NoHostAvailableException also if client is idle

Description

Hi all,
I'm already using

start_native_transport: true
rpc_address: IP address
rpc_keepalive: true

In documentation there is written that "Session instances are thread-safe and usually a single instance is enough per application" – based on this information I created a singleton to retrieve the session:

But after a few minutes working with very low usage (even idle ...) I can see all connections fails to my dev cluster

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: [dev-cross-cassandra-3.pgol.net/172.26.11.12:9042, dev-cross-cassandra-4.pgol.net/172.26.11.13:9042, dev-cross-cassandra-1.pgol.net/172.26.11.10:9042, dev-cross-cassandra-2.pgol.net/172.26.11.11:9042, /172.26.11.14:9042] - use getErrors() for details)

getErrors() just say it attempted to connect to each node without success.
And in log I also see that there are lot of attempts

[Cassandra Java Driver worker-4] WARN com.datastax.driver.core.Connection - Timeout while setting keyspace on connection to /172.26.11.14:9042. This should not happen but is not critical (it will retried)
[Cassandra Java Driver worker-5] WARN com.datastax.driver.core.Connection - Timeout while setting keyspace on connection to dev-cross-cassandra-2.pgol.net/172.26.11.11:9042. This should not happen but is not critical (it will retried)

It's surely not a problem of the cluster since if I restart my application without touching the cluster everything still work again for 15/20 minutes. As you can see I've added a

but nothing changed.

My dev cluster is a 4-node RF3 using Cassandra 2.0.9

Looking for similar issues I found only duplicate and closed ( ) but the problem persist both in 2.0.4 and in 2.0.5

Thanks,
Carlo

Environment

None

Pull Requests

None

Activity

Show:
Olivier Michallat
September 1, 2014, 2:06 PM

The argument to ConstantReconnectionPolicy is in milliseconds, 60 seems very low to me.

Looking at the code for 2.0.5, a reconnection can be triggered by an AuthenticationException or an UnsupportedProtocolVersionException, are you sure that your client correctly authenticates?

I think you're getting into a situation where connections are established at the TCP level but rejected for some other reason, and then the driver is rapidly firing reconnection attempts and exhausting file descriptors before the discarded connections can be closed.

Try setting the log level to DEBUG to see the reconnection attempts. Also, try filtering your lsof commands for socket descriptors on port 9042, on Ubuntu I think the following should work: lsof -i | grep 9042 | wc -l

Carlo Bertuccini
September 2, 2014, 12:58 PM

I'm not sure how it's possible but I think the issue is related to apache http client.
I wrote an http connector library that use org.apache.httpcomponents.httpclient 4.3.3 – in my library I have a running thread that closes idle/expired connections every 30 seconds – it invokes closeExpiredConnections() and closeIdleConnections() from PoolingHttpClientConnectionManager.

From documentation these two methods only closes connections within the pool – so I am wondering how can this affect the driver pool?

btw: this would also explain why I could run unit-test with no issue.

Olivier Michallat
September 2, 2014, 1:18 PM

I don't know how you use HttpClient, but I doubt it can interfere with the driver's connections in any way. The driver uses its own private pools, the only thing external code can do is close the Cluster.

Carlo Bertuccini
September 3, 2014, 6:18 AM

Olivier it was very strange to me too – what happened is that while I removed the httpclient thread the system engineers updated our cluster to 2.0.10 – I didn't know that. Now the problem can not be reproduced anymore using the new Cassandra version. Thanks for your patience

Olivier Michallat
September 3, 2014, 7:16 AM

OK, I'll close the issue since your problem is solved.

If that happens again, take a closer look at your connection opening rate (through debug logs and detailed lsof).

Cannot Reproduce

Assignee

Unassigned

Reporter

Carlo Bertuccini

Labels

None

PM Priority

None

Reproduced in

None

Affects versions

Fix versions

None

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Components

Priority

Critical
Configure