I'm already using
rpc_address: IP address
In documentation there is written that "Session instances are thread-safe and usually a single instance is enough per application" – based on this information I created a singleton to retrieve the session:
But after a few minutes working with very low usage (even idle ...) I can see all connections fails to my dev cluster
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: [dev-cross-cassandra-3.pgol.net/172.26.11.12:9042, dev-cross-cassandra-4.pgol.net/172.26.11.13:9042, dev-cross-cassandra-1.pgol.net/172.26.11.10:9042, dev-cross-cassandra-2.pgol.net/172.26.11.11:9042, /172.26.11.14:9042] - use getErrors() for details)
getErrors() just say it attempted to connect to each node without success.
And in log I also see that there are lot of attempts
[Cassandra Java Driver worker-4] WARN com.datastax.driver.core.Connection - Timeout while setting keyspace on connection to /172.26.11.14:9042. This should not happen but is not critical (it will retried)
[Cassandra Java Driver worker-5] WARN com.datastax.driver.core.Connection - Timeout while setting keyspace on connection to dev-cross-cassandra-2.pgol.net/172.26.11.11:9042. This should not happen but is not critical (it will retried)
It's surely not a problem of the cluster since if I restart my application without touching the cluster everything still work again for 15/20 minutes. As you can see I've added a
but nothing changed.
My dev cluster is a 4-node RF3 using Cassandra 2.0.9
Looking for similar issues I found only duplicate and closed ( ) but the problem persist both in 2.0.4 and in 2.0.5
The argument to ConstantReconnectionPolicy is in milliseconds, 60 seems very low to me.
Looking at the code for 2.0.5, a reconnection can be triggered by an AuthenticationException or an UnsupportedProtocolVersionException, are you sure that your client correctly authenticates?
I think you're getting into a situation where connections are established at the TCP level but rejected for some other reason, and then the driver is rapidly firing reconnection attempts and exhausting file descriptors before the discarded connections can be closed.
Try setting the log level to DEBUG to see the reconnection attempts. Also, try filtering your lsof commands for socket descriptors on port 9042, on Ubuntu I think the following should work: lsof -i | grep 9042 | wc -l
I'm not sure how it's possible but I think the issue is related to apache http client.
I wrote an http connector library that use org.apache.httpcomponents.httpclient 4.3.3 – in my library I have a running thread that closes idle/expired connections every 30 seconds – it invokes closeExpiredConnections() and closeIdleConnections() from PoolingHttpClientConnectionManager.
From documentation these two methods only closes connections within the pool – so I am wondering how can this affect the driver pool?
btw: this would also explain why I could run unit-test with no issue.
I don't know how you use HttpClient, but I doubt it can interfere with the driver's connections in any way. The driver uses its own private pools, the only thing external code can do is close the Cluster.
Olivier it was very strange to me too – what happened is that while I removed the httpclient thread the system engineers updated our cluster to 2.0.10 – I didn't know that. Now the problem can not be reproduced anymore using the new Cassandra version. Thanks for your patience
OK, I'll close the issue since your problem is solved.
If that happens again, take a closer look at your connection opening rate (through debug logs and detailed lsof).