Requests from client to a specific cassandra instance start timing out after few days of operation. Requests to other instances are fine. Every request to that specific instance shows error like this:
2014/08/20 15:43:04.103 6da90bb3-6be4-43cf-b3e5-e1adaf95d986 bld14D70 svn16039 tags[canary] TRACE [ApplicationWorkerThread-122] com.datastax.driver.core.RequestHandler - Querying node xyz/xyz-ip:9042
2014/08/20 15:43:09.103 6da90bb3-6be4-43cf-b3e5-e1adaf95d986 bld14D70 svn16039 tags[canary] DEBUG [ApplicationWorkerThread-122] com.datastax.driver.core.RequestHandler - Error querying xyz/xyz-ip:9042, trying next host (error is: com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections))
See the attached logs where this happens to all application threads.
The server side metrics indicate that these requests never reached server. When the client is restarted, it can connect to the same instance with out any issues.
My guess is: this is caused by failure to decrement the request count on a connection in some error condition. This is likely causing the number of requests to pile up on connections to a given instance. When they all reach the maximum number of requests, we get in to this condition of time out. Unfortunately, the number of requests per connections is not logged even at trace level.
This is causing major pain for our operation folks who have to bounce the cassandra client applications many times to make it healthy again. Can you suggest ways to narrow down the problem?
I don't believe the SocketOptions.getReadTimeoutMillis is customized. Would that make any difference?
I do not have access to the logs right before the problem started to happen. I am trying to get that next time when this happens.
We may have to hot-fix as this issue is happening for a long time and we need to get to the bottom of it. Can you send me a patch I can apply locally?
Actually, I just realized that could explain this problem. Does your application use asynchronous requests, and possibly cancel the ResultSetFutures before they have completed?
Yes, I confirmed that our application uses asynchronous requests, and cancel the ResultSetFutures before they have completed. It is very likely we are affected by JAVA-407. I would like to get the official 2.0.5 build with the fix. I do not yet find the build 2.0.5 on http://mvnrepository.com/artifact/com.datastax.cassandra/cassandra-driver-core/2.0.5
Can you make it available soon so we can try it out
Likely caused by 407.