Long running client times out against a specific cassandra instance

Description

Requests from client to a specific cassandra instance start timing out after few days of operation. Requests to other instances are fine. Every request to that specific instance shows error like this:

2014/08/20 15:43:04.103 6da90bb3-6be4-43cf-b3e5-e1adaf95d986 bld14D70 svn16039 tags[canary] TRACE [ApplicationWorkerThread-122] com.datastax.driver.core.RequestHandler - Querying node xyz/xyz-ip:9042
2014/08/20 15:43:09.103 6da90bb3-6be4-43cf-b3e5-e1adaf95d986 bld14D70 svn16039 tags[canary] DEBUG [ApplicationWorkerThread-122] com.datastax.driver.core.RequestHandler - Error querying xyz/xyz-ip:9042, trying next host (error is: com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections))

See the attached logs where this happens to all application threads.

The server side metrics indicate that these requests never reached server. When the client is restarted, it can connect to the same instance with out any issues.

My guess is: this is caused by failure to decrement the request count on a connection in some error condition. This is likely causing the number of requests to pile up on connections to a given instance. When they all reach the maximum number of requests, we get in to this condition of time out. Unfortunately, the number of requests per connections is not logged even at trace level.

This is causing major pain for our operation folks who have to bounce the cassandra client applications many times to make it healthy again. Can you suggest ways to narrow down the problem?

Environment

None

Pull Requests

None

Activity

Show:
Vishy Kasar
August 22, 2014, 5:03 PM

I don't believe the SocketOptions.getReadTimeoutMillis is customized. Would that make any difference?

I do not have access to the logs right before the problem started to happen. I am trying to get that next time when this happens.

We may have to hot-fix as this issue is happening for a long time and we need to get to the bottom of it. Can you send me a patch I can apply locally?

Olivier Michallat
August 22, 2014, 7:16 PM

Actually, I just realized that could explain this problem. Does your application use asynchronous requests, and possibly cancel the ResultSetFutures before they have completed?

Vishy Kasar
August 22, 2014, 11:24 PM

Yes, I confirmed that our application uses asynchronous requests, and cancel the ResultSetFutures before they have completed. It is very likely we are affected by JAVA-407. I would like to get the official 2.0.5 build with the fix. I do not yet find the build 2.0.5 on http://mvnrepository.com/artifact/com.datastax.cassandra/cassandra-driver-core/2.0.5

Can you make it available soon so we can try it out

Olivier Michallat
August 23, 2014, 10:34 AM
Olivier Michallat
January 13, 2015, 6:57 PM

Likely caused by 407.

Duplicate

Assignee

Unassigned

Reporter

Vishy Kasar

Labels

None

PM Priority

None

Reproduced in

None

Affects versions

Fix versions

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Components

Priority

Major
Configure