I'm encountering a case where the failure of futures returned from Session.executeAsync is dramatically delayed (over one minute). It seems this is caused by cluster.manager.executor becoming very backlogged when my hosts enter SUSPECT state and connections are closed while there are many inflight requests:
If connection cannot be re-established easily, the executor becomes very backlogged. The problem appears to come from RequestHandler.retry using the executor to retry a request and DCAwareRoundRobinPolicy.waitOnReconnection executing on each suspected host:
Retry task causing TIMED_WAITING in cluster.manager.executor thread
Example of future taking longer than a minute to complete
The intent of the executor is for processing non-blocking tasks, but this particular task can become blocking if the host is in a suspected state (and also depending on the implementation of the used LBP), therefore I think we should move this task (the work in RequestHandler.retry) to a different executor as this could cause delay of important work, like triggering up and down events. It is probably ok to delay setting an exception on the Future compared to other important work that uses the executor. I should note that as soon as I had a non-suspected host, these tasks completed quickly.
Example showing backlogged executor:
The test scenario used to reproduce this:
Configure read timeout of 500ms, connection timeout of 100ms.
Send continual queries (reads, selects, deletes, writes, etc. - sent up to 2400 simultaneous requests on 3 local nodes)
Execute script that does kill -STOP <pid> and then kill -CONT <pid> repeatedly on cassandra nodes. (
As it takes a non-ideal scenario to create this, though I suppose it is theoretically possible for this to happen without these parameters (many inflight requests while a connection is closing and all hosts are suspected), so the severity may be lower than 'Major'.
Hello, I am facing this issue and this is our main concern about our usage of C*. My thread dump match exactly the stacktrace mentioned in the description for almost all our Cassandra Java Driver worker threads (as soon as we detect the NoHostAvailableException ). We lowered the probability of occurrence of the issue by changing some parameters (like the read timeout already at 25 seconds) but it still occurs. We are working at speeding up our C* requests but CF histograms show that the maximum response time on server side is far higher than the mean response time, so that it is hard to understand/avoid timeouts. We currently have to detect the NoHostAvailableException error from log files in order to trigger the restart of our application so that all connections get renewed.
See https://groups.google.com/a/lists.datastax.com/forum/#!topic/java-driver-user/b076HRgEfoo for the complete discussion about this.
Could you please put a higher priority to this issue ? Thanks
More evidence of this happening in the field reported by user safato in #datastax-drivers (gist showing the thread dump). He is reporting that all of his hosts become suspected, so what is likely happening is there a bunch of pending retry requests are preventing the reconnection attempts from being processed in a timely manner.
Each of 3 hosts becoming suspected (one recovers but gets suspected again):
User never sees a 'Transport initialized and ready' message again. This possibly means the initialReconnectAttempt future is backed up on cluster.manager.executor.
I think this is happening to enough users that it maybe worth creating a separate executor until we get rid of the SUSPECT state (), though this may become less important depending on how the speculative retry implementation works ().
The error occurs less often when using the AlwaysIgnoreRetryPolicy but it is still there. Adding new nodes on the cluster helps also.
For information, I worked on a fork of the Java Driver (see class Cluster) in order to try to solve the issue. But I am not able to test it correctly (because of a problem when requesting of a CCManaged cluster from my machine). This modified version is running and despite that some NoHostAvailableException still appear, the driver seems to be able to recover. But I am still not sure about what it worths.
, which would remove the SUSPECT state that causes this issue, is currently being targeted for 2.0.10 (subject to change). Removing the SUSPECT state would eliminate this issue. Thought I'd share that for awareness of those watching this issue.
Issue is fixed via .