We see an issue in production cluster
00:00 - Client starts out with 4 cassandra nodes : n1, n2, n3, n4
01:00 - Error creating connection pool to n1 due to connection time out. No attempt to re-stablish connection to n1.
01:02 - Same phenomenon for n2
02:01 - Same phenomenon for n3
02:04 - Same phenomenon for n4. Now we are out of nodes to send requests to and start throwing "no host was tried"
Key issue seems to be: No attempt to re-stablish connection when connection establishment fails with a time out.Attached the connection time out exception in conn-time-out.txt
Full log is sent to datatstax separately.
I narrowed down the current "no host was tried" error to the following commit done for java-driver 2.0.7
Author: olim7t <email@example.com>
Date: Wed Oct 15 15:58:44 2014 +0200
Ensure failed reconnection attempts don't re-trigger onDown.
With this change, the onDown is not called on load balancer if it has already been marked down before. However there are bugs in java-driver where the node is actually up, load balancer thinks it is up but driver thinks it is down. For such a case, the above commit fails to trigger onDown on load balancer.
This led the load balancer to continue to use that node while the java-driver coninued to think the node is down. Eventauly when the connection to that node actually goes down, the java-driver does not do any of the normal things it does for node transitioning from up to down because it thinks it already was down. Specifically:
It does not call onDown on load balancer
It does not re-attempt connections to that node
Due to this, load balancer thinks that node is still up and attempts to send queries. Driver promptly rejects that attempt as there is no connection pool for that node and sends the request to next node. One node is now lost. When the same phenomenon happens to all connected nodes, we start seeing "no host was tried" error.
I think we can fix the issue by reverting the above commit from 2.0.7. What do you think Olivier? What is the impact of reverting? Does any functionality break in 2.0.7 if we revert?
The issue has been identified and fixed (see detailed analysis below), and is currently undergoing testing. was also discovered in the process, it is fixed as well. We are working on releasing 2.0.9 ASAP to deliver these fixes.
It has been established that commit f42f825d was not the cause, however that commit will still be reverted for reasons explained here.
Analysis for 577:
A host starts throwing errors, so its gets in the SUSPECT state, where we try a quick reconnection before putting it down.
In that case, the reconnection succeeds, so we call onUp to renew the connection pools. However the node is still overloaded and one of the connections gets a timeout while initializing:
The failed pooled connections call defunct, which correctly triggers onDown (thanks to the instanceof check, isReconnectionAttempt evaluates to false).
However in onDown we have this check:
And isSuspectedVerification is false (the only time it's true is when onDown was invoked directly from onSuspected because of an exception).
Therefore we return from onDown without having scheduled a reconnection attempt, and the host stays SUSPECT forever.
Validated against 2.1 and 2.0 branches that this issue will no longer manifests along with JAVA-587. To test this scenario, executed the following:
Executed a version of the test that included against a local 3 node cluster.
While test was running run a script that randomly suspends 1 or 2 of the cassandra instances by sending a kill -stop <pid> to the process, waiting 100-1000 milliseconds and then sending a kill -cont. The process then repeats indefinitely.
Observe the status of the hosts and ensure they don't enter a 'SUSPECTED' state with it's initialReconnectFuture being non-null or being completed.
With this scenario I can reproduce this issue typically within a minute. On the current 2.1 and 2.0 branches, I cannot reproduce this after an hour of sustained runtime. I will run this scenario continuously over the next few days to ensure it does not remanifest.
Thanks Andrew. Can we make that test part of your long running regression test suite so that the issue does not show up again in a future version?
This is absolutely something we are planning to incorporate, . There are some other scenarios we'd like to bring in as well around connectivity (timeouts, connection loss, etc.) among other things. We'll be tracking this in .