This was detected while testing JAVA-577:
The control connection is connected to host1.
host3 was down but the background reconnection attempt just succeeded. We ask the control connection to refresh host3's info.
However host1 is also overloaded, which causes the refresh query to time out:
This is (wrongly) interpreted as not finding host3 in host1's system.peers table, which causes the host to be permanently ignored (a check that was added as part of ):
Validated against 2.1 and 2.0 branches that this issue will no longer manifests along with . To test this scenario, executed the following:
Executed a version of the test that included in against a local 3 node cluster.
While test was running run a script that randomly suspends 2 of the cassandra instances by sending a kill -stop <pid> to the process, waiting 100-1000 milliseconds and then sending a kill -cont. The process then repeats indefinitely. Suspending a cassandra node w/ another cassandra node that happens to be the host the driver is using for the control connection helps manifest this issue.
Observe the status of the hosts and ensure they don't enter a 'DOWN' state with it's reconnectionAttempt being non-null or being completed.
With this scenario I can reproduce this issue typically within a minute on 2.0.8 and 2.1.3. On the current 2.1 and 2.0 branches, I cannot reproduce this after an hour of sustained runtime. I will run this scenario continuously over the next few days to ensure it does not re-manifest.