While running a test scenario that constantly suspects connections by setting a very low connection and read timeout, I found that the query plan for the Cluster lost all but 1 host over time (hours), despite the fact that all hosts were up.
For example, in the case below only 1 host was in the query plan. This continues indefinitely.
While this looks similar to , it's slightly different in that the hosts are not in a 'SUSPECTED' state. Host#isUp() will return true even if the host is suspected, so I did a heap dump and was able to see that host.state was 'UP' for all three hosts.
This is much more difficult to produce than .
Restarting a cassandra node associated with a host tends to get it back into a good state, as the socket connections are lost, triggering an onDown event which causes a reconnect event to occur, getting the host back into the LB policy.
After 17+ hours of a running scenario that typically manifests the issue in a few hours, I am not able to reproduce this with the fixes in the 2.0 branch.
Completed validation against 2.0 and 2.1 branch.