The SUSPECT state was introduced in to handle the case where connections are dropped without the driver noticing.
Starting with 2.0.10 / 2.1.5, this case is covered by the connection heartbeat (JAVA-533). The SUSPECT state adds a lot of complexity to notification handling, so we should consider removing it.
Another problem with the suspicion mechanism is that load balancing policies will wait on suspected hosts' initial reconnection attempts. If this happens at the beginning of the query plan, it will cause the driver to block before the first request is sent, so even asyncExecute will block.
If this happens at the beginning of the query plan, it will cause the driver to block before the first request is sent
Unless there is an unexpected bug, that's not how it works. Suspected nodes are handled differently and we only wait on the initial reconnection after all other nodes have been tried. In other words, you'll block only if no other node can be contacted, which is pretty reasonable.
Overall, I agree that the handling of suspect is a bit complex, but I also wouldn't say that a heartbeat fully covers the same cases, as hearbeats are imprecise by nature: if you connection broke somehow, having heartbeat won't help until the next one is sent. Which means in particular that the interval at which heartbeats are sent is likely not universal and is something more than user have to tune (so I would argue that it's more complex for users, if not for the implementation of the driver).
I'm not absolutely against removing the suspect state, but I'm not entirely convinced either.
Yes, I should have said "if all hosts are down or suspect", I had a user report that today. I agree that it's probably not a very common case.
To expand on my case to revert it:
onSuspected can't use the host's "notification lock" that was recently introduced to serialize status events (I won't go into full details here but there is a deadlock in a specific scenario). That means we have to reason about complex races between onSuspected and onUp. was an example of this.
if all hosts are suspect, retries will block on suspected reconnection attempts (5 seconds per host with the default connect timeout). Both retries and suspected reconnection attempts are scheduled on the same executor (Cluster.Manager.executor). We believe there could be starvation issues where suspected reconnection attempts end up enqueued after the retries that are waiting on them (JAVA-637 is an example of this – we haven't gotten to the bottom of it yet, but we see executor queue size going through the roof and hosts staying suspect forever).
This hurts some of our users. I'm considering doing this for 2.0.10.
Validated in the following ways:
Executed TimeoutStressTest and observed that hosts were no longer suspected on read timeouts, so connections were not being torn down and were reused, that particular test runs much more stable now in terms of achieved throughput.
Executed Duration Test while suspending cassandra nodes (ccm pause/resume) to ensure they are properly marked down via detection by the control connection and brought back up. Also validated that worker threads did not get backed up by retries as the RRLoadBalancingPolicies no longer get backed up on waiting for suspected.