Control connection is established to host1. Requests are going to connection pools of 8 connections on host1 - host10. host1, connection2 had a problem. Currently entire host1 (including the control connection on it) is marked down. There does not seem to be a reason to mark the perfectly working control connection in this case.
This is very likely the reason that triggered the
https://datastax-oss.atlassian.net/browse/JAVA-497 : Control connections are retried too frequently causing the server outage
This becomes more important once we allow limiting of connections on server through:
Related: ConvictionPolicy.Simple#addFailure is always returning true.
Fix: keep host up as long as it has at least one live connection. The driver tries to reopen missing connections according to the reconnection policy.
There does not seem to be a reason to mark the perfectly working control connection in this case.
The current assumption of the driver is that if a connection has a problem, it's most likely that the host itself is down and it is thus assumed so. This is definitively a somewhat pessimistic assumption, but I'm not sure there is that many situation that can lead to just one connection having problem without the node itself being down/disfunctional. If we were to be more conservative and not mark down nodes on connection problems, then in all the cases where a dead connection does mean a dead node, the driver would end up losing time and CPU trying to use the remaining connections (and potentially trying to recreate ones).
So I guess I'd like to understand what makes your connection to host1 have problems outside of host1 being down?
his is very likely the reason that triggered the https://datastax-oss.atlassian.net/browse/JAVA-497
I don't know, it sounds to me that the primary cause for has been identified and fixed. For this to cause a massive amount of reconnection in itself, you'd have to have connection to node having problems all the time, at which point it's worth understanding what's going on with your setup.
A connection failure does not necessarily mean host is down.
1. Server closes a new connection attempt because a limit is reached. (https://issues.apache.org/jira/browse/CASSANDRA-8086)
2. An intermediary such as fire wall closes the connection based on its policies (inactivity/oldest etc)
3. New connection attempt times out because the server was busy at that point with gc/compaction/streaming/request-surge etc
4. New connection attempt fails because server ran out of file descriptors at that point
5. Temporary hick up on the network path between client and server
Under such cases, it is not necessary and in some cases detrimental to close all the existing connections to the host.
The initial trigger that caused was : breaking the existing control connection. What got fixed in was the connection storm that resulted during re-establishment of control connection. If we did not kill the control connection in the first place, I don't think we would have seen the control connection surge. As explained in JAVA-497, we have very large number of clients in that cluster. A temporary issue such as #3 above caused pretty much all clients to abandon their existing connections and pound on host with new connection requests.
I understand the motivation behind CASSANDRA-8086, but we need to consider this carefully. This is a game changer for the driver's error handling.
For example, if connections are refused in the middle of initializing a connection pool, should we keep the pool anyway, with half the core connections ? If so, when do we retry to create the remaining ones ? How does the server notify clients when it goes back under the limit ? We could address that with new server notifications but this introduces more moving parts in the client-side logic.
A connection failure does not necessarily mean host is down
No, but it's more likely that it means that that something else. The reason of the driver behavior is that never marking down a host because of a connection failure would be inefficient, since most connection failure are probably a problem with the node, and thus continuing to try a node too agressively when it is likely (but not surely) dead would make the death of a node have more impact on the overall latency of operations. I'm not saying the current behavior is perfect, and we can probably improve it, but assuming that a connection failure never means the node is down would be, imo, a worst solution. We should also be careful to not over-complicate error handling by trying to be too smart imo as, as Olivier, this could introduce more moving parts with more risk of problems in practice. So again, I'm not saying the driver behavior is perfect, I'm merely trying to explain the rational behind it and say that it's unclear to me how to improve it further (without introducing bigger problem and/or too much complexity).
My other point being that with fixed, I'm not entirely convinced that mistakenly reconnecting the control connection from time to time could actually create server outages.
We saw some what related issue in one of our stress test.
We were running out of stream ids on a connection (because responses aren't being received for the cancelled requests). When this happens the connection got closed which is fine. However the node was also marked down which caused issues culminating in cassandra cluster become progressively un-responsive until it had to be shut down.