We have a production cluster with 72 instances spread across 2 DCs. We have a large number ( ~ 40,000 ) of clients hitting this cluster. We have made the client connect to only 4 cassandra instances at a time. A new table was added to the cluster keyspace yesterday. This triggered schema refresh event for all the clients.
Schema refresh timed out for many clients. The time out resulted in attempt to establish control connection which failed. The retry here is supposed to be exponential back off. However what I see is one attempt to back off followed by flurry of attempts to re-establish control connection. Every instance doing this may have caused the trauma we saw yesterday on the server side.
com.datastax.driver.core.ControlConnection#reconnect is using the cluster.reconnectionPolicy() policy. The client has not changed the default (ExponentialReconnectionPolicy) here.
The logs have been sent separately.
The control connection does not check if a reconnection attempt is already in progress. We get multiple remove events that time out on refreshNodeListAndTokenMap, each scheduling a concurrent reconnection attempt.
To clarify my previous comment, the control connection does prevent simultaneous attempts, but each new attempt cancel the previous one, which causes much shorter delays than what a single attempt would produce. There was also a subtle race in that cancellation. I'm refactoring the code to fix that race and let the initial attempt run.
There is still a race condition in AbstractReconnectionHandler that will lead to a thread leak :
Multiple threads create a new ARH instance simultaneously and call .start().
Those ARH all share the same AtomicReference<ListenableFuture<?>> for the current attempt.
Those threads are going to schedule a task in the executor, which will result in method run() being called
Those threads are context-switched out
The executor now runs the newly submitted tasks that perform the following
Check whether they have been canceled, which is currently false for both tasks
Enter a sleeping loop until the isActive variable is set to true
The two previous threads are now put back in the CPU by the scheduler
One of them will eventually cancel its associated task because they detect multiple tasks have been submitted.
The task that should be cancelled is not stucked in the sleep loop, as it will never check whether it is canceled anymore.
Eventually, the JVM will run out of native thread
Possible fixes include :
call .cancel(true) so that the task's thread is actually interrupted
replace the isActive boolean by a more suitable concurrent data structure (such as Semaphore or CountDownLatch) and remove the sleep loop
Just pushed the fix I did yesterday for that.