Control connections are retried too frequently causing the server outage

Description

We have a production cluster with 72 instances spread across 2 DCs. We have a large number ( ~ 40,000 ) of clients hitting this cluster. We have made the client connect to only 4 cassandra instances at a time. A new table was added to the cluster keyspace yesterday. This triggered schema refresh event for all the clients.

Schema refresh timed out for many clients. The time out resulted in attempt to establish control connection which failed. The retry here is supposed to be exponential back off. However what I see is one attempt to back off followed by flurry of attempts to re-establish control connection. Every instance doing this may have caused the trauma we saw yesterday on the server side.

com.datastax.driver.core.ControlConnection#reconnect is using the cluster.reconnectionPolicy() policy. The client has not changed the default (ExponentialReconnectionPolicy) here.

The logs have been sent separately.

Environment

None

Pull Requests

None

Activity

Show:
Olivier Michallat
October 10, 2014, 8:20 PM

The control connection does not check if a reconnection attempt is already in progress. We get multiple remove events that time out on refreshNodeListAndTokenMap, each scheduling a concurrent reconnection attempt.

Olivier Michallat
October 13, 2014, 7:42 PM

To clarify my previous comment, the control connection does prevent simultaneous attempts, but each new attempt cancel the previous one, which causes much shorter delays than what a single attempt would produce. There was also a subtle race in that cancellation. I'm refactoring the code to fix that race and let the initial attempt run.

Pierre Laporte
October 22, 2014, 7:05 PM

There is still a race condition in AbstractReconnectionHandler that will lead to a thread leak :

  1. Multiple threads create a new ARH instance simultaneously and call .start().

    1. Those ARH all share the same AtomicReference<ListenableFuture<?>> for the current attempt.

  2. Those threads are going to schedule a task in the executor, which will result in method run() being called

  3. Those threads are context-switched out

  4. The executor now runs the newly submitted tasks that perform the following

    1. Check whether they have been canceled, which is currently false for both tasks

    2. Enter a sleeping loop until the isActive variable is set to true

  5. The two previous threads are now put back in the CPU by the scheduler

    1. One of them will eventually cancel its associated task because they detect multiple tasks have been submitted.

  6. The task that should be cancelled is not stucked in the sleep loop, as it will never check whether it is canceled anymore.

  7. Eventually, the JVM will run out of native thread

Possible fixes include :

  • call .cancel(true) so that the task's thread is actually interrupted

  • replace the isActive boolean by a more suitable concurrent data structure (such as Semaphore or CountDownLatch) and remove the sleep loop

Olivier Michallat
October 22, 2014, 7:09 PM

Just pushed the fix I did yesterday for that.

Fixed

Assignee

Olivier Michallat

Reporter

Vishy Kasar

Labels

None

PM Priority

None

Reproduced in

None

Affects versions

Fix versions

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Components

Priority

Major