Optimize session initialization when some hosts are not responsive

Description

Scenario:

  • Cassandra cluster with 6 nodes, "freeze" 5 of them with kill -STOP.

  • create a Cluster instance that has all 6 nodes as contact points.

  • create a session with cluster.connect().

Currently, it takes 194 seconds before the session is ready. This is way too long; granted, the control connection (and the session when it creates the initial pools) blocks on each node, but this should take at most 5 * (connect timeout).

Environment

None

Pull Requests

None

Activity

Show:
Olivier Michallat
October 31, 2014, 9:48 AM
Edited

Would it be possible that other nodes in the cluster were up but not healthy? Actually, nodes that are down are not really a problem, because trying to connect to them causes an immediate "connection refused" error; the problem is nodes that cause connection or query timeouts, because during initalization the driver waits synchronously on these timeouts.

Vishy Kasar
October 31, 2014, 5:26 PM

I did not see any java driver logs at INFO level that shows the slow connections to up but unhealthy node. Is there any?

Is there any log at the cassandra server that shows this?

session.init() is waiting for connection pool to be established to a host before trying the next host. We should attempt them in parallel and wait on the combined future with a time out (connection timeout?). That way we do not take long time to initialize session when there are many (say 768 nodes) in the cluster some of them up but may not be fully healthy. Is that possible?

Olivier Michallat
October 31, 2014, 5:55 PM

I did not see any java driver logs at INFO level that shows the slow connections to up but unhealthy node. Is there any?

"Connection timeout" errors in your logs indicate this.

We should attempt them in parallel and wait on the combined future

Yes, that's one of the improvements that I'm testing right now.

Vishy Kasar
November 3, 2014, 5:28 PM

The connection timeouts are occurring for hosts that were powered down.

Other relevant information: The client is running on 24 core machine. So thread pool executor will have 24 threads. The client is also initializing multiple sessions on the same cluster of 384 nodes with 6 nodes down. This is causing the connection attempts to be backed up.

Pierre Laporte
November 5, 2014, 3:54 PM

Tested successfully against the latest 2.0.8-SNAPSHOT.

Steps :

  • Start a 6 nodes Cassandra clusters

  • Freeze (kill -STOP 5 of them

  • Build a cluster with all 6 IP Addresses as contact points

  • Measure the time it takes for the cluster to connect

Before fix : ~130 seconds
After fix: 35 seconds

Fixed

Assignee

Olivier Michallat

Reporter

Olivier Michallat

Labels

None

PM Priority

None

Reproduced in

None

Affects versions

Fix versions

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Time remaining

0m

Components

Priority

Major
Configure