Faulty control connection causes reconnecting host to be ignored

Description

This was detected while testing JAVA-577:

The control connection is connected to host1.
host3 was down but the background reconnection attempt just succeeded. We ask the control connection to refresh host3's info.
However host1 is also overloaded, which causes the refresh query to time out:

This is (wrongly) interpreted as not finding host3 in host1's system.peers table, which causes the host to be permanently ignored (a check that was added as part of ):

Environment

None

Pull Requests

None

Activity

Show:
Andy Tolbert
December 16, 2014, 10:09 PM

Validated against 2.1 and 2.0 branches that this issue will no longer manifests along with . To test this scenario, executed the following:

  1. Executed a version of the test that included in against a local 3 node cluster.

  2. While test was running run a script that randomly suspends 2 of the cassandra instances by sending a kill -stop <pid> to the process, waiting 100-1000 milliseconds and then sending a kill -cont. The process then repeats indefinitely. Suspending a cassandra node w/ another cassandra node that happens to be the host the driver is using for the control connection helps manifest this issue.

  3. Observe the status of the hosts and ensure they don't enter a 'DOWN' state with it's reconnectionAttempt being non-null or being completed.

With this scenario I can reproduce this issue typically within a minute on 2.0.8 and 2.1.3. On the current 2.1 and 2.0 branches, I cannot reproduce this after an hour of sustained runtime. I will run this scenario continuously over the next few days to ensure it does not re-manifest.

Fixed

Assignee

Andy Tolbert

Reporter

Olivier Michallat

Labels

None

PM Priority

None

Reproduced in

None

Affects versions

Fix versions

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Priority

Major
Configure