Race condition causing deadlock when canceling ResultSetFuture

Description

Reported via: https://groups.google.com/a/lists.datastax.com/forum/#!topic/java-driver-user/-Y1srLRzKl0

Jared Kuolt on the user group observed an deadlock condition encountered when canceling a ResultSetFuture while the connection associated with that Future was being closed because the write operation failed.

Looks like this may have been introduced by d0d9d6d341a016522e6e8123fa391c389e558fdd.

Was encountered using version 2.0.6, but I suspect this also affects 2.0.7 and 2.0.8 (working on a test scenario to reproduce)

We're using version 2.0.6, however after looking around I don't think this has yet been addressed. Please correct me if I'm wrong. (We are upgrading the drivers now, but since this is a rare race condition it will be hard to reproduce).

We had a large spike of traffic, resulting in a number of timeouts, in which case we attempt to cancel ResultSetFuture objects after submitting Statements via `Session.executeAsync(Statement statement)`.

The thread dump showed that two threads were awaiting locks from one another, resulting in a deadlock. https://gist.github.com/lucky/914b6546ce45bc609a0c

The first thread:

The second thread:

Each thread has what the other is synchronizing on.

Environment

None

Pull Requests

None

Activity

Show:
Andy Tolbert
December 5, 2014, 5:28 AM

I was able to reproduce this a few of times running a duration test against 2.0.6, but have not been able to against 2.0.8 after a couple of hours trying a number of things to cause a write failure (reseting connections, turning off network interfaces, etc.). That could indicate that the issue does not manifest in 2.0.8, but not with absolute certainty. I removed 2.0.7 and 2.0.8 as affected versions until I can reproduce against them explicitly.

Olivier Michallat
December 5, 2014, 4:18 PM

I can't think of any change that would directly address this issue in 2.0.7 or 2.0.8.
Looking at the code, I just realized that terminationLock does not achieve its intended goal, so I'm going to think of a way to refactor this code. Removing the lock altogether would avoid the deadlock at the cost of possible false warnings (with a low probability), so that might be an option.

Andy Tolbert
December 12, 2014, 4:13 AM
Edited

I can reproduce this rather quickly on a 3 node cluster (0-2 minutes typically) with a targeted test scenario that injects connection resets on multiple connections simultaneously between a client connection and a particular host on both 2.0.6 and 2.1.1 driver versions. The test would also occasionally cancel queries in order to reproduce the scenario. Cannot explicitly reproduce on 2.0.8 and 2.1.3 since seems to manifest in those scenarios instead. Could not reproduce with fix on 2.0 and 2.1 branch, marking as resolved.

Fixed

Assignee

Olivier Michallat

Reporter

Andy Tolbert

Labels

None

PM Priority

None

Reproduced in

None

Affects versions

Fix versions

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Components

Priority

Critical
Configure