Reported via: https://groups.google.com/a/lists.datastax.com/forum/#!topic/java-driver-user/-Y1srLRzKl0
Jared Kuolt on the user group observed an deadlock condition encountered when canceling a ResultSetFuture while the connection associated with that Future was being closed because the write operation failed.
Looks like this may have been introduced by d0d9d6d341a016522e6e8123fa391c389e558fdd.
Was encountered using version 2.0.6, but I suspect this also affects 2.0.7 and 2.0.8 (working on a test scenario to reproduce)
We're using version 2.0.6, however after looking around I don't think this has yet been addressed. Please correct me if I'm wrong. (We are upgrading the drivers now, but since this is a rare race condition it will be hard to reproduce).
We had a large spike of traffic, resulting in a number of timeouts, in which case we attempt to cancel ResultSetFuture objects after submitting Statements via `Session.executeAsync(Statement statement)`.
The thread dump showed that two threads were awaiting locks from one another, resulting in a deadlock. https://gist.github.com/lucky/914b6546ce45bc609a0c
The first thread:
We attempt to cancel the ResultSetFuture
Thread becomes blocked inside netty's AbstractNioWorker, where it is attempting to obtain the Netty channel's write lock: https://github.com/netty/netty/blob/netty-3.9.0.Final/src/main/java/org/jboss/netty/channel/socket/nio/AbstractNioWorker.java#L397
The second thread:
NioWorker thread in which it is attempting to run the callback on write failure: https://github.com/datastax/java-driver/blob/2.0.6/driver-core/src/main/java/com/datastax/driver/core/Connection.java#L377
Thread becomes blocked inside the DataStax Java Driver's Connection, where it is attempting to obtain the termination lock: https://github.com/datastax/java-driver/blob/2.0.6/driver-core/src/main/java/com/datastax/driver/core/Connection.java#L437
Each thread has what the other is synchronizing on.
I was able to reproduce this a few of times running a duration test against 2.0.6, but have not been able to against 2.0.8 after a couple of hours trying a number of things to cause a write failure (reseting connections, turning off network interfaces, etc.). That could indicate that the issue does not manifest in 2.0.8, but not with absolute certainty. I removed 2.0.7 and 2.0.8 as affected versions until I can reproduce against them explicitly.
I can't think of any change that would directly address this issue in 2.0.7 or 2.0.8.
Looking at the code, I just realized that terminationLock does not achieve its intended goal, so I'm going to think of a way to refactor this code. Removing the lock altogether would avoid the deadlock at the cost of possible false warnings (with a low probability), so that might be an option.
I can reproduce this rather quickly on a 3 node cluster (0-2 minutes typically) with a targeted test scenario that injects connection resets on multiple connections simultaneously between a client connection and a particular host on both 2.0.6 and 2.1.1 driver versions. The test would also occasionally cancel queries in order to reproduce the scenario. Cannot explicitly reproduce on 2.0.8 and 2.1.3 since seems to manifest in those scenarios instead. Could not reproduce with fix on 2.0 and 2.1 branch, marking as resolved.