Race condition on Cluster shutdown leading to a silent deadlock

Description

We reproduced it on 2.1.2.

The problematic sequence of operations:
1. open a Cluster,
2. create a Session
3. close Session
4. close Cluster

If operations 3 and 4 are executed immediately after 2 on a very heavily loaded system, sometimes step 4 (Cluster#close()) freezes completely.

The relevant portion of jstack:

How I believe it is possible to get into that incorrect state:

1. Cluster#close() delegates to Manager#close.
2. Manager#close invokes the following sequence:

and then creates a ClusterCloseFuture which must complete the following:

That future is returned from asyncClose and waited on by close.

4. onAdd task is processed by executor, and calls this snippet at the end:

5. updateCreatedPools runs in the context of blockingExecutor and blocks (surprise, isn't it?) by waiting on appropriate task future received from the blockingExecutor.

6. Now, if the blockingExecutor gets shutDownNow (by Cluster#close) after updateCreatedPools manages to submit a task, but before the task gets picked from the queue for execution, the task will never complete, but will also be never cancelled. Waiting on a future of such task will block forever. The problem here is that when we call shutdownNow, we never check for uncompleted tasks this call may return.

7. If onAdd gets blocked permanently on updateCreatedPools, it blocks the main executor from shutting down. We shut down the main executor by calling shutdown() not shutdownNow() so it will wait forever completing that task.

Environment

None

Pull Requests

None

Activity

Show:

Piotr Kołaczkowski 
December 1, 2014 at 9:45 AM

Yeah, we confirmed your fix. I developed a fix separately and when I tried to rebase the project to 2.1 branch head, I saw the conflicts and realised my fix was exactly the same that the fix that was already there

Olivier Michallat 
December 1, 2014 at 8:34 AM

Yes, I ran into it while working on another issue.
Glad I pre-emptively solved this ticket

Piotr Kołaczkowski 
November 30, 2014 at 7:31 PM

Feel free to close it as a duplicate. I didn't update my 2.1 branch at first - this bug has been fixed in 2.1.3 !

Fixed

Details

Assignee

Reporter

Affects versions

Fix versions

Components

Priority

Created November 30, 2014 at 6:01 PM
Updated December 1, 2014 at 9:45 AM
Resolved December 1, 2014 at 8:30 AM