Deadlock in Netty I/O workers caused when multiple workers handling failed writes for the same host but on a different connection

Description

While attempting to reproduce at 2.0.8, I was able to produce a similar, but different issue. When a write fails on a connection, Connection#writeHandler will call defunct on the Connection. In Connection#defunct, if the host for that connection is considered down PooledConnection#notifyOwnerWhenDefunct is called. This triggers the HostConnectionPool to close it's transport which closes all of it's connections. When closing each connection it needs to acquire the write lock on the channel.

The issue arises if you have multiple writes fail concurrently on a separate connection for the same host. In this situation each netty worker holds the writeLock for it's channel that has the failed write, so when it goes to close the other channels it will not be able to acquire the writeLock for the channel held by the other worker(s).

I verified that both channel.writeLock objects belonged to different channels on the same destination host by doing a heap dump.

I was able to produce this by disabling a network interface belonging to a CCM node while running a stress test.

Environment

3 node ccm cluster running cassandra 2.0.8

Pull Requests

None

Activity

Show:
Andy Tolbert
December 12, 2014, 4:03 AM

I can reproduce this rather quickly on a 3 node cluster (0-5 minutes typically) with a targeted test scenario that injects connection resets on multiple connections simultaneously between a client connection and a particular host on both 2.0.8 and 2.1.3 driver versions. Could not reproduce with fix on 2.0 and 2.1 branch, marking as resolved.

Fixed

Assignee

Olivier Michallat

Reporter

Andy Tolbert

Labels

None

PM Priority

None

Reproduced in

None

Affects versions

Fix versions

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Components

Priority

Critical
Configure