Connection leak on connecting to node in a different cluster

Description

We had few production clusters which over time ended up with incorrect entries in system.peers table. The entries pointed to nodes in another cluster. java driver connects to them, checks the cluster name and rejects such connections. However:

1. It does not close those connections thus leaking them
2. It keeps retrying the connections thus increasing the leaks

Over time, the client runs out of file descriptors and we have to restart the process.

You can recreate the issue by creating 3 node local cluster, injecting a reachable node from another cluster in system.peers of one node and throwing some heavy load. A test program that reproduces the issue is attached.

Error creating pool to /ip:9042
com.datastax.driver.core.ClusterNameMismatchException: /ip:9042 Host /ip:9042 reports cluster name 'xyz cluster' that doesn't match our cluster name 'dev3 dev3 cluster'. This host will be ignored.
at com.datastax.driver.core.Connection.checkClusterName(Connection.java:240)
at com.datastax.driver.core.Connection.initializeTransport(Connection.java:156)
at com.datastax.driver.core.Connection.<init>(Connection.java:112)
at com.datastax.driver.core.PooledConnection.<init>(PooledConnection.java:35)
at com.datastax.driver.core.Connection$Factory.open(Connection.java:522)
at com.datastax.driver.core.HostConnectionPool.<init>(HostConnectionPool.java:86)
at com.datastax.driver.core.SessionManager.replacePool(SessionManager.java:269)
at com.datastax.driver.core.SessionManager.access$400(SessionManager.java:39)
at com.datastax.driver.core.SessionManager$3.call(SessionManager.java:301)
at com.datastax.driver.core.SessionManager$3.call(SessionManager.java:293)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)

Environment

None

Pull Requests

None

Activity

Show:
Vishy Kasar
December 19, 2014, 11:21 PM

Here is how I introduce a reachable IP of the wrong cluster in the one of the nodes:
insert into system.peers (peer,rpc_address,data_center) values ('ip1','ip1','DEV');

Andy Tolbert
December 20, 2014, 12:19 AM
Edited

Hi , thanks for providing clear steps on how to reproduce this. I was also able to reproduce it against 2.0.8:

(127.0.0.5 is a host from another cluster)

There was another issue discovered recently that has to do with leaking connections (). Just out of curiosity I applied the fix to that issue and retried the scenario. It looks like in this case the connection reattempts occurred, but the sockets were closed as they should be:

Also, it would be good to check and see if the leaking connections only apply to the host from another cluster, or all hosts that have had failed reconnections (as I saw in ). You can get to that by doing a 'lsof -pn <pid>' (remove 'n' if on OS X) or by using a profiler like yourkit.

While the sockets are being closed now with the fix, I wonder if the attempted reconnects should really happen? Also, i'm thinking the host should be cleared from the driver's cluster metadata as well. I can see the argument that perhaps the node is configured incorrectly and will eventually be updated, but that would require a node restart (I think) and then the system.peers table will be repopulated.

Olivier Michallat
January 26, 2015, 1:48 PM

The driver should not attempt to reconnect. Investigating the issue.

Olivier Michallat
January 26, 2015, 5:31 PM
Edited

This happens when the rogue peer row is already there before the driver starts.

Manager.init() marks the node UP but does not create a pool to it. The node will never be used for queries, but each time we create a session or run session.updateCreatedHosts (in onUp, onDown for another node), it will try to create its pool again.

Andy Tolbert
March 5, 2015, 11:26 PM

Validated that connection is closed when connecting to a node with the wrong cluster name and that reconnections are not scheduled. IntegrationTests should_ignore_recommissioned_node_on_session_init() and should_ignore_node_that_does_not_support_protocol_version_on_session_init() added to RecommissionedNodeTest.

Fixed

Assignee

Olivier Michallat

Reporter

Vishy Kasar

Labels

None

PM Priority

None

Reproduced in

None

Affects versions

Fix versions

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Components

Priority

Major
Configure