We had few production clusters which over time ended up with incorrect entries in system.peers table. The entries pointed to nodes in another cluster. java driver connects to them, checks the cluster name and rejects such connections. However:
1. It does not close those connections thus leaking them
2. It keeps retrying the connections thus increasing the leaks
Over time, the client runs out of file descriptors and we have to restart the process.
You can recreate the issue by creating 3 node local cluster, injecting a reachable node from another cluster in system.peers of one node and throwing some heavy load. A test program that reproduces the issue is attached.
Error creating pool to /ip:9042
com.datastax.driver.core.ClusterNameMismatchException: /ip:9042 Host /ip:9042 reports cluster name 'xyz cluster' that doesn't match our cluster name 'dev3 dev3 cluster'. This host will be ignored.
Here is how I introduce a reachable IP of the wrong cluster in the one of the nodes:
insert into system.peers (peer,rpc_address,data_center) values ('ip1','ip1','DEV');
Hi , thanks for providing clear steps on how to reproduce this. I was also able to reproduce it against 2.0.8:
(127.0.0.5 is a host from another cluster)
There was another issue discovered recently that has to do with leaking connections (). Just out of curiosity I applied the fix to that issue and retried the scenario. It looks like in this case the connection reattempts occurred, but the sockets were closed as they should be:
Also, it would be good to check and see if the leaking connections only apply to the host from another cluster, or all hosts that have had failed reconnections (as I saw in ). You can get to that by doing a 'lsof -pn <pid>' (remove 'n' if on OS X) or by using a profiler like yourkit.
While the sockets are being closed now with the fix, I wonder if the attempted reconnects should really happen? Also, i'm thinking the host should be cleared from the driver's cluster metadata as well. I can see the argument that perhaps the node is configured incorrectly and will eventually be updated, but that would require a node restart (I think) and then the system.peers table will be repopulated.
The driver should not attempt to reconnect. Investigating the issue.
This happens when the rogue peer row is already there before the driver starts.
Manager.init() marks the node UP but does not create a pool to it. The node will never be used for queries, but each time we create a session or run session.updateCreatedHosts (in onUp, onDown for another node), it will try to create its pool again.
Validated that connection is closed when connecting to a node with the wrong cluster name and that reconnections are not scheduled. IntegrationTests should_ignore_recommissioned_node_on_session_init() and should_ignore_node_that_does_not_support_protocol_version_on_session_init() added to RecommissionedNodeTest.