Node list refresh behavior in 4.x is different from 3.x

Description

In Datastax4, the entire Cassandra node list is refreshed via MetadataManager#refreshNodeList. This happens in two cases: 1. when a session is initialized, and 2. when we reconnect to a node that was temporarily unavailable.

Note that a refresh does not happen if a node goes down because (for example) it was terminated ungracefully. As far as I can tell, the “node removed” topology never comes in (tested with Cassandra 3.0.X). In that case the driver holds on to the old node and tries to reconnect periodically, leading to warn-level log messages that look like this:

2022-03-24 16:07:23.846  WARN 11077 --- [Thread-1] c.d.o.d.i.c.p.ChannelPool                : [sessionname|/1.2.3.4:1234]  Error while opening new channel (ConnectionInitException: [sessionname|connecting...] Protocol initialization request, step 1 (STARTUP {CQL_VERSION=3.0.0, DRIVER_NAME=DataStax Java driver for Apache Cassandra(R), DRIVER_VERSION=4.13.0, CLIENT_ID=16f1615a-ef11-4111-a111-c9b01112f453}): failed to send request (java.nio.channels.NotYetConnectedException))

Datastax3 was more aggressive where it came to refreshing node lists, for instance it would refresh the list when connecting to a new node [1] [2]. This means that when a node came up (with a different IP) to replace any node that was terminated ungracefully, the entire node list would be refreshed and the bad node would be removed.

My question here is: is it worth revisiting the behavior w.r.t node list refresh frequency in Datastax4? If not I can implement a workaround on our end e.g. by implementing a NodeStateListener (should be straightforward, but any advice there would be appreciated too!)

Also note that I tested a node going down gracefully - the node removed topology event comes through and so we don’t run into this issue there.

Similarly to https://datastax-oss.atlassian.net/browse/JAVA-3002, I was able to reproduce this by killing a single node in a multi-node cluster.

(initially via email)

1: https://github.com/datastax/java-driver/blob/3.11.1/driver-core/src/main/java/com/datastax/driver/core/ControlConnection.java#L319

2: https://github.com/datastax/java-driver/blob/3.11.1/driver-core/src/main/java/com/datastax/driver/core/ControlConnection.java#L758

Environment

None

Pull Requests

None

Activity

Show:
Ammar Khaku
July 4, 2022 at 7:51 PM

(also while I have your attention I’d love it if I could get a pair of eyes on https://datastax-oss.atlassian.net/browse/JAVA-3010 too)

Ammar Khaku
July 4, 2022 at 7:48 PM

Thanks Alex! Only noticed your comment in https://datastax-oss.atlassian.net/browse/JAVA-3002 after I had already posted mine here, great that we landed on similar conclusions though!

Yeah agreed that the change should go into the driver - it’s a little less efficient to always refresh on a new node addition (rather than when we know that a node is down) but unlikely to make a difference since refreshing the node list is fairly cheap. And the one-line change is a lot less complexity, plus it’s already debounced! I put up https://github.com/datastax/java-driver/pull/1604 with your suggested change.

Alex Dutra
July 4, 2022 at 8:26 AM

I investigated this issue and 3002 as well, see my comment here: https://datastax-oss.atlassian.net/browse/JAVA-3002?focusedCommentId=54976 . I agree with Cassandra not sending a REMOVE_NODE event when a node is replaced, and with 3.x being more resilient to that than 4.x. Your workaround is smart and allows existing applications to bypass the issue, however I think we should go ahead and implement the one-line change I suggested in my comment.

Ammar Khaku
July 3, 2022 at 10:04 PM

For others following this issue: I just put up a PR with a sample implementation https://github.com/datastax/java-driver/pull/1603/files - I tested a slightly modified version of this on our infrastructure and it seems to work well.

Brett, Andreas: I should note that if we replace Cassandra nodes gracefully, the onRemoved topology event comes through properly and we don’t run into this issue. It’s when Cassandra nodes die unexpectedly that this happens. I wouldn’t be surprised if the real “root” cause is a bug in the server where it doesn’t send the onRemoved topology event in that case (even though system.peers has been updated) but I haven’t found any discussion about this on the server side. The 3.x driver didn’t have an issue with this since it refreshed the node list a lot more often, the the 4.x driver is more conservative - which is generally the right thing to do, but can uncover bugs like this.

Brett: yeah I agree it’s not super common. We run tens of thousands of Cassandra nodes across hundreds of clusters on cloud infra and anecdotally we see a handful of failures a month.

Andreas Wederbrand
July 2, 2022 at 12:29 PM

I’m the author of https://datastax-oss.atlassian.net/browse/JAVA-3002 (through proxy, it was originally asked in the forums). I’m also happy that this gets some traction and I hope it’s the same root cause so both bugs are fixed with the same changes.

The workaround, to restart all clients, are not viable for us either. We have hundreds of micro services all using Cassandra and our DBAs would like to rotate nodes much more frequently than what is currently realistic.


We know of the problem so we’re able to work around it but we try to stay away from any changes to the cluster that would trigger this degradation in the services.

Thanks.

Fixed

Details

Assignee

Reporter

Fix versions

Priority

Created April 8, 2022 at 7:48 PM
Updated July 13, 2022 at 7:10 PM
Resolved July 13, 2022 at 7:10 PM