Driver should handle better system.peers incosistencies

Description

I have a 3 node cluster (2.0.14). Decided to add 3 new ones. 2 failed because of hardware failure (virtualized environment).

The process was automated, so what was supposed to happen was:

  • Node 4 joins

  • wait until status is UN and then 2min more

  • Node 5 joins

  • wait until status is UN and then 2min more

  • Node 6 joins

  • wait until status is UN and then 2min more

What happened:

  • Node 4 joins

  • Wait...

  • Node 5 joins

  • VM fails while node is starting.

  • VM 6 starts, no node with UN, waits 2min

  • Node 6 joins

  • VM fails while node is starting.

After this, nodetool reports 4 nodes all UN
While trying an application (Datastax Java Driver 2.1) the debug log reports that it tries to connect to Node 5 and 6 and fails.

Checking system.peers table, I see both nodes there. So I tried "nodetool removenode <ID>" with the IDs in the table.
It blows up with the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Host ID not found.
Then I decided to do the following:
DELETE from peers where ID in (ID1, ID2);
All good, cluster still happy and driver not complaining anymore.

Driver should handle the peers table inconsistencies better. Maybe throw an Warning?

Environment

None

Pull Requests

None

Activity

Olivier Michallat 
May 12, 2015 at 12:50 PM

The driver's only sources of information are system.peers, actually trying to connect to the nodes, and UP/DOWN notifications received from other nodes. So unfortunately I don't think it can do any better.
CASSANDRA-9180 will prevent these "phantom" rows from being created when bootstrap fails, that should solve your issue.

Former user 
May 11, 2015 at 10:44 PM

The behavior is exactly what you described.

The driver tries to connect to the nodes and fails.
I thought that the driver could get more information about the nodes status (as in the same way as nodetool) that way the driver could see a mismatch between the 2 (nodetool says that the nodes are nonexistent, system.peers says nodes exist) and issue a warning on that (or avoid to connect to those nodes).

Alex Popescu 
May 11, 2015 at 10:39 PM

Can you please provide more details about what behavior you are seeing?

From the original description, I understand that the driver attempts to connect to these nodes and fails to do that. The client driver cannot decide what's the real status of those nodes (e.g. maybe it is the client that's in a split network and the nodes are healthy).

Former user 
May 11, 2015 at 10:22 PM

I understand that the driver tries to connect to the nodes in system.peers. I was recommend to fill this bug similar to the one in the C# driver.

Mailing list thread: http://qnalist.com/questions/6031649/nodes-failed-to-bootstrap-no-nodetool-info-but-system-peer-populated

The question is should the driver handle more gracefully the fact that nodes present in system.peers that are actually nonexistent in Cassandra (CASSANDRA-9180).
Otherwise mark it has invalid.

Andy Tolbert 
May 11, 2015 at 10:09 PM

Just noticed this:

While trying an application (Datastax Java Driver 2.1) the debug log reports that it tries to connect to Node 5 and 6 and fails.

What exceptions are you seeing? It may be perfectly natural for the driver to try to connect to the nodes if they are present in system.peers.

Not a Problem

Details

Assignee

Reporter

Reproduced in

Priority

Created May 11, 2015 at 7:16 PM
Updated June 21, 2020 at 5:36 PM
Resolved February 16, 2016 at 10:14 PM