Client not able to add/use C* nodes that have changed from private Ip to public Ip (Ec2MultiRegionSnitch + Translator)

Description

We are switching from Ec2Snitch to Ec2MultiRegionSnitch in our 2.1.5 C* cluster. In order to do that we follow the next steps:

One node at a time
1- Stop C*
a - Update listen_address to private IP
b - Update broadcast_address to public IP
c - Update broadcast_rpc_address to public IP
d - Switch to EC2MultiRegionSnitch
2 - Start C*

Everything works fine, we run nodetool status and we can see the restarted C* with public IP and it is in the gossip protocol info (nodetool gossipinfo) BUT! when we check any of our clients and to which C* they are connected to (netstat -anp | grep 9042) we just see the IP of the restarted C* node in a TIME_WAIT state and then it disappears from the netstat list ... so basically the client is not able to establish connection with the restarted C* node. If we restart the client, it is able to connect to ALL the nodes, but we do not want to restart the client.

When we do the upgrade process in one C* node, the client logs the following error:

-----------------

I think I know what is happening:

When we switch one C* node from Ec2Snitch to Ec2MultiRegionSnitch, Cassandra sends a Topology_CHANGE.NEW_NODE event to the client and then the client translates the received Public IP to private IP.

Then it calls refreshNodeInfo(Host) with the private IP.

RefreshNodeInfo(newHost) thinks this node is not a new one because there is still a Connection object using that node:

and tries to get its updated info from system.peers table (fetchNodeInfo()), but it does not find the node info there because system.peer table contains public IPs not private ones anymore.

At the end it returns false and the driver will never add this "new" Cassandra node to its list of C* nodes.

If you want it to work, you must restart your clients, and that is something we (maybe everyone?) try to avoid. Maybe the client should check translated and not-translated IP or maybe close the connection?

Thanks!

Environment

None

Pull Requests

None

Assignee

Unassigned

Reporter

Mario Lazaro

Labels

None

PM Priority

None

Reproduced in

2.0.9.2
2.1.5

Affects versions

Fix versions

None

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Components

Sprint

Priority

Major
Configure