We are switching from Ec2Snitch to Ec2MultiRegionSnitch in our 2.1.5 C* cluster. In order to do that we follow the next steps:
One node at a time
1- Stop C*
a - Update listen_address to private IP
b - Update broadcast_address to public IP
c - Update broadcast_rpc_address to public IP
d - Switch to EC2MultiRegionSnitch
2 - Start C*
Everything works fine, we run nodetool status and we can see the restarted C* with public IP and it is in the gossip protocol info (nodetool gossipinfo) BUT! when we check any of our clients and to which C* they are connected to (netstat -anp | grep 9042) we just see the IP of the restarted C* node in a TIME_WAIT state and then it disappears from the netstat list ... so basically the client is not able to establish connection with the restarted C* node. If we restart the client, it is able to connect to ALL the nodes, but we do not want to restart the client.
When we do the upgrade process in one C* node, the client logs the following error:
I think I know what is happening:
When we switch one C* node from Ec2Snitch to Ec2MultiRegionSnitch, Cassandra sends a Topology_CHANGE.NEW_NODE event to the client and then the client translates the received Public IP to private IP.
Then it calls refreshNodeInfo(Host) with the private IP.
RefreshNodeInfo(newHost) thinks this node is not a new one because there is still a Connection object using that node:
and tries to get its updated info from system.peers table (fetchNodeInfo()), but it does not find the node info there because system.peer table contains public IPs not private ones anymore.
At the end it returns false and the driver will never add this "new" Cassandra node to its list of C* nodes.
If you want it to work, you must restart your clients, and that is something we (maybe everyone?) try to avoid. Maybe the client should check translated and not-translated IP or maybe close the connection?