Affects Version/s: 2.1.4
Fix Version/s: None
Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux
Both jdk1.7.0_45 and jdk1.8.0_25
When a new node is started with auto_bootstrap=false all clients using the java cql driver died with NoHostAvailableException.
We haven't been able to reproduce this problem anywhere but in our production (2.0.11). But have now seen it twice in production.
Every client using the driver dies the second you start the new node ( with auto_bootstrap=false).
This happens before the new node is actually up.
Restarting the clients worked. but we suffered downtime
We thought this was fixed in 2.1.4. We saw it first in 2.1.3, and we've had ongoing NoHostAvailableException problems since 2.0.x. We intentionally upgraded everything to 2.1.4 before today's attempt at joining the node again.
It's as if the whole pool is replaced with only the new node ( which isn't up yet ).
– cassandra07 in DC1 was started with auto_bootstrap=true.
– streaming is finished. compaction starts.
– compactions are finished.
Still nothing happening.
Closer investigations show that rebuilding two (from four) secondary indexes failed with tombstone overwhelm. We've entered a separate issue for this at https://issues.apache.org/jira/browse/CASSANDRA-8798
To get past this we had to raise org.apache.cassandra.db:type=StorageService.TombstoneFailureThreshold and manually rebuild the index.
– restart node with auto_bootstrap=false
Clients immediately (13:56:37) throw NoHostAvailableException, before the node has finished starting up. All other clients using hector work ok.
Restarting the clients fixes the problems so we restart all cql driver clients as quickly as possible.
Errors on the clients appeared in different ways depending on their application code.
In a client that has otherwise been behaving very well we got
In a client we have had plenty of previous troubles with NoHostAvailableExceptions (due to a low connection timeout setting¹) we got
(notice only one host was in the pool) and subsequently
(it was wasn't for the other clients also dying this one makes me wonder about JAVA-663 Resolved ).
Our clients are configured
.withLoadBalancingPolicy(new LatencyAwarePolicy.Builder(new RoundRobinPolicy()).build())
¹ the troublesome client has