Uploaded image for project: 'DataStax Java Driver for Apache Cassandra'
  1. JAVA-665

NoHostsAvailableException when new node finishes bootstrap

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Duplicate
    • Affects Version/s: 2.1.4
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None
    • Environment:

      Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux

      Both jdk1.7.0_45 and jdk1.8.0_25

      C* 2.0.11

    • Sprint:
      Java P2.1.x

      Description

      When a new node is started with auto_bootstrap=false all clients using the java cql driver died with NoHostAvailableException.

      We haven't been able to reproduce this problem anywhere but in our production (2.0.11). But have now seen it twice in production.
      Every client using the driver dies the second you start the new node ( with auto_bootstrap=false).

      This happens before the new node is actually up.
      Restarting the clients worked. but we suffered downtime

      We thought this was fixed in 2.1.4. We saw it first in 2.1.3, and we've had ongoing NoHostAvailableException problems since 2.0.x. We intentionally upgraded everything to 2.1.4 before today's attempt at joining the node again.

      It's as if the whole pool is replaced with only the new node ( which isn't up yet ).

      History

      Wednesday 11th ~1.17pm – cassandra07 in DC1 was started with auto_bootstrap=true.

      Thursday 12th ~12.05pm – streaming is finished. compaction starts.

      Friday 13th ~7.30am – compactions are finished.

      Friday 13th ~11.26am
      Still nothing happening.

      Closer investigations show that rebuilding two (from four) secondary indexes failed with tombstone overwhelm. We've entered a separate issue for this at https://issues.apache.org/jira/browse/CASSANDRA-8798

      To get past this we had to raise org.apache.cassandra.db:type=StorageService.TombstoneFailureThreshold and manually rebuild the index.

      2015-02-13 13:56:35 – restart node with auto_bootstrap=false
      Clients immediately (13:56:37) throw NoHostAvailableException, before the node has finished starting up. All other clients using hector work ok.

      Restarting the clients fixes the problems so we restart all cql driver clients as quickly as possible.

      Errors on the clients appeared in different ways depending on their application code.
      Sometimes

      Timeout while setting keyspace on connection to cassandra06.finn.no/152.90.241.35:7615. This should not happen but is not critical (it will retried)

      .
      In a client that has otherwise been behaving very well we got

      com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: cassandra04.finn.no/152.90.241.27:7615 (com.datastax.driver.core.TransportException: [cassandra04.finn.no/152.90.241.27:7615] Connection has been closed), cassandra05.finn.no/152.90.241.30:7615 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response), cassandra06.finn.no/152.90.241.35:7615 (com.datastax.driver.core.TransportException: [cassandra06.finn.no/152.90.241.35:7615] Connection has been closed), cassandra03.finn.no/152.90.241.24:7615, cassandra02.finn.no/152.90.241.23:7615 [only showing errors of first 3 hosts, use getErrors() for more details])
              at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
              at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
              at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
              at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
              at no.finntech.ipgeoimport.dao.IPCounterDao.increment(IPCounterDao.java:291)
              at no.finntech.ipgeoimport.service.IPCounterServiceImpl.incrIpCounter(IPCounterServiceImpl.java:74)
              at no.finntech.ipgeoimport.jobs.KafkaEventConsumer$Callback.consume(KafkaEventConsumer.java:85)
              at no.finntech.ipgeoimport.jobs.KafkaEventConsumer$Callback.consume(KafkaEventConsumer.java:59)
              at no.finntech.commons.messaging.KafkaMessageConsumer$PartitionConsumer.run(KafkaMessageConsumer.java:367)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: cassandra04.finn.no/152.90.241.27:7615 (com.datastax.driver.core.TransportException: [cassandra04.finn.no/152.90.241.27:7615] Connection has been closed), cassandra05.finn.no/152.90.241.30:7615 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response), cassandra06.finn.no/152.90.241.35:7615 (com.datastax.driver.core.TransportException: [cassandra06.finn.no/152.90.241.35:7615] Connection has been closed), cassandra03.finn.no/152.90.241.24:7615, cassandra02.finn.no/152.90.241.23:7615 [only showing errors of first 3 hosts, use getErrors() for more details])
              at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
              at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
              ... 3 more
      

      In a client we have had plenty of previous troubles with NoHostAvailableExceptions (due to a low connection timeout setting¹) we got

      com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: cassandra06.finn.no/152.90.241.35:7615 (com.datastax.driver.core.TransportException: [cassandra06.finn.no/152.90.241.35:7615] Connection has been closed))
              at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
              at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
              at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:237)
              at no.finntech.search.management.LastSearchesDbService.getLastSearchesFromCassandra(LastSearchesDbService.java:88)
              at no.finntech.search.management.LastSearchesDbService.getLastSearches(LastSearchesDbService.java:75)
              at no.finntech.search.management.LastSearchesDbService.getLastSearches(LastSearchesDbService.java:71)
              at no.finntech.search.management.SearchManagementServiceHandler.getSearches(SearchManagementServiceHandler.java:41)
              at no.finntech.search.management.SearchManagementService$Processor$getSearches.getResult(SearchManagementService.java:905)
              at no.finntech.search.management.SearchManagementService$Processor$getSearches.getResult(SearchManagementService.java:889)
              at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
              at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
              at no.finntech.greenpages.thrift.server.PacketInspectingMultiplexedProcessor.processWithStatsReporting(PacketInspectingMultiplexedProcessor.java:90)
              at no.finntech.greenpages.thrift.server.PacketInspectingMultiplexedProcessor.process(PacketInspectingMultiplexedProcessor.java:71)
              at no.finntech.greenpages.thrift.server.ZipkinTracedProcessor.process(ZipkinTracedProcessor.java:54)
              at no.finntech.greenpages.thrift.server.TMultiplexThreadPoolServer$WorkerProcess.run(TMultiplexThreadPoolServer.java:248)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: cassandra06.finn.no/152.90.241.35:7615 (com.datastax.driver.core.TransportException: [cassandra06.finn.no/152.90.241.35:7615] Connection has been closed))
              at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
              at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
              ... 3 more

      (notice only one host was in the pool) and subsequently

      com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: cassandra06.finn.no/152.90.241.35:7615 (com.datastax.driver.core.TransportException: [cassandra06.finn.no/152.90.241.35:7615] Connection has been closed))
              at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
              at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
              at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:237)
              at no.finntech.search.management.LastSearchesDbService.getLastSearchesFromCassandra(LastSearchesDbService.java:88)
              at no.finntech.search.management.LastSearchesDbService.getLastSearches(LastSearchesDbService.java:75)
              at no.finntech.search.management.LastSearchesDbService.getLastSearches(LastSearchesDbService.java:71)
              at no.finntech.search.management.SearchManagementServiceHandler.getSearches(SearchManagementServiceHandler.java:41)
              at no.finntech.search.management.SearchManagementService$Processor$getSearches.getResult(SearchManagementService.java:905)
              at no.finntech.search.management.SearchManagementService$Processor$getSearches.getResult(SearchManagementService.java:889)
              at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
              at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
              at no.finntech.greenpages.thrift.server.PacketInspectingMultiplexedProcessor.processWithStatsReporting(PacketInspectingMultiplexedProcessor.java:90)
              at no.finntech.greenpages.thrift.server.PacketInspectingMultiplexedProcessor.process(PacketInspectingMultiplexedProcessor.java:71)
              at no.finntech.greenpages.thrift.server.ZipkinTracedProcessor.process(ZipkinTracedProcessor.java:54)
              at no.finntech.greenpages.thrift.server.TMultiplexThreadPoolServer$WorkerProcess.run(TMultiplexThreadPoolServer.java:248)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: cassandra06.finn.no/152.90.241.35:7615 (com.datastax.driver.core.TransportException: [cassandra06.finn.no/152.90.241.35:7615] Connection has been closed))
              at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
              at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
              ... 3 more
      

      (it was wasn't for the other clients also dying this one makes me wonder about JAVA-663 Resolved ).

      Our clients are configured
      .withLoadBalancingPolicy(new LatencyAwarePolicy.Builder(new RoundRobinPolicy()).build())
      .socketOptions.setConnectTimeoutMillis(2000)

      ¹ the troublesome client has
      .withSocketOptions(new SocketOptions().setConnectTimeoutMillis(50).setReadTimeoutMillis(500))

        Attachments

          Issue links

            Activity

              People

              • Assignee:
                andrew.tolbert Andrew Tolbert
                Reporter:
                mck mck
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: