NoHostAvailable when all hosts are up and connectable

Description

More discussion in this mailing list thread:
https://www.mail-archive.com/user@cassandra.apache.org/msg54572.html

Host connections in the pool are being marked down and removed from the pool, without being added back to the pool correctly. This causes periodic NoHostAvailable errors, but all hosts are up and connectable.

Using uWSGI with the recommended postfork connection setup:
https://datastax.github.io/python-driver/cqlengine/third_party.html#uwsgi

I'm seeing a Heartbeat failed error right before the NoHostAvailable error:

Resolution
SSL WANT_WRITE errors are now handled properly by the reactors to avoid this issue.

Environment

Ubuntu Linux, Python 3.4, uWSGI

Pull Requests

None

Activity

Show:
Jim Witschey
June 6, 2018, 12:50 AM

Have you had a look at 's comment? I wonder if your issues are related to similar circumstances:

  • What reactor are you using when you see this issue? Are you able to try GeventConnection, and if you can, does it address the issue?

  • Are you using SSL?

  • How big are the requests you're making when the error happens?

Sorry if I missed any of these reading through your thread on the C* ML.

Jim Witschey
June 6, 2018, 12:58 AM

– you don't happen to have a script that reproduces this, do you? We'll work on reproducing on our end, but if you have a test case, that could help us isolate the issue you saw, which is hopefully the same as Alan's.

Mathieu
June 6, 2018, 2:35 AM
Edited

Hi, here is a little script to reproduce the issue.

As is, it will trigger a NoHostAvailable
Comment ssl_options in cluster_kwargs and it works without ssl.
Uncomment gevent in imports and connection_class in cluster_kwargs and it works with ssl.

CoinMafia
September 14, 2018, 7:08 AM

Thank you for pursuing this. I am experiencing the same thing. Intermittent connection shut-down. The NGINX log produces this when service is interrupted:

Sep 13 19:50:13 prod-ws1 uwsgi[1476]: File "/home/ubuntu/myprojectenv/lib/python3.6/site-packages/cassandra/cqlengine/query.py", line 404, in _execute
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: result = _execute_statement(self.model, statement, self._consistency, self._timeout, connection=connection)
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: File "/home/ubuntu/myprojectenv/lib/python3.6/site-packages/cassandra/cqlengine/query.py", line 1531, in _execute_statement
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: return conn.execute(s, params, timeout=timeout, connection=connection)
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: File "/home/ubuntu/myprojectenv/lib/python3.6/site-packages/cassandra/cqlengine/connection.py", line 340, in execute
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: result = conn.session.execute(query, params, timeout=timeout)
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: File "cassandra/cluster.py", line 2141, in cassandra.cluster.Session.execute
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: File "cassandra/cluster.py", line 4033, in cassandra.cluster.ResponseFuture.result
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 162.243.42.139 datacenter1>: ConnectionException('Host has been marked down or removed',)})
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: [pid: 1596|app: 0|req: 2586/12460] 141.8.143.121 () {40 vars in 605 bytes} [Thu Sep 13 19:50:13 2018] GET /subsector/entertainment-human-resources => generated 291 bytes in 4 msecs (HTTP/1.1 500) 2 h
lines 1-22/22 (END)

If I can help or provide any additional info, please let me know.

Alan Boudreault
January 22, 2019, 3:16 AM

Thanks all for the details and for the test case!

After some investigation, those errors are from the SSL sockets that are raising WANT_WRITE errors. This seems to be relatively well-known and it is not a fatal error... it just means that the socket is not able to write the chunk at the moment. So, the PR adds a fix for asyncore and Libev reactors to better handle this error. Gevent, Eventlet, Twisted are not affected.

Fixed

Assignee

Unassigned

Reporter

Alan Hamlett

Fix versions

Labels

None

Reproduced in

3.13.0

PM Priority

C

External issue ID

None

Doc Impact

None

Reviewer

None

Size

None

Pull Request

None

Components

Sprint

Py P-DEF

Affects versions

Priority

Major