More discussion in this mailing list thread:
https://www.mail-archive.com/user@cassandra.apache.org/msg54572.html
Host connections in the pool are being marked down and removed from the pool, without being added back to the pool correctly. This causes periodic NoHostAvailable errors, but all hosts are up and connectable.
Using uWSGI with the recommended postfork connection setup:
https://datastax.github.io/python-driver/cqlengine/third_party.html#uwsgi
I'm seeing a Heartbeat failed error right before the NoHostAvailable error:
Resolution
SSL WANT_WRITE errors are now handled properly by the reactors to avoid this issue.
Ubuntu Linux, Python 3.4, uWSGI
Have you had a look at 's comment? I wonder if your issues are related to similar circumstances:
What reactor are you using when you see this issue? Are you able to try GeventConnection, and if you can, does it address the issue?
Are you using SSL?
How big are the requests you're making when the error happens?
Sorry if I missed any of these reading through your thread on the C* ML.
– you don't happen to have a script that reproduces this, do you? We'll work on reproducing on our end, but if you have a test case, that could help us isolate the issue you saw, which is hopefully the same as Alan's.
Hi, here is a little script to reproduce the issue.
As is, it will trigger a NoHostAvailable
Comment ssl_options in cluster_kwargs and it works without ssl.
Uncomment gevent in imports and connection_class in cluster_kwargs and it works with ssl.
Thank you for pursuing this. I am experiencing the same thing. Intermittent connection shut-down. The NGINX log produces this when service is interrupted:
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: File "/home/ubuntu/myprojectenv/lib/python3.6/site-packages/cassandra/cqlengine/query.py", line 404, in _execute
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: result = _execute_statement(self.model, statement, self._consistency, self._timeout, connection=connection)
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: File "/home/ubuntu/myprojectenv/lib/python3.6/site-packages/cassandra/cqlengine/query.py", line 1531, in _execute_statement
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: return conn.execute(s, params, timeout=timeout, connection=connection)
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: File "/home/ubuntu/myprojectenv/lib/python3.6/site-packages/cassandra/cqlengine/connection.py", line 340, in execute
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: result = conn.session.execute(query, params, timeout=timeout)
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: File "cassandra/cluster.py", line 2141, in cassandra.cluster.Session.execute
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: File "cassandra/cluster.py", line 4033, in cassandra.cluster.ResponseFuture.result
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 162.243.42.139 datacenter1>: ConnectionException('Host has been marked down or removed',)})
Sep 13 19:50:13 prod-ws1 uwsgi[1476]: [pid: 1596|app: 0|req: 2586/12460] 141.8.143.121 () {40 vars in 605 bytes} [Thu Sep 13 19:50:13 2018] GET /subsector/entertainment-human-resources => generated 291 bytes in 4 msecs (HTTP/1.1 500) 2 h
lines 1-22/22 (END)
If I can help or provide any additional info, please let me know.
Thanks all for the details and for the test case!
After some investigation, those errors are from the SSL sockets that are raising WANT_WRITE errors. This seems to be relatively well-known and it is not a fatal error... it just means that the socket is not able to write the chunk at the moment. So, the PR adds a fix for asyncore and Libev reactors to better handle this error. Gevent, Eventlet, Twisted are not affected.