Driver hangs/deadlock if all connections dropped by heartbeat whilst request in flight and request times out
PYTHON-1044
Handle prepared id mismatch when repreparing on the fly
PYTHON-1124
Exception when use pk__token__gt filter In python 3.7
PYTHON-1121
Remove legacy ssl_options handling
PYTHON-1116
The documentation search doesn´t provide useful results
PYTHON-1102
Add prepared statement message support for protocol v5
PYTHON-1086
Engage the LoadBalancingPolicy after the control connection
PYTHON-1080
Execution profile API tweaks
PYTHON-1071
Add support of NaN, Infinity and -Infinity to the BigDecimalTypeIO
PYTHON-1070
Cache prepared statements by query
PYTHON-1066
Handle properly SSL_ERROR_WANT_WRITE error in the asyncio reactor
PYTHON-1062
Prepared statement logging doesn't log parameters
PYTHON-1052
Investigate if cleanup happens properly when Session is GC'd before a connection is returned
PYTHON-1043
Add support for describing additional_read_policy and additional_write_policy
PYTHON-1041
Remove %(thread) from test debugging
PYTHON-1038
Application thread stuck in ResponseFuture result on event wait
PYTHON-1032
Exception refreshing schema in response to schema change
PYTHON-1030
Driver doesn't fail fast on UDTs with duplicate field names
PYTHON-1025
Remove references to ssl.SSL_ERROR_WANT_READ and SSL_ERROR_WANT_WRITE
PYTHON-1022
Connection setup methods prevent using ExecutionProfile
PYTHON-1009
Make cqlengine context-aware
PYTHON-997
Add Unit tests for ProtocolVersionRegistry
PYTHON-991
Unit and Integration tests shouldn't use ProtocolVersion directly, but rather use the default ProtocolVersionRegistry
PYTHON-989
Refactor and improve the request processing model
PYTHON-986
Refactor to make the driver pluggable
PYTHON-983
Rename request_ids to _request_ids
PYTHON-980
In getting_started.rst in the setting the consistency level we specify the legacy mode
PYTHON-979
Investigate long.test_failure_types.TimeoutTimerTest.test_async_timeouts flake
PYTHON-972
asyncio tests get hanged sometimes in test_execute_concurrent_paged_result_generator
PYTHON-969
Write DSEGSSAPIAuthProvider test after merge to DSE
PYTHON-967
Create an efficient iterator for a QuerySet
PYTHON-964
Introduce the DriverContext
PYTHON-958
Make types pluggable
PYTHON-957
Make ProtocolVersion pluggable
PYTHON-954
Consider change ResultSet.was_applied to work as well for query strings
PYTHON-943
Decide global max protocol version using system.peers
PYTHON-938
Remove DowngradingConsistencyRetryPolicy
PYTHON-936
Investigate if we should deserialize inet addresses to ipaddress.* in Python 3
PYTHON-933
Investigate if min_length for Text fields in CQLengine be set to 0 by default instead of 1
PYTHON-929
Change idle_heartbeat_timeout default should be changed to 10s instead of 30s
PYTHON-928
Reorder and group Cluster kwargs
PYTHON-927
Allow for counter columns be static
PYTHON-926
Remove old cluster/session management code from cassandra.cqlengine.connection
PYTHON-925
Investigate aliases creation for metrics
PYTHON-924
PreparedStatement.keyspace should be renamed to prepared_keyspace
PYTHON-923
Default to truncating datetime micros in cqlengine instances
PYTHON-922
Don't allow field overriding in models
PYTHON-921
cassandra.cluster._stop_scheduler seems to be dead code
PYTHON-920
Convert Python timestamps to UUID without float multiplication
PYTHON-914
Add support of solr_query with facet pivot
PYTHON-913
issue 1 of 138

Driver hangs/deadlock if all connections dropped by heartbeat whilst request in flight and request times out

Description

A request is sent and addded to the request queue of the connection. It then waits for the response future to trigger the event that it has finished:

1 2 # cluster.py:4087 self._event.wait()

Client loses network connection to all contact points.

Heartbeat defuncts connection and clears request queue for connection

1 2 3 4 5 # connection.py:404 def error_all_requests(self, exc): with self.lock: requests = self._requests self._requests = {}

Response future times out and checks if it is still in request queue

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 # cluster.py:3551 def _on_timeout(self, _attempts=0): """ Called when the request associated with this ResponseFuture times out. This function may reschedule itself. The ``_attempts`` parameter tracks the number of times this has happened. This parameter should only be set in those cases, where ``_on_timeout`` reschedules itself. """ # PYTHON-853: for short timeouts, we sometimes race with our __init__ if self._connection is None and _attempts < 3: self._timer = self.session.cluster.connection_class.create_timer( 0.01, partial(self._on_timeout, _attempts=_attempts + 1) ) return if self._connection is not None: try: self._connection._requests.pop(self._req_id) # This prevents the race condition of the # event loop thread just receiving the waited message # If it arrives after this, it will be ignored except KeyError: return

As it has been removed from the request queue, it assumes it has already completed and does not throw an exception. As the pool has been shut down, the retry task never causes it to be added to the request queue again, thereby leaving it stuck waiting for a response that will never arrive.

Even if the connection reappears, this request is never readded to the queue, leaving the whole thread stuck.

Environment

None

Pull Requests

None

Status

Assignee

Unassigned

Reporter

Patrick Engelbert

Fix versions

None

Labels

None

Reproduced in

None

PM Priority

None

External issue ID

None

External issue ID

None

External issue ID

None

External issue ID

None

External issue ID

None

External issue ID

None

Doc Impact

None

Reviewer

None

Size

None

Sprint

Affects versions

3.16.0

Priority

Critical