Reconnect not initiated when all nodes are down

Description

I set my reconnection policy to ConstantReconnectPolicy, which is supposed to be reconnecting every 10 seconds.

How to reproduce:
Alll connections are created. Then I shut down the whole cluster with "ccm stop". uwsgi logs show:

Connection <GeventConnection(67043600) 127.0.0.1:9042> closed by server
Closing connection (67043600) to 127.0.0.1
Closed socket to 127.0.0.1
in-flight callbacks: {}
Connection <GeventConnection(67915280) 127.0.0.1:9042> closed by server
Closing connection (67915280) to 127.0.0.1
Closed socket to 127.0.0.1
in-flight callbacks: {}
Connection <GeventConnection(67912336) 127.0.0.2:9042> closed by server
Closing connection (67912336) to 127.0.0.2
Closed socket to 127.0.0.2
in-flight callbacks: {}

As you can see reconnection is not initiated. "in-flight callbacks" line was added by me for debugging. It seems unless I send a new query, the driver won't reconnect.

I noticed that reconnector is triggered when defunct/closed connection is returned to pool, which happens when there are callbacks in connection.

The following code is how connection is closed from driver side, error_all_callbacks will call all pending callbacks, and if ResponseFuture._set_result callback in it, return_connection will be called and trigger the reconnect, otherwise no reconnect will be initiated:

I'm also wondering if a closed connection should be returned to pool?

Environment

cluster of 2 nodes created by ccm.
uwsgi of 1 processes.

Pull Requests

None

Activity

Adam Holmberg 
July 29, 2015 at 4:21 PM

Closing as fixed in later versions of the driver.

Adam Holmberg 
July 27, 2015 at 1:24 PM
(edited)

We do benchmarks with this feature enabled. The heartbeat is intentionally lightweight. If your application is CPU-bound, you may notice the overhead of this thread.

The recommended value would be the largest value that achieves what you need – keeping idle connections open through intermediate network devices, and the appropriate responsiveness detecting dead connections.

t 
July 27, 2015 at 6:03 AM

I tested with the version 2.5.1 and it worked Just wondering, have you done any performance tests on this new feature? I mean a new daemon thread to check all connection constantly, does it imply consuming more resouces? Is there a recommended value for idle_heartbeat_interval?

Adam Holmberg 
July 24, 2015 at 1:57 PM

Good catch. That was introduced here to cover the specific case in which every node is down (no control connection to receive node status events), and there is no request traffic (IO must be attempted to detect failed connections).

I had not noticed your driver version previously. Are you able to try this with the current version?

t 
July 24, 2015 at 1:40 AM
(edited)

I noticed there's a new feature called Heartbeat in newer versions. Looks like it can fix this issue, however I'm not sure if this is why this feature is introduced.

Please see following code, there's a line "owner.return_connection(connection)", the defunct/closed connection will be returned to the pool, this is exactly what is missing in the old versions when a connection is closed by the server:

Although this heartbeat feature is introduced in 2.1.4, it doesn't return bad connections at the time. Until version 2.5.0, where the above code is taken from, it's added. So I kinda wonder if this is a known issue .

Fixed

Details

Assignee

Reporter

Reproduced in

Affects versions

Priority

Created July 23, 2015 at 8:58 AM
Updated July 29, 2015 at 4:21 PM
Resolved July 29, 2015 at 4:21 PM