We have an old code base where we want to replace Hector with this driver. We have a requirement where each read must finish in max 100ms. For this we do something like session.executeAsync(query).getUninterruptibly(100ms)
I use the stress app example to connect to a 16 nodes cluster from a node with a single processor (a micro aws instance)
The command is `bin/stress read -t 40 --ip contact-point` (40 threads, using the blocking executor)
We did the following modifications to the stress app to replicate some of the problems we were seeing in prod:
used a [low timeout|
https://github.com/catacgc/java-driver/blob/stress-test-conn-pool/driver-examples/stress/src/main/java/com/datastax/driver/stress/BlockingConsumer.java#L83] for the blocking call (20ms)
set a variable connection pool (min=2, max=8)
So basically, after a certain period, given enough timeouts, we will have connections that will have a number of inFlightConnections == theNumberOfTimeouts
Note that because the stream id gets unmarked when a timeout occurs, the connection can get to a state where it reports 128 in flight requests and 128 available streams (see the attachments with a trace of this happening)
We see two issues:
a big number of established connections to 9042 (understandably because a connection that has at least one client timeout will never get closed)
the driver gets to a state where it just creates a new connection and trashes it immediately
Se the attached log file for a trace from a connection pool.
16 Hosts, Centos 6.5
Client is an AWS micro instance
Using a timeout on the result set future doesn't set a timeout for the query, it only tells how long you (the client) want to block on getting the future. This is really the way java Future works and as such it is perfectly expected that the connection doesn't get released after such "timeout". Now, if after having waited some time on the future you decide that you do want to cancel the task, you can call the cancel method, though be sure to read the javadoc properly as there is caveats. Then the connection should get released (or that would be a proper bug, but it's not if cancel is not called).
Yes, that's exactly it. As you can see in the example I gave here, the future is properly canceled.
But I am missing the part where the connection gets released. See here the code for the cancel method. Basically once there's a timeout, the inFlight counter never returns back to 0. If that happens, the connection never closes, even if it's trashed.
Let me know how can I help you trace this. Maybe there's something obvious that I am missing, but running that stress test will result in a big number of active connections just after a short while.
Correct, we don't properly release the connection on cancel and that is a bug. Pushed simple fix, thanks for the report. I've manually tested the fix but it's not absolutely trivial to add a (non racy) unit test for this so if you can check the current 2.0 branch and validate that it works for you, that would be amazing.
Will test and report back, thanks
FYI, I've tested it and it works as it should - no more leaked connections. Thanks