StackOverflowError on Connection.release

Description

See attach

Environment

None

Pull Requests

None

Activity

Show:
Andy Tolbert
June 21, 2017, 3:27 PM
Edited

Hi , thank you for opening this up.

I have a theory as to what is happening here. It looks like a BusyConnectionException is being thrown because the connection has exhausted it's stream ids. The connection gets released back so it can be reused, it gets reassigned and reused, and the exception is encountered again, which releases the connection again, and this is repeated over and over until it blows up the stack. In practice this shouldn't happen, but we could handle this better.

I would not expect this to happen unless the PoolingOptions#getMaxRequestsPerConnection is configured above what the connection is able to support. For Protocol Version 2 and under (default for C* 2.0 & C* 1.2 but can be configured explicitly) there are only 128 stream ids per connection (number of simultaneous in flight requests on a connection), for newer protocol versions that number is 32768. Another possibility is that the driver is incorrectly tracking its inflight requests on connections (which has been a bug before, but not know to have been one going back to driver 2.1).

Based on that, i have a few questions:

1. Are you configuring PoolingOptions in your Cluster configuration? I'm primarily interested in knowing if you are using setMaxRequestsPerConnection and what it is being set to?
2. What version of cassandra is your cluster using?
3. Are you setting the ProtocolVersion explicitly using withProtocolVersion?

Anton Afanasyev
June 21, 2017, 5:14 PM

Andy,
1.

2. C* 3.10
3. not setting withProtocolVersion

Andy Tolbert
June 22, 2017, 4:20 PM
Edited

Thanks

pools.setMaxRequestsPerConnection(HostDistance.LOCAL, 32768);

I would recommend against this since it doesn't provide a lot of value for a couple of reasons.

  1. If you are looking to optimize throughput, it's usually better to have more connections per host as that allows Cassandra to better utilize its resources (only 1 thread can read from a socket at a time, so having more connections allows you to utilize more CPU resources).

  2. From my testing, there isn't much practical value of allowing a large number of concurrent requests. By default C* will only process 128 requests at a time (native_transport_max_threads) so any requests beyond 128 will be queued (unless you change that config of course). Your mileage may vary in increasing requests per connection, but in general as you increase concurrency you may get throughput improvements (with a cost to latency) but only to some extent and eventually throughput worsens. Having many inflight requests on a connection could also prevent responses from getting delivered back to the client as both the driver and C* are busy reading more requests off the socket. Because of this, I wouldn't recommend increasing max requests per connection above the default (1024).

That being said, I'd still consider this a bug, we need to investigate:

  1. Why we are getting BusyConnectionException to begin with, I don't think this will happen unless the driver is not appropriately tracking the number of used stream ids. So there could be a bug here.

  2. If we can handle this situation better, releasing the connection and then reusing the connection in the same calling thread can create enough recursion to blow the stack as you've demonstrated, although if #1 wasn't happening, this wouldn't be a concern. Maybe we need to special case our handling of BusyConnectionException to not do this.

Olivier Michallat
October 11, 2017, 10:49 PM

Coming back to this ticket because others have run into it (sorry it kind of fell off our radar).

I think the most conservative approach would be to replace the connection if we get a BusyConnectionException in RequestHandler.SpeculativeExecution#query. We'll have to add a parameter to Connection.release to distinguish this case.

Shane Runquist
March 1, 2018, 8:45 PM
Edited

In case it is useful, here is a recent example from DSE 6.0.0 testing. In this log, the stack overflow happened 10 times within about 1 second on the same thread and it followed immediately after another thread logged:
All connections to destination destination 1781680622 are busy and can't keep up with the current replication throughput. Consider increasing values for driver-connections-max and/or driver-max-requests-per-connection in the destination's config.

According to the stack trace, the advanced replication feature at that time was using dse-java-driver-core-1.6.2.jar. Looks like the max-requests-per-connection was set to 30000.
See DSP-14382 for notes from Max, the DSE developer. I'm not sure what protocol version it is using ... Max should be able to help with more info, if needed.

Fixed

Assignee

Olivier Michallat

Reporter

Anton Afanasyev

Labels

None

PM Priority

None

Reproduced in

3.2.0

Affects versions

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Priority

Major
Configure