Race condition in the ControlConnection Dispose method can leak connections


Our product has multiple services running on the same appliance, which use the C# driver to build one connection per keyspace per service. We have about 5 keyspaces. About one month ago, we upgraded the C# driver from 3.5 to 3.13. We also turned on the driver logging so it integrates with our logging. We noticed today that 2 of our services have this log which was introduced in this commit https://github.com/datastax/csharp-driver/commit/fa451415703b4ddaf8265277710ba39e8b3e5478#diff-f82069886a3a20205b9de5409b69e1ff:

2020-03-24T17:49:12.701Z [Error] ( 057) {CassandraClient} Exception thrown while handling cassandra event.
Cassandra.DriverInternalError: Could not schedule event in the ProtocolEventDebouncer.​
at Cassandra.ProtocolEvents.ProtocolEventDebouncer.<ScheduleEventAsync>d__6.MoveNext()​
— End of stack trace from previous location where exception was thrown ---​
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()​
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)​
at Cassandra.Connections.ControlConnection.<OnConnectionCassandraEvent>d__50.MoveNext()

And then every 30 seconds, we have this warning. Services which do not have this warning also do not have the error above. Restarting our service (therefore the csharp driver client) fixes the problem. Since this is logged for the driver heartbeat interval of the Connection IdleTimeoutHandler, we think this is could be for the keyspace connections that service just isn't using. Writing to other keyspaces, while this warning logs is working.

2020-03-24T00:00:20.978Z [Warning] ( 088) {CassandraClient} Received heartbeat request exception System.InvalidOperationException: Can not start timer after Disposed
at Cassandra.Tasks.HashedWheelTimer.Start()
at Cassandra.Tasks.HashedWheelTimer.NewTimeout(Action`1 action, Object state, Int64 delay)
at Cassandra.Connections.Connection.RunWriteQueueAction()

It just seems like an internal coding error, but I am curious to know if you think this could be caused by something we do.


Win10 Enterprise LTSC, C# Driver 3.13, Cassandra 3.11.1


Joao Reis
April 9, 2020, 1:48 PM

I have a fix ready in a PR. I've been testing it with a local console application that I'm using to consistently reproduce the bug and I no longer see it after the fix. I was also able to reproduce the bug in a unit test which is great.

From my local tests it looks like this bug is still present in 3.14.0 but you won't get the periodic warning message every 30 seconds like you did in previous versions (3.13.0 and below). This is due to a change in behavior introduced in that causes the driver to not attempt to send more heartbeat requests when this particular error is thrown while sending a heartbeat request (only for the control connection, the other connections are always disposed when a heartbeat request fails).

In other words, if you decide to upgrade to the current version (3.14.0), you will most likely stop seeing the log spam but the leaked connection will still be there. Note that you will still see that warning whenever a protocol event is received though (schema or topology change). If the log spam is causing issues you can try upgrading to 3.14.0 to see if it helps, otherwise I recommend waiting for 3.15.0 which will contain this fix. 3.15.0 is scheduled to go out next month (May).

Joao Reis
March 31, 2020, 11:09 PM

Ok thanks. I was expecting this to be really rare but evidently that's not the case.

Tania Engel
March 31, 2020, 10:50 PM

I noticed it again today while reviewing a completely different set of appliances. In that case, we would also have run the same code for a short lived session.

Joao Reis
March 31, 2020, 10:40 PM

Yes, it has no impact on the existing ICluster and ISession instances. Besides the annoying log messages, that extra connection will stay open and will consume some resources but that's about it.

Do you see this issue very often? That scenario might increase the chances of hitting the race condition... I didn't consider this use case so thanks for sharing, I'll increase the priority of this ticket.

Tania Engel
March 31, 2020, 10:13 PM

I looked into why we see the extra shutdown. What is happening is we are removing a node from our cluster and when we run cql to change the replication factor, we actually create a new (short lived) ICluster and ISession with no keyspace specified. When done, we invoke Shutdown and then Dispose on it. In our minds, this has no impact on the existing (long lived) ICluster/ISession. Is that incorrect?





Tania Engel

Fix versions