Race condition in Hashed Wheel Timer causes unhandled exception
Description
UPDATE
I’ve managed to reproduce this issue and it looks like it is a race condition that can cause a specific TimeoutItem to be "removed" from the hashed wheel timer's bucket twice.
Normally, pending cancellations are processed at the start of the timer tick (on the RemoveCancelled method call) so those items will not be in the bucket when the timer tick is iterating through the bucket items.
However, if the TimeoutItem is cancelled while the timer tick is running but after the RemoveCancelled method call returned, then the item will be present in the bucket while the timer tick is iterating through it and it will be removed from the bucket. Then, on the next timer tick, the RemoveCancelled method will remove the same item from the bucket again which causes the same item to be “removed” twice and there’s no checks preventing this from happening.
This is also consistent with the fact that this usually manifests when a topology change is happening because several requests will be cancelled at the same time increasing the probability of this happening.
When nodes are being restarted/decomissioned and recommissioned, we find that the following exception can be thrown when the node is on the way down:
``` An unhandled exception occurred and the process was terminated.
Application ID: /LM/W3SVC/581/ROOT
Process ID: 10456
Exception: System.ArgumentException
Message: Next is null but it is not the tail
StackTrace: at Cassandra.Tasks.HashedWheelTimer.Bucket.Remove(TimeoutItem item) at Cassandra.Tasks.HashedWheelTimer.TimerTick(Object state) at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx) at System.Threading.TimerQueueTimer.CallCallback() at System.Threading.TimerQueueTimer.Fire() at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() at System.Threading.ThreadPoolWorkQueue.Dispatch() ```
This is reasonably easy for us to reproduce inside our production environment when perfoming maintenance.
The cluster is configured within the application like so: ``` Cluster.MaxProtocolVersion = 3; return Cluster.Builder() .WithLoadBalancingPolicy(new TokenAwarePolicy(new DCAwareRoundRobinPolicy("DC1"))) .WithSocketOptions(new SocketOptions() .SetConnectTimeoutMillis(500) ) .WithExecutionProfiles(options => options .WithProfile("default", profile => profile .WithSpeculativeExecutionPolicy(new ConstantSpeculativeExecutionPolicy(100, 5)) )) .WithQueryTimeout(2000) .AddContactPoints("cassandraseed1","cassandraseed2","cassandraseed3") .Build() .Connect("Stats"); ```
There are 5 keyspaces inside this application being connected to (if that's relevant).
We do also have a WER dump of the application in question, however as this is a full memory dump it contains credentials and other information which cannot be shared externally.
Sorry, I forget to say. No problems at all after rolling that out, so I’m pretty confident this is resolved! Thanks
Joao Reis
September 8, 2022 at 3:10 PM
Hey! Did you hit this issue again during those cluster maintenance operations?
Joao Reis
August 12, 2022 at 9:45 AM
Thanks for the update!
Robert Emery
August 12, 2022 at 6:56 AM
Hiya, Finally managed to get this deployed last night. We’ll be doing a lot of blowing away and rebuilding Cassandra nodes over the next 2 weeks, so if we haven’t had another occurrence by the 19th, I’d probably say that it’s confirmed fixed.
Robert Emery
July 20, 2022 at 12:16 PM
Yep, I will endeavour to get that rolled out this afternoon. Thanks!
UPDATE
I’ve managed to reproduce this issue and it looks like it is a race condition that can cause a specific
TimeoutItem
to be "removed" from the hashed wheel timer's bucket twice.Normally, pending cancellations are processed at the start of the timer tick (on the
RemoveCancelled
method call) so those items will not be in the bucket when the timer tick is iterating through the bucket items.However, if the
TimeoutItem
is cancelled while the timer tick is running but after theRemoveCancelled
method call returned, then the item will be present in the bucket while the timer tick is iterating through it and it will be removed from the bucket. Then, on the next timer tick, theRemoveCancelled
method will remove the same item from the bucket again which causes the same item to be “removed” twice and there’s no checks preventing this from happening.This is also consistent with the fact that this usually manifests when a topology change is happening because several requests will be cancelled at the same time increasing the probability of this happening.
When nodes are being restarted/decomissioned and recommissioned, we find that the following exception can be thrown when the node is on the way down:
```
An unhandled exception occurred and the process was terminated.
Application ID: /LM/W3SVC/581/ROOT
Process ID: 10456
Exception: System.ArgumentException
Message: Next is null but it is not the tail
StackTrace: at Cassandra.Tasks.HashedWheelTimer.Bucket.Remove(TimeoutItem item)
at Cassandra.Tasks.HashedWheelTimer.TimerTick(Object state)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.TimerQueueTimer.CallCallback()
at System.Threading.TimerQueueTimer.Fire()
at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
```
This is reasonably easy for us to reproduce inside our production environment when perfoming maintenance.
The cluster is configured within the application like so:
```
Cluster.MaxProtocolVersion = 3;
return Cluster.Builder()
.WithLoadBalancingPolicy(new TokenAwarePolicy(new DCAwareRoundRobinPolicy("DC1")))
.WithSocketOptions(new SocketOptions()
.SetConnectTimeoutMillis(500)
)
.WithExecutionProfiles(options => options
.WithProfile("default", profile => profile
.WithSpeculativeExecutionPolicy(new ConstantSpeculativeExecutionPolicy(100, 5))
))
.WithQueryTimeout(2000)
.AddContactPoints("cassandraseed1","cassandraseed2","cassandraseed3")
.Build()
.Connect("Stats");
```
There are 5 keyspaces inside this application being connected to (if that's relevant).
We do also have a WER dump of the application in question, however as this is a full memory dump it contains credentials and other information which cannot be shared externally.