Race condition in Hashed Wheel Timer causes unhandled exception

Description

UPDATE

I’ve managed to reproduce this issue and it looks like it is a race condition that can cause a specific TimeoutItem to be "removed" from the hashed wheel timer's bucket twice.

Normally, pending cancellations are processed at the start of the timer tick (on the RemoveCancelled method call) so those items will not be in the bucket when the timer tick is iterating through the bucket items.

However, if the TimeoutItem is cancelled while the timer tick is running but after the RemoveCancelled method call returned, then the item will be present in the bucket while the timer tick is iterating through it and it will be removed from the bucket. Then, on the next timer tick, the RemoveCancelled method will remove the same item from the bucket again which causes the same item to be “removed” twice and there’s no checks preventing this from happening.

This is also consistent with the fact that this usually manifests when a topology change is happening because several requests will be cancelled at the same time increasing the probability of this happening.

When nodes are being restarted/decomissioned and recommissioned, we find that the following exception can be thrown when the node is on the way down:

```
An unhandled exception occurred and the process was terminated.

Application ID: /LM/W3SVC/581/ROOT

Process ID: 10456

Exception: System.ArgumentException

Message: Next is null but it is not the tail

StackTrace: at Cassandra.Tasks.HashedWheelTimer.Bucket.Remove(TimeoutItem item)
at Cassandra.Tasks.HashedWheelTimer.TimerTick(Object state)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.TimerQueueTimer.CallCallback()
at System.Threading.TimerQueueTimer.Fire()
at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
```

This is reasonably easy for us to reproduce inside our production environment when perfoming maintenance.

The cluster is configured within the application like so:
```
Cluster.MaxProtocolVersion = 3;
return Cluster.Builder()
.WithLoadBalancingPolicy(new TokenAwarePolicy(new DCAwareRoundRobinPolicy("DC1")))
.WithSocketOptions(new SocketOptions()
.SetConnectTimeoutMillis(500)
)
.WithExecutionProfiles(options => options
.WithProfile("default", profile => profile
.WithSpeculativeExecutionPolicy(new ConstantSpeculativeExecutionPolicy(100, 5))
))
.WithQueryTimeout(2000)
.AddContactPoints("cassandraseed1","cassandraseed2","cassandraseed3")
.Build()
.Connect("Stats");
```

There are 5 keyspaces inside this application being connected to (if that's relevant).

We do also have a WER dump of the application in question, however as this is a full memory dump it contains credentials and other information which cannot be shared externally.

Environment

Server 2019, .NET 4.7.2

Attachments

Linked work items

clones

CSHARP-954

Nodes disconnecting can cause Next is null but it is not the tail exception

Activity

Robert Emery
September 8, 2022 at 3:12 PM

Sorry, I forget to say. No problems at all after rolling that out, so I’m pretty confident this is resolved! Thanks

Joao Reis
September 8, 2022 at 3:10 PM

Hey! Did you hit this issue again during those cluster maintenance operations?

Joao Reis
August 12, 2022 at 9:45 AM

Thanks for the update!

Robert Emery
August 12, 2022 at 6:56 AM

Hiya, Finally managed to get this deployed last night. We’ll be doing a lot of blowing away and rebuilding Cassandra nodes over the next 2 weeks, so if we haven’t had another occurrence by the 19th, I’d probably say that it’s confirmed fixed.

Robert Emery
July 20, 2022 at 12:16 PM

Yep, I will endeavour to get that rolled out this afternoon. Thanks!

Resize issue view side panel

Fixed

Details

Assignee

Unassigned

Reporter

Robert Emery

Labels

Components

Affects versions

3.16.3

3.18.0

Priority

Major

Created July 19, 2022 at 1:10 PM

Updated December 22, 2022 at 5:29 PM

Resolved December 22, 2022 at 5:29 PM