Invalid or unsupported protocol version (0)

Description

We have had a couple of periods where our production servers started throwing this exception on every C* request:

Invalid or unsupported protocol version (0); supported versions are (3/v3, 4/v4, 5/v5-beta)
at Cassandra.Data.Linq.CqlQueryBase`1.InternalExecuteWithProfileAsync(String executionProfile, String cqlQuery, Object[] values)
at Cassandra.Data.Linq.CqlQueryBase`1.ExecuteCqlQueryAsync(String executionProfile)

These servers did this until we restarted them, at which point the exceptions stopped.

Note the "version (0)". I've been reading the driver source and this should really never happen. Any place where the version can be set there are safety checks to ensure that they are >=1.

Around the same time we were seeing a bunch of these exceptions:

System.ArgumentNullException: Null values are not supported inside collections
at Cassandra.Serialization.CollectionSerializer.Serialize(UInt16 protocolVersion, IEnumerable value)
at Cassandra.Serialization.Serializer.Serialize(Object value)
at Cassandra.QueryProtocolOptions.Write(FrameWriter wb, Boolean isPrepared)
at Cassandra.Requests.ExecuteRequest.WriteFrame(Int16 streamId, MemoryStream stream, Serializer serializer)
at Cassandra.Connections.Connection.RunWriteQueueAction()
— End of stack trace from previous location where exception was thrown —
at Cassandra.Data.Linq.CqlQueryBase`1.InternalExecuteWithProfileAsync(String executionProfile, String cqlQuery, Object[] values)
at Cassandra.Data.Linq.CqlQueryBase`1.ExecuteCqlQueryAsync(String executionProfile)

I believe what is happening here is that the driver starts writing a CQL frame but never finishes due to the serializer exception. This makes the next frame write into a space that Cassandra thinks should be part of the first frame.

I've created a small piece of code to reproduce this:

Sorry that its written using some of our in-house helpers (ICommand, depenency injection, ICassandraRepository<T>). I can rewrite without these things if needed, but hoping this is enough to illustrate the problem.

Most importantly this call:

is essentially a call to

This reproduces the error. I will attach both the console output and a WireShark capture for your inspection.

Environment

Client:
Windows 10 (v1809 OS Build 17763.503)
dotnet --version = 2.2.101

Cassandra:
Docker image https://hub.docker.com/_/cassandra tag 3.11.3
running under Docker Desktop v2.0.0.3 (31259) build 8858db3

Activity

Show:
Collin Sauve
June 8, 2019, 4:16 AM
Edited

This seems to fix it: https://github.com/datastax/csharp-driver/pull/457
If there is an error writing a frame, back up the stream to where we started writing the frame.
I can't repro the protocol version error with that code change.

Collin Sauve
June 8, 2019, 4:16 AM
Edited

Ah nevermind I see in your cassandra.log file that you managed to reproduce it. Were you running this against the same Cassandra cluster that is being used by your production servers? I assume not but just wanted to ask it either way.

No. When I reproduce this locally I'm hitting Apache Cassandra 3.11 docker image. In production we're hitting DSE 5.1.14

Joao Reis
June 8, 2019, 4:19 AM

Good news I can reproduce it with CCM locally as well. And it's affecting 3.9.0 so this bug has been here for a long time it seems. I will try to reproduce it with your code change now.

Joao Reis
June 8, 2019, 4:49 AM

More good news: can't reproduce the bug with your PR. Looking at the code it seems pretty obvious what's happening. Thank you very much for the bug report and I'll post an update on this ticket soon.

Joao Reis
June 13, 2019, 3:24 AM
Edited

added a link to because it seems to be the same issue

Fixed

Assignee

Unassigned

Reporter

Collin Sauve

Labels

Reproduced in

3.9.0
3.10.0

PM Priority

None

Fix versions

External issue ID

None

Doc Impact

None

Reviewer

None

Epic Link

None

Sprint

C# P-NEXT

Pull Requests

None

Size

None

Components

Affects versions

Priority

Critical