'Adjusted frame length exceeds' error breaks drivers ability to properly read data

Description

Resolution

When this exception is now encountered, the request future is failed with a FrameTooLongException, it is not retried. The connection is not closed as it was before, instead up to the frame length bytes on the connection are discarded and the connection remains open and usable.

Initial Report

If a cassandra node is configured with native_transport_max_frame_size_in_mb > 256 and the driver reads a frame larger than 256mb it throws an exception:

This breaks the drivers ability to read subsequent packets since the Decoder for parsing frames is static (code):

When a LengthFieldBasedDecoder encounters a frame too large it will throw away all data until the remaining bytes for that frame are consumed. Since the decoder is shared among all connections, it continues parsing and throwing away bytes until this is fully consumed.

To fix this, initialize a DecoderForStreaminIdSize for each Decoder.

Test case to reproduce:

Note: Also add envMap.put("CCM_MAX_HEAP_SIZE", "1G");; at line 180 of CCMBridge to give the CCM Nodes more memory and bump logging to DEBUG. You'll notice that the exception will be encountered and reconnect attempts time out.

Environment

None

Pull Requests

None

Activity

Show:

Andy Tolbert 
September 13, 2016 at 11:05 PM

Sounds good, . We have a proof of concept that does that now in a pull request, we'll aim to make it behave that way for 3.0.4.

Vishy Kasar 
September 13, 2016 at 10:49 PM

Yes.

I think it is pointless to send it to RetryPolicy where request may be retried on other host.

Throwing the error to user code is the right thing IMO. If the code had already done it, we would have found the root cause for the current issue some time ago.

Andy Tolbert 
September 13, 2016 at 10:34 PM

Fixing this issue (where the decoder breaks all parsing), the driver would still defunct (close) the connection and try it on the next host, where it will likely raise the same exception, retry on next host, and repeat that process.

It probably isn't good that the request is retried on the next host as it is very likely that the same exception would be raised again. So instead of closing the connection and retrying, it would now raise a new specialized DriverException named TooLongFrameException which has netty's TooLongFrameException as the cause. The connection will then skip the next bytes up to the frame length and continue to function (this is handled for us by Netty's LengthFieldBasedFrameDecoder).

We could let the user decide what to do by passing the exception to RetryPolicy#onRequestError, but the current behavior of DefaultRetryPolicy#onRequestError is to retry on the next host which is not a good default for this particular error I think.

It seems like it could be more appropriate that this kind of error be handled at the client level instead of RetryPolicy. i.e., the user could handle the exception and react by decreasing the fetch size on their query and retrying.

Does that sound like the right thing to do ?

Andy Tolbert 
September 13, 2016 at 2:08 AM

It is a good idea to add a regression test to catch this kind of error.

Good call, I've included an integration test in the PR which also exercises and ensures that the error is handled, host connection is re-established and queries can be made.

Andy Tolbert 
September 13, 2016 at 1:54 AM

Opened up CASSANDRA-12630 to request restricting the outbound frame size in C* server.

Fixed

Details

Assignee

Reporter

Affects versions

Fix versions

Priority

Created September 12, 2016 at 11:51 PM
Updated September 22, 2016 at 5:23 PM
Resolved September 18, 2016 at 9:54 PM