DriverTimeoutException should trigger retry policy

Description

In 3.x, you could implement RetryPolicy#onRequestError which would be invoked when the driver encountered an OperationTimedOutException (i.e. Cassandra didn't respond in time).

As far as I can tell, in 4.x there is no way to retry automatically on a DriverTimeoutException (the equivalent to OperationTimedOutException). Note that this is different from ReadTimeoutException and WriteTimeoutException, where the coordinator responds and says that it couldn't achieve quorum within its timeout, we're talking specifically about the coordinator not responding in time. Only retrying idempotent requests is probably fine too.

Was this functionality intentionally not ported? A quick search didn't turn up any mention in the guide, but the code clearly goes straight to setFinalError, bypassing the retry policy entirely. This is fairly important to us for a couple reasons:

1. We have some low-latency use cases where we fail fast and retry on another node if one node is a little sluggish e.g. due to GC (we don't currently use speculative execution, but we do want to explore that at some point)
2. For cases where there is a network hiccup or the remote node dies unexpectedly, any outstanding requests from the client fail with a timeout. Without being able to retry automatically the failure would bubble up to the caller.

Environment

None

Pull Requests

None

Activity

Show:

Ammar Khaku
February 24, 2021 at 2:58 AM

Thank you! Please feel free to close as “By Design” or “Won’t Fix” or similar.

Alex Dutra
February 22, 2021 at 10:21 AM

Was this functionality intentionally not ported?

Yes. The main motivation was that in driver 3.x, it was impossible to comply with strict SLAs: if your application has an SLA of 5 seconds max per request, if the request times out and is retried 5 times, the whole query could take up to 25 seconds to complete.

Instead, driver 4.x introduces the notion of a global client-side timeout: if you set that timeout to 5 seconds, no query will ever take more than 5 seconds to complete.

We have some low-latency use cases where we fail fast and retry on another node if one node is a little sluggish e.g. due to GC (we don't currently use speculative execution, but we do want to explore that at some point)

Indeed, the alternative in driver 4.x would be speculative executions; that’s exactly the best use case for them. Please give it a try.

Resize issue view side panel

Not a Problem

Details

Assignee

Unassigned

Reporter

Ammar Khaku

Priority

Critical

Created February 20, 2021 at 12:50 AM

Updated February 25, 2021 at 10:06 AM

Resolved February 25, 2021 at 10:06 AM