Latency tracker does not distinguish between success and failure

Description

The LatencyTracker interface has a single call back method that reports latency for a given host.
public void update(Host host, long newLatencyNanos);

This method gets called both on success and failure. However the call back does not distinguish between success and failure. Due to this, a node that fails fast would appear as super-responsive to the latency tracker.

Can we enhance the interface to pass the success/fail information?

Environment

None

Pull Requests

None

Activity

Show:
Olivier Michallat
March 24, 2015, 5:23 PM
Edited

To be clear, I'm not saying the latency-aware policy should ignore all failed requests, it should take a decision based on the type of error (for instance, a client timeout should still be counted, a CQL syntax error probably shouldn't). We've not addressed that yet, I'll add a note in the pull request.

Vishy Kasar
March 24, 2015, 5:25 PM

Sounds good Olivier. Can the fix version be updated to 2.0.10?

Olivier Michallat
March 26, 2015, 11:12 AM

This would make sense for errors that are caught early Cassandra-side (and therefore yield very low latencies that can skew measurements). I think the following fall into that category: UNAVAILABLE, OVERLOADED, IS_BOOTSTRAPPING, UNPREPARED.
In the general case I would lean on the conservative side and keep recording latency.

Sylvain Lebresne
March 26, 2015, 1:17 PM

I think the following fall into that category: UNAVAILABLE, OVERLOADED, IS_BOOTSTRAPPING, UNPREPARED.

Sounds reasonable, but I would tend to agree that invalid requests should probably also be ignored (but is sounded like you've changed your mind on that so...). For OVERLOADED, it almost sounds like to me we should penalize the node by adding a fake large latency, but I'm not sure how to make that work in practice (without risking to over or under penalize in some situation) so maybe ignoring is good enough. As for IS_BOOTSTRAPPING, it would kind of be a bug in the driver that we get this so I guess whether we ignore or not doesn't matter too much.

Andy Tolbert
March 31, 2015, 8:22 PM

The test cases added in LatencyAwarePolicyTest cover this logic pretty well. Also validated manually with LatencyAwarePolicy and ensured updates with UnavailableExceptions were not considered.

Fixed

Assignee

Alexandre Dutra

Reporter

Vishy Kasar

Labels

None

PM Priority

None

Reproduced in

None

Affects versions

None

Fix versions

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Priority

Major
Configure