The LatencyTracker interface has a single call back method that reports latency for a given host.
public void update(Host host, long newLatencyNanos);
This method gets called both on success and failure. However the call back does not distinguish between success and failure. Due to this, a node that fails fast would appear as super-responsive to the latency tracker.
Can we enhance the interface to pass the success/fail information?
To be clear, I'm not saying the latency-aware policy should ignore all failed requests, it should take a decision based on the type of error (for instance, a client timeout should still be counted, a CQL syntax error probably shouldn't). We've not addressed that yet, I'll add a note in the pull request.
Sounds good Olivier. Can the fix version be updated to 2.0.10?
This would make sense for errors that are caught early Cassandra-side (and therefore yield very low latencies that can skew measurements). I think the following fall into that category: UNAVAILABLE, OVERLOADED, IS_BOOTSTRAPPING, UNPREPARED.
In the general case I would lean on the conservative side and keep recording latency.
I think the following fall into that category: UNAVAILABLE, OVERLOADED, IS_BOOTSTRAPPING, UNPREPARED.
Sounds reasonable, but I would tend to agree that invalid requests should probably also be ignored (but is sounded like you've changed your mind on that so...). For OVERLOADED, it almost sounds like to me we should penalize the node by adding a fake large latency, but I'm not sure how to make that work in practice (without risking to over or under penalize in some situation) so maybe ignoring is good enough. As for IS_BOOTSTRAPPING, it would kind of be a bug in the driver that we get this so I guess whether we ignore or not doesn't matter too much.
The test cases added in LatencyAwarePolicyTest cover this logic pretty well. Also validated manually with LatencyAwarePolicy and ensured updates with UnavailableExceptions were not considered.