TokenAwarePolicy should respect child policy ordering
Chaining LatencyAwarePolicy and TokenAwarePolicy has different downsides depending on the order of chainging:
LatencyAware(TokenAware(..)) - if all three replicas for a given partition are beyond exclusionThreshold they will be sent to the back of the line by LatencyAwarePolicy, meaning we will end up coordinating through a non-replica coordinator. That replica is still going to have to wait on CL replicas, so we incur an unnecessary hop.
TokenAware(LatencyAware(..)) - TokenAware puts the replicas for the partition in the front of the line, regardless of what the childPolicy has decided. So for practical purposes, this chaining is the same as TokenAware(..) (excluding LAP completely)
In practice what would be ideal is to prefer local replicas to all other coordinators, but within those replicas prefer non-slow nodes. One approach to achieving this would be for TokenAwarePolicy to still prefer local replicas, but retain their ordering from the childPolicy.
For example, if we have nodes A,B,C,D,E,F and C,D,E are replicas for a partition. Let's say B & C are consider slow nodes by LatencyAware (that is, they are beyond exclusionThreshold). LAP would return a query plan like A,D,E,F,B,C. In the current implementation, TokenAware would prepend C,D,E giving us C,D,E,A,F,B. The proposed implementation in this ticket would give us D,E,C,A,F,B: as a replica, C would still be preferred over A,F,B, but D & E would be preferred first.
Given that users tend to make DCAwareRoundRobinPolicy the innermost child policy, when retaining child ordering the shuffle option in TokenAware is redundant.. at least in terms of balanced requests across replicas.
So I propose that the retain-ordering behavior be the only option in TokenAware, replacing the 2 behaviors that exist today.
The only situation where it might make sense is when ordering replicas between them, for the reasons I explained above.
My quotes around DCAwareOrderByIp were intentional, but not intuitive. I was using IP as an obvious indicator of determinism, but you can replace that with anything that's deterministic. My point was that determinism could (and maybe should?) be achieve at a/the childPolicy below TAP rather than adding complexity to TAP.
but maybe we should keep the shuffleReplicas option in TAP for users that want to override the child policy ordering.
I don't follow how this would work. The behavior in this patch is essentially shuffle=false. If this option is intended to support the deterministic case, there is no other value for a shuffle argument that would achieve that.
Re DCAwareOrderByIP: I actually don't see any benefit in ordering hosts in any deterministic fashion; besides, "lesser" hosts (lexicographically speaking) would likely get hot-spotty. The only situation where it might make sense is when ordering replicas between them, for the reasons I explained above.
So I'm actually -1 on the idea of a fully-fledged DCAwareOrderByIP, but maybe we should keep the shuffleReplicas option in TAP for users that want to override the child policy ordering.
If there was something like "DCAwareOrderByIP" then TokenAware(DCAwareOrderByIP) would give you the same effect as shuffle=false
There is no such LBP, although we do use something similar in our tests. Indeed a deterministic child policy would bring us back the shuffle=false effect.
This means that it is not possible anymore to mimic the previous behavior when shuffleReplicas = false.
Is DCAwareRoundRobin the only non-chaining child policy available for DC isolation? It seems the round robin ordering from that child policy is the root of why we can't get a specific ordering. If there was something like "DCAwareOrderByIP" then TokenAware(DCAwareOrderByIP) would give you the same effect as shuffle=false. And imo, would do this more intuitively than TokenAware(DCAwareRoundRobin) does today, where one load balancing policy's ordering is insignificant.
TAP now honors the underlying ordering in all cases. This means that it is not possible anymore to mimic the previous behavior when shuffleReplicas = false. However, as this SO question suggests, there are situations where it might be desirable to always hit the primary replica :
Mimic strong consistency with CL = ONE (because reads and writes go all to the same node);
Favor a specific availability zone in AWS deployments.