Currently, the driver creates all core connections on every request processor (I/O threads). This seems to work well for smaller clusters, but for larger clusters this inhibits coalescing and consumes more connection resources than is necessary. An alternative model could distribute the core connection across all the request processors which could better leverage the core connections and the allocated I/O threads.
To accomplish this the request processors need to know which processors are handling which connections. There needs to be a mapping from the host (`Host:tr` or `Address`) to the request processors (`RequestProcessor:tr` or a list of them) that handle the connections. I believe the mapping itself can be distributed and updated asynchronously.
The initial mapping would be created (and maintained) by the `Session` object. When the request processors are initialized they would be provided with the mapping. The mapping would then be incrementally updated as hosts are added/removed to/from the cluster using event loop tasks. In this way, the request path would require no locking for accessing the local mapping on the request processor.
Request processors themselves will need to now handle queues of both `RequestHandler`s and `RequestExecution`s. When `RequestHandler`s are processed the processor would run the load balancing policy for the request then find the appropriate request processor(s) that handles the specific host returned by the policy. The processor would then queue a `RequestExecution` to handle the request on the correct processor. The hard part of this is correctly balancing the processing of both the `RequestHandler` and `RequestExection` queues. I suspect that the `RequestExection` queue would take priority. A `RequestHandler` is bound to a single processor for the duration of a request's processing; however, it can have executions outstanding on several different processors. For this reason, request handlers while need some low contention synchronization i.e. a mutex that protects the load balancing plan and outstanding executions. The request's execution timer will continue to run on the initial processor.
A disadvantage of this approach is that requests, in most cases, would need to be processed by two separate threads. We can optimize the single I/O thread case and cases where the mapping lead to the same processor, but this is only a subset of the requests.