We should automate throughput-based benchmarks mainly for regression testing, to have the ability to compare 2 different branches/commits of the driver to look for regression or possible perf improvements.
For these benchmarks, we would need 2 things:
A project / tool to test the maximum throughput at different in-flight requests. Ideally, we should be able to configure different CQL workloads and driver settings.
A test setup using our automated provisioning tool to launch it.
For the project/tool, you can look at the node.js driver one as an example: https://github.com/riptano/nodejs-driver-sut or reuse whatever we have.
If I understand correctly, (and 99) relates latency-based benchmarks to see how the driver behaves when the server is under (artificial) stress.
The purpose of this ticket is to automate throughput based tests: what’s the maximum throughput you can get from the client side, with different numbers of in-flight requests and a healthy cluster. This type of testing is the one that historically we’ve used across all drivers to look for driver performance regressions. For example, you can look at the historic results here: https://datastax.jira.com/wiki/spaces/DRIV/pages/101122057/Benchmarks
It’s currently a general performance harness. I’ve got flags in the fallout test to enable/disable cpu burn (and want to add network latency soon) so it can be used to measure latency regardless of what the server is doing.
I also have in the backlog, though I think it seeks to measure something a little different than the max throughput tests I’m seeing on that benchmarks page. When I get to implementing it, I might want to chat on the differences in our approaches to make sure I’m not missing anything.
Note that the technique that should be used to identify the throughput levels on the driver side is different: what remains constant is the number of in-flight requests (w/ steady memory usage), as detailed here:
[…] launch a fixed number of asynchronous operations using the concurrency level as the maximum. As each operation completes, add a new one. This ensures your application's asynchronous operations will not exceed the concurrency level.
Ideally, a single client with a simple workload should not able to saturate a multi-node cluster. With this approach, the driver will yield the maximum throughput when you find the optimal concurrency level (according to hw/cluster size/…). This is not intended to get a “number” per branch/release, is to prove that it's on par with the previous version of the driver (or it has improved) and we didn’t introduce a regression (compare X vs Y).
I see. Thanks, that helps. Do you think there is also value in measuring max throughput under constant QPS?
Could be, I think it’s more valuable from the server perspective as in “how much pressure I can sanely handle”. The server can become unstable for a number of reasons that are independent to the client (e.g., background tasks like compaction or GC).
Using requests per second at different concurrency levels has proven successful to surface client level regressions. This way the server is under a steady level of pressure (there are no peaks of in-flight requests because the server is behind) as a new requests will only be issued after a previous is fulfilled.