~40% drop in performance when upgrading from 4.3 to 4.5

Description

In Akka we have some basic perf tests for our persistence library built on top of the cassandra driver.

We noticed a 40% drop off in write throughput when upgrading from 4.3 to 4.5.

I git bisected the driver to see when it was introduced and it was:

git bisect good
24757424b70b3e7bd889e94e8d1acf313ba70fec is the first bad commit
commit 24757424b70b3e7bd889e94e8d1acf313ba70fec
Author: olim7t <omichallat+github@gmail.com>
Date: Mon Feb 3 16:22:59 2020 -0800

JAVA-2637: Bump Netty to 4.1.45

I've also confirmed this by running with 4.5.0 and overridden netty to 4.1.39.Final

Environment

None

Pull Requests

None

Activity

Show:
Olivier Michallat
May 11, 2020, 5:02 PM

The fix is merged, last run to validate it on the Akka Persistence test:

I'll proceed to release 4.6.1.

Olivier Michallat
May 6, 2020, 1:17 AM

Upon further investigation, it's not obvious that Netty is to blame, at least not directly. The driver does its own message coalescing in an attempt to limit the number of I/O syscalls, see DefaultWriteCoalescer. If I replace the current implementation by a "no-op" one (flush after every write), 4.1.43 and 4.1.45 are back to the same order of magnitude.

It looks like something changed in the way the event loop handles scheduled tasks, and that doesn't play well with our coalescer implementation. I suspect this line in particular.

I should also mention that both examples are a bit contrived: executing synchronous requests in a loop means that the coalescer will only ever handle 1 write at at time, that doesn't give it a chance to do its job. If I parallelize the load across multiple client threads, the problem immediately goes away. A perf drop on a basic example is still a bad look though, I will keep investigating to see how we can adapt the coalescer code.

Olivier Michallat
May 5, 2020, 6:02 PM

We have a reproduction in pure Java, see this cassandra-user ML thread.

I'm raising the priority on this, we'll most likely release a patch version in the next few days to downgrade Netty to 4.1.43. We'll also raise an issue with the Netty project.

Alex Dutra
April 13, 2020, 8:44 AM

any news on this front? We are trying to narrow down the scope of code involved in the regression, but we could use your help (see Olivier’s last comment).

Olivier Michallat
March 10, 2020, 11:08 PM

I will log a Netty issue for this, but I'm trying to come up with a simpler reproducing case.

Ideally I'd like something that involves just the driver, not Akka persistence. But the fact is that just executing queries does not easily reproduce the issue, I think it might have to do with the amount of work done in future callbacks.

I'm not very familiar with Akka, could you help me understand the query execution model? From what I've gathered so far:

  • in CassandraLoadTypedSpec, the main iteration is in testThroughput:

  • the processor behavior is implemented (mocked?) in object Processor, the message ends up in this method:

  • from there is gets murky, but I see methods like CassandraJournal.writeMessages which is probably what ends up executing the query.

Fixed

Assignee

Olivier Michallat

Reporter

Christopher Batey

Reproduced in

4.5.0
4.6.0
4.5.1

Affects versions

Fix versions