Netty upgrade to 4.x

Description

The Java driver currently uses Netty 3.9.

Upgrading to 4.x will bring valuable improvements like buffer pooling.

Note: the maven-shade-plugin configuration (introduced by JAVA-538) will have to be upgraded. In particular, Netty 4 uses a new package name.

Environment

None

Pull Requests

None

Activity

Show:
Andy Tolbert
March 28, 2015, 2:50 AM
Edited

Verified against java622 branch that the driver behaves correctly after upgrading to netty 4.0.26 final. I also ran a mixed-request load duration test in the following environment:

  • 8 node cassandra cluster running 2.0.13 on n1-standard-1 instances.

  • Driver client running on n1-standard-1 instance.

Each test configuration was executed for 10 minutes

Test

Requests Completed

Throughput

2.0.9.2

10483857

17473/sec

java622

11141451

18569/sec

w/o coalescing

4494692

7491/sec

w/ epoll

11154753

18591/sec

The 'without coalescing' configuration was inhibited by CPU utilization. The client ran on an n1-standard-1 instance, which is only 1 VCPU, which was completely consumed throughout the test. Without using coalescing, a write syscall is made for each write, which is very intensive. All other configurations use about the same amount of ~75-85% CPU. I suspect if I were to try on a larger instance, the w/o and w/ coalescing numbers would be more in line with one another (see the test results in ). I will validate this tomorrow, but this provides evidence of a tangible benefit of using the coalescing configuration as default.

Using netty-transport-native-epoll did not seem to offer that much benefit. I suspect this could be because epoll provides more value when there are many socket connections. With an 8 node cluster, there are 69 connections (8 per node * 8 nodes + 1 control connection).

Andy Tolbert
March 28, 2015, 7:19 PM

Executed the same test with the client running on n1-standard-8:

Test

Requests Completed

Throughput

2.0.9.2

22000475

36667/sec

java622

21997387

36662/sec

w/o coalescing

21221093

35368/sec

w/ epoll

21375001

35625/sec

Running with a larger capacity instance, the CPU usage is pretty much the same between all the tests (~75-80%). I think this further confirms that I was pretty much pushing an n1-standard-1 to it's limits previously. Contrary the previous test, more pressure is being put on the cassandra end with a larger client instance, enough such that it is the bottleneck. As the 'java622' configuration performs about the same or better as '2.0.9.2' in both cases where either the client is overtaxed or the cluster is overtaxed I think we can conclude this the netty 4.x upgrade does not inhibit performance in any negative way.

Andy Tolbert
April 23, 2015, 1:15 AM

We noticed that Netty 4.0 sets nodelay to on by default, where previously they did not (pr 939). A user has the option to configure tcp nodelay in the Driver via SocketOptions.setTcpNoDelay. Currently, if a user does not set this value, it is configured at the discretion of netty or the default behavior of java sockets. We are considering whether or not we should not take any action, so I ran some tests with no delay on and off to see if there was any performance difference. Note that the impact of disabling nagle by enabling tcp no delay is very environmental, so I performed this test on a system that is in the same EC2 region as my cassandra cluster (us-east-1) and on one that wasn't (us-west1).

This test used the same test as the previous tests documented in the comments. The environment used was the following:

  • 8 node cassandra cluster running 2.0.13 on c3.2xlarge instances in us-west-1.

  • Driver clients running on c3.2xlarge instance in each us-east-1 and us-west-1 zones.

Each test configuration was executed for 10 minutes and were executed against the java561 branch.

Client Location

No Delay

Requests Completed

Throughput

us-west-1

on

41767342

69155/sec

us-west-1

off

41534979

68209/sec

us-east-1

on

24443146

40411/sec

us-east-1

off

19740311

32689/sec

Interestingly enough, a performance improvement is more apparent in the remote client, with a throughput improvement of roughly 19%. To ensure this wasn't an anomaly, I ran the test several times and observed similar performance. In the local datacenter, there is an improvement, but only marginal (1.3%). It's possible we were pushing the cassandra cluster to its limits when using a local client and any possible increase would not be as noticeable.

My assertion is that the benefit of nagle's algorithm (tcp no delay = off) is really only useful on slow links / high latency networks (such as a mobile network). I would consider it a more common use case that the client is either in the same datacenter of the cassandra cluster, or has a pretty decent/reliable connection to it.

Fixed

Assignee

Olivier Michallat

Reporter

Olivier Michallat

Labels

PM Priority

None

Affects versions

Fix versions

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Components

Priority

Major
Configure