Netty upgrade to 4.x

Description

The Java driver currently uses Netty 3.9.

Upgrading to 4.x will bring valuable improvements like buffer pooling.

Note: the maven-shade-plugin configuration (introduced by JAVA-538) will have to be upgraded. In particular, Netty 4 uses a new package name.

Environment

None

Pull Requests

None

Linked work items

relates to

JAVA-583

Rename threads to indicate that they are explicitly for the driver

JAVA-562

Add driver-side write batching of native frames

Activity

Show:

Andy Tolbert

April 23, 2015 at 1:15 AM

We noticed that Netty 4.0 sets nodelay to on by default, where previously they did not (pr 939). A user has the option to configure tcp nodelay in the Driver via SocketOptions.setTcpNoDelay. Currently, if a user does not set this value, it is configured at the discretion of netty or the default behavior of java sockets. We are considering whether or not we should not take any action, so I ran some tests with no delay on and off to see if there was any performance difference. Note that the impact of disabling nagle by enabling tcp no delay is very environmental, so I performed this test on a system that is in the same EC2 region as my cassandra cluster (us-east-1) and on one that wasn't (us-west1).

This test used the same test as the previous tests documented in the comments. The environment used was the following:

8 node cassandra cluster running 2.0.13 on c3.2xlarge instances in us-west-1.
Driver clients running on c3.2xlarge instance in each us-east-1 and us-west-1 zones.

Each test configuration was executed for 10 minutes and were executed against the java561 branch.

Client Location	No Delay	Requests Completed	Throughput
us-west-1	on	41767342	69155/sec
us-west-1	off	41534979	68209/sec
us-east-1	on	24443146	40411/sec
us-east-1	off	19740311	32689/sec

Interestingly enough, a performance improvement is more apparent in the remote client, with a throughput improvement of roughly 19%. To ensure this wasn't an anomaly, I ran the test several times and observed similar performance. In the local datacenter, there is an improvement, but only marginal (1.3%). It's possible we were pushing the cassandra cluster to its limits when using a local client and any possible increase would not be as noticeable.

My assertion is that the benefit of nagle's algorithm (tcp no delay = off) is really only useful on slow links / high latency networks (such as a mobile network). I would consider it a more common use case that the client is either in the same datacenter of the cassandra cluster, or has a pretty decent/reliable connection to it.

Andy Tolbert

March 28, 2015 at 7:19 PM

Executed the same test with the client running on n1-standard-8:

Test	Requests Completed	Throughput
2.0.9.2	22000475	36667/sec
java622	21997387	36662/sec
w/o coalescing	21221093	35368/sec
w/ epoll	21375001	35625/sec

Running with a larger capacity instance, the CPU usage is pretty much the same between all the tests (~75-80%). I think this further confirms that I was pretty much pushing an n1-standard-1 to it's limits previously. Contrary the previous test, more pressure is being put on the cassandra end with a larger client instance, enough such that it is the bottleneck. As the 'java622' configuration performs about the same or better as '2.0.9.2' in both cases where either the client is overtaxed or the cluster is overtaxed I think we can conclude this the netty 4.x upgrade does not inhibit performance in any negative way.

Andy Tolbert

March 28, 2015 at 2:50 AM

(edited)

Verified against java622 branch that the driver behaves correctly after upgrading to netty 4.0.26 final. I also ran a mixed-request load duration test in the following environment:

8 node cassandra cluster running 2.0.13 on n1-standard-1 instances.
Driver client running on n1-standard-1 instance.

Each test configuration was executed for 10 minutes

Test	Requests Completed	Throughput
2.0.9.2	10483857	17473/sec
java622	11141451	18569/sec
w/o coalescing	4494692	7491/sec
w/ epoll	11154753	18591/sec

The 'without coalescing' configuration was inhibited by CPU utilization. The client ran on an n1-standard-1 instance, which is only 1 VCPU, which was completely consumed throughout the test. Without using coalescing, a write syscall is made for each write, which is very intensive. All other configurations use about the same amount of ~75-85% CPU. I suspect if I were to try on a larger instance, the w/o and w/ coalescing numbers would be more in line with one another (see the test results in https://datastax-oss.atlassian.net/browse/JAVA-562#icft=JAVA-562). I will validate this tomorrow, but this provides evidence of a tangible benefit of using the coalescing configuration as default.

Using netty-transport-native-epoll did not seem to offer that much benefit. I suspect this could be because epoll provides more value when there are many socket connections. With an 8 node cluster, there are 69 connections (8 per node * 8 nodes + 1 control connection).

Resize issue view side panel

Fixed

Details

Assignee

Olivier Michallat

Reporter

Olivier Michallat

Labels

Priority

Major

Created January 15, 2015 at 2:02 PM

Updated April 23, 2015 at 1:15 AM

Resolved March 30, 2015 at 2:03 PM

Netty upgrade to 4.x

Description

Environment

Pull Requests

Linked work items

relates to

Activity

Details

Assignee

Reporter

Labels

Affects versions

Fix versions

Components

Priority