Load-Balancer doesn't request complete Cluster directly

Description

We are running a cassandra cluster containing 10 nodes without replicated data, each nodes are local. Each node owns round a about 10 % of the complete data.

Each node owns parts of the TOKENs (without any replication).

We use the Java driver with default configuration. As we can see one node is not directly (connection: Java client --> node) requested. One node is not connected from the client and receives no requests / responses directly from the client (requests will be transmitted and answered via internal Cluster GOSSIP-protocol to the coordinator of the request). 9 nodes are directly receiving the requests based on the TOKEN-Map of the driver.

So it seems a BUG in the Load-Balancer, not directly sending the requests to the node owing the token and using an other coordinator node. This kind of requests needs more time to answer.

What we've tried to change the behavior:

  • Reducing / Extending the cluster with more nodes: Every time one is not directly requested

  • Extend the -Ddatastax-java-driver.advanced.connection.pool.local.size to 20 or more: One node doesn't contain 20 connections (per Java client); Other nodes have more than 20 connections per Java client

Environment

Linux cassandra nodes 3.11.7
Datastax Java Driver 4.8

Pull Requests

None

Activity

Show:
Alexandre Dutra
November 17, 2020, 2:38 PM
Edited

How do you know the driver is avoiding 1 node out of ten? The snippet you included above is just the output of nodetool status if I'm not mistaken, it doesn't really show evidence that something is wrong with the driver.

Also which load balancing policy are you using, the default one? Can you post your load balancing policy configuration please?

If you are absolutely certain that one node is being avoided, we need to understand why. My guess is that this node is being assigned a distance of IGNORED for some reason, maybe because it is configured with the wrong datacenter name. Could you please enable debug logs and run your application again? Thanks!

Arne Voß
November 24, 2020, 3:09 PM

Hi Aleaxandre,

thank you for the response.

  • As load balancing we are using the default one (that should use the Token information, to choose the best node because we use prepared statements).

  • The nodes are all in the same network, we don’t have set an ignore node

  • The nodes are all in the same datastore (only one cluster containing one datastore)

  • We use the JMX-Interface of each node to get the connection count of each node (org.apache.cassandra.metrics.Client.connectedNativeClients)

Which debug information do you want exatly?

Alexandre Dutra
December 3, 2020, 2:33 PM

I was asking if you could enable DEBUG or TRACE levels in your logging library. I think that if you set the following loggers to DEBUG it should be enough to understand why the driver is skipping one node:

Also, could you please run the following snippet on your cluster:

This should show if one of the nodes is being considered DOWN or IGNORED.

Arne Voß
December 9, 2020, 3:30 PM

Hi Alexandre,

I attached the log files.

As you can see, the driver detects all nodes:

 

 

But for processing the node with the IP 10.0.10.36 is not requested (here are the output of your code snippet)

 

 

Do you have an idea why 10.0.10.36 is not listed as node?

Alexandre Dutra
December 9, 2020, 4:43 PM

Hi Arne,

First off, I see 2 sessions being created:

  • “s0” is being initialized at 15:35:21.920 with contact point 10.0.10.7:9042. It is being closed at 15:35:23.012.

  • “s1” is being initialized at 15:35:25.065 with a contact point of typecom.ipsb.platform.storage.cassandra.HostnameEndPoint.

Why 2 sessions? but in summary, session s0 discovers and uses 10.0.10.36, but s1 seems to have a problem with it.

There are several things that I don’t understand: what is com.ipsb.platform.storage.cassandra.HostnameEndPoint? Is it a custom implementation of EndPoint? If so I cannot know what this implementation does, maybe it’s not resolving the IP address of the node correctly.

Secondly, it seems that this contact point is 10.0.10.36, since they have the same host id: 248fccb0-aa3a-440f-8c16-89b4ba26a263. Which means that your special implementation is maybe making the driver believe that this endpoint is not reachable.

If you can’t find what’s going on you could try with even more logs; set these to DEBUG:

Good luck!

Assignee

Unassigned

Reporter

Arne Voß

Labels

None

PM Priority

A

Reproduced in

None

Affects versions

Fix versions

None

Pull Request

None

Doc Impact

None

Size

None

External issue ID

None

External issue ID

None

Priority

Major
Configure