CCM timing issues

Description

https://github.com/joaquincasares/java-driver/compare/master...retrypolicies_dev_bugs

Searching for "// BUG:" shows situations where an additional Thread.sleep() is needed in order to ensure the node has been marked as down. Can I get this automated in the waitFor() code? That way it's less racy?

Also, do note that in one case I must use:

instead of leaving the "40" out of the call (so it defaults to 20). Not sure why this is needed in this one situation. Any thoughts?

Environment

None

Pull Requests

None

Linked work items

is duplicated by

JAVA-246

Delayed Host availability detection

Activity

Andy Tolbert
October 9, 2015 at 6:48 PM

We are refactoring the tests with the 'FIXME' tags and utility code that has 'FIXME' as part of so marking this as DUPLICATE and assigning to . The timing problem will continue to exist, but we'll have less tests that depend on it and doesn't seem to be that visible in jenkins anymore.

Joaquin Casares
July 18, 2014 at 6:54 PM

This is now a placeholder until CASSANDRA-7012 is resolved. In the meantime, these commits to ccm are our workarounds:

We can probably remove the FIXME comments today through the ccm workaround, but I'm uncertain how stable Jenkins will become. I'd advise waiting until either CASSANDRA-7012 is resolved or until we have additional resources to keep up with Jenkins stability.

Joaquin Casares
May 6, 2013 at 6:51 PM

Okay, cool. I'll move the Thread.sleep() into the waitFor code and have it switchable by a boolean. This way when ccm has that integration, it will make for a quick fix.

+1 on not submitting BUG comments. I'll move them over to FIXME's and create a new, fresh branch before I submit a pull request.

Sylvain Lebresne
May 6, 2013 at 2:55 PM

Can I get this automated in the waitFor() code?

Not really, no. The problem is that the C* nodes themselves take time to discover that other are dead, and that is somewhat independent of the vision the driver has (but will impact test that care about the consistency level obviously). This is especially true when nodes are started/stopped quickly, the C* failure detector can then get particularly slow. We have the same problem with the C* dtests, and there we use a CCM flag that waits until a dead node has been detected as so by other member of the cluster. The implementation of said flag is ugly as hell, it watch the system log file of other nodes, but that's vaguely better than sleeps. So we can probably do the same here, though the flag I mention is not yet exposed through the command line 'stop' command so we'd need to add that to ccm first. I'll do that when I have a minute.

In the meantime, sleeps will have to do. But please let's not commit a "// BUG:" comment for such things. We can put a FIXME to not forget to change it later but they are no bug so this makes it annoying to locate real problem with the driver.

Resize issue view side panel

Duplicate

Details

Assignee

Clint Ascencio(Deactivated)

Reporter

Joaquin Casares

Labels

Sprint

Java 4.x

Priority

Minor

Created May 2, 2013 at 11:10 AM

Updated August 17, 2020 at 7:36 AM

Resolved October 9, 2015 at 6:49 PM