We've been running into hangs in various tests, both on Jenkins and in the old duration test environments. By sshing into a hung node on Jenkins and attaching with gdb, I was able to source the error:
Here's my interpretation:
on_up gets called in normal operation during a test, and on_up calls debug
because we run with %(thread) in our logging code, debug calls threading.current_thread()
threading.current_thread() (inside debug) acquires the _active_limbo_lock
before it releases that lock, Python starts garbage collection on the same thread (marked with (Garbage-collecting in the backtrace)
the garbage collector finds a Cluster that can be __del__ 'd, and does so. This calls connection.close, which calls debug, which calls threading.current_thread, which, like the other debug call, tries to acquire _active_limbo_lock
_active_limbo_lock is a Lock object and not a RLock, so the multiple acquire attempts from the same thread put us in a deadlock. The interpreter won't return to the non-GC context and continue the first debug call until garbage collection is over. GC can't complete until the first debug call has released it.
Alan and I discussed privately and decided that this something that is unlikely to happen on production systems – it requires that users 1) use debug-level logging 2) configure debug logging to print thread IDs or names. So, we're inclined to categorize this as a test bug, not a product bug.
We'll fix it by removing %(thread) from our test config. If we see hangs still happening, then we'll know there's a proper product bug, but until then we have a quick fix for the test issue.