Remove %(thread) from test debugging

Description

We've been running into hangs in various tests, both on Jenkins and in the old duration test environments. By sshing into a hung node on Jenkins and attaching with gdb, I was able to source the error:

Here's my interpretation:

  • on_up gets called in normal operation during a test, and on_up calls debug

  • because we run with %(thread) in our logging code, debug calls threading.current_thread()

  • threading.current_thread() (inside debug) acquires the _active_limbo_lock

  • before it releases that lock, Python starts garbage collection on the same thread (marked with (Garbage-collecting in the backtrace)

  • the garbage collector finds a Cluster that can be __del__ 'd, and does so. This calls connection.close, which calls debug, which calls threading.current_thread, which, like the other debug call, tries to acquire _active_limbo_lock

_active_limbo_lock is a Lock object and not a RLock, so the multiple acquire attempts from the same thread put us in a deadlock. The interpreter won't return to the non-GC context and continue the first debug call until garbage collection is over. GC can't complete until the first debug call has released it.

Alan and I discussed privately and decided that this something that is unlikely to happen on production systems – it requires that users 1) use debug-level logging 2) configure debug logging to print thread IDs or names. So, we're inclined to categorize this as a test bug, not a product bug.

We'll fix it by removing %(thread) from our test config. If we see hangs still happening, then we'll know there's a proper product bug, but until then we have a quick fix for the test issue.

Environment

None

Pull Requests

None
Won't Do

Assignee

Unassigned

Reporter

Jim Witschey

Fix versions

None

Labels

None

Reproduced in

None

PM Priority

None

External issue ID

None

Doc Impact

None

Reviewer

None

Size

None

Pull Request

None

Sprint

Py P-NEXT

Priority

Major