Client is running a stress test against the 18 node cassandra cluster. All the cassandra nodes are shutdown one by one with a gap of 30 seconds between each. Client logs the open connections and the in flight queries on each connection every 10 seconds.
Client did not receive onDown event for a specific node so it continued to send the requests to that co-ordinator. However all requests failed from that node with "Write attempt on defunct connection" and LBP routed the request to another available cassandra node. However the inFlight requests count on the connection for that node kept going up. Once it reached 128, it stayed there for ever. Looks like there is a request count leak on "Write attempt on defunct connection"
See the attached log: request-leak.txt. The request leak happens on host1.domain.com
fixed a bug where the connection was not marked as "defunct" immediately when the USE query to set the keyspace failed
hardened the code for a corner case where a defunct connection was not removed from the connection pool