Log exception when Cluster.Init() can not recover from

Description

Update

We should log exceptions when the driver can not recover while trying to initialize the Cluster instance.


There is a recommendation to use only one Cluster instance per (physical) cluster (per application lifetime) (source: http://www.datastax.com/dev/blog/4-simple-rules-when-using-the-datastax-drivers-for-cassandra).

However, in the Cluster class, in the Init method, when a exception is thrown in "_controlConnection.Init();", it's (usually, if it's not NoHostAvailableException) cached. Therefore occurrence of a TimeoutException is cached, and - according to the recommendation - there will be no retries.

Suggestion 1: Handle System.TimeoutException as NoHostAvailableException.
Suggestion 2: Add logging when a) caching the exception b) throwing cached exception, now there is no such a information and the StackTrace is misleading:

{{System.TimeoutException: The task didn't complete before timeout.
at Cassandra.Cluster.Init()
at Cassandra.Cluster.Connect(String keyspace)}}

Question: Why caching the exceptions is needed at all in Cluster.Init?

Environment

None

Activity

Show:
Jorge Bay Gondra
February 8, 2016, 10:21 AM

Question: Why caching the exceptions is needed at all in Cluster.Init?

Cluster.Init() covers various steps that should be done once. Internally init is called when trying to connect a new session or retrieve metadata for the first time. If there is an exception (that we can't recover from) trying to initialize, its expected that all subsequent calls get the same exception. Otherwise, if you get an exception in one of the steps , it is likely that next time you get a different exception at step n+1 next time, making it very hard for users to find the underlying issue.

About suggestion 2: sounds good to me.
About suggestion 1: TimeoutException is an "artificial" exception that says that a certain amount of time passed and the task didn't complete. There isn't much information, increasing SocketOptions.ConnectTimeoutMillis would generally help but we are not able to determine the actual issue. We could wrap the exception on Cluster.Init() to tell the user to do so.

We plan to provide async counterpart of Cluster.Connect() method in that could help mitigate the lack of information of TimeoutException, allowing the user to control the Task.

PW
February 8, 2016, 11:08 AM

Thanks for the quick answer.

I agree with the statement initializing should result in the same exception, which will make debugging easier.
On the other hand, Init is used only in places like getting Metadata and getting ISession for keyspace (Connect). Therefore, even if wrapped in message with note about SocketOptions.ConnecctTimeoutMillis, people may want to try again - without rebuilding whole Cluster object. In fact, the nature of TimeoutException (I've seen it's "artificial", but may be also real from the socket) is similar to NoHostAvailableException - at a time, we couldn't connect, due to various reasons (internet connection, high CPU and thread starvation), but we might want to try again. Let's take an example: a business requirement, which says "we want to finish all business operations within 30 seconds". Part of that business operation is a log insertion to the Cassandra database. Normally it's immensely fast. However, sometimes due to various reasons, it can be longer. We can add a retry logic for connection timeout and decide on the fly, what to do if we fail. Increasing ConnectTimeoutMillis isn't an option, though - we cannot spent the majority of time on waiting for the connection.

Jorge Bay Gondra
February 8, 2016, 3:12 PM

The Cluster.Connect() should be called on application startup, outside of execution SLAs as it is one time call.
Each session maintains a connection pool to each host selected by the load balancing policy, so creating the session is an expensive operation and they are designed to be long lived instances.

It's not possible to control the execution time SLAs if you count the Cluster.Connect() call, even if we add retry (retry will happen after a timed out attempt).
About handling the TimeoutException, there are several actions within the Init() that may throw it, so it is not easy to determine in which "step" we are in.

Creating the session on app startup and setting a connection timeout according to your connection speed should solve all the possible issues, remember that all other exception will be wrapped by NoHostAvailableException.

I agree that having it logged would allow users to understand what is happening.

PW
February 9, 2016, 10:03 AM

Alright, now I get it, that's reasonable. Thank you for explanation.

A logging will be enough then.

Jorge Bay Gondra
February 9, 2016, 10:59 AM

ok, I changed the ticket subject.

Assignee

Unassigned

Reporter

PW

Labels

None

PM Priority

None

Fix versions

External issue ID

None

Doc Impact

None

Reviewer

None

Pull Request

None

Epic Link

None

Sprint

Pull Requests

None

Size

None

Components

Priority

Major
Configure