SslTests.Integration_Cassandra_VerifyPeerMultipleCerts fails intermittently

Description

An example:

 

/home/jenkins/workspace/drivers_cpp_oss_master/tests/src/integration/objects/ssl.hpp:62 Expected: CASS_OK To be equal to: cass_ssl_add_trusted_cert(get(), cert.c_str()) Which is: CASS_ERROR_SSL_INVALID_CERT [Unable to load certificate]

 

When running locally:

 

$ ./cassandra-integration-tests --gtest_filter=SslTests* --version=3.11.15 --category cassandra Missing Category: All applicable tests will run DSE Category Will be Ignored: DSE is not enabled [--dse] Starting DataStax C/C++ Driver Integration Test v2.16.2 libuv v1.44.2 Logging driver messages Apache Cassandra Version: 3.11.15 CCM Cluster Prefix: cpp-driver Category: Cassandra Note: Google Test filter = SslTests*:-*_DSE_* ... [ RUN ] SslTests.Integration_Cassandra_VerifyPeerMultipleCerts /work/git/cpp-driver/tests/src/integration/objects/ssl.hpp:62: Failure Expected: CASS_OK To be equal to: cass_ssl_add_trusted_cert(get(), cert.c_str()) Which is: CASS_ERROR_SSL_PROTOCOL_ERROR [Protocol error] unknown file: Failure C++ exception with description "Unable to Establish Session Connection: Invalid peer certificate" thrown in the test body. [ FAILED ] SslTests.Integration_Cassandra_VerifyPeerMultipleCerts (184 ms)

Environment

None

Pull Requests

None

Activity

Show:

Bret McGuire 
July 19, 2023 at 9:03 PM
(edited)

Some investigation confirms that we’re actually seeing the error in this code in ssl_openssl_impl.cpp:

 

// Iterate over the bio, reading out as many certificates as possible. for (X509* cert = PEM_read_bio_X509(bio, NULL, pem_password_callback, NULL); cert != NULL; cert = PEM_read_bio_X509(bio, NULL, pem_password_callback, NULL)) { X509_STORE_add_cert(trusted_store_, cert); X509_free(cert); num_certs++; }

 

PEM_read_bio_X509() is returning null unexpectedly, killing this loop and leading to the following case which returns the underlying error:

 

// If no certificates were read from the bio, that is an error. if (num_certs == 0) { ssl_log_errors("Unable to load certificate(s)"); return CASS_ERROR_SSL_INVALID_CERT; }

 

Some additional research suggests that PEM_read_bio_X509() can be a bit finicky about what it’s reading, particularly if the format of the PEM files has some subtle errors.

 

Testing showed a significant reduction in the number of errors observed when we used the existing invalid PEM-encoded cert rather than the dummy cert created in the follow-up commit referenced below. Short-term fix is to use the existing invalid cert for the first cert in the multi-cert case and avoid using the dummy one. This doesn’t eliminate the error cited above occurs but it does pretty significantly reduce it.

Bret McGuire 
July 19, 2023 at 7:42 PM
(edited)

PR mentioned in previous comment introduced the relevant behaviour but the test in question was actually introduced in a follow-up commit from yours truly. Might still be identifying a real issue in the original PR, or it might just be identifying a formatting issue in the test.

Bret McGuire 
July 18, 2023 at 10:16 PM

Appears to be caused (or at least exacerbated) by this PR. Based on some research it looks like PEM_read_bio_X509 can be a little bit flakey [1] and as a result of the change in question we’re calling it more.

 

[1] Not necessarily the fault of the implementation; it seems like some of this flakiness is due to formatting of the relevant PEM certs. Local testing hasn’t been able to determine a version which evals consistently for my local OpenSSL version, however, and even if I were able to reproduce something consistent locally there’s no guarantee it would behave the same way elsewhere with other OpenSSL versions.

Fixed

Details

Assignee

Reporter

Labels

Fix versions

Priority

Created July 18, 2023 at 10:11 PM
Updated July 26, 2023 at 5:28 PM
Resolved July 19, 2023 at 9:38 PM