SslTests.Integration_Cassandra_VerifyPeerMultipleCerts fails intermittently
Description
Environment
Pull Requests
Activity

Bret McGuire July 19, 2023 at 9:03 PM(edited)
Some investigation confirms that we’re actually seeing the error in this code in ssl_openssl_impl.cpp:
// Iterate over the bio, reading out as many certificates as possible.
for (X509* cert = PEM_read_bio_X509(bio, NULL, pem_password_callback, NULL); cert != NULL;
cert = PEM_read_bio_X509(bio, NULL, pem_password_callback, NULL)) {
X509_STORE_add_cert(trusted_store_, cert);
X509_free(cert);
num_certs++;
}
PEM_read_bio_X509() is returning null unexpectedly, killing this loop and leading to the following case which returns the underlying error:
// If no certificates were read from the bio, that is an error.
if (num_certs == 0) {
ssl_log_errors("Unable to load certificate(s)");
return CASS_ERROR_SSL_INVALID_CERT;
}
Some additional research suggests that PEM_read_bio_X509() can be a bit finicky about what it’s reading, particularly if the format of the PEM files has some subtle errors.
Testing showed a significant reduction in the number of errors observed when we used the existing invalid PEM-encoded cert rather than the dummy cert created in the follow-up commit referenced below. Short-term fix is to use the existing invalid cert for the first cert in the multi-cert case and avoid using the dummy one. This doesn’t eliminate the error cited above occurs but it does pretty significantly reduce it.

Bret McGuire July 19, 2023 at 7:42 PM(edited)
PR mentioned in previous comment introduced the relevant behaviour but the test in question was actually introduced in a follow-up commit from yours truly. Might still be identifying a real issue in the original PR, or it might just be identifying a formatting issue in the test.

Bret McGuire July 18, 2023 at 10:16 PM
Appears to be caused (or at least exacerbated) by this PR. Based on some research it looks like PEM_read_bio_X509 can be a little bit flakey [1] and as a result of the change in question we’re calling it more.
[1] Not necessarily the fault of the implementation; it seems like some of this flakiness is due to formatting of the relevant PEM certs. Local testing hasn’t been able to determine a version which evals consistently for my local OpenSSL version, however, and even if I were able to reproduce something consistent locally there’s no guarantee it would behave the same way elsewhere with other OpenSSL versions.
Details
Details
Assignee

Reporter

An example:
/home/jenkins/workspace/drivers_cpp_oss_master/tests/src/integration/objects/ssl.hpp:62 Expected: CASS_OK To be equal to: cass_ssl_add_trusted_cert(get(), cert.c_str()) Which is: CASS_ERROR_SSL_INVALID_CERT [Unable to load certificate]
When running locally:
$ ./cassandra-integration-tests --gtest_filter=SslTests* --version=3.11.15 --category cassandra Missing Category: All applicable tests will run DSE Category Will be Ignored: DSE is not enabled [--dse] Starting DataStax C/C++ Driver Integration Test v2.16.2 libuv v1.44.2 Logging driver messages Apache Cassandra Version: 3.11.15 CCM Cluster Prefix: cpp-driver Category: Cassandra Note: Google Test filter = SslTests*:-*_DSE_* ... [ RUN ] SslTests.Integration_Cassandra_VerifyPeerMultipleCerts /work/git/cpp-driver/tests/src/integration/objects/ssl.hpp:62: Failure Expected: CASS_OK To be equal to: cass_ssl_add_trusted_cert(get(), cert.c_str()) Which is: CASS_ERROR_SSL_PROTOCOL_ERROR [Protocol error] unknown file: Failure C++ exception with description "Unable to Establish Session Connection: Invalid peer certificate" thrown in the test body. [ FAILED ] SslTests.Integration_Cassandra_VerifyPeerMultipleCerts (184 ms)