Description

During OpenSSL 1.1.1 integration it was necessary to use per-worker SSL_CTX —instantiated in tls/tls_mod.c. This is still required for OpenSSL 3.x integration.

This is a retrospective root cause analysis of why this duplicated SSL_CTX is needed.

Reproduction

Instead of creating repeating SSL_CTX (one-per-worker) have all workers use a single SSL_CTX
Observation: intermittent connection failures
Observation: if tls is using only EC keys, the connections will succeed

Root Cause Analysis

OpenSSL RSA BN operations are multi-threaded ready (can be used in single-process multi-threaded applications). However the BN operations depend crucially on each thread reporting different pthread_self() values. At runtime pthread_self() values can be reused and are only different for all running threads in a single process.

When rank 0 forks the worker processes their pthread_self() values will overlap. This will result in invalid BN computations and lead to failure of RSA connections. In a sense the workers perform “identify theft”.

There is no mechanism in pthreads to reset the thread ids; they are opaque handles.

In contrast, OpenSSL ECDSA operations do not invoke pthread_self() and do not require unique thread IDs.

Notes

no action is required; this is purely a historical note
I have added a code comment: 29007ad
I will leave this issue up for a few days knowledge sharing

—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.