Hi there,
We are encountering consistent segfaults after rebooting our Kamailio instance with incoming traffic, specifically when using Kamailio 5.7.4. We think this issue did not occur with version 5.7.2, so it seems to have been introduced in either 5.7.3 or 5.7.4.
Due to team bandwidth constraints and the potential impact on production traffic, we don't want to spend time on trying to reproduce the issue. So we have decided to downgrade to 5.6.4, which we confirmed to be stable. (Probably 5.7.2 would be too - but we didn't try).
Unfortunately, our logging was only set to WARNING level, and we did not capture a core dump, so we cannot provide additional details beyond the following logs:
This was with tcp_reuse_ports=yes:
2024-05-17T15:42:55.582475541Z Listening on 2024-05-17T15:42:55.582512370Z [redacted] 2024-05-17T15:42:55.582538161Z tls: 10.X.X.X:5061 advertise Y.Y.Y:5061 2024-05-17T15:42:55.582543750Z Aliases: 2024-05-17T15:42:55.582549081Z tls: [redacted]:5061 2024-05-17T15:42:55.582574890Z 2024-05-17T15:42:55.587876630Z 0(1) WARNING: tls [tls_init.c:978]: tls_h_mod_init_f(): openssl bug #1491 (crash/mem leaks on low memory) workaround enabled (on low memory tls operations will fail preemptively) with free memory thresholds 18874368 and 9437184 bytes 2024-05-17T15:42:55.703927049Z 35(41) CRITICAL: <core> [core/pass_fd.c:281]: receive_fd(): EOF on 23 2024-05-17T15:42:55.703972029Z 0(1) ALERT: <core> [main.c:791]: handle_sigs(): child process 15 exited by a signal 11 2024-05-17T15:42:55.703978409Z 0(1) ALERT: <core> [main.c:795]: handle_sigs(): core was generated 2024-05-17T15:42:55.705049839Z 35(41) CRITICAL: <core> [core/pass_fd.c:281]: receive_fd(): EOF on 17 2024-05-17T15:42:55.705074209Z 35(41) CRITICAL: <core> [core/pass_fd.c:281]: receive_fd(): EOF on 21 2024-05-17T15:42:55.705081209Z 35(41) CRITICAL: <core> [core/pass_fd.c:281]: receive_fd(): EOF on 22 2024-05-17T15:42:55.705085879Z 35(41) CRITICAL: <core> [core/pass_fd.c:281]: receive_fd(): EOF on 20 2024-05-17T15:42:55.705090319Z 35(41) CRITICAL: <core> [core/pass_fd.c:281]: receive_fd(): EOF on 18 2024-05-17T15:42:55.705094649Z 35(41) CRITICAL: <core> [core/pass_fd.c:281]: receive_fd(): EOF on 19 2024-05-17T15:42:55.705098879Z 35(41) CRITICAL: <core> [core/pass_fd.c:281]: receive_fd(): EOF on 16 2024-05-17T15:42:55.705207399Z 35(41) CRITICAL: <core> [core/pass_fd.c:281]: receive_fd(): EOF on 15 2024-05-17T15:42:55.705459439Z 35(41) CRITICAL: <core> [core/pass_fd.c:281]: receive_fd(): EOF on 27
Without tcp_reuse_ports=yes, the segfault was always preceded by the following line if any existing TLS connections were stuck in TIME_WAIT:
2024-05-16T19:18:51.654447639Z 9(14) WARNING: {1 1 INVITE XXX@0.0.0.0} <core> [core/tcp_main.c:1301]: find_listening_sock_info(): binding to source address 10.X.X>X:5061 failed: Address already in use [98] 2024-05-16T19:18:51.746994728Z 0(1) ALERT: <core> [main.c:791]: handle_sigs(): child process 14 exited by a signal 11
When the server wasn't handling any traffic, the issue didn't occur even in 5.7.4.
Does anyone have any insights or suggestions on how to address this issue?
Kind regards Stefan
For additional context:
- Our Kamailio setup receives SIP messages from one endpoint over UDP and forwards them to another endpoint over TLS, with rtpengine for RTP proxying. - The issue only occurs on startup. E.g., after a VM reboot hosting our Kamailio container, after deploying an updated Docker container onto the VM, or just after restarting the docker container. - Out of business hours, when the instance doesn't handle any traffic, we can't reproduce the issue. - We've been running 5.7.4 for about two months. We did the upgrade out of hours so the initial upgrade didn't trigger the issue. The issue occurred during redeployments yesterday and today, while the instance was handling traffic. We saw a few dozen segfaults as we troubleshooted the issue during the incidents yesterday and today. After business hours, we couldn't reproduce the issue. - We've been doing regular kamailio upgrades to the latest 5.4.x-5.7.x versions and this instance has been around for years without any similar issues or significant configuration changes.
Hello,
In the releases after 5.7.2 there have been a lot of TLS related changes. There were necessary due to several critical memory corruption bugs due to implementation decisions from the OpenSSL team in version 3.x. These changes were somewhat larger as usually expected in minor releases, but ultimately necessary due to the mentioned problems. Due to the complexity of the problems, several iterations were necessary to solve it completely.
So, without looking too much into the details of your issue - I suspect that the problems you are observing might be caused from these changes. It might that you did a security package update that changes some memory layout, for example, that triggered it.
I think that these TLS changes are now done in the releases 5.7.5 and 5.8.1, and these releases should be stable again. This has been confirmed on multiples reports on our issue tracker and also in some of our customer environments.
So, I would suggest you give the 5.7.5 a try. If there are still crashes on startup, please let provide an update on the list or on the issue tracker.
Cheers,
Henning