This bug report concerns a Kamailio instance that's functioning as an edge proxy, accepting requests from clients over TLS and forwarding them to internal components over UDP.
I've observed occasional "deadlocks" that all take place after there were connectivity issues between Kamailio and one or more of the clients (logging messages such as "tcp_send failed", "handle_tcpconn_ev(): connect x failed", "tcp_read_data(): error reading: Connection reset by peer").
Environment:
* Ubuntu 16.04 * Kamailio 5.2.7 (5.2.7+ubuntu16.04)
Relevant config:
``` # for TLS clients socket_workers=10 listen=tls:wan.ip:TLS_PORT
# to be able to forward incoming TCP requests to UDP destinations socket_workers=10 listen=udp:wan.ip:TLS_PORT
# for RPC commands socket_workers=2 listen=tcp:127.0.0.1:HTTP_PORT ```
## Reproduction
After a lot of trial and error, I've found a way to somewhat reproducibly trigger the problem using sipp. It involves setting up a large amount of simultaneous calls dialing into the edge proxy using TLS, to a destination UA that will answer the call and keep it open for a few seconds, then using `ctrl-c` to suddenly quit sipp which will ungracefully terminate the TCP/TLS connections.
It takes me a couple of iterations to trigger the problem, e.g repeat until Kamailio starts failing:
* start sipp * terminate sipp with ctrl-c * (quickly) restart sipp * terminate sipp * etc
The amount of simultaneous calls has to be sufficiently large. In my environment I'm unable to reproduce the problem at all with 10-20 calls, while 50 calls always triggers the problem after a couple of iterations.
Example sipp commandline: `sipp -nd -sf caller.xml -d 2000 -r 50 -l 500 -m 5000 -max_socket 1000 -t ln -tls_cert cert.crt -tls_key key.key kamailio.edge.proxy:5061`
## Failure mode 1
In one of the failure modes, it appears that all available TCP/TLS workers are permanently bricked; sipp's calls will simply timeout and not trigger any dialplan handling.
Interestingly, even though I have dedicated workers for TLS (SIP client) and TCP (RPC commands), and my reproduction scenario doesn't connect to the HTTP_PORT at all, this failure scenario bricks the TCP (RPC) listeners as well; any curl requests to it will simply timeout as well.
Kamailio will hang for a while on (systemd) restart.
## Failure mode 2
In another failure mode, it looks like the TCP workers are unaffected but the UDP perform forwarding to UDP destinations are broken. Sipp's calls will enter Kamailio, but forwarding (`relay()`) to the next hop will timeout even though the destination is functioning just fine. Since this failure mode doesn't break RPC I was able to enable debug logging:
``` 2020-07-07T17:26:58.770411+02:00 kamailio[24056]: INFO: From Customer 2020-07-07T17:26:58.770697+02:00 kamailio[24056]: DEBUG: pv [pv_trans.c:590]: tr_eval_string(): i=0 j=2 2020-07-07T17:26:58.770971+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2055]: ds_manage_routes(): set [4] 2020-07-07T17:26:58.771248+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2161]: ds_manage_routes(): using alg [0] hash [250849449] 2020-07-07T17:26:58.771535+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2205]: ds_manage_routes(): selected [0-4-0/1] sip:wan.ip:8065 2020-07-07T17:26:58.771817+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2226]: ds_manage_routes(): using first entry [4/1] 2020-07-07T17:26:58.772090+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2269]: ds_manage_routes(): using entry [4/0] 2020-07-07T17:26:58.772362+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2016]: ds_select_dst_limit(): selected target destinations: 2 2020-07-07T17:26:58.773001+02:00 kamailio[24056]: DEBUG: pv [pv_core.c:2611]: pv_set_force_sock(): trying to set send-socket to [udp:wan.ip:5061] 2020-07-07T17:26:58.773353+02:00 kamailio[24056]: DEBUG: <core> [core/socket_info.c:559]: grep_sock_info(): checking if host==us: 13==13 && [wan.ip] == [wan.ip] 2020-07-07T17:26:58.773829+02:00 kamailio[24056]: DEBUG: <core> [core/socket_info.c:566]: grep_sock_info(): checking if port 5061 (advertise 0) matches port 5061
2020-07-07T17:26:58.774122+02:00 kamailio[24056]: NOTICE: Relaying (cid: 1-9151@sipp.ip). 2020-07-07T17:26:58.774417+02:00 kamailio[24056]: DEBUG: siputils [checks.c:120]: has_totag(): no totag 2020-07-07T17:26:58.774708+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:1344]: t_newtran(): msg (0x7f69d1cc4980) id=1265/24056 global id=1265/24056 T start=(nil) 2020-07-07T17:26:58.774989+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:499]: t_lookup_request(): start searching: hash=43232, isACK=0 2020-07-07T17:26:58.775315+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:457]: matching_3261(): RFC3261 transaction matching failed - via branch [z9hG4bK-9151-1-1] 2020-07-07T17:26:58.775612+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:682]: t_lookup_request(): no transaction found 2020-07-07T17:26:58.775908+02:00 kamailio[24056]: DEBUG: <core> [core/md5utils.c:67]: MD5StringArray(): MD5 calculated: 0a5d4a8d13f9606523da9c4d193b2261 2020-07-07T17:26:58.776270+02:00 kamailio[24056]: DEBUG: <core> [core/dns_cache.c:3197]: dns_srv_sip_resolve(): (wan.ip, 0, 0), ip, ret=0 2020-07-07T17:26:58.776701+02:00 kamailio[24056]: DEBUG: <core> [core/msg_translator.c:1765]: check_boundaries(): no multi-part body 2020-07-07T17:26:58.777012+02:00 kamailio[24056]: DEBUG: <core> [core/msg_translator.c:2935]: create_via_hf(): id added: <;i=a701>, rcv proto=3 2020-07-07T17:26:58.777596+02:00 kamailio[24056]: DEBUG: tm [t_funcs.c:375]: t_relay_to(): new transaction forwarded 2020-07-07T17:26:58.777948+02:00 kamailio[24056]: DEBUG: <core> [core/receive.c:354]: receive_msg(): request-route executed in: 1927 usec 2020-07-07T17:26:58.778243+02:00 kamailio[24056]: DEBUG: <core> [core/usr_avp.c:636]: destroy_avp_list(): destroying list (nil) 2020-07-07T17:26:58.779265+02:00 kamailio[24056]: message repeated 5 times: [ DEBUG: <core> [core/usr_avp.c:636]: destroy_avp_list(): destroying list (nil)] 2020-07-07T17:26:58.779431+02:00 kamailio[24056]: DEBUG: <core> [core/xavp.c:507]: xavp_destroy_list(): destroying xavp list (nil) 2020-07-07T17:26:58.779753+02:00 kamailio[24056]: DEBUG: <core> [core/receive.c:458]: receive_msg(): cleaning up
2020-07-07T17:27:00.705303+02:00 kamailio[24053]: DEBUG: tm [t_reply.c:1266]: t_should_relay_response(): ->>>>>>>>> T_code=0, new_code=408 2020-07-07T17:27:00.705917+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1041]: t_check_msg(): msg (0x7f69b1017e80) id=1264/24056 global id=1264/24056 T start=0x7f69b10167c0 2020-07-07T17:27:00.706187+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1116]: t_check_msg(): T (0x7f69b10167c0) already found for msg (0x7f69b1017e80)! 2020-07-07T17:27:00.706533+02:00 kamailio[24053]: DEBUG: tmx [t_var.c:539]: pv_get_tm_reply_code(): reply code is <408> 2020-07-07T17:27:00.706879+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1041]: t_check_msg(): msg (0x7f69b1017e80) id=1264/24056 global id=1264/24056 T start=0x7f69b10167c0 2020-07-07T17:27:00.707198+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1116]: t_check_msg(): T (0x7f69b10167c0) already found for msg (0x7f69b1017e80)! 2020-07-07T17:27:00.707522+02:00 kamailio[24053]: DEBUG: tmx [t_var.c:594]: pv_get_tm_reply_reason(): reply reason is [Request Timeout]
{ "DSC": "tcp receiver (tls:wan.ip:5061) child=0", "IDX": 14, "PID": 24056 },
{ "DSC": "slow timer", "IDX": 11, "PID": 24053 }, ```
Kamailio will restart (systemd) normally and everything will function normally again afterwards.
If the RPC worker is not blocked, can you get the output of `kamctl trap`? It is written in a text file, path being printed at the end of command execution.
Is the Ubuntu 16.04 with latest security fixes? Iirc, on another deployment I had access to, it somehow messed up the use of libssl, because it ships normally with libssl 1.0, but I think libmysqlclient-dev enforced the installation of libssl 1.1.
All libssl packages are `1.0.2g-1ubuntu4.16`. Attached is a trap file for both failure modes (with some manual hacking to retrieve pids for failure mode 1). Although I suspect (but didn't verify yet as I don't normally use plaintext SIP over TCP) the issue is with the TCP handling rather than anything SSL specific.
[trap_failuremode1.txt](https://github.com/kamailio/kamailio/files/4886100/trap_failuremode1.txt) [trap_failuremode2.txt](https://github.com/kamailio/kamailio/files/4886103/trap_failuremode2.txt)
I looked at one trace and the blocking leads to the libmysqlclient:
``` #0 0x00007fa250dd3730 in futex_wait (private=<optimized out>, expected=0, futex_word=0x7fa2523a2d1c) at ../sysdeps/unix/sysv/linux/futex-internal.h:61 __ret = -512 err = <optimized out> #1 futex_wait_simple (private=<optimized out>, expected=0, futex_word=0x7fa2523a2d1c) at ../sysdeps/nptl/futex-internal.h:135 No locals. #2 __pthread_rwlock_wrlock_slow (rwlock=0x7fa2523a2d10) at pthread_rwlock_wrlock.c:67 waitval = 0 result = 0 #3 0x00007fa251b11d30 in ?? () from /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20 No symbol table info available. #4 0x00007fa2510468e7 in CRYPTO_add_lock () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 No symbol table info available. #5 0x00007fa251466cfc in SSL_CTX_free () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 No symbol table info available. #6 0x00007fa251468ff0 in SSL_free () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 No symbol table info available. #7 0x00007fa24ef6a38a in tls_h_tcpconn_clean (c=0x7fa2535b8998) at tls_server.c:655 extra = 0x7fa2535ca9c8 __func__ = "tls_h_tcpconn_clean" #8 0x00000000004ed791 in _tcpconn_free (c=0x7fa2535b8998) at core/tcp_main.c:1352 __func__ = "_tcpconn_free" #9 0x00000000004ff708 in tcpconn_put_destroy (tcpconn=0x7fa2535b8998) at core/tcp_main.c:3030 __func__ = "tcpconn_put_destroy" ```
Can you get the output for:
``` ldd /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20 ```
There were couple of similar reports for ubuntu 16.04 with security updates, having issues when connecting to mysql server via IP sockets. You can check the latest commits to db_mysql module done in master branch (backported to 5.3, iirc), among them leading to addition of opt_ssl_mode parameter. You can try to connect to mysql via unix socket file, if you have the mysql server on the same system as kamailio (use `localhost` instead of `127.0.0.1` in db url). I spent quite some time analyzing such a similar issue on ubuntu 16.04 with security updates, trying to track changes in libssl/libcrypto and libmysqlclient, but at the end the solution was to upgrade ubuntu not to have a mixture of libraries built with libssl1.0 and couple of them with 1.1.
Thanks so far!
``` $ ldd /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20 linux-vdso.so.1 => (0x00007ffdb0fc4000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff76e52c000) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ff76e312000) libssl.so.1.0.0 => /lib/x86_64-linux-gnu/libssl.so.1.0.0 (0x00007ff76e0a9000) libcrypto.so.1.0.0 => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007ff76dc64000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ff76da47000) libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ff76d6c5000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ff76d3bc000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ff76d1a6000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff76cddc000) /lib64/ld-linux-x86-64.so.2 (0x00007ff76ece7000) ```
In failure mode 1, I see only 1 process stuck in libmysqlclient, in failure mode 2 only 3. Could that explain Kamailio to be deadlocked completely (on either TLS or UDP level) even though there are more workers available?
Do those commits in the db_mysql module have any chance of being backported to 5.2 as well?
I see that libmysqlclient is linked against libssl 1.0, so doesn't seem to be mixed libs.
A blocked process can block the instance, if they acquired other mutexes (e.g., a mutex used to protect other tcp/tls connections). In the traces, only one process seems to be blocked, so for some cases it should still work, but not easy to say which should work and what is locked.
Probably the db_mysql commits will be backported before the next 5.2.x release.
In master branch I also aded `kamctl pstrap` command to use ps for getting the list of pids for kamailio and write the gdb backtrace file.
You can also try installing the packages with debugging symbols for libssl and libmysqlclient, then get the trap again to see more details about what happens inside libssl and libmysqlclient. Now it seems that some function is overwritten from inside libmysqlclient.
Looks like there's actually no libmysqlclient-dbg package :-/ I'll see what I can do to try and reproduce/pinpoint this further and report back.
OK, this turns out to be a lot more complicated. I'll try to describe as accurately as possible.
Turns out SSL support was already disabled on the server-side, so in all above examples there was no SSL used for the database connection. Although that doesn't rule out libssl still being involved in the db code, to _try_ to negotiate a SSL connection.
I was able to upgrade to Kamailio 5.3 without any problems, with exactly the same behavior as before; able to reproduce both failure modes.
With `modparam("db_mysql", "opt_ssl_mode", 1)` I can still easily reproduce failure mode 1.
I'm struggling to generate "accurate" backtraces for that, because either I'm crazy, or Kamailio keeps on "fixing itself" at _exactly_ the moment I decide to run `pstrap`. It can be frozen for minutes, not responding to any TCP socket at all, then suddenly in the middle of the gdb process print a bunch of messages like "tcpconn_1st_send(): connect 157.245.69.206:40573 failed (RST) Connection refused" then continue to function again.
Nonetheless, even with `modparam("db_mysql", "opt_ssl_mode", 1)` I see libmysqlclient in the backtrace coming from various ssl functions, e.g.
``` #3 0x00007fe9a9c87d30 in ?? () from /usr/lib/x86_64-linux-gnu/libmysqlclient.so.20 #4 0x00007fe9a9275c63 in ssleay_rand_add (buf=0x7ffcf0a04b18, num=8, add=0) at md_rand.c:230 ```
Finally, so far I'm completely unable to reproduce the issue when `db_mysql.so` is not loaded! Looks like having both db_mysql.so and tls.so loaded causes some kind of conflict..
That's a lot of input - does this help and/or is there any avenue you'd like me to debug further?
Nonetheless, even with `modparam("db_mysql", "opt_ssl_mode", 1)` I see libmysqlclient in the backtrace coming from various ssl functions, e.g.
Note that libmysqlclient is somehow intertwining itself with a SIP-TLS connection here. This Kamailio instance only uses db_mysql for the dispatcher module, so it only actually queries the database once on startup. All subsequent TCP/TLS connections, as seen in my backtraces, are SIP clients.
To be sure: are you using mysql-client or maridb-client library?
MySQL client.
ii libmysqlclient20:amd64 5.7.30-0ubuntu0.16.04.1 amd64 MySQL database client library
Something seems to be strange with libmysqlclient and libssl on ubuntu 16.04, especially after latest security updates. I spent a bit of time and could track it down, not being familiar with its code. At the end, the system I troubleshooted was upgraded to a newer Ubuntu. Libssl was in a rather fluid state during past few years, between 1.0 and 1.1 releases, then libmysqlclient also was sometimes compiled with another ssl lib. New versions of distros shifted to mariadb, which doesn't seem to expose any similar issue.
I think I figured out why all these issues are popping up all of a sudden.
A couple of months ago, Xenial's libmysqlclient 5.7.28-0ubuntu0.16.04.2 made the following change:
``` * debian/rules: removed -DWITH_SSL=bundled, option no longer works. * debian/control: add libssl-dev to Build-Depends. ```
Before that, the `WITH_SSL=bundled` option was the default, and "bundled" didn't actually use OpenSSL but yaSSL (see https://mysqlserverteam.com/mysql-is-openssl-only-now/)!
So in the past, the db_mysql module would use YaSSL and the tls module would use OpenSSL and everyone was happy. But now suddenly both modules use OpenSSL (1.0.2, in Xenial) and there's a conflict with how both of them handle locking callbacks.
Thanks fo digging further on this one and updating here
Does it depend on libssl 1.0.2? Because, iirc, I noticed dependency on libssl 1.1.x where the most of the other packages were using libssl 1.0.x, but maybe that was on Ubuntu 18.04
I guess that was Ubuntu 18 indeed, on Ubuntu 16 I've never seen anything other than 1.0.2. See https://packages.ubuntu.com/search?keywords=libssl&searchon=names&su...
I can confirm that after downgrading libmysqlclient20 to `5.7.26-0ubuntu0.16.04.1` the problem seems to be gone - at the very least my sipp stress test no longer manages to cause any deadlocks.
Related to report: https://lists.kamailio.org/pipermail/sr-users/2020-August/110179.html
Around Aug 24 2020, we've upgraded to Ubuntu 18 + mysql + all updates + kamailio 5.3. and the issue is gone.
Closing, based on @arkadiam comment, there were updates of dependency packages that seem to get rid of this behaviour.
Closed #2395.