### Description
After upgrading to 5.3.0 from 5.2.2 (standard packages on FreeBSD 12.0), I am experiencing intermittent crashes related to handling of BYE messages.
### Troubleshooting
#### Reproduction
This happens ~ weekly and I have not found a good way to reproduce it.
#### Debugging Data
An example backtrace is below from the last dumped core (the SIGSEGV one); unfortunately it overwrote the earlier one: ``` * thread #1, name = 'kamailio', stop reason = signal SIGSEGV * frame #0: 0x00000008009c5b79 libc.so.7`___lldb_unnamed_symbol403$$libc.so.7 + 41 frame #1: 0x00000008009ed63e libc.so.7`__free + 990 frame #2: 0x000000080271562b libthr.so.3`pthread_rwlock_destroy + 59 frame #3: 0x0000000802bedbf6 libcrypto.so.111`CRYPTO_THREAD_lock_free + 22 frame #4: 0x0000000802aef3c4 libcrypto.so.111`RSA_free + 100 frame #5: 0x0000000802b10c32 libcrypto.so.111`EVP_PKEY_free + 66 frame #6: 0x000000080296ed86 libssl.so.111`___lldb_unnamed_symbol646$$libssl.so.111 + 134 frame #7: 0x000000080295f93c libssl.so.111`SSL_CTX_free + 236 frame #8: 0x00000008028aee42 tls.so`tls_free_domain + 114 frame #9: 0x00000008028af1d7 tls.so`tls_free_cfg + 199 frame #10: 0x00000008028af2df tls.so`tls_destroy_cfg + 191 frame #11: 0x00000008028ad1f1 tls.so`destroy_tls_h + 1185 frame #12: 0x000000000041adea kamailio`destroy_tls + 26 frame #13: 0x00000000002e36fd kamailio`cleanup + 269 frame #14: 0x00000000002eb5b7 kamailio`___lldb_unnamed_symbol5$$kamailio + 1351 frame #15: 0x00000000002ea5e5 kamailio`handle_sigs + 21669 frame #16: 0x00000000002fb83e kamailio`main_loop + 40014 frame #17: 0x0000000000307d2b kamailio`main + 50267 frame #18: 0x00000000002e311b kamailio`_start + 283 ```
This is with OpenSSL 1.1 With the LD_PRELOAD hack to 5.2.2, things were completely stable; I am trying to use kamailio without the LD_PRELOAD'ed mutex wrapper now, which I believe is no longer required. It looks like the SSL-related stuff in the TLS crash (which was 5 minutes later!) is unrelated to the initial problem and may just be an artifact of one of the kamailio processes crashing earlier.
#### Log Messages
``` Nov 2 08:11:31 home /usr/local/sbin/kamailio[94702]: CRITICAL: {1 527440 BYE 973470944-5061-16392@BA.A.B.I} <core> [core/mem/q_malloc.c:149]: qm_debug_check_frag(): BUG: qm: prev. fragm. tail overwritten(c0c0c000, abcdefed)[0x801544c58:0x801544c90]! Memory allocator was called from core: core/action.c:754. Fragment marked by core: core/dset.c:733. Exec from core/mem/q_malloc.c:504. Nov 2 08:13:41 home /usr/local/sbin/kamailio[94703]: CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 22 Nov 2 08:13:41 home kernel: pid 94702 (kamailio), uid 0: exited on signal 6 (core dumped) Nov 2 08:13:41 home /usr/local/sbin/kamailio[94692]: ALERT: <core> [main.c:767]: handle_sigs(): child process 94702 exited by a signal 6 Nov 2 08:13:41 home /usr/local/sbin/kamailio[94692]: ALERT: <core> [main.c:770]: handle_sigs(): core was generated Nov 2 08:14:56 home login[8284]: ROOT LOGIN (root) ON ttyu0 Nov 2 08:16:26 home kernel: pid 94692 (kamailio), uid 0: exited on signal 11 (core dumped) ```
I had an identical problem a week ago, also with a crash on a BYE for an active call:
``` Oct 27 13:00:02 home /usr/local/sbin/kamailio[79819]: CRITICAL: {1 598425 BYE 649761149-5061-291@BA.A.B.I} <core> [core/mem/q_malloc.c:149]: qm_debug_check_frag(): BUG: qm: prev. fragm. tail overwritten(c0c0c000, abcdefed)[0x801551808:0x801551840]! Memory allocator was called from core: core/action.c:754. Fragment marked by core: core/dset.c:733. Exec from core/mem/q_malloc.c:504. Oct 27 13:02:09 home /usr/local/sbin/kamailio[79820]: CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 22 Oct 27 13:02:09 home kernel: pid 79819 (kamailio), uid 0: exited on signal 6 (core dumped) Oct 27 13:02:09 home /usr/local/sbin/kamailio[79809]: ALERT: <core> [main.c:767]: handle_sigs(): child process 79819 exited by a signal 6 Oct 27 13:02:09 home /usr/local/sbin/kamailio[79809]: ALERT: <core> [main.c:770]: handle_sigs(): core was generated Oct 27 13:04:55 home kernel: pid 79809 (kamailio), uid 0: exited on signal 11 (core dumped) ```
### Possible Solutions
<!-- If you found a solution or workaround for the issue, describe it. Ideally, provide a pull request with a fix. -->
### Additional Information
* **Kamailio Version** - output of `kamailio -v`
``` version: kamailio 5.3.0 (x86_64/freebsd) 4cc67a flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, select, kqueue. id: 4cc67a compiled on 18:51:34 Oct 25 2019 with cc 6.0 ```
* **Operating System**:
FreeBSD 12.0
``` FreeBSD home.XXX 12.0-RELEASE-p10 FreeBSD 12.0-RELEASE-p10 GENERIC amd64 ```
The backtrace is from shutdown procedure after the crash happens, not revealing the real reason for crash, because the first core is overwritten by the one produced by shutdown cleanup. You have to enable one core file per pid (per process). I am not sure how is done in freebsd, on linux is done with:
``` echo "1" > /proc/sys/kernel/core_uses_pid ``` The you should get at least another corefile with the issue from runtime.
The log messages suggest a buffer overflow is done somewhere in kamailio code (not related to tls at all).
I agree that it is probably not TLS related, but it was all I had. I'll wait for a better core, will probably have one in ~ a week again; on FreeBSD, this is the relevant setting in case it comes up again:
``` sysctl kern.corefile='%N.%P.core' ```
Here is the actual backtrace, after ~ half a week this time:
``` (lldb) bt * thread #1, name = 'kamailio', stop reason = signal SIGABRT * frame #0: 0x0000000800a8645a libc.so.7`_thr_kill + 10 frame #1: 0x0000000800a84844 libc.so.7`_raise + 52 frame #2: 0x00000008009f7079 libc.so.7`abort + 73 frame #3: 0x0000000000706ffe kamailio`___lldb_unnamed_symbol817$$kamailio + 5934 frame #4: 0x0000000000708b32 kamailio`qm_free + 6418 frame #5: 0x00000000005ab45c kamailio`do_action + 38716 frame #6: 0x0000000801ea45a9 pv.so`pv_set_ruri + 2777 frame #7: 0x000000000057e80c kamailio`___lldb_unnamed_symbol454$$kamailio + 6172 frame #8: 0x000000000057ae98 kamailio`lval_assign + 3176 frame #9: 0x00000000005bf4e6 kamailio`do_action + 120774 frame #10: 0x00000000005c278d kamailio`run_actions + 3245 frame #11: 0x00000000005af198 kamailio`do_action + 54392 frame #12: 0x00000000005c278d kamailio`run_actions + 3245 frame #13: 0x00000000005af198 kamailio`do_action + 54392 frame #14: 0x00000000005c278d kamailio`run_actions + 3245 frame #15: 0x00000000005a9823 kamailio`do_action + 31491 frame #16: 0x00000000005c278d kamailio`run_actions + 3245 frame #17: 0x00000000005c3493 kamailio`run_top_route + 179 frame #18: 0x000000000036553c kamailio`receive_msg + 24332 frame #19: 0x000000000064b938 kamailio`receive_tcp_msg + 312 frame #20: 0x000000000065132c kamailio`tcp_read_req + 13740 frame #21: 0x0000000000664971 kamailio`___lldb_unnamed_symbol650$$kamailio + 10817 frame #22: 0x000000000065e671 kamailio`___lldb_unnamed_symbol637$$kamailio + 4561 frame #23: 0x0000000000657954 kamailio`tcp_receive_loop + 1556 frame #24: 0x00000000004b9fed kamailio`tcp_init_children + 3805 frame #25: 0x00000000002fa5df kamailio`main_loop + 35311 frame #26: 0x0000000000307d2b kamailio`main + 50267 frame #27: 0x00000000002e311b kamailio`_start + 283 ```
This time the crash happened on ACK, rather than BYE:
``` Nov 5 19:13:03 home /usr/local/sbin/kamailio[16259]: CRITICAL: {1 952554 ACK !! :R7p-RGMbyGRCR9j0aFpFaFhbakp0yAK0yAKYR7pVRcNL} <core> [core/mem/q_malloc.c:149]: qm_debug_check_frag(): BUG: qm: prev. fragm. tail overwritten(c0c0c000, abcdefed)[0x801544380:0x8015443b8]! Memory allocator was called from core: core/action.c:754. Fragment marked by core: core/dset.c:733. Exec from core/mem/q_malloc.c:504. Nov 5 19:14:56 home /usr/local/sbin/kamailio[16260]: CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 22 Nov 5 19:14:56 home kernel: pid 16259 (kamailio), uid 0: exited on signal 6 (core dumped) ```
I am not familiar with lldb, in case next command doesn't work, then maybe you can get gdb and use it on your system ... so I need the output for the next command in gdb:
``` p *( (struct qm_frag*) ( ((char*)(0x801544380)-sizeof(struct qm_frag_end)) - ((struct qm_frag_end*)((char*)(0x801544380)-sizeof(struct qm_frag_end)))->size - sizeof(struct qm_frag) ) ) ```
Here you are, apologies for the delay:
``` (gdb) p *( (struct qm_frag*) ( ((char*)(0x801544380)-sizeof(struct qm_frag_end)) - ((struct qm_frag_end*)((char*)(0x801544380)-sizeof(struct qm_frag_end)))->size - sizeof(struct qm_frag) ) ) $1 = {size = 32, u = {nxt_free = 0x0 <tsl>, is_free = 0}, file = 0x208a40 "core: core/parser/msg_parser.c", func = 0x2059c3 "set_dst_uri", mname = 0x2067f2 "core", line = 752, check = 4042322160} ```
I've had several more samples of this crash in the last few days, all in the same place, so it doesn't appear to be random. Please let me know if there is anything else I can provide -- I'll have to roll back to 5.2.4 soon.
Do you do any transformations or other operations with `$du` variable in kamailio.cfg?
No. It is checked against `$null` in one place, but that's it.
Can you list the modules you are loading in the configuration file? It seems there is an operation writing an extra 0 in the dst_uri.
Here is the list. This seems to only happen on BYE and ACK, usually BYE for whatever reason. Out of curiosity, why were you asking about `$du` earlier rather than `$ru`? Isn't setting `$ru` what calls `pv_set_ruri()`?
``` loadmodule "jsonrpcs.so" loadmodule "kex.so" loadmodule "corex.so" loadmodule "tm.so" loadmodule "tmx.so" loadmodule "sl.so" loadmodule "rr.so" loadmodule "pv.so" loadmodule "maxfwd.so" loadmodule "usrloc.so" loadmodule "registrar.so" loadmodule "textops.so" loadmodule "siputils.so" loadmodule "xlog.so" loadmodule "sanity.so" loadmodule "ctl.so" loadmodule "cfg_rpc.so" loadmodule "acc.so" loadmodule "counters.so"
loadmodule "db_text.so" loadmodule "enum.so"
loadmodule "auth.so" loadmodule "auth_db.so"
loadmodule "alias_db.so"
loadmodule "nathelper.so" loadmodule "rtpproxy.so"
loadmodule "tls.so" ```
I've reverted back to 5.2.4 for the time being, but could have another shot at debugging later.
It is about dst uri ($du) because the error message is about previous memory fragment:
``` qm_debug_check_frag(): BUG: qm: prev. fragm. tail overwritten(c0c0c000, abcdefed)[0x801544380:0x8015443b8] ``` And looking at the previous fragment, the details show it was allocated by set_dst_uri():
``` $1 = {size = 32, u = {nxt_free = 0x0 <tsl>, is_free = 0}, file = 0x208a40 "core: core/parser/msg_parser.c", func = 0x2059c3 "set_dst_uri", mname = 0x2067f2 "core", line = 752, check = 4042322160} ```
Can you print the sip_msg structure that is given as parameter to pv_set_ruri() in frame 6? It should be named msg or so ... the backtrace you pasted doesn't have the usual format printed by GNU gdb. Can you use GNU gdb and give the backtrace printed by it? Because I am familiar with that one and I can assist better with troubleshooting.
This is still present with 5.3.1, at least. I missed your last comment earlier, but will have a go at it soon.
Any chance to get soon the requested details? Trying to solve asap the old reported issues for releasing a new version...
If still an issue with latest versions, reopen and attach the requested details. Closing now due to 2 months of no follow up, during which there were many code updates.
Closed #2121.
I finally isolated the problem here. I was trying to deal with some NAT nonsense involving dropped Contact headers from my SIP trunk provider and had a block like this in the WITHINDLG route:
``` if (uri==myself && $ru=~";alias=10") { # Deal with BS on ACKs handle_ruri_alias(); if($du != $null) { $ru = $du; } } ```
The crash was happening in here. I've since switched to a different method of dealing with this and the problem has gone away.