### Description
Kamailio suddenly stops working (this is a production server so debug logging is disabled), when we restart it a core is generated and a segfault is logged.
### Troubleshooting
The only thing I have seen is some reference to IPv6 in the backtrace. But no idea where to start looking. I'm pretty sure we will need more information, but I don't know how to read or understand backtraces, please let me know what you see in it so I can dig deeper into this problem.
#### Reproduction
I don't know how to reproduce yet, first I have to understand the cause.
#### Debugging Data
Backtrace: ``` [New LWP 12052] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/sbin/kamailio -P /var/run/kamailio/kamailio.pid -f /etc/kamailio/sbc.cfg -'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007efcf3b5df69 in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 #0 0x00007efcf3b5df69 in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 No symbol table info available. #1 0x00007efcf3b5e234 in OPENSSL_cleanup () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 No symbol table info available. #2 0x00007efcf7721910 in __run_exit_handlers (status=0, listp=0x7efcf7a855d8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:83 atfct = <optimized out> onfct = <optimized out> cxafct = <optimized out> f = <optimized out> #3 0x00007efcf772196a in __GI_exit (status=<optimized out>) at exit.c:105 No locals. #4 0x000055f0a9483d54 in handle_sigs () at main.c:699 chld = 845005216 chld_status = 845005248 any_chld_stopped = 32766 memlog = 0 __func__ = "handle_sigs" #5 0x000055f0a948ea43 in main_loop () at main.c:1758 i = 8 pid = 12091 si = 0x0 si_desc = "udp receiver child=7 sock=[2602:FF37:0:1:0:0:C601:377A]:5060\000\367v\215\200\262+\363\372~\000\000\025\067\236\003\000\000\000\000\300\017H\251\360U\000\000\340\305]2\376\177", '\000' <repeats 18 times>, "\020\303]2\376\177\000\000a\202g\251\360U\000" nrprocs = 8 woneinit = 1 __func__ = "main_loop" #6 0x000055f0a9494fb9 in main (argc=13, argv=0x7ffe325dc5e8) at main.c:2646 cfg_stream = 0x55f0aa7fb010 c = -1 r = 0 tmp = 0x7ffe325dceb6 "" tmp_len = -133383536 port = 32508 proto = 845006016 options = 0x55f0a97d80e8 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:" ret = -1 seed = 3679304743 rfd = 4 debug_save = 0 debug_flag = 0 dont_fork_cnt = 0 n_lst = 0x0 p = 0xffffffff <error: Cannot access memory at address 0xffffffff> st = {st_dev = 19, st_ino = 14769, st_nlink = 2, st_mode = 16832, st_uid = 109, st_gid = 113, __pad0 = 0, st_rdev = 0, st_size = 40, st_blksize = 4096, st_blocks = 0, st_atim = {tv_sec = 1498523393, tv_nsec = 731915583}, st_mtim = {tv_sec = 1498533865, tv_nsec = 274210743}, st_ctim = {tv_sec = 1498533865, tv_nsec = 274210743}, __glibc_reserved = {0, 0, 0}} __func__ = "main" ```
Info locals: ``` [New LWP 12052] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/sbin/kamailio -P /var/run/kamailio/kamailio.pid -f /etc/kamailio/sbc.cfg -'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007efcf3b5df69 in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 No symbol table info available. ```
List: ``` [New LWP 12052] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/sbin/kamailio -P /var/run/kamailio/kamailio.pid -f /etc/kamailio/sbc.cfg -'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007efcf3b5df69 in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 ```
#### Log Messages
We have regular logging with no issues, then suddenly we get:
``` Jun 28 01:57:51 kam-1 sbc[12075]: WARNING: <script>: [DISPATCHER] - Destination down: OPTIONS sip:A.A.A.A:5060 (<null>) Jun 28 01:57:51 kam-1 sbc[12075]: WARNING: <script>: [DISPATCHER] - Destination down: OPTIONS sip:B.B.B.B:5060 (<null>) Jun 28 01:58:40 kam-1 sbc[12076]: NOTICE: <script>: Node unavailable, trying next sip:A.A.A.A:5060 - M=INVITE R=sip:+12345678900@A.A.A.A:5060 ID=334f3vqDtj Jun 28 01:58:40 kam-1 sbc[12076]: NOTICE: <script>: Forwarding request - M=INVITE R=sip:+12345678900@A.A.A.A:5060 ID=334f3vqDtj Jun 28 01:58:55 kam-1 sbc[12075]: WARNING: dialog [dlg_handlers.c:1574]: dlg_ontimeout(): timeout for dlg with CallID '00g4gNy5KZ' and tags 'FHmrD-CM3' 'SZ3a5Sg58c5pB' Jun 28 01:58:55 kam-1 sbc[12075]: WARNING: dialog [dlg_handlers.c:1574]: dlg_ontimeout(): timeout for dlg with CallID 'Kj7uLRzJfd' and tags 'wRSVjfGaT' 'yQHvjNaZ5Z3rc' Jun 28 01:58:55 kam-1 sbc[12075]: WARNING: dialog [dlg_handlers.c:1574]: dlg_ontimeout(): timeout for dlg with CallID 'S5QUIXtS6-' and tags 'FV4ySEouz' 'eS7F910ycNg8H' Jun 28 01:59:10 kam-1 sbc[12075]: ERROR: <script>: No available gateways - M=INVITE R=sip:+12345678900@B.B.B.B:5060 ID=334f3vqDtj Jun 28 02:01:45 kam-1 sbc[12080]: NOTICE: dialog [dlg_hash.c:248]: dlg_clean_run(): dialog in delete state is too old (0x7efaf3904008 ref 1) Jun 28 02:03:15 kam-1 sbc[12080]: NOTICE: dialog [dlg_hash.c:248]: dlg_clean_run(): dialog in delete state is too old (0x7efaf40f0f50 ref 1) Jun 28 02:04:45 kam-1 sbc[12080]: NOTICE: dialog [dlg_hash.c:248]: dlg_clean_run(): dialog in delete state is too old (0x7efaf4d92108 ref 1) Jun 28 02:55:59 kam-1 sbc[12071]: INFO: <core> [main.c:823]: sig_usr(): signal 15 received Jun 28 02:55:59 kam-1 sbc[12052]: NOTICE: <core> [main.c:696]: handle_sigs(): Thank you for flying kamailio!!! ```
NOTE: The process stops logging at 02:04:45, then at 02:55:59 we manually restart kamailio and a core is generated.
Also, in `dmesg` we have:
``` [138953.744199] kamailio[12052]: segfault at 7efaf32f51b8 ip 00007efcf3b5df69 sp 00007ffe325dc0c0 error 4 in libcrypto.so.1.1[7efcf39f6000+265000] ```
#### SIP Traffic
I don't have sip traffic, If required I can start capturing it.
### Additional Information
``` # kamailio -v version: kamailio 5.0.2 (x86_64/linux) flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: unknown compiled with gcc 6.3.0 # ```
``` # dpkg -l | grep kamailio ii kamailio 5.0.2+stretch amd64 very fast, dynamic and configurable SIP server ii kamailio-dbg:amd64 5.0.2+stretch amd64 very fast and configurable SIP server [debug symbols] ii kamailio-extra-modules:amd64 5.0.2+stretch amd64 Extra modules for the Kamailio SIP Server ii kamailio-tls-modules:amd64 5.0.2+stretch amd64 TLS support for the Kamailio SIP server (authentication, transport) # ```
``` # dpkg -l |grep libssl ii libssl1.0.2:amd64 1.0.2l-2 amd64 Secure Sockets Layer toolkit - shared libraries ii libssl1.1:amd64 1.1.0f-3 amd64 Secure Sockets Layer toolkit - shared libraries# ```
* **Operating System**:
``` # uname -a Linux kam-1 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u1 (2017-06-18) x86_64 GNU/Linux ```
``` # lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 9.0 (stretch) Release: 9.0 Codename: stretch # ```
I'm now capturing traffic in case it happens again.
I'm wondering, is this a kamailio issue or a debian stretch libssl1.1 issue?
The same problem occurred on the second SBC, I have the packet capture from that moment. Please let me know what information you need.
The segfault happens at shutdown, which should be avoided if possible, but it is not relevant for what happens at runtime.
From the log messages, it seems that you use dispatcher (or some other routing module) and there is no active gateway where the sip request should be sent.
What you have seen in the packet capture? Are the SIP requests replied? Or no response from Kamailio?
Do you have any monitoring tool that may stop kamailio? The signal 15 (TERM) is the usual way to stop an application.
In the capture at some point Kamailio stops responding to any request. We have Nagios sending OPTIONS requests, when we get the CRITICAL because Kamailio doesn't reply, then we manually restart it (so the signal 15 TERM is caused in the restart).
We have a dispatcher group with 2 nodes, if both are down, Kamailio should reply with a 503.
I can send you the pcap in private, I would have to hide a lot of data to make post it here.
Also, I'm pretty sure the problem will reproduce, when it is happening how can I get some traces or something useful to send to you before we restart it?
Have you enabled one core per process (pid)? If not, core files from other processes may be overwritten by the main process doing cleanup at shut down. Anyhow, if a worker process crashes, you should see a log message in syslog. not only the one about signal 15.
If you don't get a response, then the sip trace is not that useful for the moment.
First thing to double check -- if you enabled pike module, be sure you don't block the traffic that is trusted.
Otherwise, if you can reproduce, do following:
* when kamailio runs fine, do `kamctl ps` to get the list of running processes * when kamailio stops routing sip traffic, attach with gdb to a worker process for your kind of traffic (UDP, TCP ... for example for UDP, attach to the PID listed for the process with ID 1) and get the backtrace:
``` gdb /path/to/kamailio PID bt full ```
Also see if the CPU is at high load. That could be an indication of a dead (spin) lock. If CPU is low, maybe there is some blocking operation dns, db query, ...
Ok! I'll wait until it happens again.
Some extra info:
1- Pike is enabled, but we have logging when it blocks something (and no logs appeared). 2- CPU is always < 1% usage. 3- I have enabled kernel.core_uses_pid=1. 4- Our traffic is:
[clients]--> IPv4/IPv6 tls_443 [Kamailio] IPv4 udp_5060 --> IPv4 udp_5060 [dispatcher setid 4].
If I run the bt full right now to a worker process, should we get random stuff?
Becase right now I'm getting mostly the same on each process:
``` root:~# for i in `kamctl ps | grep PID | awk '{print $2}' | tr -d ","`; do gdb /usr/sbin/kamailio -ex "bt full" --batch $i >> $i.txt 2>&1; done root:~#
root:~# ls -l total 140 -rw-r--r-- 1 root root 1769 Jun 29 01:04 23585.txt -rw-r--r-- 1 root root 3300 Jun 29 01:04 23586.txt -rw-r--r-- 1 root root 3300 Jun 29 01:04 23587.txt -rw-r--r-- 1 root root 3301 Jun 29 01:04 23588.txt -rw-r--r-- 1 root root 3300 Jun 29 01:04 23589.txt -rw-r--r-- 1 root root 3301 Jun 29 01:04 23590.txt -rw-r--r-- 1 root root 3300 Jun 29 01:04 23591.txt -rw-r--r-- 1 root root 3301 Jun 29 01:04 23592.txt -rw-r--r-- 1 root root 3301 Jun 29 01:04 23593.txt -rw-r--r-- 1 root root 3432 Jun 29 01:04 23594.txt -rw-r--r-- 1 root root 3432 Jun 29 01:04 23595.txt -rw-r--r-- 1 root root 3431 Jun 29 01:04 23596.txt -rw-r--r-- 1 root root 3432 Jun 29 01:04 23597.txt -rw-r--r-- 1 root root 3432 Jun 29 01:04 23598.txt -rw-r--r-- 1 root root 3432 Jun 29 01:04 23599.txt -rw-r--r-- 1 root root 3432 Jun 29 01:04 23600.txt -rw-r--r-- 1 root root 3432 Jun 29 01:04 23601.txt -rw-r--r-- 1 root root 2388 Jun 29 01:04 23602.txt -rw-r--r-- 1 root root 1836 Jun 29 01:04 23603.txt -rw-r--r-- 1 root root 2245 Jun 29 01:04 23604.txt -rw-r--r-- 1 root root 4023 Jun 29 01:04 23605.txt -rw-r--r-- 1 root root 2950 Jun 29 01:04 23606.txt -rw-r--r-- 1 root root 2837 Jun 29 01:04 23607.txt -rw-r--r-- 1 root root 4691 Jun 29 01:04 23608.txt -rw-r--r-- 1 root root 2740 Jun 29 01:04 23609.txt -rw-r--r-- 1 root root 2691 Jun 29 01:04 23610.txt -rw-r--r-- 1 root root 2691 Jun 29 01:04 23611.txt -rw-r--r-- 1 root root 2691 Jun 29 01:04 23612.txt -rw-r--r-- 1 root root 2691 Jun 29 01:04 23613.txt -rw-r--r-- 1 root root 2691 Jun 29 01:04 23614.txt -rw-r--r-- 1 root root 2691 Jun 29 01:04 23615.txt -rw-r--r-- 1 root root 2691 Jun 29 01:04 23616.txt -rw-r--r-- 1 root root 2691 Jun 29 01:04 23617.txt -rw-r--r-- 1 root root 2151 Jun 29 01:04 23618.txt root:~#
root:~# grep -n ffff * 23585.txt:31: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23586.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23587.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23588.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23589.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23590.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23591.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23592.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23593.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23594.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23595.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23596.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23597.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23598.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23599.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23600.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23601.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23602.txt:40: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23603.txt:33: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23604.txt:39: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23605.txt:75: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23606.txt:51: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23607.txt:49: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23608.txt:75: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23609.txt:50: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23610.txt:47: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23611.txt:47: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23612.txt:47: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23613.txt:47: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23614.txt:47: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23615.txt:47: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23616.txt:47: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23617.txt:47: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> 23618.txt:41: p = 0xffffffff <error: Cannot access memory at address 0xffffffff> root:~# ```
Am I missing something or am I doing anything wrong here?
Can you paste the content of a bactrace for one udp and one tcp worker?
Hi Daniel,
Here is the output:
``` root:~/bt# kamctl ps | grep -B1 "tcp receiver" "PID": 23610, "DSC": "tcp receiver (generic) child=0" -- "PID": 23611, "DSC": "tcp receiver (generic) child=1" -- "PID": 23612, "DSC": "tcp receiver (generic) child=2" -- "PID": 23613, "DSC": "tcp receiver (generic) child=3" -- "PID": 23614, "DSC": "tcp receiver (generic) child=4" -- "PID": 23615, "DSC": "tcp receiver (generic) child=5" -- "PID": 23616, "DSC": "tcp receiver (generic) child=6" -- "PID": 23617, "DSC": "tcp receiver (generic) child=7" root:~/bt#
root:~/bt# gdb /usr/sbin/kamailio -ex "bt full" --batch 23610 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007f73f4cbd0d3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 84 ../sysdeps/unix/syscall-template.S: No such file or directory. #0 0x00007f73f4cbd0d3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 No locals. #1 0x000055a667819c13 in io_wait_loop_epoll (h=0x55a667cbee80 <io_w>, t=2, repeat=0) at core/io_wait.h:1034 n = 2 r = 1 fm = 0x7ffddf805cf0 revents = 1734426560 __func__ = "io_wait_loop_epoll" #2 0x000055a66782afd3 in tcp_receive_loop (unix_sock=40) at core/tcp_read.c:1789 __func__ = "tcp_receive_loop" #3 0x000055a66770f088 in tcp_init_children () at core/tcp_main.c:4796 r = 0 i = 7 reader_fd_1 = 40 pid = 0 si_desc = "tcp receiver (generic)\000\000\360\\200\337\375\177", '\000' <repeats 18 times>, "\320X\200\337\375\177\000\000u\340tg\246U\000\000\201\016\234g\246U\000\000\001\000\000\000\000\000\000\000p\022\230g\000\000\000\000\070\352C\364s\177\000\000\300?ag\246U\000\000\f\372\000s\000\000\000\000\360X\200\337\375\177\000\000\f\372\000s\000\000\000" si = 0x0 __func__ = "tcp_init_children" #4 0x000055a667620ff7 in main_loop () at main.c:1715 i = 8 pid = 23603 si = 0x0 si_desc = "udp receiver child=7 sock=[2602:FF37:0:1:0:0:C601:3779]:5060\000\257$\366\200:z\360q\177\000\000\f\372\000s\000\000\000\000\300?ag\246U\000\000\360\\200\337\375\177", '\000' <repeats 18 times>, " Z\200\337\375\177\000\000a\262\200g\246U\000" nrprocs = 8 woneinit = 1 __func__ = "main_loop" #5 0x000055a667627fb9 in main (argc=13, argv=0x7ffddf805cf8) at main.c:2646 cfg_stream = 0x55a668b42010 c = -1 r = 0 tmp = 0x7ffddf806eb5 "" tmp_len = -178570608 port = 32627 proto = -545236016 options = 0x55a66796b0e8 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:" ret = -1 seed = 83065607 rfd = 4 debug_save = 0 debug_flag = 0 dont_fork_cnt = 0 n_lst = 0x0 p = 0xffffffff <error: Cannot access memory at address 0xffffffff> st = {st_dev = 19, st_ino = 15809, st_nlink = 2, st_mode = 16832, st_uid = 109, st_gid = 113, __pad0 = 0, st_rdev = 0, st_size = 40, st_blksize = 4096, st_blocks = 0, st_atim = {tv_sec = 1498523393, tv_nsec = 31134570}, st_mtim = {tv_sec = 1498687722, tv_nsec = 986617235}, st_ctim = {tv_sec = 1498687722, tv_nsec = 986617235}, __glibc_reserved = {0, 0, 0}} __func__ = "main" root:~/bt#
root:~/bt# kamctl ps | grep -B1 "udp receiver" "PID": 23586, "DSC": "udp receiver child=0 sock=198.1.55.121:5060" -- "PID": 23587, "DSC": "udp receiver child=1 sock=198.1.55.121:5060" -- "PID": 23588, "DSC": "udp receiver child=2 sock=198.1.55.121:5060" -- "PID": 23589, "DSC": "udp receiver child=3 sock=198.1.55.121:5060" -- "PID": 23590, "DSC": "udp receiver child=4 sock=198.1.55.121:5060" -- "PID": 23591, "DSC": "udp receiver child=5 sock=198.1.55.121:5060" -- "PID": 23592, "DSC": "udp receiver child=6 sock=198.1.55.121:5060" -- "PID": 23593, "DSC": "udp receiver child=7 sock=198.1.55.121:5060" -- "PID": 23594, "DSC": "udp receiver child=0 sock=[2602:FF37:0:1:0:0:C601:3779]:5060" -- "PID": 23595, "DSC": "udp receiver child=1 sock=[2602:FF37:0:1:0:0:C601:3779]:5060" -- "PID": 23596, "DSC": "udp receiver child=2 sock=[2602:FF37:0:1:0:0:C601:3779]:5060" -- "PID": 23597, "DSC": "udp receiver child=3 sock=[2602:FF37:0:1:0:0:C601:3779]:5060" -- "PID": 23598, "DSC": "udp receiver child=4 sock=[2602:FF37:0:1:0:0:C601:3779]:5060" -- "PID": 23599, "DSC": "udp receiver child=5 sock=[2602:FF37:0:1:0:0:C601:3779]:5060" -- "PID": 23600, "DSC": "udp receiver child=6 sock=[2602:FF37:0:1:0:0:C601:3779]:5060" -- "PID": 23601, "DSC": "udp receiver child=7 sock=[2602:FF37:0:1:0:0:C601:3779]:5060" root:~/bt#
root:~/bt# gdb /usr/sbin/kamailio -ex "bt full" --batch 23586 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007f73f4cbd9f3 in __recvfrom_nocancel () at ../sysdeps/unix/syscall-template.S:84 84 ../sysdeps/unix/syscall-template.S: No such file or directory. #0 0x00007f73f4cbd9f3 in __recvfrom_nocancel () at ../sysdeps/unix/syscall-template.S:84 No locals. #1 0x000055a6676ad0f2 in udp_rcv_loop () at core/udp_server.c:441 len = 392 buf = "SIP/2.0 200 canceling\r\nVia: SIP/2.0/UDP 198.1.55.121;branch=z9hG4bK1b16.c1f15688f268122a121deccea3eb8c11.0;i=eff9\r\nFrom: "Deniz Altunta?" sip:b9c825417f684a93a03cff1dde3f6c44@app.nextplus.me;tag=Zcf"... tmp = 0x55a66774e70c <init_child+52> "\311\303UH\211\345AWAVAUATSH\203\354\070H\211}\270H\203}\270" from = 0x7f73f44b6348 fromlen = 16 ri = {src_ip = {af = 2, len = 4, u = {addrl = {2067202502, 140135995956936}, addr32 = {2067202502, 0, 4097990344, 32627}, addr16 = {454, 31543, 0, 0, 24264, 62530, 32627, 0}, addr = "\306\001\067{\000\000\000\000\310^B\364s\177\000"}}, dst_ip = {af = 2, len = 4, u = {addrl = {2033648070, 0}, addr32 = {2033648070, 0, 0, 0}, addr16 = {454, 31031, 0, 0, 0, 0, 0, 0}, addr = "\306\001\067y", '\000' <repeats 11 times>}}, src_port = 5060, dst_port = 5060, proto_reserved1 = 0, proto_reserved2 = 0, src_su = {s = {sa_family = 2, sa_data = "\023\304\306\001\067{\000\000\000\000\000\000\000"}, sin = {sin_family = 2, sin_port = 50195, sin_addr = {s_addr = 2067202502}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 50195, sin6_flowinfo = 2067202502, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}}, bind_address = 0x7f73f4425d48, proto = 1 '\001'} __func__ = "udp_rcv_loop" #2 0x000055a66761ff1e in main_loop () at main.c:1625 i = 0 pid = 0 si = 0x7f73f4425d48 si_desc = "udp receiver child=0 sock=198.1.55.121:5060\000o\000\000\000H֘g\246U\000\000\000*r\331D\257$\366\200:z\360q\177\000\000\f\372\000s\000\000\000\000\300?ag\246U\000\000\360\\200\337\375\177", '\000' <repeats 18 times>, " Z\200\337\375\177\000\000a\262\200g\246U\000" nrprocs = 8 woneinit = 0 __func__ = "main_loop" #3 0x000055a667627fb9 in main (argc=13, argv=0x7ffddf805cf8) at main.c:2646 cfg_stream = 0x55a668b42010 c = -1 r = 0 tmp = 0x7ffddf806eb5 "" tmp_len = -178570608 port = 32627 proto = -545236016 options = 0x55a66796b0e8 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:" ret = -1 seed = 83065607 rfd = 4 debug_save = 0 debug_flag = 0 dont_fork_cnt = 0 n_lst = 0x0 p = 0xffffffff <error: Cannot access memory at address 0xffffffff> st = {st_dev = 19, st_ino = 15809, st_nlink = 2, st_mode = 16832, st_uid = 109, st_gid = 113, __pad0 = 0, st_rdev = 0, st_size = 40, st_blksize = 4096, st_blocks = 0, st_atim = {tv_sec = 1498523393, tv_nsec = 31134570}, st_mtim = {tv_sec = 1498687722, tv_nsec = 986617235}, st_ctim = {tv_sec = 1498687722, tv_nsec = 986617235}, __glibc_reserved = {0, 0, 0}} __func__ = "main" root:~/bt# ```
The backtraces show that the worker processes are waiting to receive traffic from network, which is the normal situation. When you get it to the state of not processing sip traffic, grab again the backtraces.
Hi Daniel,
Today it happened again. I have more information:
We have currently running 4x Debian 9 servers with Kamailio v5:
KamailioA & KamailioB KamailioC & KamailioD
Servers A/B only take care of SSL offloading and loadbalance/failover between servers C/D. (using dispatcher with minimal config)
Flow would be:
User <-> KamailioA/B <-> KamailioC/D <-> ...
Today KamailioB stop replying... and I got the backtraces.
From KamailioA: 0 problems From KamailioB:
``` root:~/bt2# grep DISPATCHER /var/log/kamailio/kamailio.log Jul 1 15:03:06 13cn4 sbc[14833]: WARNING: <script>: [DISPATCHER] - Destination down: OPTIONS sip:A.A.A.A:5060 (<null>) Jul 1 15:03:06 13cn4 sbc[14833]: WARNING: <script>: [DISPATCHER] - Destination down: OPTIONS sip:B.B.B.B:5060 (<null>) Jul 1 15:39:50 13cn4 sbc[14818]: WARNING: <script>: [DISPATCHER] - Destination up: OPTIONS sip:A.A.A.A:5060 Jul 1 15:39:50 13cn4 sbc[14818]: WARNING: <script>: [DISPATCHER] - Destination up: OPTIONS sip:B.B.B.B:5060 root:~/bt2# ``` A.A.A.A and B.B.B.B would be KamailioC/D.
Note, this only happened in KamailioB, KamailioA had 0 problems.
When I connected to the server, I did the following:
``` # for i in `kamctl ps | grep PID | awk '{print $2}' | tr -d ","`; do gdb /usr/sbin/kamailio -ex "bt full" --batch $i >> $i.txt 2>&1; done ```
Well, I got the backtraces per PID, the thing is, right after running that command, traffic started flowing again (I didn't restart or anything), hence the timestamps:
``` root:~/bt2# ls -lh total 356K -rw-r--r-- 1 root root 1.8K Jul 1 15:39 14816.txt -rw-r--r-- 1 root root 15K Jul 1 15:39 14817.txt -rw-r--r-- 1 root root 15K Jul 1 15:39 14818.txt -rw-r--r-- 1 root root 15K Jul 1 15:39 14819.txt -rw-r--r-- 1 root root 48K Jul 1 15:39 14820.txt -rw-r--r-- 1 root root 47K Jul 1 15:39 14821.txt -rw-r--r-- 1 root root 15K Jul 1 15:39 14822.txt -rw-r--r-- 1 root root 15K Jul 1 15:39 14823.txt -rw-r--r-- 1 root root 15K Jul 1 15:39 14824.txt -rw-r--r-- 1 root root 3.4K Jul 1 15:39 14825.txt -rw-r--r-- 1 root root 3.4K Jul 1 15:39 14826.txt -rw-r--r-- 1 root root 3.4K Jul 1 15:39 14827.txt -rw-r--r-- 1 root root 3.4K Jul 1 15:39 14828.txt -rw-r--r-- 1 root root 3.4K Jul 1 15:39 14829.txt -rw-r--r-- 1 root root 3.4K Jul 1 15:39 14830.txt -rw-r--r-- 1 root root 3.4K Jul 1 15:39 14831.txt -rw-r--r-- 1 root root 3.4K Jul 1 15:39 14832.txt -rw-r--r-- 1 root root 42K Jul 1 15:39 14833.txt -rw-r--r-- 1 root root 1.8K Jul 1 15:39 14834.txt -rw-r--r-- 1 root root 2.2K Jul 1 15:39 14835.txt -rw-r--r-- 1 root root 4.0K Jul 1 15:39 14836.txt -rw-r--r-- 1 root root 2.9K Jul 1 15:39 14837.txt -rw-r--r-- 1 root root 2.8K Jul 1 15:39 14838.txt -rw-r--r-- 1 root root 4.7K Jul 1 15:39 14839.txt -rw-r--r-- 1 root root 2.7K Jul 1 15:39 14840.txt -rw-r--r-- 1 root root 2.7K Jul 1 15:39 14841.txt -rw-r--r-- 1 root root 2.7K Jul 1 15:39 14842.txt -rw-r--r-- 1 root root 7.5K Jul 1 15:39 14843.txt -rw-r--r-- 1 root root 2.7K Jul 1 15:39 14844.txt -rw-r--r-- 1 root root 2.7K Jul 1 15:39 14845.txt -rw-r--r-- 1 root root 7.5K Jul 1 15:39 14846.txt -rw-r--r-- 1 root root 2.7K Jul 1 15:39 14847.txt -rw-r--r-- 1 root root 7.5K Jul 1 15:39 14848.txt -rw-r--r-- 1 root root 4.2K Jul 1 15:39 14849.txt root:~/bt2# ``` [backtrace_20170701_1539.tar.gz](https://github.com/kamailio/kamailio/files/1117494/backtrace_20170701_1539.t...)
I have tried to look at the backtraces, but to me they seem ok (kind of like the previous ones, where Kamailio is just waiting for new requests).
So at this point my assumption is that something is triggering dispatcher to see both nodes down and therefor stops processing traffic, when dispatcher sees the nodes up again, all starts working.
Let me know what you think Daniel and how I can investigate this further.
I also have tcpdump captures at the time, I will check to see if OPTIONS are actually being sent out or not to try to narrow this a little more.
Thanks! Joel.
Hi again, unfortunately the tcpdump where only capturing the external side, I have restarted tcpdump now capturing internal too so we can see the OPTIONS between the servers.
Hi Daniel,
More information. I'm attaching the log (with debug=2) between the dispatcher nodes down & up at the time of the problem.
[kamailio_sbc_2.log.txt](https://github.com/kamailio/kamailio/files/1117574/kamailio_sbc_2.log.txt)
Some extra stuff:
1. There is a gap a little bit after the dispatcher sets the nodes down, until a little before the dispatcher sets the nodes up, hope this helps. 2. We use VoIPmonitor and the sensor receives the traffic from a port mirror on the switch, in the timeframe of the issue, there is no traffic detected from the affected server. 3. We pull performance metrics from Kamailio every 10 seconds using xhttp and json-rpc modules, during the issue our tool can't pool metric either:
``` 2017-07-01T22:02:15Z E! Error in plugin [inputs.httpjson]: Get http://127.0.0.1/statistics: net/http: timeout awaiting response headers (Client.Timeout exceeded while awaiting headers) 2017-07-01T22:02:25Z E! Error in plugin [inputs.httpjson]: Get http://127.0.0.1/statistics: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2017-07-01T22:02:35Z E! Error in plugin [inputs.httpjson]: Get http://127.0.0.1/statistics: net/http: request canceled (Client.Timeout exceeded while awaiting headers) [...] [...] [...] 2017-07-01T22:39:25Z E! Error in plugin [inputs.httpjson]: Get http://127.0.0.1/statistics: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2017-07-01T22:39:35Z E! Error in plugin [inputs.httpjson]: Get http://127.0.0.1/statistics: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2017-07-01T22:39:45Z E! Error in plugin [inputs.httpjson]: Get http://127.0.0.1/statistics: net/http: request canceled (Client.Timeout exceeded while awaiting headers) ```
I will leave the tcpdumps enabled anyway in case next time it happens we can get more or better information.
Joel.
Hi,
It happened again, I checked the captures and this is what is happening during the issue:
From KamailioB to KamailioC/D
1. Kamailio B is sending OPTIONS to KamailioC/D. 2. KamailioC/D are replying 200. 3. Kamailio B is not acting to those 200 and has the nodes DOWN (probably because it is in that "hung" state).
At the same time:
From KamailioC/D to KamailioB
1. KamailioC/D are sending OPTIONS to KamailioB. 2. KamailioB is NOT replying anything.
This time I had to restart Kamailio and it started working right away again.
[backtrace_20170701_2001.tar.gz](https://github.com/kamailio/kamailio/files/1117612/backtrace_20170701_2001.t...)
Sorry for so many posts, but I update this as things happen.
It looks like something blocked on tls processing. The process 14848 is doing:
``` #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 No locals. #1 0x00007f4fdee39c50 in __GI___pthread_rwlock_unlock (rwlock=0x7f4dde916288) at pthread_rwlock_unlock.c:42 futex_shared = 0 #2 0x00007f4fdf207f09 in CRYPTO_THREAD_unlock () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 No symbol table info available. #3 0x00007f4fdf1d6cb9 in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 No symbol table info available. #4 0x00007f4fdf5216a0 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 No symbol table info available. #5 0x00007f4fdf5152cf in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 No symbol table info available. #6 0x00007f4fdf50d9f1 in SSL_do_handshake () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 No symbol table info available. #7 0x00007f4fdf77d73c in tls_accept (c=0x7f4ddf75e2f0, error=0x7ffd3522dc90) at tls_server.c:411 ```
The 14847 and many others:
``` #0 0x00007f4fdee39450 in futex_wait (private=<optimized out>, expected=1446580, futex_word=0x7f4dde916294) at ../sysdeps/unix/sysv/linux/futex-internal.h:61 __ret = -512 err = <optimized out> #1 futex_wait_simple (private=<optimized out>, expected=1446580, futex_word=0x7f4dde916294) at ../sysdeps/nptl/futex-internal.h:135 No locals. #2 __pthread_rwlock_wrlock_slow (rwlock=0x7f4dde916288) at pthread_rwlock_wrlock.c:67 waitval = 1446580 result = 0 futex_shared = <optimized out> #3 0x00007f4fdf207ee9 in CRYPTO_THREAD_write_lock () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 No symbol table info available. #4 0x00007f4fdf1d6658 in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 No symbol table info available. #5 0x00007f4fdf51500f in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 No symbol table info available. #6 0x00007f4fdf50d9f1 in SSL_do_handshake () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 No symbol table info available. #7 0x00007f4fdf77d73c in tls_accept (c=0x7f4ddf1993a8, error=0x7ffd3522dc90) at tls_server.c:411 ```
As I assume by function names, one is trying to do unlock, but still waiting for something, while the others try to acquire the lock.
The thing is that openssl (libssl) 1.1.0 don't use anymore the custom lock functions for their internal stuff, being thread safe implicitely. I pushed a patch to master branch, disabling the part where we set the lock/unlock functions for libssl, but that mainly avoids compile warnings, otherwise it doesn't affect how libssl works internally for v1.1.0+.
If that unlock takes too long inside libssl/libcrypto, either is bug there in the libs or somehow the pointer to the lock (or the lock structure content) got corrupted. For the first option, you would have to use a libssl less than 1.1.0 -- maybe you can search on the web and see if there are others reporting issues with locking/deadlocks for openssl 1.1.0+.
The second option could be also the fault of kamailio, if there is a buffer overflow somewhere. But this should pop up as a bug with openssl less than 1.1.0 (in that case our internal lock structure should be overwritten/corrupted) and nobody reported any deadlock/crash related to tls for very long time, although it is used extensively for very long time.
I would try to use first openssl less than 1.1.0 and see if the same situation occurs. If not, then it is likely something related to the new locking system done inside openssl 1.1.0.
Hi Daniel, thanks for helping troubleshoot the issue.
I have Kamailio v4.4 with Debian 8 working with TLS on other servers and we have had 0 problems (OpenSSL v1.0)
If I install the kamailio 5 nightly deb build from tomorrow, would your latest patch be applied? Or should I manually compile from master branch?
Also, for some reason I had a feeling it could be OpenSSL 1.1, Debian 9 seems to have OpenSSL 1.0 also:
``` root:~# dpkg -l | grep libssl ii libssl1.0.2:amd64 1.0.2l-2 amd64 Secure Sockets Layer toolkit - shared libraries ii libssl1.1:amd64 1.1.0f-3 amd64 Secure Sockets Layer toolkit - shared libraries root:~# ```
Can I build Kamailio deb package from the sources, compiling with the ssl v1.0 version instead of v1.1, and then install that resulting deb package and wait and see if the problem reproduces?
Let me know what you think.
Yes, the patches to tls module will be in nightly builds. But again, they just skipped setting the locking functions for libssl/libcrypto, being not used and the compiler throwing unused functions warnings.
Probably you can link against libssl 1.0, but you need to update the deb specs and/or Makefile for tls module.
Hi Daniel, just an update... I have done the following:
1 - Download src package for v5.0.2: `# apt-src install kamailio`. 2- Edit the control file: `./kamailio-5.0.2+stretch/debian/control`. 3- In the "Build-Depends" section, replace `libssl-dev` with `libssl1.0-dev`. 4- Build: `dpkg-buildpackage -b`
I checked the resulting TLS module package:
``` # dpkg -I kamailio-tls-modules_5.0.2+stretch_amd64.deb new debian package, version 2.0. size 362518 bytes: control archive=1025 bytes. 22 bytes, 1 lines conffiles 757 bytes, 18 lines control 842 bytes, 10 lines md5sums Package: kamailio-tls-modules Source: kamailio Version: 5.0.2+stretch Architecture: amd64 Maintainer: Debian VoIP Team pkg-voip-maintainers@lists.alioth.debian.org Installed-Size: 995 Depends: kamailio (= 5.0.2+stretch), libc6 (>= 2.14), libcurl3 (>= 7.16.2), libssl1.0.2 (>= 1.0.2d) Section: net Priority: optional Multi-Arch: same Homepage: http://www.kamailio.org/ Description: TLS support for the Kamailio SIP server (authentication, transport) Kamailio is a very fast and flexible SIP (RFC3261) server. Written entirely in C, Kamailio can handle thousands calls per second even on low-budget hardware. . This package provides TLS support for encrypted and authenticated SIP connections as well as generic TLS support for many Kamailio modules. root@debian9build:/opt/kamailio-5.0.2-deb# ```
So it seems it is correctly linked to libssl1.0.2.
I will install this version of Kamailio on one of the SBCs tonight and keep it in close observation. I will update here as soon as I have more info.
Hi Daniel,
Just an update, we have just restarted the SBC with the new Kamailio build using OpenSSL v1.0:
``` Jul 25 00:59:09 voicesbc1 sbc[1112]: INFO: tls [tls_init.c:639]: init_tls_h(): tls: _init_tls_h: compiled with openssl version "OpenSSL 1.0.2l 25 May 2017" (0x100020cf), kerberos support: off, compression: on ```
Now to wait and see if it reproduces. We have 2 SBCs, both Debian9+Kamailio5.0.2:
voicesbc1: Kamailio compiled with OpenSSL v1.0 voicesbc2: Kamailio compiled with OpenSSL v1.1
Until now, we have had to restart kamailio every time the issue appeared on any/both servers, let's see what happens in the next couple days.
OK, thanks for updates, let's see the results...
Hi Daniel, just an update... So far, since we repackaged v5.0.2 compiling with libssl1.0 we haven't had a single problem...
What should we try next to get this fixed with libssl1.1 ?
The core dump at restart seems to be the same like the #1189, which should be fixed now.
To troubleshoot better the other issue, install dbg symbols package for openssl libraries to see what is executed when it locks via the backtraces.
Closed #1172.
I am closing this one, provided the fix for the core. If you get again stuck with libssl1.1, open a new issue and attach the backtrace, so we work with latest source code.
Will do Daniel. Thank you for the help.