It would be good to add some sip keepalive monitoring (e.g., cron job
with sipsak sending options) that will alert/restart in case of no
response. The monit tool can also send sip keepalives and take actions
on no response.
On a deadlock, checking process table is not enough. There should have
been high cpu usage, though, if you monitored that.
Cheers,
Daniel
On 27/03/15 12:47, Alex Balashov wrote:
This was a rather peculiar crash:
From the logs, it would appear that Kamailio simply stopped processing
messages at some point. There's about 8 minutes of zero log output at
a time of constantly incoming traffic.
At some point, this situation is resolved when all Kamailio processes
die with a normal SIGTERM, when someone manually restarted it:
Mar 26 20:40:10 Proxy1 /usr/local/sbin/kamailio[27498]: NOTICE: <core>
[main.c:739]: handle_sigs(): Thank you for flying kamailio!!!
Mar 26 20:40:10 Proxy1 /usr/local/sbin/kamailio[27535]: INFO: <core>
[main.c:850]: sig_usr(): signal 15 received.
...
But there are a few things here that are difficult to explain from the
log:
1. Why was there no SIP stack response for 8 minutes, no logging
activity, etc?
2. We have a script that checks if Kamailio processes are running
every 1 second, and restarts Kamailio if it's not. It sends an e-mail
informing us of that development also.
It's a rather naive check:
ps aux | grep kamailio | grep -v 'grep kamailio' | wc -l
But in this case, the script was not triggered, which would imply that
some Kamailio processes--perhaps all--remained running.
There is no indication in the logs that any process died for any
reason, except for the 'signal 15' received by all processes at the
time of manual restart.
3. Why was a core dump generated at the time of the restart, if
nothing crashed?
#3 is most interesting to me, because if it were some other problem,
e.g. blocking of SIP worker threads for some reason, then I wouldn't
expect a core dump upon service shutdown.
There is no other indication of any child process dying with SIGSEGV
or SIGABRT.
-- Alex
On 03/27/2015 06:17 AM, Alex Balashov wrote:
Hello,
The system experienced another crash yesterday, but unfortunately the
core dump is not very insightful, possibly due to being incomplete:
BFD: Warning: /tmp/./core.kamailio.500.1427402410.27498 is truncated:
expected core file size >= 8602058752, found: 1769852928.
[New Thread 27498]
Cannot access memory at address 0x7f52891e3168
Cannot access memory at address 0x7f52891e3168
Cannot access memory at address 0x7f52891e3168
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Failed to read a valid object file image from memory.
Core was generated by `/usr/local/sbin/kamailio -P /var/run/kamailio.pid
-m 8192 -u evaristesys -g eva'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f5286d97e45 in ?? ()
Missing separate debuginfos, use: debuginfo-install
glibc-2.12-1.149.el6_6.5.x86_64
(gdb) where
#0 0x00007f5286d97e45 in ?? ()
Cannot access memory at address 0x7fffbe32a210
That's not much help at all, so I cannot possibly say it is for the same
reasons as before.
--
Daniel-Constantin Mierla
http://twitter.com/#!/miconda -
http://www.linkedin.com/in/miconda
Kamailio World Conference, May 27-29, 2015
Berlin, Germany -
http://www.kamailioworld.com