Hi, I'm experimenting an annoying issue in a Debian Lenny 32 bits under a DELL 850 (2 cores). Same occured before with Debian Etch 32 and 64 bits in other host (also DELL 850).
Kamailio (1.5 rev 5834) behaves as a load balancer in front of 2 PSTN gateways. There are ~200 calls and for all of them rtpproxy is applied.
Sometimes, with no reason (not just in the moments of highest traffic) kamailio gets totally frozen, this is, it doesn't reply to SIP messages, neither relays them. Then Kamailio cannot be killed (just with -9). Sometimes it occurs when performing a fifo command (i.e. "kamctl fifo address_reload").
After killing it, kamailio cannot be started again. It starts and logs the usual logs (to syslog) but after "rtp proxy unix:/var/run/rtpproxy.sock found, support for it enabled" there is no more logs. Then if I do a "ps aux | grep kamailio" I just see a single process, it's like the children fail when connecting to rtpproxy and die but the master process doesn't realize of it and remains alive showing no error at all. So I suspect from rtpproxy as even when kamailio was killed rtpproxy process consumes lot of CPU.
I'll try with rtpproxy-1.2.1 (the latest stable version) as before I was using a old "truk" version (unfortunatelly I've no idea about how to know the CVS revision of a CVS working directory...).
Anyhow I also suspect that the problem could be in the server itself, as sometimes the above problem occurs when reloading iptables and so on, very very strange, no a definitive cause or reason, very annoying.
So nothing is clear for me. But I know that the problem occurs: - In two DELL 850, one with 32 and the other with 64 bits. - In both Debian Leeny and Etch. - With kamailio 1.5 rev 5834. - With rtpproxy rev XXX (how to get the last modification date in a CVS working dirrectory?)
So, the fact is that I'm completely lost, no idea of what is happening. Any suggestion? Thanks a lot.
El Lunes, 1 de Marzo de 2010, Iñaki Baz Castillo escribió:
Sometimes, with no reason (not just in the moments of highest traffic) kamailio gets totally frozen, this is, it doesn't reply to SIP messages, neither relays them. Then Kamailio cannot be killed (just with -9).
And the worst: when this occurs kamailio logs *nothing*.
On 03/01/2010 11:52 AM, Iñaki Baz Castillo wrote:
El Lunes, 1 de Marzo de 2010, Iñaki Baz Castillo escribió:
Sometimes, with no reason (not just in the moments of highest traffic) kamailio gets totally frozen, this is, it doesn't reply to SIP messages, neither relays them. Then Kamailio cannot be killed (just with -9).
And the worst: when this occurs kamailio logs *nothing*.
is it eating a lot of cpu? Can you attach to it with gdb and do a backtrace?
Might be some race in permissions module induced by address_reload MI command.
Cheers, Daniel
El Lunes, 1 de Marzo de 2010, Daniel-Constantin Mierla escribió:
On 03/01/2010 11:52 AM, Iñaki Baz Castillo wrote:
El Lunes, 1 de Marzo de 2010, Iñaki Baz Castillo escribió:
Sometimes, with no reason (not just in the moments of highest traffic) kamailio gets totally frozen, this is, it doesn't reply to SIP messages, neither relays them. Then Kamailio cannot be killed (just with -9).
And the worst: when this occurs kamailio logs *nothing*.
is it eating a lot of cpu? Can you attach to it with gdb and do a backtrace?
Do you mean running kamailio as usual and later running: gdb> attach KAMAILIO_MASTER_PID ?
or running directly:
$ gdb --args /usr/sbin/kamailio -P /var/run/kamailio/kamailio.pid -m 64 -u kamailio -g kamailio
In the second case kamailio doesn't fork and doesn't bind to the listening address. So I assume you mean the first case.
Could it affect to the performance of the service? If not, when to run "bt"? after the problem occurs? Also note that when Kamailio gest frozen there is no coredump. I also tell that after running "kamctl fifo address_reload" it takes several minutes until it get frozen, but it doesn't occur always.
Might be some race in permissions module induced by address_reload MI command.
I suspected it. However it doesn't explain the fact that later kamailio cannot be started again (even after killing all the kamailio processes with -9). Would it make sense?
Thanks a lot.
El Lunes, 1 de Marzo de 2010, Iñaki Baz Castillo escribió:
Do you mean running kamailio as usual and later running: gdb> attach KAMAILIO_MASTER_PID ?
Should it be the PID of the master process (attendant) or the "MI FIFO" process of kamailio?
Thanks.
On 03/01/2010 12:39 PM, Iñaki Baz Castillo wrote:
El Lunes, 1 de Marzo de 2010, Iñaki Baz Castillo escribió:
Do you mean running kamailio as usual and later running: gdb> attach KAMAILIO_MASTER_PID ?
Should it be the PID of the master process (attendant) or the "MI FIFO" process of kamailio?
not the master, would be good if you can attach to a sip worker process. That will reveal why is no longer processing sip messages.
Cheers, Daniel
On 03/01/2010 12:35 PM, Iñaki Baz Castillo wrote:
El Lunes, 1 de Marzo de 2010, Daniel-Constantin Mierla escribió:
On 03/01/2010 11:52 AM, Iñaki Baz Castillo wrote:
El Lunes, 1 de Marzo de 2010, Iñaki Baz Castillo escribió:
Sometimes, with no reason (not just in the moments of highest traffic) kamailio gets totally frozen, this is, it doesn't reply to SIP messages, neither relays them. Then Kamailio cannot be killed (just with -9).
And the worst: when this occurs kamailio logs *nothing*.
is it eating a lot of cpu? Can you attach to it with gdb and do a backtrace?
Do you mean running kamailio as usual and later running: gdb> attach KAMAILIO_MASTER_PID ?
yes, run as usual and when it is no longer working attach to a SIP worker process and grab the backtrace.
The master process does not do much, just supervising the other processes in order to clean.
or running directly:
$ gdb --args /usr/sbin/kamailio -P /var/run/kamailio/kamailio.pid -m 64 -u kamailio -g kamailio
In the second case kamailio doesn't fork and doesn't bind to the listening address. So I assume you mean the first case.
Right.
Could it affect to the performance of the service? If not, when to run "bt"? after the problem occurs?
Yes, after the problem occurs. Only the process you attach to is going to be blocked in gdb, the others should function normally (but they don't anyhow, as I understood).
Also note that when Kamailio gest frozen there is no coredump. I also tell that after running "kamctl fifo address_reload" it takes several minutes until it get frozen, but it doesn't occur always.
Might be some race in permissions module induced by address_reload MI command.
I suspected it. However it doesn't explain the fact that later kamailio cannot be started again (even after killing all the kamailio processes with -9). Would it make sense?
Is it the fifo/pid file there? No error why is not starting again?
Cheers, Daniel
El Lunes, 1 de Marzo de 2010, Daniel-Constantin Mierla escribió:
Do you mean running kamailio as usual and later running: gdb> attach KAMAILIO_MASTER_PID ?
yes, run as usual and when it is no longer working attach to a SIP worker process and grab the backtrace.
ok
Could it affect to the performance of the service? If not, when to run "bt"? after the problem occurs?
Yes, after the problem occurs. Only the process you attach to is going to be blocked in gdb, the others should function normally (but they don't anyhow, as I understood).
Yes, no worker works after the problem occurs.
Might be some race in permissions module induced by address_reload MI command.
I suspected it. However it doesn't explain the fact that later kamailio cannot be started again (even after killing all the kamailio processes with -9). Would it make sense?
Is it the fifo/pid file there? No error why is not starting again?
I must check it. Anyhow, as I said it's a very strange problem as some day it occured after reloading the iptables rules! (without changing nothing important). Very very strange, but as it occurs in two servers I want to believe that has something to do with k or rtpproxy version.
I'll prepare a SIPp scenario to check it with some load.
Thanks a lot.
On 03/01/2010 12:57 PM, Iñaki Baz Castillo wrote:
El Lunes, 1 de Marzo de 2010, Daniel-Constantin Mierla escribió:
Do you mean running kamailio as usual and later running: gdb> attach KAMAILIO_MASTER_PID ?
yes, run as usual and when it is no longer working attach to a SIP worker process and grab the backtrace.
ok
Could it affect to the performance of the service? If not, when to run "bt"? after the problem occurs?
Yes, after the problem occurs. Only the process you attach to is going to be blocked in gdb, the others should function normally (but they don't anyhow, as I understood).
Yes, no worker works after the problem occurs.
Might be some race in permissions module induced by address_reload MI command.
I suspected it. However it doesn't explain the fact that later kamailio cannot be started again (even after killing all the kamailio processes with -9). Would it make sense?
Is it the fifo/pid file there? No error why is not starting again?
I must check it. Anyhow, as I said it's a very strange problem as some day it occured after reloading the iptables rules! (without changing nothing important).
hmmm ... are you using tcp or just udp?
Cheers, Daniel
Very very strange, but as it occurs in two servers I want to believe that has something to do with k or rtpproxy version.
I'll prepare a SIPp scenario to check it with some load.
Thanks a lot.
El Lunes, 1 de Marzo de 2010, Daniel-Constantin Mierla escribió:
I must check it. Anyhow, as I said it's a very strange problem as some day it occured after reloading the iptables rules! (without changing nothing important).
hmmm ... are you using tcp or just udp?
Just udp, no tcp enabled.
Also I cannot reproduce the problem with SIPp as I call to wrong numbers so the PSTN gateways reply 503 and there is no RtpProxy sessions, neither dialogs (I use the dialog module just to get statics).
On 03/01/2010 01:17 PM, Iñaki Baz Castillo wrote:
El Lunes, 1 de Marzo de 2010, Daniel-Constantin Mierla escribió:
I must check it. Anyhow, as I said it's a very strange problem as some day it occured after reloading the iptables rules! (without changing nothing important).
hmmm ... are you using tcp or just udp?
Just udp, no tcp enabled.
udp should not be affected by a firewall, i asked because in past i got some troubles with tcp connections and firewalls updates (not related to sip).
Also I cannot reproduce the problem with SIPp as I call to wrong numbers so the PSTN gateways reply 503 and there is no RtpProxy sessions, neither dialogs (I use the dialog module just to get statics).
It might be a corner case, when you get it next time, grab some backtraces. I don't think it is related to rtpproxy or dialog module, i will look at permissions a bit when I get some time, maybe I'll spot something.
Cheers, Daniel
El Lunes, 1 de Marzo de 2010, Daniel-Constantin Mierla escribió:
Also I cannot reproduce the problem with SIPp as I call to wrong numbers so the PSTN gateways reply 503 and there is no RtpProxy sessions, neither dialogs (I use the dialog module just to get statics).
It might be a corner case, when you get it next time, grab some backtraces. I don't think it is related to rtpproxy or dialog module, i will look at permissions a bit when I get some time, maybe I'll spot something.
Ok, I'll try to reproduce the problem in a testing scenario (no luck yet). If not, I'll wait until it occurs again in production and I'll attach any UDP worker to gdb and run "bt".
Thanks.
El Lunes, 1 de Marzo de 2010, Iñaki Baz Castillo escribió:
Might be some race in permissions module induced by address_reload MI command.
Unfortunatelly I cannot reproduce the problem (running "kamctl fifo address_reload" several times and modifying adddrss table) under no traffic scenario (just an unique phone making calls). I've the master process attached to gdb and nothing wrong occurs, in fact kamailio remains working properly. There must be something else somewhere...