On Jan 28, 2010 at 14:56, Daniel-Constantin Mierla <miconda(a)gmail.com> wrote:
I am cc-ing sr-dev, since tcp code is from ser and
Andrei may have more
insights...
Is this kamailio 1.5 or kamailio 3.0 (looks like <3.0 to me)?
On 1/28/10 2:41 PM, Aymeric Moizard wrote:
>
>
>On Thu, 28 Jan 2010, Henning Westerholt wrote:
>
>>On Thursday 28 January 2010, Aymeric Moizard wrote:
>>>here is the backtrace I have. unfortunatly without debug symbol!
>>>I found the same for many of the kamailio process. "sched_yield"
>>>is pending for ever. My system is a debian/etch.
>>>
>>>#0 0xffffe424 in __kernel_vsyscall ()
>>>#1 0xb7cef4ac in sched_yield () from /lib/tls/i686/cmov/libc.so.6
>>>#2 0x080a93fd in tcp_send ()
>>>#3 0xb7975679 in send_pr_buffer () from
>>>/usr/lib/kamailio/modules/tm.so
>>>#4 0xb79789ac in t_forward_nonack () from
>>>/usr/lib/kamailio/modules/tm.so
>>>#5 0xb7974784 in t_relay_to () from /usr/lib/kamailio/modules/tm.so
>>>#6 0xb7983a11 in load_tm () from /usr/lib/kamailio/modules/tm.so
>>>#7 0x081cf810 in mem_pool ()
>>>#8 0x00000000 in ?? ()
The backtrace looks strange (mem_pool() and load_tm() for example).
It would help greatly to have it compiled with debug symbols from
source (or have around the exact source used for the compiled code).
>>>
>>>I guess most t_relay operation towards my "mobipouce.com" domain
>>>with one IP being down breaks each kamailio process one after the
>>>other... I'm not sure every such t_relay operation is always breaking
>>>exactly one thread each time.
>>>
>>>I went through the lock/unlock of tcp_main.c but it seems every
>>>lock has an unlock at least...
>>
>>Hi Aymeric,
>>
>>i remember that we observed this "sched_yield" problems on one old
>>0.9 system
>>after some time (like weeks or month). We did not found the solution
>>in this
>>case, after a restart it was gone again..
If it's kamailio 1.5 (or any version < 3.0) then the sched_yield() means
spinning on a lock. However I'm not sure we can trust the backtrace.
You mentioned in an earlier mail that you see this related to UDP
traffic, but
in the log file and also in your investigations you think its related
to TPC?
This is the exact case:
1-> SUBSCRIBE sent to/received by over UDP to kamailio.
2-> kamailio does a SRV record lookup for "mobipouce.com"
3-> kamailio try
sip2.mobipouce.com (91.199.234.47) over TCP first
4-> connection failed with logs:
Jan 27 12:56:38 ns26829 /usr/sbin/kamailio[9763]:
ERROR:core:tcp_blocking_connect: poll error: flags 18
Jan 27 12:56:38 ns26829 /usr/sbin/kamailio[9763]:
ERROR:core:tcp_blocking_connect: failed to retrieve SO_ERROR (111)
Connection refused
Jan 27 12:56:38 ns26829 /usr/sbin/kamailio[9763]:
ERROR:core:tcpconn_connect: tcp_blocking_connect failed
Jan 27 12:56:38 ns26829 /usr/sbin/kamailio[9763]: ERROR:core:tcp_send:
connect failed
Jan 27 12:56:38 ns26829 /usr/sbin/kamailio[9763]: ERROR:tm:msg_send:
tcp_send failed
Jan 27 12:56:38 ns26829 /usr/sbin/kamailio[9763]:
ERROR:tm:t_forward_nonack: sending request failed
5-> I guess kamailio is supposed to try other SRV record value:
sip2.mobipouce.com (91.199.234.46) but it doesn't
Thus, I'm guessing the issue is related to SRV record with failover OR
just tcp failure. Not related to UDP at all.
so TCP connect failed, the tcp worker returned as it prints the message
and, to be sure I got it right, the UDP worker (the one that received)
got blocked?
My guess (assuming kamailio 1.5) is that the problem is in tm somewhere
and not in the TCP code. There are hardly any TCP locks used in this
case (connect failure). Most likely the backtrace is bogus.
It's definitly possible to reproduce the issue now!
I guess anyone can try your version of kamailio and t_relay message
to "mobipouce.com" and you'll fall in that case! Sending plenty of
those messages will finally lock all kamailio process.
All? tcp and udp?
Andrei