TM possible deadlock

List overview All Threads
Download

newer

older

git:master:c6eff0f5: ndb_mongodb:...

git:master:54503788: kamctl: added...

Jason Penton

9 Apr 2014 9 Apr '14

3:51 p.m.

Hi All,

I have been experiencing a deadlock when a timeout occurs on a t_relayed() INVITE. Going through the code I have noticed a possible chance of deadlock (without re-entrant enabled). Here is my thinking:

t_should_relay_response() is called with REPLY_LOCK when the timer process fires on the fr_inv_timer (no response from the INVITE that was relayed, other than 100 provisional) and a 408 is generated. However, from within that function there are calls to run_failure_handlers() which in turn *could* try and lock the reply (viz. somebody having a t_reply() call in the cfg file - in failure route block). This would result in another lock on the same transaction's REPLY_LOCK....

Has anybody else experienced something like this?

this is on master btw.

Cheers Jason

Attachments:

attachment.html (text/html — 941 bytes)

Show replies by date

Daniel-Constantin Mierla

9 Apr 9 Apr

7:34 p.m.

Hello,

that should not be a very rare case and I would expect to be caught so far, anyhow ... this looks like easy to reproduce, have you tried it?

You can have two kamailio, one relying the invite to the second, which will reply with 100, then wait for the timeout on the first instance. You can add some debug messages in the code to see if the lock is called twice.

Cheers, Daniel

On 09/04/14 17:51, Jason Penton wrote:

...

Hi All,

I have been experiencing a deadlock when a timeout occurs on a t_relayed() INVITE. Going through the code I have noticed a possible chance of deadlock (without re-entrant enabled). Here is my thinking:

t_should_relay_response() is called with REPLY_LOCK when the timer process fires on the fr_inv_timer (no response from the INVITE that was relayed, other than 100 provisional) and a 408 is generated. However, from within that function there are calls to run_failure_handlers() which in turn *could* try and lock the reply (viz. somebody having a t_reply() call in the cfg file - in failure route block). This would result in another lock on the same transaction's REPLY_LOCK....

Has anybody else experienced something like this?

this is on master btw.

Cheers Jason

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.com http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

Jason Penton

8:32 p.m.

Hey Daniel,

Yes I did a test with a very basic config file and I am not able to re-create. However, with my *complex* cfg file I can re-create every time. Tomorrow I will compare what is different and report back... hopefully with fix ;)

here is bt of timer process deadlocking itself:

#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39 #1 0x00007f5009f22004 in futex_get (lock=0x7f4fc55030d8) at ../../mem/../futexlock.h:123 #2 0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8, file=0x7f5009f90fd1 "t_cancel.c", function=0x7f5009f91980 "cancel_branch", line=250) at lock.h:99 #3 0x00007f5009f23271 in cancel_branch (t=0x7f4fc5501b40, branch=0, reason=0x7fff646d03a8, flags=3) at t_cancel.c:250 #4 0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40, cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123 #5 0x00007f5009f718c4 in _reply_light (trans=0x7f4fc5501b40, buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR select next S-CSCF\r\nVia: SIP/2.0/UDP 10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia: SIP/2.0/UDP 10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"..., len=778, code=500, to_tag=0x7f500a1c7ae0 "c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37, lock=0, bm=0x7fff646d0b60) at t_reply.c:660 #6 0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF", lock=0) at t_reply.c:795 #7 0x00007f5009f76436 in t_reply_unsafe (t=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF") at t_reply.c:1643 #8 0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0, p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870 "h\321$\nP\177") at tm.c:1324 #9 0x000000000041a700 in do_action (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119 #10 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607 #11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30, a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102 #12 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607 #13 0x000000000041a54e in do_action (h=0x7fff646d1d30, a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098 #14 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607 #15 0x0000000000423fdf in run_top_route (a=0x7f500a247a28, msg=0x7f500a1c6bc0, c=0x0) at action.c:1693 #16 0x00007f5009f73815 in run_failure_handlers (t=0x7f4fc5501b40, rpl=0xffffffffffffffff, code=408, extra_flags=96) at t_reply.c:1061 #17 0x00007f5009f7527a in t_should_relay_response (Trans=0x7f4fc5501b40, new_code=408, branch=1, should_store=0x7fff646d201c, should_relay=0x7fff646d2018, cancel_data=0x7fff646d2070, reply=0xffffffffffffffff) at t_reply.c:1416 #18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40, p_msg=0xffffffffffffffff, branch=1, msg_status=408, cancel_data=0x7fff646d2070, do_put_on_wait=0) at t_reply.c:1819 #19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40, branch=1, code=408) at timer.c:354 #20 0x00007f5009f450e7 in final_response_handler (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526 #21 0x00007f5009f4518d in retr_buf_handler (ticks=260027386, tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584 #22 0x0000000000544119 in timer_list_expire (t=260027386, h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0, slow_mark=0) at timer.c:894 #23 0x0000000000544418 in timer_handler () at timer.c:959 #24 0x00000000005446b2 in timer_main () at timer.c:998 #25 0x0000000000471ddf in main_loop () at main.c:1689

On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla miconda@gmail.comwrote:

...

Hello,

that should not be a very rare case and I would expect to be caught so far, anyhow ... this looks like easy to reproduce, have you tried it?

You can have two kamailio, one relying the invite to the second, which will reply with 100, then wait for the timeout on the first instance. You can add some debug messages in the code to see if the lock is called twice.

Cheers, Daniel

On 09/04/14 17:51, Jason Penton wrote:

Hi All,

I have been experiencing a deadlock when a timeout occurs on a t_relayed() INVITE. Going through the code I have noticed a possible chance of deadlock (without re-entrant enabled). Here is my thinking:

t_should_relay_response() is called with REPLY_LOCK when the timer process fires on the fr_inv_timer (no response from the INVITE that was relayed, other than 100 provisional) and a 408 is generated. However, from within that function there are calls to run_failure_handlers() which in turn *could* try and lock the reply (viz. somebody having a t_reply() call in the cfg file - in failure route block). This would result in another lock on the same transaction's REPLY_LOCK....

Has anybody else experienced something like this?

this is on master btw.

Cheers Jason

sr-dev mailing listsr-dev@lists.sip-router.orghttp://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

Daniel-Constantin Mierla

9 p.m.

Hello,

what are the parameters for tm you are using?

Cheers, Daniel

On 09/04/14 22:32, Jason Penton wrote:

...

Hey Daniel,

Yes I did a test with a very basic config file and I am not able to re-create. However, with my *complex* cfg file I can re-create every time. Tomorrow I will compare what is different and report back... hopefully with fix ;)

here is bt of timer process deadlocking itself:

#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39 #1 0x00007f5009f22004 in futex_get (lock=0x7f4fc55030d8) at ../../mem/../futexlock.h:123 #2 0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8, file=0x7f5009f90fd1 "t_cancel.c", function=0x7f5009f91980 "cancel_branch", line=250) at lock.h:99 #3 0x00007f5009f23271 in cancel_branch (t=0x7f4fc5501b40, branch=0, reason=0x7fff646d03a8, flags=3) at t_cancel.c:250 #4 0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40, cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123 #5 0x00007f5009f718c4 in _reply_light (trans=0x7f4fc5501b40, buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR select next S-CSCF\r\nVia: SIP/2.0/UDP 10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia: SIP/2.0/UDP 10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"..., len=778, code=500, to_tag=0x7f500a1c7ae0 "c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37, lock=0, bm=0x7fff646d0b60) at t_reply.c:660 #6 0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF", lock=0) at t_reply.c:795 #7 0x00007f5009f76436 in t_reply_unsafe (t=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF") at t_reply.c:1643 #8 0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0, p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870 "h\321$\nP\177") at tm.c:1324 #9 0x000000000041a700 in do_action (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119 #10 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607 #11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30, a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102 #12 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607 #13 0x000000000041a54e in do_action (h=0x7fff646d1d30, a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098 #14 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607 #15 0x0000000000423fdf in run_top_route (a=0x7f500a247a28, msg=0x7f500a1c6bc0, c=0x0) at action.c:1693 #16 0x00007f5009f73815 in run_failure_handlers (t=0x7f4fc5501b40, rpl=0xffffffffffffffff, code=408, extra_flags=96) at t_reply.c:1061 #17 0x00007f5009f7527a in t_should_relay_response (Trans=0x7f4fc5501b40, new_code=408, branch=1, should_store=0x7fff646d201c, should_relay=0x7fff646d2018, cancel_data=0x7fff646d2070, reply=0xffffffffffffffff) at t_reply.c:1416 #18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40, p_msg=0xffffffffffffffff, branch=1, msg_status=408, cancel_data=0x7fff646d2070, do_put_on_wait=0) at t_reply.c:1819 #19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40, branch=1, code=408) at timer.c:354 #20 0x00007f5009f450e7 in final_response_handler (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526 #21 0x00007f5009f4518d in retr_buf_handler (ticks=260027386, tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584 #22 0x0000000000544119 in timer_list_expire (t=260027386, h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0, slow_mark=0) at timer.c:894 #23 0x0000000000544418 in timer_handler () at timer.c:959 #24 0x00000000005446b2 in timer_main () at timer.c:998 #25 0x0000000000471ddf in main_loop () at main.c:1689

On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla <miconda@gmail.com mailto:miconda@gmail.com> wrote:
Hello,

that should not be a very rare case and I would expect to be
caught so far, anyhow ... this looks like easy to reproduce, have
you tried it?

You can have two kamailio, one relying the invite to the second,
which will reply with 100, then wait for the timeout on the first
instance. You can add some debug messages in the code to see if
the lock is called twice.

Cheers,
Daniel


On 09/04/14 17:51, Jason Penton wrote:
...
Hi All,

I have been experiencing a deadlock when a timeout occurs on a
t_relayed() INVITE. Going through the code I have noticed a
possible chance of deadlock (without re-entrant enabled). Here is
my thinking:

t_should_relay_response() is called with REPLY_LOCK when the
timer process fires on the fr_inv_timer (no response from the
INVITE that was relayed, other than 100 provisional) and a 408 is
generated. However, from within that function there are calls
to run_failure_handlers() which in turn *could* try and lock the
reply (viz. somebody having a t_reply() call in the cfg file - in
failure route block). This would result in another lock on the
same transaction's REPLY_LOCK....

Has anybody else experienced something like this?

this is on master btw.

Cheers
Jason


_______________________________________________
sr-dev mailing list
sr-dev@lists.sip-router.org  <mailto:sr-dev@lists.sip-router.org>
http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
-- 
Daniel-Constantin Mierla -http://www.asipto.com
http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda


_______________________________________________
sr-dev mailing list
sr-dev@lists.sip-router.org <mailto:sr-dev@lists.sip-router.org>
http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.com http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

Jason Penton

9:26 p.m.

Hey Daniel,

nothing extraordinary...

# -- TM params -- modparam("tm", "fr_timer", 20000); modparam("tm", "fr_inv_timer", 10000)

Cheers Jason

On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton jason.penton@gmail.comwrote:

...

Hey Daniel,

Yes I did a test with a very basic config file and I am not able to re-create. However, with my *complex* cfg file I can re-create every time. Tomorrow I will compare what is different and report back... hopefully with fix ;)

here is bt of timer process deadlocking itself:

#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39 #1 0x00007f5009f22004 in futex_get (lock=0x7f4fc55030d8) at ../../mem/../futexlock.h:123 #2 0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8, file=0x7f5009f90fd1 "t_cancel.c", function=0x7f5009f91980 "cancel_branch", line=250) at lock.h:99 #3 0x00007f5009f23271 in cancel_branch (t=0x7f4fc5501b40, branch=0, reason=0x7fff646d03a8, flags=3) at t_cancel.c:250 #4 0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40, cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123 #5 0x00007f5009f718c4 in _reply_light (trans=0x7f4fc5501b40, buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR select next S-CSCF\r\nVia: SIP/2.0/UDP 10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia: SIP/2.0/UDP 10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"..., len=778, code=500, to_tag=0x7f500a1c7ae0 "c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37, lock=0, bm=0x7fff646d0b60) at t_reply.c:660 #6 0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF", lock=0) at t_reply.c:795 #7 0x00007f5009f76436 in t_reply_unsafe (t=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF") at t_reply.c:1643 #8 0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0, p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870 "h\321$\nP\177") at tm.c:1324 #9 0x000000000041a700 in do_action (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119 #10 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607 #11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30, a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102 #12 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607 #13 0x000000000041a54e in do_action (h=0x7fff646d1d30, a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098 #14 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607 #15 0x0000000000423fdf in run_top_route (a=0x7f500a247a28, msg=0x7f500a1c6bc0, c=0x0) at action.c:1693 #16 0x00007f5009f73815 in run_failure_handlers (t=0x7f4fc5501b40, rpl=0xffffffffffffffff, code=408, extra_flags=96) at t_reply.c:1061 #17 0x00007f5009f7527a in t_should_relay_response (Trans=0x7f4fc5501b40, new_code=408, branch=1, should_store=0x7fff646d201c, should_relay=0x7fff646d2018, cancel_data=0x7fff646d2070, reply=0xffffffffffffffff) at t_reply.c:1416 #18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40, p_msg=0xffffffffffffffff, branch=1, msg_status=408, cancel_data=0x7fff646d2070, do_put_on_wait=0) at t_reply.c:1819 #19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40, branch=1, code=408) at timer.c:354 #20 0x00007f5009f450e7 in final_response_handler (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526 #21 0x00007f5009f4518d in retr_buf_handler (ticks=260027386, tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584 #22 0x0000000000544119 in timer_list_expire (t=260027386, h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0, slow_mark=0) at timer.c:894 #23 0x0000000000544418 in timer_handler () at timer.c:959 #24 0x00000000005446b2 in timer_main () at timer.c:998 #25 0x0000000000471ddf in main_loop () at main.c:1689

On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla < miconda@gmail.com> wrote:

...
Hello,

that should not be a very rare case and I would expect to be caught so far, anyhow ... this looks like easy to reproduce, have you tried it?

You can have two kamailio, one relying the invite to the second, which will reply with 100, then wait for the timeout on the first instance. You can add some debug messages in the code to see if the lock is called twice.

Cheers, Daniel

On 09/04/14 17:51, Jason Penton wrote:

Hi All,

I have been experiencing a deadlock when a timeout occurs on a t_relayed() INVITE. Going through the code I have noticed a possible chance of deadlock (without re-entrant enabled). Here is my thinking:

t_should_relay_response() is called with REPLY_LOCK when the timer process fires on the fr_inv_timer (no response from the INVITE that was relayed, other than 100 provisional) and a 408 is generated. However, from within that function there are calls to run_failure_handlers() which in turn *could* try and lock the reply (viz. somebody having a t_reply() call in the cfg file - in failure route block). This would result in another lock on the same transaction's REPLY_LOCK....

Has anybody else experienced something like this?

this is on master btw.

Cheers Jason

sr-dev mailing listsr-dev@lists.sip-router.orghttp://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

Daniel-Constantin Mierla

10 Apr 10 Apr

7:01 a.m.

Hello Jason,

I pushed a patch trying to fix this case, it is only on git master branch. Can you test it? If all goes fine, we can consider backporting it.

Cheers, Daniel

On 09/04/14 23:26, Jason Penton wrote:

...

Hey Daniel,

nothing extraordinary...

# -- TM params -- modparam("tm", "fr_timer", 20000); modparam("tm", "fr_inv_timer", 10000)

Cheers Jason

On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton <jason.penton@gmail.com mailto:jason.penton@gmail.com> wrote:

Hey Daniel,

Yes I did a test with a very basic config file and I am not able
to re-create. However, with my *complex* cfg file I can re-create
every time. Tomorrow I will compare what is different and report
back... hopefully with fix ;)

here is bt of timer process deadlocking itself:

#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39
#1  0x00007f5009f22004 in futex_get (lock=0x7f4fc55030d8) at
../../mem/../futexlock.h:123
#2  0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8,
file=0x7f5009f90fd1 "t_cancel.c", function=0x7f5009f91980
"cancel_branch", line=250) at lock.h:99
#3  0x00007f5009f23271 in cancel_branch (t=0x7f4fc5501b40,
branch=0, reason=0x7fff646d03a8, flags=3) at t_cancel.c:250
#4  0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40,
cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123
#5  0x00007f5009f718c4 in _reply_light (trans=0x7f4fc5501b40,
    buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR select
next S-CSCF\r\nVia: SIP/2.0/UDP
10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia:
SIP/2.0/UDP 10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"...,
len=778, code=500, to_tag=0x7f500a1c7ae0
"c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37, lock=0,
bm=0x7fff646d0b60) at t_reply.c:660
#6  0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40,
p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error
on LIR select next S-CSCF", lock=0) at t_reply.c:795
#7  0x00007f5009f76436 in t_reply_unsafe (t=0x7f4fc5501b40,
p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error
on LIR select next S-CSCF") at t_reply.c:1643
#8  0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0,
p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870
"h\321$\nP\177") at tm.c:1324
#9  0x000000000041a700 in do_action (h=0x7fff646d1d30,
a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119
#10 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607
#11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30,
a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102
#12 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607
#13 0x000000000041a54e in do_action (h=0x7fff646d1d30,
a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098
#14 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607
#15 0x0000000000423fdf in run_top_route (a=0x7f500a247a28,
msg=0x7f500a1c6bc0, c=0x0) at action.c:1693
#16 0x00007f5009f73815 in run_failure_handlers (t=0x7f4fc5501b40,
rpl=0xffffffffffffffff, code=408, extra_flags=96) at t_reply.c:1061
#17 0x00007f5009f7527a in t_should_relay_response
(Trans=0x7f4fc5501b40, new_code=408, branch=1,
should_store=0x7fff646d201c, should_relay=0x7fff646d2018,
cancel_data=0x7fff646d2070,
    reply=0xffffffffffffffff) at t_reply.c:1416
#18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40,
p_msg=0xffffffffffffffff, branch=1, msg_status=408,
cancel_data=0x7fff646d2070, do_put_on_wait=0) at t_reply.c:1819
#19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40, branch=1,
code=408) at timer.c:354
#20 0x00007f5009f450e7 in final_response_handler
(r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526
#21 0x00007f5009f4518d in retr_buf_handler (ticks=260027386,
tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584
#22 0x0000000000544119 in timer_list_expire (t=260027386,
h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0, slow_mark=0) at timer.c:894
#23 0x0000000000544418 in timer_handler () at timer.c:959
#24 0x00000000005446b2 in timer_main () at timer.c:998
#25 0x0000000000471ddf in main_loop () at main.c:1689



On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla
<miconda@gmail.com <mailto:miconda@gmail.com>> wrote:

    Hello,

    that should not be a very rare case and I would expect to be
    caught so far, anyhow ... this looks like easy to reproduce,
    have you tried it?

    You can have two kamailio, one relying the invite to the
    second, which will reply with 100, then wait for the timeout
    on the first instance. You can add some debug messages in the
    code to see if the lock is called twice.

    Cheers,
    Daniel


    On 09/04/14 17:51, Jason Penton wrote:

...

    Hi All,

    I have been experiencing a deadlock when a timeout occurs on
    a t_relayed() INVITE. Going through the code I have noticed a
    possible chance of deadlock (without re-entrant enabled).
    Here is my thinking:

    t_should_relay_response() is called with REPLY_LOCK when the
    timer process fires on the fr_inv_timer (no response from the
    INVITE that was relayed, other than 100 provisional) and a
    408 is generated. However, from within that function there
    are calls to run_failure_handlers() which in turn *could* try
    and lock the reply (viz. somebody having a t_reply() call in
    the cfg file - in failure route block). This would result in
    another lock on the same transaction's REPLY_LOCK....

    Has anybody else experienced something like this?

    this is on master btw.

    Cheers
    Jason


    _______________________________________________
    sr-dev mailing list
    sr-dev@lists.sip-router.org  <mailto:sr-dev@lists.sip-router.org>
    http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

    -- 
    Daniel-Constantin Mierla -http://www.asipto.com
    http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda


    _______________________________________________
    sr-dev mailing list
    sr-dev@lists.sip-router.org <mailto:sr-dev@lists.sip-router.org>
    http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.com http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

Jason Penton

7:06 a.m.

oh excellent, I will look at it right away - was just getting ready to jump in myself ;)

Cheers Jason

On Thu, Apr 10, 2014 at 9:01 AM, Daniel-Constantin Mierla <miconda@gmail.com

...

wrote:

...

Hello Jason,

I pushed a patch trying to fix this case, it is only on git master branch. Can you test it? If all goes fine, we can consider backporting it.

Cheers, Daniel

On 09/04/14 23:26, Jason Penton wrote:

Hey Daniel,

nothing extraordinary...

# -- TM params -- modparam("tm", "fr_timer", 20000); modparam("tm", "fr_inv_timer", 10000)

Cheers Jason

On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton jason.penton@gmail.comwrote:

...
Hey Daniel,

Yes I did a test with a very basic config file and I am not able to re-create. However, with my *complex* cfg file I can re-create every time. Tomorrow I will compare what is different and report back... hopefully with fix ;)

here is bt of timer process deadlocking itself:

#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39 #1 0x00007f5009f22004 in futex_get (lock=0x7f4fc55030d8) at ../../mem/../futexlock.h:123 #2 0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8, file=0x7f5009f90fd1 "t_cancel.c", function=0x7f5009f91980 "cancel_branch", line=250) at lock.h:99 #3 0x00007f5009f23271 in cancel_branch (t=0x7f4fc5501b40, branch=0, reason=0x7fff646d03a8, flags=3) at t_cancel.c:250 #4 0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40, cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123 #5 0x00007f5009f718c4 in _reply_light (trans=0x7f4fc5501b40, buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR select next S-CSCF\r\nVia: SIP/2.0/UDP 10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia: SIP/2.0/UDP 10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"..., len=778, code=500, to_tag=0x7f500a1c7ae0 "c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37, lock=0, bm=0x7fff646d0b60) at t_reply.c:660 #6 0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF", lock=0) at t_reply.c:795 #7 0x00007f5009f76436 in t_reply_unsafe (t=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF") at t_reply.c:1643 #8 0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0, p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870 "h\321$\nP\177") at tm.c:1324 #9 0x000000000041a700 in do_action (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119 #10 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607 #11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30, a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102 #12 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607 #13 0x000000000041a54e in do_action (h=0x7fff646d1d30, a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098 #14 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607 #15 0x0000000000423fdf in run_top_route (a=0x7f500a247a28, msg=0x7f500a1c6bc0, c=0x0) at action.c:1693 #16 0x00007f5009f73815 in run_failure_handlers (t=0x7f4fc5501b40, rpl=0xffffffffffffffff, code=408, extra_flags=96) at t_reply.c:1061 #17 0x00007f5009f7527a in t_should_relay_response (Trans=0x7f4fc5501b40, new_code=408, branch=1, should_store=0x7fff646d201c, should_relay=0x7fff646d2018, cancel_data=0x7fff646d2070, reply=0xffffffffffffffff) at t_reply.c:1416 #18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40, p_msg=0xffffffffffffffff, branch=1, msg_status=408, cancel_data=0x7fff646d2070, do_put_on_wait=0) at t_reply.c:1819 #19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40, branch=1, code=408) at timer.c:354 #20 0x00007f5009f450e7 in final_response_handler (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526 #21 0x00007f5009f4518d in retr_buf_handler (ticks=260027386, tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584 #22 0x0000000000544119 in timer_list_expire (t=260027386, h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0, slow_mark=0) at timer.c:894 #23 0x0000000000544418 in timer_handler () at timer.c:959 #24 0x00000000005446b2 in timer_main () at timer.c:998 #25 0x0000000000471ddf in main_loop () at main.c:1689

On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla < miconda@gmail.com> wrote:

...
Hello,

that should not be a very rare case and I would expect to be caught so far, anyhow ... this looks like easy to reproduce, have you tried it?

You can have two kamailio, one relying the invite to the second, which will reply with 100, then wait for the timeout on the first instance. You can add some debug messages in the code to see if the lock is called twice.

Cheers, Daniel

On 09/04/14 17:51, Jason Penton wrote:

Hi All,

I have been experiencing a deadlock when a timeout occurs on a t_relayed() INVITE. Going through the code I have noticed a possible chance of deadlock (without re-entrant enabled). Here is my thinking:

t_should_relay_response() is called with REPLY_LOCK when the timer process fires on the fr_inv_timer (no response from the INVITE that was relayed, other than 100 provisional) and a 408 is generated. However, from within that function there are calls to run_failure_handlers() which in turn *could* try and lock the reply (viz. somebody having a t_reply() call in the cfg file - in failure route block). This would result in another lock on the same transaction's REPLY_LOCK....

Has anybody else experienced something like this?

this is on master btw.

Cheers Jason

sr-dev mailing listsr-dev@lists.sip-router.orghttp://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

Jason Penton

8:51 a.m.

Nice work Daniel!! - Works perfectly! ;)

Cheers Jason

On Thu, Apr 10, 2014 at 9:06 AM, Jason Penton jason.penton@gmail.comwrote:

...

oh excellent, I will look at it right away - was just getting ready to jump in myself ;)

Cheers Jason

On Thu, Apr 10, 2014 at 9:01 AM, Daniel-Constantin Mierla < miconda@gmail.com> wrote:

...
Hello Jason,

I pushed a patch trying to fix this case, it is only on git master branch. Can you test it? If all goes fine, we can consider backporting it.

Cheers, Daniel

On 09/04/14 23:26, Jason Penton wrote:

Hey Daniel,

nothing extraordinary...

# -- TM params -- modparam("tm", "fr_timer", 20000); modparam("tm", "fr_inv_timer", 10000)

Cheers Jason

On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton jason.penton@gmail.comwrote:

...
Hey Daniel,

Yes I did a test with a very basic config file and I am not able to re-create. However, with my *complex* cfg file I can re-create every time. Tomorrow I will compare what is different and report back... hopefully with fix ;)

here is bt of timer process deadlocking itself:

#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39 #1 0x00007f5009f22004 in futex_get (lock=0x7f4fc55030d8) at ../../mem/../futexlock.h:123 #2 0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8, file=0x7f5009f90fd1 "t_cancel.c", function=0x7f5009f91980 "cancel_branch", line=250) at lock.h:99 #3 0x00007f5009f23271 in cancel_branch (t=0x7f4fc5501b40, branch=0, reason=0x7fff646d03a8, flags=3) at t_cancel.c:250 #4 0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40, cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123 #5 0x00007f5009f718c4 in _reply_light (trans=0x7f4fc5501b40, buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR select next S-CSCF\r\nVia: SIP/2.0/UDP 10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia: SIP/2.0/UDP 10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"..., len=778, code=500, to_tag=0x7f500a1c7ae0 "c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37, lock=0, bm=0x7fff646d0b60) at t_reply.c:660 #6 0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF", lock=0) at t_reply.c:795 #7 0x00007f5009f76436 in t_reply_unsafe (t=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF") at t_reply.c:1643 #8 0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0, p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870 "h\321$\nP\177") at tm.c:1324 #9 0x000000000041a700 in do_action (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119 #10 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607 #11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30, a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102 #12 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607 #13 0x000000000041a54e in do_action (h=0x7fff646d1d30, a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098 #14 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607 #15 0x0000000000423fdf in run_top_route (a=0x7f500a247a28, msg=0x7f500a1c6bc0, c=0x0) at action.c:1693 #16 0x00007f5009f73815 in run_failure_handlers (t=0x7f4fc5501b40, rpl=0xffffffffffffffff, code=408, extra_flags=96) at t_reply.c:1061 #17 0x00007f5009f7527a in t_should_relay_response (Trans=0x7f4fc5501b40, new_code=408, branch=1, should_store=0x7fff646d201c, should_relay=0x7fff646d2018, cancel_data=0x7fff646d2070, reply=0xffffffffffffffff) at t_reply.c:1416 #18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40, p_msg=0xffffffffffffffff, branch=1, msg_status=408, cancel_data=0x7fff646d2070, do_put_on_wait=0) at t_reply.c:1819 #19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40, branch=1, code=408) at timer.c:354 #20 0x00007f5009f450e7 in final_response_handler (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526 #21 0x00007f5009f4518d in retr_buf_handler (ticks=260027386, tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584 #22 0x0000000000544119 in timer_list_expire (t=260027386, h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0, slow_mark=0) at timer.c:894 #23 0x0000000000544418 in timer_handler () at timer.c:959 #24 0x00000000005446b2 in timer_main () at timer.c:998 #25 0x0000000000471ddf in main_loop () at main.c:1689

On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla < miconda@gmail.com> wrote:

...
Hello,

that should not be a very rare case and I would expect to be caught so far, anyhow ... this looks like easy to reproduce, have you tried it?

You can have two kamailio, one relying the invite to the second, which will reply with 100, then wait for the timeout on the first instance. You can add some debug messages in the code to see if the lock is called twice.

Cheers, Daniel

On 09/04/14 17:51, Jason Penton wrote:

Hi All,

I have been experiencing a deadlock when a timeout occurs on a t_relayed() INVITE. Going through the code I have noticed a possible chance of deadlock (without re-entrant enabled). Here is my thinking:

t_should_relay_response() is called with REPLY_LOCK when the timer process fires on the fr_inv_timer (no response from the INVITE that was relayed, other than 100 provisional) and a 408 is generated. However, from within that function there are calls to run_failure_handlers() which in turn *could* try and lock the reply (viz. somebody having a t_reply() call in the cfg file - in failure route block). This would result in another lock on the same transaction's REPLY_LOCK....

Has anybody else experienced something like this?

this is on master btw.

Cheers Jason

sr-dev mailing listsr-dev@lists.sip-router.orghttp://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

Daniel-Constantin Mierla

10:53 a.m.

OK. I will leave it a bit in master to see if there are any new reports, then I will backport. I will also have to review the tm reply functions that can be used from config to align them to the new check.

Cheers, Daniel

On 10/04/14 09:06, Jason Penton wrote:

...

oh excellent, I will look at it right away - was just getting ready to jump in myself ;)

Cheers Jason

On Thu, Apr 10, 2014 at 9:01 AM, Daniel-Constantin Mierla <miconda@gmail.com mailto:miconda@gmail.com> wrote:

Hello Jason,

I pushed a patch trying to fix this case, it is only on git master
branch. Can you test it? If all goes fine, we can consider
backporting it.

Cheers,
Daniel


On 09/04/14 23:26, Jason Penton wrote:

...

Hey Daniel,

nothing extraordinary...

# -- TM params --
modparam("tm", "fr_timer", 20000);
modparam("tm", "fr_inv_timer", 10000)


Cheers
Jason


On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton
<jason.penton@gmail.com <mailto:jason.penton@gmail.com>> wrote:

    Hey Daniel,

    Yes I did a test with a very basic config file and I am not
    able to re-create. However, with my *complex* cfg file I can
    re-create every time. Tomorrow I will compare what is
    different and report back... hopefully with fix ;)

    here is bt of timer process deadlocking itself:

    #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39
    #1  0x00007f5009f22004 in futex_get (lock=0x7f4fc55030d8) at
    ../../mem/../futexlock.h:123
    #2  0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8,
    file=0x7f5009f90fd1 "t_cancel.c", function=0x7f5009f91980
    "cancel_branch", line=250) at lock.h:99
    #3  0x00007f5009f23271 in cancel_branch (t=0x7f4fc5501b40,
    branch=0, reason=0x7fff646d03a8, flags=3) at t_cancel.c:250
    #4  0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40,
    cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123
    #5  0x00007f5009f718c4 in _reply_light (trans=0x7f4fc5501b40,
        buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR
    select next S-CSCF\r\nVia: SIP/2.0/UDP
    10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia:
    SIP/2.0/UDP
    10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"..., len=778,
    code=500, to_tag=0x7f500a1c7ae0
    "c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37,
    lock=0, bm=0x7fff646d0b60) at t_reply.c:660
    #6  0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40,
    p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server
    error on LIR select next S-CSCF", lock=0) at t_reply.c:795
    #7  0x00007f5009f76436 in t_reply_unsafe (t=0x7f4fc5501b40,
    p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server
    error on LIR select next S-CSCF") at t_reply.c:1643
    #8  0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0,
    p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870
    "h\321$\nP\177") at tm.c:1324
    #9  0x000000000041a700 in do_action (h=0x7fff646d1d30,
    a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119
    #10 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
    a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607
    #11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30,
    a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102
    #12 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
    a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607
    #13 0x000000000041a54e in do_action (h=0x7fff646d1d30,
    a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098
    #14 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
    a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607
    #15 0x0000000000423fdf in run_top_route (a=0x7f500a247a28,
    msg=0x7f500a1c6bc0, c=0x0) at action.c:1693
    #16 0x00007f5009f73815 in run_failure_handlers
    (t=0x7f4fc5501b40, rpl=0xffffffffffffffff, code=408,
    extra_flags=96) at t_reply.c:1061
    #17 0x00007f5009f7527a in t_should_relay_response
    (Trans=0x7f4fc5501b40, new_code=408, branch=1,
    should_store=0x7fff646d201c, should_relay=0x7fff646d2018,
    cancel_data=0x7fff646d2070,
        reply=0xffffffffffffffff) at t_reply.c:1416
    #18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40,
    p_msg=0xffffffffffffffff, branch=1, msg_status=408,
    cancel_data=0x7fff646d2070, do_put_on_wait=0) at t_reply.c:1819
    #19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40,
    branch=1, code=408) at timer.c:354
    #20 0x00007f5009f450e7 in final_response_handler
    (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526
    #21 0x00007f5009f4518d in retr_buf_handler (ticks=260027386,
    tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584
    #22 0x0000000000544119 in timer_list_expire (t=260027386,
    h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0, slow_mark=0) at
    timer.c:894
    #23 0x0000000000544418 in timer_handler () at timer.c:959
    #24 0x00000000005446b2 in timer_main () at timer.c:998
    #25 0x0000000000471ddf in main_loop () at main.c:1689



    On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla
    <miconda@gmail.com <mailto:miconda@gmail.com>> wrote:

        Hello,

        that should not be a very rare case and I would expect to
        be caught so far, anyhow ... this looks like easy to
        reproduce, have you tried it?

        You can have two kamailio, one relying the invite to the
        second, which will reply with 100, then wait for the
        timeout on the first instance. You can add some debug
        messages in the code to see if the lock is called twice.

        Cheers,
        Daniel


        On 09/04/14 17:51, Jason Penton wrote:

...

        Hi All,

        I have been experiencing a deadlock when a timeout
        occurs on a t_relayed() INVITE. Going through the code I
        have noticed a possible chance of deadlock (without
        re-entrant enabled). Here is my thinking:

        t_should_relay_response() is called with REPLY_LOCK when
        the timer process fires on the fr_inv_timer (no response
        from the INVITE that was relayed, other than 100
        provisional) and a 408 is generated. However, from
        within that function there are calls
        to run_failure_handlers() which in turn *could* try and
        lock the reply (viz. somebody having a t_reply() call in
        the cfg file - in failure route block). This would
        result in another lock on the same transaction's
        REPLY_LOCK....

        Has anybody else experienced something like this?

        this is on master btw.

        Cheers
        Jason


        _______________________________________________
        sr-dev mailing list
        sr-dev@lists.sip-router.org  <mailto:sr-dev@lists.sip-router.org>
        http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

        -- 
        Daniel-Constantin Mierla -http://www.asipto.com
        http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda


        _______________________________________________
        sr-dev mailing list
        sr-dev@lists.sip-router.org
        <mailto:sr-dev@lists.sip-router.org>
        http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- 
Daniel-Constantin Mierla -http://www.asipto.com
http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda

-- Daniel-Constantin Mierla - http://www.asipto.com http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

Jason Penton

11:24 a.m.

Hey Daniel,

which reply functions are you referring to? API functions?

Cheers Jason

On Thu, Apr 10, 2014 at 12:53 PM, Daniel-Constantin Mierla < miconda@gmail.com> wrote:

...

OK. I will leave it a bit in master to see if there are any new reports, then I will backport. I will also have to review the tm reply functions that can be used from config to align them to the new check.

Cheers, Daniel

On 10/04/14 09:06, Jason Penton wrote:

oh excellent, I will look at it right away - was just getting ready to jump in myself ;)

Cheers Jason

On Thu, Apr 10, 2014 at 9:01 AM, Daniel-Constantin Mierla < miconda@gmail.com> wrote:

...
Hello Jason,

I pushed a patch trying to fix this case, it is only on git master branch. Can you test it? If all goes fine, we can consider backporting it.

Cheers, Daniel

On 09/04/14 23:26, Jason Penton wrote:

Hey Daniel,

nothing extraordinary...

# -- TM params -- modparam("tm", "fr_timer", 20000); modparam("tm", "fr_inv_timer", 10000)

Cheers Jason

On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton jason.penton@gmail.comwrote:

...
Hey Daniel,

Yes I did a test with a very basic config file and I am not able to re-create. However, with my *complex* cfg file I can re-create every time. Tomorrow I will compare what is different and report back... hopefully with fix ;)

here is bt of timer process deadlocking itself:

#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39 #1 0x00007f5009f22004 in futex_get (lock=0x7f4fc55030d8) at ../../mem/../futexlock.h:123 #2 0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8, file=0x7f5009f90fd1 "t_cancel.c", function=0x7f5009f91980 "cancel_branch", line=250) at lock.h:99 #3 0x00007f5009f23271 in cancel_branch (t=0x7f4fc5501b40, branch=0, reason=0x7fff646d03a8, flags=3) at t_cancel.c:250 #4 0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40, cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123 #5 0x00007f5009f718c4 in _reply_light (trans=0x7f4fc5501b40, buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR select next S-CSCF\r\nVia: SIP/2.0/UDP 10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia: SIP/2.0/UDP 10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"..., len=778, code=500, to_tag=0x7f500a1c7ae0 "c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37, lock=0, bm=0x7fff646d0b60) at t_reply.c:660 #6 0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF", lock=0) at t_reply.c:795 #7 0x00007f5009f76436 in t_reply_unsafe (t=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF") at t_reply.c:1643 #8 0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0, p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870 "h\321$\nP\177") at tm.c:1324 #9 0x000000000041a700 in do_action (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119 #10 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607 #11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30, a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102 #12 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607 #13 0x000000000041a54e in do_action (h=0x7fff646d1d30, a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098 #14 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607 #15 0x0000000000423fdf in run_top_route (a=0x7f500a247a28, msg=0x7f500a1c6bc0, c=0x0) at action.c:1693 #16 0x00007f5009f73815 in run_failure_handlers (t=0x7f4fc5501b40, rpl=0xffffffffffffffff, code=408, extra_flags=96) at t_reply.c:1061 #17 0x00007f5009f7527a in t_should_relay_response (Trans=0x7f4fc5501b40, new_code=408, branch=1, should_store=0x7fff646d201c, should_relay=0x7fff646d2018, cancel_data=0x7fff646d2070, reply=0xffffffffffffffff) at t_reply.c:1416 #18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40, p_msg=0xffffffffffffffff, branch=1, msg_status=408, cancel_data=0x7fff646d2070, do_put_on_wait=0) at t_reply.c:1819 #19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40, branch=1, code=408) at timer.c:354 #20 0x00007f5009f450e7 in final_response_handler (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526 #21 0x00007f5009f4518d in retr_buf_handler (ticks=260027386, tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584 #22 0x0000000000544119 in timer_list_expire (t=260027386, h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0, slow_mark=0) at timer.c:894 #23 0x0000000000544418 in timer_handler () at timer.c:959 #24 0x00000000005446b2 in timer_main () at timer.c:998 #25 0x0000000000471ddf in main_loop () at main.c:1689

On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla < miconda@gmail.com> wrote:

...
Hello,

that should not be a very rare case and I would expect to be caught so far, anyhow ... this looks like easy to reproduce, have you tried it?

You can have two kamailio, one relying the invite to the second, which will reply with 100, then wait for the timeout on the first instance. You can add some debug messages in the code to see if the lock is called twice.

Cheers, Daniel

On 09/04/14 17:51, Jason Penton wrote:

Hi All,

I have been experiencing a deadlock when a timeout occurs on a t_relayed() INVITE. Going through the code I have noticed a possible chance of deadlock (without re-entrant enabled). Here is my thinking:

t_should_relay_response() is called with REPLY_LOCK when the timer process fires on the fr_inv_timer (no response from the INVITE that was relayed, other than 100 provisional) and a 408 is generated. However, from within that function there are calls to run_failure_handlers() which in turn *could* try and lock the reply (viz. somebody having a t_reply() call in the cfg file - in failure route block). This would result in another lock on the same transaction's REPLY_LOCK....

Has anybody else experienced something like this?

this is on master btw.

Cheers Jason

sr-dev mailing listsr-dev@lists.sip-router.orghttp://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

Daniel-Constantin Mierla

2:35 p.m.

Hello,

iirc, there are several functions that script writer can use, like t_reply_callid() from tmx. The idea is to analyze a bit in order to detect if a forced reply may end up in canceling some pending branches -- the reply on the branch doesnt matter anymore and should not be considered anymore for relaying upstream, because the script writer already decided what to send out.

Cheers, Daniel

On 10/04/14 13:24, Jason Penton wrote:

...

Hey Daniel,

which reply functions are you referring to? API functions?

Cheers Jason

On Thu, Apr 10, 2014 at 12:53 PM, Daniel-Constantin Mierla <miconda@gmail.com mailto:miconda@gmail.com> wrote:

OK. I will leave it a bit in master to see if there are any new
reports, then I will backport. I will also have to review the tm
reply functions that can be used from config to align them to the
new check.

Cheers,
Daniel


On 10/04/14 09:06, Jason Penton wrote:

...

oh excellent, I will look at it right away - was just getting
ready to jump in myself ;)

Cheers
Jason


On Thu, Apr 10, 2014 at 9:01 AM, Daniel-Constantin Mierla
<miconda@gmail.com <mailto:miconda@gmail.com>> wrote:

    Hello Jason,

    I pushed a patch trying to fix this case, it is only on git
    master branch. Can you test it? If all goes fine, we can
    consider backporting it.

    Cheers,
    Daniel


    On 09/04/14 23:26, Jason Penton wrote:

...

    Hey Daniel,

    nothing extraordinary...

    # -- TM params --
    modparam("tm", "fr_timer", 20000);
    modparam("tm", "fr_inv_timer", 10000)


    Cheers
    Jason


    On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton
    <jason.penton@gmail.com <mailto:jason.penton@gmail.com>> wrote:

        Hey Daniel,

        Yes I did a test with a very basic config file and I am
        not able to re-create. However, with my *complex* cfg
        file I can re-create every time. Tomorrow I will compare
        what is different and report back... hopefully with fix ;)

        here is bt of timer process deadlocking itself:

        #0  syscall () at
        ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39
        #1  0x00007f5009f22004 in futex_get
        (lock=0x7f4fc55030d8) at ../../mem/../futexlock.h:123
        #2  0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8,
        file=0x7f5009f90fd1 "t_cancel.c",
        function=0x7f5009f91980 "cancel_branch", line=250) at
        lock.h:99
        #3  0x00007f5009f23271 in cancel_branch
        (t=0x7f4fc5501b40, branch=0, reason=0x7fff646d03a8,
        flags=3) at t_cancel.c:250
        #4  0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40,
        cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123
        #5  0x00007f5009f718c4 in _reply_light
        (trans=0x7f4fc5501b40,
            buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR
        select next S-CSCF\r\nVia: SIP/2.0/UDP
        10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia:
        SIP/2.0/UDP
        10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"...,
        len=778, code=500, to_tag=0x7f500a1c7ae0
        "c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37,
        lock=0, bm=0x7fff646d0b60) at t_reply.c:660
        #6  0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40,
        p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48
        "Server error on LIR select next S-CSCF", lock=0) at
        t_reply.c:795
        #7  0x00007f5009f76436 in t_reply_unsafe
        (t=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500,
        text=0x7f500a249a48 "Server error on LIR select next
        S-CSCF") at t_reply.c:1643
        #8  0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0,
        p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870
        "h\321$\nP\177") at tm.c:1324
        #9  0x000000000041a700 in do_action (h=0x7fff646d1d30,
        a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119
        #10 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
        a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607
        #11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30,
        a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102
        #12 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
        a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607
        #13 0x000000000041a54e in do_action (h=0x7fff646d1d30,
        a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098
        #14 0x0000000000423831 in run_actions (h=0x7fff646d1d30,
        a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607
        #15 0x0000000000423fdf in run_top_route
        (a=0x7f500a247a28, msg=0x7f500a1c6bc0, c=0x0) at
        action.c:1693
        #16 0x00007f5009f73815 in run_failure_handlers
        (t=0x7f4fc5501b40, rpl=0xffffffffffffffff, code=408,
        extra_flags=96) at t_reply.c:1061
        #17 0x00007f5009f7527a in t_should_relay_response
        (Trans=0x7f4fc5501b40, new_code=408, branch=1,
        should_store=0x7fff646d201c,
        should_relay=0x7fff646d2018, cancel_data=0x7fff646d2070,
        reply=0xffffffffffffffff) at t_reply.c:1416
        #18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40,
        p_msg=0xffffffffffffffff, branch=1, msg_status=408,
        cancel_data=0x7fff646d2070, do_put_on_wait=0) at
        t_reply.c:1819
        #19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40,
        branch=1, code=408) at timer.c:354
        #20 0x00007f5009f450e7 in final_response_handler
        (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526
        #21 0x00007f5009f4518d in retr_buf_handler
        (ticks=260027386, tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584
        #22 0x0000000000544119 in timer_list_expire
        (t=260027386, h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0,
        slow_mark=0) at timer.c:894
        #23 0x0000000000544418 in timer_handler () at timer.c:959
        #24 0x00000000005446b2 in timer_main () at timer.c:998
        #25 0x0000000000471ddf in main_loop () at main.c:1689



        On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla
        <miconda@gmail.com <mailto:miconda@gmail.com>> wrote:

            Hello,

            that should not be a very rare case and I would
            expect to be caught so far, anyhow ... this looks
            like easy to reproduce, have you tried it?

            You can have two kamailio, one relying the invite to
            the second, which will reply with 100, then wait for
            the timeout on the first instance. You can add some
            debug messages in the code to see if the lock is
            called twice.

            Cheers,
            Daniel


            On 09/04/14 17:51, Jason Penton wrote:

...

            Hi All,

            I have been experiencing a deadlock when a timeout
            occurs on a t_relayed() INVITE. Going through the
            code I have noticed a possible chance of deadlock
            (without re-entrant enabled). Here is my thinking:

            t_should_relay_response() is called with REPLY_LOCK
            when the timer process fires on the fr_inv_timer
            (no response from the INVITE that was relayed,
            other than 100 provisional) and a 408 is generated.
            However, from within that function there are calls
            to run_failure_handlers() which in turn *could* try
            and lock the reply (viz. somebody having a
            t_reply() call in the cfg file - in failure route
            block). This would result in another lock on the
            same transaction's REPLY_LOCK....

            Has anybody else experienced something like this?

            this is on master btw.

            Cheers
            Jason


            _______________________________________________
            sr-dev mailing list
            sr-dev@lists.sip-router.org  <mailto:sr-dev@lists.sip-router.org>
            http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

            -- 
            Daniel-Constantin Mierla -http://www.asipto.com
            http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda


            _______________________________________________
            sr-dev mailing list
            sr-dev@lists.sip-router.org
            <mailto:sr-dev@lists.sip-router.org>
            http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

    -- 
    Daniel-Constantin Mierla -http://www.asipto.com
    http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda

-- 
Daniel-Constantin Mierla -http://www.asipto.com
http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda

-- Daniel-Constantin Mierla - http://www.asipto.com http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

Jason Penton

11 Apr 11 Apr

12:45 p.m.

yes okay - got it. I will test and analyse.

Thanks Daniel!

On Thu, Apr 10, 2014 at 4:35 PM, Daniel-Constantin Mierla <miconda@gmail.com

...

wrote:

...

Hello,

iirc, there are several functions that script writer can use, like t_reply_callid() from tmx. The idea is to analyze a bit in order to detect if a forced reply may end up in canceling some pending branches -- the reply on the branch doesnt matter anymore and should not be considered anymore for relaying upstream, because the script writer already decided what to send out.

Cheers, Daniel

On 10/04/14 13:24, Jason Penton wrote:

Hey Daniel,

which reply functions are you referring to? API functions?

Cheers Jason

On Thu, Apr 10, 2014 at 12:53 PM, Daniel-Constantin Mierla < miconda@gmail.com> wrote:

...
OK. I will leave it a bit in master to see if there are any new reports, then I will backport. I will also have to review the tm reply functions that can be used from config to align them to the new check.

Cheers, Daniel

On 10/04/14 09:06, Jason Penton wrote:

oh excellent, I will look at it right away - was just getting ready to jump in myself ;)

Cheers Jason

On Thu, Apr 10, 2014 at 9:01 AM, Daniel-Constantin Mierla < miconda@gmail.com> wrote:

...
Hello Jason,

I pushed a patch trying to fix this case, it is only on git master branch. Can you test it? If all goes fine, we can consider backporting it.

Cheers, Daniel

On 09/04/14 23:26, Jason Penton wrote:

Hey Daniel,

nothing extraordinary...

# -- TM params -- modparam("tm", "fr_timer", 20000); modparam("tm", "fr_inv_timer", 10000)

Cheers Jason

On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton jason.penton@gmail.comwrote:

...
Hey Daniel,

Yes I did a test with a very basic config file and I am not able to re-create. However, with my *complex* cfg file I can re-create every time. Tomorrow I will compare what is different and report back... hopefully with fix ;)

here is bt of timer process deadlocking itself:

#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39 #1 0x00007f5009f22004 in futex_get (lock=0x7f4fc55030d8) at ../../mem/../futexlock.h:123 #2 0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8, file=0x7f5009f90fd1 "t_cancel.c", function=0x7f5009f91980 "cancel_branch", line=250) at lock.h:99 #3 0x00007f5009f23271 in cancel_branch (t=0x7f4fc5501b40, branch=0, reason=0x7fff646d03a8, flags=3) at t_cancel.c:250 #4 0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40, cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123 #5 0x00007f5009f718c4 in _reply_light (trans=0x7f4fc5501b40, buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR select next S-CSCF\r\nVia: SIP/2.0/UDP 10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia: SIP/2.0/UDP 10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"..., len=778, code=500, to_tag=0x7f500a1c7ae0 "c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37, lock=0, bm=0x7fff646d0b60) at t_reply.c:660 #6 0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF", lock=0) at t_reply.c:795 #7 0x00007f5009f76436 in t_reply_unsafe (t=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF") at t_reply.c:1643 #8 0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0, p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870 "h\321$\nP\177") at tm.c:1324 #9 0x000000000041a700 in do_action (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119 #10 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607 #11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30, a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102 #12 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607 #13 0x000000000041a54e in do_action (h=0x7fff646d1d30, a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098 #14 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607 #15 0x0000000000423fdf in run_top_route (a=0x7f500a247a28, msg=0x7f500a1c6bc0, c=0x0) at action.c:1693 #16 0x00007f5009f73815 in run_failure_handlers (t=0x7f4fc5501b40, rpl=0xffffffffffffffff, code=408, extra_flags=96) at t_reply.c:1061 #17 0x00007f5009f7527a in t_should_relay_response (Trans=0x7f4fc5501b40, new_code=408, branch=1, should_store=0x7fff646d201c, should_relay=0x7fff646d2018, cancel_data=0x7fff646d2070, reply=0xffffffffffffffff) at t_reply.c:1416 #18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40, p_msg=0xffffffffffffffff, branch=1, msg_status=408, cancel_data=0x7fff646d2070, do_put_on_wait=0) at t_reply.c:1819 #19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40, branch=1, code=408) at timer.c:354 #20 0x00007f5009f450e7 in final_response_handler (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526 #21 0x00007f5009f4518d in retr_buf_handler (ticks=260027386, tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584 #22 0x0000000000544119 in timer_list_expire (t=260027386, h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0, slow_mark=0) at timer.c:894 #23 0x0000000000544418 in timer_handler () at timer.c:959 #24 0x00000000005446b2 in timer_main () at timer.c:998 #25 0x0000000000471ddf in main_loop () at main.c:1689

On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla < miconda@gmail.com> wrote:

...
Hello,

that should not be a very rare case and I would expect to be caught so far, anyhow ... this looks like easy to reproduce, have you tried it?

You can have two kamailio, one relying the invite to the second, which will reply with 100, then wait for the timeout on the first instance. You can add some debug messages in the code to see if the lock is called twice.

Cheers, Daniel

On 09/04/14 17:51, Jason Penton wrote:

Hi All,

I have been experiencing a deadlock when a timeout occurs on a t_relayed() INVITE. Going through the code I have noticed a possible chance of deadlock (without re-entrant enabled). Here is my thinking:

t_should_relay_response() is called with REPLY_LOCK when the timer process fires on the fr_inv_timer (no response from the INVITE that was relayed, other than 100 provisional) and a 408 is generated. However, from within that function there are calls to run_failure_handlers() which in turn *could* try and lock the reply (viz. somebody having a t_reply() call in the cfg file - in failure route block). This would result in another lock on the same transaction's REPLY_LOCK....

Has anybody else experienced something like this?

this is on master btw.

Cheers Jason

sr-dev mailing listsr-dev@lists.sip-router.orghttp://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

Daniel-Constantin Mierla

12:48 p.m.

On 11/04/14 14:45, Jason Penton wrote:

...

yes okay - got it. I will test and analyse.

I meant I want to analyze before deciding to backport, but a second analysis is never bad :-)

Cheers, Daniel

...

Thanks Daniel!

On Thu, Apr 10, 2014 at 4:35 PM, Daniel-Constantin Mierla <miconda@gmail.com mailto:miconda@gmail.com> wrote:

Hello,

iirc, there are several functions that script writer can use, like
t_reply_callid() from tmx. The idea is to analyze a bit in order
to detect if a forced reply may end up in canceling some pending
branches -- the reply on the branch doesnt matter anymore and
should not be considered anymore for relaying upstream, because
the script writer already decided what to send out.

Cheers,
Daniel



On 10/04/14 13:24, Jason Penton wrote:

...

Hey Daniel,

which reply functions are you referring to? API functions?

Cheers
Jason


On Thu, Apr 10, 2014 at 12:53 PM, Daniel-Constantin Mierla
<miconda@gmail.com <mailto:miconda@gmail.com>> wrote:

    OK. I will leave it a bit in master to see if there are any
    new reports, then I will backport. I will also have to review
    the tm reply functions that can be used from config to align
    them to the new check.

    Cheers,
    Daniel


    On 10/04/14 09:06, Jason Penton wrote:

...

    oh excellent, I will look at it right away - was just
    getting ready to jump in myself ;)

    Cheers
    Jason


    On Thu, Apr 10, 2014 at 9:01 AM, Daniel-Constantin Mierla
    <miconda@gmail.com <mailto:miconda@gmail.com>> wrote:

        Hello Jason,

        I pushed a patch trying to fix this case, it is only on
        git master branch. Can you test it? If all goes fine, we
        can consider backporting it.

        Cheers,
        Daniel


        On 09/04/14 23:26, Jason Penton wrote:

...

        Hey Daniel,

        nothing extraordinary...

        # -- TM params --
        modparam("tm", "fr_timer", 20000);
        modparam("tm", "fr_inv_timer", 10000)


        Cheers
        Jason


        On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton
        <jason.penton@gmail.com
        <mailto:jason.penton@gmail.com>> wrote:

            Hey Daniel,

            Yes I did a test with a very basic config file and
            I am not able to re-create. However, with my
            *complex* cfg file I can re-create every time.
            Tomorrow I will compare what is different and
            report back... hopefully with fix ;)

            here is bt of timer process deadlocking itself:

            #0  syscall () at
            ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39
            #1  0x00007f5009f22004 in futex_get
            (lock=0x7f4fc55030d8) at ../../mem/../futexlock.h:123
            #2  0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8,
            file=0x7f5009f90fd1 "t_cancel.c",
            function=0x7f5009f91980 "cancel_branch", line=250)
            at lock.h:99
            #3  0x00007f5009f23271 in cancel_branch
            (t=0x7f4fc5501b40, branch=0, reason=0x7fff646d03a8,
            flags=3) at t_cancel.c:250
            #4  0x00007f5009f22c02 in cancel_uacs
            (t=0x7f4fc5501b40, cancel_data=0x7fff646d03a0,
            flags=1) at t_cancel.c:123
            #5  0x00007f5009f718c4 in _reply_light
            (trans=0x7f4fc5501b40,
            buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR
            select next S-CSCF\r\nVia: SIP/2.0/UDP
            10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia:
            SIP/2.0/UDP
            10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"...,
            len=778, code=500, to_tag=0x7f500a1c7ae0
            "c82b15d7f12ef185f95fe4945457d449-8bab",
            to_tag_len=37, lock=0, bm=0x7fff646d0b60) at
            t_reply.c:660
            #6  0x00007f5009f7244c in _reply
            (trans=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0,
            code=500, text=0x7f500a249a48 "Server error on LIR
            select next S-CSCF", lock=0) at t_reply.c:795
            #7  0x00007f5009f76436 in t_reply_unsafe
            (t=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500,
            text=0x7f500a249a48 "Server error on LIR select
            next S-CSCF") at t_reply.c:1643
            #8  0x00007f5009f57621 in w_t_reply
            (msg=0x7f500a1c6bc0, p1=0x7f500a2497d8
            "\340\332$\nP\177", p2=0x7f500a249870
            "h\321$\nP\177") at tm.c:1324
            #9  0x000000000041a700 in do_action
            (h=0x7fff646d1d30, a=0x7f500a24cee8,
            msg=0x7f500a1c6bc0) at action.c:1119
            #10 0x0000000000423831 in run_actions
            (h=0x7fff646d1d30, a=0x7f500a24cee8,
            msg=0x7f500a1c6bc0) at action.c:1607
            #11 0x000000000041a5a4 in do_action
            (h=0x7fff646d1d30, a=0x7f500a24d478,
            msg=0x7f500a1c6bc0) at action.c:1102
            #12 0x0000000000423831 in run_actions
            (h=0x7fff646d1d30, a=0x7f500a249148,
            msg=0x7f500a1c6bc0) at action.c:1607
            #13 0x000000000041a54e in do_action
            (h=0x7fff646d1d30, a=0x7f500a24c500,
            msg=0x7f500a1c6bc0) at action.c:1098
            #14 0x0000000000423831 in run_actions
            (h=0x7fff646d1d30, a=0x7f500a247a28,
            msg=0x7f500a1c6bc0) at action.c:1607
            #15 0x0000000000423fdf in run_top_route
            (a=0x7f500a247a28, msg=0x7f500a1c6bc0, c=0x0) at
            action.c:1693
            #16 0x00007f5009f73815 in run_failure_handlers
            (t=0x7f4fc5501b40, rpl=0xffffffffffffffff,
            code=408, extra_flags=96) at t_reply.c:1061
            #17 0x00007f5009f7527a in t_should_relay_response
            (Trans=0x7f4fc5501b40, new_code=408, branch=1,
            should_store=0x7fff646d201c,
            should_relay=0x7fff646d2018,
            cancel_data=0x7fff646d2070,
            reply=0xffffffffffffffff) at t_reply.c:1416
            #18 0x00007f5009f76ede in relay_reply
            (t=0x7f4fc5501b40, p_msg=0xffffffffffffffff,
            branch=1, msg_status=408,
            cancel_data=0x7fff646d2070, do_put_on_wait=0) at
            t_reply.c:1819
            #19 0x00007f5009f44c88 in fake_reply
            (t=0x7f4fc5501b40, branch=1, code=408) at timer.c:354
            #20 0x00007f5009f450e7 in final_response_handler
            (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526
            #21 0x00007f5009f4518d in retr_buf_handler
            (ticks=260027386, tl=0x7f4fc5501e80, p=0x3e8) at
            timer.c:584
            #22 0x0000000000544119 in timer_list_expire
            (t=260027386, h=0x7f4fc527cbe0,
            slow_l=0x7f4fc527cdf0, slow_mark=0) at timer.c:894
            #23 0x0000000000544418 in timer_handler () at
            timer.c:959
            #24 0x00000000005446b2 in timer_main () at timer.c:998
            #25 0x0000000000471ddf in main_loop () at main.c:1689



            On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin
            Mierla <miconda@gmail.com
            <mailto:miconda@gmail.com>> wrote:

                Hello,

                that should not be a very rare case and I would
                expect to be caught so far, anyhow ... this
                looks like easy to reproduce, have you tried it?

                You can have two kamailio, one relying the
                invite to the second, which will reply with
                100, then wait for the timeout on the first
                instance. You can add some debug messages in
                the code to see if the lock is called twice.

                Cheers,
                Daniel


                On 09/04/14 17:51, Jason Penton wrote:

...

                Hi All,

                I have been experiencing a deadlock when a
                timeout occurs on a t_relayed() INVITE. Going
                through the code I have noticed a possible
                chance of deadlock (without re-entrant
                enabled). Here is my thinking:

                t_should_relay_response() is called with
                REPLY_LOCK when the timer process fires on the
                fr_inv_timer (no response from the INVITE that
                was relayed, other than 100 provisional) and a
                408 is generated. However, from within that
                function there are calls
                to run_failure_handlers() which in turn
                *could* try and lock the reply (viz. somebody
                having a t_reply() call in the cfg file - in
                failure route block). This would result in
                another lock on the same transaction's
                REPLY_LOCK....

                Has anybody else experienced something like this?

                this is on master btw.

                Cheers
                Jason


                _______________________________________________
                sr-dev mailing list
                sr-dev@lists.sip-router.org  <mailto:sr-dev@lists.sip-router.org>
                http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

                -- 
                Daniel-Constantin Mierla -http://www.asipto.com
                http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda


                _______________________________________________
                sr-dev mailing list
                sr-dev@lists.sip-router.org
                <mailto:sr-dev@lists.sip-router.org>
                http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

        -- 
        Daniel-Constantin Mierla -http://www.asipto.com
        http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda

    -- 
    Daniel-Constantin Mierla -http://www.asipto.com
    http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda

-- 
Daniel-Constantin Mierla -http://www.asipto.com
http://twitter.com/#!/miconda  <http://twitter.com/#%21/miconda>  -http://www.linkedin.com/in/miconda

-- Daniel-Constantin Mierla - http://www.asipto.com http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

Jason Penton

3 Feb 3 Feb

7:29 a.m.

Hi Daniel,

I have found another instance of a similar bug where the a process gets itself into deadlock. The sequence of events:

1. INVITE request is relayed downstream 2. error response or locally generated 408 occurs 2. Failure route in cfg is armed and tries to relay to alternate destination (voicemail/annoucnement server, etc) - using t_relay() 3. t_relay fails to destination for whatever reason (in this case the result of the destination being blacklisted in TM) 4. Original reply is attempted to be relayed upstream (this is all done safely - ie using t_reply_unsafe() as the REPLY_LOCK has already been acquired in failure route 5. At or around line 562 of t_reply.c we check if the reply is > 200 (final response). If it is and if its an INVITE transaction, we clean up timers and cancel all UAC branches. This is where we hit a problem:

cancel_uacs( trans, &cancel_data, F_CANCEL_B_KILL, lock );

6. cancel_uacs calls cancel_branch if there was no response on the branch by calling cancel_branch() 7. cancel branch calls LOCK_REPLIES(t) on the transaction and you hit deadlock.... (original call to LOCK_REPLIES(t) happened in failure route processing)

This use case is slightly different to the one you fixed related to retransmission timers... (the beginning of this thread)

Suggestion: What about a pkg/process variable that can be used to check if you have already acquired the reply lock to combat these non-reentrant deadlocks. A side-effect of this would save on passing variables around in various functions indicating intention to lock or not.

Thoughts?

Cheers Jason

On Fri Apr 11 2014 at 14:48:59 Daniel-Constantin Mierla miconda@gmail.com wrote:

...

On 11/04/14 14:45, Jason Penton wrote:

yes okay - got it. I will test and analyse.

I meant I want to analyze before deciding to backport, but a second analysis is never bad :-)

Cheers, Daniel

Thanks Daniel!

On Thu, Apr 10, 2014 at 4:35 PM, Daniel-Constantin Mierla < miconda@gmail.com> wrote:

...
Hello,

iirc, there are several functions that script writer can use, like t_reply_callid() from tmx. The idea is to analyze a bit in order to detect if a forced reply may end up in canceling some pending branches -- the reply on the branch doesnt matter anymore and should not be considered anymore for relaying upstream, because the script writer already decided what to send out.

Cheers, Daniel

On 10/04/14 13:24, Jason Penton wrote:

Hey Daniel,

which reply functions are you referring to? API functions?

Cheers Jason

On Thu, Apr 10, 2014 at 12:53 PM, Daniel-Constantin Mierla < miconda@gmail.com> wrote:

...
OK. I will leave it a bit in master to see if there are any new reports, then I will backport. I will also have to review the tm reply functions that can be used from config to align them to the new check.

Cheers, Daniel

On 10/04/14 09:06, Jason Penton wrote:

oh excellent, I will look at it right away - was just getting ready to jump in myself ;)

Cheers Jason

On Thu, Apr 10, 2014 at 9:01 AM, Daniel-Constantin Mierla < miconda@gmail.com> wrote:

...
Hello Jason,

I pushed a patch trying to fix this case, it is only on git master branch. Can you test it? If all goes fine, we can consider backporting it.

Cheers, Daniel

On 09/04/14 23:26, Jason Penton wrote:

Hey Daniel,

nothing extraordinary...

# -- TM params -- modparam("tm", "fr_timer", 20000); modparam("tm", "fr_inv_timer", 10000)

Cheers Jason

On Wed, Apr 9, 2014 at 10:32 PM, Jason Penton jason.penton@gmail.com wrote:

...
Hey Daniel,

Yes I did a test with a very basic config file and I am not able to re-create. However, with my *complex* cfg file I can re-create every time. Tomorrow I will compare what is different and report back... hopefully with fix ;)

here is bt of timer process deadlocking itself:

#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39 #1 0x00007f5009f22004 in futex_get (lock=0x7f4fc55030d8) at ../../mem/../futexlock.h:123 #2 0x00007f5009f223e1 in _lock (s=0x7f4fc55030d8, file=0x7f5009f90fd1 "t_cancel.c", function=0x7f5009f91980 "cancel_branch", line=250) at lock.h:99 #3 0x00007f5009f23271 in cancel_branch (t=0x7f4fc5501b40, branch=0, reason=0x7fff646d03a8, flags=3) at t_cancel.c:250 #4 0x00007f5009f22c02 in cancel_uacs (t=0x7f4fc5501b40, cancel_data=0x7fff646d03a0, flags=1) at t_cancel.c:123 #5 0x00007f5009f718c4 in _reply_light (trans=0x7f4fc5501b40, buf=0x7f500a24dc68 "SIP/2.0 500 Server error on LIR select next S-CSCF\r\nVia: SIP/2.0/UDP 10.0.1.167:6060;branch=z9hG4bKb7.2ae09f29ffbd0034cd6d58483053603b.1\r\nVia: SIP/2.0/UDP 10.0.1.166:4060;branch=z9hG4bKb7.3faa03ddea80"..., len=778, code=500, to_tag=0x7f500a1c7ae0 "c82b15d7f12ef185f95fe4945457d449-8bab", to_tag_len=37, lock=0, bm=0x7fff646d0b60) at t_reply.c:660 #6 0x00007f5009f7244c in _reply (trans=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF", lock=0) at t_reply.c:795 #7 0x00007f5009f76436 in t_reply_unsafe (t=0x7f4fc5501b40, p_msg=0x7f500a1c6bc0, code=500, text=0x7f500a249a48 "Server error on LIR select next S-CSCF") at t_reply.c:1643 #8 0x00007f5009f57621 in w_t_reply (msg=0x7f500a1c6bc0, p1=0x7f500a2497d8 "\340\332$\nP\177", p2=0x7f500a249870 "h\321$\nP\177") at tm.c:1324 #9 0x000000000041a700 in do_action (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1119 #10 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a24cee8, msg=0x7f500a1c6bc0) at action.c:1607 #11 0x000000000041a5a4 in do_action (h=0x7fff646d1d30, a=0x7f500a24d478, msg=0x7f500a1c6bc0) at action.c:1102 #12 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a249148, msg=0x7f500a1c6bc0) at action.c:1607 #13 0x000000000041a54e in do_action (h=0x7fff646d1d30, a=0x7f500a24c500, msg=0x7f500a1c6bc0) at action.c:1098 #14 0x0000000000423831 in run_actions (h=0x7fff646d1d30, a=0x7f500a247a28, msg=0x7f500a1c6bc0) at action.c:1607 #15 0x0000000000423fdf in run_top_route (a=0x7f500a247a28, msg=0x7f500a1c6bc0, c=0x0) at action.c:1693 #16 0x00007f5009f73815 in run_failure_handlers (t=0x7f4fc5501b40, rpl=0xffffffffffffffff, code=408, extra_flags=96) at t_reply.c:1061 #17 0x00007f5009f7527a in t_should_relay_response (Trans=0x7f4fc5501b40, new_code=408, branch=1, should_store=0x7fff646d201c, should_relay=0x7fff646d2018, cancel_data=0x7fff646d2070, reply=0xffffffffffffffff) at t_reply.c:1416 #18 0x00007f5009f76ede in relay_reply (t=0x7f4fc5501b40, p_msg=0xffffffffffffffff, branch=1, msg_status=408, cancel_data=0x7fff646d2070, do_put_on_wait=0) at t_reply.c:1819 #19 0x00007f5009f44c88 in fake_reply (t=0x7f4fc5501b40, branch=1, code=408) at timer.c:354 #20 0x00007f5009f450e7 in final_response_handler (r_buf=0x7f4fc5501e60, t=0x7f4fc5501b40) at timer.c:526 #21 0x00007f5009f4518d in retr_buf_handler (ticks=260027386, tl=0x7f4fc5501e80, p=0x3e8) at timer.c:584 #22 0x0000000000544119 in timer_list_expire (t=260027386, h=0x7f4fc527cbe0, slow_l=0x7f4fc527cdf0, slow_mark=0) at timer.c:894 #23 0x0000000000544418 in timer_handler () at timer.c:959 #24 0x00000000005446b2 in timer_main () at timer.c:998 #25 0x0000000000471ddf in main_loop () at main.c:1689

On Wed, Apr 9, 2014 at 9:34 PM, Daniel-Constantin Mierla < miconda@gmail.com> wrote:

...
Hello,

that should not be a very rare case and I would expect to be caught so far, anyhow ... this looks like easy to reproduce, have you tried it?

You can have two kamailio, one relying the invite to the second, which will reply with 100, then wait for the timeout on the first instance. You can add some debug messages in the code to see if the lock is called twice.

Cheers, Daniel

On 09/04/14 17:51, Jason Penton wrote:

Hi All,

I have been experiencing a deadlock when a timeout occurs on a t_relayed() INVITE. Going through the code I have noticed a possible chance of deadlock (without re-entrant enabled). Here is my thinking:

t_should_relay_response() is called with REPLY_LOCK when the timer process fires on the fr_inv_timer (no response from the INVITE that was relayed, other than 100 provisional) and a 408 is generated. However, from within that function there are calls to run_failure_handlers() which in turn *could* try and lock the reply (viz. somebody having a t_reply() call in the cfg file - in failure route block). This would result in another lock on the same transaction's REPLY_LOCK....

Has anybody else experienced something like this?

this is on master btw.

Cheers Jason

sr-dev mailing listsr-dev@lists.sip-router.orghttp://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

-- Daniel-Constantin Mierla - http://www.asipto.comhttp://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

Daniel-Constantin Mierla

2:27 p.m.

Hi Jason,

On 03/02/15 08:29, Jason Penton wrote:

...

Hi Daniel,

I have found another instance of a similar bug where the a process gets itself into deadlock. The sequence of events:

INVITE request is relayed downstream

error response or locally generated 408 occurs

Failure route in cfg is armed and tries to relay to alternate

destination (voicemail/annoucnement server, etc) - using t_relay() 3. t_relay fails to destination for whatever reason (in this case the result of the destination being blacklisted in TM) 4. Original reply is attempted to be relayed upstream (this is all done safely - ie using t_reply_unsafe() as the REPLY_LOCK has already been acquired in failure route 5. At or around line 562 of t_reply.c we check if the reply is > 200 (final response). If it is and if its an INVITE transaction, we clean up timers and cancel all UAC branches. This is where we hit a problem:

cancel_uacs( trans, &cancel_data, F_CANCEL_B_KILL, lock );

cancel_uacs calls cancel_branch if there was no response on the

branch by calling cancel_branch() 7. cancel branch calls LOCK_REPLIES(t) on the transaction and you hit deadlock.... (original call to LOCK_REPLIES(t) happened in failure route processing)

This use case is slightly different to the one you fixed related to retransmission timers... (the beginning of this thread)

Suggestion: What about a pkg/process variable that can be used to check if you have already acquired the reply lock to combat these non-reentrant deadlocks. A side-effect of this would save on passing variables around in various functions indicating intention to lock or not.

Thoughts?

What about making the reply lock re-entrant? I think it should be safer in long term to allow parts of code executed by same process to safely acquire the lock without worrying if the same process did it before. Global variables in pkg would need more complex management to set, test and reset.

Cheers, Daniel

-- Daniel-Constantin Mierla http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda

Daniel-Constantin Mierla

30 Mar 30 Mar

3:03 p.m.

Hi Jason,

On 03/02/15 15:27, Daniel-Constantin Mierla wrote:

...

Hi Jason,

On 03/02/15 08:29, Jason Penton wrote:

...
Hi Daniel,

I have found another instance of a similar bug where the a process gets itself into deadlock. The sequence of events:

INVITE request is relayed downstream

error response or locally generated 408 occurs

Failure route in cfg is armed and tries to relay to alternate

destination (voicemail/annoucnement server, etc) - using t_relay() 3. t_relay fails to destination for whatever reason (in this case the result of the destination being blacklisted in TM) 4. Original reply is attempted to be relayed upstream (this is all done safely - ie using t_reply_unsafe() as the REPLY_LOCK has already been acquired in failure route 5. At or around line 562 of t_reply.c we check if the reply is > 200 (final response). If it is and if its an INVITE transaction, we clean up timers and cancel all UAC branches. This is where we hit a problem:

cancel_uacs( trans, &cancel_data, F_CANCEL_B_KILL, lock );

cancel_uacs calls cancel_branch if there was no response on the

branch by calling cancel_branch() 7. cancel branch calls LOCK_REPLIES(t) on the transaction and you hit deadlock.... (original call to LOCK_REPLIES(t) happened in failure route processing)

This use case is slightly different to the one you fixed related to retransmission timers... (the beginning of this thread)

Suggestion: What about a pkg/process variable that can be used to check if you have already acquired the reply lock to combat these non-reentrant deadlocks. A side-effect of this would save on passing variables around in various functions indicating intention to lock or not.

Thoughts?

What about making the reply lock re-entrant? I think it should be safer in long term to allow parts of code executed by same process to safely acquire the lock without worrying if the same process did it before. Global variables in pkg would need more complex management to set, test and reset.

I made the reply lock re-entrant, this should handle the situation you were exposing it. if you have a chance to simulate it, then let me know the result.

On a slightly different idea, I wonder if async_mutex should be made re-entrant as well, what do you think? I guess it is unlikely to get two async attempts in the same process, but you know better how the mutex is planned to be used.

Cheers, Daniel

-- Daniel-Constantin Mierla http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda Kamailio World Conference, May 27-29, 2015 Berlin, Germany - http://www.kamailioworld.com

Jason Penton

9 Apr 9 Apr

3:07 p.m.

Hey Daniel,

Ok great, I will test. No I don't see a need to make async_mutex re-entrant but let me check.

I will test your change too. Do you have a commit ref for your change?

Cheers Jason

On Mon, 30 Mar 2015 at 17:03 Daniel-Constantin Mierla miconda@gmail.com wrote:

...

Hi Jason,

On 03/02/15 15:27, Daniel-Constantin Mierla wrote:

...
Hi Jason,

On 03/02/15 08:29, Jason Penton wrote:

...
Hi Daniel,

I have found another instance of a similar bug where the a process gets itself into deadlock. The sequence of events:

INVITE request is relayed downstream

error response or locally generated 408 occurs

Failure route in cfg is armed and tries to relay to alternate

destination (voicemail/annoucnement server, etc) - using t_relay() 3. t_relay fails to destination for whatever reason (in this case the result of the destination being blacklisted in TM) 4. Original reply is attempted to be relayed upstream (this is all done safely - ie using t_reply_unsafe() as the REPLY_LOCK has already been acquired in failure route 5. At or around line 562 of t_reply.c we check if the reply is > 200 (final response). If it is and if its an INVITE transaction, we clean up timers and cancel all UAC branches. This is where we hit a problem:

cancel_uacs( trans, &cancel_data, F_CANCEL_B_KILL, lock );

cancel_uacs calls cancel_branch if there was no response on the

branch by calling cancel_branch() 7. cancel branch calls LOCK_REPLIES(t) on the transaction and you hit deadlock.... (original call to LOCK_REPLIES(t) happened in failure route processing)

This use case is slightly different to the one you fixed related to retransmission timers... (the beginning of this thread)

Suggestion: What about a pkg/process variable that can be used to check if you have already acquired the reply lock to combat these non-reentrant deadlocks. A side-effect of this would save on passing variables around in various functions indicating intention to lock or not.

Thoughts?

What about making the reply lock re-entrant? I think it should be safer in long term to allow parts of code executed by same process to safely acquire the lock without worrying if the same process did it before. Global variables in pkg would need more complex management to set, test and reset.

I made the reply lock re-entrant, this should handle the situation you were exposing it. if you have a chance to simulate it, then let me know the result.

On a slightly different idea, I wonder if async_mutex should be made re-entrant as well, what do you think? I guess it is unlikely to get two async attempts in the same process, but you know better how the mutex is planned to be used.

Cheers, Daniel

-- Daniel-Constantin Mierla http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda Kamailio World Conference, May 27-29, 2015 Berlin, Germany - http://www.kamailioworld.com

3750

Age (days ago)

4115

Last active (days ago)

sr-dev@lists.kamailio.org

16 comments

2 participants

tags (0)

participants (2)

Daniel-Constantin Mierla
Jason Penton