Description

I've been simulating calls while there's (faked) issues against the diameter server using TCP.
Ro_CCR() uses t_suspend() to suspend the SIP transaction while waiting for a response from the charging server.
A cdp transaction is also created with a callback to resume_on_initial_ccr() in the ims charging module.

When a response to a charging request is received, this is handled in cdp/api_process.c where cdp_take_trans() find the given transaction and returns it without a lock held. The callback is then executed, and everything is fine.

The problem occurs when a CCR times out, and the callback is done by the cdp_trans_timer() function. This hold a lock for the entire transaction list while executing callbacks for every timed out entry. In my case, the charging module does t_continue and the execution of that routing block is started again.

Here we reply a cause indicating it was an issue with the charging server. This will trigger a new callback from the ims_dialog module which again will try to start a new cdp transaction (CCR Stop) - still while the cdp_trans_timer function holds a lock.

Since the cdp timers now are blocked, no new connection attempts to the failed diameter server is done either.

Troubleshooting

Reproduction

Ro_CCR("RO_CONTINUE_DISPATCHER", "orig", 30, "", "")
this times out, and will do continue in the routing block specified (based on the cdp transaction timeout timer)

route[RO_CONTINUE_DISPATCHER] {
send_reply("500", "Internal error");
exit;
}

Debugging Data

cdp_add_trans() on line 2 is blocked while waiting for lock held by cdp_trans_timer() on line 26 to be freed. The line numbers are a bit off because of some debugging.

#0  0x00007f6d9a3e0e29 in syscall () from /lib64/libc.so.6
#1  0x00007f6d979622f6 in futex_get (lock=0x7f6d706622a0) at ../../core/mem/../futexlock.h:121
#2  0x00007f6d979632cd in cdp_add_trans (msg=0x7f6d70921280, cb=0x7f6d9084caf5 <resume_on_termination_ccr>, ptr=0x0, timeout=5, auto_drop=1) at transaction.c:126
#3  0x00007f6d97980740 in AAASendMessage (message=0x7f6d70921280, callback_f=0x7f6d9084caf5 <resume_on_termination_ccr>, callback_param=0x0) at diameter_comm.c:158
#4  0x00007f6d9084c71b in send_ccr_stop_with_param (ro_session=0x7f6d70921dc0, code=0, reason=0x7ffe9b6cd400) at ims_ro.c:1183
#5  0x00007f6d90829f6d in dlg_terminated (dlg=0x7f6d7091f5d0, type=4, termcode=0, reason=0x7f6d9086f998 "call failed", _params=0x7f6d90d20280 <params>) at dialog.c:249
#6  0x00007f6d908216cd in dlg_callback_received (dlg=0x7f6d7091f5d0, type=4, _params=0x7f6d90d20280 <params>) at dialog.c:29
#7  0x00007f6d90aed1b9 in run_dlg_callbacks (type=4, dlg=0x7f6d7091f5d0, req=0x7f6d7091fb68, rpl=0xffffffffffffffff, dir=2, dlg_data=0x0) at dlg_cb.c:271
#8  0x00007f6d90abced4 in dlg_onreply (t=0x7f6d70931e10, type=1048576, param=0x7ffe9b6cda80) at dlg_handlers.c:1363
#9  0x00007f6d94252f2c in run_trans_callbacks_internal (cb_lst=0x7f6d70931eb8, type=1048576, trans=0x7f6d70931e10, params=0x7ffe9b6cda80) at t_hooks.c:236
#10 0x00007f6d942530e4 in run_trans_callbacks_with_buf (type=1048576, rbuf=0x7f6d70931f10, req=0x7f6d7091fb68, repl=0xffffffffffffffff, flags=0) at t_hooks.c:272
#11 0x00007f6d941d2860 in _reply_light (trans=0x7f6d70931e10, buf=0x7f6d97d10768 "SIP/2.0 480 Temporarily Unavailable\r\nVia: SIP/2.0/UDP 192.168.10.15:5060;branch=z9hG4bK427d31b9\r\nFrom: \"sipp\" <sip:+4749484948@192.168.10.15;user=phone>;tag=as22aa85b4\r\nTo: <sip:+4749494848@192.168.10.20"..., len=386, code=480, to_tag=0x7f6d944bb5e0 <tm_tags> "b239b52f382f2203913409401a079c72-9b859c73", to_tag_len=41, lock=0, bm=0x7ffe9b6ce5d0)
    at t_reply.c:520
#12 0x00007f6d941d3e8e in _reply (trans=0x7f6d70931e10, p_msg=0x7f6d70946878, code=480, reason=0x7ffe9b6ce750, lock=0) at t_reply.c:689
#13 0x00007f6d941dfdcf in t_reply_str_unsafe (t=0x7f6d70931e10, p_msg=0x7f6d70946878, code=480, reason=0x7ffe9b6ce750) at t_reply.c:1753
#14 0x00007f6d941fd400 in ki_t_reply (msg=0x7f6d70946878, code=480, reason=0x7ffe9b6ce750) at tm.c:1538
#15 0x00007f6d941fdcee in w_t_reply_wrp (msg=0x7f6d70946878, code=480, txt=0x7f6d97d106e8 "Temporarily Unavailable") at tm.c:1598
#16 0x00007f6d93d24c0e in send_reply (msg=0x7f6d70946878, code=480, reason=0x7ffe9b6ce880) at sl.c:287
#17 0x00007f6d93d25728 in w_send_reply (msg=0x7f6d70946878, p1=0x7f6d97cee2e8 "x)\320\227m\177", p2=0x7f6d97cee388 "\350)\320\227m\177") at sl.c:329
#18 0x0000000000466fb3 in do_action (h=0x7ffe9b6cf9b0, a=0x7f6d97d025d8, msg=0x7f6d70946878) at core/action.c:1131
#19 0x0000000000474188 in run_actions (h=0x7ffe9b6cf9b0, a=0x7f6d97d02210, msg=0x7f6d70946878) at core/action.c:1618
#20 0x000000000047028d in do_action (h=0x7ffe9b6cf9b0, a=0x7f6d97d02f00, msg=0x7f6d70946878) at core/action.c:1241
#21 0x0000000000474188 in run_actions (h=0x7ffe9b6cf9b0, a=0x7f6d97cfdb68, msg=0x7f6d70946878) at core/action.c:1618
#22 0x00000000004748cc in run_top_route (a=0x7f6d97cfdb68, msg=0x7f6d70946878, c=0x0) at core/action.c:1701
#23 0x00007f6d9425802a in t_continue_helper (hash_index=39149, label=1245761144, rtact=0x7f6d97cfdb68, cbname=0x0, cbparam=0x0, skip_timer=1) at t_suspend.c:307
#24 0x00007f6d9425c6d8 in t_continue (hash_index=39149, label=1245761144, route=0x7f6d97cfdb68) at t_suspend.c:603
#25 0x00007f6d9085b372 in resume_on_initial_ccr (is_timeout=1, param=0x7f6d70921cc8, cca=0x0, elapsed_msecs=1) at ims_ro.c:1786
#26 0x00007f6d979646d1 in cdp_trans_timer (now=1699966590, ptr=0x0) at transaction.c:223
#27 0x00007f6d9798e3b0 in timer_loop () at timer.c:117
#28 0x00007f6d9798f639 in timer_process (returns=0) at timer.c:217
#29 0x00007f6d97942cf8 in diameter_peer_start (blocking=0) at diameter_peer.c:350
#30 0x00007f6d97932bb2 in cdp_child_init (rank=0) at cdp_mod.c:272

Log Messages


SIP Traffic


Possible Solutions

I'm considering changing cdp_trans_timer() to iterate through the list, remove expired items while keeping them in a new list so all callbacks can be executed without the lock held.

Could also be fixed within the ims_charging module. For example by not sending a CCR-T when there's no answer to the CCR-I, but I feel that's more of a band aid rather than a real fix.

Any feedback or suggestions would be nice.

Additional Information

master
centos 7.8


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <kamailio/kamailio/issues/3641@github.com>