I've been simulating calls while there's (faked) issues against the diameter server using TCP.
Ro_CCR() uses t_suspend() to suspend the SIP transaction while waiting for a response from the charging server.
A cdp transaction is also created with a callback to resume_on_initial_ccr() in the ims charging module.
When a response to a charging request is received, this is handled in cdp/api_process.c where cdp_take_trans() find the given transaction and returns it without a lock held. The callback is then executed, and everything is fine.
The problem occurs when a CCR times out, and the callback is done by the cdp_trans_timer() function. This hold a lock for the entire transaction list while executing callbacks for every timed out entry. In my case, the charging module does t_continue and the execution of that routing block is started again.
Here we reply a cause indicating it was an issue with the charging server. This will trigger a new callback from the ims_dialog module which again will try to start a new cdp transaction (CCR Stop) - still while the cdp_trans_timer function holds a lock.
Since the cdp timers now are blocked, no new connection attempts to the failed diameter server is done either.
Ro_CCR("RO_CONTINUE_DISPATCHER", "orig", 30, "", "")
this times out, and will do continue in the routing block specified (based on the cdp transaction timeout timer)
route[RO_CONTINUE_DISPATCHER] {
send_reply("500", "Internal error");
exit;
}
cdp_add_trans() on line 2 is blocked while waiting for lock held by cdp_trans_timer() on line 26 to be freed. The line numbers are a bit off because of some debugging.
#0 0x00007f6d9a3e0e29 in syscall () from /lib64/libc.so.6
#1 0x00007f6d979622f6 in futex_get (lock=0x7f6d706622a0) at ../../core/mem/../futexlock.h:121
#2 0x00007f6d979632cd in cdp_add_trans (msg=0x7f6d70921280, cb=0x7f6d9084caf5 <resume_on_termination_ccr>, ptr=0x0, timeout=5, auto_drop=1) at transaction.c:126
#3 0x00007f6d97980740 in AAASendMessage (message=0x7f6d70921280, callback_f=0x7f6d9084caf5 <resume_on_termination_ccr>, callback_param=0x0) at diameter_comm.c:158
#4 0x00007f6d9084c71b in send_ccr_stop_with_param (ro_session=0x7f6d70921dc0, code=0, reason=0x7ffe9b6cd400) at ims_ro.c:1183
#5 0x00007f6d90829f6d in dlg_terminated (dlg=0x7f6d7091f5d0, type=4, termcode=0, reason=0x7f6d9086f998 "call failed", _params=0x7f6d90d20280 <params>) at dialog.c:249
#6 0x00007f6d908216cd in dlg_callback_received (dlg=0x7f6d7091f5d0, type=4, _params=0x7f6d90d20280 <params>) at dialog.c:29
#7 0x00007f6d90aed1b9 in run_dlg_callbacks (type=4, dlg=0x7f6d7091f5d0, req=0x7f6d7091fb68, rpl=0xffffffffffffffff, dir=2, dlg_data=0x0) at dlg_cb.c:271
#8 0x00007f6d90abced4 in dlg_onreply (t=0x7f6d70931e10, type=1048576, param=0x7ffe9b6cda80) at dlg_handlers.c:1363
#9 0x00007f6d94252f2c in run_trans_callbacks_internal (cb_lst=0x7f6d70931eb8, type=1048576, trans=0x7f6d70931e10, params=0x7ffe9b6cda80) at t_hooks.c:236
#10 0x00007f6d942530e4 in run_trans_callbacks_with_buf (type=1048576, rbuf=0x7f6d70931f10, req=0x7f6d7091fb68, repl=0xffffffffffffffff, flags=0) at t_hooks.c:272
#11 0x00007f6d941d2860 in _reply_light (trans=0x7f6d70931e10, buf=0x7f6d97d10768 "SIP/2.0 480 Temporarily Unavailable\r\nVia: SIP/2.0/UDP 192.168.10.15:5060;branch=z9hG4bK427d31b9\r\nFrom: \"sipp\" <sip:+4749484948@192.168.10.15;user=phone>;tag=as22aa85b4\r\nTo: <sip:+4749494848@192.168.10.20"..., len=386, code=480, to_tag=0x7f6d944bb5e0 <tm_tags> "b239b52f382f2203913409401a079c72-9b859c73", to_tag_len=41, lock=0, bm=0x7ffe9b6ce5d0)
at t_reply.c:520
#12 0x00007f6d941d3e8e in _reply (trans=0x7f6d70931e10, p_msg=0x7f6d70946878, code=480, reason=0x7ffe9b6ce750, lock=0) at t_reply.c:689
#13 0x00007f6d941dfdcf in t_reply_str_unsafe (t=0x7f6d70931e10, p_msg=0x7f6d70946878, code=480, reason=0x7ffe9b6ce750) at t_reply.c:1753
#14 0x00007f6d941fd400 in ki_t_reply (msg=0x7f6d70946878, code=480, reason=0x7ffe9b6ce750) at tm.c:1538
#15 0x00007f6d941fdcee in w_t_reply_wrp (msg=0x7f6d70946878, code=480, txt=0x7f6d97d106e8 "Temporarily Unavailable") at tm.c:1598
#16 0x00007f6d93d24c0e in send_reply (msg=0x7f6d70946878, code=480, reason=0x7ffe9b6ce880) at sl.c:287
#17 0x00007f6d93d25728 in w_send_reply (msg=0x7f6d70946878, p1=0x7f6d97cee2e8 "x)\320\227m\177", p2=0x7f6d97cee388 "\350)\320\227m\177") at sl.c:329
#18 0x0000000000466fb3 in do_action (h=0x7ffe9b6cf9b0, a=0x7f6d97d025d8, msg=0x7f6d70946878) at core/action.c:1131
#19 0x0000000000474188 in run_actions (h=0x7ffe9b6cf9b0, a=0x7f6d97d02210, msg=0x7f6d70946878) at core/action.c:1618
#20 0x000000000047028d in do_action (h=0x7ffe9b6cf9b0, a=0x7f6d97d02f00, msg=0x7f6d70946878) at core/action.c:1241
#21 0x0000000000474188 in run_actions (h=0x7ffe9b6cf9b0, a=0x7f6d97cfdb68, msg=0x7f6d70946878) at core/action.c:1618
#22 0x00000000004748cc in run_top_route (a=0x7f6d97cfdb68, msg=0x7f6d70946878, c=0x0) at core/action.c:1701
#23 0x00007f6d9425802a in t_continue_helper (hash_index=39149, label=1245761144, rtact=0x7f6d97cfdb68, cbname=0x0, cbparam=0x0, skip_timer=1) at t_suspend.c:307
#24 0x00007f6d9425c6d8 in t_continue (hash_index=39149, label=1245761144, route=0x7f6d97cfdb68) at t_suspend.c:603
#25 0x00007f6d9085b372 in resume_on_initial_ccr (is_timeout=1, param=0x7f6d70921cc8, cca=0x0, elapsed_msecs=1) at ims_ro.c:1786
#26 0x00007f6d979646d1 in cdp_trans_timer (now=1699966590, ptr=0x0) at transaction.c:223
#27 0x00007f6d9798e3b0 in timer_loop () at timer.c:117
#28 0x00007f6d9798f639 in timer_process (returns=0) at timer.c:217
#29 0x00007f6d97942cf8 in diameter_peer_start (blocking=0) at diameter_peer.c:350
#30 0x00007f6d97932bb2 in cdp_child_init (rank=0) at cdp_mod.c:272
I'm considering changing cdp_trans_timer() to iterate through the list, remove expired items while keeping them in a new list so all callbacks can be executed without the lock held.
Could also be fixed within the ims_charging module. For example by not sending a CCR-T when there's no answer to the CCR-I, but I feel that's more of a band aid rather than a real fix.
Any feedback or suggestions would be nice.
kamailio -v
master
centos 7.8
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.