<!-- Kamailio Pull Request Template -->
<!-- IMPORTANT: - for detailed contributing guidelines, read: https://github.com/kamailio/kamailio/blob/master/.github/CONTRIBUTING.md - pull requests must be done to master branch, unless they are backports of fixes from master branch to a stable branch - backports to stable branches must be done with 'git cherry-pick -x ...' - code is contributed under BSD for core and main components (tm, sl, auth, tls) - code is contributed GPLv2 or a compatible license for the other components - GPL code is contributed with OpenSSL licensing exception -->
#### Pre-Submission Checklist <!-- Go over all points below, and after creating the PR, tick all the checkboxes that apply --> <!-- All points should be verified, otherwise, read the CONTRIBUTING guidelines from above--> <!-- If you're unsure about any of these, don't hesitate to ask on sr-dev mailing list --> - [x] Commit message has the format required by CONTRIBUTING guide - [x] Commits are split per component (core, individual modules, libs, utils, ...) - [x] Each component has a single commit (if not, squash them into one commit) - [x] No commits to README files for modules (changes must be done to docbook files in `doc/` subfolder, the README file is autogenerated)
#### Type Of Change - [x] Small bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds new functionality) - [ ] Breaking change (fix or feature that would change existing functionality)
#### Checklist: <!-- Go over all points below, and after creating the PR, tick the checkboxes that apply --> - [ ] PR should be backported to stable branches - [x] Tested changes locally - [ ] Related to issue #XXXX (replace XXXX with an open issue number)
#### Description We've been running the ims_charging module in production for quite some time, without any issues. A couple of days ago the whole instance freezed, but I was able to get a dump before the stuck processes before restarting. The traffic volume has gradually increased over time, so it's likely related to that.
I then did a summary of the different types of stuck processes. Which locks they're holding, and what they're waiting for:
``` holding lock: AAASessionsLock
waiting for lock: lock_get(peer_list_lock);
#1 0x00007fe70eba8cde in futex_get (lock=0x7fe6e4a5c680) at ../../core/futexlock.h:108 #2 0x00007fe70ebaaca2 in get_peer_by_fqdn (fqdn=0x7fe6e4a5ba30) at peermanager.c:259 #3 0x00007fe70ebb64b7 in get_first_connected_route (cdp_session=0x7fe6e5cc5350, r=0x7fe6e4a5ba30, app_id=4, vendor_id=10415) at routing.c:115 #4 0x00007fe70ebb9a37 in get_routing_peer (cdp_session=0x7fe6e5cc5350, m=0x7fe6e5447ab0) at routing.c:293 #5 0x00007fe70ebcaf8c in AAASendMessage (message=0x7fe6e5447ab0, callback_f=0x7fe707a9e323 <resume_on_initial_ccr>, callback_param=0x7fe6e5297110) at diameter_comm.c:139 #6 0x00007fe707a9d3e0 in Ro_Send_CCR (msg=0x7fe70f011100, dlg=0x7fe6e5c571f0, dir=0, reservation_units=30, incoming_trunk_id=0x7ffe50efb060, outgoing_trunk_id=0x7ffe50efb050, pani=0x7ffe50efaee0, action=0x7fe70efa4730, tindex=10484, tlabel=677172593) at ims_ro.c:1511 #7 0x00007fe707a8285d in ki_ro_ccr (msg=0x7fe70f011100, s_route_name=0x7ffe50efb080, s_direction=0x7ffe50efb070, reservation_units=30, s_incoming_trunk_id=0x7ffe50efb060, s_outgoing_trunk_id=0x7ffe50efb050) at ims_charging_mod.c:742 #8 0x00007fe707a7bf01 in w_ro_ccr (msg=0x7fe70f011100, c_route_name=0x7fe70ef8b8d0 "\220\311\371\016\347\177", c_direction=0x7fe70ef8b980 "p\240\371\016\347\177", reservation_units=30, c_incoming_trunk_id=0x7fe70ef8ba30 "p\241\371\016\347\177", c_outgoing_trunk_id=0x7fe70ef8bae0 "\360\241\371\016\347\177") at ims_charging_mod.c:507 #9 0x00000000004858d8 in do_action (h=0x7ffe50efb970, a=0x7fe70ef99e20, msg=0x7fe70f011100) at core/action.c:1144 #10 0x00000000004928d6 in run_actions (h=0x7ffe50efb970, a=0x7fe70ef99e20, msg=0x7fe70f011100) at core/action.c:1618 #11 0x0000000000492f52 in run_actions_safe (h=0x7ffe50eff1e0, a=0x7fe70ef99e20, msg=0x7fe70f011100) at core/action.c:1681 #12 0x0000000000450156 in rval_get_long (h=0x7ffe50eff1e0, msg=0x7fe70f011100, i=0x7ffe50efbec8, rv=0x7fe70ef9c1d8, cache=0x0) at core/rvalue.c:973 #13 0x0000000000454d24 in rval_expr_eval_long (h=0x7ffe50eff1e0, msg=0x7fe70f011100, res=0x7ffe50efbec8, rve=0x7fe70ef9c1d0) at core/rvalue.c:1854 #14 0x0000000000454d52 in rval_expr_eval_long (h=0x7ffe50eff1e0, msg=0x7fe70f011100, res=0x7ffe50efc448, rve=0x7fe70ef9b920) at core/rvalue.c:1864 #15 0x00000000004850ce in do_action (h=0x7ffe50eff1e0, a=0x7fe70ef9b070, msg=0x7fe70f011100) at core/action.c:1097
--
holding lock: AAASessionsLock
waiting for lock: lock_get(peer_list_lock);
#1 0x00007fe70eba8d4f in futex_get (lock=0x7fe6e4a5c680) at ../../core/futexlock.h:121 #2 0x00007fe70ebaaca2 in get_peer_by_fqdn (fqdn=0x7ffe50efab90) at peermanager.c:259 #3 0x00007fe70ebb8e89 in get_routing_peer (cdp_session=0x7fe6e5ab6910, m=0x7fe6e5435be0) at routing.c:252 #4 0x00007fe70ebcaf8c in AAASendMessage (message=0x7fe6e5435be0, callback_f=0x7fe707a95edc <resume_on_termination_ccr>, callback_param=0x0) at diameter_comm.c:139 #5 0x00007fe707a95b02 in send_ccr_stop_with_param (ro_session=0x7fe6e5ab65e0, code=0, reason=0x7ffe50efb060) at ims_ro.c:1181 #6 0x00007fe707a72ff7 in dlg_terminated (dlg=0x7fe6e623d7a0, type=64, termcode=0, reason=0x7fe707ab72b3 "normal call clearing", _params=0x7fe707f67280 <params>) at dialog.c:249 #7 0x00007fe707a6a729 in dlg_callback_received (dlg=0x7fe6e623d7a0, type=64, _params=0x7fe707f67280 <params>) at dialog.c:25 #8 0x00007fe707d341b9 in run_dlg_callbacks (type=64, dlg=0x7fe6e623d7a0, req=0x7fe70f011100, rpl=0x0, dir=1, dlg_data=0x0) at dlg_cb.c:271 #9 0x00007fe707cf4db4 in dlg_terminated (req=0x7fe70f011100, dlg=0x7fe6e623d7a0, dir=1) at dlg_handlers.c:413 #10 0x00007fe707cfddeb in dlg_onroute (req=0x7fe70f011100, route_params=0x7ffe50efb6d0, param=0x0) at dlg_handlers.c:1097 #11 0x00007fe70ad285f6 in run_rr_callbacks (req=0x7fe70f011100, rr_param=0x7ffe50efb7c0) at rr_cb.c:96 #12 0x00007fe70ad3ae92 in after_loose (_m=0x7fe70f011100, preloaded=0) at loose.c:1021 #13 0x00007fe70ad3b5ce in loose_route_mode (_m=0x7fe70f011100, _mode=0) at loose.c:1056 #14 0x00007fe70ad3f74f in w_loose_route (msg=0x7fe70f011100, p1=0x0, p2=0x0) at rr_mod.c:273 #15 0x00000000004855ff in do_action (h=0x7ffe50efc390, a=0x7fe70efe0d40, msg=0x7fe70f011100) at core/action.c:1121
--
holding lock: lock_get(peer_list_lock);
waiting for lock: lock_get(p->lock);
#1 0x00007fe70eba8d4f in futex_get (lock=0x7fe6e4a5cbd0) at ../../core/futexlock.h:121 #2 0x00007fe70ebab0ae in peer_timer (now=1742807320, ptr=0x0) at peermanager.c:286 #3 0x00007fe70ebd0f39 in timer_loop () at timer.c:116 #4 0x00007fe70ebd21b2 in timer_process (returns=0) at timer.c:216 #5 0x00007fe70eb8ccf8 in diameter_peer_start (blocking=0) at diameter_peer.c:350 #6 0x00007fe70eb7cbb2 in cdp_child_init (rank=0) at cdp_mod.c:272
--
holding lock: lock_get(p->lock)
waiting for lock: AAASessionsLock
#1 0x00007fe70ebf2597 in futex_get (lock=0x7fe6e4a5d490) at ../../core/futexlock.h:108 #2 0x00007fe70ebf26f1 in AAASessionsLock (hash=0) at session.c:79 #3 0x00007fe70ebf5e6e in cdp_get_session (id=...) at session.c:316 #4 0x00007fe70eba6892 in Snd_Message (p=0x7fe6e4a5c880, msg=0x7fe6e63998d0) at peerstatemachine.c:1237 #5 0x00007fe70eba003e in sm_process (p=0x7fe6e4a5c880, event=Send_Message, msg=0x7fe6e63998d0, peer_locked=0, sock=0) at peerstatemachine.c:429 #6 0x00007fe70ebcbdc6 in AAASendMessage (message=0x7fe6e63998d0, callback_f=0x7fe707a8f208 <resume_on_interim_ccr>, callback_param=0x7fe6e6190a90) at diameter_comm.c:166 #7 0x00007fe707a8edb3 in send_ccr_interim (ro_session=0x7fe6e5399160, used=60, reserve=30) at ims_ro.c:847 #8 0x00007fe707a68bd6 in ro_session_ontimeout (tl=0x7fe6e5399200) at ro_timer.c:513 #9 0x00007fe707a63078 in ro_timer_routine (ticks=114862426, attr=0x0) at ro_timer.c:279 #10 0x00000000004fd33e in compat_old_handler (ti=1837798827, tl=0x7fe6e4cf6260, data=0x7fe6e4cf6260) at core/timer.c:980 #11 0x00000000004fde7a in slow_timer_main () at core/timer.c:1103 #12 0x000000000042e4e7 in main_loop () at main.c:1911 #13 0x000000000043876c in main (argc=10, argv=0x7ffe50f001c8) at main.c:3236
```
In `get_first_connected_route()` in `routing.c` of the cdp module there are two places `get_peer_by_fqdn()` are called. One of them has an unlock/relock of the session list before and after (and a comment about holding two locks at a time), while the other doesn't.
I'm currently testing two version of this. The first by just doing the some relock for the other `get_peer_by_fqdn()`, but this PR got an approach for maybe fixing more latent issues. `sm_process()` also got some strange handling by `Rcv_Process()` after the peer lock is released, which seems to be about the same thing. My deadlock is for the `Snd_Message()` equivalent.
I've now removed this queueing behaviour, and instead doing a re-lock of the peer to (hopefully) have the same locking order as the other operations.
The problem with this thing is how rare it occurs. Just wanted to share my findings for others with more knowledge to the cdp module for comments, suggestions and hopefully some extra testing. You can view, comment on, or merge this pull request online at:
https://github.com/kamailio/kamailio/pull/4191
-- Commit Summary --
* cdp: restructure locking order to prevent rare deadlock
-- File Changes --
M src/modules/cdp/peerstatemachine.c (25) M src/modules/cdp/routing.c (6)
-- Patch Links --
https://github.com/kamailio/kamailio/pull/4191.patch https://github.com/kamailio/kamailio/pull/4191.diff
@herlesupreeth commented on this pull request.
Thanks for the fix. I left a few comments. Let me know what you think
@@ -1241,8 +1233,12 @@ void Snd_Message(peer *p, AAAMessage *msg)
int send_message_before_session_sm = 0; LM_DBG("called to peer [%.*s] for %s with code %d \n", p->fqdn.len, p->fqdn.s, is_req(msg) ? "request" : "response", msg->commandCode); - if(msg->sessionId) + if(msg->sessionId) { + // Ensure proper locking order + lock_release(p->lock);
I am not sure adding lock part here if the right way. I would prefer this function be called with the lock (as mentioned in the comments above at line 1225). I will let other comment on this.
@@ -1351,8 +1347,13 @@ void Rcv_Process(peer *p, AAAMessage *msg)
return; }
- if(msg->sessionId) + if(msg->sessionId) { + // Ensure proper locking order + lock_release(p->lock);
same comment as above about the part of having locking parts inside the function
vingarzan left a comment (kamailio/kamailio#4191)
Hey @mtryfoss!
I think I wrote some big parts of this code, but a very long time ago and it might have a lot of changes by now that I could not be fully aware of. I also further developed and heavily modified version of this code is in production in the closed source project from where it originally came, so hopefully it should be fixable. But there were also many changes here.
In hindsight, having Diameter session logic in the stack was a mistake, since IRL it's better to let the "application" handle it cleanly outside, yet that might be too late here. But, it's also not a huge amount of logic that it does, so let me try and help hopefully.
Seems like there is a dead-lock between the lock on a peer and the one on sessions. There is maybe also a 3rd lock involved, but I don't see that in your changes. And it might be just a side-effect, which would clear if the root cause is fixed. But the idea is the same: **we should try to eliminate any overlap between critical zones, in order to have some good guarantees about avoiding a dead/live-lock situation.**
So IMHO a proper fix/change would be to not have `peer *p` as a parameter to `Snd_Message()`. That would mean that you wouldn't hold `p->lock`. One way could be to use the FQDN instead, the other would be to copy (before unlocking) what the deeper functions need and avoid re-getting/re-locking in `peer_send_msg()`. In short: I'm advocating to actually have the same treatment to `Snd_Message()` as `Rcv_Process()` and call it also without a lock after the peer state machine big switch in `sm_process()`.
To explain why `Rcv_Process()` on purpose does not hold a lock while we call the handler in the application layer: imagine that you would want to send back an answer to a request that you have received, to the same peer. The handler will dead-lock while sending, trying to get a lock on a peer which the handler's caller already has a lock on. For this reason I'd also eliminate the peer pointer from the `Rcv_Process()` call. Yet... that's exported, so seems like a much deeper change needed to fix it thoroughly.
Back to your dead-lock though: I think that the hack/fix-attempt in `get_first_connected_route()` which you are eliminating is/was a bad idea in the first place, so I encourage your removal. But I see now that the session pointer is all over the place in calls to route messages. Seems like someone tried to add a mechanism of session-stickiness-to-peers, but there was no consideration for avoiding overlapping locks... I'm actually surprised that it doesn't crash&burn much more often, because again: if you release a lock and you'd be acquiring it again a few lines below, you are just asking for trouble, since someone might've already free that session or peer pointer and you would be reading/writing from bad addresses. And of course, it's a bad design, which breaks code modularity between Diameter peers and sessions.
A better design for the unlock is to have a function with a double pointer peer/session parameter, which resets the pointer in the process, forcing a seg-fault if someone tries such unlock/do-something/relock-without-get patterns.
Anyway, long-story short, this seems like a deeper change might be required. If you wish, I'd offer a call (hopefully time-boxed) to help you navigate `cdp` or maybe to discuss a deeper change for peers and sessions.
Cheers, -Dragos
@vingarzan commented on this pull request.
@@ -367,9 +366,7 @@ int sm_process(
p->state = R_Open; break; case R_Rcv_Message: - // delayed processing until out of the critical zone - //Rcv_Process(p,msg); - msg_received = 1; + Rcv_Process(p, msg);
Calling the handlers with the lock is maybe a bad idea. Sure, there is a "task-queue" in-between this process and the one actually doing the handling and potentially sending a message out to the same peer, but you are changing the "contract" which other functions relied on, that the peer wasn't locked.
I left a bigger description though why having a pointer to the peer in `Rcv_Process()` is also a bad in the absence of a locked peer p. Hence we probably need to change the APIs, not the locking strategy here.
@vingarzan commented on this pull request.
// Ensure proper locking order
+ lock_release(p->lock); session = cdp_get_session(msg->sessionId->data); + lock_get(p->lock);
I think this actually adds more to the current issue, maybe just in the reversed order of operations.
IMHO the proper fix is to avoid overlapping critical zones, not releasing temporarily the locks. If you do this release here, to avoid memory corruption, you should do something like this suggestion:
```suggestion // Avoid overlapping locks // TODO copy first some peer identifier, like the p->fqdn (deep!) tmp_fqdn = ....; lock_release(p->lock); p = 0;
session = cdp_get_session(msg->sessionId->data); p = get_peer_by_fqdn(tmp_fqdn); if p == nil { // oh-oh, the peer is actually gone, so ... how do we resume and unfold this now?? ... } ```
@vingarzan commented on this pull request.
@@ -1351,8 +1347,13 @@ void Rcv_Process(peer *p, AAAMessage *msg)
return; }
- if(msg->sessionId) + if(msg->sessionId) { + // Ensure proper locking order + lock_release(p->lock);
yup, I agree... it's the same thing as removed from `routing.c`, just that the peer and the session are swapped. Removing that is a good idea (but might break other things).
vingarzan left a comment (kamailio/kamailio#4191)
The problem with "to (hopefully) have the same locking order as the other operations" is that in theory it might solve the immediate issue... but in practice the poison remains in the fridge :stuck_out_tongue_closed_eyes: ... How long until someone won't observe this and mess it back up again? So the proper solution I think is to unwind the need to hold both locks at the same time.
@vingarzan commented on this pull request.
// Ensure proper locking order
+ lock_release(p->lock); session = cdp_get_session(msg->sessionId->data); + lock_get(p->lock);
I see now that you tried to at least have the same order of locking...
mtryfoss left a comment (kamailio/kamailio#4191)
First of all, thanks a lot for the comprehensive feedback!
Seems like there is a dead-lock between the lock on a peer and the one on sessions. There is maybe also a 3rd lock involved, but I don't see that in your changes. And it might be just a side-effect, which would clear if the root cause is fixed.
Yes, there's a lock on the peer list as well. So 3 processes involved, but according to my analysis having this order changed should avoid the deadlock - but as you say, it might introduce a segfault instead.
I'm actually surprised that it doesn't crash&burn much more often
Indeed, but this seems to be very load and timing sensitive - combined with some specific behaviour of the ims_charging module. For example, the interim timer needs to be active. All in all, a nightmare to debug and/or confirm that the fix actually worked 😅
Anyway, long-story short, this seems like a deeper change might be required. If you wish, I'd offer a call (hopefully time-boxed) to help you navigate cdp or maybe to discuss a deeper change for peers and sessions.
I really appreciate this, but I'm probably not the correct person for a bigger/deeper change in this and other related modules.
I'll take look again on the code based on your feedback, and see if it could be improved a bit for a short term fix.
--
We're currently running a custom build with this change below on one of the clusters, as an immediate work around. In my case, I'd take a segfault over a deadlock any day, since the service would just be restarted and continue operation.
https://github.com/mtryfoss/kamailio/commit/a573e107b81b3975daf78103b3dc2a58...
That's just a variant where we copy the hack used for sessions with a sticky peer, for all requests. It's been doing fine with a fair amount of traffic for several days, for our specific use case.
vingarzan left a comment (kamailio/kamailio#4191)
Hey Morten!
a seg-fault is indeed better. From my perspective it gives the core-dump/back-trace to be able to fix the code, immediately when the abuse potentially happens. IMHO the peer pointer should always be reset on lock release, because any use beyond that is dangerous (something might free the memory while we're not holding the lock). The optimization of not having to call `get_peer_by_fqdn()` again and re-lock the proper way is not worth the risk. That's why I'm advocating for hiding `p->lock` and instead having a `unlock_peer(peer **p)` which forces `*p = 0` inside.
I was thinking that problems are most likely to happen around timers, simply because otherwise the probability of 2+ other operations happening in parallel is much lower, normally due to normal networking delay. One way to accelerate it then for testing is to reduce the interval at which the timer runs, even to something ridiculously low. Good code should never block or crash.
Re sticky peers by session, I wasn't sure why that would be needed really. Normally I would see a production topology having some DRAs (>1 for redundancy) in-between Kamailio and some Charging Servers, with the message paths being redundant and just a matter of random distribution. The `cdp` stack task is just to pick the next hop of those paths.
But if you don't have a DRA, then I could see why you'd want to have stickiness. Yet I would've implemented it differently maybe. I guess what you'd want and would be well aligned to the Diameter and 3GPP specs would be to set the Destination-Host AVP in the requests (response are routed differently, so not affected). Then let the `cdp` handle routing to peers on its own: would see the Destination-Host as a hint to check if that's an immediate next-hop and prefer it before actually trying to use the realm/app-id routes to decide on another egress connection. Or in short: I don't think that the `cdp` message routing routines need to care about sessions at all, just leave the standard "marker" in the message for them to follow.
I'm not using myself this code in production, so I'm mostly helping here the community a bit. I'd love to overhaul the `cdp` stack a bit though, since I think some parts right now need quite a bit of cleaning to keep them safe. That's why I was offering a call, to maybe see if we could hatch a plan. But to be direct, I'm lacking also a bit the time to spend on it.
Cheers, -Dragos
mtryfoss left a comment (kamailio/kamailio#4191)
But to be direct, I'm lacking also a bit the free time to spend on it.
That's my main issue as well :(
Regarding the peer and locks. Seems like this is the only situation where it might be invalidated under normal operation? I guess that is the reason calling `Rcv_Process` after the big switch without a lock (code in master) works so well. I assume most users use a static setup.
``` if(p->is_dynamic && config->drop_unknown_peers) { remove_peer(p); free_peer(p, 1); break; } ```
As a quick fix/work around, would you prefer that approach for `Snd_Message` too over the proposed change in this PR? To await doing that until after out of the `critical zone` described earlier.