mtryfoss left a comment (kamailio/kamailio#4191)

First of all, thanks a lot for the comprehensive feedback!

Seems like there is a dead-lock between the lock on a peer and the one on sessions. There is maybe also a 3rd lock involved, but I don't see that in your changes. And it might be just a side-effect, which would clear if the root cause is fixed.

Yes, there's a lock on the peer list as well. So 3 processes involved, but according to my analysis having this order changed should avoid the deadlock - but as you say, it might introduce a segfault instead.

I'm actually surprised that it doesn't crash&burn much more often

Indeed, but this seems to be very load and timing sensitive - combined with some specific behaviour of the ims_charging module. For example, the interim timer needs to be active. All in all, a nightmare to debug and/or confirm that the fix actually worked 😅

Anyway, long-story short, this seems like a deeper change might be required. If you wish, I'd offer a call (hopefully time-boxed) to help you navigate cdp or maybe to discuss a deeper change for peers and sessions.

I really appreciate this, but I'm probably not the correct person for a bigger/deeper change in this and other related modules.

I'll take look again on the code based on your feedback, and see if it could be improved a bit for a short term fix.

We're currently running a custom build with this change below on one of the clusters, as an immediate work around. In my case, I'd take a segfault over a deadlock any day, since the service would just be restarted and continue operation.

mtryfoss@a573e10

That's just a variant where we copy the hack used for sessions with a sticky peer, for all requests. It's been doing fine with a fair amount of traffic for several days, for our specific use case.

—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.