First of all, thanks a lot for the comprehensive feedback!

Seems like there is a dead-lock between the lock on a peer and the one on sessions. There is maybe also a 3rd lock involved, but I don't see that in your changes. And it might be just a side-effect, which would clear if the root cause is fixed.

Yes, there's a lock on the peer list as well. So 3 processes involved, but according to my analysis having this order changed should avoid the deadlock - but as you say, it might introduce a segfault instead.

I'm actually surprised that it doesn't crash&burn much more often

Indeed, but this seems to be very load and timing sensitive - combined with some specific behaviour of the ims_charging module. For example, the interim timer needs to be active. All in all, a nightmare to debug and/or confirm that the fix actually worked 😅

Anyway, long-story short, this seems like a deeper change might be required. If you wish, I'd offer a call (hopefully time-boxed) to help you navigate cdp or maybe to discuss a deeper change for peers and sessions.

I really appreciate this, but I'm probably not the correct person for a bigger/deeper change in this and other related modules.

I'll take look again on the code based on your feedback, and see if it could be improved a bit for a short term fix.

--

We're currently running a custom build with this change below on one of the clusters, as an immediate work around. In my case, I'd take a segfault over a deadlock any day, since the service would just be restarted and continue operation.

mtryfoss@a573e10

That's just a variant where we copy the hack used for sessions with a sticky peer, for all requests. It's been doing fine with a fair amount of traffic for several days, for our specific use case.


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <kamailio/kamailio/pull/4191/c2765538934@github.com>

mtryfossmtryfoss left a comment (kamailio/kamailio#4191)

First of all, thanks a lot for the comprehensive feedback!

Seems like there is a dead-lock between the lock on a peer and the one on sessions. There is maybe also a 3rd lock involved, but I don't see that in your changes. And it might be just a side-effect, which would clear if the root cause is fixed.

Yes, there's a lock on the peer list as well. So 3 processes involved, but according to my analysis having this order changed should avoid the deadlock - but as you say, it might introduce a segfault instead.

I'm actually surprised that it doesn't crash&burn much more often

Indeed, but this seems to be very load and timing sensitive - combined with some specific behaviour of the ims_charging module. For example, the interim timer needs to be active. All in all, a nightmare to debug and/or confirm that the fix actually worked 😅

Anyway, long-story short, this seems like a deeper change might be required. If you wish, I'd offer a call (hopefully time-boxed) to help you navigate cdp or maybe to discuss a deeper change for peers and sessions.

I really appreciate this, but I'm probably not the correct person for a bigger/deeper change in this and other related modules.

I'll take look again on the code based on your feedback, and see if it could be improved a bit for a short term fix.

--

We're currently running a custom build with this change below on one of the clusters, as an immediate work around. In my case, I'd take a segfault over a deadlock any day, since the service would just be restarted and continue operation.

mtryfoss@a573e10

That's just a variant where we copy the hack used for sessions with a sticky peer, for all requests. It's been doing fine with a fair amount of traffic for several days, for our specific use case.


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <kamailio/kamailio/pull/4191/c2765538934@github.com>