vingarzan left a comment (kamailio/kamailio#4191)
Hey @mtryfoss!
I think I wrote some big parts of this code, but a very long time ago and it might have a lot of changes by now that I could not be fully aware of. I also further developed and heavily modified version of this code is in production in the closed source project from where it originally came, so hopefully it should be fixable. But there were also many changes here.
In hindsight, having Diameter session logic in the stack was a mistake, since IRL it's better to let the "application" handle it cleanly outside, yet that might be too late here. But, it's also not a huge amount of logic that it does, so let me try and help hopefully.
Seems like there is a dead-lock between the lock on a peer and the one on sessions. There is maybe also a 3rd lock involved, but I don't see that in your changes. And it might be just a side-effect, which would clear if the root cause is fixed. But the idea is the same: **we should try to eliminate any overlap between critical zones, in order to have some good guarantees about avoiding a dead/live-lock situation.**
So IMHO a proper fix/change would be to not have `peer *p` as a parameter to `Snd_Message()`. That would mean that you wouldn't hold `p->lock`. One way could be to use the FQDN instead, the other would be to copy (before unlocking) what the deeper functions need and avoid re-getting/re-locking in `peer_send_msg()`. In short: I'm advocating to actually have the same treatment to `Snd_Message()` as `Rcv_Process()` and call it also without a lock after the peer state machine big switch in `sm_process()`.
To explain why `Rcv_Process()` on purpose does not hold a lock while we call the handler in the application layer: imagine that you would want to send back an answer to a request that you have received, to the same peer. The handler will dead-lock while sending, trying to get a lock on a peer which the handler's caller already has a lock on. For this reason I'd also eliminate the peer pointer from the `Rcv_Process()` call. Yet... that's exported, so seems like a much deeper change needed to fix it thoroughly.
Back to your dead-lock though: I think that the hack/fix-attempt in `get_first_connected_route()` which you are eliminating is/was a bad idea in the first place, so I encourage your removal. But I see now that the session pointer is all over the place in calls to route messages. Seems like someone tried to add a mechanism of session-stickiness-to-peers, but there was no consideration for avoiding overlapping locks... I'm actually surprised that it doesn't crash&burn much more often, because again: if you release a lock and you'd be acquiring it again a few lines below, you are just asking for trouble, since someone might've already free that session or peer pointer and you would be reading/writing from bad addresses. And of course, it's a bad design, which breaks code modularity between Diameter peers and sessions.
A better design for the unlock is to have a function with a double pointer peer/session parameter, which resets the pointer in the process, forcing a seg-fault if someone tries such unlock/do-something/relock-without-get patterns.
Anyway, long-story short, this seems like a deeper change might be required. If you wish, I'd offer a call (hopefully time-boxed) to help you navigate `cdp` or maybe to discuss a deeper change for peers and sessions.
Cheers, -Dragos