Hi Jason,
On 03/02/15 08:29, Jason Penton wrote:
Hi Daniel,
I have found another instance of a similar bug where the a process gets itself into deadlock. The sequence of events:
- INVITE request is relayed downstream
- error response or locally generated 408 occurs
- Failure route in cfg is armed and tries to relay to alternate
destination (voicemail/annoucnement server, etc) - using t_relay() 3. t_relay fails to destination for whatever reason (in this case the result of the destination being blacklisted in TM) 4. Original reply is attempted to be relayed upstream (this is all done safely - ie using t_reply_unsafe() as the REPLY_LOCK has already been acquired in failure route 5. At or around line 562 of t_reply.c we check if the reply is > 200 (final response). If it is and if its an INVITE transaction, we clean up timers and cancel all UAC branches. This is where we hit a problem:
cancel_uacs( trans, &cancel_data, F_CANCEL_B_KILL, lock );
- cancel_uacs calls cancel_branch if there was no response on the
branch by calling cancel_branch() 7. cancel branch calls LOCK_REPLIES(t) on the transaction and you hit deadlock.... (original call to LOCK_REPLIES(t) happened in failure route processing)
This use case is slightly different to the one you fixed related to retransmission timers... (the beginning of this thread)
Suggestion: What about a pkg/process variable that can be used to check if you have already acquired the reply lock to combat these non-reentrant deadlocks. A side-effect of this would save on passing variables around in various functions indicating intention to lock or not.
Thoughts?
What about making the reply lock re-entrant? I think it should be safer in long term to allow parts of code executed by same process to safely acquire the lock without worrying if the same process did it before. Global variables in pkg would need more complex management to set, test and reset.
Cheers, Daniel