Hey Daniel,
Ok great, I will test. No I don't see a need to make async_mutex re-entrant but let me check.
I will test your change too. Do you have a commit ref for your change?
Cheers Jason
On Mon, 30 Mar 2015 at 17:03 Daniel-Constantin Mierla miconda@gmail.com wrote:
Hi Jason,
On 03/02/15 15:27, Daniel-Constantin Mierla wrote:
Hi Jason,
On 03/02/15 08:29, Jason Penton wrote:
Hi Daniel,
I have found another instance of a similar bug where the a process gets itself into deadlock. The sequence of events:
- INVITE request is relayed downstream
- error response or locally generated 408 occurs
- Failure route in cfg is armed and tries to relay to alternate
destination (voicemail/annoucnement server, etc) - using t_relay() 3. t_relay fails to destination for whatever reason (in this case the result of the destination being blacklisted in TM) 4. Original reply is attempted to be relayed upstream (this is all done safely - ie using t_reply_unsafe() as the REPLY_LOCK has already been acquired in failure route 5. At or around line 562 of t_reply.c we check if the reply is > 200 (final response). If it is and if its an INVITE transaction, we clean up timers and cancel all UAC branches. This is where we hit a problem:
cancel_uacs( trans, &cancel_data, F_CANCEL_B_KILL, lock );
- cancel_uacs calls cancel_branch if there was no response on the
branch by calling cancel_branch() 7. cancel branch calls LOCK_REPLIES(t) on the transaction and you hit deadlock.... (original call to LOCK_REPLIES(t) happened in failure route processing)
This use case is slightly different to the one you fixed related to retransmission timers... (the beginning of this thread)
Suggestion: What about a pkg/process variable that can be used to check if you have already acquired the reply lock to combat these non-reentrant deadlocks. A side-effect of this would save on passing variables around in various functions indicating intention to lock or not.
Thoughts?
What about making the reply lock re-entrant? I think it should be safer in long term to allow parts of code executed by same process to safely acquire the lock without worrying if the same process did it before. Global variables in pkg would need more complex management to set, test and reset.
I made the reply lock re-entrant, this should handle the situation you were exposing it. if you have a chance to simulate it, then let me know the result.
On a slightly different idea, I wonder if async_mutex should be made re-entrant as well, what do you think? I guess it is unlikely to get two async attempts in the same process, but you know better how the mutex is planned to be used.
Cheers, Daniel
-- Daniel-Constantin Mierla http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda Kamailio World Conference, May 27-29, 2015 Berlin, Germany - http://www.kamailioworld.com