Hi Jason,
On 03/02/15 15:27, Daniel-Constantin Mierla wrote:
Hi Jason,
On 03/02/15 08:29, Jason Penton wrote:
Hi Daniel,
I have found another instance of a similar bug where the a process
gets itself into deadlock. The sequence of events:
1. INVITE request is relayed downstream
2. error response or locally generated 408 occurs
2. Failure route in cfg is armed and tries to relay to alternate
destination (voicemail/annoucnement server, etc) - using t_relay()
3. t_relay fails to destination for whatever reason (in this case the
result of the destination being blacklisted in TM)
4. Original reply is attempted to be relayed upstream (this is all
done safely - ie using t_reply_unsafe() as the REPLY_LOCK has already
been acquired in failure route
5. At or around line 562 of t_reply.c we check if the reply is > 200
(final response). If it is and if its an INVITE transaction, we clean
up timers and cancel all UAC branches. This is where we hit a problem:
cancel_uacs( trans, &cancel_data, F_CANCEL_B_KILL, lock );
6. cancel_uacs calls cancel_branch if there was no response on the
branch by calling cancel_branch()
7. cancel branch calls LOCK_REPLIES(t) on the transaction and you hit
deadlock.... (original call to LOCK_REPLIES(t) happened in failure
route processing)
This use case is slightly different to the one you fixed related to
retransmission timers... (the beginning of this thread)
Suggestion: What about a pkg/process variable that can be used to
check if you have already acquired the reply lock to combat these
non-reentrant deadlocks. A side-effect of this would save on passing
variables around in various functions indicating intention to lock or
not.
Thoughts?
What about making the reply lock re-entrant? I think it should be safer
in long term to allow parts of code executed by same process to safely
acquire the lock without worrying if the same process did it before.
Global variables in pkg would need more complex management to set, test
and reset.
I made the reply lock re-entrant, this should handle the situation you
were exposing it. if you have a chance to simulate it, then let me know
the result.
On a slightly different idea, I wonder if async_mutex should be made
re-entrant as well, what do you think? I guess it is unlikely to get two
async attempts in the same process, but you know better how the mutex is
planned to be used.
Cheers,
Daniel
--
Daniel-Constantin Mierla
http://twitter.com/#!/miconda -
http://www.linkedin.com/in/miconda
Kamailio World Conference, May 27-29, 2015
Berlin, Germany -
http://www.kamailioworld.com