Hi Jason,
On 03/02/15 15:27, Daniel-Constantin Mierla wrote:
> Hi Jason,
>
> On 03/02/15 08:29, Jason Penton wrote:
>> Hi Daniel,
>>
>> I have found another instance of a similar bug where the a process
>> gets itself into deadlock. The sequence of events:
>>
>> 1. INVITE request is relayed downstream
>> 2. error response or locally generated 408 occurs
>> 2. Failure route in cfg is armed and tries to relay to alternate
>> destination (voicemail/annoucnement server, etc) - using t_relay()
>> 3. t_relay fails to destination for whatever reason (in this case the
>> result of the destination being blacklisted in TM)
>> 4. Original reply is attempted to be relayed upstream (this is all
>> done safely - ie using t_reply_unsafe() as the REPLY_LOCK has already
>> been acquired in failure route
>> 5. At or around line 562 of t_reply.c we check if the reply is > 200
>> (final response). If it is and if its an INVITE transaction, we clean
>> up timers and cancel all UAC branches. This is where we hit a problem:
>>
>> cancel_uacs( trans, &cancel_data, F_CANCEL_B_KILL, lock );
>>
>> 6. cancel_uacs calls cancel_branch if there was no response on the
>> branch by calling cancel_branch()
>> 7. cancel branch calls LOCK_REPLIES(t) on the transaction and you hit
>> deadlock.... (original call to LOCK_REPLIES(t) happened in failure
>> route processing)
>>
>> This use case is slightly different to the one you fixed related to
>> retransmission timers... (the beginning of this thread)
>>
>> Suggestion: What about a pkg/process variable that can be used to
>> check if you have already acquired the reply lock to combat these
>> non-reentrant deadlocks. A side-effect of this would save on passing
>> variables around in various functions indicating intention to lock or
>> not.
>>
>> Thoughts?
> What about making the reply lock re-entrant? I think it should be safer
> in long term to allow parts of code executed by same process to safely
> acquire the lock without worrying if the same process did it before.
> Global variables in pkg would need more complex management to set, test
> and reset.
>
I made the reply lock re-entrant, this should handle the situation you
were exposing it. if you have a chance to simulate it, then let me know
the result.
On a slightly different idea, I wonder if async_mutex should be made
re-entrant as well, what do you think? I guess it is unlikely to get two
async attempts in the same process, but you know better how the mutex is
planned to be used.
Cheers,
Daniel
--
Daniel-Constantin Mierla
http://twitter.com/#!/miconda - http://www.linkedin.com/in/miconda
Kamailio World Conference, May 27-29, 2015
Berlin, Germany - http://www.kamailioworld.com