At 14:48 30/03/2007, Bogdan-Andrei Iancu wrote:
wrong again :)
I wish it would be.
The operational experience shows us that in the former versions
there have been race conditions which do cause troubles under
hard-to-reproduce conditions. Based on surface knowledge, it appears
that openser has inhereted those from ser before's ser's overhaul
of those.
as I mentioned in my previous email, the "detached
timer" was more an maker that something else was going wrong - there was no
amplification.
lucky those who haven't been affected by the race conditions. My point
is though, this particular warning corelates with undeterminism.
and as TR clearly said, the problem was with DB
connectivity and had nothing to do with TM timers.
Well, as a matter of fact, I have witnessed several failures which coincidently
appeared with this warning. Studing the code will reveal to you and anyone else
that actually this warning is just a hack which helps to ignore erroneous conditions
and survive those, but doesn't heal the cause of the problem, which may still
generate
disfucntional service.
Again -- I don't mean to daemonize it, with this -ignore-the-problem-hack things
have been running mostly fine.
-jiri
regards,
bogdan
Jiri Kuthan wrote:
>Actually more likely it has been both. The root problem lies in the timer subsystem
>and may be amplified by other troubles (or amplify those).
>
>-jiri
>
>At 01:35 30/03/2007, T.R. Missner wrote:
>
>>FYI All
>>
>>This turned out to be a database write ( acc ) that was blocking due to a raid
card problem.
>>
>>
>>
>>T.R. Missner wrote:
>>
>>>Is it possible the locked state I am seeing with openser leads to the
"detached" timer?
>>>Since the "detached" timer is a race, it would make sense to see the
race condition after openser locks up and messages buffer up in the stack.
>>>When a bunch of messages are processed all at once by multiple threads the
race condition would occur.
>>>Does this make sense?
>>>
>>>Maybe I have been focusing on the wrong place.
>>>
>>>Ignoring the "detached" timer what could cause openser to hang for a
couple seconds then clear every 5 - 10 minutes?
>>>
>>>Ideas?
>>>
>>>We are seeing this on 3 different productions servers.
>>>
>>>Thanks
>>>
>>>TR
>>>
>>>using openser1.1.1
>>>
>>>
>>>
>>>T.R. Missner wrote:
>>>
>>>>Bogdan,
>>>>
>>>>I have been chasing this for days and done lots of debugging.
>>>>using 1.1.1
>>>>While looking at the network trace at the time of these messages ( I
usually see at least 5 in a row with differing hex values ) I see many incoming packets
coming into the box and no response from the proxy for somewhere between 5 - 10 seconds,
then a flood a responses from the proxy.
>>>>I can email you a sample pcap file if you like.
>>>>As part of my debugging I forced a 100 reply at the very top of my cfg
file.
>>>>The forced 100 was not sent during the locked up time leading me to
believe openser was not processing incoming packets.
>>>>I have now seen this on multiple servers in different locations. Likely a
particular customer call flow is causing this but I have not been able to pin it down to
the exact customer. These proxies run pretty fast during the day so finding a pattern
leading up the this issue is difficult. What could I add to the Log output to identify the
offending sip-callid? Is sip-callid or branch tag or anything similar easily accessible in
any of the data structs in timer.c?
>>>>
>>>>TR
>>>>
>>>>Bogdan-Andrei Iancu wrote:
>>>>
>>>>>Hi TR,
>>>>>
>>>>>it is race between expire even (from timer) and inserting again on a
timer list.
>>>>> 1 is the final response timer list (fr_timer)
>>>>> 3 id the wait timer list (wt_timer)
>>>>>
>>>>>I would say there is no way this could leas to a any kind of lock.
>>>>>
>>>>>what version are you using? what makes you say it locks?
>>>>>
>>>>>regards,
>>>>>bogdan
>>>>>
>>>>>T.R. Missner wrote:
>>>>>
>>>>>>Does anyone know what causes this?
>>>>>>
>>>>>>*/set_timer for 1 list called on a "detached" timer --
ignoring /*
>>>>>>
>>>>>>I also see
>>>>>>
>>>>>>*/set_timer for 3 list called on a "detached" timer --
ignoring /*
>>>>>>
>>>>>>
>>>>>>
>>>>>>When this happens Openser seems to lock up for 10 seconds or so.
>>>>>>
>>>>>>>From searching it appears this is caused by a race but I am
not sure what the race is or why this results in an unresponsive openser instance for
multiple seconds.
>>>>>>
>>>>>>Transaction expiration racing reply?
>>>>>>
>>>>>>
>>>>>>Desperately need to understand how this could be triggered so I
can get customer to adjust system.
>>>>>>
>>>>>>Any way to adjust?
>>>>>>
>>>>>>tried tweaking fr_inv_timer but no joy.
>>>>>>
>>>>>>
>>>>>>
>>>>>>TR
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>--
>>>>>Jiri Kuthan
http://iptel.org/~jiri/