On 25.02.19 19:05, Richard Fuchs wrote:
On 25/02/2019 12.34, Daniel-Constantin Mierla wrote:
Hello,
that's strange, but a while ago someone else reported an issue with
same backtrace.
So the crash happens at the last line in the next snippet from
reply_received() function in the tm module:
uac=&t->uac[branch];
LM_DBG("org. status uas=%d, uac[%d]=%d local=%d is_invite=%d)\n",
t->uas.status, branch, uac->last_received,
is_local(t), is_invite(t));
last_uac_status=uac->last_received;
The backtrace and info locals say that uac is null (0x0). According
to my knowledge, the address of a field in a structure cannot be null
and uac is set to &t->uac[branch]. Moreover, uac->last_received is
printed in the LM_DBG() above the line of crash, if uac was 0x0, the
crash should have happened there.
t->uac is a pointer to an array, not a static array contained in the
struct. So, if t->uac was null, then &t->uac[branch] would also yield
null if branch was zero. (For a non-zero branch, it would yield a
pointer to somewhere just past null. &t->uac[branch] is the same as
t->uac + branch.)
The t->uac should never be null for a valid t, it is allocated at the
same time with t, in the same shm_malloc(). The operation is done under
lock - LOCK_REPLIES(t) - but indeed, if there is a race somehow or
operation done without lock check, the memory space for it can be
overwritten.
As for LM_DBG, I'm not too familiar with the logging macros, but if
they're defined in such a way to check the log level first and then
skip calling the actual logging function if the log level is too low,
then the LM_DBG arguments would never be evaluated and so no null
dereference would occur there.
Yes, the macro checks first for the log level and only if it is going to
be printed, does the rest of evaluation.
I was debugging a similar core dump just the other day, although in a
different location. That one was in t_should_relay_response(), line
1282, and also had Trans->uac == null. The strange part about this one
was that according to gdb, Trans->uac was valid:
#0 0x00007f3f11d5b5e8 in t_should_relay_response
(Trans=Trans@entry=0x7f3e14a551f8, new_code=new_code@entry=200,
branch=branch@entry=0,
should_store=should_store@entry=0x7fffb0353408,
should_relay=should_relay@entry=0x7fffb0353404,
cancel_data=cancel_data@entry=0x7fffb0353670,
reply=0x7f3f160aa6e8) at t_reply.c:1282
1282 in t_reply.c
(gdb) p Trans->uac[branch].last_received
$11 = 0
even though the asm instruction definitely was a null dereference into
->uac:
0x00007f3f11d5b5de <+718>: add 0x170(%rbx),%r8
=> 0x00007f3f11d5b5e8 <+728>: mov 0x190(%r8),%eax
(gdb) p $r8
$2 = 0
%rbx had Trans and so %r8 had Trans->uac. At this point, %8 ==
Trans->uac == null, even though:
(gdb) p (long int) Trans->uac
$18 = 139904611079176
Investigating further, we found that Trans resided in shared memory
and so we (tentatively) concluded that this looks to be a race
condition with another process overwriting the Trans shm. First
Trans->uac was null and got assigned to %r8, then another process
changed it to something valid in shm, then the segfault happened
through %r8. We didn't have a chance to investigate further and I
can't say for sure if these two crashes are related.
I will look into this direction as well, there was something reported
also for t_should_relay_response() over the time.
You were running 5.2.1?
Cheers,
Daniel
--
Daniel-Constantin Mierla --
www.asipto.com
www.twitter.com/miconda --
www.linkedin.com/in/miconda
Kamailio World Conference - May 6-8, 2019 --
www.kamailioworld.com
Kamailio Advanced Training - Mar 4-6, 2019 in Berlin; Mar 25-27, 2019, in Washington, DC,
USA --
www.asipto.com