these Cisco ATAs are so old, we are maintaining around 300 of these, and a lot of times
we encounter corrupted SIP messages of which some are fixed by just reboots, other times,
by upgrading firmwares.
most of the times, kamailio manages to stay afloat in the midst
of all these corruption except for some instances like the one i'm reporting, it didn't.
i am currently getting the latest git and will test.
the last crash we encountered before yesterday is march 30 and before that, january 5.
if i manage to upgrade this now, we won't know if it really works not unless we monitor this
for at least 3 months :-)
Kelvin Chua
On Thu, Apr 22, 2010 at 4:47 PM, Daniel-Constantin Mierla
<miconda@gmail.com> wrote:
Hi Timo,
thanks for troubleshooting. I committed the patch that moves setting of bind_addr before any error case in populate_leg_info(). I backported to kamailio_3.0 branch as well.
Kelvin, can you get the lasted git version for branch kamailio_3.0 and test?
Thanks,
Daniel
On 4/22/10 1:21 AM, Timo Reimann wrote:
Hello,
Kelvin Chua wrote:
(gdb) bt
#0 0x00002ab61b62779a in update_dialog_dbinfo (cell=0x2ab61c9100f8) at
dlg_db_handler.c:501
This corresponds to
SET_STR_VALUE(values+8, cell->bind_addr[DLG_CALLEE_LEG]->sock_str);
so assumingly sip-router crashes when it tries to access the callee's
bound address's sock_str...
#1 0x00002ab61b628ea8 in dlg_onreply (t=0x7d5228, type=<value optimized
out>, param=<value optimized out>) at dlg_handlers.c:361
#2 0x00002ab617965505 in run_trans_callbacks_internal
(cb_lst=0x2ab61c938830, type=128, trans=0x2ab61c9387c0,
(gdb) print cell
$1 = (struct dlg_cell *) 0x2ab61c9100f8
(gdb) print *cell
0}}, bind_addr = {0x88c580, 0x0},
cbs = {first = 0x0, types = 0}, profile_links = 0x0}
... as supported by the fact that bind_addr's second field
(DLG_CALLEE_LEG) is 0.
Why does the segfault happen?
Let's trace the code path: The initial error message
"bad sip message or missing Contact hdr"
occurred in dlg_handlers.c, line 218, which makes this piece of code's
surrounding function "populate_leg_info" return prematurely (by means of
"goto error0"). Specifically, this implies that the code at the end of
the function on line 272
dlg->bind_addr[leg] = msg->rcv.bind_address;
isn't carried out anymore, leaving the callee's bound address associated
with the given dialog unassigned. (This happens to be the only occasion
where the bound address is assigned.) Instead, execution drops back to
the "dlg_onreply" function and proceeds to line 361, thereby calling the
database update function:
update_dialog_dbinfo(dlg);
which directly leads to the segfaulting code location.
AFAICS, "update_dialog_dbinfo" is dereferencing a possibly null memory
location at the dialog data in question only, so one way to prevent the
segfault from happening is to move the bound address assignment before
any failing code in the function. This should make sure that some
accessible bound address is stored in any case.
Cheers,
--Timo