On Fri, Jan 13, 2017 at 2:57 PM, Joshua Colp <jcolp(a)digium.com> wrote:
Greetings all,
I'm currently investigating a crash in a Kamailio 4.1.3 deployment. The
backtrace is as follows:
(I've commented out phone numbers and IP addresses using X but have
retained the correct length for each)
(gdb) bt
#0 0x00007f017f2ac96e in memcpy () from /lib64/libc.so.6
#1 0x00007f0174e1bb50 in rc_avpair_assign () from
/usr/lib64/libradiusclient-ng.so.2
#2 0x00007f0174e1bc92 in rc_avpair_new () from
/usr/lib64/libradiusclient-ng.so.2
#3 0x00007f0174e1bdd1 in rc_avpair_add () from
/usr/lib64/libradiusclient-ng.so.2
#4 0x00007f0175026806 in acc_radius_send_request (req=0x7f0162b6c1f0,
inf=0x7fff113e9cc0) at acc_radius_mod.c:365
#5 0x00007f017522ff13 in acc_run_engines (msg=0x7f0162b6c1f0, type=0,
reset=0x0) at acc.c:957
#6 0x00007f0175239b49 in acc_onreply (t=0x7f0161a12f00,
req=0x7f0162b6c1f0, reply=0x7f017ca845c8, code=200) at acc_logic.c:485
#7 0x00007f017523a07d in tmcb_func (t=0x7f0161a12f00, type=512,
ps=0x7fff113e9ee0) at acc_logic.c:559
#8 0x00007f0177434478 in run_trans_callbacks_internal
(cb_lst=0x7f0161a12f70, type=512, trans=0x7f0161a12f00,
params=0x7fff113e9ee0) at t_hooks.c:290
#9 0x00007f017743468a in run_trans_callbacks_with_buf (type=512,
rbuf=0x7f0161a12fc0, req=0x7f0162b6c1f0, repl=0x7f017ca845c8, flags=200)
at t_hooks.c:336
#10 0x00007f0177466c06 in relay_reply (t=0x7f0161a12f00,
p_msg=0x7f017ca845c8, branch=0, msg_status=200,
cancel_data=0x7fff113ea240, do_put_on_wait=1)
at t_reply.c:2001
#11 0x00007f01774690b7 in reply_received (p_msg=0x7f017ca845c8) at
t_reply.c:2499
#12 0x000000000045d837 in do_forward_reply (msg=0x7f017ca845c8, mode=0)
at forward.c:777
#13 0x000000000045e0f8 in forward_reply (msg=0x7f017ca845c8) at
forward.c:860
#14 0x00000000004a5887 in receive_msg (
buf=0x9245e0 "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP
X.XX.XX.XX:5060;rport=5060;received=X.XX.XX.XX;branch=z9hG4bK9981.
5d2e8535c8eb94d43283f83230b5011e.0\r\nVia:
SIP/2.0/UDP X.XX.XX.XX;rport=5060;branch=z9hG4bK9981.78d8b40"...,
len=1161, rcv_info=0x7fff113ea5c0) at receive.c:273
#15 0x000000000053c838 in udp_rcv_loop () at udp_server.c:536
#16 0x000000000046d42b in main_loop () at main.c:1617
#17 0x00000000004704d3 in main (argc=11, argv=0x7fff113ea8f8) at
main.c:2533
As you can see it's crashing when interacting with the radius client.
Looking at frame 4 we can examine what is being provided to radius:
(gdb) frame 4
#4 0x00007f0175026806 in acc_radius_send_request (req=0x7f0162b6c1f0,
inf=0x7fff113e9cc0) at acc_radius_mod.c:365
365 ADD_RAD_AVPAIR(offset+i, inf->varr[i].s,
inf->varr[i].len);
(gdb) print inf->varr[i].s
$12 = 0x0
(gdb) print inf->varr[i].len
$13 = 3
(gdb) print inf->tarr[i]
$14 = 2 '\002'
This is a TYPE_STR with a length of 3, but no actual string. Needless to
say this upsets radius. Based on the configuration of the system and the
value of 'i' this is actually for the Remote-Party-ID header:
(gdb) print i
$15 = 8
(gdb) print *rad_extra->next->next->next->next
$16 = {name = {s = 0x7f017b27fcf4 "Sip-RPid", len = 8}, spec = {type =
PVT_OTHER, getf = 0x7f0176b79883 <pv_get_rpid>, setf = 0, pvp = {pvn =
{type = 0,
nfree = 0, u = {isname = {type = 0, name = {n = 0, s = {s = 0x0,
len = 0}, re = 0x0}}, dname = 0x0}}, pvi = {type = 0, u = {ival
= 0, dval = 0x0}}},
trans = 0x0}, next = 0x7f017b3ae4b0}
Going up a frame and looking at the RPID URI on the message it matches
the bad TYPE_STR:
(gdb) frame 5
#5 0x00007f017522ff13 in acc_run_engines (msg=0x7f0162b6c1f0, type=0,
reset=0x0) at acc.c:957
957 e->acc_req(msg, &inf);
(gdb) print ((struct to_body*)(msg)->rpid->parsed)->uri
$17 = {s = 0x0, len = 3}
And digging into the full parsed information shows that it's not correct
at all:
(gdb) print *((struct to_body*)(msg)->rpid->parsed)
$30 = {error = 3, body = {s = 0x0, len = 11064280}, uri = {s = 0x0, len
= 3}, display = {s = 0x0, len = 11064313}, tag_value = {s = 0x0, len =
5},
parsed_uri = {user = {s = 0x0, len = 1484108237}, passwd = {s = 0x0,
len = 2}, host = {s = 0x0, len = 0}, port = {s = 0x0, len = 3}, params
= {s = 0x0,
len = 11064395}, sip_params = {s = 0x0, len = 0}, headers = {s =
0x0, len = 8619}, port_no = 0, proto = 0, type = ERROR_URI_T,
flags = 0, transport = {
s = 0x0, len = 0}, ttl = {s = 0x0, len = 0}, user_param = {s =
0x0, len = 192}, maddr = {s = 0x0, len = 3}, method = {s = 0x0,
len = 11064439}, lr = {
s = 0x0, len = 3}, r2 = {s = 0x0, len = 11064464}, gr = {s = 0x0,
len = 3}, transport_val = {s = 0x0, len = 2039735278}, ttl_val =
{s = 0x0, len = 3},
user_param_val = {s = 0x0, len = 11064487}, maddr_val = {s = 0x0,
len = 0}, method_val = {s = 0x0, len = 16367}, lr_val = {s = 0x0,
len = 5}, r2_val = {
s = 0x0, len = 1484104637}, gr_val = {s = 0x0, len = 3}},
param_lst = 0x0, last_param = 0xa8d4d6}
Looking at the actual Remote-Party-ID header on the other hand itself
shows nothing out of the ordinary:
(gdb) print *msg->rpid
$18 = {type = HDR_RPID_T, name = {
s = 0x7f0162b6cba0 "Remote-Party-ID:
<sip:XXXXXXXXXX@XXX.XX.XX.XX:5060>;privacy=off\r\nP-Charge-Info:
sip:XXXXXXXXXX@XXX.XX.XX.XX:5060\r\nSupported:
timer,replaces\r\nSession-Expires: 3600\r\nMin-SE:
90\r\nContent-Length: 260\r\nCo"..., len = 15}, body = {
s = 0x7f0162b6cbb1
"<sip:XXXXXXXXXX@XXX.XX.XX.XX:5060>;privacy=off\r\nP-Charge-Info:
sip:XXXXXXXXXX@XXX.XX.XX.XX:5060\r\nSupported:
timer,replaces\r\nSession-Expires: 3600\r\nMin-SE:
90\r\nContent-Length: 260\r\nContent-Disposition"..., len = 46},
len = 65, parsed = 0x7f017b33c710, next = 0x7f0162b6d720}
I should also add that the value before and after RPID appears to be
correct:
(gdb) print inf->varr[i-1].s
$24 = 0x7f0161ce8427 "sip:XXXXXXXXXX@X.XX.XX.XXX:5060
SIP/2.0\r\nRecord-Route:
<sip:X.XX.XX.XX;lr=on;ftag=gK0c7f2820;did=438.50e1>\r\nRecord-Route:
<sip:X.XX.XX.XX;lr=on;did=ee5.4da5ae1>\r\nVia: SIP/2.0/UDP
X.XX.XX.XX:5060;branc"...
(gdb) print inf->varr[i-1].len
$25 = 31
(gdb) print inf->varr[i+1].s
$26 = 0x7f0176dbe720 "X.XX.XX.XX"
(gdb) print inf->varr[i+1].len
$27 = 10
Has anyone seen anything like this before or have a suggestion on where
to look further? I've traced through the code in question and it all
seems to be correct.
We have a theory that we hope to test out later today. If anyone more
familiar with the inner workings of Kamailio could comment on it, that'd be
hugely appreciated.
Looking through the Kamailio code base, not many things manipulate
rpid->parsed once it has been set. One place that it could be manipulated
is in sip_msg_shm_clone. Comparing the handling of the rpid header in that
routine to the from/to headers is instructive - the from/to header handling
directly handles the case of parsed being set, while rpid does not:
from/to:
case HDR_TO_T:
case HDR_FROM_T:
if (hdr->type == HDR_TO_T) {
if (!HOOK_SET(to)) new_msg->to = new_hdr;
} else {
if (!HOOK_SET(from)) new_msg->from = new_hdr;
}
/* From header might be unparsed */
if (!hdr->parsed) break;
new_hdr->parsed = p;
p +=ROUND4(sizeof(struct to_body));
memcpy(new_hdr->parsed, hdr->parsed, sizeof(struct to_body));
((struct to_body*)new_hdr->parsed)->body.s =
translate_pointer( new_msg->buf , org_msg->buf ,
((struct to_body*)hdr->parsed)->body.s );
((struct to_body*)new_hdr->parsed)->display.s =
translate_pointer( new_msg->buf, org_msg->buf,
((struct to_body*)hdr->parsed)->display.s);
((struct to_body*)new_hdr->parsed)->uri.s =
translate_pointer( new_msg->buf , org_msg->buf ,
((struct to_body*)hdr->parsed)->uri.s );
if ( ((struct to_body*)hdr->parsed)->tag_value.s )
((struct to_body*)new_hdr->parsed)->tag_value.s =
translate_pointer( new_msg->buf , org_msg->buf ,
((struct to_body*)hdr->parsed)->tag_value.s );
if ( (((struct to_body*)new_hdr->parsed)->parsed_uri.user.s)
|| (((struct to_body*)new_hdr->parsed)->parsed_uri.host.s) )
uri_trans(new_msg->buf, org_msg->buf,
&((struct to_body*)new_hdr->parsed)->parsed_uri);
etc.
rpid:
case HDR_RPID_T:
if (!HOOK_SET(rpid)) {
new_msg->rpid = new_hdr;
}
break;
Since the rpid->parsed pointer points to a struct to_body, I would expect
it would need the same handling as the from/to headers.
It's interesting to note in our backtrace that the crash occurs in a reply
route, and - I think - Kamailio will clone the SIP message prior to
executing the reply route (again - I think). Since the crash only happens
if you evaluate the rpid pseudo variable, and since the Radius
configuration we're using maps an 'extra field' to said psuedo variable,
that might explain the issue we're seeing.
Does this sound plausible, or are we potentially going down the wrong path
here?
--
Matthew Jordan
Digium, Inc. | CTO
445 Jan Davis Drive NW - Huntsville, AL 35806 - USA
Check us out at:
http://digium.com &
http://asterisk.org