Greetings all,
I'm currently investigating a crash in a Kamailio 4.1.3 deployment. The backtrace is as follows:
(I've commented out phone numbers and IP addresses using X but have retained the correct length for each)
(gdb) bt #0 0x00007f017f2ac96e in memcpy () from /lib64/libc.so.6 #1 0x00007f0174e1bb50 in rc_avpair_assign () from /usr/lib64/libradiusclient-ng.so.2 #2 0x00007f0174e1bc92 in rc_avpair_new () from /usr/lib64/libradiusclient-ng.so.2 #3 0x00007f0174e1bdd1 in rc_avpair_add () from /usr/lib64/libradiusclient-ng.so.2 #4 0x00007f0175026806 in acc_radius_send_request (req=0x7f0162b6c1f0, inf=0x7fff113e9cc0) at acc_radius_mod.c:365 #5 0x00007f017522ff13 in acc_run_engines (msg=0x7f0162b6c1f0, type=0, reset=0x0) at acc.c:957 #6 0x00007f0175239b49 in acc_onreply (t=0x7f0161a12f00, req=0x7f0162b6c1f0, reply=0x7f017ca845c8, code=200) at acc_logic.c:485 #7 0x00007f017523a07d in tmcb_func (t=0x7f0161a12f00, type=512, ps=0x7fff113e9ee0) at acc_logic.c:559 #8 0x00007f0177434478 in run_trans_callbacks_internal (cb_lst=0x7f0161a12f70, type=512, trans=0x7f0161a12f00, params=0x7fff113e9ee0) at t_hooks.c:290 #9 0x00007f017743468a in run_trans_callbacks_with_buf (type=512, rbuf=0x7f0161a12fc0, req=0x7f0162b6c1f0, repl=0x7f017ca845c8, flags=200) at t_hooks.c:336 #10 0x00007f0177466c06 in relay_reply (t=0x7f0161a12f00, p_msg=0x7f017ca845c8, branch=0, msg_status=200, cancel_data=0x7fff113ea240, do_put_on_wait=1) at t_reply.c:2001 #11 0x00007f01774690b7 in reply_received (p_msg=0x7f017ca845c8) at t_reply.c:2499 #12 0x000000000045d837 in do_forward_reply (msg=0x7f017ca845c8, mode=0) at forward.c:777 #13 0x000000000045e0f8 in forward_reply (msg=0x7f017ca845c8) at forward.c:860 #14 0x00000000004a5887 in receive_msg ( buf=0x9245e0 "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP X.XX.XX.XX:5060;rport=5060;received=X.XX.XX.XX;branch=z9hG4bK9981.5d2e8535c8eb94d43283f83230b5011e.0\r\nVia: SIP/2.0/UDP X.XX.XX.XX;rport=5060;branch=z9hG4bK9981.78d8b40"..., len=1161, rcv_info=0x7fff113ea5c0) at receive.c:273 #15 0x000000000053c838 in udp_rcv_loop () at udp_server.c:536 #16 0x000000000046d42b in main_loop () at main.c:1617 #17 0x00000000004704d3 in main (argc=11, argv=0x7fff113ea8f8) at main.c:2533
As you can see it's crashing when interacting with the radius client.
Looking at frame 4 we can examine what is being provided to radius:
(gdb) frame 4 #4 0x00007f0175026806 in acc_radius_send_request (req=0x7f0162b6c1f0, inf=0x7fff113e9cc0) at acc_radius_mod.c:365 365 ADD_RAD_AVPAIR(offset+i, inf->varr[i].s, inf->varr[i].len); (gdb) print inf->varr[i].s $12 = 0x0 (gdb) print inf->varr[i].len $13 = 3 (gdb) print inf->tarr[i] $14 = 2 '\002'
This is a TYPE_STR with a length of 3, but no actual string. Needless to say this upsets radius. Based on the configuration of the system and the value of 'i' this is actually for the Remote-Party-ID header:
(gdb) print i $15 = 8 (gdb) print *rad_extra->next->next->next->next $16 = {name = {s = 0x7f017b27fcf4 "Sip-RPid", len = 8}, spec = {type = PVT_OTHER, getf = 0x7f0176b79883 <pv_get_rpid>, setf = 0, pvp = {pvn = {type = 0, nfree = 0, u = {isname = {type = 0, name = {n = 0, s = {s = 0x0, len = 0}, re = 0x0}}, dname = 0x0}}, pvi = {type = 0, u = {ival = 0, dval = 0x0}}}, trans = 0x0}, next = 0x7f017b3ae4b0}
Going up a frame and looking at the RPID URI on the message it matches the bad TYPE_STR:
(gdb) frame 5 #5 0x00007f017522ff13 in acc_run_engines (msg=0x7f0162b6c1f0, type=0, reset=0x0) at acc.c:957 957 e->acc_req(msg, &inf); (gdb) print ((struct to_body*)(msg)->rpid->parsed)->uri $17 = {s = 0x0, len = 3}
And digging into the full parsed information shows that it's not correct at all: (gdb) print *((struct to_body*)(msg)->rpid->parsed) $30 = {error = 3, body = {s = 0x0, len = 11064280}, uri = {s = 0x0, len = 3}, display = {s = 0x0, len = 11064313}, tag_value = {s = 0x0, len = 5}, parsed_uri = {user = {s = 0x0, len = 1484108237}, passwd = {s = 0x0, len = 2}, host = {s = 0x0, len = 0}, port = {s = 0x0, len = 3}, params = {s = 0x0, len = 11064395}, sip_params = {s = 0x0, len = 0}, headers = {s = 0x0, len = 8619}, port_no = 0, proto = 0, type = ERROR_URI_T, flags = 0, transport = { s = 0x0, len = 0}, ttl = {s = 0x0, len = 0}, user_param = {s = 0x0, len = 192}, maddr = {s = 0x0, len = 3}, method = {s = 0x0, len = 11064439}, lr = { s = 0x0, len = 3}, r2 = {s = 0x0, len = 11064464}, gr = {s = 0x0, len = 3}, transport_val = {s = 0x0, len = 2039735278}, ttl_val = {s = 0x0, len = 3}, user_param_val = {s = 0x0, len = 11064487}, maddr_val = {s = 0x0, len = 0}, method_val = {s = 0x0, len = 16367}, lr_val = {s = 0x0, len = 5}, r2_val = { s = 0x0, len = 1484104637}, gr_val = {s = 0x0, len = 3}}, param_lst = 0x0, last_param = 0xa8d4d6}
Looking at the actual Remote-Party-ID header on the other hand itself shows nothing out of the ordinary:
(gdb) print *msg->rpid $18 = {type = HDR_RPID_T, name = { s = 0x7f0162b6cba0 "Remote-Party-ID: sip:XXXXXXXXXX@XXX.XX.XX.XX:5060;privacy=off\r\nP-Charge-Info: sip:XXXXXXXXXX@XXX.XX.XX.XX:5060\r\nSupported: timer,replaces\r\nSession-Expires: 3600\r\nMin-SE: 90\r\nContent-Length: 260\r\nCo"..., len = 15}, body = { s = 0x7f0162b6cbb1 "sip:XXXXXXXXXX@XXX.XX.XX.XX:5060;privacy=off\r\nP-Charge-Info: sip:XXXXXXXXXX@XXX.XX.XX.XX:5060\r\nSupported: timer,replaces\r\nSession-Expires: 3600\r\nMin-SE: 90\r\nContent-Length: 260\r\nContent-Disposition"..., len = 46}, len = 65, parsed = 0x7f017b33c710, next = 0x7f0162b6d720}
I should also add that the value before and after RPID appears to be correct:
(gdb) print inf->varr[i-1].s $24 = 0x7f0161ce8427 "sip:XXXXXXXXXX@X.XX.XX.XXX:5060 SIP/2.0\r\nRecord-Route: sip:X.XX.XX.XX;lr=on;ftag=gK0c7f2820;did=438.50e1\r\nRecord-Route: sip:X.XX.XX.XX;lr=on;did=ee5.4da5ae1\r\nVia: SIP/2.0/UDP X.XX.XX.XX:5060;branc"... (gdb) print inf->varr[i-1].len $25 = 31 (gdb) print inf->varr[i+1].s $26 = 0x7f0176dbe720 "X.XX.XX.XX" (gdb) print inf->varr[i+1].len $27 = 10
Has anyone seen anything like this before or have a suggestion on where to look further? I've traced through the code in question and it all seems to be correct.
Thanks a lot,
On Fri, Jan 13, 2017 at 2:57 PM, Joshua Colp jcolp@digium.com wrote:
Greetings all,
I'm currently investigating a crash in a Kamailio 4.1.3 deployment. The backtrace is as follows:
(I've commented out phone numbers and IP addresses using X but have retained the correct length for each)
(gdb) bt #0 0x00007f017f2ac96e in memcpy () from /lib64/libc.so.6 #1 0x00007f0174e1bb50 in rc_avpair_assign () from /usr/lib64/libradiusclient-ng.so.2 #2 0x00007f0174e1bc92 in rc_avpair_new () from /usr/lib64/libradiusclient-ng.so.2 #3 0x00007f0174e1bdd1 in rc_avpair_add () from /usr/lib64/libradiusclient-ng.so.2 #4 0x00007f0175026806 in acc_radius_send_request (req=0x7f0162b6c1f0, inf=0x7fff113e9cc0) at acc_radius_mod.c:365 #5 0x00007f017522ff13 in acc_run_engines (msg=0x7f0162b6c1f0, type=0, reset=0x0) at acc.c:957 #6 0x00007f0175239b49 in acc_onreply (t=0x7f0161a12f00, req=0x7f0162b6c1f0, reply=0x7f017ca845c8, code=200) at acc_logic.c:485 #7 0x00007f017523a07d in tmcb_func (t=0x7f0161a12f00, type=512, ps=0x7fff113e9ee0) at acc_logic.c:559 #8 0x00007f0177434478 in run_trans_callbacks_internal (cb_lst=0x7f0161a12f70, type=512, trans=0x7f0161a12f00, params=0x7fff113e9ee0) at t_hooks.c:290 #9 0x00007f017743468a in run_trans_callbacks_with_buf (type=512, rbuf=0x7f0161a12fc0, req=0x7f0162b6c1f0, repl=0x7f017ca845c8, flags=200) at t_hooks.c:336 #10 0x00007f0177466c06 in relay_reply (t=0x7f0161a12f00, p_msg=0x7f017ca845c8, branch=0, msg_status=200, cancel_data=0x7fff113ea240, do_put_on_wait=1) at t_reply.c:2001 #11 0x00007f01774690b7 in reply_received (p_msg=0x7f017ca845c8) at t_reply.c:2499 #12 0x000000000045d837 in do_forward_reply (msg=0x7f017ca845c8, mode=0) at forward.c:777 #13 0x000000000045e0f8 in forward_reply (msg=0x7f017ca845c8) at forward.c:860 #14 0x00000000004a5887 in receive_msg ( buf=0x9245e0 "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP X.XX.XX.XX:5060;rport=5060;received=X.XX.XX.XX;branch=z9hG4bK9981. 5d2e8535c8eb94d43283f83230b5011e.0\r\nVia: SIP/2.0/UDP X.XX.XX.XX;rport=5060;branch=z9hG4bK9981.78d8b40"..., len=1161, rcv_info=0x7fff113ea5c0) at receive.c:273 #15 0x000000000053c838 in udp_rcv_loop () at udp_server.c:536 #16 0x000000000046d42b in main_loop () at main.c:1617 #17 0x00000000004704d3 in main (argc=11, argv=0x7fff113ea8f8) at main.c:2533
As you can see it's crashing when interacting with the radius client.
Looking at frame 4 we can examine what is being provided to radius:
(gdb) frame 4 #4 0x00007f0175026806 in acc_radius_send_request (req=0x7f0162b6c1f0, inf=0x7fff113e9cc0) at acc_radius_mod.c:365 365 ADD_RAD_AVPAIR(offset+i, inf->varr[i].s, inf->varr[i].len); (gdb) print inf->varr[i].s $12 = 0x0 (gdb) print inf->varr[i].len $13 = 3 (gdb) print inf->tarr[i] $14 = 2 '\002'
This is a TYPE_STR with a length of 3, but no actual string. Needless to say this upsets radius. Based on the configuration of the system and the value of 'i' this is actually for the Remote-Party-ID header:
(gdb) print i $15 = 8 (gdb) print *rad_extra->next->next->next->next $16 = {name = {s = 0x7f017b27fcf4 "Sip-RPid", len = 8}, spec = {type = PVT_OTHER, getf = 0x7f0176b79883 <pv_get_rpid>, setf = 0, pvp = {pvn = {type = 0, nfree = 0, u = {isname = {type = 0, name = {n = 0, s = {s = 0x0, len = 0}, re = 0x0}}, dname = 0x0}}, pvi = {type = 0, u = {ival = 0, dval = 0x0}}}, trans = 0x0}, next = 0x7f017b3ae4b0}
Going up a frame and looking at the RPID URI on the message it matches the bad TYPE_STR:
(gdb) frame 5 #5 0x00007f017522ff13 in acc_run_engines (msg=0x7f0162b6c1f0, type=0, reset=0x0) at acc.c:957 957 e->acc_req(msg, &inf); (gdb) print ((struct to_body*)(msg)->rpid->parsed)->uri $17 = {s = 0x0, len = 3}
And digging into the full parsed information shows that it's not correct at all: (gdb) print *((struct to_body*)(msg)->rpid->parsed) $30 = {error = 3, body = {s = 0x0, len = 11064280}, uri = {s = 0x0, len = 3}, display = {s = 0x0, len = 11064313}, tag_value = {s = 0x0, len = 5}, parsed_uri = {user = {s = 0x0, len = 1484108237}, passwd = {s = 0x0, len = 2}, host = {s = 0x0, len = 0}, port = {s = 0x0, len = 3}, params = {s = 0x0, len = 11064395}, sip_params = {s = 0x0, len = 0}, headers = {s = 0x0, len = 8619}, port_no = 0, proto = 0, type = ERROR_URI_T, flags = 0, transport = { s = 0x0, len = 0}, ttl = {s = 0x0, len = 0}, user_param = {s = 0x0, len = 192}, maddr = {s = 0x0, len = 3}, method = {s = 0x0, len = 11064439}, lr = { s = 0x0, len = 3}, r2 = {s = 0x0, len = 11064464}, gr = {s = 0x0, len = 3}, transport_val = {s = 0x0, len = 2039735278}, ttl_val = {s = 0x0, len = 3}, user_param_val = {s = 0x0, len = 11064487}, maddr_val = {s = 0x0, len = 0}, method_val = {s = 0x0, len = 16367}, lr_val = {s = 0x0, len = 5}, r2_val = { s = 0x0, len = 1484104637}, gr_val = {s = 0x0, len = 3}}, param_lst = 0x0, last_param = 0xa8d4d6}
Looking at the actual Remote-Party-ID header on the other hand itself shows nothing out of the ordinary:
(gdb) print *msg->rpid $18 = {type = HDR_RPID_T, name = { s = 0x7f0162b6cba0 "Remote-Party-ID: sip:XXXXXXXXXX@XXX.XX.XX.XX:5060;privacy=off\r\nP-Charge-Info: sip:XXXXXXXXXX@XXX.XX.XX.XX:5060\r\nSupported: timer,replaces\r\nSession-Expires: 3600\r\nMin-SE: 90\r\nContent-Length: 260\r\nCo"..., len = 15}, body = { s = 0x7f0162b6cbb1 "sip:XXXXXXXXXX@XXX.XX.XX.XX:5060;privacy=off\r\nP-Charge-Info: sip:XXXXXXXXXX@XXX.XX.XX.XX:5060\r\nSupported: timer,replaces\r\nSession-Expires: 3600\r\nMin-SE: 90\r\nContent-Length: 260\r\nContent-Disposition"..., len = 46}, len = 65, parsed = 0x7f017b33c710, next = 0x7f0162b6d720}
I should also add that the value before and after RPID appears to be correct:
(gdb) print inf->varr[i-1].s $24 = 0x7f0161ce8427 "sip:XXXXXXXXXX@X.XX.XX.XXX:5060 SIP/2.0\r\nRecord-Route: sip:X.XX.XX.XX;lr=on;ftag=gK0c7f2820;did=438.50e1\r\nRecord-Route: sip:X.XX.XX.XX;lr=on;did=ee5.4da5ae1\r\nVia: SIP/2.0/UDP X.XX.XX.XX:5060;branc"... (gdb) print inf->varr[i-1].len $25 = 31 (gdb) print inf->varr[i+1].s $26 = 0x7f0176dbe720 "X.XX.XX.XX" (gdb) print inf->varr[i+1].len $27 = 10
Has anyone seen anything like this before or have a suggestion on where to look further? I've traced through the code in question and it all seems to be correct.
We have a theory that we hope to test out later today. If anyone more familiar with the inner workings of Kamailio could comment on it, that'd be hugely appreciated.
Looking through the Kamailio code base, not many things manipulate rpid->parsed once it has been set. One place that it could be manipulated is in sip_msg_shm_clone. Comparing the handling of the rpid header in that routine to the from/to headers is instructive - the from/to header handling directly handles the case of parsed being set, while rpid does not:
from/to:
case HDR_TO_T: case HDR_FROM_T: if (hdr->type == HDR_TO_T) { if (!HOOK_SET(to)) new_msg->to = new_hdr; } else { if (!HOOK_SET(from)) new_msg->from = new_hdr; } /* From header might be unparsed */ if (!hdr->parsed) break; new_hdr->parsed = p; p +=ROUND4(sizeof(struct to_body)); memcpy(new_hdr->parsed, hdr->parsed, sizeof(struct to_body)); ((struct to_body*)new_hdr->parsed)->body.s = translate_pointer( new_msg->buf , org_msg->buf , ((struct to_body*)hdr->parsed)->body.s ); ((struct to_body*)new_hdr->parsed)->display.s = translate_pointer( new_msg->buf, org_msg->buf, ((struct to_body*)hdr->parsed)->display.s); ((struct to_body*)new_hdr->parsed)->uri.s = translate_pointer( new_msg->buf , org_msg->buf , ((struct to_body*)hdr->parsed)->uri.s ); if ( ((struct to_body*)hdr->parsed)->tag_value.s ) ((struct to_body*)new_hdr->parsed)->tag_value.s = translate_pointer( new_msg->buf , org_msg->buf , ((struct to_body*)hdr->parsed)->tag_value.s ); if ( (((struct to_body*)new_hdr->parsed)->parsed_uri.user.s) || (((struct to_body*)new_hdr->parsed)->parsed_uri.host.s) ) uri_trans(new_msg->buf, org_msg->buf, &((struct to_body*)new_hdr->parsed)->parsed_uri);
etc.
rpid:
case HDR_RPID_T: if (!HOOK_SET(rpid)) { new_msg->rpid = new_hdr; } break;
Since the rpid->parsed pointer points to a struct to_body, I would expect it would need the same handling as the from/to headers.
It's interesting to note in our backtrace that the crash occurs in a reply route, and - I think - Kamailio will clone the SIP message prior to executing the reply route (again - I think). Since the crash only happens if you evaluate the rpid pseudo variable, and since the Radius configuration we're using maps an 'extra field' to said psuedo variable, that might explain the issue we're seeing.
Does this sound plausible, or are we potentially going down the wrong path here?
Hello,
I think you were hit by an issue solved with commit 15fc8b9c59aaf31f005e38f54d363f1e9d0a068e :
https://github.com/kamailio/kamailio/commit/15fc8b9c59aaf31f005e38f54d363f1e...
The 4.1.3 was released before, in April 2014:
- https://www.kamailio.org/pub/kamailio/4.1.3/README
I am not sure if it was backported to 4.1 branch, but should not be hard to backport.
The issues was with many processing handling the same transaction, which has the sip_msg in shared memory, but then parsing of some headers created pointers to private memory of the process doing the parsing. Another process coming shortly after would see the pointer in sip_msg, but it would be to another process private memory and accessing it does a seg fault as expected.
Not all headers are cloned in shared memory, like it's done for To/From, and results in this kind of processing with parsing again in private memory.
Anyhow, I didn't have much time to dig in too much, so it's mu quick guess based on the fact you run a rather old release.
Cheers, Daniel
On 14/01/2017 17:53, Matthew Jordan wrote:
On Fri, Jan 13, 2017 at 2:57 PM, Joshua Colp <jcolp@digium.com mailto:jcolp@digium.com> wrote:
Greetings all, I'm currently investigating a crash in a Kamailio 4.1.3 deployment. The backtrace is as follows: (I've commented out phone numbers and IP addresses using X but have retained the correct length for each) (gdb) bt #0 0x00007f017f2ac96e in memcpy () from /lib64/libc.so.6 #1 0x00007f0174e1bb50 in rc_avpair_assign () from /usr/lib64/libradiusclient-ng.so.2 #2 0x00007f0174e1bc92 in rc_avpair_new () from /usr/lib64/libradiusclient-ng.so.2 #3 0x00007f0174e1bdd1 in rc_avpair_add () from /usr/lib64/libradiusclient-ng.so.2 #4 0x00007f0175026806 in acc_radius_send_request (req=0x7f0162b6c1f0, inf=0x7fff113e9cc0) at acc_radius_mod.c:365 #5 0x00007f017522ff13 in acc_run_engines (msg=0x7f0162b6c1f0, type=0, reset=0x0) at acc.c:957 #6 0x00007f0175239b49 in acc_onreply (t=0x7f0161a12f00, req=0x7f0162b6c1f0, reply=0x7f017ca845c8, code=200) at acc_logic.c:485 #7 0x00007f017523a07d in tmcb_func (t=0x7f0161a12f00, type=512, ps=0x7fff113e9ee0) at acc_logic.c:559 #8 0x00007f0177434478 in run_trans_callbacks_internal (cb_lst=0x7f0161a12f70, type=512, trans=0x7f0161a12f00, params=0x7fff113e9ee0) at t_hooks.c:290 #9 0x00007f017743468a in run_trans_callbacks_with_buf (type=512, rbuf=0x7f0161a12fc0, req=0x7f0162b6c1f0, repl=0x7f017ca845c8, flags=200) at t_hooks.c:336 #10 0x00007f0177466c06 in relay_reply (t=0x7f0161a12f00, p_msg=0x7f017ca845c8, branch=0, msg_status=200, cancel_data=0x7fff113ea240, do_put_on_wait=1) at t_reply.c:2001 #11 0x00007f01774690b7 in reply_received (p_msg=0x7f017ca845c8) at t_reply.c:2499 #12 0x000000000045d837 in do_forward_reply (msg=0x7f017ca845c8, mode=0) at forward.c:777 #13 0x000000000045e0f8 in forward_reply (msg=0x7f017ca845c8) at forward.c:860 #14 0x00000000004a5887 in receive_msg ( buf=0x9245e0 "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP X.XX.XX.XX:5060;rport=5060;received=X.XX.XX.XX;branch=z9hG4bK9981.5d2e8535c8eb94d43283f83230b5011e.0\r\nVia: SIP/2.0/UDP X.XX.XX.XX;rport=5060;branch=z9hG4bK9981.78d8b40"..., len=1161, rcv_info=0x7fff113ea5c0) at receive.c:273 #15 0x000000000053c838 in udp_rcv_loop () at udp_server.c:536 #16 0x000000000046d42b in main_loop () at main.c:1617 #17 0x00000000004704d3 in main (argc=11, argv=0x7fff113ea8f8) at main.c:2533 As you can see it's crashing when interacting with the radius client. Looking at frame 4 we can examine what is being provided to radius: (gdb) frame 4 #4 0x00007f0175026806 in acc_radius_send_request (req=0x7f0162b6c1f0, inf=0x7fff113e9cc0) at acc_radius_mod.c:365 365 ADD_RAD_AVPAIR(offset+i, inf->varr[i].s, inf->varr[i].len); (gdb) print inf->varr[i].s $12 = 0x0 (gdb) print inf->varr[i].len $13 = 3 (gdb) print inf->tarr[i] $14 = 2 '\002' This is a TYPE_STR with a length of 3, but no actual string. Needless to say this upsets radius. Based on the configuration of the system and the value of 'i' this is actually for the Remote-Party-ID header: (gdb) print i $15 = 8 (gdb) print *rad_extra->next->next->next->next $16 = {name = {s = 0x7f017b27fcf4 "Sip-RPid", len = 8}, spec = {type = PVT_OTHER, getf = 0x7f0176b79883 <pv_get_rpid>, setf = 0, pvp = {pvn = {type = 0, nfree = 0, u = {isname = {type = 0, name = {n = 0, s = {s = 0x0, len = 0}, re = 0x0}}, dname = 0x0}}, pvi = {type = 0, u = {ival = 0, dval = 0x0}}}, trans = 0x0}, next = 0x7f017b3ae4b0} Going up a frame and looking at the RPID URI on the message it matches the bad TYPE_STR: (gdb) frame 5 #5 0x00007f017522ff13 in acc_run_engines (msg=0x7f0162b6c1f0, type=0, reset=0x0) at acc.c:957 957 e->acc_req(msg, &inf); (gdb) print ((struct to_body*)(msg)->rpid->parsed)->uri $17 = {s = 0x0, len = 3} And digging into the full parsed information shows that it's not correct at all: (gdb) print *((struct to_body*)(msg)->rpid->parsed) $30 = {error = 3, body = {s = 0x0, len = 11064280}, uri = {s = 0x0, len = 3}, display = {s = 0x0, len = 11064313}, tag_value = {s = 0x0, len = 5}, parsed_uri = {user = {s = 0x0, len = 1484108237}, passwd = {s = 0x0, len = 2}, host = {s = 0x0, len = 0}, port = {s = 0x0, len = 3}, params = {s = 0x0, len = 11064395}, sip_params = {s = 0x0, len = 0}, headers = {s = 0x0, len = 8619}, port_no = 0, proto = 0, type = ERROR_URI_T, flags = 0, transport = { s = 0x0, len = 0}, ttl = {s = 0x0, len = 0}, user_param = {s = 0x0, len = 192}, maddr = {s = 0x0, len = 3}, method = {s = 0x0, len = 11064439}, lr = { s = 0x0, len = 3}, r2 = {s = 0x0, len = 11064464}, gr = {s = 0x0, len = 3}, transport_val = {s = 0x0, len = 2039735278}, ttl_val = {s = 0x0, len = 3}, user_param_val = {s = 0x0, len = 11064487}, maddr_val = {s = 0x0, len = 0}, method_val = {s = 0x0, len = 16367}, lr_val = {s = 0x0, len = 5}, r2_val = { s = 0x0, len = 1484104637}, gr_val = {s = 0x0, len = 3}}, param_lst = 0x0, last_param = 0xa8d4d6} Looking at the actual Remote-Party-ID header on the other hand itself shows nothing out of the ordinary: (gdb) print *msg->rpid $18 = {type = HDR_RPID_T, name = { s = 0x7f0162b6cba0 "Remote-Party-ID: <sip:XXXXXXXXXX@XXX.XX.XX.XX:5060>;privacy=off\r\nP-Charge-Info: sip:XXXXXXXXXX@XXX.XX.XX.XX:5060\r\nSupported: timer,replaces\r\nSession-Expires: 3600\r\nMin-SE: 90\r\nContent-Length: 260\r\nCo"..., len = 15}, body = { s = 0x7f0162b6cbb1 "<sip:XXXXXXXXXX@XXX.XX.XX.XX:5060>;privacy=off\r\nP-Charge-Info: sip:XXXXXXXXXX@XXX.XX.XX.XX:5060\r\nSupported: timer,replaces\r\nSession-Expires: 3600\r\nMin-SE: 90\r\nContent-Length: 260\r\nContent-Disposition"..., len = 46}, len = 65, parsed = 0x7f017b33c710, next = 0x7f0162b6d720} I should also add that the value before and after RPID appears to be correct: (gdb) print inf->varr[i-1].s $24 = 0x7f0161ce8427 "sip:XXXXXXXXXX@X.XX.XX.XXX:5060 SIP/2.0\r\nRecord-Route: <sip:X.XX.XX.XX;lr=on;ftag=gK0c7f2820;did=438.50e1>\r\nRecord-Route: <sip:X.XX.XX.XX;lr=on;did=ee5.4da5ae1>\r\nVia: SIP/2.0/UDP X.XX.XX.XX:5060;branc"... (gdb) print inf->varr[i-1].len $25 = 31 (gdb) print inf->varr[i+1].s $26 = 0x7f0176dbe720 "X.XX.XX.XX" (gdb) print inf->varr[i+1].len $27 = 10 Has anyone seen anything like this before or have a suggestion on where to look further? I've traced through the code in question and it all seems to be correct.
We have a theory that we hope to test out later today. If anyone more familiar with the inner workings of Kamailio could comment on it, that'd be hugely appreciated.
Looking through the Kamailio code base, not many things manipulate rpid->parsed once it has been set. One place that it could be manipulated is in sip_msg_shm_clone. Comparing the handling of the rpid header in that routine to the from/to headers is instructive - the from/to header handling directly handles the case of parsed being set, while rpid does not:
from/to:
case HDR_TO_T: case HDR_FROM_T: if (hdr->type == HDR_TO_T) { if (!HOOK_SET(to)) new_msg->to = new_hdr; } else { if (!HOOK_SET(from)) new_msg->from = new_hdr; } /* From header might be unparsed */ if (!hdr->parsed) break; new_hdr->parsed = p; p +=ROUND4(sizeof(struct to_body)); memcpy(new_hdr->parsed, hdr->parsed, sizeof(struct to_body)); ((struct to_body*)new_hdr->parsed)->body.s = translate_pointer( new_msg->buf , org_msg->buf , ((struct to_body*)hdr->parsed)->body.s ); ((struct to_body*)new_hdr->parsed)->display.s = translate_pointer( new_msg->buf, org_msg->buf, ((struct to_body*)hdr->parsed)->display.s); ((struct to_body*)new_hdr->parsed)->uri.s = translate_pointer( new_msg->buf , org_msg->buf , ((struct to_body*)hdr->parsed)->uri.s ); if ( ((struct to_body*)hdr->parsed)->tag_value.s ) ((struct to_body*)new_hdr->parsed)->tag_value.s = translate_pointer( new_msg->buf , org_msg->buf , ((struct to_body*)hdr->parsed)->tag_value.s ); if ( (((struct to_body*)new_hdr->parsed)->parsed_uri.user.s) || (((struct to_body*)new_hdr->parsed)->parsed_uri.host.s) ) uri_trans(new_msg->buf, org_msg->buf, &((struct to_body*)new_hdr->parsed)->parsed_uri);
etc.
rpid:
case HDR_RPID_T: if (!HOOK_SET(rpid)) { new_msg->rpid = new_hdr; } break;
Since the rpid->parsed pointer points to a struct to_body, I would expect it would need the same handling as the from/to headers.
It's interesting to note in our backtrace that the crash occurs in a reply route, and - I think - Kamailio will clone the SIP message prior to executing the reply route (again - I think). Since the crash only happens if you evaluate the rpid pseudo variable, and since the Radius configuration we're using maps an 'extra field' to said psuedo variable, that might explain the issue we're seeing.
Does this sound plausible, or are we potentially going down the wrong path here?
-- Matthew Jordan Digium, Inc. | CTO 445 Jan Davis Drive NW - Huntsville, AL 35806 - USA Check us out at: http://digium.com & http://asterisk.org
sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
On Sat, Jan 14, 2017, at 01:25 PM, Daniel-Constantin Mierla wrote:
Hello,
I think you were hit by an issue solved with commit 15fc8b9c59aaf31f005e38f54d363f1e9d0a068e :
https://github.com/kamailio/kamailio/commit/15fc8b9c59aaf31f005e38f54d363f1e...
The 4.1.3 was released before, in April 2014:
I am not sure if it was backported to 4.1 branch, but should not be hard to backport.
The issues was with many processing handling the same transaction, which has the sip_msg in shared memory, but then parsing of some headers created pointers to private memory of the process doing the parsing. Another process coming shortly after would see the pointer in sip_msg, but it would be to another process private memory and accessing it does a seg fault as expected.
Thanks Daniel! Based on some logging I added I can confirm that the parsing did happen in another process, so I think you are right that this will fix the issue. I'm going to work on backporting the change and testing it out.
Cheers,
On 14/01/2017 18:45, Joshua Colp wrote:
On Sat, Jan 14, 2017, at 01:25 PM, Daniel-Constantin Mierla wrote:
Hello,
I think you were hit by an issue solved with commit 15fc8b9c59aaf31f005e38f54d363f1e9d0a068e :
https://github.com/kamailio/kamailio/commit/15fc8b9c59aaf31f005e38f54d363f1e...
The 4.1.3 was released before, in April 2014:
I am not sure if it was backported to 4.1 branch, but should not be hard to backport.
The issues was with many processing handling the same transaction, which has the sip_msg in shared memory, but then parsing of some headers created pointers to private memory of the process doing the parsing. Another process coming shortly after would see the pointer in sip_msg, but it would be to another process private memory and accessing it does a seg fault as expected.
Thanks Daniel! Based on some logging I added I can confirm that the parsing did happen in another process, so I think you are right that this will fix the issue. I'm going to work on backporting the change and testing it out.
Do the testing, because as a first thought now looking at the acc code, the fix might have just narrowed the race window. But not having any other related report since the patch, nobody checked further. I would need to see how the callback is executed in the tm for a proper resolution, but no time right now.
Cheers, Daniel
On Sat, Jan 14, 2017, at 01:55 PM, Daniel-Constantin Mierla wrote:
<snip>
The issues was with many processing handling the same transaction, which has the sip_msg in shared memory, but then parsing of some headers created pointers to private memory of the process doing the parsing. Another process coming shortly after would see the pointer in sip_msg, but it would be to another process private memory and accessing it does a seg fault as expected.
Thanks Daniel! Based on some logging I added I can confirm that the parsing did happen in another process, so I think you are right that this will fix the issue. I'm going to work on backporting the change and testing it out.
Do the testing, because as a first thought now looking at the acc code, the fix might have just narrowed the race window. But not having any other related report since the patch, nobody checked further. I would need to see how the callback is executed in the tm for a proper resolution, but no time right now.
Testing is in progress and so far so good. I do think the code in acc_onreply that cleans up the parsed header is not correct, though. It's referencing the shared memory memory instead of the locally scoped one where any parsed headers (should) live.
On Sat, Jan 14, 2017, at 02:43 PM, Joshua Colp wrote:
Testing is in progress and so far so good. I do think the code in acc_onreply that cleans up the parsed header is not correct, though. It's referencing the shared memory memory instead of the locally scoped one where any parsed headers (should) live.
10,000 calls and zero problems. Going to push it further. I did try implementing the change to free the alloced headers on preq and it went very very poorly. Something still doesn't feel right with this over all, even with the fix.
On Sat, Jan 14, 2017, at 04:09 PM, Joshua Colp wrote:
On Sat, Jan 14, 2017, at 02:43 PM, Joshua Colp wrote:
Testing is in progress and so far so good. I do think the code in acc_onreply that cleans up the parsed header is not correct, though. It's referencing the shared memory memory instead of the locally scoped one where any parsed headers (should) live.
10,000 calls and zero problems. Going to push it further. I did try implementing the change to free the alloced headers on preq and it went very very poorly. Something still doesn't feel right with this over all, even with the fix.
Another crash occurred when freeing the Remote-Party-ID parsed value:
(gdb) bt #0 0x000000000055adf6 in parse_addr_spec (buffer=0x7fff0d5d04a0 "\365?", end=0x7fff0d5d0410 "`8J\245\360\177", to_b=0x7fff0d5d04a0, allow_comma_sep=32767) at parser/parse_addr_spec.c:912 #1 0x000000000055ae62 in free_to_params (tb=0x55ae62) at parser/parse_addr_spec.c:921 #2 0x000000000054fde2 in clean_hdr_field (hf=0x7ff08e7e6248) at parser/hf.c:142 #3 0x00007ff0a5274ec5 in acc_onreply (t=0x7ff08e902830, req=0x7ff08e7e5230, reply=0x7ff0ac9f01a8, code=200) at acc_logic.c:511 #4 0x00007ff0a52753ec in tmcb_func (t=0x7ff08e902830, type=512, ps=0x7fff0d5d0d40) at acc_logic.c:583 #5 0x00007ff0a746f478 in run_trans_callbacks_internal (cb_lst=0x7ff08e9028a0, type=512, trans=0x7ff08e902830, params=0x7fff0d5d0d40) at t_hooks.c:290 #6 0x00007ff0a746f68a in run_trans_callbacks_with_buf (type=512, rbuf=0x7ff08e9028f0, req=0x7ff08e7e5230, repl=0x7ff0ac9f01a8, flags=200) at t_hooks.c:336 #7 0x00007ff0a74a1c06 in relay_reply (t=0x7ff08e902830, p_msg=0x7ff0ac9f01a8, branch=0, msg_status=200, cancel_data=0x7fff0d5d10a0, do_put_on_wait=1) at t_reply.c:2001 #8 0x00007ff0a74a40b7 in reply_received (p_msg=0x7ff0ac9f01a8) at t_reply.c:2499 #9 0x000000000045d837 in do_forward_reply (msg=0x7ff0ac9f01a8, mode=0) at forward.c:777 #10 0x000000000045e0f8 in forward_reply (msg=0x7ff0ac9f01a8) at forward.c:860 #11 0x00000000004a5887 in receive_msg ( buf=0x9245e0 "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 128.177.41.40:5060;branch=z9hG4bK16.def8e32b85c1f9a00abfd00f56e513ff.0, SIP/2.0/UDP 128.177.41.62:5061;rport=5061;branch=z9hG4bK-13531-8230-0\r\nFrom: sipp <sip:sipp@128"..., len=603, rcv_info=0x7fff0d5d1420) at receive.c:273 #12 0x000000000053c838 in udp_rcv_loop () at udp_server.c:531 #13 0x000000000046d42b in main_loop () at main.c:1617 #14 0x00000000004704d3 in main (argc=11, argv=0x7fff0d5d1758) at main.c:2533
(gdb) print *hf $2 = {type = HDR_RPID_T, name = { s = 0x7ff08e7e5a54 "Remote-Party-ID: "XXXXXXXXXX" sip:XXXXXXXXXX@XX.XXX.XXX.XXX;party=calling;privacy=off;screen=no\r\nContact: sip:sipp@XXX.XXX.XX.XX:5061\r\nMax-Forwards: 69\r\nSubject: Performance Test\r\nContent-Type: appl"..., len = 15}, body = { s = 0x7ff08e7e5a65 ""XXXXXXXXXX" sip:XXXXXXXXXX@XX.XXX.XXX.XXX;party=calling;privacy=off;screen=no\r\nContact: sip:sipp@XXX.XXX.XX.XX:5061\r\nMax-Forwards: 69\r\nSubject: Performance Test\r\nContent-Type: application/sdp\r\nCont"..., len = 80}, len = 99, parsed = 0x0, next = 0x7ff08e7e6288}
So it does indeed appear as though the change is incomplete. I'm going to see if I can figure out an additional fix, and if there's any additional information I can provide I'll try my best.
If you want a solution to avoid the issue via config: in the request_route, store the RPID in an avp and update the acc parameter to use the avp instead of the rpid variable.
As for code, a solution could be resetting the 'hdr->parsed' pointers to NULL that are in private memory after memcpy(&tmsg, req, sizeof(sip_msg_t)); -- using same kind of loop like at the end of acc_onreply(), but wihout clean_hdr_field(). In this way it ensure that the acc is using only what it parses and it is freeing only those headers.
Cheers, Daniel
On 14/01/2017 22:59, Joshua Colp wrote:
On Sat, Jan 14, 2017, at 04:09 PM, Joshua Colp wrote:
On Sat, Jan 14, 2017, at 02:43 PM, Joshua Colp wrote:
Testing is in progress and so far so good. I do think the code in acc_onreply that cleans up the parsed header is not correct, though. It's referencing the shared memory memory instead of the locally scoped one where any parsed headers (should) live.
10,000 calls and zero problems. Going to push it further. I did try implementing the change to free the alloced headers on preq and it went very very poorly. Something still doesn't feel right with this over all, even with the fix.
Another crash occurred when freeing the Remote-Party-ID parsed value:
(gdb) bt #0 0x000000000055adf6 in parse_addr_spec (buffer=0x7fff0d5d04a0 "\365?", end=0x7fff0d5d0410 "`8J\245\360\177", to_b=0x7fff0d5d04a0, allow_comma_sep=32767) at parser/parse_addr_spec.c:912 #1 0x000000000055ae62 in free_to_params (tb=0x55ae62) at parser/parse_addr_spec.c:921 #2 0x000000000054fde2 in clean_hdr_field (hf=0x7ff08e7e6248) at parser/hf.c:142 #3 0x00007ff0a5274ec5 in acc_onreply (t=0x7ff08e902830, req=0x7ff08e7e5230, reply=0x7ff0ac9f01a8, code=200) at acc_logic.c:511 #4 0x00007ff0a52753ec in tmcb_func (t=0x7ff08e902830, type=512, ps=0x7fff0d5d0d40) at acc_logic.c:583 #5 0x00007ff0a746f478 in run_trans_callbacks_internal (cb_lst=0x7ff08e9028a0, type=512, trans=0x7ff08e902830, params=0x7fff0d5d0d40) at t_hooks.c:290 #6 0x00007ff0a746f68a in run_trans_callbacks_with_buf (type=512, rbuf=0x7ff08e9028f0, req=0x7ff08e7e5230, repl=0x7ff0ac9f01a8, flags=200) at t_hooks.c:336 #7 0x00007ff0a74a1c06 in relay_reply (t=0x7ff08e902830, p_msg=0x7ff0ac9f01a8, branch=0, msg_status=200, cancel_data=0x7fff0d5d10a0, do_put_on_wait=1) at t_reply.c:2001 #8 0x00007ff0a74a40b7 in reply_received (p_msg=0x7ff0ac9f01a8) at t_reply.c:2499 #9 0x000000000045d837 in do_forward_reply (msg=0x7ff0ac9f01a8, mode=0) at forward.c:777 #10 0x000000000045e0f8 in forward_reply (msg=0x7ff0ac9f01a8) at forward.c:860 #11 0x00000000004a5887 in receive_msg ( buf=0x9245e0 "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 128.177.41.40:5060;branch=z9hG4bK16.def8e32b85c1f9a00abfd00f56e513ff.0, SIP/2.0/UDP 128.177.41.62:5061;rport=5061;branch=z9hG4bK-13531-8230-0\r\nFrom: sipp <sip:sipp@128"..., len=603, rcv_info=0x7fff0d5d1420) at receive.c:273 #12 0x000000000053c838 in udp_rcv_loop () at udp_server.c:531 #13 0x000000000046d42b in main_loop () at main.c:1617 #14 0x00000000004704d3 in main (argc=11, argv=0x7fff0d5d1758) at main.c:2533
(gdb) print *hf $2 = {type = HDR_RPID_T, name = { s = 0x7ff08e7e5a54 "Remote-Party-ID: "XXXXXXXXXX" sip:XXXXXXXXXX@XX.XXX.XXX.XXX;party=calling;privacy=off;screen=no\r\nContact: sip:sipp@XXX.XXX.XX.XX:5061\r\nMax-Forwards: 69\r\nSubject: Performance Test\r\nContent-Type: appl"..., len = 15}, body = { s = 0x7ff08e7e5a65 ""XXXXXXXXXX" sip:XXXXXXXXXX@XX.XXX.XXX.XXX;party=calling;privacy=off;screen=no\r\nContact: sip:sipp@XXX.XXX.XX.XX:5061\r\nMax-Forwards: 69\r\nSubject: Performance Test\r\nContent-Type: application/sdp\r\nCont"..., len = 80}, len = 99, parsed = 0x0, next = 0x7ff08e7e6288}
So it does indeed appear as though the change is incomplete. I'm going to see if I can figure out an additional fix, and if there's any additional information I can provide I'll try my best.
On Sat, Jan 14, 2017, at 06:20 PM, Daniel-Constantin Mierla wrote:
If you want a solution to avoid the issue via config: in the request_route, store the RPID in an avp and update the acc parameter to use the avp instead of the rpid variable.
As for code, a solution could be resetting the 'hdr->parsed' pointers to NULL that are in private memory after memcpy(&tmsg, req, sizeof(sip_msg_t)); -- using same kind of loop like at the end of acc_onreply(), but wihout clean_hdr_field(). In this way it ensure that the acc is using only what it parses and it is freeing only those headers.
Thanks for the configuration suggestion Daniel! I'm going to try that as it should do the job. I'll follow up if I encounter the same issue despite the configuration change (which shouldn't be possible - but I don't know if I'm that lucky).
Cheers,
On Sat, Jan 14, 2017, at 06:39 PM, Joshua Colp wrote:
On Sat, Jan 14, 2017, at 06:20 PM, Daniel-Constantin Mierla wrote:
If you want a solution to avoid the issue via config: in the request_route, store the RPID in an avp and update the acc parameter to use the avp instead of the rpid variable.
As for code, a solution could be resetting the 'hdr->parsed' pointers to NULL that are in private memory after memcpy(&tmsg, req, sizeof(sip_msg_t)); -- using same kind of loop like at the end of acc_onreply(), but wihout clean_hdr_field(). In this way it ensure that the acc is using only what it parses and it is freeing only those headers.
We've been testing the configuration change and as expected it has resolved the problem. Thanks Daniel!
On 16/01/2017 00:48, Joshua Colp wrote:
On Sat, Jan 14, 2017, at 06:39 PM, Joshua Colp wrote:
On Sat, Jan 14, 2017, at 06:20 PM, Daniel-Constantin Mierla wrote:
If you want a solution to avoid the issue via config: in the request_route, store the RPID in an avp and update the acc parameter to use the avp instead of the rpid variable.
As for code, a solution could be resetting the 'hdr->parsed' pointers to NULL that are in private memory after memcpy(&tmsg, req, sizeof(sip_msg_t)); -- using same kind of loop like at the end of acc_onreply(), but wihout clean_hdr_field(). In this way it ensure that the acc is using only what it parses and it is freeing only those headers.
We've been testing the configuration change and as expected it has resolved the problem. Thanks Daniel!
Welcome! I will try to fix it properly in the code as well.
Cheers, Daniel
On 16/01/2017 09:33, Daniel-Constantin Mierla wrote:
On 16/01/2017 00:48, Joshua Colp wrote:
On Sat, Jan 14, 2017, at 06:39 PM, Joshua Colp wrote:
On Sat, Jan 14, 2017, at 06:20 PM, Daniel-Constantin Mierla wrote:
If you want a solution to avoid the issue via config: in the request_route, store the RPID in an avp and update the acc parameter to use the avp instead of the rpid variable.
As for code, a solution could be resetting the 'hdr->parsed' pointers to NULL that are in private memory after memcpy(&tmsg, req, sizeof(sip_msg_t)); -- using same kind of loop like at the end of acc_onreply(), but wihout clean_hdr_field(). In this way it ensure that the acc is using only what it parses and it is freeing only those headers.
We've been testing the configuration change and as expected it has resolved the problem. Thanks Daniel!
Welcome! I will try to fix it properly in the code as well.
I pushed a commit to do a deep cloning instead of just memcpy for the request structure. It should give a fresh structure without the headers that are not cloned in shared memory.
The solution with avp is still good, specially for older versions in order to avoid backporting, but if you get a chance to test and report the results, it will be appreciated. Besides seeing if there is still a crash, monitor the memory at the beginning and start of the tests just to see if there is a risk of a leak -- next commands can be used to see stats for private memory and shared memory:
kamcmd pkg.stats kamctl stats shmem
Cheers, Daniel
On Mon, Jan 16, 2017, at 10:26 AM, Daniel-Constantin Mierla wrote:
On 16/01/2017 09:33, Daniel-Constantin Mierla wrote:
On 16/01/2017 00:48, Joshua Colp wrote:
On Sat, Jan 14, 2017, at 06:39 PM, Joshua Colp wrote:
On Sat, Jan 14, 2017, at 06:20 PM, Daniel-Constantin Mierla wrote:
If you want a solution to avoid the issue via config: in the request_route, store the RPID in an avp and update the acc parameter to use the avp instead of the rpid variable.
As for code, a solution could be resetting the 'hdr->parsed' pointers to NULL that are in private memory after memcpy(&tmsg, req, sizeof(sip_msg_t)); -- using same kind of loop like at the end of acc_onreply(), but wihout clean_hdr_field(). In this way it ensure that the acc is using only what it parses and it is freeing only those headers.
We've been testing the configuration change and as expected it has resolved the problem. Thanks Daniel!
Welcome! I will try to fix it properly in the code as well.
I pushed a commit to do a deep cloning instead of just memcpy for the request structure. It should give a fresh structure without the headers that are not cloned in shared memory.
The solution with avp is still good, specially for older versions in order to avoid backporting, but if you get a chance to test and report the results, it will be appreciated. Besides seeing if there is still a crash, monitor the memory at the beginning and start of the tests just to see if there is a risk of a leak -- next commands can be used to see stats for private memory and shared memory:
kamcmd pkg.stats kamctl stats shmem
We've gone with the config change but I'll see what I can do with testing the above in the future.