when doing jitsi presence tests, i also managed to get kamailio 3.3 to crash with core dump. events leading to the crash were pua publish via xmlrpc, which resulted into punch of notifies, some of which sip proxy failed to deliver due to missing tcp connections.
after generating the notifies, presence server crashed like this:
#0 0x00007fa1e9b17f7b in core_hash () from /usr/lib/pres-serv/modules_k/pua.so #1 0x00007fa1e9b1933f in publ_cback_func () from /usr/lib/pres-serv/modules_k/pua.so #2 0x00007fa1ee59b907 in run_trans_callbacks_internal () from /usr/lib/pres-serv/modules/tm.so #3 0x00007fa1ee59ba19 in run_trans_callbacks () from /usr/lib/pres-serv/modules/tm.so #4 0x00007fa1ee5c36fa in local_reply () from /usr/lib/pres-serv/modules/tm.so #5 0x00007fa1ee5c4b30 in reply_received () from /usr/lib/pres-serv/modules/tm.so #6 0x000000000044fee5 in forward_reply () #7 0x0000000000489180 in receive_msg () #8 0x0000000000501a8c in receive_tcp_msg () #9 0x0000000000502740 in tcp_read_req () #10 0x0000000000503759 in handle_io () #11 0x00000000004fe363 in io_wait_loop_epoll () #12 0x0000000000504431 in tcp_receive_loop () #13 0x00000000004f9920 in tcp_init_children () #14 0x000000000045c9fb in main_loop () #15 0x000000000045f29c in main ()
it would be nice to get this fixed before 3.3 release.
-- juha
Hello,
the backtrace is missing debug symbols, not showing detail of file and line, nor the parameters. Can you install with debug symbols in and reproduce it? A detailed backtrace will help to find the issue -- installing from source will keep the symbols, for debs there is an option to build a dedicated package for the symbols, I don't remember which one.
Cheers, Daniel
On 6/13/12 8:17 PM, Juha Heinanen wrote:
when doing jitsi presence tests, i also managed to get kamailio 3.3 to crash with core dump. events leading to the crash were pua publish via xmlrpc, which resulted into punch of notifies, some of which sip proxy failed to deliver due to missing tcp connections.
after generating the notifies, presence server crashed like this:
#0 0x00007fa1e9b17f7b in core_hash () from /usr/lib/pres-serv/modules_k/pua.so #1 0x00007fa1e9b1933f in publ_cback_func () from /usr/lib/pres-serv/modules_k/pua.so #2 0x00007fa1ee59b907 in run_trans_callbacks_internal () from /usr/lib/pres-serv/modules/tm.so #3 0x00007fa1ee59ba19 in run_trans_callbacks () from /usr/lib/pres-serv/modules/tm.so #4 0x00007fa1ee5c36fa in local_reply () from /usr/lib/pres-serv/modules/tm.so #5 0x00007fa1ee5c4b30 in reply_received () from /usr/lib/pres-serv/modules/tm.so #6 0x000000000044fee5 in forward_reply () #7 0x0000000000489180 in receive_msg () #8 0x0000000000501a8c in receive_tcp_msg () #9 0x0000000000502740 in tcp_read_req () #10 0x0000000000503759 in handle_io () #11 0x00000000004fe363 in io_wait_loop_epoll () #12 0x0000000000504431 in tcp_receive_loop () #13 0x00000000004f9920 in tcp_init_children () #14 0x000000000045c9fb in main_loop () #15 0x000000000045f29c in main ()
it would be nice to get this fixed before 3.3 release.
-- juha
sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
Daniel-Constantin Mierla writes:
the backtrace is missing debug symbols, not showing detail of file and line, nor the parameters. Can you install with debug symbols in and reproduce it? A detailed backtrace will help to find the issue -- installing from source will keep the symbols, for debs there is an option to build a dedicated package for the symbols, I don't remember which one.
i noticed that too and i'm currently building new debian package when DEB_BUILD_OPTIONS includes "nostrip".
-- juha
i have not managed to build debian package with debug symbols for gdb. file claims that the binary is not stripped:
$ file /usr/sbin/pres-serv /usr/sbin/pres-serv: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.26, BuildID[sha1]=0xffa2eda9f71031ebb06946541b7aeae02ffce982, not stripped
but gdb claims that there are no symbols:
# gdb /usr/sbin/pres-serv /var/cores/core.pres-serv.sig11.1 core.pres-serv.sig11.15119 core.pres-serv.sig11.17616 # gdb /usr/sbin/pres-serv /var/cores/core.pres-serv.sig11.17616 GNU gdb (GDB) 7.4.1-debian Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /usr/sbin/pres-serv...(no debugging symbols found)...done.
in debian rules i have
# force no stripping export DEB_BUILD_OPTIONS:="$(DEB_BUILD_OPTIONS) nostrip" ... ifeq (,$(findstring nostrip,$(DEB_BUILD_OPTIONS))) INSTALL_PROGRAM += -s endif
any clues?
-- juha
Juha Heinanen writes:
in debian rules i have
# force no stripping export DEB_BUILD_OPTIONS:="$(DEB_BUILD_OPTIONS) nostrip" ... ifeq (,$(findstring nostrip,$(DEB_BUILD_OPTIONS))) INSTALL_PROGRAM += -s endif
problem solved. i got debug symbols when i added "debug" in debian rules line:
export DEB_BUILD_OPTIONS:="$(DEB_BUILD_OPTIONS) debug nostrip"
-- juha
Hello, I found an issue in presence yesterday which caused a crash in core_hash.
My issue was caused by sending a null string to core_hash which did not check that s1->s != NULL before doing some pointer arithmetic. (The fix to presence will be committed v soon!)
I see that core_hash has moved from /lib/kcore/hash_func.h (in 3.2) to /hashes.h but I think the algorithm is the same one. Maybe pua is passing a null or uninitialised string?
Regards, Hugh
On 13/06/12 19:17, Juha Heinanen wrote:
when doing jitsi presence tests, i also managed to get kamailio 3.3 to crash with core dump. events leading to the crash were pua publish via xmlrpc, which resulted into punch of notifies, some of which sip proxy failed to deliver due to missing tcp connections.
after generating the notifies, presence server crashed like this:
#0 0x00007fa1e9b17f7b in core_hash () from /usr/lib/pres-serv/modules_k/pua.so #1 0x00007fa1e9b1933f in publ_cback_func () from /usr/lib/pres-serv/modules_k/pua.so #2 0x00007fa1ee59b907 in run_trans_callbacks_internal () from /usr/lib/pres-serv/modules/tm.so #3 0x00007fa1ee59ba19 in run_trans_callbacks () from /usr/lib/pres-serv/modules/tm.so #4 0x00007fa1ee5c36fa in local_reply () from /usr/lib/pres-serv/modules/tm.so #5 0x00007fa1ee5c4b30 in reply_received () from /usr/lib/pres-serv/modules/tm.so #6 0x000000000044fee5 in forward_reply () #7 0x0000000000489180 in receive_msg () #8 0x0000000000501a8c in receive_tcp_msg () #9 0x0000000000502740 in tcp_read_req () #10 0x0000000000503759 in handle_io () #11 0x00000000004fe363 in io_wait_loop_epoll () #12 0x0000000000504431 in tcp_receive_loop () #13 0x00000000004f9920 in tcp_init_children () #14 0x000000000045c9fb in main_loop () #15 0x000000000045f29c in main ()
it would be nice to get this fixed before 3.3 release.
-- juha
sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
Hugh Waite writes:
I see that core_hash has moved from /lib/kcore/hash_func.h (in 3.2) to /hashes.h but I think the algorithm is the same one. Maybe pua is passing a null or uninitialised string?
hugh,
when sending publish, there is two places in pua module where core_hash is called:
hash_code= core_hash(hentity->pres_uri, NULL, HASH_SIZE);
and
hash_code= core_hash(publ->pres_uri, NULL, HASH_SIZE);
is it so that the second param cannot be NULL?
-- juha
Hi Juha,
The second parameter (s2) is allowed to be NULL, but there is no check that s1->s or s2->s != NULL.
It looks like you have an uninitialised or corrupt value as the first string, which causes the same problem - segfault
Regards, Hugh
On 13/06/12 20:14, Juha Heinanen wrote:
Hugh Waite writes:
I see that core_hash has moved from /lib/kcore/hash_func.h (in 3.2) to /hashes.h but I think the algorithm is the same one. Maybe pua is passing a null or uninitialised string?
hugh,
when sending publish, there is two places in pua module where core_hash is called:
hash_code= core_hash(hentity->pres_uri, NULL, HASH_SIZE);
and
hash_code= core_hash(publ->pres_uri, NULL, HASH_SIZE);
is it so that the second param cannot be NULL?
-- juha
sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
Hugh Waite writes:
I see that core_hash has moved from /lib/kcore/hash_func.h (in 3.2) to /hashes.h but I think the algorithm is the same one. Maybe pua is passing a null or uninitialised string?
after getting the debug symbols, i see this:
(gdb) where #0 0x00007fe3b4db0f9f in core_hash (s1=0x7fe3b1866a01, s2=0x0, size=512) at ../../hashes.h:279 #1 0x00007fe3b4db233f in publ_cback_func (t=0x7fe3b1866d20, type=1024, ps=0x7fffa7086230) at send_publish.c:246 #2 0x00007fe3b9834907 in run_trans_callbacks_internal (cb_lst=0x7fe3b1866d90, type=1024, trans=0x7fe3b1866d20, params=0x7fffa7086230) at t_hooks.c:290 #3 0x00007fe3b9834a19 in run_trans_callbacks (type=1024, trans=0x7fe3b1866d20, req=0x0, rpl=0x7fe3b9d78c88, code=200) at t_hooks.c:317 #4 0x00007fe3b985c6fa in local_reply (t=0x7fe3b1866d20, p_msg=0x7fe3b9d78c88, branch=0, msg_status=200, cancel_data=0x7fffa7086490) at t_reply.c:2001 #5 0x00007fe3b985db30 in reply_received (p_msg=0x7fe3b9d78c88) at t_reply.c:2350 #6 0x000000000044fee5 in forward_reply (msg=0x7fe3b9d78c88) at forward.c:790 #7 0x0000000000489180 in receive_msg ( buf=0x7fe3b1868350 "SIP/2.0 200 OK\r\nVia: SIP/2.0/TCP 192.98.103.10:8080;branch=z9hG4bK2d38.cbac1124", '0' <repeats 24 times>, ".0;received=127.0.0.1\r\nTo: sip:jh@vm.test.fi;tag=4a664ec84c547b2d0bc0fe8965f834e4-d075\r\nFrom: sip"..., len=461, rcv_info=0x7fe3b1868088) at receive.c:270 #8 0x0000000000501a8c in receive_tcp_msg ( tcpbuf=0x7fe3b1868350 "SIP/2.0 200 OK\r\nVia: SIP/2.0/TCP 192.98.103.10:8080;branch=z9hG4bK2d38.cbac1124", '0' <repeats 24 times>, ".0;received=127.0.0.1\r\nTo: sip:jh@vm.test.fi;tag=4a664ec84c547b2d0bc0fe8965f834e4-d075\r\nFrom: sip"..., len=461, rcv_info=0x7fe3b1868088, con=0x7fe3b1868070) at tcp_read.c:1044 #9 0x0000000000502740 in tcp_read_req (con=0x7fe3b1868070, bytes_read=0x7fffa708693c, read_flags=0x7fffa7086938) at tcp_read.c:1231 #10 0x0000000000503759 in handle_io (fm=0x7fe3b9d66740, events=1, idx=-1) at tcp_read.c:1403 #11 0x00000000004fe363 in io_wait_loop_epoll (h=0x897ce0, t=2, repeat=0) at io_wait.h:1092 #12 0x0000000000504431 in tcp_receive_loop (unix_sock=20) at tcp_read.c:1572 #13 0x00000000004f9920 in tcp_init_children () at tcp_main.c:4952 #14 0x000000000045c9fb in main_loop () at main.c:1718 #15 0x000000000045f29c in main (argc=16, argv=0x7fffa7086de8) at main.c:2546
at frame #1, i see:
(gdb) frame 1 #1 0x00007fe3b4db233f in publ_cback_func (t=0x7fe3b1866d20, type=1024, ps=0x7fffa7086230) at send_publish.c:246 246 hash_code= core_hash(hentity->pres_uri, NULL, HASH_SIZE); (gdb) print hentity->pres_uri $1 = (str *) 0x7fe3b1866a01
looks like that str does not point to anything real:
(gdb) print hentity->pres_uri.len $3 = 1835890035 (gdb) print hentity->pres_uri.s $4 = 0x2d6567617373656d <Address 0x2d6567617373656d out of bounds>
-- juha
now i got pua.c crash without any close by publish request:
Program terminated with signal 11, Segmentation fault. #0 0x00007f4bc97766a6 in db_update (ticks=28098484, param=0x0) at pua.c:992 992 switch(p->db_flag) (gdb) where #0 0x00007f4bc97766a6 in db_update (ticks=28098484, param=0x0) at pua.c:992 #1 0x0000000000507b6a in compat_old_handler (ti=449575751, tl=0x7f4bc6212050, data=0x7f4bc6212050) at timer.c:1017 #2 0x00000000005080ac in slow_timer_main () at timer.c:1151 #3 0x000000000045c7f8 in main_loop () at main.c:1688 #4 0x000000000045f29c in main (argc=16, argv=0x7fff1c7fa078) at main.c:2546
there was pua publish at 09:18:45 but there was no subscribers for the presentity at that time. then at 09:19:42, there was subscription for this presentity and crash followed right after notify was generated.
i have not set any presence related db_mode parameters.
i don't remember seeing these crashes in may when i did similar presence tests.
-- juha
Hello,
can you print p and content of p in frame 0?
Cheers, Daniel
On 6/14/12 8:29 AM, Juha Heinanen wrote:
now i got pua.c crash without any close by publish request:
Program terminated with signal 11, Segmentation fault. #0 0x00007f4bc97766a6 in db_update (ticks=28098484, param=0x0) at pua.c:992 992 switch(p->db_flag) (gdb) where #0 0x00007f4bc97766a6 in db_update (ticks=28098484, param=0x0) at pua.c:992 #1 0x0000000000507b6a in compat_old_handler (ti=449575751, tl=0x7f4bc6212050, data=0x7f4bc6212050) at timer.c:1017 #2 0x00000000005080ac in slow_timer_main () at timer.c:1151 #3 0x000000000045c7f8 in main_loop () at main.c:1688 #4 0x000000000045f29c in main (argc=16, argv=0x7fff1c7fa078) at main.c:2546
there was pua publish at 09:18:45 but there was no subscribers for the presentity at that time. then at 09:19:42, there was subscription for this presentity and crash followed right after notify was generated.
i have not set any presence related db_mode parameters.
i don't remember seeing these crashes in may when i did similar presence tests.
-- juha
sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
Daniel-Constantin Mierla writes:
Hello,
can you print p and content of p in frame 0?
(gdb) where #0 0x00007f4bc97766a6 in db_update (ticks=28098484, param=0x0) at pua.c:992 #1 0x0000000000507b6a in compat_old_handler (ti=449575751, tl=0x7f4bc6212050, data=0x7f4bc6212050) at timer.c:1017 #2 0x00000000005080ac in slow_timer_main () at timer.c:1151 #3 0x000000000045c7f8 in main_loop () at main.c:1688 #4 0x000000000045f29c in main (argc=16, argv=0x7fff1c7fa078) at main.c:2546 (gdb) frame 0 #0 0x00007f4bc97766a6 in db_update (ticks=28098484, param=0x0) at pua.c:992 992 switch(p->db_flag) (gdb) print p $1 = (ua_pres_t *) 0x5
but looks like i cannot print any of its contents:
(gdb) print p->etag.s Cannot access memory at address 0x4d (gdb) print p->body.len Cannot access memory at address 0x6d
-- juha
after the previous timer based crash, i tried again by sending pua publish when subscription for the presentity was already active and got the same crash as before:
Program terminated with signal 11, Segmentation fault. #0 0x00007ff739269f7b in core_hash (s1=0x504953203a616956, s2=0x0, size=512) at ../../hashes.h:277 277 end=s1->s+s1->len; (gdb) where #0 0x00007ff739269f7b in core_hash (s1=0x504953203a616956, s2=0x0, size=512) at ../../hashes.h:277 #1 0x00007ff73926b33f in publ_cback_func (t=0x7ff735d4f8f0, type=1024, ps=0x7fffcaef1e10) at send_publish.c:246 #2 0x00007ff73dced907 in run_trans_callbacks_internal (cb_lst=0x7ff735d4f960, type=1024, trans=0x7ff735d4f8f0, params=0x7fffcaef1e10) at t_hooks.c:290 #3 0x00007ff73dceda19 in run_trans_callbacks (type=1024, trans=0x7ff735d4f8f0, req=0x0, rpl=0x7ff73e231998, code=200) at t_hooks.c:317 #4 0x00007ff73dd156fa in local_reply (t=0x7ff735d4f8f0, p_msg=0x7ff73e231998, branch=0, msg_status=200, cancel_data=0x7fffcaef2070) at t_reply.c:2001 #5 0x00007ff73dd16b30 in reply_received (p_msg=0x7ff73e231998) at t_reply.c:2350 #6 0x000000000044fee5 in forward_reply (msg=0x7ff73e231998) at forward.c:790 #7 0x0000000000489180 in receive_msg ( buf=0x7ff735d713c0 "SIP/2.0 200 OK\r\nVia: SIP/2.0/TCP 192.98.103.10:8080;branch=z9hG4bKa5de.bd555264", '0' <repeats 24 times>, ".0;received=127.0.0.1\r\nTo: sip:jh@vm.test.fi;tag=4a664ec84c547b2d0bc0fe8965f834e4-02e8\r\nFrom: sip"..., len=461, rcv_info=0x7ff735d710f8) at receive.c:270 #8 0x0000000000501a8c in receive_tcp_msg ( tcpbuf=0x7ff735d713c0 "SIP/2.0 200 OK\r\nVia: SIP/2.0/TCP 192.98.103.10:8080;branch=z9hG4bKa5de.bd555264", '0' <repeats 24 times>, ".0;received=127.0.0.1\r\nTo: sip:jh@vm.test.fi;tag=4a664ec84c547b2d0bc0fe8965f834e4-02e8\r\nFrom: sip"..., len=461, rcv_info=0x7ff735d710f8, con=0x7ff735d710e0) at tcp_read.c:1044 #9 0x0000000000502740 in tcp_read_req (con=0x7ff735d710e0, bytes_read=0x7fffcaef251c, read_flags=0x7fffcaef2518) at tcp_read.c:1231 #10 0x0000000000503759 in handle_io (fm=0x7ff73e21f3f0, events=1, idx=-1) at tcp_read.c:1403 #11 0x00000000004fe363 in io_wait_loop_epoll (h=0x897ce0, t=2, repeat=0) at io_wait.h:1092 #12 0x0000000000504431 in tcp_receive_loop (unix_sock=16) at tcp_read.c:1572 #13 0x00000000004f9920 in tcp_init_children () at tcp_main.c:4952 #14 0x000000000045c9fb in main_loop () at main.c:1718 #15 0x000000000045f29c in main (argc=16, argv=0x7fffcaef29c8) at main.c:2546
i need to go now for a few hours, but will be back in the evening for more tests if someone before that has been able to figure out what is going on.
-- juha
Hello,
can you print hentity and *hentity in frame 1?
Cheers, Daniel
On 6/13/12 9:55 PM, Juha Heinanen wrote:
Hugh Waite writes:
I see that core_hash has moved from /lib/kcore/hash_func.h (in 3.2) to /hashes.h but I think the algorithm is the same one. Maybe pua is passing a null or uninitialised string?
after getting the debug symbols, i see this:
(gdb) where #0 0x00007fe3b4db0f9f in core_hash (s1=0x7fe3b1866a01, s2=0x0, size=512) at ../../hashes.h:279 #1 0x00007fe3b4db233f in publ_cback_func (t=0x7fe3b1866d20, type=1024, ps=0x7fffa7086230) at send_publish.c:246 #2 0x00007fe3b9834907 in run_trans_callbacks_internal (cb_lst=0x7fe3b1866d90, type=1024, trans=0x7fe3b1866d20, params=0x7fffa7086230) at t_hooks.c:290 #3 0x00007fe3b9834a19 in run_trans_callbacks (type=1024, trans=0x7fe3b1866d20, req=0x0, rpl=0x7fe3b9d78c88, code=200) at t_hooks.c:317 #4 0x00007fe3b985c6fa in local_reply (t=0x7fe3b1866d20, p_msg=0x7fe3b9d78c88, branch=0, msg_status=200, cancel_data=0x7fffa7086490) at t_reply.c:2001 #5 0x00007fe3b985db30 in reply_received (p_msg=0x7fe3b9d78c88) at t_reply.c:2350 #6 0x000000000044fee5 in forward_reply (msg=0x7fe3b9d78c88) at forward.c:790 #7 0x0000000000489180 in receive_msg ( buf=0x7fe3b1868350 "SIP/2.0 200 OK\r\nVia: SIP/2.0/TCP 192.98.103.10:8080;branch=z9hG4bK2d38.cbac1124", '0' <repeats 24 times>, ".0;received=127.0.0.1\r\nTo: sip:jh@vm.test.fi;tag=4a664ec84c547b2d0bc0fe8965f834e4-d075\r\nFrom: sip"..., len=461, rcv_info=0x7fe3b1868088) at receive.c:270 #8 0x0000000000501a8c in receive_tcp_msg ( tcpbuf=0x7fe3b1868350 "SIP/2.0 200 OK\r\nVia: SIP/2.0/TCP 192.98.103.10:8080;branch=z9hG4bK2d38.cbac1124", '0' <repeats 24 times>, ".0;received=127.0.0.1\r\nTo: sip:jh@vm.test.fi;tag=4a664ec84c547b2d0bc0fe8965f834e4-d075\r\nFrom: sip"..., len=461, rcv_info=0x7fe3b1868088, con=0x7fe3b1868070) at tcp_read.c:1044 #9 0x0000000000502740 in tcp_read_req (con=0x7fe3b1868070, bytes_read=0x7fffa708693c, read_flags=0x7fffa7086938) at tcp_read.c:1231 #10 0x0000000000503759 in handle_io (fm=0x7fe3b9d66740, events=1, idx=-1) at tcp_read.c:1403 #11 0x00000000004fe363 in io_wait_loop_epoll (h=0x897ce0, t=2, repeat=0) at io_wait.h:1092 #12 0x0000000000504431 in tcp_receive_loop (unix_sock=20) at tcp_read.c:1572 #13 0x00000000004f9920 in tcp_init_children () at tcp_main.c:4952 #14 0x000000000045c9fb in main_loop () at main.c:1718 #15 0x000000000045f29c in main (argc=16, argv=0x7fffa7086de8) at main.c:2546
at frame #1, i see:
(gdb) frame 1 #1 0x00007fe3b4db233f in publ_cback_func (t=0x7fe3b1866d20, type=1024, ps=0x7fffa7086230) at send_publish.c:246 246 hash_code= core_hash(hentity->pres_uri, NULL, HASH_SIZE); (gdb) print hentity->pres_uri $1 = (str *) 0x7fe3b1866a01
looks like that str does not point to anything real:
(gdb) print hentity->pres_uri.len $3 = 1835890035 (gdb) print hentity->pres_uri.s $4 = 0x2d6567617373656d <Address 0x2d6567617373656d out of bounds>
-- juha
sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
Daniel-Constantin Mierla writes:
can you print hentity and *hentity in frame 1?
here you go:
(gdb) frame 1 #1 0x00007ff73926b33f in publ_cback_func (t=0x7ff735d4f8f0, type=1024, ps=0x7fffcaef1e10) at send_publish.c:246 246 hash_code= core_hash(hentity->pres_uri, NULL, HASH_SIZE); (gdb) print hentity $1 = (ua_pres_t *) 0x7ff735d1eed8 (gdb) print *hentity $2 = {id = {s = 0x20302e322f504953 <Address 0x20302e322f504953 out of bounds>, len = 540028978}, pres_uri = 0x504953203a616956, event = 808333871, expires = 1346589743, desired_expires = 842608928, flag = 775436590, db_flag = 775106609, cb_param = 0x6e6172623b303830, next = 0x344768397a3d6863, ua_flag = 895568738, etag = { s = 0x3034363235353564 <Address 0x3034363235353564 out of bounds>, len = 808464432}, tuple_id = { s = 0x3030303030303030 <Address 0x3030303030303030 out of bounds>, len = 808464432}, body = 0x7669656365723b30, content_type = { s = 0x302e3732313d6465 <Address 0x302e3732313d6465 out of bounds>, len = 825110574}, watcher_uri = 0x686a3a706973203a, call_id = { s = 0x747365742e6d7640 <Address 0x747365742e6d7640 out of bounds>, len = 996763182}, to_tag = { s = 0x3863653436366134 <Address 0x3863653436366134 out of bounds>, len = 875914036}, from_tag = { s = 0x3938656630636230 <Address 0x3938656630636230 out of bounds>, len = 946222390}, cseq = 1697787949, version = 1175063864, outbound_proxy = 0x706973203a6d6f72, extra_headers = 0x742e6d7640686a3a, record_route = { s = 0x743b69662e747365 <Address 0x743b69662e747365 out of bounds>, len = 926771041}, remote_contact = { s = 0x3564303638333834 <Address 0x3564303638333834 out of bounds>, len = 962815330}, contact = { s = 0x6266393266373734 <Address 0x6266393266373734 out of bounds>, len = 758265909}}
-- juha
Hello,
I think I spotted the reason (checking also the logs from the issue reported by Charles Chance on sr-users some weeks ago, it is the same case) -- cc-ed Peter and Hugh because it is from a commit coming from them, respectively:
commit ea2fab792425bf30197d47ae08f806a908fc3681 Author: Peter Dunkley peter.dunkley@crocodile-rcs.com Date: Wed May 9 13:55:01 2012 +0100
There were few issues IMO added by this commit (in function int send_publish( publ_info_t* publ ) from modules_k/pua/send_publish.c), caused by letting the execution going through error: label even when all is ok. Before this commit, when all was ok the function returned before error: label.
First is the shm_free() of cb_param -- this variable is given to TM for returning it in callback function, where is accessed but with invalid content at that time -- the reason for the crash reported here.
The second is related to DB transaction operations, that's why I wanted to discuss it here: - if all is ok, pua_dbf.end_transaction(pua_db) is executed - but then goes through error: label and does pua_dbf.abort_transaction(pua_db)
It might be harmless, but does not look 'ok' IMO.
I pushed a commit to fix it, Hugh and Peter should check it not to break something that they had in mind with the commit:
http://git.sip-router.org/cgi-bin/gitweb.cgi/sip-router/?a=commit;h=1d89d7be...
If feedback and testing is ok, then it will be backported.
Cheers, Daniel
On 6/14/12 8:46 AM, Juha Heinanen wrote:
Daniel-Constantin Mierla writes:
can you print hentity and *hentity in frame 1?
here you go:
(gdb) frame 1 #1 0x00007ff73926b33f in publ_cback_func (t=0x7ff735d4f8f0, type=1024, ps=0x7fffcaef1e10) at send_publish.c:246 246 hash_code= core_hash(hentity->pres_uri, NULL, HASH_SIZE); (gdb) print hentity $1 = (ua_pres_t *) 0x7ff735d1eed8 (gdb) print *hentity $2 = {id = {s = 0x20302e322f504953 <Address 0x20302e322f504953 out of bounds>, len = 540028978}, pres_uri = 0x504953203a616956, event = 808333871, expires = 1346589743, desired_expires = 842608928, flag = 775436590, db_flag = 775106609, cb_param = 0x6e6172623b303830, next = 0x344768397a3d6863, ua_flag = 895568738, etag = { s = 0x3034363235353564 <Address 0x3034363235353564 out of bounds>, len = 808464432}, tuple_id = { s = 0x3030303030303030 <Address 0x3030303030303030 out of bounds>, len = 808464432}, body = 0x7669656365723b30, content_type = { s = 0x302e3732313d6465 <Address 0x302e3732313d6465 out of bounds>, len = 825110574}, watcher_uri = 0x686a3a706973203a, call_id = { s = 0x747365742e6d7640 <Address 0x747365742e6d7640 out of bounds>, len = 996763182}, to_tag = { s = 0x3863653436366134 <Address 0x3863653436366134 out of bounds>, len = 875914036}, from_tag = { s = 0x3938656630636230 <Address 0x3938656630636230 out of bounds>, len = 946222390}, cseq = 1697787949, version = 1175063864, outbound_proxy = 0x706973203a6d6f72, extra_headers = 0x742e6d7640686a3a, record_route = { s = 0x743b69662e747365 <Address 0x743b69662e747365 out of bounds>, len = 926771041}, remote_contact = { s = 0x3564303638333834 <Address 0x3564303638333834 out of bounds>, len = 962815330}, contact = { s = 0x6266393266373734 <Address 0x6266393266373734 out of bounds>, len = 758265909}}
-- juha
Hi,
abort_transaction() has no effect if there is no transaction in progress. It is deliberately called this way to make sure any unclosed transactions at this point are caught and rolled-back.
It's a belts-and-braces thing to make sure that - even if there is a coding error in the function and the transaction is not closed when it should be - the transaction is closed before the function ends. Otherwise future DB operations that occur on the same DB connection could end up uncommitted within the same transaction.
Regards,
Peter
On Thu, 2012-06-14 at 12:28 +0200, Daniel-Constantin Mierla wrote:
Hello,
I think I spotted the reason (checking also the logs from the issue reported by Charles Chance on sr-users some weeks ago, it is the same case) -- cc-ed Peter and Hugh because it is from a commit coming from them, respectively:
commit ea2fab792425bf30197d47ae08f806a908fc3681 Author: Peter Dunkley peter.dunkley@crocodile-rcs.com Date: Wed May 9 13:55:01 2012 +0100
There were few issues IMO added by this commit (in function int send_publish( publ_info_t* publ ) from modules_k/pua/send_publish.c), caused by letting the execution going through error: label even when all is ok. Before this commit, when all was ok the function returned before error: label.
First is the shm_free() of cb_param -- this variable is given to TM for returning it in callback function, where is accessed but with invalid content at that time -- the reason for the crash reported here.
The second is related to DB transaction operations, that's why I wanted to discuss it here:
- if all is ok, pua_dbf.end_transaction(pua_db) is executed
- but then goes through error: label and does
pua_dbf.abort_transaction(pua_db)
It might be harmless, but does not look 'ok' IMO.
I pushed a commit to fix it, Hugh and Peter should check it not to break something that they had in mind with the commit:
http://git.sip-router.org/cgi-bin/gitweb.cgi/sip-router/?a=commit;h=1d89d7be...
If feedback and testing is ok, then it will be backported.
Cheers, Daniel
On 6/14/12 8:46 AM, Juha Heinanen wrote:
Daniel-Constantin Mierla writes:
can you print hentity and *hentity in frame 1?
here you go:
(gdb) frame 1 #1 0x00007ff73926b33f in publ_cback_func (t=0x7ff735d4f8f0, type=1024, ps=0x7fffcaef1e10) at send_publish.c:246 246 hash_code= core_hash(hentity->pres_uri, NULL, HASH_SIZE); (gdb) print hentity $1 = (ua_pres_t *) 0x7ff735d1eed8 (gdb) print *hentity $2 = {id = {s = 0x20302e322f504953 <Address 0x20302e322f504953 out of bounds>, len = 540028978}, pres_uri = 0x504953203a616956, event = 808333871, expires = 1346589743, desired_expires = 842608928, flag = 775436590, db_flag = 775106609, cb_param = 0x6e6172623b303830, next = 0x344768397a3d6863, ua_flag = 895568738, etag = { s = 0x3034363235353564 <Address 0x3034363235353564 out of bounds>, len = 808464432}, tuple_id = { s = 0x3030303030303030 <Address 0x3030303030303030 out of bounds>, len = 808464432}, body = 0x7669656365723b30, content_type = { s = 0x302e3732313d6465 <Address 0x302e3732313d6465 out of bounds>, len = 825110574}, watcher_uri = 0x686a3a706973203a, call_id = { s = 0x747365742e6d7640 <Address 0x747365742e6d7640 out of bounds>, len = 996763182}, to_tag = { s = 0x3863653436366134 <Address 0x3863653436366134 out of bounds>, len = 875914036}, from_tag = { s = 0x3938656630636230 <Address 0x3938656630636230 out of bounds>, len = 946222390}, cseq = 1697787949, version = 1175063864, outbound_proxy = 0x706973203a6d6f72, extra_headers = 0x742e6d7640686a3a, record_route = { s = 0x743b69662e747365 <Address 0x743b69662e747365 out of bounds>, len = 926771041}, remote_contact = { s = 0x3564303638333834 <Address 0x3564303638333834 out of bounds>, len = 962815330}, contact = { s = 0x6266393266373734 <Address 0x6266393266373734 out of bounds>, len = 758265909}}
-- juha
I pushed a commit to fix it, Hugh and Peter should check it not to break something that they had in mind with the commit:
http://git.sip-router.org/cgi-bin/gitweb.cgi/sip-router/?a=commit;h=1d89d7be...
If feedback and testing is ok, then it will be backported.
daniel,
i just build kamailio from latest 3.3 branch and i'm still getting the core_hash crash.
-- juha
Hello,
On 6/14/12 6:17 PM, Juha Heinanen wrote:
I pushed a commit to fix it, Hugh and Peter should check it not to break something that they had in mind with the commit:
http://git.sip-router.org/cgi-bin/gitweb.cgi/sip-router/?a=commit;h=1d89d7be...
If feedback and testing is ok, then it will be backported.
daniel,
i just build kamailio from latest 3.3 branch and i'm still getting the core_hash crash.
it was not backported to 3.3, the patch is committed only in master so far.
Cheers, Daniel
Hi,
Hugh tested the fix here and it works.
I am also about to commit another fix to presence and to pua shortly.
Peter
On Thu, 2012-06-14 at 18:19 +0200, Daniel-Constantin Mierla wrote:
Hello,
On 6/14/12 6:17 PM, Juha Heinanen wrote:
I pushed a commit to fix it, Hugh and Peter should check it not to break something that they had in mind with the commit:
http://git.sip-router.org/cgi-bin/gitweb.cgi/sip-router/?a=commit;h=1d89d7be...
If feedback and testing is ok, then it will be backported.
daniel,
i just build kamailio from latest 3.3 branch and i'm still getting the core_hash crash.
it was not backported to 3.3, the patch is committed only in master so far.
Cheers, Daniel
after pua bug fix was commited to 3.3 branch, i get this kind of message to syslog related to pua publish/notify that don't remember seeing ever before and that looks very dangerous:
Jun 14 20:02:52 siika /usr/sbin/pres-serv[4339]: INFO: <core> [mem/f_malloc.c:529]: freeing a free fragment (0x7f5da71031a0/0x7f5da71031b0) - ignore
-- juha
Hello,
overall, it's pretty much harmless, I added the message to see eventual double free in even f_malloc. When q_malloc (memory debugger manager) is used, an abort() is executed.
But now is the fast malloc (f_malloc) which didn't try to catch these situation at all, ignoring it. Maybe we should make it DBG.
Ideally, the double free should be found and avoided, which will be easier to do with q_malloc on. For that you have to set MEMDBG=1 in Makefile.defs or in command line when compiling.
Cheers, Daniel
On 6/14/12 7:05 PM, Juha Heinanen wrote:
after pua bug fix was commited to 3.3 branch, i get this kind of message to syslog related to pua publish/notify that don't remember seeing ever before and that looks very dangerous:
Jun 14 20:02:52 siika /usr/sbin/pres-serv[4339]: INFO: <core> [mem/f_malloc.c:529]: freeing a free fragment (0x7f5da71031a0/0x7f5da71031b0) - ignore
-- juha