SER crash : Segmentation fault

List overview All Threads
Download

newer

older

SF.net SVN:...

git:master: Changing legal...

inge

13 Aug 2009 13 Aug '09

11:47 a.m.

Hi all,

My SER process had crashed today with the following logs in /var/log/messages :

ser[378]: child process 418 exited by a signal 11 ser[378]: core was generated ser[378]: INFO: terminating due to SIGCHLD ser[421]: INFO: signal 15 received ...

Can someone help me to determine what kind of problem is it ? I think I need to use gdb to extract some information from the core dump. How can I use it to extract the uses informations ?

Regards,

Adrien

Show replies by date

Klaus Darilion

13 Aug 13 Aug

11:53 a.m.

locate the core file (either in the working dir or /tmp or /) then execute:

gdb /usr/local/sbin/ser /path/to/core (gdb) bt

regards klaus

inge schrieb:

...

Hi all,

My SER process had crashed today with the following logs in /var/log/messages :

ser[378]: child process 418 exited by a signal 11 ser[378]: core was generated ser[378]: INFO: terminating due to SIGCHLD ser[421]: INFO: signal 15 received ...

Can someone help me to determine what kind of problem is it ? I think I need to use gdb to extract some information from the core dump. How can I use it to extract the uses informations ?

Regards,

Adrien

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

inge

1:32 p.m.

Hi Klaus,

Thanks.

I put the output of gdb in attached.

I hope someone can decrypt this. Thank you.

Regards,

Adrien

Le jeudi 13 août 2009 à 13:53 +0200, Klaus Darilion a écrit :

...

locate the core file (either in the working dir or /tmp or /) then execute:

gdb /usr/local/sbin/ser /path/to/core (gdb) bt

regards klaus

inge schrieb:

...
Hi all,

My SER process had crashed today with the following logs in /var/log/messages :

ser[378]: child process 418 exited by a signal 11 ser[378]: core was generated ser[378]: INFO: terminating due to SIGCHLD ser[421]: INFO: signal 15 received ...

Can someone help me to determine what kind of problem is it ? I think I need to use gdb to extract some information from the core dump. How can I use it to extract the uses informations ?

Regards,

Adrien

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

Andrei Pelinescu-Onciul

14 Aug 14 Aug

12:45 p.m.

On Aug 13, 2009 at 15:32, inge inge@legos.fr wrote:

...

Hi Klaus,

Thanks.

I put the output of gdb in attached.

I hope someone can decrypt this. Thank you.

If you are using ser 2.1/latest cvs or sip-router then just update to the latest cvs or git. It's a known fixed bug (sip router git 6fcd5e or ser 2.1 commit starting with "rr: fix from header access").

If you are using another version then tell me which one (ser -V) and I'll fix it.

Andrei

...

Le jeudi 13 ao??t 2009 ?? 13:53 +0200, Klaus Darilion a ??crit :

...
locate the core file (either in the working dir or /tmp or /) then execute:

gdb /usr/local/sbin/ser /path/to/core (gdb) bt

regards klaus

inge schrieb:

...
Hi all,

My SER process had crashed today with the following logs in /var/log/messages :

ser[378]: child process 418 exited by a signal 11 ser[378]: core was generated ser[378]: INFO: terminating due to SIGCHLD ser[421]: INFO: signal 15 received ...

Can someone help me to determine what kind of problem is it ? I think I need to use gdb to extract some information from the core dump. How can I use it to extract the uses informations ?

Regards,

Adrien

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

...

#0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 222 if (memcmp(get_to(ack)->tag_value.s,p_cell->uas.local_totag.s, (gdb) bt #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 #1 0x00e96aff in t_lookup_request (p_msg=0x81647e8, leave_new_locked=1) at t_lookup.c:421 #2 0x00e992a0 in t_newtran (p_msg=0x81647e8) at t_lookup.c:1085 #3 0x00e9116a in t_relay_to (p_msg=0x81647e8, proxy=0x0, proto=0, replicate=0) at t_funcs.c:224 #4 0x00e9c410 in w_t_relay (p_msg=0x81647e8, _foo=0x0, _bar=0x0) at tm.c:889 #5 0x0804fc81 in do_action (a=0x8117818, msg=0x81647e8) at action.c:610 #6 0x0805099d in run_actions (a=0x8117818, msg=0x81647e8) at action.c:718 #7 0x08073f08 in eval_elem (e=0x8117840, msg=0x81647e8) at route.c:605 #8 0x08074392 in eval_expr (e=0x8117840, msg=0x81647e8) at route.c:654 #9 0x080743ce in eval_expr (e=0x8117860, msg=0x81647e8) at route.c:670 #10 0x0804ec95 in do_action (a=0x8117bc8, msg=0x81647e8) at action.c:586 #11 0x0805099d in run_actions (a=0x8117630, msg=0x81647e8) at action.c:718 #12 0x0804ffdf in do_action (a=0x8114f70, msg=0x81647e8) at action.c:375 #13 0x0805099d in run_actions (a=0x8114f70, msg=0x81647e8) at action.c:718 #14 0x0804ecd3 in do_action (a=0x8114fc0, msg=0x81647e8) at action.c:603 #15 0x0805099d in run_actions (a=0x8114fc0, msg=0x81647e8) at action.c:718 #16 0x0804ecd3 in do_action (a=0x8114fe8, msg=0x81647e8) at action.c:603 #17 0x0805099d in run_actions (a=0x8114fe8, msg=0x81647e8) at action.c:718 #18 0x0804ecd3 in do_action (a=0x8115010, msg=0x81647e8) at action.c:603 #19 0x0805099d in run_actions (a=0x8115010, msg=0x81647e8) at action.c:718 #20 0x0804ecd3 in do_action (a=0x8115038, msg=0x81647e8) at action.c:603 #21 0x0805099d in run_actions (a=0x8115038, msg=0x81647e8) at action.c:718 #22 0x0804ecd3 in do_action (a=0x8115060, msg=0x81647e8) at action.c:603 #23 0x0805099d in run_actions (a=0x810fe88, msg=0x81647e8) at action.c:718 #24 0x0806d062 in receive_msg ( buf=0x80d61e0 "ACK sip:0389719641@domain.tld:5060 SIP/2.0\r\nMax-Forwards: 16\r\nContent-Length: 0\r\nVia: SIP/2.0/UDP 10.0.140.147:5060;branch=z9hG4bK4f1b8571c\r\nCall-ID: bf85c76a5e2066256679e3945f6b4e36@10.0.140.147\r\nF"..., len=592, rcv_info=0xbff76340) at receive.c:165 #25 0x080843cc in udp_rcv_loop () at udp_server.c:472 #26 0x0805cdaf in main_loop () at main.c:1056 #27 0x0805e40b in main (argc=1, argv=0xbff76504) at main.c:1592

...

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

inge

1:01 p.m.

Hi Andrei,

Thanks for your reply.

I use ser 0.9.5-pre4.

I don't really understand the bug you have identify, where can I find a description ?

Regards,

Adrien

Le vendredi 14 août 2009 à 14:45 +0200, Andrei Pelinescu-Onciul a écrit :

...

On Aug 13, 2009 at 15:32, inge inge@legos.fr wrote:

...
Hi Klaus,

Thanks.

I put the output of gdb in attached.

I hope someone can decrypt this. Thank you.

If you are using ser 2.1/latest cvs or sip-router then just update to the latest cvs or git. It's a known fixed bug (sip router git 6fcd5e or ser 2.1 commit starting with "rr: fix from header access").

If you are using another version then tell me which one (ser -V) and I'll fix it.

Andrei

...
Le jeudi 13 ao??t 2009 ?? 13:53 +0200, Klaus Darilion a ??crit :

...
locate the core file (either in the working dir or /tmp or /) then execute:

gdb /usr/local/sbin/ser /path/to/core (gdb) bt

regards klaus

inge schrieb:

...
Hi all,

My SER process had crashed today with the following logs in /var/log/messages :

ser[378]: child process 418 exited by a signal 11 ser[378]: core was generated ser[378]: INFO: terminating due to SIGCHLD ser[421]: INFO: signal 15 received ...

Can someone help me to determine what kind of problem is it ? I think I need to use gdb to extract some information from the core dump. How can I use it to extract the uses informations ?

Regards,

Adrien

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

...
#0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 222 if (memcmp(get_to(ack)->tag_value.s,p_cell->uas.local_totag.s, (gdb) bt #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 #1 0x00e96aff in t_lookup_request (p_msg=0x81647e8, leave_new_locked=1) at t_lookup.c:421 #2 0x00e992a0 in t_newtran (p_msg=0x81647e8) at t_lookup.c:1085 #3 0x00e9116a in t_relay_to (p_msg=0x81647e8, proxy=0x0, proto=0, replicate=0) at t_funcs.c:224 #4 0x00e9c410 in w_t_relay (p_msg=0x81647e8, _foo=0x0, _bar=0x0) at tm.c:889 #5 0x0804fc81 in do_action (a=0x8117818, msg=0x81647e8) at action.c:610 #6 0x0805099d in run_actions (a=0x8117818, msg=0x81647e8) at action.c:718 #7 0x08073f08 in eval_elem (e=0x8117840, msg=0x81647e8) at route.c:605 #8 0x08074392 in eval_expr (e=0x8117840, msg=0x81647e8) at route.c:654 #9 0x080743ce in eval_expr (e=0x8117860, msg=0x81647e8) at route.c:670 #10 0x0804ec95 in do_action (a=0x8117bc8, msg=0x81647e8) at action.c:586 #11 0x0805099d in run_actions (a=0x8117630, msg=0x81647e8) at action.c:718 #12 0x0804ffdf in do_action (a=0x8114f70, msg=0x81647e8) at action.c:375 #13 0x0805099d in run_actions (a=0x8114f70, msg=0x81647e8) at action.c:718 #14 0x0804ecd3 in do_action (a=0x8114fc0, msg=0x81647e8) at action.c:603 #15 0x0805099d in run_actions (a=0x8114fc0, msg=0x81647e8) at action.c:718 #16 0x0804ecd3 in do_action (a=0x8114fe8, msg=0x81647e8) at action.c:603 #17 0x0805099d in run_actions (a=0x8114fe8, msg=0x81647e8) at action.c:718 #18 0x0804ecd3 in do_action (a=0x8115010, msg=0x81647e8) at action.c:603 #19 0x0805099d in run_actions (a=0x8115010, msg=0x81647e8) at action.c:718 #20 0x0804ecd3 in do_action (a=0x8115038, msg=0x81647e8) at action.c:603 #21 0x0805099d in run_actions (a=0x8115038, msg=0x81647e8) at action.c:718 #22 0x0804ecd3 in do_action (a=0x8115060, msg=0x81647e8) at action.c:603 #23 0x0805099d in run_actions (a=0x810fe88, msg=0x81647e8) at action.c:718 #24 0x0806d062 in receive_msg ( buf=0x80d61e0 "ACK sip:0389719641@domain.tld:5060 SIP/2.0\r\nMax-Forwards: 16\r\nContent-Length: 0\r\nVia: SIP/2.0/UDP 10.0.140.147:5060;branch=z9hG4bK4f1b8571c\r\nCall-ID: bf85c76a5e2066256679e3945f6b4e36@10.0.140.147\r\nF"..., len=592, rcv_info=0xbff76340) at receive.c:165 #25 0x080843cc in udp_rcv_loop () at udp_server.c:472 #26 0x0805cdaf in main_loop () at main.c:1056 #27 0x0805e40b in main (argc=1, argv=0xbff76504) at main.c:1592

...

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

Klaus Darilion

1:42 p.m.

inge schrieb:

...

Hi Andrei,

Thanks for your reply.

I use ser 0.9.5-pre4.

ser 0.9 is very very old.

I recommend update to never versions, e.g. - Kamailio 1.5 (stable version of the ser-fork) or - sip-router: development version of ser/Kamailio join

regards klaus

...

I don't really understand the bug you have identify, where can I find a description ?

Regards,

Adrien

Le vendredi 14 août 2009 à 14:45 +0200, Andrei Pelinescu-Onciul a écrit :

...
On Aug 13, 2009 at 15:32, inge inge@legos.fr wrote:

...
Hi Klaus,

Thanks.

I put the output of gdb in attached.

I hope someone can decrypt this. Thank you.

If you are using ser 2.1/latest cvs or sip-router then just update to the latest cvs or git. It's a known fixed bug (sip router git 6fcd5e or ser 2.1 commit starting with "rr: fix from header access").

If you are using another version then tell me which one (ser -V) and I'll fix it.

Andrei

...
Le jeudi 13 ao??t 2009 ?? 13:53 +0200, Klaus Darilion a ??crit :

...
locate the core file (either in the working dir or /tmp or /) then execute:

gdb /usr/local/sbin/ser /path/to/core (gdb) bt

regards klaus

inge schrieb:

...
Hi all,

My SER process had crashed today with the following logs in /var/log/messages :

ser[378]: child process 418 exited by a signal 11 ser[378]: core was generated ser[378]: INFO: terminating due to SIGCHLD ser[421]: INFO: signal 15 received ...

Can someone help me to determine what kind of problem is it ? I think I need to use gdb to extract some information from the core dump. How can I use it to extract the uses informations ?

Regards,

Adrien

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

#0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 222 if (memcmp(get_to(ack)->tag_value.s,p_cell->uas.local_totag.s, (gdb) bt #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 #1 0x00e96aff in t_lookup_request (p_msg=0x81647e8, leave_new_locked=1) at t_lookup.c:421 #2 0x00e992a0 in t_newtran (p_msg=0x81647e8) at t_lookup.c:1085 #3 0x00e9116a in t_relay_to (p_msg=0x81647e8, proxy=0x0, proto=0, replicate=0) at t_funcs.c:224 #4 0x00e9c410 in w_t_relay (p_msg=0x81647e8, _foo=0x0, _bar=0x0) at tm.c:889 #5 0x0804fc81 in do_action (a=0x8117818, msg=0x81647e8) at action.c:610 #6 0x0805099d in run_actions (a=0x8117818, msg=0x81647e8) at action.c:718 #7 0x08073f08 in eval_elem (e=0x8117840, msg=0x81647e8) at route.c:605 #8 0x08074392 in eval_expr (e=0x8117840, msg=0x81647e8) at route.c:654 #9 0x080743ce in eval_expr (e=0x8117860, msg=0x81647e8) at route.c:670 #10 0x0804ec95 in do_action (a=0x8117bc8, msg=0x81647e8) at action.c:586 #11 0x0805099d in run_actions (a=0x8117630, msg=0x81647e8) at action.c:718 #12 0x0804ffdf in do_action (a=0x8114f70, msg=0x81647e8) at action.c:375 #13 0x0805099d in run_actions (a=0x8114f70, msg=0x81647e8) at action.c:718 #14 0x0804ecd3 in do_action (a=0x8114fc0, msg=0x81647e8) at action.c:603 #15 0x0805099d in run_actions (a=0x8114fc0, msg=0x81647e8) at action.c:718 #16 0x0804ecd3 in do_action (a=0x8114fe8, msg=0x81647e8) at action.c:603 #17 0x0805099d in run_actions (a=0x8114fe8, msg=0x81647e8) at action.c:718 #18 0x0804ecd3 in do_action (a=0x8115010, msg=0x81647e8) at action.c:603 #19 0x0805099d in run_actions (a=0x8115010, msg=0x81647e8) at action.c:718 #20 0x0804ecd3 in do_action (a=0x8115038, msg=0x81647e8) at action.c:603 #21 0x0805099d in run_actions (a=0x8115038, msg=0x81647e8) at action.c:718 #22 0x0804ecd3 in do_action (a=0x8115060, msg=0x81647e8) at action.c:603 #23 0x0805099d in run_actions (a=0x810fe88, msg=0x81647e8) at action.c:718 #24 0x0806d062 in receive_msg ( buf=0x80d61e0 "ACK sip:0389719641@domain.tld:5060 SIP/2.0\r\nMax-Forwards: 16\r\nContent-Length: 0\r\nVia: SIP/2.0/UDP 10.0.140.147:5060;branch=z9hG4bK4f1b8571c\r\nCall-ID: bf85c76a5e2066256679e3945f6b4e36@10.0.140.147\r\nF"..., len=592, rcv_info=0xbff76340) at receive.c:165 #25 0x080843cc in udp_rcv_loop () at udp_server.c:472 #26 0x0805cdaf in main_loop () at main.c:1056 #27 0x0805e40b in main (argc=1, argv=0xbff76504) at main.c:1592

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

Andrei Pelinescu-Onciul

2:34 p.m.

On Aug 14, 2009 at 15:01, inge inge@legos.fr wrote:

...

Hi Andrei,

Thanks for your reply.

I use ser 0.9.5-pre4.

I don't really understand the bug you have identify, where can I find a description ?

Sorry, I was wrong (that bug was in RR and appears only in newer code).

Could you run gdb on the core again , type "frame 0" and then send me the output of the following commands:

print p_cell print p_msg print p_msg->buf print p_cell->uas.local_totag.len print p_cell->uas.local_totag.s print p_msg->to print p_msg->to->parsed print *((struct to_body*)(p_msg->to->parsed)) print ((struct to_body*)(p_msg->to->parsed))->tag_value.len print ((struct to_body*)(p_msg->to->parsed))->tag_value.s

Andrei P.S.: you could try also upgrading to ser 2.0, 2.1 or sip-router.

...

Regards,

Adrien

Le vendredi 14 ao??t 2009 ?? 14:45 +0200, Andrei Pelinescu-Onciul a ??crit :

...
On Aug 13, 2009 at 15:32, inge inge@legos.fr wrote:

...
Hi Klaus,

Thanks.

I put the output of gdb in attached.

I hope someone can decrypt this. Thank you.

If you are using ser 2.1/latest cvs or sip-router then just update to the latest cvs or git. It's a known fixed bug (sip router git 6fcd5e or ser 2.1 commit starting with "rr: fix from header access").

If you are using another version then tell me which one (ser -V) and I'll fix it.

Andrei

...
Le jeudi 13 ao??t 2009 ?? 13:53 +0200, Klaus Darilion a ??crit :

...
locate the core file (either in the working dir or /tmp or /) then execute:

gdb /usr/local/sbin/ser /path/to/core (gdb) bt

regards klaus

inge schrieb:

...
Hi all,

My SER process had crashed today with the following logs in /var/log/messages :

ser[378]: child process 418 exited by a signal 11 ser[378]: core was generated ser[378]: INFO: terminating due to SIGCHLD ser[421]: INFO: signal 15 received ...

Can someone help me to determine what kind of problem is it ? I think I need to use gdb to extract some information from the core dump. How can I use it to extract the uses informations ?

Regards,

Adrien

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

...
#0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 222 if (memcmp(get_to(ack)->tag_value.s,p_cell->uas.local_totag.s, (gdb) bt #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 #1 0x00e96aff in t_lookup_request (p_msg=0x81647e8, leave_new_locked=1) at t_lookup.c:421 #2 0x00e992a0 in t_newtran (p_msg=0x81647e8) at t_lookup.c:1085 #3 0x00e9116a in t_relay_to (p_msg=0x81647e8, proxy=0x0, proto=0, replicate=0) at t_funcs.c:224 #4 0x00e9c410 in w_t_relay (p_msg=0x81647e8, _foo=0x0, _bar=0x0) at tm.c:889 #5 0x0804fc81 in do_action (a=0x8117818, msg=0x81647e8) at action.c:610 #6 0x0805099d in run_actions (a=0x8117818, msg=0x81647e8) at action.c:718 #7 0x08073f08 in eval_elem (e=0x8117840, msg=0x81647e8) at route.c:605 #8 0x08074392 in eval_expr (e=0x8117840, msg=0x81647e8) at route.c:654 #9 0x080743ce in eval_expr (e=0x8117860, msg=0x81647e8) at route.c:670 #10 0x0804ec95 in do_action (a=0x8117bc8, msg=0x81647e8) at action.c:586 #11 0x0805099d in run_actions (a=0x8117630, msg=0x81647e8) at action.c:718 #12 0x0804ffdf in do_action (a=0x8114f70, msg=0x81647e8) at action.c:375 #13 0x0805099d in run_actions (a=0x8114f70, msg=0x81647e8) at action.c:718 #14 0x0804ecd3 in do_action (a=0x8114fc0, msg=0x81647e8) at action.c:603 #15 0x0805099d in run_actions (a=0x8114fc0, msg=0x81647e8) at action.c:718 #16 0x0804ecd3 in do_action (a=0x8114fe8, msg=0x81647e8) at action.c:603 #17 0x0805099d in run_actions (a=0x8114fe8, msg=0x81647e8) at action.c:718 #18 0x0804ecd3 in do_action (a=0x8115010, msg=0x81647e8) at action.c:603 #19 0x0805099d in run_actions (a=0x8115010, msg=0x81647e8) at action.c:718 #20 0x0804ecd3 in do_action (a=0x8115038, msg=0x81647e8) at action.c:603 #21 0x0805099d in run_actions (a=0x8115038, msg=0x81647e8) at action.c:718 #22 0x0804ecd3 in do_action (a=0x8115060, msg=0x81647e8) at action.c:603 #23 0x0805099d in run_actions (a=0x810fe88, msg=0x81647e8) at action.c:718 #24 0x0806d062 in receive_msg ( buf=0x80d61e0 "ACK sip:0389719641@domain.tld:5060 SIP/2.0\r\nMax-Forwards: 16\r\nContent-Length: 0\r\nVia: SIP/2.0/UDP 10.0.140.147:5060;branch=z9hG4bK4f1b8571c\r\nCall-ID: bf85c76a5e2066256679e3945f6b4e36@10.0.140.147\r\nF"..., len=592, rcv_info=0xbff76340) at receive.c:165 #25 0x080843cc in udp_rcv_loop () at udp_server.c:472 #26 0x0805cdaf in main_loop () at main.c:1056 #27 0x0805e40b in main (argc=1, argv=0xbff76504) at main.c:1592

...

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

inge

3:03 p.m.

Please find the requested information in attached.

I'm aware of the need for an update. It's in the list of tasks to be done, however, the priority is to troubleshoot the problem and maybe find a workaround.

Regards,

Adrien

Le vendredi 14 août 2009 à 16:34 +0200, Andrei Pelinescu-Onciul a écrit :

...

On Aug 14, 2009 at 15:01, inge inge@legos.fr wrote:

...
Hi Andrei,

Thanks for your reply.

I use ser 0.9.5-pre4.

I don't really understand the bug you have identify, where can I find a description ?

Sorry, I was wrong (that bug was in RR and appears only in newer code).

Could you run gdb on the core again , type "frame 0" and then send me the output of the following commands:

print p_cell print p_msg print p_msg->buf print p_cell->uas.local_totag.len print p_cell->uas.local_totag.s print p_msg->to print p_msg->to->parsed print *((struct to_body*)(p_msg->to->parsed)) print ((struct to_body*)(p_msg->to->parsed))->tag_value.len print ((struct to_body*)(p_msg->to->parsed))->tag_value.s

Andrei P.S.: you could try also upgrading to ser 2.0, 2.1 or sip-router.

...
Regards,

Adrien

Le vendredi 14 ao??t 2009 ?? 14:45 +0200, Andrei Pelinescu-Onciul a ??crit :

...
On Aug 13, 2009 at 15:32, inge inge@legos.fr wrote:

...
Hi Klaus,

Thanks.

I put the output of gdb in attached.

I hope someone can decrypt this. Thank you.

If you are using ser 2.1/latest cvs or sip-router then just update to the latest cvs or git. It's a known fixed bug (sip router git 6fcd5e or ser 2.1 commit starting with "rr: fix from header access").

If you are using another version then tell me which one (ser -V) and I'll fix it.

Andrei

...
Le jeudi 13 ao??t 2009 ?? 13:53 +0200, Klaus Darilion a ??crit :

...
locate the core file (either in the working dir or /tmp or /) then execute:

gdb /usr/local/sbin/ser /path/to/core (gdb) bt

regards klaus

inge schrieb:

...
Hi all,

My SER process had crashed today with the following logs in /var/log/messages :

ser[378]: child process 418 exited by a signal 11 ser[378]: core was generated ser[378]: INFO: terminating due to SIGCHLD ser[421]: INFO: signal 15 received ...

Can someone help me to determine what kind of problem is it ? I think I need to use gdb to extract some information from the core dump. How can I use it to extract the uses informations ?

Regards,

Adrien

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

...
#0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 222 if (memcmp(get_to(ack)->tag_value.s,p_cell->uas.local_totag.s, (gdb) bt #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 #1 0x00e96aff in t_lookup_request (p_msg=0x81647e8, leave_new_locked=1) at t_lookup.c:421 #2 0x00e992a0 in t_newtran (p_msg=0x81647e8) at t_lookup.c:1085 #3 0x00e9116a in t_relay_to (p_msg=0x81647e8, proxy=0x0, proto=0, replicate=0) at t_funcs.c:224 #4 0x00e9c410 in w_t_relay (p_msg=0x81647e8, _foo=0x0, _bar=0x0) at tm.c:889 #5 0x0804fc81 in do_action (a=0x8117818, msg=0x81647e8) at action.c:610 #6 0x0805099d in run_actions (a=0x8117818, msg=0x81647e8) at action.c:718 #7 0x08073f08 in eval_elem (e=0x8117840, msg=0x81647e8) at route.c:605 #8 0x08074392 in eval_expr (e=0x8117840, msg=0x81647e8) at route.c:654 #9 0x080743ce in eval_expr (e=0x8117860, msg=0x81647e8) at route.c:670 #10 0x0804ec95 in do_action (a=0x8117bc8, msg=0x81647e8) at action.c:586 #11 0x0805099d in run_actions (a=0x8117630, msg=0x81647e8) at action.c:718 #12 0x0804ffdf in do_action (a=0x8114f70, msg=0x81647e8) at action.c:375 #13 0x0805099d in run_actions (a=0x8114f70, msg=0x81647e8) at action.c:718 #14 0x0804ecd3 in do_action (a=0x8114fc0, msg=0x81647e8) at action.c:603 #15 0x0805099d in run_actions (a=0x8114fc0, msg=0x81647e8) at action.c:718 #16 0x0804ecd3 in do_action (a=0x8114fe8, msg=0x81647e8) at action.c:603 #17 0x0805099d in run_actions (a=0x8114fe8, msg=0x81647e8) at action.c:718 #18 0x0804ecd3 in do_action (a=0x8115010, msg=0x81647e8) at action.c:603 #19 0x0805099d in run_actions (a=0x8115010, msg=0x81647e8) at action.c:718 #20 0x0804ecd3 in do_action (a=0x8115038, msg=0x81647e8) at action.c:603 #21 0x0805099d in run_actions (a=0x8115038, msg=0x81647e8) at action.c:718 #22 0x0804ecd3 in do_action (a=0x8115060, msg=0x81647e8) at action.c:603 #23 0x0805099d in run_actions (a=0x810fe88, msg=0x81647e8) at action.c:718 #24 0x0806d062 in receive_msg ( buf=0x80d61e0 "ACK sip:0389719641@domain.tld:5060 SIP/2.0\r\nMax-Forwards: 16\r\nContent-Length: 0\r\nVia: SIP/2.0/UDP 10.0.140.147:5060;branch=z9hG4bK4f1b8571c\r\nCall-ID: bf85c76a5e2066256679e3945f6b4e36@10.0.140.147\r\nF"..., len=592, rcv_info=0xbff76340) at receive.c:165 #25 0x080843cc in udp_rcv_loop () at udp_server.c:472 #26 0x0805cdaf in main_loop () at main.c:1056 #27 0x0805e40b in main (argc=1, argv=0xbff76504) at main.c:1592

...

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

inge

17 Aug 17 Aug

12:42 p.m.

Hi Andrei,

Hope you are fine. Do you have any update on our crash ? Is there anything we can do to find the segmentation fault cause, maybe as a well-known bug, without bothering you ?

Sincerely,

Adrien

Le vendredi 14 août 2009 à 17:03 +0200, inge a écrit :

...

Please find the requested information in attached.

I'm aware of the need for an update. It's in the list of tasks to be done, however, the priority is to troubleshoot the problem and maybe find a workaround.

Regards,

Adrien

Le vendredi 14 août 2009 à 16:34 +0200, Andrei Pelinescu-Onciul a écrit :

...
On Aug 14, 2009 at 15:01, inge inge@legos.fr wrote:

...
Hi Andrei,

Thanks for your reply.

I use ser 0.9.5-pre4.

I don't really understand the bug you have identify, where can I find a description ?

Sorry, I was wrong (that bug was in RR and appears only in newer code).

Could you run gdb on the core again , type "frame 0" and then send me the output of the following commands:

print p_cell print p_msg print p_msg->buf print p_cell->uas.local_totag.len print p_cell->uas.local_totag.s print p_msg->to print p_msg->to->parsed print *((struct to_body*)(p_msg->to->parsed)) print ((struct to_body*)(p_msg->to->parsed))->tag_value.len print ((struct to_body*)(p_msg->to->parsed))->tag_value.s

Andrei P.S.: you could try also upgrading to ser 2.0, 2.1 or sip-router.

...
Regards,

Adrien

Le vendredi 14 ao??t 2009 ?? 14:45 +0200, Andrei Pelinescu-Onciul a ??crit :

...
On Aug 13, 2009 at 15:32, inge inge@legos.fr wrote:

...
Hi Klaus,

Thanks.

I put the output of gdb in attached.

I hope someone can decrypt this. Thank you.

If you are using ser 2.1/latest cvs or sip-router then just update to the latest cvs or git. It's a known fixed bug (sip router git 6fcd5e or ser 2.1 commit starting with "rr: fix from header access").

If you are using another version then tell me which one (ser -V) and I'll fix it.

Andrei

...
Le jeudi 13 ao??t 2009 ?? 13:53 +0200, Klaus Darilion a ??crit :

...
locate the core file (either in the working dir or /tmp or /) then execute:

gdb /usr/local/sbin/ser /path/to/core (gdb) bt

regards klaus

inge schrieb: > Hi all, > > My SER process had crashed today with the following logs > in /var/log/messages : > > ser[378]: child process 418 exited by a signal 11 > ser[378]: core was generated > ser[378]: INFO: terminating due to SIGCHLD > ser[421]: INFO: signal 15 received > ... > > Can someone help me to determine what kind of problem is it ? I think I > need to use gdb to extract some information from the core dump. How can > I use it to extract the uses informations ? > > Regards, > > Adrien > > > _______________________________________________ > sr-dev mailing list > sr-dev@lists.sip-router.org > http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

...
#0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 222 if (memcmp(get_to(ack)->tag_value.s,p_cell->uas.local_totag.s, (gdb) bt #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 #1 0x00e96aff in t_lookup_request (p_msg=0x81647e8, leave_new_locked=1) at t_lookup.c:421 #2 0x00e992a0 in t_newtran (p_msg=0x81647e8) at t_lookup.c:1085 #3 0x00e9116a in t_relay_to (p_msg=0x81647e8, proxy=0x0, proto=0, replicate=0) at t_funcs.c:224 #4 0x00e9c410 in w_t_relay (p_msg=0x81647e8, _foo=0x0, _bar=0x0) at tm.c:889 #5 0x0804fc81 in do_action (a=0x8117818, msg=0x81647e8) at action.c:610 #6 0x0805099d in run_actions (a=0x8117818, msg=0x81647e8) at action.c:718 #7 0x08073f08 in eval_elem (e=0x8117840, msg=0x81647e8) at route.c:605 #8 0x08074392 in eval_expr (e=0x8117840, msg=0x81647e8) at route.c:654 #9 0x080743ce in eval_expr (e=0x8117860, msg=0x81647e8) at route.c:670 #10 0x0804ec95 in do_action (a=0x8117bc8, msg=0x81647e8) at action.c:586 #11 0x0805099d in run_actions (a=0x8117630, msg=0x81647e8) at action.c:718 #12 0x0804ffdf in do_action (a=0x8114f70, msg=0x81647e8) at action.c:375 #13 0x0805099d in run_actions (a=0x8114f70, msg=0x81647e8) at action.c:718 #14 0x0804ecd3 in do_action (a=0x8114fc0, msg=0x81647e8) at action.c:603 #15 0x0805099d in run_actions (a=0x8114fc0, msg=0x81647e8) at action.c:718 #16 0x0804ecd3 in do_action (a=0x8114fe8, msg=0x81647e8) at action.c:603 #17 0x0805099d in run_actions (a=0x8114fe8, msg=0x81647e8) at action.c:718 #18 0x0804ecd3 in do_action (a=0x8115010, msg=0x81647e8) at action.c:603 #19 0x0805099d in run_actions (a=0x8115010, msg=0x81647e8) at action.c:718 #20 0x0804ecd3 in do_action (a=0x8115038, msg=0x81647e8) at action.c:603 #21 0x0805099d in run_actions (a=0x8115038, msg=0x81647e8) at action.c:718 #22 0x0804ecd3 in do_action (a=0x8115060, msg=0x81647e8) at action.c:603 #23 0x0805099d in run_actions (a=0x810fe88, msg=0x81647e8) at action.c:718 #24 0x0806d062 in receive_msg ( buf=0x80d61e0 "ACK sip:0389719641@domain.tld:5060 SIP/2.0\r\nMax-Forwards: 16\r\nContent-Length: 0\r\nVia: SIP/2.0/UDP 10.0.140.147:5060;branch=z9hG4bK4f1b8571c\r\nCall-ID: bf85c76a5e2066256679e3945f6b4e36@10.0.140.147\r\nF"..., len=592, rcv_info=0xbff76340) at receive.c:165 #25 0x080843cc in udp_rcv_loop () at udp_server.c:472 #26 0x0805cdaf in main_loop () at main.c:1056 #27 0x0805e40b in main (argc=1, argv=0xbff76504) at main.c:1592

...

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

Andrei Pelinescu-Onciul

18 Aug 18 Aug

7 a.m.

On Aug 17, 2009 at 14:42, inge inge@legos.fr wrote:

...

Hi Andrei,

Hope you are fine. Do you have any update on our crash ? Is there anything we can do to find the segmentation fault cause, maybe as a well-known bug, without bothering you ?

There are lots of changes between 0.9.5-pre and the latest 0.9.x version. You should try updating to the latest code on the rel_0_9_0 branch and see if you run into this problem again. To get the latest 0.9.x code either get the latest snapshot from http://ftp.iptel.org/pub/ser/daily-snapshots/stable/ , use cvs to get the rel_0_9_0 branch (CVSROOT=:pserver:anonymous@cvs.berlios.de:/cvsroot/ser ; export CVSROOT ; cvs co -r rel_0_9_0 sip_router ), or use git and the ser repository (see http://sip-router.org/wiki/git/ser-repository).

Here's a short changelog for tm, between 0.9.5 and 0.9.7+ (git log --oneline v_0_9_5..origin/rel_0_9_0 modules/tm): - tm: fix delete_cell() when the transaction is referenced - variable timer fix: variable timers (avps) won't be exteneded anymore - fix for free_rdata_list() which used to access the "next" pointer af - deadlock when t_relay-ing a message from the failure_route fixed (e2e - added sems specific patch. This patch is present in the ser version ship - added diversion and rpid header cloning -bug fix: tm insert_timer used to eat too much cpu, decreasing dramatic - fixed misplaced set_avp list, courtesy of cesc.santa@gmail.com - int2reverse_hex/reverse_hex2int fixes (tm with large "labels" was aff - fix of local ACK matching provided by cesc.santa@gmail.com - avp race condition fix (backported from HEAD) - CANCEL terminates retransmission timers properly (backported)

Andrei

...

Le vendredi 14 ao??t 2009 ?? 17:03 +0200, inge a ??crit :

...
Please find the requested information in attached.

I'm aware of the need for an update. It's in the list of tasks to be done, however, the priority is to troubleshoot the problem and maybe find a workaround.

Regards,

Adrien

Le vendredi 14 ao??t 2009 ?? 16:34 +0200, Andrei Pelinescu-Onciul a ??crit :

...
On Aug 14, 2009 at 15:01, inge inge@legos.fr wrote:

...
Hi Andrei,

Thanks for your reply.

I use ser 0.9.5-pre4.

I don't really understand the bug you have identify, where can I find a description ?

Sorry, I was wrong (that bug was in RR and appears only in newer code).

Could you run gdb on the core again , type "frame 0" and then send me the output of the following commands:

print p_cell print p_msg print p_msg->buf print p_cell->uas.local_totag.len print p_cell->uas.local_totag.s print p_msg->to print p_msg->to->parsed print *((struct to_body*)(p_msg->to->parsed)) print ((struct to_body*)(p_msg->to->parsed))->tag_value.len print ((struct to_body*)(p_msg->to->parsed))->tag_value.s

Andrei P.S.: you could try also upgrading to ser 2.0, 2.1 or sip-router.

...
Regards,

Adrien

Le vendredi 14 ao??t 2009 ?? 14:45 +0200, Andrei Pelinescu-Onciul a ??crit :

...
On Aug 13, 2009 at 15:32, inge inge@legos.fr wrote:

...
Hi Klaus,

Thanks.

I put the output of gdb in attached.

I hope someone can decrypt this. Thank you.

If you are using ser 2.1/latest cvs or sip-router then just update to the latest cvs or git. It's a known fixed bug (sip router git 6fcd5e or ser 2.1 commit starting with "rr: fix from header access").

If you are using another version then tell me which one (ser -V) and I'll fix it.

Andrei

...
Le jeudi 13 ao??t 2009 ?? 13:53 +0200, Klaus Darilion a ??crit : > locate the core file (either in the working dir or /tmp or /) > then execute: > > gdb /usr/local/sbin/ser /path/to/core > (gdb) bt > > regards > klaus > > inge schrieb: > > Hi all, > > > > My SER process had crashed today with the following logs > > in /var/log/messages : > > > > ser[378]: child process 418 exited by a signal 11 > > ser[378]: core was generated > > ser[378]: INFO: terminating due to SIGCHLD > > ser[421]: INFO: signal 15 received > > ... > > > > Can someone help me to determine what kind of problem is it ? I think I > > need to use gdb to extract some information from the core dump. How can > > I use it to extract the uses informations ? > > > > Regards, > > > > Adrien > > > > > > _______________________________________________ > > sr-dev mailing list > > sr-dev@lists.sip-router.org > > http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

...
#0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 222 if (memcmp(get_to(ack)->tag_value.s,p_cell->uas.local_totag.s, (gdb) bt #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 #1 0x00e96aff in t_lookup_request (p_msg=0x81647e8, leave_new_locked=1) at t_lookup.c:421 #2 0x00e992a0 in t_newtran (p_msg=0x81647e8) at t_lookup.c:1085 #3 0x00e9116a in t_relay_to (p_msg=0x81647e8, proxy=0x0, proto=0, replicate=0) at t_funcs.c:224 #4 0x00e9c410 in w_t_relay (p_msg=0x81647e8, _foo=0x0, _bar=0x0) at tm.c:889 #5 0x0804fc81 in do_action (a=0x8117818, msg=0x81647e8) at action.c:610 #6 0x0805099d in run_actions (a=0x8117818, msg=0x81647e8) at action.c:718 #7 0x08073f08 in eval_elem (e=0x8117840, msg=0x81647e8) at route.c:605 #8 0x08074392 in eval_expr (e=0x8117840, msg=0x81647e8) at route.c:654 #9 0x080743ce in eval_expr (e=0x8117860, msg=0x81647e8) at route.c:670 #10 0x0804ec95 in do_action (a=0x8117bc8, msg=0x81647e8) at action.c:586 #11 0x0805099d in run_actions (a=0x8117630, msg=0x81647e8) at action.c:718 #12 0x0804ffdf in do_action (a=0x8114f70, msg=0x81647e8) at action.c:375 #13 0x0805099d in run_actions (a=0x8114f70, msg=0x81647e8) at action.c:718 #14 0x0804ecd3 in do_action (a=0x8114fc0, msg=0x81647e8) at action.c:603 #15 0x0805099d in run_actions (a=0x8114fc0, msg=0x81647e8) at action.c:718 #16 0x0804ecd3 in do_action (a=0x8114fe8, msg=0x81647e8) at action.c:603 #17 0x0805099d in run_actions (a=0x8114fe8, msg=0x81647e8) at action.c:718 #18 0x0804ecd3 in do_action (a=0x8115010, msg=0x81647e8) at action.c:603 #19 0x0805099d in run_actions (a=0x8115010, msg=0x81647e8) at action.c:718 #20 0x0804ecd3 in do_action (a=0x8115038, msg=0x81647e8) at action.c:603 #21 0x0805099d in run_actions (a=0x8115038, msg=0x81647e8) at action.c:718 #22 0x0804ecd3 in do_action (a=0x8115060, msg=0x81647e8) at action.c:603 #23 0x0805099d in run_actions (a=0x810fe88, msg=0x81647e8) at action.c:718 #24 0x0806d062 in receive_msg ( buf=0x80d61e0 "ACK sip:0389719641@domain.tld:5060 SIP/2.0\r\nMax-Forwards: 16\r\nContent-Length: 0\r\nVia: SIP/2.0/UDP 10.0.140.147:5060;branch=z9hG4bK4f1b8571c\r\nCall-ID: bf85c76a5e2066256679e3945f6b4e36@10.0.140.147\r\nF"..., len=592, rcv_info=0xbff76340) at receive.c:165 #25 0x080843cc in udp_rcv_loop () at udp_server.c:472 #26 0x0805cdaf in main_loop () at main.c:1056 #27 0x0805e40b in main (argc=1, argv=0xbff76504) at main.c:1592

...

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

inge

20 Aug 20 Aug

8:40 a.m.

Hi Andrei,

As I understand, this changelog only apply to the tm module. Is there any clues that this module caused the crash we experienced ?

We would like to determine which of the known and corrected bug could have caused the crash, in order to find a short-time workaround letting us some time to deploy abn upgrade to the latest rel in the 0.9.0 branch.

Adrien

Le mardi 18 août 2009 à 09:00 +0200, Andrei Pelinescu-Onciul a écrit :

...

On Aug 17, 2009 at 14:42, inge inge@legos.fr wrote:

...
Hi Andrei,

Hope you are fine. Do you have any update on our crash ? Is there anything we can do to find the segmentation fault cause, maybe as a well-known bug, without bothering you ?

There are lots of changes between 0.9.5-pre and the latest 0.9.x version. You should try updating to the latest code on the rel_0_9_0 branch and see if you run into this problem again. To get the latest 0.9.x code either get the latest snapshot from http://ftp.iptel.org/pub/ser/daily-snapshots/stable/ , use cvs to get the rel_0_9_0 branch (CVSROOT=:pserver:anonymous@cvs.berlios.de:/cvsroot/ser ; export CVSROOT ; cvs co -r rel_0_9_0 sip_router ), or use git and the ser repository (see http://sip-router.org/wiki/git/ser-repository).

Here's a short changelog for tm, between 0.9.5 and 0.9.7+ (git log --oneline v_0_9_5..origin/rel_0_9_0 modules/tm):

tm: fix delete_cell() when the transaction is referenced

variable timer fix: variable timers (avps) won't be exteneded anymore

fix for free_rdata_list() which used to access the "next" pointer af

deadlock when t_relay-ing a message from the failure_route fixed (e2e

added sems specific patch. This patch is present in the ser version ship

added diversion and rpid header cloning

-bug fix: tm insert_timer used to eat too much cpu, decreasing dramatic

fixed misplaced set_avp list, courtesy of cesc.santa@gmail.com

int2reverse_hex/reverse_hex2int fixes (tm with large "labels" was aff

fix of local ACK matching provided by cesc.santa@gmail.com

avp race condition fix (backported from HEAD)

CANCEL terminates retransmission timers properly (backported)

Andrei

...
Le vendredi 14 ao??t 2009 ?? 17:03 +0200, inge a ??crit :

...
Please find the requested information in attached.

I'm aware of the need for an update. It's in the list of tasks to be done, however, the priority is to troubleshoot the problem and maybe find a workaround.

Regards,

Adrien

Le vendredi 14 ao??t 2009 ?? 16:34 +0200, Andrei Pelinescu-Onciul a ??crit :

...
On Aug 14, 2009 at 15:01, inge inge@legos.fr wrote:

...
Hi Andrei,

Thanks for your reply.

I use ser 0.9.5-pre4.

I don't really understand the bug you have identify, where can I find a description ?

Sorry, I was wrong (that bug was in RR and appears only in newer code).

Could you run gdb on the core again , type "frame 0" and then send me the output of the following commands:

print p_cell print p_msg print p_msg->buf print p_cell->uas.local_totag.len print p_cell->uas.local_totag.s print p_msg->to print p_msg->to->parsed print *((struct to_body*)(p_msg->to->parsed)) print ((struct to_body*)(p_msg->to->parsed))->tag_value.len print ((struct to_body*)(p_msg->to->parsed))->tag_value.s

Andrei P.S.: you could try also upgrading to ser 2.0, 2.1 or sip-router.

...
Regards,

Adrien

Le vendredi 14 ao??t 2009 ?? 14:45 +0200, Andrei Pelinescu-Onciul a ??crit :

...
On Aug 13, 2009 at 15:32, inge inge@legos.fr wrote: > Hi Klaus, > > Thanks. > > I put the output of gdb in attached. > > I hope someone can decrypt this. Thank you.

If you are using ser 2.1/latest cvs or sip-router then just update to the latest cvs or git. It's a known fixed bug (sip router git 6fcd5e or ser 2.1 commit starting with "rr: fix from header access").

If you are using another version then tell me which one (ser -V) and I'll fix it.

Andrei

> > Le jeudi 13 ao??t 2009 ?? 13:53 +0200, Klaus Darilion a ??crit : > > locate the core file (either in the working dir or /tmp or /) > > then execute: > > > > gdb /usr/local/sbin/ser /path/to/core > > (gdb) bt > > > > regards > > klaus > > > > inge schrieb: > > > Hi all, > > > > > > My SER process had crashed today with the following logs > > > in /var/log/messages : > > > > > > ser[378]: child process 418 exited by a signal 11 > > > ser[378]: core was generated > > > ser[378]: INFO: terminating due to SIGCHLD > > > ser[421]: INFO: signal 15 received > > > ... > > > > > > Can someone help me to determine what kind of problem is it ? I think I > > > need to use gdb to extract some information from the core dump. How can > > > I use it to extract the uses informations ? > > > > > > Regards, > > > > > > Adrien > > > > > > > > > _______________________________________________ > > > sr-dev mailing list > > > sr-dev@lists.sip-router.org > > > http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

> #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 > 222 if (memcmp(get_to(ack)->tag_value.s,p_cell->uas.local_totag.s, > (gdb) bt > #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 > #1 0x00e96aff in t_lookup_request (p_msg=0x81647e8, leave_new_locked=1) at t_lookup.c:421 > #2 0x00e992a0 in t_newtran (p_msg=0x81647e8) at t_lookup.c:1085 > #3 0x00e9116a in t_relay_to (p_msg=0x81647e8, proxy=0x0, proto=0, replicate=0) at t_funcs.c:224 > #4 0x00e9c410 in w_t_relay (p_msg=0x81647e8, _foo=0x0, _bar=0x0) at tm.c:889 > #5 0x0804fc81 in do_action (a=0x8117818, msg=0x81647e8) at action.c:610 > #6 0x0805099d in run_actions (a=0x8117818, msg=0x81647e8) at action.c:718 > #7 0x08073f08 in eval_elem (e=0x8117840, msg=0x81647e8) at route.c:605 > #8 0x08074392 in eval_expr (e=0x8117840, msg=0x81647e8) at route.c:654 > #9 0x080743ce in eval_expr (e=0x8117860, msg=0x81647e8) at route.c:670 > #10 0x0804ec95 in do_action (a=0x8117bc8, msg=0x81647e8) at action.c:586 > #11 0x0805099d in run_actions (a=0x8117630, msg=0x81647e8) at action.c:718 > #12 0x0804ffdf in do_action (a=0x8114f70, msg=0x81647e8) at action.c:375 > #13 0x0805099d in run_actions (a=0x8114f70, msg=0x81647e8) at action.c:718 > #14 0x0804ecd3 in do_action (a=0x8114fc0, msg=0x81647e8) at action.c:603 > #15 0x0805099d in run_actions (a=0x8114fc0, msg=0x81647e8) at action.c:718 > #16 0x0804ecd3 in do_action (a=0x8114fe8, msg=0x81647e8) at action.c:603 > #17 0x0805099d in run_actions (a=0x8114fe8, msg=0x81647e8) at action.c:718 > #18 0x0804ecd3 in do_action (a=0x8115010, msg=0x81647e8) at action.c:603 > #19 0x0805099d in run_actions (a=0x8115010, msg=0x81647e8) at action.c:718 > #20 0x0804ecd3 in do_action (a=0x8115038, msg=0x81647e8) at action.c:603 > #21 0x0805099d in run_actions (a=0x8115038, msg=0x81647e8) at action.c:718 > #22 0x0804ecd3 in do_action (a=0x8115060, msg=0x81647e8) at action.c:603 > #23 0x0805099d in run_actions (a=0x810fe88, msg=0x81647e8) at action.c:718 > #24 0x0806d062 in receive_msg ( > buf=0x80d61e0 "ACK sip:0389719641@domain.tld:5060 SIP/2.0\r\nMax-Forwards: 16\r\nContent-Length: 0\r\nVia: SIP/2.0/UDP 10.0.140.147:5060;branch=z9hG4bK4f1b8571c\r\nCall-ID: bf85c76a5e2066256679e3945f6b4e36@10.0.140.147\r\nF"..., len=592, rcv_info=0xbff76340) at receive.c:165 > #25 0x080843cc in udp_rcv_loop () at udp_server.c:472 > #26 0x0805cdaf in main_loop () at main.c:1056 > #27 0x0805e40b in main (argc=1, argv=0xbff76504) at main.c:1592 >

> _______________________________________________ > sr-dev mailing list > sr-dev@lists.sip-router.org > http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

inge

4 Sep 4 Sep

4:06 p.m.

Hello Andrei,

I wonder what is involved in migration to SER 2.0 versus 0.9.5. I read that UID is used now to identify the users instead of username for example, so we can anticipate impacts on internal processes and SERWeb.

Can I find a detailed list which these changelog ?

Is it not more reasonnable to migrate in 0.9.7? Mainly to solve our initial bug with Seg Fault and optionnaly to have new features.

Thank you for your opinion on these issues.

Regards,

Adrien

Le jeudi 20 août 2009 à 10:40 +0200, inge a écrit :

...

Hi Andrei,

As I understand, this changelog only apply to the tm module. Is there any clues that this module caused the crash we experienced ?

We would like to determine which of the known and corrected bug could have caused the crash, in order to find a short-time workaround letting us some time to deploy abn upgrade to the latest rel in the 0.9.0 branch.

Adrien

Le mardi 18 août 2009 à 09:00 +0200, Andrei Pelinescu-Onciul a écrit :

...
On Aug 17, 2009 at 14:42, inge inge@legos.fr wrote:

...
Hi Andrei,

Hope you are fine. Do you have any update on our crash ? Is there anything we can do to find the segmentation fault cause, maybe as a well-known bug, without bothering you ?

There are lots of changes between 0.9.5-pre and the latest 0.9.x version. You should try updating to the latest code on the rel_0_9_0 branch and see if you run into this problem again. To get the latest 0.9.x code either get the latest snapshot from http://ftp.iptel.org/pub/ser/daily-snapshots/stable/ , use cvs to get the rel_0_9_0 branch (CVSROOT=:pserver:anonymous@cvs.berlios.de:/cvsroot/ser ; export CVSROOT ; cvs co -r rel_0_9_0 sip_router ), or use git and the ser repository (see http://sip-router.org/wiki/git/ser-repository).

Here's a short changelog for tm, between 0.9.5 and 0.9.7+ (git log --oneline v_0_9_5..origin/rel_0_9_0 modules/tm):

tm: fix delete_cell() when the transaction is referenced

variable timer fix: variable timers (avps) won't be exteneded anymore

fix for free_rdata_list() which used to access the "next" pointer af

deadlock when t_relay-ing a message from the failure_route fixed (e2e

added sems specific patch. This patch is present in the ser version ship

added diversion and rpid header cloning

-bug fix: tm insert_timer used to eat too much cpu, decreasing dramatic

fixed misplaced set_avp list, courtesy of cesc.santa@gmail.com

int2reverse_hex/reverse_hex2int fixes (tm with large "labels" was aff

fix of local ACK matching provided by cesc.santa@gmail.com

avp race condition fix (backported from HEAD)

CANCEL terminates retransmission timers properly (backported)

Andrei

...
Le vendredi 14 ao??t 2009 ?? 17:03 +0200, inge a ??crit :

...
Please find the requested information in attached.

I'm aware of the need for an update. It's in the list of tasks to be done, however, the priority is to troubleshoot the problem and maybe find a workaround.

Regards,

Adrien

Le vendredi 14 ao??t 2009 ?? 16:34 +0200, Andrei Pelinescu-Onciul a ??crit :

...
On Aug 14, 2009 at 15:01, inge inge@legos.fr wrote:

...
Hi Andrei,

Thanks for your reply.

I use ser 0.9.5-pre4.

I don't really understand the bug you have identify, where can I find a description ?

Sorry, I was wrong (that bug was in RR and appears only in newer code).

Could you run gdb on the core again , type "frame 0" and then send me the output of the following commands:

print p_cell print p_msg print p_msg->buf print p_cell->uas.local_totag.len print p_cell->uas.local_totag.s print p_msg->to print p_msg->to->parsed print *((struct to_body*)(p_msg->to->parsed)) print ((struct to_body*)(p_msg->to->parsed))->tag_value.len print ((struct to_body*)(p_msg->to->parsed))->tag_value.s

Andrei P.S.: you could try also upgrading to ser 2.0, 2.1 or sip-router.

...
Regards,

Adrien

Le vendredi 14 ao??t 2009 ?? 14:45 +0200, Andrei Pelinescu-Onciul a ??crit : > On Aug 13, 2009 at 15:32, inge inge@legos.fr wrote: > > Hi Klaus, > > > > Thanks. > > > > I put the output of gdb in attached. > > > > I hope someone can decrypt this. Thank you. > > > If you are using ser 2.1/latest cvs or sip-router then just update to > the latest cvs or git. It's a known fixed bug (sip router > git 6fcd5e or ser 2.1 commit starting with "rr: fix from header > access"). > > If you are using another version then tell me which one (ser -V) > and I'll fix it. > > Andrei > > > > > Le jeudi 13 ao??t 2009 ?? 13:53 +0200, Klaus Darilion a ??crit : > > > locate the core file (either in the working dir or /tmp or /) > > > then execute: > > > > > > gdb /usr/local/sbin/ser /path/to/core > > > (gdb) bt > > > > > > regards > > > klaus > > > > > > inge schrieb: > > > > Hi all, > > > > > > > > My SER process had crashed today with the following logs > > > > in /var/log/messages : > > > > > > > > ser[378]: child process 418 exited by a signal 11 > > > > ser[378]: core was generated > > > > ser[378]: INFO: terminating due to SIGCHLD > > > > ser[421]: INFO: signal 15 received > > > > ... > > > > > > > > Can someone help me to determine what kind of problem is it ? I think I > > > > need to use gdb to extract some information from the core dump. How can > > > > I use it to extract the uses informations ? > > > > > > > > Regards, > > > > > > > > Adrien > > > > > > > > > > > > _______________________________________________ > > > > sr-dev mailing list > > > > sr-dev@lists.sip-router.org > > > > http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev > > > #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 > > 222 if (memcmp(get_to(ack)->tag_value.s,p_cell->uas.local_totag.s, > > (gdb) bt > > #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 > > #1 0x00e96aff in t_lookup_request (p_msg=0x81647e8, leave_new_locked=1) at t_lookup.c:421 > > #2 0x00e992a0 in t_newtran (p_msg=0x81647e8) at t_lookup.c:1085 > > #3 0x00e9116a in t_relay_to (p_msg=0x81647e8, proxy=0x0, proto=0, replicate=0) at t_funcs.c:224 > > #4 0x00e9c410 in w_t_relay (p_msg=0x81647e8, _foo=0x0, _bar=0x0) at tm.c:889 > > #5 0x0804fc81 in do_action (a=0x8117818, msg=0x81647e8) at action.c:610 > > #6 0x0805099d in run_actions (a=0x8117818, msg=0x81647e8) at action.c:718 > > #7 0x08073f08 in eval_elem (e=0x8117840, msg=0x81647e8) at route.c:605 > > #8 0x08074392 in eval_expr (e=0x8117840, msg=0x81647e8) at route.c:654 > > #9 0x080743ce in eval_expr (e=0x8117860, msg=0x81647e8) at route.c:670 > > #10 0x0804ec95 in do_action (a=0x8117bc8, msg=0x81647e8) at action.c:586 > > #11 0x0805099d in run_actions (a=0x8117630, msg=0x81647e8) at action.c:718 > > #12 0x0804ffdf in do_action (a=0x8114f70, msg=0x81647e8) at action.c:375 > > #13 0x0805099d in run_actions (a=0x8114f70, msg=0x81647e8) at action.c:718 > > #14 0x0804ecd3 in do_action (a=0x8114fc0, msg=0x81647e8) at action.c:603 > > #15 0x0805099d in run_actions (a=0x8114fc0, msg=0x81647e8) at action.c:718 > > #16 0x0804ecd3 in do_action (a=0x8114fe8, msg=0x81647e8) at action.c:603 > > #17 0x0805099d in run_actions (a=0x8114fe8, msg=0x81647e8) at action.c:718 > > #18 0x0804ecd3 in do_action (a=0x8115010, msg=0x81647e8) at action.c:603 > > #19 0x0805099d in run_actions (a=0x8115010, msg=0x81647e8) at action.c:718 > > #20 0x0804ecd3 in do_action (a=0x8115038, msg=0x81647e8) at action.c:603 > > #21 0x0805099d in run_actions (a=0x8115038, msg=0x81647e8) at action.c:718 > > #22 0x0804ecd3 in do_action (a=0x8115060, msg=0x81647e8) at action.c:603 > > #23 0x0805099d in run_actions (a=0x810fe88, msg=0x81647e8) at action.c:718 > > #24 0x0806d062 in receive_msg ( > > buf=0x80d61e0 "ACK sip:0389719641@domain.tld:5060 SIP/2.0\r\nMax-Forwards: 16\r\nContent-Length: 0\r\nVia: SIP/2.0/UDP 10.0.140.147:5060;branch=z9hG4bK4f1b8571c\r\nCall-ID: bf85c76a5e2066256679e3945f6b4e36@10.0.140.147\r\nF"..., len=592, rcv_info=0xbff76340) at receive.c:165 > > #25 0x080843cc in udp_rcv_loop () at udp_server.c:472 > > #26 0x0805cdaf in main_loop () at main.c:1056 > > #27 0x0805e40b in main (argc=1, argv=0xbff76504) at main.c:1592 > > > > > _______________________________________________ > > sr-dev mailing list > > sr-dev@lists.sip-router.org > > http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev >

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

Henning Westerholt

9 Sep 9 Sep

10:38 a.m.

On Freitag, 4. September 2009, inge wrote:

...

I wonder what is involved in migration to SER 2.0 versus 0.9.5. I read that UID is used now to identify the users instead of username for example, so we can anticipate impacts on internal processes and SERWeb.

Can I find a detailed list which these changelog ?

Hi Adrien,

i don't know if there is one changelog file for the complete differences between 0.9.5 and SER 2.0. But i think you could look at the changelogs of each major release that were in between (e.g. 1.X up to 2.0), and also at the NEWS file in the 2.0 release.

Cheers,

Henning

Andrei Pelinescu-Onciul

11:11 a.m.

On Sep 04, 2009 at 18:06, inge inge@legos.fr wrote:

...

Hello Andrei,

I wonder what is involved in migration to SER 2.0 versus 0.9.5. I read that UID is used now to identify the users instead of username for example, so we can anticipate impacts on internal processes and SERWeb.

Yes, uid and did (for domains) are now used internally. There is a script which might help in migrating a database: http://www.iptel.org/ser/migrate_db

...

Can I find a detailed list which these changelog ?

Try http://www.iptel.org/basic_changes_in_configuration_file_0 http://git.sip-router.org/cgi-bin/gitweb.cgi?p=ser;a=blob;f=ser/NEWS;h=02543... http://git.sip-router.org/cgi-bin/gitweb.cgi?p=ser;a=blob;f=ser/ChangeLog;h=...

I don't remember exactly, hopefully there is a more complete migration guide somewhere (maybe somebody else can help if I missed anything).

...

Is it not more reasonnable to migrate in 0.9.7? Mainly to solve our initial bug with Seg Fault and optionnaly to have new features.

Yes, in your case migration to 0.9.7 should be painless (there are no config or db changes, only bugfixes, so you wouldn't need to change anything).

Andrei

inge

10 Sep 10 Sep

4:36 p.m.

Hi Andrei,

Thank you for this detailed answer.

Indeed, update to 0.9.7 will be more easier than SER 2.0. Our priority is to avoid a lot of bugs.

We are not able to reproduce the crash...so we'll see. Probably we will do an update to 0.9.7.

Regards,

Adrien

Le mercredi 09 septembre 2009 à 13:11 +0200, Andrei Pelinescu-Onciul a écrit :

...

On Sep 04, 2009 at 18:06, inge inge@legos.fr wrote:

...
Hello Andrei,

I wonder what is involved in migration to SER 2.0 versus 0.9.5. I read that UID is used now to identify the users instead of username for example, so we can anticipate impacts on internal processes and SERWeb.

Yes, uid and did (for domains) are now used internally. There is a script which might help in migrating a database: http://www.iptel.org/ser/migrate_db

...
Can I find a detailed list which these changelog ?

Try http://www.iptel.org/basic_changes_in_configuration_file_0 http://git.sip-router.org/cgi-bin/gitweb.cgi?p=ser;a=blob;f=ser/NEWS;h=02543... http://git.sip-router.org/cgi-bin/gitweb.cgi?p=ser;a=blob;f=ser/ChangeLog;h=...

I don't remember exactly, hopefully there is a more complete migration guide somewhere (maybe somebody else can help if I missed anything).

...
Is it not more reasonnable to migrate in 0.9.7? Mainly to solve our initial bug with Seg Fault and optionnaly to have new features.

Yes, in your case migration to 0.9.7 should be painless (there are no config or db changes, only bugfixes, so you wouldn't need to change anything).

Andrei

inge

11 Sep 11 Sep

5:01 p.m.

Hi Andrei,

A new crash happend today !

It was impossible to restart SER. During this time I could collect logs from syslog by activating ser debug (ie. facility local0, etc.).

The solution have consisted in flushing Location table to be able to restart SER...

I attach the gdb output and syslog trace to this email. Do you see that the problem is same as the previous one ?

Regards,

Adrien

Le jeudi 10 septembre 2009 à 18:36 +0200, inge a écrit :

...

Hi Andrei,

Thank you for this detailed answer.

Indeed, update to 0.9.7 will be more easier than SER 2.0. Our priority is to avoid a lot of bugs.

We are not able to reproduce the crash...so we'll see. Probably we will do an update to 0.9.7.

Regards,

Adrien

Le mercredi 09 septembre 2009 à 13:11 +0200, Andrei Pelinescu-Onciul a écrit :

...
On Sep 04, 2009 at 18:06, inge inge@legos.fr wrote:

...
Hello Andrei,

I wonder what is involved in migration to SER 2.0 versus 0.9.5. I read that UID is used now to identify the users instead of username for example, so we can anticipate impacts on internal processes and SERWeb.

Yes, uid and did (for domains) are now used internally. There is a script which might help in migrating a database: http://www.iptel.org/ser/migrate_db

...
Can I find a detailed list which these changelog ?

Try http://www.iptel.org/basic_changes_in_configuration_file_0 http://git.sip-router.org/cgi-bin/gitweb.cgi?p=ser;a=blob;f=ser/NEWS;h=02543... http://git.sip-router.org/cgi-bin/gitweb.cgi?p=ser;a=blob;f=ser/ChangeLog;h=...

I don't remember exactly, hopefully there is a more complete migration guide somewhere (maybe somebody else can help if I missed anything).

...
Is it not more reasonnable to migrate in 0.9.7? Mainly to solve our initial bug with Seg Fault and optionnaly to have new features.

Yes, in your case migration to 0.9.7 should be painless (there are no config or db changes, only bugfixes, so you wouldn't need to change anything).

Andrei

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

Andrei Pelinescu-Onciul

16 Sep 16 Sep

10:39 a.m.

On Sep 11, 2009 at 19:01, inge inge@legos.fr wrote:

...

Hi Andrei,

A new crash happend today !

It was impossible to restart SER. During this time I could collect logs from syslog by activating ser debug (ie. facility local0, etc.).

The solution have consisted in flushing Location table to be able to restart SER...

I attach the gdb output and syslog trace to this email. Do you see that the problem is same as the previous one ?

Yes, it's the same.

Andrei

Andrei Pelinescu-Onciul

9 Sep 9 Sep

10:26 a.m.

On Aug 20, 2009 at 10:40, inge inge@legos.fr wrote:

...

Hi Andrei,

As I understand, this changelog only apply to the tm module. Is there any clues that this module caused the crash we experienced ?

Yes, according to the backtrace it crashed in tm. It looks like the tag value was corrupted (one possible explanation is that matching against a deleted transaction was attempted). It's also possible but much more unlikely that despite the backtrace info the crash is not related to tm (e.g. some other module corrupting shared memory).

...

We would like to determine which of the known and corrected bug could have caused the crash, in order to find a short-time workaround letting us some time to deploy abn upgrade to the latest rel in the 0.9.0 branch.

That would be quite hard since we don't know yet if the crash is really fixed in the latest 0.9.x If you can reproduce the crash, then you could try a test instalation of the latest 0.9.x and see if the crash is fixed. It's very easy to upgrade between 0.9.x versions. There are no config or db changes, the only differences are bug fixes.

If it still crashes with the latest 0.9.x, then the next step would be to compile it with debugging info, in an attempt to get more meaningful backtraces.

Andrei

...

Le mardi 18 ao??t 2009 ?? 09:00 +0200, Andrei Pelinescu-Onciul a ??crit :

...
On Aug 17, 2009 at 14:42, inge inge@legos.fr wrote:

...
Hi Andrei,

Hope you are fine. Do you have any update on our crash ? Is there anything we can do to find the segmentation fault cause, maybe as a well-known bug, without bothering you ?

There are lots of changes between 0.9.5-pre and the latest 0.9.x version. You should try updating to the latest code on the rel_0_9_0 branch and see if you run into this problem again. To get the latest 0.9.x code either get the latest snapshot from http://ftp.iptel.org/pub/ser/daily-snapshots/stable/ , use cvs to get the rel_0_9_0 branch (CVSROOT=:pserver:anonymous@cvs.berlios.de:/cvsroot/ser ; export CVSROOT ; cvs co -r rel_0_9_0 sip_router ), or use git and the ser repository (see http://sip-router.org/wiki/git/ser-repository).

Here's a short changelog for tm, between 0.9.5 and 0.9.7+ (git log --oneline v_0_9_5..origin/rel_0_9_0 modules/tm):

tm: fix delete_cell() when the transaction is referenced

variable timer fix: variable timers (avps) won't be exteneded anymore

fix for free_rdata_list() which used to access the "next" pointer af

deadlock when t_relay-ing a message from the failure_route fixed (e2e

added sems specific patch. This patch is present in the ser version ship

added diversion and rpid header cloning

-bug fix: tm insert_timer used to eat too much cpu, decreasing dramatic

fixed misplaced set_avp list, courtesy of cesc.santa@gmail.com

int2reverse_hex/reverse_hex2int fixes (tm with large "labels" was aff

fix of local ACK matching provided by cesc.santa@gmail.com

avp race condition fix (backported from HEAD)

CANCEL terminates retransmission timers properly (backported)

Andrei

...
Le vendredi 14 ao??t 2009 ?? 17:03 +0200, inge a ??crit :

...
Please find the requested information in attached.

I'm aware of the need for an update. It's in the list of tasks to be done, however, the priority is to troubleshoot the problem and maybe find a workaround.

Regards,

Adrien

Le vendredi 14 ao??t 2009 ?? 16:34 +0200, Andrei Pelinescu-Onciul a ??crit :

...
On Aug 14, 2009 at 15:01, inge inge@legos.fr wrote:

...
Hi Andrei,

Thanks for your reply.

I use ser 0.9.5-pre4.

I don't really understand the bug you have identify, where can I find a description ?

Sorry, I was wrong (that bug was in RR and appears only in newer code).

Could you run gdb on the core again , type "frame 0" and then send me the output of the following commands:

print p_cell print p_msg print p_msg->buf print p_cell->uas.local_totag.len print p_cell->uas.local_totag.s print p_msg->to print p_msg->to->parsed print *((struct to_body*)(p_msg->to->parsed)) print ((struct to_body*)(p_msg->to->parsed))->tag_value.len print ((struct to_body*)(p_msg->to->parsed))->tag_value.s

Andrei P.S.: you could try also upgrading to ser 2.0, 2.1 or sip-router.

...
Regards,

Adrien

Le vendredi 14 ao??t 2009 ?? 14:45 +0200, Andrei Pelinescu-Onciul a ??crit : > On Aug 13, 2009 at 15:32, inge inge@legos.fr wrote: > > Hi Klaus, > > > > Thanks. > > > > I put the output of gdb in attached. > > > > I hope someone can decrypt this. Thank you. > > > If you are using ser 2.1/latest cvs or sip-router then just update to > the latest cvs or git. It's a known fixed bug (sip router > git 6fcd5e or ser 2.1 commit starting with "rr: fix from header > access"). > > If you are using another version then tell me which one (ser -V) > and I'll fix it. > > Andrei > > > > > Le jeudi 13 ao??t 2009 ?? 13:53 +0200, Klaus Darilion a ??crit : > > > locate the core file (either in the working dir or /tmp or /) > > > then execute: > > > > > > gdb /usr/local/sbin/ser /path/to/core > > > (gdb) bt > > > > > > regards > > > klaus > > > > > > inge schrieb: > > > > Hi all, > > > > > > > > My SER process had crashed today with the following logs > > > > in /var/log/messages : > > > > > > > > ser[378]: child process 418 exited by a signal 11 > > > > ser[378]: core was generated > > > > ser[378]: INFO: terminating due to SIGCHLD > > > > ser[421]: INFO: signal 15 received > > > > ... > > > > > > > > Can someone help me to determine what kind of problem is it ? I think I > > > > need to use gdb to extract some information from the core dump. How can > > > > I use it to extract the uses informations ? > > > > > > > > Regards, > > > > > > > > Adrien > > > > > > > > > > > > _______________________________________________ > > > > sr-dev mailing list > > > > sr-dev@lists.sip-router.org > > > > http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev > > > #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 > > 222 if (memcmp(get_to(ack)->tag_value.s,p_cell->uas.local_totag.s, > > (gdb) bt > > #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 > > #1 0x00e96aff in t_lookup_request (p_msg=0x81647e8, leave_new_locked=1) at t_lookup.c:421 > > #2 0x00e992a0 in t_newtran (p_msg=0x81647e8) at t_lookup.c:1085 > > #3 0x00e9116a in t_relay_to (p_msg=0x81647e8, proxy=0x0, proto=0, replicate=0) at t_funcs.c:224 > > #4 0x00e9c410 in w_t_relay (p_msg=0x81647e8, _foo=0x0, _bar=0x0) at tm.c:889 > > #5 0x0804fc81 in do_action (a=0x8117818, msg=0x81647e8) at action.c:610 > > #6 0x0805099d in run_actions (a=0x8117818, msg=0x81647e8) at action.c:718 > > #7 0x08073f08 in eval_elem (e=0x8117840, msg=0x81647e8) at route.c:605 > > #8 0x08074392 in eval_expr (e=0x8117840, msg=0x81647e8) at route.c:654 > > #9 0x080743ce in eval_expr (e=0x8117860, msg=0x81647e8) at route.c:670 > > #10 0x0804ec95 in do_action (a=0x8117bc8, msg=0x81647e8) at action.c:586 > > #11 0x0805099d in run_actions (a=0x8117630, msg=0x81647e8) at action.c:718 > > #12 0x0804ffdf in do_action (a=0x8114f70, msg=0x81647e8) at action.c:375 > > #13 0x0805099d in run_actions (a=0x8114f70, msg=0x81647e8) at action.c:718 > > #14 0x0804ecd3 in do_action (a=0x8114fc0, msg=0x81647e8) at action.c:603 > > #15 0x0805099d in run_actions (a=0x8114fc0, msg=0x81647e8) at action.c:718 > > #16 0x0804ecd3 in do_action (a=0x8114fe8, msg=0x81647e8) at action.c:603 > > #17 0x0805099d in run_actions (a=0x8114fe8, msg=0x81647e8) at action.c:718 > > #18 0x0804ecd3 in do_action (a=0x8115010, msg=0x81647e8) at action.c:603 > > #19 0x0805099d in run_actions (a=0x8115010, msg=0x81647e8) at action.c:718 > > #20 0x0804ecd3 in do_action (a=0x8115038, msg=0x81647e8) at action.c:603 > > #21 0x0805099d in run_actions (a=0x8115038, msg=0x81647e8) at action.c:718 > > #22 0x0804ecd3 in do_action (a=0x8115060, msg=0x81647e8) at action.c:603 > > #23 0x0805099d in run_actions (a=0x810fe88, msg=0x81647e8) at action.c:718 > > #24 0x0806d062 in receive_msg ( > > buf=0x80d61e0 "ACK sip:0389719641@domain.tld:5060 SIP/2.0\r\nMax-Forwards: 16\r\nContent-Length: 0\r\nVia: SIP/2.0/UDP 10.0.140.147:5060;branch=z9hG4bK4f1b8571c\r\nCall-ID: bf85c76a5e2066256679e3945f6b4e36@10.0.140.147\r\nF"..., len=592, rcv_info=0xbff76340) at receive.c:165 > > #25 0x080843cc in udp_rcv_loop () at udp_server.c:472 > > #26 0x0805cdaf in main_loop () at main.c:1056 > > #27 0x0805e40b in main (argc=1, argv=0xbff76504) at main.c:1592 > > > > > _______________________________________________ > > sr-dev mailing list > > sr-dev@lists.sip-router.org > > http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev >

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

inge

16 Sep 16 Sep

8:47 a.m.

Hi Andrei,

I'm Nicolas and I'm working with Adrien on crashes experienced on our SER server during the last months.

We had 4 crashes on 11 jun 2009, 13 aug 2009, 11 sept 2009 and 12 sept 2009. Every of this crash have a similar call flow, as seen in the one attached: SER crashes when trying to process an ACK from the CPE for the previously relayed "482 Loop Detected" from the gateway.

...

From coredump analysis, the crash occures when trying to match the ack

totag with a the out of bound local_totag from the corresponding tm entry (see attached coredump analysis)

It seems to me that there is a bug, and I didn't find any patch for this, even in the last 2.0 versions.

Do you have any idea about this problem ? Is this bug already known ?

Sincerely,

Nicolas LEROY

Le mercredi 09 septembre 2009 à 12:26 +0200, Andrei Pelinescu-Onciul a écrit :

...

On Aug 20, 2009 at 10:40, inge inge@legos.fr wrote:

...
Hi Andrei,

As I understand, this changelog only apply to the tm module. Is there any clues that this module caused the crash we experienced ?

Yes, according to the backtrace it crashed in tm. It looks like the tag value was corrupted (one possible explanation is that matching against a deleted transaction was attempted). It's also possible but much more unlikely that despite the backtrace info the crash is not related to tm (e.g. some other module corrupting shared memory).

...
We would like to determine which of the known and corrected bug could have caused the crash, in order to find a short-time workaround letting us some time to deploy abn upgrade to the latest rel in the 0.9.0 branch.

That would be quite hard since we don't know yet if the crash is really fixed in the latest 0.9.x If you can reproduce the crash, then you could try a test instalation of the latest 0.9.x and see if the crash is fixed. It's very easy to upgrade between 0.9.x versions. There are no config or db changes, the only differences are bug fixes.

If it still crashes with the latest 0.9.x, then the next step would be to compile it with debugging info, in an attempt to get more meaningful backtraces.

Andrei

...
Le mardi 18 ao??t 2009 ?? 09:00 +0200, Andrei Pelinescu-Onciul a ??crit :

...
On Aug 17, 2009 at 14:42, inge inge@legos.fr wrote:

...
Hi Andrei,

Hope you are fine. Do you have any update on our crash ? Is there anything we can do to find the segmentation fault cause, maybe as a well-known bug, without bothering you ?

There are lots of changes between 0.9.5-pre and the latest 0.9.x version. You should try updating to the latest code on the rel_0_9_0 branch and see if you run into this problem again. To get the latest 0.9.x code either get the latest snapshot from http://ftp.iptel.org/pub/ser/daily-snapshots/stable/ , use cvs to get the rel_0_9_0 branch (CVSROOT=:pserver:anonymous@cvs.berlios.de:/cvsroot/ser ; export CVSROOT ; cvs co -r rel_0_9_0 sip_router ), or use git and the ser repository (see http://sip-router.org/wiki/git/ser-repository).

Here's a short changelog for tm, between 0.9.5 and 0.9.7+ (git log --oneline v_0_9_5..origin/rel_0_9_0 modules/tm):

tm: fix delete_cell() when the transaction is referenced

variable timer fix: variable timers (avps) won't be exteneded anymore

fix for free_rdata_list() which used to access the "next" pointer af

deadlock when t_relay-ing a message from the failure_route fixed (e2e

added sems specific patch. This patch is present in the ser version ship

added diversion and rpid header cloning

-bug fix: tm insert_timer used to eat too much cpu, decreasing dramatic

fixed misplaced set_avp list, courtesy of cesc.santa@gmail.com

int2reverse_hex/reverse_hex2int fixes (tm with large "labels" was aff

fix of local ACK matching provided by cesc.santa@gmail.com

avp race condition fix (backported from HEAD)

CANCEL terminates retransmission timers properly (backported)

Andrei

...
Le vendredi 14 ao??t 2009 ?? 17:03 +0200, inge a ??crit :

...
Please find the requested information in attached.

I'm aware of the need for an update. It's in the list of tasks to be done, however, the priority is to troubleshoot the problem and maybe find a workaround.

Regards,

Adrien

Le vendredi 14 ao??t 2009 ?? 16:34 +0200, Andrei Pelinescu-Onciul a ??crit :

...
On Aug 14, 2009 at 15:01, inge inge@legos.fr wrote: > Hi Andrei, > > Thanks for your reply. > > I use ser 0.9.5-pre4. > > I don't really understand the bug you have identify, where can I find a > description ?

Sorry, I was wrong (that bug was in RR and appears only in newer code).

Could you run gdb on the core again , type "frame 0" and then send me the output of the following commands:

print p_cell print p_msg print p_msg->buf print p_cell->uas.local_totag.len print p_cell->uas.local_totag.s print p_msg->to print p_msg->to->parsed print *((struct to_body*)(p_msg->to->parsed)) print ((struct to_body*)(p_msg->to->parsed))->tag_value.len print ((struct to_body*)(p_msg->to->parsed))->tag_value.s

Andrei P.S.: you could try also upgrading to ser 2.0, 2.1 or sip-router.

> > Regards, > > Adrien > > Le vendredi 14 ao??t 2009 ?? 14:45 +0200, Andrei Pelinescu-Onciul a > ??crit : > > On Aug 13, 2009 at 15:32, inge inge@legos.fr wrote: > > > Hi Klaus, > > > > > > Thanks. > > > > > > I put the output of gdb in attached. > > > > > > I hope someone can decrypt this. Thank you. > > > > > > If you are using ser 2.1/latest cvs or sip-router then just update to > > the latest cvs or git. It's a known fixed bug (sip router > > git 6fcd5e or ser 2.1 commit starting with "rr: fix from header > > access"). > > > > If you are using another version then tell me which one (ser -V) > > and I'll fix it. > > > > Andrei > > > > > > > > Le jeudi 13 ao??t 2009 ?? 13:53 +0200, Klaus Darilion a ??crit : > > > > locate the core file (either in the working dir or /tmp or /) > > > > then execute: > > > > > > > > gdb /usr/local/sbin/ser /path/to/core > > > > (gdb) bt > > > > > > > > regards > > > > klaus > > > > > > > > inge schrieb: > > > > > Hi all, > > > > > > > > > > My SER process had crashed today with the following logs > > > > > in /var/log/messages : > > > > > > > > > > ser[378]: child process 418 exited by a signal 11 > > > > > ser[378]: core was generated > > > > > ser[378]: INFO: terminating due to SIGCHLD > > > > > ser[421]: INFO: signal 15 received > > > > > ... > > > > > > > > > > Can someone help me to determine what kind of problem is it ? I think I > > > > > need to use gdb to extract some information from the core dump. How can > > > > > I use it to extract the uses informations ? > > > > > > > > > > Regards, > > > > > > > > > > Adrien > > > > > > > > > > > > > > > _______________________________________________ > > > > > sr-dev mailing list > > > > > sr-dev@lists.sip-router.org > > > > > http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev > > > > > #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 > > > 222 if (memcmp(get_to(ack)->tag_value.s,p_cell->uas.local_totag.s, > > > (gdb) bt > > > #0 0x00e964d3 in matching_3261 (p_msg=0x81647e8, trans=0xbff74f38, skip_method=4294967294) at t_lookup.c:222 > > > #1 0x00e96aff in t_lookup_request (p_msg=0x81647e8, leave_new_locked=1) at t_lookup.c:421 > > > #2 0x00e992a0 in t_newtran (p_msg=0x81647e8) at t_lookup.c:1085 > > > #3 0x00e9116a in t_relay_to (p_msg=0x81647e8, proxy=0x0, proto=0, replicate=0) at t_funcs.c:224 > > > #4 0x00e9c410 in w_t_relay (p_msg=0x81647e8, _foo=0x0, _bar=0x0) at tm.c:889 > > > #5 0x0804fc81 in do_action (a=0x8117818, msg=0x81647e8) at action.c:610 > > > #6 0x0805099d in run_actions (a=0x8117818, msg=0x81647e8) at action.c:718 > > > #7 0x08073f08 in eval_elem (e=0x8117840, msg=0x81647e8) at route.c:605 > > > #8 0x08074392 in eval_expr (e=0x8117840, msg=0x81647e8) at route.c:654 > > > #9 0x080743ce in eval_expr (e=0x8117860, msg=0x81647e8) at route.c:670 > > > #10 0x0804ec95 in do_action (a=0x8117bc8, msg=0x81647e8) at action.c:586 > > > #11 0x0805099d in run_actions (a=0x8117630, msg=0x81647e8) at action.c:718 > > > #12 0x0804ffdf in do_action (a=0x8114f70, msg=0x81647e8) at action.c:375 > > > #13 0x0805099d in run_actions (a=0x8114f70, msg=0x81647e8) at action.c:718 > > > #14 0x0804ecd3 in do_action (a=0x8114fc0, msg=0x81647e8) at action.c:603 > > > #15 0x0805099d in run_actions (a=0x8114fc0, msg=0x81647e8) at action.c:718 > > > #16 0x0804ecd3 in do_action (a=0x8114fe8, msg=0x81647e8) at action.c:603 > > > #17 0x0805099d in run_actions (a=0x8114fe8, msg=0x81647e8) at action.c:718 > > > #18 0x0804ecd3 in do_action (a=0x8115010, msg=0x81647e8) at action.c:603 > > > #19 0x0805099d in run_actions (a=0x8115010, msg=0x81647e8) at action.c:718 > > > #20 0x0804ecd3 in do_action (a=0x8115038, msg=0x81647e8) at action.c:603 > > > #21 0x0805099d in run_actions (a=0x8115038, msg=0x81647e8) at action.c:718 > > > #22 0x0804ecd3 in do_action (a=0x8115060, msg=0x81647e8) at action.c:603 > > > #23 0x0805099d in run_actions (a=0x810fe88, msg=0x81647e8) at action.c:718 > > > #24 0x0806d062 in receive_msg ( > > > buf=0x80d61e0 "ACK sip:0389719641@domain.tld:5060 SIP/2.0\r\nMax-Forwards: 16\r\nContent-Length: 0\r\nVia: SIP/2.0/UDP 10.0.140.147:5060;branch=z9hG4bK4f1b8571c\r\nCall-ID: bf85c76a5e2066256679e3945f6b4e36@10.0.140.147\r\nF"..., len=592, rcv_info=0xbff76340) at receive.c:165 > > > #25 0x080843cc in udp_rcv_loop () at udp_server.c:472 > > > #26 0x0805cdaf in main_loop () at main.c:1056 > > > #27 0x0805e40b in main (argc=1, argv=0xbff76504) at main.c:1592 > > > > > > > > _______________________________________________ > > > sr-dev mailing list > > > sr-dev@lists.sip-router.org > > > http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev > >

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

Andrei Pelinescu-Onciul

10:49 a.m.

On Sep 16, 2009 at 10:47, inge inge@legos.fr wrote:

...

Hi Andrei,

I'm Nicolas and I'm working with Adrien on crashes experienced on our SER server during the last months.

We had 4 crashes on 11 jun 2009, 13 aug 2009, 11 sept 2009 and 12 sept 2009. Every of this crash have a similar call flow, as seen in the one attached: SER crashes when trying to process an ACK from the CPE for the previously relayed "482 Loop Detected" from the gateway.

...
From coredump analysis, the crash occures when trying to match the ack

totag with a the out of bound local_totag from the corresponding tm entry (see attached coredump analysis)

Yes, I saw the same thing.

...

It seems to me that there is a bug, and I didn't find any patch for this, even in the last 2.0 versions.

Yes, it's a bug, but things changed a lot between versions. It might be fixed even in 0.9.7.

...

Do you have any idea about this problem ?

No.

...

Is this bug already known ?

No.

If you can reproduce it easily, try it with 0.9.7 (it will work with the same config as 0.9.4, you don't have to change anything). If you can still see it, try compiling with debug support (make proper; make mode=debug all and also don't forget to recompile with mode=debug any other module you might be using that is not covered by make all). After this the coredumps should be "better" (more info, no variables will be optimized to registers).

Hopefully 0.9.7 will solve your problems. If it doesn't then send me again some backtraces and/or the coredump + binaries (unfortunately the code is very old and I'm not any longer familiar with it).

Andrei

inge

2:11 p.m.

Andrei,

Thanks for your update on this. As the conditions for the bug to appear is quite erratic (seems to be several INVITE sent to the gateway in a row, that make it sending a "482 Loop Detected" leading to the crashing ACK), we do not manage to reproduce this.

I never saw this on our 0.9.7pre1 lab version, but this one is quite empty and relays only a few test calls compared to our production 0.9.5 one serving severeal thousand of end users.

Is there any explanation on how we can have an <out of bound memory address> for the local_totag of the 100-ed tm entry ?

Is there any way to check if the memory address is OK before passing it to memcmp ? This has not been modified in t_lookup.c from v0.9.7 nor v2.0 (attached)

Sincerely,

Nicolas

Le mercredi 16 septembre 2009 à 12:49 +0200, Andrei Pelinescu-Onciul a écrit :

...

On Sep 16, 2009 at 10:47, inge inge@legos.fr wrote:

...
Hi Andrei,

I'm Nicolas and I'm working with Adrien on crashes experienced on our SER server during the last months.

We had 4 crashes on 11 jun 2009, 13 aug 2009, 11 sept 2009 and 12 sept 2009. Every of this crash have a similar call flow, as seen in the one attached: SER crashes when trying to process an ACK from the CPE for the previously relayed "482 Loop Detected" from the gateway.

...
From coredump analysis, the crash occures when trying to match the ack

totag with a the out of bound local_totag from the corresponding tm entry (see attached coredump analysis)

Yes, I saw the same thing.

...
It seems to me that there is a bug, and I didn't find any patch for this, even in the last 2.0 versions.

Yes, it's a bug, but things changed a lot between versions. It might be fixed even in 0.9.7.

...
Do you have any idea about this problem ?

No.

...
Is this bug already known ?

No.

If you can reproduce it easily, try it with 0.9.7 (it will work with the same config as 0.9.4, you don't have to change anything). If you can still see it, try compiling with debug support (make proper; make mode=debug all and also don't forget to recompile with mode=debug any other module you might be using that is not covered by make all). After this the coredumps should be "better" (more info, no variables will be optimized to registers).

Hopefully 0.9.7 will solve your problems. If it doesn't then send me again some backtraces and/or the coredump + binaries (unfortunately the code is very old and I'm not any longer familiar with it).

Andrei

Andrei Pelinescu-Onciul

7:48 p.m.

On Sep 16, 2009 at 16:11, inge inge@legos.fr wrote:

...

Andrei,

Thanks for your update on this. As the conditions for the bug to appear is quite erratic (seems to be several INVITE sent to the gateway in a row, that make it sending a "482 Loop Detected" leading to the crashing ACK), we do not manage to reproduce this.

I never saw this on our 0.9.7pre1 lab version, but this one is quite empty and relays only a few test calls compared to our production 0.9.5 one serving severeal thousand of end users.

Is there any explanation on how we can have an <out of bound memory address> for the local_totag of the 100-ed tm entry ?

Either some module corrupts shared memory somehow (very hard to find out), or a deleted transaction is somehow used (e.g. race condition, the transaction is found in the list, but deleted immediately before being accessed).

...

Is there any way to check if the memory address is OK before passing it to memcmp ?

Yes, you could check that, but it wouldn't help. That bad address means that things are very wrong. You could avoid the memcpy, but you'll most likely only delay the crash. What you could do is add some LOG() statements and log all the fields of the transaction when the address is wrong (but the information is about the same that you would get from a coredump when ser is compiled with debugging).

...

This has not been modified in t_lookup.c from v0.9.7 nor v2.0 (attached)

Yes, but the problem is not in the place where it crashes, it's somewhere else. Both 0.9.7 and 0.9.4 are very old and I don't really remember all the fixes that went in (the cvs log is helpful, but does not tell the whole story). I wouldn't want to spend a lot of time debugging 0.9.4, just to find out in the end that the bug was fixed in 0.9.7, especially since there were non-trivial tm changes between the two.

Andrei

Andrei Pelinescu-Onciul

17 Sep 17 Sep

10:38 a.m.

On Sep 16, 2009 at 21:48, Andrei Pelinescu-Onciul andrei@iptel.org wrote: [...]

...

...
This has not been modified in t_lookup.c from v0.9.7 nor v2.0 (attached)

Yes, but the problem is not in the place where it crashes, it's somewhere else. Both 0.9.7 and 0.9.4 are very old and I don't really remember all the fixes that went in (the cvs log is helpful, but does not tell the whole story). I wouldn't want to spend a lot of time debugging 0.9.4, just to find out in the end that the bug was fixed in 0.9.7, especially since there were non-trivial tm changes between the two.

I think I found the problem. It is fixed in 0.9.7. Could you either try the attached patch, or upgrading?

If you upgrade, try the latest rel_0_9_0 branch, it has 3 additional fixes for 0.9.7 (one being a fixed "fix" from 0.9.6). I'll add a 0.9.8 tag shortly, just to avoid confusion.

Andrei

Andrei Pelinescu-Onciul

11:45 a.m.

On Sep 17, 2009 at 12:38, Andrei Pelinescu-Onciul andrei@iptel.org wrote:

...

On Sep 16, 2009 at 21:48, Andrei Pelinescu-Onciul andrei@iptel.org wrote: [...]

...
...
This has not been modified in t_lookup.c from v0.9.7 nor v2.0 (attached)

Yes, but the problem is not in the place where it crashes, it's somewhere else. Both 0.9.7 and 0.9.4 are very old and I don't really remember all the fixes that went in (the cvs log is helpful, but does not tell the whole story). I wouldn't want to spend a lot of time debugging 0.9.4, just to find out in the end that the bug was fixed in 0.9.7, especially since there were non-trivial tm changes between the two.

I think I found the problem. It is fixed in 0.9.7. Could you either try the attached patch, or upgrading?

If you upgrade, try the latest rel_0_9_0 branch, it has 3 additional fixes for 0.9.7 (one being a fixed "fix" from 0.9.6). I'll add a 0.9.8 tag shortly, just to avoid confusion.

In the meantime, I've uploaded a tar.gz, so you could also get http://ftp.iptel.org/pub/ser/0.9.8/src/ser-0.9.8_src.tar.gz or use the v_0_9_8 tag with cvs.

Andrei

inge

24 Sep 24 Sep

1:49 p.m.

Hi Andrei,

Thanks for the 0.9.8 release. We upgraded our proxy yesterday with sucess.

But we had another crash of SER in this morning. The cause seem to be defferent than the previous one.

Attached is a backtrace from the coredump. Is there a bug related to that in the last 0.9.x version ?

Sincerely,

Nicolas

Le jeudi 17 septembre 2009 à 13:45 +0200, Andrei Pelinescu-Onciul a écrit :

...

On Sep 17, 2009 at 12:38, Andrei Pelinescu-Onciul andrei@iptel.org wrote:

...
On Sep 16, 2009 at 21:48, Andrei Pelinescu-Onciul andrei@iptel.org wrote: [...]

...
...
This has not been modified in t_lookup.c from v0.9.7 nor v2.0 (attached)

Yes, but the problem is not in the place where it crashes, it's somewhere else. Both 0.9.7 and 0.9.4 are very old and I don't really remember all the fixes that went in (the cvs log is helpful, but does not tell the whole story). I wouldn't want to spend a lot of time debugging 0.9.4, just to find out in the end that the bug was fixed in 0.9.7, especially since there were non-trivial tm changes between the two.

I think I found the problem. It is fixed in 0.9.7. Could you either try the attached patch, or upgrading?

If you upgrade, try the latest rel_0_9_0 branch, it has 3 additional fixes for 0.9.7 (one being a fixed "fix" from 0.9.6). I'll add a 0.9.8 tag shortly, just to avoid confusion.

In the meantime, I've uploaded a tar.gz, so you could also get http://ftp.iptel.org/pub/ser/0.9.8/src/ser-0.9.8_src.tar.gz or use the v_0_9_8 tag with cvs.

Andrei

Andrei Pelinescu-Onciul

9:09 p.m.

On Sep 24, 2009 at 15:49, inge inge@legos.fr wrote:

...

Hi Andrei,

Thanks for the 0.9.8 release. We upgraded our proxy yesterday with sucess.

But we had another crash of SER in this morning. The cause seem to be defferent than the previous one.

Attached is a backtrace from the coredump. Is there a bug related to that in the last 0.9.x version ?

No, this is a new one. It was introduced in 0.9.7. A very safe looking optimization was backported from unstable (at that time). Unfortunately the optimization had a bug and although it was quickly fixed in unstable, the fix was not backported to 0.9.7.

Try the attached patch (patch -p2 < auth_db_empty_usernames_fix.patch ).

It seems I'll have to release 0.9.9 :-(

Andrei

inge

28 Sep 28 Sep

9:02 a.m.

Hello Andrei,

We need to wait for 0.9.9 ?

Or should we just patch the current 0.9.8 to avoid the bug ?

Otherwise, do you have an idea of the value we should set for "PKG_MEM_POOL_SIZE" and avoid a truncate of location to restart SER ?

Regards,

Adrien.

Le jeudi 24 septembre 2009 à 23:09 +0200, Andrei Pelinescu-Onciul a écrit :

...

On Sep 24, 2009 at 15:49, inge inge@legos.fr wrote:

...
Hi Andrei,

Thanks for the 0.9.8 release. We upgraded our proxy yesterday with sucess.

But we had another crash of SER in this morning. The cause seem to be defferent than the previous one.

Attached is a backtrace from the coredump. Is there a bug related to that in the last 0.9.x version ?

No, this is a new one. It was introduced in 0.9.7. A very safe looking optimization was backported from unstable (at that time). Unfortunately the optimization had a bug and although it was quickly fixed in unstable, the fix was not backported to 0.9.7.

Try the attached patch (patch -p2 < auth_db_empty_usernames_fix.patch ).

It seems I'll have to release 0.9.9 :-(

Andrei

Henning Westerholt

12:32 p.m.

On Montag, 28. September 2009, inge wrote:

...

[..] Otherwise, do you have an idea of the value we should set for "PKG_MEM_POOL_SIZE" and avoid a truncate of location to restart SER ?

Hi Inge,

just increase the memory pool until you can load your subscriber base. You can use even bigger values as 20 or 40 MB without any problems, at least to my experience.

Regards,

Henning

inge

4:53 p.m.

Hi Henning,

40MB x Number of child ? isn't it ?

So the parameter can be set to "40 x 1024 x 1024" ?

The database contains something than 4500 users. The server runs with 4GB of memory. How many users do you have on your server ?

Regards,

Adrien

Le lundi 28 septembre 2009 à 14:32 +0200, Henning Westerholt a écrit :

...

On Montag, 28. September 2009, inge wrote:

...
[..] Otherwise, do you have an idea of the value we should set for "PKG_MEM_POOL_SIZE" and avoid a truncate of location to restart

SER ?

Hi Inge,

just increase the memory pool until you can load your subscriber base. You can use even bigger values as 20 or 40 MB without any problems, at least to my experience.

Regards,

Henning

Henning Westerholt

5:07 p.m.

On Montag, 28. September 2009, inge wrote:

...

40MB x Number of child ? isn't it ?

So the parameter can be set to "40 x 1024 x 1024" ?

The database contains something than 4500 users. The server runs with 4GB of memory.

Hi Inge,

ok, then i think that 40 MB is a bit too much for you, 10MB should be really more then enough. But you could nevertheless set it to this value. FYI, in later versions of kamailio (i also think SER) the usrloc module contains logic to partion the data loading, that its not necessary to increase the memory pool anymore, regardless how many subscribers you've.

...

How many users do you have on your server ?

Somewhat more then two million users (not subscriber). ;) But we've use a patched usrloc for this.

Henning

Andrei Pelinescu-Onciul

6:24 p.m.

On Sep 28, 2009 at 18:53, inge inge@legos.fr wrote:

...

Hi Henning,

40MB x Number of child ? isn't it ?

It doesn't use 40 MB x number of process physical memory. In your case it will use as much as needed to temporary load the location table on startup, but only in _one_ process. The rest of the processes will use < 1Mb, so the total physical memory used will be something like no_of_processes-1 + size_needed_for_location. Think of the PKG_MEM_POOL_SIZE like an upper bound and not actually used memory. You need to worry about PKG_MEM_POOL_SIZE only if you have memory overcommit disabled (and even in that case only if you exceed the physical memory configured on the box).

The reason for the low 1Mb default in 0.9.x (and 4Mb in newer versions) is debugging. Most people don't need more then 1Mb and memory leaks are caught much earlier this way. I'll increase the default for 0.9.9.

...

So the parameter can be set to "40 x 1024 x 1024" ?

The database contains something than 4500 users. The server runs with 4GB of memory. How many users do you have on your server ?

Andrei

inge

29 Sep 29 Sep

8:25 a.m.

Hello Andrei, Henning,

So we will up the PKG_MEM_POOL_SIZE to 4MB.

Andrei : we have update to the lastest CVS snapshot ser-0.9.9 +cvs20090925. Is that correct for avoid the lastest problem ?

Thanks for your detailed answers !

Regards,

Adrien.

Le lundi 28 septembre 2009 à 20:24 +0200, Andrei Pelinescu-Onciul a écrit :

...

On Sep 28, 2009 at 18:53, inge inge@legos.fr wrote:

...
Hi Henning,

40MB x Number of child ? isn't it ?

It doesn't use 40 MB x number of process physical memory. In your case it will use as much as needed to temporary load the location table on startup, but only in _one_ process. The rest of the processes will use < 1Mb, so the total physical memory used will be something like no_of_processes-1 + size_needed_for_location. Think of the PKG_MEM_POOL_SIZE like an upper bound and not actually used memory. You need to worry about PKG_MEM_POOL_SIZE only if you have memory overcommit disabled (and even in that case only if you exceed the physical memory configured on the box).

The reason for the low 1Mb default in 0.9.x (and 4Mb in newer versions) is debugging. Most people don't need more then 1Mb and memory leaks are caught much earlier this way. I'll increase the default for 0.9.9.

...
So the parameter can be set to "40 x 1024 x 1024" ?

The database contains something than 4500 users. The server runs with 4GB of memory. How many users do you have on your server ?

Andrei

Andrei Pelinescu-Onciul

8:34 a.m.

On Sep 29, 2009 at 10:25, inge inge@legos.fr wrote:

...

Hello Andrei, Henning,

So we will up the PKG_MEM_POOL_SIZE to 4MB.

Andrei : we have update to the lastest CVS snapshot ser-0.9.9 +cvs20090925. Is that correct for avoid the lastest problem ?

Yes, it is.

Andrei

inge

9 Oct 9 Oct

6:33 p.m.

Hello Andrei,

We are now running to 0.9.9+cvs20090925 since 10 days.

Today SER crash/stop without coredump. Do you know if we need to configure something for enable this option ?

With the previous crash, we got in /var/log/messages something like a CHILD which firstly crashed and then all the processes are followed. But here SER stop like "service ser stop" by printing only in the log "INFO : signal 15 received..."

Do you have any idea ?

Regards,

Adrien;

Le mardi 29 septembre 2009 à 10:34 +0200, Andrei Pelinescu-Onciul a écrit :

...

On Sep 29, 2009 at 10:25, inge inge@legos.fr wrote:

...
Hello Andrei, Henning,

So we will up the PKG_MEM_POOL_SIZE to 4MB.

Andrei : we have update to the lastest CVS snapshot ser-0.9.9 +cvs20090925. Is that correct for avoid the lastest problem ?

Yes, it is.

Andrei

Andrei Pelinescu-Onciul

6:59 p.m.

On Oct 09, 2009 at 20:33, inge inge@legos.fr wrote:

...

Hello Andrei,

We are now running to 0.9.9+cvs20090925 since 10 days.

Today SER crash/stop without coredump. Do you know if we need to configure something for enable this option ?

No, coredump it's enabled by default since 0.9.3. However note that if you don't start ser as root, it cannot enable core-dumping (and you have to do it b by hand before starting ser). Note also that even if started as root, if it's supposed to change its uid (e.g. started with -u <some_user> or with uid in the .cfg) it won't be able to dump core on any modern linux kernel (in this case you would need to set /proc/sys/fs/suid_dumpable to 1 or remove the -u from ser command line).

You can check if it dumps core, by sending SIGABRT to one of the ser processes (e.g kill -SIGABRT <pid_of_ser>).

...

With the previous crash, we got in /var/log/messages something like a CHILD which firstly crashed and then all the processes are followed. But here SER stop like "service ser stop" by printing only in the log "INFO : signal 15 received..."

Do you have any idea ?

Are you sure somebody hasn't stopped it? If it crashed and couldn't dump core, there should be a message logged (something like ... core was not generated...). Also you should see messages about the signal that caused the first child process to terminate and if it's really a problem it will be different from 15.

Another possibility is that the kernel killed some ser processes due to low memory (check dmesg for OOM).

Andrei

inge

12 Oct 12 Oct

7:57 a.m.

Hello Andrei,

Thank for having replied.

SER runs as root user without anything in the config file concerning a "-u" option, even in a "ps aux", all the process are started without options.

We have checked and it seems that no one shoot the process at the time crash happened.

I attached to this email the /var/log/messages. As you will see, we just received a "signal 15".

About the "dmesg | grep 00M", I have no output generated by this command.

Regards,

Adrien.

Le vendredi 09 octobre 2009 à 20:59 +0200, Andrei Pelinescu-Onciul a écrit :

...

On Oct 09, 2009 at 20:33, inge inge@legos.fr wrote:

...
Hello Andrei,

We are now running to 0.9.9+cvs20090925 since 10 days.

Today SER crash/stop without coredump. Do you know if we need to configure something for enable this option ?

No, coredump it's enabled by default since 0.9.3. However note that if you don't start ser as root, it cannot enable core-dumping (and you have to do it b by hand before starting ser). Note also that even if started as root, if it's supposed to change its uid (e.g. started with -u <some_user> or with uid in the .cfg) it won't be able to dump core on any modern linux kernel (in this case you would need to set /proc/sys/fs/suid_dumpable to 1 or remove the -u from ser command line).

You can check if it dumps core, by sending SIGABRT to one of the ser processes (e.g kill -SIGABRT <pid_of_ser>).

...
With the previous crash, we got in /var/log/messages something like a CHILD which firstly crashed and then all the processes are followed. But here SER stop like "service ser stop" by printing only in the log "INFO : signal 15 received..."

Do you have any idea ?

Are you sure somebody hasn't stopped it? If it crashed and couldn't dump core, there should be a message logged (something like ... core was not generated...). Also you should see messages about the signal that caused the first child process to terminate and if it's really a problem it will be different from 15.

Another possibility is that the kernel killed some ser processes due to low memory (check dmesg for OOM).

Andrei

Andrei Pelinescu-Onciul

8:47 a.m.

On Oct 12, 2009 at 09:57, inge inge@legos.fr wrote:

...

Hello Andrei,

Thank for having replied.

SER runs as root user without anything in the config file concerning a "-u" option, even in a "ps aux", all the process are started without options.

We have checked and it seems that no one shoot the process at the time crash happened.

I attached to this email the /var/log/messages. As you will see, we just received a "signal 15".

If it would have been a process that crashed or exited (which would be another bug), you would have in the log one of the following lines: child process %d exited normally, status= child process %d exited by a signal child process %d stopped by a signal

After that you would have another line: INFO: terminating due to SIGCHLD

(see main.c:506 , in handle_sigs()).

The log fragment you sent, shows that log messages with L_INFO are printed (so debug>=L_INFO) => you should see one of the above lines.

If you don't see them the only explanation is that someone or some other program sent a SIGTERM or SIGINT to it. There is a message printed any time the main program gets a SIGTERM, but its printed only on high debug levels. You could try either increasing the debug level (,e.g. debug=5, but then you'll have a _lot_ logged) or changing main.c:478: DBG("SIGTERM received, program terminates\n"); to LOG(L_CRIT, "SIGTERM received, program terminates\n"); and main.c:476 DBG("INT received, program terminates\n"); to LOG(L_CRIT, "INT received, program terminates\n");.

This way if it happens a second time you'll at least see if the main process was manually killed (which is the only explanation I can offer).

Andrei

inge

8:54 a.m.

A command issued by the FIFO can lead to a restart of how we see in the logs "messages" ?

Regards,

Adrien.

Le lundi 12 octobre 2009 à 10:47 +0200, Andrei Pelinescu-Onciul a écrit :

...

On Oct 12, 2009 at 09:57, inge inge@legos.fr wrote:

...
Hello Andrei,

Thank for having replied.

SER runs as root user without anything in the config file concerning a "-u" option, even in a "ps aux", all the process are started without options.

We have checked and it seems that no one shoot the process at the time crash happened.

I attached to this email the /var/log/messages. As you will see, we just received a "signal 15".

If it would have been a process that crashed or exited (which would be another bug), you would have in the log one of the following lines: child process %d exited normally, status= child process %d exited by a signal child process %d stopped by a signal

After that you would have another line: INFO: terminating due to SIGCHLD

(see main.c:506 , in handle_sigs()).

The log fragment you sent, shows that log messages with L_INFO are printed (so debug>=L_INFO) => you should see one of the above lines.

If you don't see them the only explanation is that someone or some other program sent a SIGTERM or SIGINT to it. There is a message printed any time the main program gets a SIGTERM, but its printed only on high debug levels. You could try either increasing the debug level (,e.g. debug=5, but then you'll have a _lot_ logged) or changing main.c:478: DBG("SIGTERM received, program terminates\n"); to LOG(L_CRIT, "SIGTERM received, program terminates\n"); and main.c:476 DBG("INT received, program terminates\n"); to LOG(L_CRIT, "INT received, program terminates\n");.

This way if it happens a second time you'll at least see if the main process was manually killed (which is the only explanation I can offer).

Andrei

Andrei Pelinescu-Onciul

9:10 a.m.

On Oct 12, 2009 at 10:54, inge inge@legos.fr wrote:

...

A command issued by the FIFO can lead to a restart of how we see in the logs "messages" ?

If "kill" is sent over the fifo, you will see only one of the DBG()s: "SIGTERM received, program terminates". Note that you will see it in the log only if debug>4, or if you replace it with a LOG(L_CRIT, ...).

Andrei

...

Le lundi 12 octobre 2009 ?? 10:47 +0200, Andrei Pelinescu-Onciul a ??crit :

...
On Oct 12, 2009 at 09:57, inge inge@legos.fr wrote:

...
Hello Andrei,

Thank for having replied.

SER runs as root user without anything in the config file concerning a "-u" option, even in a "ps aux", all the process are started without options.

We have checked and it seems that no one shoot the process at the time crash happened.

I attached to this email the /var/log/messages. As you will see, we just received a "signal 15".

If it would have been a process that crashed or exited (which would be another bug), you would have in the log one of the following lines: child process %d exited normally, status= child process %d exited by a signal child process %d stopped by a signal

After that you would have another line: INFO: terminating due to SIGCHLD

(see main.c:506 , in handle_sigs()).

The log fragment you sent, shows that log messages with L_INFO are printed (so debug>=L_INFO) => you should see one of the above lines.

If you don't see them the only explanation is that someone or some other program sent a SIGTERM or SIGINT to it. There is a message printed any time the main program gets a SIGTERM, but its printed only on high debug levels. You could try either increasing the debug level (,e.g. debug=5, but then you'll have a _lot_ logged) or changing main.c:478: DBG("SIGTERM received, program terminates\n"); to LOG(L_CRIT, "SIGTERM received, program terminates\n"); and main.c:476 DBG("INT received, program terminates\n"); to LOG(L_CRIT, "INT received, program terminates\n");.

This way if it happens a second time you'll at least see if the main process was manually killed (which is the only explanation I can offer).

Andrei

inge

9:40 a.m.

I try to kill -ABRT on a preproduction platform running ser0.9.9+cvs and a coredump is generated.

The only difference I see is the system. Preproduction running on Red Hat 4 wherease the production platform runs under Red Hat 5.

Is that can make any difference ?

Regards,

Adrien.

Le lundi 12 octobre 2009 à 11:10 +0200, Andrei Pelinescu-Onciul a écrit :

...

On Oct 12, 2009 at 10:54, inge inge@legos.fr wrote:

...
A command issued by the FIFO can lead to a restart of how we see in the logs "messages" ?

If "kill" is sent over the fifo, you will see only one of the DBG()s: "SIGTERM received, program terminates". Note that you will see it in the log only if debug>4, or if you replace it with a LOG(L_CRIT, ...).

Andrei

...
Le lundi 12 octobre 2009 ?? 10:47 +0200, Andrei Pelinescu-Onciul a ??crit :

...
On Oct 12, 2009 at 09:57, inge inge@legos.fr wrote:

...
Hello Andrei,

Thank for having replied.

SER runs as root user without anything in the config file concerning a "-u" option, even in a "ps aux", all the process are started without options.

We have checked and it seems that no one shoot the process at the time crash happened.

I attached to this email the /var/log/messages. As you will see, we just received a "signal 15".

If it would have been a process that crashed or exited (which would be another bug), you would have in the log one of the following lines: child process %d exited normally, status= child process %d exited by a signal child process %d stopped by a signal

After that you would have another line: INFO: terminating due to SIGCHLD

(see main.c:506 , in handle_sigs()).

The log fragment you sent, shows that log messages with L_INFO are printed (so debug>=L_INFO) => you should see one of the above lines.

If you don't see them the only explanation is that someone or some other program sent a SIGTERM or SIGINT to it. There is a message printed any time the main program gets a SIGTERM, but its printed only on high debug levels. You could try either increasing the debug level (,e.g. debug=5, but then you'll have a _lot_ logged) or changing main.c:478: DBG("SIGTERM received, program terminates\n"); to LOG(L_CRIT, "SIGTERM received, program terminates\n"); and main.c:476 DBG("INT received, program terminates\n"); to LOG(L_CRIT, "INT received, program terminates\n");.

This way if it happens a second time you'll at least see if the main process was manually killed (which is the only explanation I can offer).

Andrei

Andrei Pelinescu-Onciul

28 Sep 28 Sep

6:10 p.m.

On Sep 28, 2009 at 11:02, inge inge@legos.fr wrote:

...

Hello Andrei,

We need to wait for 0.9.9 ?

Or should we just patch the current 0.9.8 to avoid the bug ?

No, just apply the patch (the same patch will be in 0.9.9).

...

Otherwise, do you have an idea of the value we should set for "PKG_MEM_POOL_SIZE" and avoid a truncate of location to restart SER ?

It depends on your location table size, but having it too big has no adverse side effects (it doesn't use more physical memory then needed even if you set it to a high value). I would start with 8 (or 4) Mb and increase it if you have restart problems (but you could as well use 20 Mb from the start and not worry about it in the future).

Andrei

5759

Age (days ago)

5819

Last active (days ago)

sr-dev@lists.kamailio.org

40 comments

4 participants

tags (0)

participants (4)

Andrei Pelinescu-Onciul
Henning Westerholt
inge
Klaus Darilion