Hello,
Any further insight any of you can provide is very much appreciated.
Here is the core dump syslog:
Aug 9 13:42:10 kernel: [80782620.026999] kamailio[21378]: segfault at 7f269f026e5c ip 7f1c24f72969 sp 7fff3452bbe0 error 4 in dialog.so[7f1c24f53000+3a000] Aug 9 13:42:10 /usr/local/sbin/kamailio[21368]: INFO: <core> [main.c:756]: INFO: terminating due to SIGCHLD Aug 9 13:42:10 /usr/local/sbin/kamailio[21369]: INFO: <core> [main.c:807]: INFO: signal 15 received Aug 9 13:42:10 /usr/local/sbin/kamailio[21370]: INFO: <core> [main.c:807]: INFO: signal 15 received Aug 9 13:42:10 /usr/local/sbin/kamailio[21376]: INFO: <core> [main.c:807]: INFO: signal 15 received Aug 9 13:42:10 /usr/local/sbin/kamailio[21371]: INFO: <core> [main.c:807]: INFO: signal 15 received Aug 9 13:42:10 /usr/local/sbin/kamailio[21372]: INFO: <core> [main.c:807]: INFO: signal 15 received Aug 9 13:42:10 /usr/local/sbin/kamailio[21377]: INFO: <core> [main.c:807]: INFO: signal 15 received Aug 9 13:42:10 /usr/local/sbin/kamailio[21379]: INFO: <core> [main.c:807]: INFO: signal 15 received Aug 9 13:42:10 /usr/local/sbin/kamailio[21374]: INFO: <core> [main.c:807]: INFO: signal 15 received Aug 9 13:42:10 /usr/local/sbin/kamailio[21373]: INFO: <core> [main.c:807]: INFO: signal 15 received Aug 9 13:42:10 /usr/local/sbin/kamailio[21375]: INFO: <core> [main.c:807]: INFO: signal 15 received
Here is the full backtrace from GDB:
Program terminated with signal 11, Segmentation fault. [New process 21378] #0 unref_dlg (dlg=0x7f1c15232238, cnt=1) at dlg_hash.c:599 599 dlg_lock( d_table, d_entry); (gdb) bt full #0 unref_dlg (dlg=0x7f1c15232238, cnt=1) at dlg_hash.c:599 d_entry = (struct dlg_entry *) 0x7f269f026e48 #1 0x00007f1c29fd4a82 in run_trans_callbacks_internal (cb_lst=0x7f1c1538b6e0, type=32768, trans=0x7f1c1538b670, params=0x7fff3452bd40) at t_hooks.c:290 cbp = (struct tm_callback *) 0x7f1c1529a170 backup_from = (avp_list_t *) 0x8b4350 backup_to = (avp_list_t *) 0x8b4358 backup_dom_from = (avp_list_t *) 0x8b4360 backup_dom_to = (avp_list_t *) 0x8b4368 backup_uri_from = (avp_list_t *) 0x8b4340 backup_uri_to = (avp_list_t *) 0x8b4348 #2 0x00007f1c29fd4cd6 in run_trans_callbacks (type=32768, trans=<value optimized out>, req=0x1, rpl=0x7f269f026e48, code=622390088) at t_hooks.c:317 params = {req = 0x0, rpl = 0x0, param = 0x7f1c1529a180, code = 0, flags = 0, branch = 0, t_rbuf = 0x0, dst = 0x0, send_buf = {s = 0x0, len = 0}} #3 0x00007f1c29fbab46 in free_cell (dead_cell=0x7f1c1538b670) at h_table.c:152 b = <value optimized out> i = <value optimized out> rpl = <value optimized out> tt = <value optimized out> foo = <value optimized out> cbs = <value optimized out> __FUNCTION__ = "free_cell" #4 0x00007f1c29fd6a71 in wait_handler (ti=<value optimized out>, wait_tl=<value optimized out>, data=<value optimized out>) at timer.c:645 p_cell = (struct cell *) 0x7f1c1538b670 #5 0x000000000051463f in timer_main () at timer.c:894 No locals. #6 0x0000000000465101 in main_loop () at main.c:1630 i = 8 pid = <value optimized out> si = (struct socket_info *) 0x0 si_desc = "udp receiver child=7 sock=67.XXX.XXX.XXX:5060\000\000\000\000\000HÂ\232\000\000\000\000\000\f\000\000\000\000\000\000\000X\023\220", '\0' <repeats 13 times>, "t\000\000\000ÿ\177\000\000\030\205ø\024\001\000\000\000\001\000\000\000\000\000\000\000\002", '\0' <repeats 22 times> #7 0x00000000004680f3 in main (argc=<value optimized out>, argv=0x7fff3452c138) at main.c:2410 cfg_stream = (FILE *) 0x1c86010 c = <value optimized out> r = <value optimized out> tmp = 0x7fff3452de5f "" tmp_len = 32540 port = <value optimized out> proto = <value optimized out> ret = <value optimized out> seed = 3355938686 rfd = 4 debug_save = <value optimized out> debug_flag = 0 dont_fork_cnt = 0 n_lst = <value optimized out> p = <value optimized out>
Sincerely, Brandon Armstead
Hello Brandon,
On 09.08.2011 16:17, Brandon Armstead wrote:
Hello,
Any further insight any of you can provide is very much appreciated.
Here is the core dump syslog:
[snip!]
A few months ago, Anton Roman provided a core dump looking very similar to yours. I wasn't exactly able to pin down the cause but suspected the dlg_end_dlg() function.
Is there a chance you used that function around the time the crash happened?
Cheers,
--Timo
Timo,
I have actually been researching that thread - it does look *similar* however it does not look 100% related. I am checking out that commit now however - and will see if it resolves the same issue.
As for dlg_end_dlg - I am not calling this via FIFO or anything - I am simply calling dlg_manage() in the routing config - however I am not sure if this is being called internally (I assume that it is) upon a dialog cleanup?
Sincerely, Brandon Armstead
On Tue, Aug 9, 2011 at 8:08 AM, Timo Reimann timo.reimann@1und1.de wrote:
Hello Brandon,
On 09.08.2011 16:17, Brandon Armstead wrote:
Hello,
Any further insight any of you can provide is very much appreciated.
Here is the core dump syslog:
[snip!]
A few months ago, Anton Roman provided a core dump looking very similar to yours. I wasn't exactly able to pin down the cause but suspected the dlg_end_dlg() function.
Is there a chance you used that function around the time the crash happened?
Cheers,
--Timo
Timo,
Looks like that worked - going to keep watching it and see what happens.
However I am now getting:
Aug 9 15:34:00 /usr/local/sbin/kamailio[3040]: CRITICAL: dialog [dlg_timer.c:138]: Trying to insert a bogus dlg tl=0x7f8dc8089368 tl->next=0x7f8dc80202e0 tl->prev=0x7f8dc8108b68 Aug 9 15:34:00 /usr/local/sbin/kamailio[3040]: CRITICAL: dialog [dlg_handlers.c:373]: Unable to insert dlg 0x7f8dc8089318 [603:994585481] on event 3 [2->3] with clid 'CINMGC0320110809153354004076@XXX.XXX.XXX.XXX' and tags 'VPSF506071629460' 'gK0cc7be82'
Which looks similar to the original thread? Not sure if there is still an underlying issue that I should be wary of?
Sincerely, Brandon Armrstead
On Tue, Aug 9, 2011 at 8:16 AM, Brandon Armstead brandon@cryy.com wrote:
Timo,
I have actually been researching that thread - it does look *similar* however it does not look 100% related. I am checking out that commit now however - and will see if it resolves the same issue.
As for dlg_end_dlg - I am not calling this via FIFO or anything - I am simply calling dlg_manage() in the routing config - however I am not sure if this is being called internally (I assume that it is) upon a dialog cleanup?
Sincerely, Brandon Armstead
On Tue, Aug 9, 2011 at 8:08 AM, Timo Reimann timo.reimann@1und1.dewrote:
Hello Brandon,
On 09.08.2011 16:17, Brandon Armstead wrote:
Hello,
Any further insight any of you can provide is very much appreciated.
Here is the core dump syslog:
[snip!]
A few months ago, Anton Roman provided a core dump looking very similar to yours. I wasn't exactly able to pin down the cause but suspected the dlg_end_dlg() function.
Is there a chance you used that function around the time the crash happened?
Cheers,
--Timo
Timo,
Looks like I spoke too soon! It is still happening.
Any additional thoughts? All and any help is greatly appreciated.
Sincerely, Brandon Armstead
On Tue, Aug 9, 2011 at 8:37 AM, Brandon Armstead brandon@cryy.com wrote:
Timo,
Looks like that worked - going to keep watching it and see what happens.
However I am now getting:
Aug 9 15:34:00 /usr/local/sbin/kamailio[3040]: CRITICAL: dialog [dlg_timer.c:138]: Trying to insert a bogus dlg tl=0x7f8dc8089368 tl->next=0x7f8dc80202e0 tl->prev=0x7f8dc8108b68 Aug 9 15:34:00 /usr/local/sbin/kamailio[3040]: CRITICAL: dialog [dlg_handlers.c:373]: Unable to insert dlg 0x7f8dc8089318 [603:994585481] on event 3 [2->3] with clid 'CINMGC0320110809153354004076@XXX.XXX.XXX.XXX' and tags 'VPSF506071629460' 'gK0cc7be82'
Which looks similar to the original thread? Not sure if there is still an underlying issue that I should be wary of?
Sincerely, Brandon Armrstead
On Tue, Aug 9, 2011 at 8:16 AM, Brandon Armstead brandon@cryy.com wrote:
Timo,
I have actually been researching that thread - it does look *similar* however it does not look 100% related. I am checking out that commit now however - and will see if it resolves the same issue.
As for dlg_end_dlg - I am not calling this via FIFO or anything - I am simply calling dlg_manage() in the routing config - however I am not sure if this is being called internally (I assume that it is) upon a dialog cleanup?
Sincerely, Brandon Armstead
On Tue, Aug 9, 2011 at 8:08 AM, Timo Reimann timo.reimann@1und1.dewrote:
Hello Brandon,
On 09.08.2011 16:17, Brandon Armstead wrote:
Hello,
Any further insight any of you can provide is very much appreciated.
Here is the core dump syslog:
[snip!]
A few months ago, Anton Roman provided a core dump looking very similar to yours. I wasn't exactly able to pin down the cause but suspected the dlg_end_dlg() function.
Is there a chance you used that function around the time the crash happened?
Cheers,
--Timo
Hey Brandon,
On 09.08.2011 17:54, Brandon Armstead wrote:
Looks like I spoke too soon! It is still happening.
Any additional thoughts? All and any help is greatly appreciated.
My original theory with Anton's issue was (and still is) that the dialog module is trying to touch a dialog which has already terminated. When provoking things like that through modifications in the dialog module, we encountered crashes at similar locations in the code.
Is there any chance you can provide me with some (anonymized) SIP traces? Dissecting this problem has proven to be very hard even when given a lot of details, such as Anton's core dumps. Not being able to reconstruct the call flow has been a show stopper so far.
On Tue, Aug 9, 2011 at 8:37 AM, Brandon Armstead <brandon@cryy.com mailto:brandon@cryy.com> wrote:
Timo, Looks like that worked - going to keep watching it and see what happens. However I am now getting: Aug 9 15:34:00 /usr/local/sbin/kamailio[3040]: CRITICAL: dialog [dlg_timer.c:138]: Trying to insert a bogus dlg tl=0x7f8dc8089368 tl->next=0x7f8dc80202e0 tl->prev=0x7f8dc8108b68 Aug 9 15:34:00 /usr/local/sbin/kamailio[3040]: CRITICAL: dialog [dlg_handlers.c:373]: Unable to insert dlg 0x7f8dc8089318 [603:994585481] on event 3 [2->3] with clid 'CINMGC0320110809153354004076@XXX.XXX.XXX.XXX' and tags 'VPSF506071629460' 'gK0cc7be82' Which looks similar to the original thread? Not sure if there is still an underlying issue that I should be wary of?
This seems to be related to setting up the dialog timer on reception of a 200 OK message when the given dialog was about to transition from the "early" state (18x reply seen) to the "confirmed without ACK" state (200 OK seen, ACK still outstanding).
Not quite sure what this means but it could possibly be just another manifestation of the same bug.
Cheers,
--Timo
On Tue, Aug 9, 2011 at 8:16 AM, Brandon Armstead <brandon@cryy.com <mailto:brandon@cryy.com>> wrote: Timo, I have actually been researching that thread - it does look *similar* however it does not look 100% related. I am checking out that commit now however - and will see if it resolves the same issue. As for dlg_end_dlg - I am not calling this via FIFO or anything - I am simply calling dlg_manage() in the routing config - however I am not sure if this is being called internally (I assume that it is) upon a dialog cleanup? Sincerely, Brandon Armstead On Tue, Aug 9, 2011 at 8:08 AM, Timo Reimann <timo.reimann@1und1.de <mailto:timo.reimann@1und1.de>> wrote: Hello Brandon, On 09.08.2011 16:17, Brandon Armstead wrote: > Hello, > > Any further insight any of you can provide is very much appreciated. > > Here is the core dump syslog: [snip!] A few months ago, Anton Roman provided a core dump looking very similar to yours. I wasn't exactly able to pin down the cause but suspected the dlg_end_dlg() function. Is there a chance you used that function around the time the crash happened? Cheers, --Timo
Hey,
On 09.08.2011 18:25, Timo Reimann wrote:
On 09.08.2011 17:54, Brandon Armstead wrote:
Looks like I spoke too soon! It is still happening.
Any additional thoughts? All and any help is greatly appreciated.
My original theory with Anton's issue was (and still is) that the dialog module is trying to touch a dialog which has already terminated. When provoking things like that through modifications in the dialog module, we encountered crashes at similar locations in the code.
Git commit 7afb2cf8e fixes the issue. The cause was a race condition (most likely only manifesting when storing dialogs in the database in real-time) in conjunction with some awkward call flow behavior that would lead to the dialog timer list being messed up in rare situations. In consequence, this let to incorrect reference counting and, in turn, premature dialog termination.
Along the way, we found another related bug that would cause the "dlg_list" fifo command to crash Kamailio in certain situations when dialogs were restored from database. Commit ab21b2565 fixes this.
Major credits go to Brendon Armstead for giving out constant information, feedback, and overall assistance!
Cheers,
--Timo
Timo,
Thanks for solving this one!
On Aug 15, 2011, at 2:31 AM, Timo Reimann timo.reimann@1und1.de wrote:
Hey,
On 09.08.2011 18:25, Timo Reimann wrote:
On 09.08.2011 17:54, Brandon Armstead wrote:
Looks like I spoke too soon! It is still happening.
Any additional thoughts? All and any help is greatly appreciated.
My original theory with Anton's issue was (and still is) that the dialog module is trying to touch a dialog which has already terminated. When provoking things like that through modifications in the dialog module, we encountered crashes at similar locations in the code.
Git commit 7afb2cf8e fixes the issue. The cause was a race condition (most likely only manifesting when storing dialogs in the database in real-time) in conjunction with some awkward call flow behavior that would lead to the dialog timer list being messed up in rare situations. In consequence, this let to incorrect reference counting and, in turn, premature dialog termination.
Along the way, we found another related bug that would cause the "dlg_list" fifo command to crash Kamailio in certain situations when dialogs were restored from database. Commit ab21b2565 fixes this.
Major credits go to Brendon Armstead for giving out constant information, feedback, and overall assistance!
Cheers,
--Timo