Re: [OpenSER-Users] OpenSER Randomly crashes.

6 Mar 2008

      Hi Sergio,
please upload your report on the tracker (bug section).
Regards,
Bogdan
PS: please enable the memory debug support (DBG_QM_MALLOC) and run it 
like this - it might provide more infos when crashing.
Sergio Gutierrez wrote:
...
Hi Henning.
I apologize in advance for the long post.
These days, Openser still has been crashing randomly.
Using GDB in one of the generated core files, I found something curious:
#0  0x000bcfbc in fm_malloc (qm=0x185320, size=24, file=0xfedbac10 
"res.c",
    func=0xfedbac70 "db_mysql_get_columns", line=62) at mem/f_malloc.c:267
#1  0xfedb74b0 in db_mysql_get_columns (_h=0x1cbf68, _r=0x24dde8) at 
res.c:62
#2  0xfedb79f0 in db_mysql_convert_result (_h=0x1cbf68, _r=0x24dde8) 
at res.c:167
#3  0xfedb28c4 in db_mysql_store_result (_h=0x1cbf68, _r=0xffbff830) 
at dbase.c:209
#4  0xfedb40e8 in db_mysql_raw_query (_h=0x1cbf68,
    _s=0xff07e668 "select received, contact, socket, cflags, path from 
location where expires > '2008-03-04 13:37:51' and cflags & 64 = 64 
and id % 1 = 0", _r=0xffbff830) at dbase.c:447
#5  0xff053260 in get_all_db_ucontacts (buf=0x1ceec0, len=320054, 
flags=64, part_idx=0, part_max=1)
    at dlist.c:128
#6  0xff0528c8 in get_all_ucontacts (buf=0x1ceec0, len=320058, 
flags=64, part_idx=0, part_max=1) at dlist.c:356
#7  0xfee57c6c in pingClients (ticks=60, param=0x0) at functions.h:60
#8  0x000aa430 in timer_ticker (timer_list=0x163c00) at timer.c:275
#9  0x000aa180 in run_timer_process (tpl=0x1c5808, do_jiffies=1) at 
timer.c:357
#10 0x000aa6fc in start_timer_processes () at timer.c:386
#11 0x00036788 in main_loop () at main.c:873
#12 0x0003a0c4 in main (argc=1137536, argv=0x155f1c) at main.c:1372
By inspecting in detail the frame 0, in particular the qm variable:
(gdb) print qm
$1 = (struct fm_block *) 0x185320
Which is the fm_block structure defined at mem/f_malloc.h.
(gdb) frame 0
#0  0x000bcfbc in fm_malloc (qm=0x185320, size=24, file=0xfedbac10 
"res.c",
    func=0xfedbac70 "db_mysql_get_columns", line=62) at mem/f_malloc.c:267
267                             if ((*f)->size>=size) goto found;
(gdb) list
262             /*search for a suitable free frag*/
263
264             for(hash=GET_HASH(size);hash<F_HASH_SIZE;hash++){
265                     f=&(qm->free_hash[hash].first);
266                     for(;(*f); f=&((*f)->u.nxt_free))
267                             if ((*f)->size>=size) goto found;
268                     /* try in a bigger bucket */
269             }
270             /* not found, bad! */
271             return 0;
If I print the qm->free_hash array, I found that is mainly empty; For 
the particular case of my core file, hash has a value of three, when 
printing that position I have the following:
(gdb) print qm->free_hash[hash]
$1 = {first = 0x69703a31, no = 1}
(gdb) print qm->free_hash
$2 = {{first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no = 
0}, {first = 0x69703a31, no = 1}, {
    first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no = 
0}, {first = 0x0, no = 0}, {first = 0x0,
    no = 0}, {first = 0x0, no = 0}, {first = 0x24dd68, no = 4641}, 
{first = 0x0, no = 0} <repeats 21 times>, {
    first = 0x1ced90, no = 1}, {first = 0x0, no = 0} <repeats 679 
times>, {first = 0x1cef40, no = 1}, {
    first = 0x0, no = 0} <repeats 1337 times>, {first = 0x1cef40, no = 
1}, {first = 0x0, no = 0}, {
    first = 0x24de38, no = 1}, {first = 0x0, no = 0} <repeats 11 
times>, {first = 0x21d100, no = 1}, {
    first = 0x0, no = 0}, {first = 0x0, no = 0}}
(gdb) print qm->free_hash.no
$3 = 0
(gdb) print qm->free_hash[hash].first
$4 = (struct fm_frag *) 0x69703a31
(gdb) x/s 0x69703a31
0x69703a31:      <Address 0x69703a31 out of bounds>
So, the error happened because from the list of memory fragments, an 
invalid one was referred.
I have two questions:

Is it normal that free_hash array at fm_block has some positions

pointing to invalid locations?
2. I could see that the fm_frag_lnk struct has a member called no, 
which, for the printing, I see it is 0 for most of the values at the 
array, and it is 1 at some members, including the one which causes the 
crash; would it not be possible to use that member for a check before 
trying the allocation? What exactly means the no member, as I also see 
that for some of the members it has a value higher than 1.
Thanks in advance for any help, and again, I apologize for the long post.
Best regards.
Sergio Gutierrez
On Thu, Feb 28, 2008 at 11:49 AM, Sergio Gutierrez <saguti@gmail.com 
mailto:saguti@gmail.com> wrote:
Hi Henning.

Thanks a lot for your answer.

Currently, the machine does not report any hardware problem;
Solaris 10 has a service called Fault Manager, which is running on
my machine, and it has not reported any error or problem related
to it.

At this moment, I am testing a Openser installation compiled using
an optimized version of GCC released by Sun to be used on Sparc
Systems; this release is based on gcc 4, and at this time, OpenSER
has been running for almost 18 hours without crash.

I will inspect the core file again, and I will be posting what I find.

Best regards, and thanks again.

Sergio Gutierrez.

On Thu, Feb 28, 2008 at 5:19 AM, Henning Westerholt
<henning.westerholt@1und1.de <mailto:henning.westerholt@1und1.de>>
wrote:

    On Thursday 28 February 2008, Sergio Gutierrez wrote:
    > My OpenSER 1.3 installation running on Solaris Sparc is
    facing random and
    > unexpected crashes, in appearance related to timer process.
    >
    > The last core presents the following backtrace
    >
    > #0  0xfe977a04 in get_expired_dlgs (time=4233810208) at
    dlg_timer.c:194
    > #1  0xfe977540 in dlg_timer_routine (ticks=7980, attr=0x0) at
    > dlg_timer.c:210
    > #2  0x000a839c in timer_ticker (timer_list=0x15ec00) at
    timer.c:275
    > #3  0x000a80ec in run_timer_process (tpl=0x1b8088,
    do_jiffies=1) at timer.c
    >
    > :357
    >
    > #4  0x000a8668 in start_timer_processes () at timer.c:386
    > #5  0x00035ea8 in main_loop () at main.c:873
    > #6  0x000397c4 in main (argc=-4195024, argv=0x150e9c) at
    main.c:1372
    >
    >
    > Thanks in advance for any hint you can give me.

    Hi Sergio,

    signal 10 is SIGBUS on solaris. This could be caused from an
    invalid address
    alignment, a segmention fault on a physical address and a
    object hardware
    error (wikipedia).

    The first crashes were both caused from a get_all_ucontact,
    triggered by a
    timer. This crash is now another timer, deletion of expired
    dialogs,
    strange.. Is this machine otherwise stable, when (openser
    release) does this
    crashes started?

    Do you have already inspected with the debugger the
    datastructures in the code
    of the get_expired_dlgs functions? Perhaps there is something
    wrong in
    there..

    Cheers,

    Henning

Users mailing list
Users@lists.openser.org
http://lists.openser.org/cgi-bin/mailman/listinfo/users

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [OpenSER-Users] OpenSER Randomly crashes.