Hi Henning.
I apologize in advance for the long post.
These days, Openser still has been crashing randomly.
Using GDB in one of the generated core files, I found something curious:
#0 0x000bcfbc in fm_malloc (qm=0x185320, size=24, file=0xfedbac10 "res.c",
func=0xfedbac70 "db_mysql_get_columns", line=62) at mem/f_malloc.c:267
#1 0xfedb74b0 in db_mysql_get_columns (_h=0x1cbf68, _r=0x24dde8) at res.c
:62
#2 0xfedb79f0 in db_mysql_convert_result (_h=0x1cbf68, _r=0x24dde8) at
res.c:167
#3 0xfedb28c4 in db_mysql_store_result (_h=0x1cbf68, _r=0xffbff830) at
dbase.c:209
#4 0xfedb40e8 in db_mysql_raw_query (_h=0x1cbf68,
_s=0xff07e668 "select received, contact, socket, cflags, path from
location where expires > '2008-03-04 13:37:51' and cflags & 64 = 64 and id
%
1 = 0", _r=0xffbff830) at dbase.c:447
#5 0xff053260 in get_all_db_ucontacts (buf=0x1ceec0, len=320054, flags=64,
part_idx=0, part_max=1)
at dlist.c:128
#6 0xff0528c8 in get_all_ucontacts (buf=0x1ceec0, len=320058, flags=64,
part_idx=0, part_max=1) at dlist.c:356
#7 0xfee57c6c in pingClients (ticks=60, param=0x0) at functions.h:60
#8 0x000aa430 in timer_ticker (timer_list=0x163c00) at timer.c:275
#9 0x000aa180 in run_timer_process (tpl=0x1c5808, do_jiffies=1) at timer.c
:357
#10 0x000aa6fc in start_timer_processes () at timer.c:386
#11 0x00036788 in main_loop () at main.c:873
#12 0x0003a0c4 in main (argc=1137536, argv=0x155f1c) at main.c:1372
By inspecting in detail the frame 0, in particular the qm variable:
(gdb) print qm
$1 = (struct fm_block *) 0x185320
Which is the fm_block structure defined at mem/f_malloc.h.
(gdb) frame 0
#0 0x000bcfbc in fm_malloc (qm=0x185320, size=24, file=0xfedbac10 "res.c",
func=0xfedbac70 "db_mysql_get_columns", line=62) at mem/f_malloc.c:267
267 if ((*f)->size>=size) goto found;
(gdb) list
262 /*search for a suitable free frag*/
263
264 for(hash=GET_HASH(size);hash<F_HASH_SIZE;hash++){
265 f=&(qm->free_hash[hash].first);
266 for(;(*f); f=&((*f)->u.nxt_free))
267 if ((*f)->size>=size) goto found;
268 /* try in a bigger bucket */
269 }
270 /* not found, bad! */
271 return 0;
If I print the qm->free_hash array, I found that is mainly empty; For the
particular case of my core file, hash has a value of three, when printing
that position I have the following:
(gdb) print qm->free_hash[hash]
$1 = {first = 0x69703a31, no = 1}
(gdb) print qm->free_hash
$2 = {{first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no = 0},
{first = 0x69703a31, no = 1}, {
first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no = 0},
{first = 0x0, no = 0}, {first = 0x0,
no = 0}, {first = 0x0, no = 0}, {first = 0x24dd68, no = 4641}, {first =
0x0, no = 0} <repeats 21 times>, {
first = 0x1ced90, no = 1}, {first = 0x0, no = 0} <repeats 679 times>,
{first = 0x1cef40, no = 1}, {
first = 0x0, no = 0} <repeats 1337 times>, {first = 0x1cef40, no = 1},
{first = 0x0, no = 0}, {
first = 0x24de38, no = 1}, {first = 0x0, no = 0} <repeats 11 times>,
{first = 0x21d100, no = 1}, {
first = 0x0, no = 0}, {first = 0x0, no = 0}}
(gdb) print qm->free_hash.no
$3 = 0
(gdb) print qm->free_hash[hash].first
$4 = (struct fm_frag *) 0x69703a31
(gdb) x/s 0x69703a31
0x69703a31: <Address 0x69703a31 out of bounds>
So, the error happened because from the list of memory fragments, an invalid
one was referred.
I have two questions:
1. Is it normal that free_hash array at fm_block has some positions pointing
to invalid locations?
2. I could see that the fm_frag_lnk struct has a member called no, which,
for the printing, I see it is 0 for most of the values at the array, and it
is 1 at some members, including the one which causes the crash; would it not
be possible to use that member for a check before trying the allocation?
What exactly means the no member, as I also see that for some of the members
it has a value higher than 1.
Thanks in advance for any help, and again, I apologize for the long post.
Best regards.
Sergio Gutierrez
On Thu, Feb 28, 2008 at 11:49 AM, Sergio Gutierrez <saguti(a)gmail.com> wrote:
Hi Henning.
Thanks a lot for your answer.
Currently, the machine does not report any hardware problem; Solaris 10
has a service called Fault Manager, which is running on my machine, and it
has not reported any error or problem related to it.
At this moment, I am testing a Openser installation compiled using an
optimized version of GCC released by Sun to be used on Sparc Systems; this
release is based on gcc 4, and at this time, OpenSER has been running for
almost 18 hours without crash.
I will inspect the core file again, and I will be posting what I find.
Best regards, and thanks again.
Sergio Gutierrez.
On Thu, Feb 28, 2008 at 5:19 AM, Henning Westerholt <
henning.westerholt(a)1und1.de> wrote:
On Thursday 28 February 2008, Sergio Gutierrez
wrote:
My OpenSER 1.3 installation running on Solaris
Sparc is facing random
and
unexpected crashes, in appearance related to
timer process.
The last core presents the following backtrace
#0 0xfe977a04 in get_expired_dlgs (time=4233810208) at
dlg_timer.c:194
#1 0xfe977540 in dlg_timer_routine (ticks=7980,
attr=0x0) at
dlg_timer.c:210
#2 0x000a839c in timer_ticker (timer_list=0x15ec00) at timer.c:275
#3 0x000a80ec in run_timer_process (tpl=0x1b8088, do_jiffies=1) at
timer.c
:357
#4 0x000a8668 in start_timer_processes () at timer.c:386
#5 0x00035ea8 in main_loop () at main.c:873
#6 0x000397c4 in main (argc=-4195024, argv=0x150e9c) at main.c:1372
Thanks in advance for any hint you can give me.
Hi Sergio,
signal 10 is SIGBUS on solaris. This could be caused from an invalid
address
alignment, a segmention fault on a physical address and a object
hardware
error (wikipedia).
The first crashes were both caused from a get_all_ucontact, triggered by
a
timer. This crash is now another timer, deletion of expired dialogs,
strange.. Is this machine otherwise stable, when (openser release) does
this
crashes started?
Do you have already inspected with the debugger the datastructures in
the code
of the get_expired_dlgs functions? Perhaps there is something wrong in
there..
Cheers,
Henning