Hi Sergio,
please upload your report on the tracker (bug section).
Regards,
Bogdan
PS: please enable the memory debug support (DBG_QM_MALLOC) and run it
like this - it might provide more infos when crashing.
Sergio Gutierrez wrote:
Hi Henning.
I apologize in advance for the long post.
These days, Openser still has been crashing randomly.
Using GDB in one of the generated core files, I found something curious:
#0 0x000bcfbc in fm_malloc (qm=0x185320, size=24, file=0xfedbac10
"res.c",
func=0xfedbac70 "db_mysql_get_columns", line=62) at mem/f_malloc.c:267
#1 0xfedb74b0 in db_mysql_get_columns (_h=0x1cbf68, _r=0x24dde8) at
res.c:62
#2 0xfedb79f0 in db_mysql_convert_result (_h=0x1cbf68, _r=0x24dde8)
at res.c:167
#3 0xfedb28c4 in db_mysql_store_result (_h=0x1cbf68, _r=0xffbff830)
at dbase.c:209
#4 0xfedb40e8 in db_mysql_raw_query (_h=0x1cbf68,
_s=0xff07e668 "select received, contact, socket, cflags, path from
location where expires > '2008-03-04 13:37:51' and cflags & 64 = 64
and id % 1 = 0", _r=0xffbff830) at dbase.c:447
#5 0xff053260 in get_all_db_ucontacts (buf=0x1ceec0, len=320054,
flags=64, part_idx=0, part_max=1)
at dlist.c:128
#6 0xff0528c8 in get_all_ucontacts (buf=0x1ceec0, len=320058,
flags=64, part_idx=0, part_max=1) at dlist.c:356
#7 0xfee57c6c in pingClients (ticks=60, param=0x0) at functions.h:60
#8 0x000aa430 in timer_ticker (timer_list=0x163c00) at timer.c:275
#9 0x000aa180 in run_timer_process (tpl=0x1c5808, do_jiffies=1) at
timer.c:357
#10 0x000aa6fc in start_timer_processes () at timer.c:386
#11 0x00036788 in main_loop () at main.c:873
#12 0x0003a0c4 in main (argc=1137536, argv=0x155f1c) at main.c:1372
By inspecting in detail the frame 0, in particular the qm variable:
(gdb) print qm
$1 = (struct fm_block *) 0x185320
Which is the fm_block structure defined at mem/f_malloc.h.
(gdb) frame 0
#0 0x000bcfbc in fm_malloc (qm=0x185320, size=24, file=0xfedbac10
"res.c",
func=0xfedbac70 "db_mysql_get_columns", line=62) at mem/f_malloc.c:267
267 if ((*f)->size>=size) goto found;
(gdb) list
262 /*search for a suitable free frag*/
263
264 for(hash=GET_HASH(size);hash<F_HASH_SIZE;hash++){
265 f=&(qm->free_hash[hash].first);
266 for(;(*f); f=&((*f)->u.nxt_free))
267 if ((*f)->size>=size) goto found;
268 /* try in a bigger bucket */
269 }
270 /* not found, bad! */
271 return 0;
If I print the qm->free_hash array, I found that is mainly empty; For
the particular case of my core file, hash has a value of three, when
printing that position I have the following:
(gdb) print qm->free_hash[hash]
$1 = {first = 0x69703a31, no = 1}
(gdb) print qm->free_hash
$2 = {{first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no =
0}, {first = 0x69703a31, no = 1}, {
first = 0x0, no = 0}, {first = 0x0, no = 0}, {first = 0x0, no =
0}, {first = 0x0, no = 0}, {first = 0x0,
no = 0}, {first = 0x0, no = 0}, {first = 0x24dd68, no = 4641},
{first = 0x0, no = 0} <repeats 21 times>, {
first = 0x1ced90, no = 1}, {first = 0x0, no = 0} <repeats 679
times>, {first = 0x1cef40, no = 1}, {
first = 0x0, no = 0} <repeats 1337 times>, {first = 0x1cef40, no =
1}, {first = 0x0, no = 0}, {
first = 0x24de38, no = 1}, {first = 0x0, no = 0} <repeats 11
times>, {first = 0x21d100, no = 1}, {
first = 0x0, no = 0}, {first = 0x0, no = 0}}
(gdb) print qm->free_hash.no
$3 = 0
(gdb) print qm->free_hash[hash].first
$4 = (struct fm_frag *) 0x69703a31
(gdb) x/s 0x69703a31
0x69703a31: <Address 0x69703a31 out of bounds>
So, the error happened because from the list of memory fragments, an
invalid one was referred.
I have two questions:
1. Is it normal that free_hash array at fm_block has some positions
pointing to invalid locations?
2. I could see that the fm_frag_lnk struct has a member called no,
which, for the printing, I see it is 0 for most of the values at the
array, and it is 1 at some members, including the one which causes the
crash; would it not be possible to use that member for a check before
trying the allocation? What exactly means the no member, as I also see
that for some of the members it has a value higher than 1.
Thanks in advance for any help, and again, I apologize for the long post.
Best regards.
Sergio Gutierrez
On Thu, Feb 28, 2008 at 11:49 AM, Sergio Gutierrez <saguti(a)gmail.com
<mailto:saguti@gmail.com>> wrote:
Hi Henning.
Thanks a lot for your answer.
Currently, the machine does not report any hardware problem;
Solaris 10 has a service called Fault Manager, which is running on
my machine, and it has not reported any error or problem related
to it.
At this moment, I am testing a Openser installation compiled using
an optimized version of GCC released by Sun to be used on Sparc
Systems; this release is based on gcc 4, and at this time, OpenSER
has been running for almost 18 hours without crash.
I will inspect the core file again, and I will be posting what I find.
Best regards, and thanks again.
Sergio Gutierrez.
On Thu, Feb 28, 2008 at 5:19 AM, Henning Westerholt
<henning.westerholt(a)1und1.de <mailto:henning.westerholt@1und1.de>>
wrote:
On Thursday 28 February 2008, Sergio Gutierrez wrote:
My OpenSER 1.3 installation running on Solaris
Sparc is
facing random and
unexpected crashes, in appearance related to
timer process.
The last core presents the following backtrace
#0 0xfe977a04 in get_expired_dlgs (time=4233810208) at
dlg_timer.c:194
#1 0xfe977540 in dlg_timer_routine (ticks=7980,
attr=0x0) at
dlg_timer.c:210
#2 0x000a839c in timer_ticker (timer_list=0x15ec00) at
timer.c:275
#3 0x000a80ec in run_timer_process
(tpl=0x1b8088,
do_jiffies=1) at timer.c
:357
#4 0x000a8668 in start_timer_processes () at timer.c:386
#5 0x00035ea8 in main_loop () at main.c:873
#6 0x000397c4 in main (argc=-4195024, argv=0x150e9c) at
main.c:1372
Thanks in advance for any hint you can give me.
Hi Sergio,
signal 10 is SIGBUS on solaris. This could be caused from an
invalid address
alignment, a segmention fault on a physical address and a
object hardware
error (wikipedia).
The first crashes were both caused from a get_all_ucontact,
triggered by a
timer. This crash is now another timer, deletion of expired
dialogs,
strange.. Is this machine otherwise stable, when (openser
release) does this
crashes started?
Do you have already inspected with the debugger the
datastructures in the code
of the get_expired_dlgs functions? Perhaps there is something
wrong in
there..
Cheers,
Henning
------------------------------------------------------------------------
_______________________________________________
Users mailing list
Users(a)lists.openser.org
http://lists.openser.org/cgi-bin/mailman/listinfo/users