Hi again!
This time I've started openser with only 1 child and attached with gdb
to the UDP thread in hope that gdb will show more info about the signal
9. But without luck:
# gdb /usr/sbin/openser 2855
...
Reading symbols from /lib/tls/i686/cmov/libnsl.so.1...done.
Loaded symbols for /lib/tls/i686/cmov/libnsl.so.1
Reading symbols from /lib/tls/i686/cmov/libnss_nis.so.2...done.
Loaded symbols for /lib/tls/i686/cmov/libnss_nis.so.2
Reading symbols from /lib/tls/i686/cmov/libnss_files.so.2...done.
Loaded symbols for /lib/tls/i686/cmov/libnss_files.so.2
Failed to read a valid object file image from memory.
0xb7f73410 in ?? ()
(gdb) c
Continuing.
<-- here the thread terminates by signal 9
Couldn't get registers: No such process.
(gdb)
Continuing.
Couldn't get registers: No such process.
(gdb)
Continuing.
Couldn't get registers: No such process.
(gdb)
Continuing.
Couldn't get registers: No such process.
(gdb)
The logfile also shows no hints:
May 8 09:04:24 debian /usr/sbin/openser[2855]: PRESENCE:
get_subs_dialog:The query for subscribtion for [user]= klaus,[domain]=
pernau.at for [event]= presen
ce returned no result
May 8 09:04:24 debian /usr/sbin/openser[2855]:
PRESENCE:query_db_notify: Could not get subs_dialog from database
May 8 09:04:24 debian /usr/sbin/openser[2855]:
PRESENCE:update_presentity: Could not send Notify
May 8 09:04:24 debian /usr/sbin/openser[2855]:
e17197948e006b8865b78e750537073b.8e1b///2-2877(a)88.198.53.113 PUBLISH
detected, handle_publish ... done
May 8 09:04:24 debian /usr/sbin/openser[2854]: child process 2855
exited by a signal 9
May 8 09:04:24 debian /usr/sbin/openser[2854]: core was not generated
I want to track down the signal 9. Who sent the signal 9 to the UDP
thread? I've searched for tools to monitor signals globally but didn't
found tools. strace only shows which signals are sent/received by the
traced process - but now who sent the signal.
I have run out of ideas to debug this - thus, please send me your ideas.
regards
klaus
Klaus Darilion wrote:
Hi Bogdan!
I've attached with strace to all openser threads and waited for the
crash. Here is the strace log of the "attendant" process (ID=0):
Process 2340 attached - interrupt to quit
pause() = ? ERESTARTNOHAND (To be
restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
sigreturn() = ? (mask now [])
waitpid(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], WNOHANG) = 2344
time([1178548261]) = 1178548261
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
send(3, "<134>May 7 16:31:01 /usr/sbin/o"..., 86, MSG_NOSIGNAL) = 86
time([1178548262]) = 1178548262
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
send(3, "<134>May 7 16:31:02 /usr/sbin/o"..., 69, MSG_NOSIGNAL) = 69
waitpid(-1, 0xbfd58ecc, WNOHANG) = 0
time([1178548262]) = 1178548262
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
send(3, "<134>May 7 16:31:02 /usr/sbin/o"..., 79, MSG_NOSIGNAL) = 79
kill(0, SIGTERM) = 0
--- SIGTERM (Terminated) @ 0 (0) ---
sigreturn() = ? (mask now [])
rt_sigaction(SIGALRM, {0x8067830, [ALRM], SA_RESTART}, {SIG_DFL}, 8) = 0
alarm(60) = 0
wait4(-1, NULL, 0, NULL) = 2350
--- SIGCHLD (Child exited) @ 0 (0) ---
sigreturn() = ? (mask now [])
wait4(-1, NULL, 0, NULL) = 2345
--- SIGCHLD (Child exited) @ 0 (0) ---
sigreturn() = ? (mask now [])
wait4(-1, NULL, 0, NULL) = 2349
--- SIGCHLD (Child exited) @ 0 (0) ---
sigreturn() = ? (mask now [])
wait4(-1, NULL, 0, NULL) = 2341
--- SIGCHLD (Child exited) @ 0 (0) ---
sigreturn() = ? (mask now [])
wait4(-1, NULL, 0, NULL) = 2347
--- SIGCHLD (Child exited) @ 0 (0) ---
sigreturn() = ? (mask now [])
wait4(-1, NULL, 0, NULL) = 2346
--- SIGCHLD (Child exited) @ 0 (0) ---
sigreturn() = ? (mask now [])
wait4(-1, NULL, 0, NULL) = 2348
--- SIGCHLD (Child exited) @ 0 (0) ---
sigreturn() = ? (mask now [])
wait4(-1, NULL, 0, NULL) = 2342
--- SIGCHLD (Child exited) @ 0 (0) ---
sigreturn() = ? (mask now [])
wait4(-1, NULL, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
kill(0, SIGKILL) = 0
+++ killed by SIGKILL +++
Process 2340 detached
If I read it correct, the SIGKILL is sent by this process, after sending
SIGTERM to all its childs. The SIGTERM is sent, because a child exited.
But which child? And why?
The openser log says:
May 7 16:31:02 debian /usr/sbin/openser[2340]: child process 2344
exited by a signal 9
May 7 16:31:02 debian /usr/sbin/openser[2340]: core was not generated
May 7 16:31:02 debian /usr/sbin/openser[2340]: INFO: terminating due to
SIGCHLD
To me this looks like 2344 (a UDP thread) exited with signal 9. Thus,
the main thread receives SIGCHLD and then sends SIGTERM and afterwards
SIGKILL to all other threads and itself.
But why received the thread 2344 a SIGKILL and who sent the SIGKILL?
I need some more debugging tips.
Bogdan, you mentioned gdb - how can I debug this with gdb?
regards
klaus
Bogdan-Andrei Iancu wrote:
Hi Klaus,
I applied on SVN the fix for the TM memory leak - it should not happen
anymore now, even if you do not use t_release()...
regarding the openser stop reacting - can you attach with gdb to see
what the process are done?
regards,
bogdan
Klaus Darilion wrote:
Hi Daniel!
Summary:
- Without t_release() (no modifications to source code) openser leaks
memory.
- with t_release() openser does not leak. But after some time there
is strange behaviour, e.g.:
-: openser stops reacting for some minutes and afterwards gets
terminated with signal 9. When openser stops working the load
increase to > 40. This happend 3 times now.
-: openser stops reacting for some minutes and the linux PC
where openser is running gets unresponsive. No login. Open
SSH sessions are unresponsive. I had to reboot the PC. Happend
1 time.
Maybe this is not pure openser related, but a problem with openser
and Linux (as I had to reboot the server one time).
Any hints how to debug this?
regards
klaus
_______________________________________________
Users mailing list
Users(a)openser.org
http://openser.org/cgi-bin/mailman/listinfo/users