---------- Forwarded message ---------- From: Zunnun zunnun@gmail.com Date: Wed, Mar 30, 2011 at 10:51 AM Subject: crash & 100% CPU usage problem with kamailio 3.1.2 To: sr-dev@lists.sip-router.org
*kamailio 3.1.2 issues*
*Problem 1:*
Running heavy stress (for few hours), we have seen 100 % CPU usage Reason: the linked list is circular. next pointer points itself & the loop never breaks.
file: tcp_main.c
function:
inline static int _tcpconn_add_alias_unsafe(struct tcp_connection* c, int port,struct ip_addr* l_ip, int l_port,int flags)
for (a=tcpconn_aliases_hash[hash], nxt=0; a; a=nxt){ nxt=a->next;
here a->next points to a & loop never breaks
*Problem 2: * kamailio process terminates (heavy stress for over 24 hours)
Reason: it calls abort()
file: tcp_main.c function
inline static int tcpconn_chld_put(struct tcp_connection* tcpconn) { if (unlikely(atomic_dec_and_test(&tcpconn->refcnt))){ DBG("tcpconn_chld_put: destroying connection %p (%d, %d) " "flags %04x\n", tcpconn, tcpconn->id, tcpconn->s, tcpconn->flags); /* sanity checks */ membar_read_atomic_op(); /* make sure we see the current flags */ if (unlikely(!(tcpconn->flags & F_CONN_FD_CLOSED) || (tcpconn->flags & (F_CONN_HASHED|F_CONN_MAIN_TIMER| F_CONN_READ_W|F_CONN_WRITE_W)) )){ LOG(L_CRIT, "BUG: tcpconn_chld_put: %p bad flags = %0x\n", tcpconn, tcpconn->flags); abort(); //CALLS abort } _tcpconn_free(tcpconn); /* destroys also the wbuf_q if still present*/ return 1; } return 0; }
*Problem 3: * kamailio crashed (heavy stress, seen it twice after 4 days 8 hours) Reason: Circular link list is bad, prev pointer is NULL & kamailio access it
#0 local_timer_list_expire (lt=0x82eea0, saved_ticks=1295476481) at local_timer.c:221 221 _timer_rm_list(tl); /* detach */ (gdb) bt #0 local_timer_list_expire (lt=0x82eea0, saved_ticks=1295476481) at local_timer.c:221 #1 local_timer_expire (lt=0x82eea0, saved_ticks=1295476481) at local_timer.c:250 #2 local_timer_run (lt=0x82eea0, saved_ticks=1295476481) at local_timer.c:274 #3 0x0000000000510c3e in tcp_timer_run () at tcp_main.c:4384 #4 tcp_main_loop () at tcp_main.c:4564 #5 0x0000000000469eba in main_loop () at main.c:1641 #6 0x000000000046c04f in main (argc=<value optimized out>, argv=0x7fff3d28a3c8) at main.c:2398
(gdb) print tl $1 = <value optimized out> (gdb) print h $2 = (struct timer_head *) 0x855eb8 (gdb) print *h $3 = {next = 0x0, prev = 0x2acbac5398b8} (gdb) print *h->prev $4 = {next = 0x0, prev = 0x855eb8, expire = 1295476481, initial_timeout = 1920, data = 0x2acbac5397d0, f = 0x4f8310 <tcpconn_main_timeout>, flags = 512, slow_idx = 0} (gdb)
once prev pointer was NULL & next crash next pointer was NULL
*Problem 4:* kamailio process terminated (heavy stress, found it twice, after 16 hours) Reason: it calls abort()
file : mem/q_malloc.c
function
void qm_free(struct qm_block* qm, void* p)
partial code:
#ifdef DBG_QM_MALLOC qm_debug_frag(qm, f); if (f->u.is_free){ LOG(L_CRIT, "BUG: qm_free: freeing already freed pointer," " first free: %s: %s(%ld) - aborting\n", f->file, f->func, f->line); abort(); //CALLS ABORT
} MDBG("qm_free: freeing frag. %p alloc'ed from %s: %s(%ld)\n", f, f->file, f->func, f->line); #endif
*problem 5*: infinite loop - log file is full of these messages 100% CPU at that time /kamailio[21562]: : <core> [io_wait.h:617]: BUG: io_watch_del: invalid fd -1, not in [0, 2) //kamailio[21562]: : <core> [tcp_read.c:1218]: ERROR: tcpconn_receive: handle_io: io_watch_del failed for 0x2acbac5397d0
Is this really kamailio 3.1.2? There are some line numbers that do not match.
How was it compiled? Could you send me the output of kamailio -V (or ser -V, whichever name you use)
Could you describe a bit the stress tests? Do you use an open source tool (if so which, I would like to try to reproduce it) or something proprietary? If the later, do you open lots of connections all the time, or only a few connections with heavy traffic? How many cores has the machine on which you run kamailio?
Do you still have the coredumps? (I would be interested in some variable values if you do).
At first sight it could either be some kind of race (that's why I asked about the number of cores) or something completely unrelated to the tcp code, that overwrites something in a memory area that happens to belong to a tcp connections (especially the double free hints towards this).
Andrei
On Apr 01, 2011 at 09:32, Zunnun zunnun@gmail.com wrote:
---------- Forwarded message ---------- From: Zunnun zunnun@gmail.com Date: Wed, Mar 30, 2011 at 10:51 AM Subject: crash & 100% CPU usage problem with kamailio 3.1.2 To: sr-dev@lists.sip-router.org
*kamailio 3.1.2 issues*
*Problem 1:*
Running heavy stress (for few hours), we have seen 100 % CPU usage Reason: the linked list is circular. next pointer points itself & the loop never breaks.
file: tcp_main.c
function:
inline static int _tcpconn_add_alias_unsafe(struct tcp_connection* c, int port,struct ip_addr* l_ip, int l_port,int flags)
for (a=tcpconn_aliases_hash[hash], nxt=0; a; a=nxt){ nxt=a->next;
here a->next points to a & loop never breaks
*Problem 2: * kamailio process terminates (heavy stress for over 24 hours)
Reason: it calls abort()
file: tcp_main.c function
inline static int tcpconn_chld_put(struct tcp_connection* tcpconn) { if (unlikely(atomic_dec_and_test(&tcpconn->refcnt))){ DBG("tcpconn_chld_put: destroying connection %p (%d, %d) " "flags %04x\n", tcpconn, tcpconn->id, tcpconn->s, tcpconn->flags); /* sanity checks */ membar_read_atomic_op(); /* make sure we see the current flags */ if (unlikely(!(tcpconn->flags & F_CONN_FD_CLOSED) || (tcpconn->flags & (F_CONN_HASHED|F_CONN_MAIN_TIMER| F_CONN_READ_W|F_CONN_WRITE_W)) )){ LOG(L_CRIT, "BUG: tcpconn_chld_put: %p bad flags = %0x\n", tcpconn, tcpconn->flags); abort(); //CALLS abort } _tcpconn_free(tcpconn); /* destroys also the wbuf_q if still present*/ return 1; } return 0; }
*Problem 3: * kamailio crashed (heavy stress, seen it twice after 4 days 8 hours) Reason: Circular link list is bad, prev pointer is NULL & kamailio access it
#0 local_timer_list_expire (lt=0x82eea0, saved_ticks=1295476481) at local_timer.c:221 221 _timer_rm_list(tl); /* detach */ (gdb) bt #0 local_timer_list_expire (lt=0x82eea0, saved_ticks=1295476481) at local_timer.c:221 #1 local_timer_expire (lt=0x82eea0, saved_ticks=1295476481) at local_timer.c:250 #2 local_timer_run (lt=0x82eea0, saved_ticks=1295476481) at local_timer.c:274 #3 0x0000000000510c3e in tcp_timer_run () at tcp_main.c:4384 #4 tcp_main_loop () at tcp_main.c:4564 #5 0x0000000000469eba in main_loop () at main.c:1641 #6 0x000000000046c04f in main (argc=<value optimized out>, argv=0x7fff3d28a3c8) at main.c:2398
(gdb) print tl $1 = <value optimized out> (gdb) print h $2 = (struct timer_head *) 0x855eb8 (gdb) print *h $3 = {next = 0x0, prev = 0x2acbac5398b8} (gdb) print *h->prev $4 = {next = 0x0, prev = 0x855eb8, expire = 1295476481, initial_timeout = 1920, data = 0x2acbac5397d0, f = 0x4f8310 <tcpconn_main_timeout>, flags = 512, slow_idx = 0} (gdb)
once prev pointer was NULL & next crash next pointer was NULL
*Problem 4:* kamailio process terminated (heavy stress, found it twice, after 16 hours) Reason: it calls abort()
file : mem/q_malloc.c
function
void qm_free(struct qm_block* qm, void* p)
partial code:
#ifdef DBG_QM_MALLOC qm_debug_frag(qm, f); if (f->u.is_free){ LOG(L_CRIT, "BUG: qm_free: freeing already freed pointer," " first free: %s: %s(%ld) - aborting\n", f->file, f->func, f->line); abort(); //CALLS ABORT
} MDBG("qm_free: freeing frag. %p alloc'ed from %s: %s(%ld)\n", f, f->file, f->func, f->line);
#endif
*problem 5*: infinite loop - log file is full of these messages 100% CPU at that time /kamailio[21562]: : <core> [io_wait.h:617]: BUG: io_watch_del: invalid fd -1, not in [0, 2) //kamailio[21562]: : <core> [tcp_read.c:1218]: ERROR: tcpconn_receive: handle_io: io_watch_del failed for 0x2acbac5397d0
sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
Hi Andrei,
Yes this is kamailio 3.1.2 code. I have put some DBG & LOG inside the tcp code to debug this problem that’s why the line number does not match.
It happens frequently – you need to run stress for a day or more. We are using Microsoft OCS tools for the stress.
Here’s a problem – it creates a circular linked list
0x00000000004e8d5b in _tcpconn_find (id=<value optimized out>, ip=0x7fff6d7393c0, port=5061, l_ip=0x7fff6d739280, l_port=0) at tcp_main.c:1491
(gdb) list
1486 #ifdef EXTRA_DEBUG
1487 DBG("a=%p, c=%p, c->id=%d, alias port= %d port=%d\n", a, a->parent,
1488 a->parent->id, a->port, a->parent->rcv.src_port);
1489 print_ip("ip=",&a->parent->rcv.src_ip,"\n");
1490 #endif
(gdb) print a
$1 = (struct tcp_conn_alias *) 0x2aed7ed00220
(gdb) print *a
$2 = {parent = 0x2aed7ed000e0, next = 0x2aed7ebfc568, prev = 0x2aed7ebfc568, port = 5061, hash = 2739}
(gdb) print a->next
$3 = (struct tcp_conn_alias *) 0x2aed7ebfc568
(gdb) print *a->next
$4 = {parent = 0x2aed7ebfc428, next = 0x2aed7ed00220, prev = 0x0, port = 5061, hash = 2739}
(gdb) print *a->next->next
$5 = {parent = 0x2aed7ed000e0, next = 0x2aed7ebfc568, prev = 0x2aed7ebfc568, port = 5061, hash = 2739}
(gdb) print *a->next->next->next
$6 = {parent = 0x2aed7ebfc428, next = 0x2aed7ed00220, prev = 0x0, port = 5061, hash = 2739}
(gdb) print *a->next->next->next->next
$7 = {parent = 0x2aed7ed000e0, next = 0x2aed7ebfc568, prev = 0x2aed7ebfc568, port = 5061, hash = 2739}
(gdb) print *a->next->next->next->next->next
$8 = {parent = 0x2aed7ebfc428, next = 0x2aed7ed00220, prev = 0x0, port = 5061, hash = 2739}
(gdb) print *a->next->next->next->next->next->next
$9 = {parent = 0x2aed7ed000e0, next = 0x2aed7ebfc568, prev = 0x2aed7ebfc568, port = 5061, hash = 2739}
(gdb) print c->aliases
$10 = 3
Hope this helps.
I have to check whether I have the core dumps . If I have I’ll send you.
Regards
-Zunnun
Hi Andrei,
Sorry I forgot to answer all your questions in my last email.
This is the version of kamailio (we are running it on CentOS 5.5, 64 bit with 8 Cores Xenon processor, 8 GB RAM), & we are starting kamailio with 2 GB shared memory.
version: kamailio 3.1.2 (x86_64/linux)
flags: STATS: Off, USE_IPV6, USE_TCP, USE_TLS, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, DBG_QM_MALLOC, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535, PKG_SIZE 4MB
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
id: unknown
compiled on 10:32:48 Mar 7 2011 with gcc 4.1.2
We are using kamailio as a router for SIP messages. We connect kamailio with OCS Access proxy & sends traffic.
We have two Access proxies, each creates up to 32 TCP /TLS connections. And sends over 5000 messages/sec. So there is not enough connected socketa for kamailio but we frequently close the sockets for testing.
We are running kamailio with 8 TCP children. & when it crashes it generates mostly 2 core files.
There is defiantly a race condition inside the code which I have fixed it but it was not related to the tcp connection management. I will send you the code soon or submit it to the dev group. It is inside timer code.
For this 100% cpu problem, the linked list is circular. May be the code that adds/removes TCP connection to linked list is not perfect.
Regards
-Zunnun