Sorry this was originally posted incorrectly, so I’m reposting....
I have been having problems with TCP under load. What I have been seeing is TCP buffers failing to be serviced and, when wr_timeout exceeds the configured value for tcp_send_timeout, kamailio kills the connection. Increasing tcp_send_timeout doesn't help, even setting this to a big value (such as 45 seconds) just delays the disconnection.
Putting some tracing into the code shows that wbufq_add() is repeatedly called, but wbufq_run() is called for that connection far less than I would expect. wbufq_run() is frequently called for other connections. It looks like wbufq_run() doesn't get called when lots of wbufq_add()s are happening for a connection? wbufq_run() only appears to be called for a connection after some time has passed from the last wbufq_add().
The connection in question is a local loopback between the RLS and Presence modules (both running in the same Kamailio instance). However, it may just be a coincidence that this is the affected connection as it is also the one with the most traffic.
My suspicion is that the bug is in the io_wait_loop_epoll() routine.
Can anybody with experience of this part of the code help?
Paul Pankhurst Engineering Director Crocodile RCS Ltd