Re: [sr-dev] Problem with TCP and EPOLL

List overview All Threads
Download

newer

older

git:master: modules_k/presence Fix...

debian packages

Paul Pankhurst

14 Feb 2012 14 Feb '12

5:06 p.m.

Sorry this was originally posted incorrectly, so I’m reposting....

I have been having problems with TCP under load. What I have been seeing is TCP buffers failing to be serviced and, when wr_timeout exceeds the configured value for tcp_send_timeout, kamailio kills the connection. Increasing tcp_send_timeout doesn't help, even setting this to a big value (such as 45 seconds) just delays the disconnection.

Putting some tracing into the code shows that wbufq_add() is repeatedly called, but wbufq_run() is called for that connection far less than I would expect. wbufq_run() is frequently called for other connections. It looks like wbufq_run() doesn't get called when lots of wbufq_add()s are happening for a connection? wbufq_run() only appears to be called for a connection after some time has passed from the last wbufq_add().

The connection in question is a local loopback between the RLS and Presence modules (both running in the same Kamailio instance). However, it may just be a coincidence that this is the affected connection as it is also the one with the most traffic.

My suspicion is that the bug is in the io_wait_loop_epoll() routine.

Can anybody with experience of this part of the code help?

Paul Pankhurst Engineering Director Crocodile RCS Ltd

Attachments:

attachment.html (text/html — 1.7 KB)

Show replies by date

Daniel-Constantin Mierla

15 Feb 15 Feb

11:05 a.m.

New subject: Problem with TCP and EPOLL

Hello,

I am cc-ing Andrei, since he authored that part, maybe he is available these days and can give a quick answer regarding the issue.

Cheers, Daniel

On 2/14/12 6:06 PM, Paul Pankhurst wrote:

...

Sorry this was originally posted incorrectly, so I'm reposting.... I have been having problems with TCP under load. What I have been seeing is TCP buffers failing to be serviced and, when wr_timeout exceeds the configured value for tcp_send_timeout, kamailio kills the connection. Increasing tcp_send_timeout doesn't help, even setting this to a big value (such as 45 seconds) just delays the disconnection.

Putting some tracing into the code shows that wbufq_add() is repeatedly called, but wbufq_run() is called for that connection far less than I would expect. wbufq_run() is frequently called for other connections. It looks like wbufq_run() doesn't get called when lots of wbufq_add()s are happening for a connection? wbufq_run() only appears to be called for a connection after some time has passed from the last wbufq_add().

The connection in question is a local loopback between the RLS and Presence modules (both running in the same Kamailio instance). However, it may just be a coincidence that this is the affected connection as it is also the one with the most traffic.

My suspicion is that the bug is in the io_wait_loop_epoll() routine.

Can anybody with experience of this part of the code help?

Paul Pankhurst Engineering Director Crocodile RCS Ltd

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla -- http://www.asipto.com http://linkedin.com/in/miconda -- http://twitter.com/miconda

Andrei Pelinescu-Onciul

16 Feb 16 Feb

1:53 p.m.

New subject: Problem with TCP and EPOLL

On Feb 15, 2012 at 12:05, Daniel-Constantin Mierla miconda@gmail.com wrote:

...

Hello,

I am cc-ing Andrei, since he authored that part, maybe he is available these days and can give a quick answer regarding the issue.

Cheers, Daniel

On 2/14/12 6:06 PM, Paul Pankhurst wrote:

...
Sorry this was originally posted incorrectly, so I'm reposting.... I have been having problems with TCP under load. What I have been seeing is TCP buffers failing to be serviced and, when wr_timeout exceeds the configured value for tcp_send_timeout, kamailio kills the connection. Increasing tcp_send_timeout doesn't help, even setting this to a big value (such as 45 seconds) just delays the disconnection.

Putting some tracing into the code shows that wbufq_add() is repeatedly called, but wbufq_run() is called for that connection far less than I would expect. wbufq_run() is frequently called for other connections. It looks like wbufq_run() doesn't get called when lots of wbufq_add()s are happening for a connection? wbufq_run() only appears to be called for a connection after some time has passed from the last wbufq_add().

It's called when the kernel says it can write again on the respective socket. It might be that your consumer cannot read fast enough and so the buffers fill on ser/kamailio side.

...

...
The connection in question is a local loopback between the RLS and Presence modules (both running in the same Kamailio instance). However, it may just be a coincidence that this is the affected connection as it is also the one with the most traffic.

You might do something much more resource intensive on the receive side and it might not be able to keep up with the traffic (one connection is handled by one process, so if that process is too slow for some reason it might not read fast enough => on the transmit side the send buffers will fill-up).

...

...
My suspicion is that the bug is in the io_wait_loop_epoll() routine.

You could try changing the poll method and see if that makes any difference, e.g.: tcp_poll_method = sigio_rt in the .cfg file. The default is epoll-lt, so try "epoll-et", "sigio_rt2 and maybe "poll" (slow for lots of connections).

Andrei

...

...
Can anybody with experience of this part of the code help?

Paul Pankhurst Engineering Director Crocodile RCS Ltd

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla -- http://www.asipto.com http://linkedin.com/in/miconda -- http://twitter.com/miconda

Paul Pankhurst

5:27 p.m.

New subject: Problem with TCP and EPOLL

Thanks for that information Andrei - it has opened up a new path of investigation and we think we understand what is going wrong now. I will post an update when I get a bit further.

Paul

-----Original Message----- From: Andrei Pelinescu-Onciul Sent: Thursday, February 16, 2012 1:53 PM To: Daniel-Constantin Mierla Cc: Development mailing list of the sip-router project ; Paul Pankhurst Subject: Re: [sr-dev] Problem with TCP and EPOLL

On Feb 15, 2012 at 12:05, Daniel-Constantin Mierla miconda@gmail.com wrote:

...

Hello,

I am cc-ing Andrei, since he authored that part, maybe he is available these days and can give a quick answer regarding the issue.

Cheers, Daniel

On 2/14/12 6:06 PM, Paul Pankhurst wrote:

...
Sorry this was originally posted incorrectly, so I'm reposting.... I have been having problems with TCP under load. What I have been seeing is TCP buffers failing to be serviced and, when wr_timeout exceeds the configured value for tcp_send_timeout, kamailio kills the connection. Increasing tcp_send_timeout doesn't help, even setting this to a big value (such as 45 seconds) just delays the disconnection.

Putting some tracing into the code shows that wbufq_add() is repeatedly called, but wbufq_run() is called for that connection far less than I would expect. wbufq_run() is frequently called for other connections. It looks like wbufq_run() doesn't get called when lots of wbufq_add()s are happening for a connection? wbufq_run() only appears to be called for a connection after some time has passed from the last wbufq_add().

It's called when the kernel says it can write again on the respective socket. It might be that your consumer cannot read fast enough and so the buffers fill on ser/kamailio side.

...

...
The connection in question is a local loopback between the RLS and Presence modules (both running in the same Kamailio instance). However, it may just be a coincidence that this is the affected connection as it is also the one with the most traffic.

...

...
My suspicion is that the bug is in the io_wait_loop_epoll() routine.

Andrei

...

...
Can anybody with experience of this part of the code help?

Paul Pankhurst Engineering Director Crocodile RCS Ltd

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla -- http://www.asipto.com http://linkedin.com/in/miconda -- http://twitter.com/miconda

Paul Pankhurst

17 Feb 17 Feb

10:35 a.m.

New subject: Problem with TCP and EPOLL

I now understand what is going wrong....

To make the xcap server work with the size of documents generated by the SIP client, I had to significantly increase the size of tcp_rd_buf_size. Increasing this value is what causes the problem described. Returning tcp_rd_buf_size to it's default size resolves the problem, but causes the upload of documents to the xcap server to fail.

One way of solving this would be to allow the buffer size to be settable on a per connection basis, or perhaps separately for local connections. Does anyone have any thoughts, or other suggestions?

Thanks

Paul

On Feb 15, 2012 at 12:05, Daniel-Constantin Mierla miconda@gmail.com wrote:

...

Hello,

I am cc-ing Andrei, since he authored that part, maybe he is available these days and can give a quick answer regarding the issue.

Cheers, Daniel

On 2/14/12 6:06 PM, Paul Pankhurst wrote:

...
Sorry this was originally posted incorrectly, so I'm reposting.... I have been having problems with TCP under load. What I have been seeing is TCP buffers failing to be serviced and, when wr_timeout exceeds the configured value for tcp_send_timeout, kamailio kills the connection. Increasing tcp_send_timeout doesn't help, even setting this to a big value (such as 45 seconds) just delays the disconnection.

Putting some tracing into the code shows that wbufq_add() is repeatedly called, but wbufq_run() is called for that connection far less than I would expect. wbufq_run() is frequently called for other connections. It looks like wbufq_run() doesn't get called when lots of wbufq_add()s are happening for a connection? wbufq_run() only appears to be called for a connection after some time has passed from the last wbufq_add().

It's called when the kernel says it can write again on the respective socket. It might be that your consumer cannot read fast enough and so the buffers fill on ser/kamailio side.

...

...
The connection in question is a local loopback between the RLS and Presence modules (both running in the same Kamailio instance). However, it may just be a coincidence that this is the affected connection as it is also the one with the most traffic.

...

...
My suspicion is that the bug is in the io_wait_loop_epoll() routine.

Andrei

...

...
Can anybody with experience of this part of the code help?

Paul Pankhurst Engineering Director Crocodile RCS Ltd

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla -- http://www.asipto.com http://linkedin.com/in/miconda -- http://twitter.com/miconda

Daniel-Constantin Mierla

1:10 p.m.

New subject: Problem with TCP and EPOLL

Hello,

On 2/17/12 11:35 AM, Paul Pankhurst wrote:

...

I now understand what is going wrong....

To make the xcap server work with the size of documents generated by the SIP client, I had to significantly increase the size of tcp_rd_buf_size. Increasing this value is what causes the problem described. Returning tcp_rd_buf_size to it's default size resolves the problem, but causes the upload of documents to the xcap server to fail.

One way of solving this would be to allow the buffer size to be settable on a per connection basis, or perhaps separately for local connections. Does anyone have any thoughts, or other suggestions?

perhaps the size of the buffer has to stay big in order to be able to receive mixed sip-xcap traffic. However, detection whether it is sip or http is done in tcp read code, so maybe the solution is to have a limit for read size and set it lower for sip, larger for http. I haven't checked the source code to see if it is possible, though.

If not, the only way I see now is to use different listen sockets (e.g., ports) and based on that, set the read buffer size, maybe similar to the new option I added for worker processes per socket.

Cheers, Daniel

...

Thanks

Paul

-----Original Message----- From: Andrei Pelinescu-Onciul Sent: Thursday, February 16, 2012 1:53 PM To: Daniel-Constantin Mierla Cc: Development mailing list of the sip-router project ; Paul Pankhurst Subject: Re: [sr-dev] Problem with TCP and EPOLL

On Feb 15, 2012 at 12:05, Daniel-Constantin Mierla miconda@gmail.com wrote:

...
Hello,

I am cc-ing Andrei, since he authored that part, maybe he is available these days and can give a quick answer regarding the issue.

Cheers, Daniel

On 2/14/12 6:06 PM, Paul Pankhurst wrote:

...
Sorry this was originally posted incorrectly, so I'm reposting.... I have been having problems with TCP under load. What I have been seeing is TCP buffers failing to be serviced and, when wr_timeout exceeds the configured value for tcp_send_timeout, kamailio kills the connection. Increasing tcp_send_timeout doesn't help, even setting this to a big value (such as 45 seconds) just delays the disconnection.

Putting some tracing into the code shows that wbufq_add() is repeatedly called, but wbufq_run() is called for that connection far less than I would expect. wbufq_run() is frequently called for other connections. It looks like wbufq_run() doesn't get called when lots of wbufq_add()s are happening for a connection? wbufq_run() only appears to be called for a

connection

...
after some time has passed from the last wbufq_add().

It's called when the kernel says it can write again on the respective socket. It might be that your consumer cannot read fast enough and so the buffers fill on ser/kamailio side.

...
...
The connection in question is a local loopback between the RLS and Presence modules (both running in the same Kamailio instance). However, it may just be a coincidence that this is the affected connection as it is also the one with the most traffic.

You might do something much more resource intensive on the receive side and it might not be able to keep up with the traffic (one connection is handled by one process, so if that process is too slow for some reason it might not read fast enough => on the transmit side the send buffers will fill-up).

...
...
My suspicion is that the bug is in the io_wait_loop_epoll() routine.

You could try changing the poll method and see if that makes any difference, e.g.: tcp_poll_method = sigio_rt in the .cfg file. The default is epoll-lt, so try "epoll-et", "sigio_rt2 and maybe "poll" (slow for lots of connections).

Andrei

...
...
Can anybody with experience of this part of the code help?

Paul Pankhurst Engineering Director Crocodile RCS Ltd

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla -- http://www.asipto.com http://linkedin.com/in/miconda -- http://twitter.com/miconda

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla -- http://www.asipto.com http://linkedin.com/in/miconda -- http://twitter.com/miconda

Andrei Pelinescu-Onciul

21 Feb 21 Feb

4:20 p.m.

New subject: Problem with TCP and EPOLL

On Feb 17, 2012 at 10:35, Paul Pankhurst paul@crocodile-rcs.com wrote:

...

I now understand what is going wrong....

To make the xcap server work with the size of documents generated by the SIP client, I had to significantly increase the size of tcp_rd_buf_size. Increasing this value is what causes the problem described. Returning tcp_rd_buf_size to it's default size resolves the problem, but causes the upload of documents to the xcap server to fail.

One way of solving this would be to allow the buffer size to be settable on a per connection basis, or perhaps separately for local connections. Does anyone have any thoughts, or other suggestions?

It's very strange that increasing the _receive_ buffer size would cause problems. Have you tried increasing also tcp_conn_wq_max (e.g 128k) and possibly tcp_wq_max?

What's the output of sercmd core.tcp_info when you start to see problems (best just before that)?

Andrei

...

Thanks

Paul

-----Original Message----- From: Andrei Pelinescu-Onciul Sent: Thursday, February 16, 2012 1:53 PM To: Daniel-Constantin Mierla Cc: Development mailing list of the sip-router project ; Paul Pankhurst Subject: Re: [sr-dev] Problem with TCP and EPOLL

On Feb 15, 2012 at 12:05, Daniel-Constantin Mierla miconda@gmail.com wrote:

...
Hello,

I am cc-ing Andrei, since he authored that part, maybe he is available these days and can give a quick answer regarding the issue.

Cheers, Daniel

On 2/14/12 6:06 PM, Paul Pankhurst wrote:

...
Sorry this was originally posted incorrectly, so I'm reposting.... I have been having problems with TCP under load. What I have been seeing is TCP buffers failing to be serviced and, when wr_timeout exceeds the configured value for tcp_send_timeout, kamailio kills the connection. Increasing tcp_send_timeout doesn't help, even setting this to a big value (such as 45 seconds) just delays the disconnection.

Putting some tracing into the code shows that wbufq_add() is repeatedly called, but wbufq_run() is called for that connection far less than I would expect. wbufq_run() is frequently called for other connections. It looks like wbufq_run() doesn't get called when lots of wbufq_add()s are happening for a connection? wbufq_run() only appears to be called for a connection after some time has passed from the last wbufq_add().

It's called when the kernel says it can write again on the respective socket. It might be that your consumer cannot read fast enough and so the buffers fill on ser/kamailio side.

...
...
The connection in question is a local loopback between the RLS and Presence modules (both running in the same Kamailio instance). However, it may just be a coincidence that this is the affected connection as it is also the one with the most traffic.

You might do something much more resource intensive on the receive side and it might not be able to keep up with the traffic (one connection is handled by one process, so if that process is too slow for some reason it might not read fast enough => on the transmit side the send buffers will fill-up).

...
...
My suspicion is that the bug is in the io_wait_loop_epoll() routine.

You could try changing the poll method and see if that makes any difference, e.g.: tcp_poll_method = sigio_rt in the .cfg file. The default is epoll-lt, so try "epoll-et", "sigio_rt2 and maybe "poll" (slow for lots of connections).

Andrei

...
...
Can anybody with experience of this part of the code help?

Paul Pankhurst Engineering Director Crocodile RCS Ltd

sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev

-- Daniel-Constantin Mierla -- http://www.asipto.com http://linkedin.com/in/miconda -- http://twitter.com/miconda

4890

Age (days ago)

4897

Last active (days ago)

sr-dev@lists.kamailio.org

6 comments

3 participants

tags (0)

participants (3)

Andrei Pelinescu-Onciul
Daniel-Constantin Mierla
Paul Pankhurst