Hi,
Recently experienced issues with 1 TCP connection between 2 kamailio servers: 1. KAM1 sends 2x forked INVITEs to KAM2 2. KAM2 starts config route processing for INVITE1. But blocks for ~1s due to rtpengine module pinging some inactive IPs 3. KAM1 re-transmits forked INVITE2
Worth mentioning that: 1. KAM2 uses same TCP connection for receiving KDMQs too. During that period, noticed KDMQ default_callback error triggered, due to timeout. So clearly, no KDMQs were processed anymore, during that time. 2. No errors related to TCP connection logged 3. kamailio version 5.8, tcp_reuse_port=yes, and don't set any route_locks_size, used 4 socket workers for that specific TCP connection
Looked for quite a while in tcp_main.c and tcp_read.c trying to figure out what is happening with TCP connection(s) in general, and come to the following conclusion: TCP connection structure is held by the TCP socket worker process until the SIP request is completely received in the buffer, parsed *and* processed routing config for it. Afterwards TCP socket worker releases the TCP connection structure by signalling this back to the TCP_MAIN process. Thus other TCP socket worker would be able to handle *next* SIP request, for *the same* TCP connection. ...but while one TCP socket worker executes config route, no other TCP socket workers will be able to handle *next* SIP request, for *the same* TCP connection.
My questions are: 1. Is the above conclusion correct? => this explains the above issue, and want to double check I understood the core tcp code correctly 2. Can async socket workers solve this?
Thank you, Stefan
On 13 Jun 2025, at 10:29, Stefan Mititelu via sr-dev sr-dev@lists.kamailio.org wrote:
Hi,
Recently experienced issues with 1 TCP connection between 2 kamailio servers:
- KAM1 sends 2x forked INVITEs to KAM2
- KAM2 starts config route processing for INVITE1. But blocks for ~1s due to rtpengine module pinging some inactive IPs
- KAM1 re-transmits forked INVITE2
Worth mentioning that:
- KAM2 uses same TCP connection for receiving KDMQs too. During that period, noticed KDMQ default_callback error triggered, due to timeout. So clearly, no KDMQs were processed anymore, during that time.
- No errors related to TCP connection logged
- kamailio version 5.8, tcp_reuse_port=yes, and don't set any route_locks_size, used 4 socket workers for that specific TCP connection
Looked for quite a while in tcp_main.c and tcp_read.c trying to figure out what is happening with TCP connection(s) in general, and come to the following conclusion: TCP connection structure is held by the TCP socket worker process until the SIP request is completely received in the buffer, parsed *and* processed routing config for it. Afterwards TCP socket worker releases the TCP connection structure by signalling this back to the TCP_MAIN process. Thus other TCP socket worker would be able to handle *next* SIP request, for *the same* TCP connection. ...but while one TCP socket worker executes config route, no other TCP socket workers will be able to handle *next* SIP request, for *the same* TCP connection.
My questions are:
- Is the above conclusion correct? => this explains the above issue, and want to double check I understood the core tcp code correctly
- Can async socket workers solve this?
This is one of the known problems with TCP. You need to offload messages from the incoming buffer and process them in a background process not to block other messages. That’s why there’s a lot of infrastructure in TM(x) for suspending transactions and resuming them in another process.
You do not want to do anything with HTTP API calls, SQL queries, or IP pings in the listener process.
Good observation!
/O