This bug report concerns a Kamailio instance that's functioning as an edge proxy, accepting requests from clients over TLS and forwarding them to internal components over UDP.
I've observed occasional "deadlocks" that all take place after there were connectivity issues between Kamailio and one or more of the clients (logging messages such as "tcp_send failed", "handle_tcpconn_ev(): connect x failed", "tcp_read_data(): error reading: Connection reset by peer").
Environment:
Relevant config:
# for TLS clients
socket_workers=10
listen=tls:wan.ip:TLS_PORT
# to be able to forward incoming TCP requests to UDP destinations
socket_workers=10
listen=udp:wan.ip:TLS_PORT
# for RPC commands
socket_workers=2
listen=tcp:127.0.0.1:HTTP_PORT
After a lot of trial and error, I've found a way to somewhat reproducibly trigger the problem using sipp. It involves setting up a large amount of simultaneous calls dialing into the edge proxy using TLS, to a destination UA that will answer the call and keep it open for a few seconds, then using ctrl-c
to suddenly quit sipp which will ungracefully terminate the TCP/TLS connections.
It takes me a couple of iterations to trigger the problem, e.g repeat until Kamailio starts failing:
The amount of simultaneous calls has to be sufficiently large. In my environment I'm unable to reproduce the problem at all with 10-20 calls, while 50 calls always triggers the problem after a couple of iterations.
Example sipp commandline: sipp -nd -sf caller.xml -d 2000 -r 50 -l 500 -m 5000 -max_socket 1000 -t ln -tls_cert cert.crt -tls_key key.key kamailio.edge.proxy:5061
In one of the failure modes, it appears that all available TCP/TLS workers are permanently bricked; sipp's calls will simply timeout and not trigger any dialplan handling.
Interestingly, even though I have dedicated workers for TLS (SIP client) and TCP (RPC commands), and my reproduction scenario doesn't connect to the HTTP_PORT at all, this failure scenario bricks the TCP (RPC) listeners as well; any curl requests to it will simply timeout as well.
Kamailio will hang for a while on (systemd) restart.
In another failure mode, it looks like the TCP workers are unaffected but the UDP perform forwarding to UDP destinations are broken. Sipp's calls will enter Kamailio, but forwarding (relay()
) to the next hop will timeout even though the destination is functioning just fine. Since this failure mode doesn't break RPC I was able to enable debug logging:
2020-07-07T17:26:58.770411+02:00 kamailio[24056]: INFO: From Customer
2020-07-07T17:26:58.770697+02:00 kamailio[24056]: DEBUG: pv [pv_trans.c:590]: tr_eval_string(): i=0 j=2
2020-07-07T17:26:58.770971+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2055]: ds_manage_routes(): set [4]
2020-07-07T17:26:58.771248+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2161]: ds_manage_routes(): using alg [0] hash [250849449]
2020-07-07T17:26:58.771535+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2205]: ds_manage_routes(): selected [0-4-0/1] <sip:wan.ip:8065>
2020-07-07T17:26:58.771817+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2226]: ds_manage_routes(): using first entry [4/1]
2020-07-07T17:26:58.772090+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2269]: ds_manage_routes(): using entry [4/0]
2020-07-07T17:26:58.772362+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2016]: ds_select_dst_limit(): selected target destinations: 2
2020-07-07T17:26:58.773001+02:00 kamailio[24056]: DEBUG: pv [pv_core.c:2611]: pv_set_force_sock(): trying to set send-socket to [udp:wan.ip:5061]
2020-07-07T17:26:58.773353+02:00 kamailio[24056]: DEBUG: <core> [core/socket_info.c:559]: grep_sock_info(): checking if host==us: 13==13 && [wan.ip] == [wan.ip]
2020-07-07T17:26:58.773829+02:00 kamailio[24056]: DEBUG: <core> [core/socket_info.c:566]: grep_sock_info(): checking if port 5061 (advertise 0) matches port 5061
2020-07-07T17:26:58.774122+02:00 kamailio[24056]: NOTICE: Relaying (cid: 1-9151@sipp.ip).
2020-07-07T17:26:58.774417+02:00 kamailio[24056]: DEBUG: siputils [checks.c:120]: has_totag(): no totag
2020-07-07T17:26:58.774708+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:1344]: t_newtran(): msg (0x7f69d1cc4980) id=1265/24056 global id=1265/24056 T start=(nil)
2020-07-07T17:26:58.774989+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:499]: t_lookup_request(): start searching: hash=43232, isACK=0
2020-07-07T17:26:58.775315+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:457]: matching_3261(): RFC3261 transaction matching failed - via branch [z9hG4bK-9151-1-1]
2020-07-07T17:26:58.775612+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:682]: t_lookup_request(): no transaction found
2020-07-07T17:26:58.775908+02:00 kamailio[24056]: DEBUG: <core> [core/md5utils.c:67]: MD5StringArray(): MD5 calculated: 0a5d4a8d13f9606523da9c4d193b2261
2020-07-07T17:26:58.776270+02:00 kamailio[24056]: DEBUG: <core> [core/dns_cache.c:3197]: dns_srv_sip_resolve(): (wan.ip, 0, 0), ip, ret=0
2020-07-07T17:26:58.776701+02:00 kamailio[24056]: DEBUG: <core> [core/msg_translator.c:1765]: check_boundaries(): no multi-part body
2020-07-07T17:26:58.777012+02:00 kamailio[24056]: DEBUG: <core> [core/msg_translator.c:2935]: create_via_hf(): id added: <;i=a701>, rcv proto=3
2020-07-07T17:26:58.777596+02:00 kamailio[24056]: DEBUG: tm [t_funcs.c:375]: t_relay_to(): new transaction forwarded
2020-07-07T17:26:58.777948+02:00 kamailio[24056]: DEBUG: <core> [core/receive.c:354]: receive_msg(): request-route executed in: 1927 usec
2020-07-07T17:26:58.778243+02:00 kamailio[24056]: DEBUG: <core> [core/usr_avp.c:636]: destroy_avp_list(): destroying list (nil)
2020-07-07T17:26:58.779265+02:00 kamailio[24056]: message repeated 5 times: [ DEBUG: <core> [core/usr_avp.c:636]: destroy_avp_list(): destroying list (nil)]
2020-07-07T17:26:58.779431+02:00 kamailio[24056]: DEBUG: <core> [core/xavp.c:507]: xavp_destroy_list(): destroying xavp list (nil)
2020-07-07T17:26:58.779753+02:00 kamailio[24056]: DEBUG: <core> [core/receive.c:458]: receive_msg(): cleaning up
2020-07-07T17:27:00.705303+02:00 kamailio[24053]: DEBUG: tm [t_reply.c:1266]: t_should_relay_response(): ->>>>>>>>> T_code=0, new_code=408
2020-07-07T17:27:00.705917+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1041]: t_check_msg(): msg (0x7f69b1017e80) id=1264/24056 global id=1264/24056 T start=0x7f69b10167c0
2020-07-07T17:27:00.706187+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1116]: t_check_msg(): T (0x7f69b10167c0) already found for msg (0x7f69b1017e80)!
2020-07-07T17:27:00.706533+02:00 kamailio[24053]: DEBUG: tmx [t_var.c:539]: pv_get_tm_reply_code(): reply code is <408>
2020-07-07T17:27:00.706879+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1041]: t_check_msg(): msg (0x7f69b1017e80) id=1264/24056 global id=1264/24056 T start=0x7f69b10167c0
2020-07-07T17:27:00.707198+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1116]: t_check_msg(): T (0x7f69b10167c0) already found for msg (0x7f69b1017e80)!
2020-07-07T17:27:00.707522+02:00 kamailio[24053]: DEBUG: tmx [t_var.c:594]: pv_get_tm_reply_reason(): reply reason is [Request Timeout]
{
"DSC": "tcp receiver (tls:wan.ip:5061) child=0",
"IDX": 14,
"PID": 24056
},
{
"DSC": "slow timer",
"IDX": 11,
"PID": 24053
},
Kamailio will restart (systemd) normally and everything will function normally again afterwards.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.