This bug report concerns a Kamailio instance that's functioning as an edge proxy,
accepting requests from clients over TLS and forwarding them to internal components over
UDP.
I've observed occasional "deadlocks" that all take place after there were
connectivity issues between Kamailio and one or more of the clients (logging messages such
as "tcp_send failed", "handle_tcpconn_ev(): connect x failed",
"tcp_read_data(): error reading: Connection reset by peer").
Environment:
* Ubuntu 16.04
* Kamailio 5.2.7 (5.2.7+ubuntu16.04)
Relevant config:
```
# for TLS clients
socket_workers=10
listen=tls:wan.ip:TLS_PORT
# to be able to forward incoming TCP requests to UDP destinations
socket_workers=10
listen=udp:wan.ip:TLS_PORT
# for RPC commands
socket_workers=2
listen=tcp:127.0.0.1:HTTP_PORT
```
## Reproduction
After a lot of trial and error, I've found a way to somewhat reproducibly trigger the
problem using sipp. It involves setting up a large amount of simultaneous calls dialing
into the edge proxy using TLS, to a destination UA that will answer the call and keep it
open for a few seconds, then using `ctrl-c` to suddenly quit sipp which will ungracefully
terminate the TCP/TLS connections.
It takes me a couple of iterations to trigger the problem, e.g repeat until Kamailio
starts failing:
* start sipp
* terminate sipp with ctrl-c
* (quickly) restart sipp
* terminate sipp
* etc
The amount of simultaneous calls has to be sufficiently large. In my environment I'm
unable to reproduce the problem at all with 10-20 calls, while 50 calls always triggers
the problem after a couple of iterations.
Example sipp commandline: `sipp -nd -sf caller.xml -d 2000 -r 50 -l 500 -m 5000
-max_socket 1000 -t ln -tls_cert cert.crt -tls_key key.key kamailio.edge.proxy:5061`
## Failure mode 1
In one of the failure modes, it appears that all available TCP/TLS workers are permanently
bricked; sipp's calls will simply timeout and not trigger any dialplan handling.
Interestingly, even though I have dedicated workers for TLS (SIP client) and TCP (RPC
commands), and my reproduction scenario doesn't connect to the HTTP_PORT at all, this
failure scenario bricks the TCP (RPC) listeners as well; any curl requests to it will
simply timeout as well.
Kamailio will hang for a while on (systemd) restart.
## Failure mode 2
In another failure mode, it looks like the TCP workers are unaffected but the UDP perform
forwarding to UDP destinations are broken. Sipp's calls will enter Kamailio, but
forwarding (`relay()`) to the next hop will timeout even though the destination is
functioning just fine. Since this failure mode doesn't break RPC I was able to enable
debug logging:
```
2020-07-07T17:26:58.770411+02:00 kamailio[24056]: INFO: From Customer
2020-07-07T17:26:58.770697+02:00 kamailio[24056]: DEBUG: pv [pv_trans.c:590]:
tr_eval_string(): i=0 j=2
2020-07-07T17:26:58.770971+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2055]:
ds_manage_routes(): set [4]
2020-07-07T17:26:58.771248+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2161]:
ds_manage_routes(): using alg [0] hash [250849449]
2020-07-07T17:26:58.771535+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2205]:
ds_manage_routes(): selected [0-4-0/1] <sip:wan.ip:8065>
2020-07-07T17:26:58.771817+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2226]:
ds_manage_routes(): using first entry [4/1]
2020-07-07T17:26:58.772090+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2269]:
ds_manage_routes(): using entry [4/0]
2020-07-07T17:26:58.772362+02:00 kamailio[24056]: DEBUG: dispatcher [dispatch.c:2016]:
ds_select_dst_limit(): selected target destinations: 2
2020-07-07T17:26:58.773001+02:00 kamailio[24056]: DEBUG: pv [pv_core.c:2611]:
pv_set_force_sock(): trying to set send-socket to [udp:wan.ip:5061]
2020-07-07T17:26:58.773353+02:00 kamailio[24056]: DEBUG: <core>
[core/socket_info.c:559]: grep_sock_info(): checking if host==us: 13==13 &&
[wan.ip] == [wan.ip]
2020-07-07T17:26:58.773829+02:00 kamailio[24056]: DEBUG: <core>
[core/socket_info.c:566]: grep_sock_info(): checking if port 5061 (advertise 0) matches
port 5061
2020-07-07T17:26:58.774122+02:00 kamailio[24056]: NOTICE: Relaying (cid:
1-9151(a)sipp.ip).
2020-07-07T17:26:58.774417+02:00 kamailio[24056]: DEBUG: siputils [checks.c:120]:
has_totag(): no totag
2020-07-07T17:26:58.774708+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:1344]:
t_newtran(): msg (0x7f69d1cc4980) id=1265/24056 global id=1265/24056 T start=(nil)
2020-07-07T17:26:58.774989+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:499]:
t_lookup_request(): start searching: hash=43232, isACK=0
2020-07-07T17:26:58.775315+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:457]:
matching_3261(): RFC3261 transaction matching failed - via branch [z9hG4bK-9151-1-1]
2020-07-07T17:26:58.775612+02:00 kamailio[24056]: DEBUG: tm [t_lookup.c:682]:
t_lookup_request(): no transaction found
2020-07-07T17:26:58.775908+02:00 kamailio[24056]: DEBUG: <core>
[core/md5utils.c:67]: MD5StringArray(): MD5 calculated: 0a5d4a8d13f9606523da9c4d193b2261
2020-07-07T17:26:58.776270+02:00 kamailio[24056]: DEBUG: <core>
[core/dns_cache.c:3197]: dns_srv_sip_resolve(): (wan.ip, 0, 0), ip, ret=0
2020-07-07T17:26:58.776701+02:00 kamailio[24056]: DEBUG: <core>
[core/msg_translator.c:1765]: check_boundaries(): no multi-part body
2020-07-07T17:26:58.777012+02:00 kamailio[24056]: DEBUG: <core>
[core/msg_translator.c:2935]: create_via_hf(): id added: <;i=a701>, rcv proto=3
2020-07-07T17:26:58.777596+02:00 kamailio[24056]: DEBUG: tm [t_funcs.c:375]:
t_relay_to(): new transaction forwarded
2020-07-07T17:26:58.777948+02:00 kamailio[24056]: DEBUG: <core>
[core/receive.c:354]: receive_msg(): request-route executed in: 1927 usec
2020-07-07T17:26:58.778243+02:00 kamailio[24056]: DEBUG: <core>
[core/usr_avp.c:636]: destroy_avp_list(): destroying list (nil)
2020-07-07T17:26:58.779265+02:00 kamailio[24056]: message repeated 5 times: [ DEBUG:
<core> [core/usr_avp.c:636]: destroy_avp_list(): destroying list (nil)]
2020-07-07T17:26:58.779431+02:00 kamailio[24056]: DEBUG: <core> [core/xavp.c:507]:
xavp_destroy_list(): destroying xavp list (nil)
2020-07-07T17:26:58.779753+02:00 kamailio[24056]: DEBUG: <core>
[core/receive.c:458]: receive_msg(): cleaning up
2020-07-07T17:27:00.705303+02:00 kamailio[24053]: DEBUG: tm [t_reply.c:1266]:
t_should_relay_response(): ->>>>>>>>> T_code=0, new_code=408
2020-07-07T17:27:00.705917+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1041]:
t_check_msg(): msg (0x7f69b1017e80) id=1264/24056 global id=1264/24056 T
start=0x7f69b10167c0
2020-07-07T17:27:00.706187+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1116]:
t_check_msg(): T (0x7f69b10167c0) already found for msg (0x7f69b1017e80)!
2020-07-07T17:27:00.706533+02:00 kamailio[24053]: DEBUG: tmx [t_var.c:539]:
pv_get_tm_reply_code(): reply code is <408>
2020-07-07T17:27:00.706879+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1041]:
t_check_msg(): msg (0x7f69b1017e80) id=1264/24056 global id=1264/24056 T
start=0x7f69b10167c0
2020-07-07T17:27:00.707198+02:00 kamailio[24053]: DEBUG: tm [t_lookup.c:1116]:
t_check_msg(): T (0x7f69b10167c0) already found for msg (0x7f69b1017e80)!
2020-07-07T17:27:00.707522+02:00 kamailio[24053]: DEBUG: tmx [t_var.c:594]:
pv_get_tm_reply_reason(): reply reason is [Request Timeout]
{
"DSC": "tcp receiver (tls:wan.ip:5061) child=0",
"IDX": 14,
"PID": 24056
},
{
"DSC": "slow timer",
"IDX": 11,
"PID": 24053
},
```
Kamailio will restart (systemd) normally and everything will function normally again
afterwards.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2395