One server replicates to others with t_replicate_tcp. The connection was by mistake blocked by a firewall and after a few hours the memory of Kamailio ran out, with over 40000 open transactions in tm.stats on a server with around 300 registrations and no calls since restart.
t_replicate transactions needs to time out somehow.
Needs to be investigated, indeed. By chance, did you test also with udp? If yes, same behaviour, or is just for tcp?
I can check with UDP next week.
Changed to t_replicate_udp with a random port and it seems to not behave in the same way. I see that transactions are freed after a timeout.
``` # kamcmd tm.stats { current: 119 waiting: 17 total: 462 total_local: 0 rpl_received: 0 rpl_generated: 360 rpl_sent: 360 6xx: 0 5xx: 0 4xx: 360 3xx: 0 2xx: 0 created: 462 freed: 343 delayed_free: 0 }
```
If transaction relies on tcp stack than it could depend on whether firewall dropped or rejected connections in your initial case. As for "UDP with a random port" it was most probably icmp rejected.
2017-08-28 14:30 GMT+03:00 Olle E. Johansson notifications@github.com:
Changed to t_replicate_udp with a random port and it seems to not behave in the same way. I see that transactions are freed after a timeout.
# kamcmd tm.stats { current: 119 waiting: 17 total: 462 total_local: 0 rpl_received: 0 rpl_generated: 360 rpl_sent: 360 6xx: 0 5xx: 0 4xx: 360 3xx: 0 2xx: 0 created: 462 freed: 343 delayed_free: 0 }
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kamailio/kamailio/issues/1220#issuecomment-325328754, or mute the thread https://github.com/notifications/unsubscribe-auth/AF36ZdLyYPv1nPgTFowFEnojT0n33KBJks5scqS8gaJpZM4PC_1r .
Kamailio (SER) - Development Mailing List sr-dev@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-dev
I bet the problem was with established connections been dropped with a firewall. TCP takes ages to timeout on Linux.
What happened was that the firewall closed the TCP connection after a while, Kamailio continued sending but got dropped.
If FW was iptables - it doesn't send neither FIN nor RST to existing connections.
Not iptables
Regardless of the TCP connection, it seems like the transactions should time out anyway...
Note for the future: Maybe an event_route for failed replications would be a good thing.
@oej - were those transactions counted in `current` field of the stats?
After fixing the firewall we still have issues. I see the TCP keepalives coming, but replication stopped, no SIP messages flowing and tm.stats has a big pile of open transactions, no more shm memory available. Current and waiting seems to follow each other. No calls during the night, so this is only registrations. 300 devices connected. I will have to go back to replication over UDP or stateless replication using forward to get stability.
``` # kamcmd tm.stats { current: 19965 waiting: 19965 total: 42187 total_local: 0 rpl_received: 42207 rpl_generated: 10 rpl_sent: 42211 6xx: 0 5xx: 1 4xx: 4 3xx: 0 2xx: 42191 created: 42187 freed: 22221 delayed_free: 0 } ```
So connections are no ok, but the transactions still live forever, right?
Do you know if the replicated requests were replied? Or the replicated registrations were handled in the location table of the pair server?
I tried to reproduce using master, but all ok. Based on default kamailio.cfg, I have added t_replicate_tcp:
``` # Handle SIP registrations route[REGISTRAR] { if (!is_method("REGISTER")) return;
if(isflagset(FLT_NATS)) { setbflag(FLB_NATB); #!ifdef WITH_NATSIPPING # do SIP NAT pinging setbflag(FLB_NATSIPPING); #!endif } if (!save("location")) { sl_reply_error(); } t_replicate_tcp("127.0.0.1", 9); exit; } ```
In logs I get:
``` 39(81014) ERROR: <core> [core/tcp_main.c:4135]: handle_tcpconn_ev(): connect 127.0.0.1:9 failed ```
and after a while (delete wait time) transaction is cleaned:
``` 25(81000) DEBUG: tm [t_reply.c:1263]: t_should_relay_response(): ->>>>>>>>> T_code=0, new_code=408 25(81000) DEBUG: tm [t_reply.c:2072]: local_reply(): branch=0, save=0, winner=0 25(81000) DEBUG: tm [t_reply.c:2111]: local_reply(): local transaction completed 408/0 (totag retr: 0/0) 26(81001) DEBUG: tm [h_table.c:129]: free_cell_helper(): freeing transaction 0x105712170 from timer.c:651 ```
`tm.stats` showing 0 current transactions.
What version are you using? What are the parameters you set for tm module?
Latest release. Can't access parameters until monday.
Latest release as snapshot/tarball 5.0.2, or latest branch 5.0 at some point in time?
After a bit mor testing of t_replicate_udp I am sorry to say that we still have a problem with that. Firewall is open. We see the same locked transactions and replication has stopped, the other side has an empty usrloc. Kamailio from the 5.0 branch of git before the 5.0.3 release. Version: ``` # kamailio -v version: kamailio 5.0.2 (x86_64/linux) ```
``` # kamcmd tm.stats { current: 19974 waiting: 19974 total: 534594 total_local: 0 rpl_received: 544239 rpl_generated: 577 rpl_sent: 534606 6xx: 0 5xx: 4533 4xx: 584 3xx: 0 2xx: 529484 created: 534594 freed: 514619 delayed_free: 0 }
```
One server replicating with udp, the other with tcp. I see TCP keepalives flowing, so no blocking any more.
No strange tm modparams:
``` # ----- tm params ----- # auto-discard branches from previous serial forking leg modparam("tm", "failure_reply_mode", 3) # default retransmission timeout: 30sec modparam("tm", "fr_timer", 30000) # default invite retransmission timeout after 1xx: 120sec modparam("tm", "fr_inv_timer", 120000) modparam("tm", "retr_timer1", 100) modparam("tm", "retr_timer2", 4000) ``` After the TCP problems started, we added som tcp params to the config, just to verify the settings. Mostly default.
``` tcp_connection_lifetime=3200 tcp_accept_no_cl=yes # Maximum write queue. If this is reached, the connection is killed. tcp_conn_wq_max=10000 # Enables keepalive for tcp (sets SO_KEEPALIVE socket option) tcp_keepalive = yes # Time before starting to send keepalives, if the connection is idle (TCP_KEEPIDLE socket option). Linux only. # Seconds tcp_keepidle = 10 # Number of keepalives sent before dropping the connection (TCP_KEEPCNT socket option). Linux only. tcp_keepcnt = 20 # Time interval between keepalive probes, when the previous probe failed (TCP_KEEPINTVL socket option). Linux only. # Seconds tcp_keepintvl = 20 # Time in seconds after a TCP connection will be closed if it is not available for writing in this # interval (and Kamailio wants to send something on it). Lower this value for faster detection of broken # TCP connections. The default value is 10s. tcp_send_timeout=10 # Number of SYN retransmissions before aborting a connect attempt (see linux tcp(7) TCP_SYNCNT). Linux only. # Default in LInux seems to be 5, which corresponds to 180 seconds tcp_syncnt = 2
```
Could you spot if it appears from first replication or starts after a while? Are you having any test scenario that could be reused to reproduce it? As wrote previously, I tested with default config + t_replicate_tcp() for each register and worked.
After three hours I have 14726 transactions waiting using around 300 accounts for REGISTER. I will restart now and see how fast I notice a new queue.
After 30 minutes, everything looks fine... So it takes a while before the TM queue starts to build
Started test 12:42, all seems quiet and current goes up and down around 20. At 13:14 it begins to climb again, so around 30 minutes quiet. We have two-minute registration timeouts, so at the time it starts running astray we have processed 6762 transactions. All over UDP.
Quite strange ... any relevant log messages in syslog around the time when it goes astray? Are any new registrations still replicated (sent out) and replied?
Would you be able to attach with gdb to one of the kamailio processes and look at the transaction structures?
replication stops. Can't find anything unusual in the syslogs. Can attach GDB, but have no clue on how to look at the transaction structures. Need some help :-)
Here is a gdb script snippet to print transaction structures from memory -- hopefully works ok, not tested here yet:
``` set $i=0 while($i<TABLE_ENTRIES)
set $ehead = &_tm_table->entries[$i] set $tcell = _tm_table->entries[$i].next_c
while($ehead!=(void*)$tcell) p $tcell p *$tcell set $tcell = $tcell->next_c; end
set $i = $i + 1 end ```
Try to attach to last udp worker, so traffic routing goes ok. You may need to replace `TABLE_ENTRIES` with `2^16` if gdb complains that it cannot find its value.
There is something fishy going on here. I removed t_replicate and the server has been running for six days without getting into trouble. However when I run kamcmd tm.stats there are 767 waiting transactions, not freed. It's a very basic configuration using the dialog module, the dialog event routes. A few calls, but not much over the weekend. I will try to debug a big harder and find out where these come from. T_replicate really forced the issue and the server bogged down quickly, but it seems like ordinary calls also get transactions that hang around. That's not good. ```
{ current: 780 waiting: 767 total: 792 total_local: 0 rpl_received: 1389 rpl_generated: 339 rpl_sent: 1514 6xx: 4 5xx: 285 4xx: 172 3xx: 0 2xx: 519 created: 792 freed: 11 delayed_free: 0 }
```
Did you get the chance to run the gdb script from my previous comment?
Is the shm memory usage still growing without t_replicate?
I master I added a rpc command that prints some details of the transactions in memory, that should make it easier in the future to see what's there.
I couldn't find a reason after looking around, likely something is triggered at some point, in certain conditions.
As a workaround, I am considering to add a hard cleanup on timer to the transactions that leave too long after lifetime, with a log message to gather some useful information for troubleshooting.
If you can still get the output for gdb script on such transactions is still useful.
Any chance that you can test with latest master branch? I pushed some new features to list transactions via rpc as well as trigger a clean via rpc. Alternative would be to backport last patches from tm module in your clone.
I will also add an option to do hard clean on time, but it would be good to know if it works when triggered via rpc. Also, the list of transactions via rpc will be useful to do when the issue is exposed, as alternative to gdb script.
I will try installing latest branch. This is the current stats, something is definitely keeping transactions open.
``` # kamcmd tm.stats { current: 5060 waiting: 5055 total: 5065 total_local: 0 rpl_received: 5415 rpl_generated: 266 rpl_sent: 5543 6xx: 1 5xx: 4614 4xx: 53 3xx: 0 2xx: 447 created: 5065 freed: 4 delayed_free: 0 } # kamcmd core.uptime { now: Mon Sep 18 09:39:51 2017 up_since: Tue Sep 12 12:12:25 2017 uptime: 509246 } ```
Sorry, can't disturb the system until middle of week. Will test then.
Have you tried with gdb script? That should work with any version.
This is still a problem, even though we managed to push it back a little. After a few weeks of service, kamailio stopped working properly because it was out of memory. Had to restart quickly because of other dependencies (important tests of devices) but will try to isolate the problem with small configuration files this week.
Finally have a day where it's possible to run som tests. Will try to fire up GDB :-) Building latest 5.0 from git.
The server was running over christmas with no issues (but no calls either). Registrations doesn't seem to be the major problem (and thus not replication). I can force a memory issue by running a lot of calls with SIPP.
Another observation is that the timer process runs up to 99% of the CPU it's running on (observed with top). I have moved dialog, usrloc, rtimer and everything else I can move out of the generic timer process, so this points to a problem with TM and timers. I will review our settings there and keep testing.
What version are you running now there?
If the issue is reproducible with a lot of calls via sipp, then it sounds like not being related to t_replicate() of registrations...
I mentioned that, but did not change the topic of the issue... Latest 5.0 from git being used here, not 5.1. Wanted to find this issue before upgrading.
While running SIPP and checking statistics, I see that inuse_transactions keep going up, but active_transactions stay low
``` tmx:active_transactions = 5 tmx:inuse_transactions = 1054
```
What is the difference between these?
``` dialog:active_dialogs = 638 dialog:early_dialogs = 5 dialog:expired_dialogs = 577 dialog:failed_dialogs = 4 dialog:processed_dialogs = 80
``` Running at 80 calls, seems like dialog have over 600 active dialogs. Something fishy here.
After a while, I can clearly see infuse_transactions growing
``` tmx:2xx_transactions = 3472 tmx:3xx_transactions = 0 tmx:4xx_transactions = 1078 tmx:5xx_transactions = 2913 tmx:6xx_transactions = 2 tmx:UAC_transactions = 5554 tmx:UAS_transactions = 7323 tmx:active_transactions = 2 tmx:inuse_transactions = 7322 tmx:rpl_absorbed = 463 tmx:rpl_generated = 1679 tmx:rpl_received = 8119 tmx:rpl_relayed = 7656 tmx:rpl_sent = 9335 ```
Can you walk the transaction hash table with gdb? I provided the gdb script in a comment above.
This server was running on VM ware. We tested with another server and could not see the same behaviour. Still trying to find any differences while installing a third server to test with... Life.
Must have been a third-party library or something that messed up Kamailio. We reinstalled the server, built a new Kamailio and the problem is totally gone. Will continue testing a few days more, but will likely just close this issue and go hide somewhere.
OK, good it was sorted out, tm is quite important and used pretty much everywhere.
Closed #1220.
Closing it for now. If something new related appears, reopen or create a new item.