[kamailio/kamailio] t_replicate transaction failures keeps transaction in memory forever (#1220)

List overview All Threads
Download

newer

older

[kamailio/kamailio] presence:...

[kamailio/kamailio] registrar:...

Olle E. Johansson

25 Aug 2017 25 Aug '17

6:13 p.m.

One server replicates to others with t_replicate_tcp. The connection was by mistake blocked by a firewall and after a few hours the memory of Kamailio ran out, with over 40000 open transactions in tm.stats on a server with around 300 registrations and no calls since restart.

t_replicate transactions needs to time out somehow.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220

Attachments:

attachment.html (text/html — 2.0 KB)

Show replies by date

Daniel-Constantin Mierla

26 Aug 26 Aug

2:35 p.m.

Needs to be investigated, indeed. By chance, did you test also with udp? If yes, same behaviour, or is just for tcp?

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-325133454

Olle E. Johansson

3:42 p.m.

I can check with UDP next week.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-325139566

Olle E. Johansson

28 Aug 28 Aug

11:30 a.m.

Changed to t_replicate_udp with a random port and it seems to not behave in the same way. I see that transactions are freed after a timeout.

``` # kamcmd tm.stats { current: 119 waiting: 17 total: 462 total_local: 0 rpl_received: 0 rpl_generated: 360 rpl_sent: 360 6xx: 0 5xx: 0 4xx: 360 3xx: 0 2xx: 0 created: 462 freed: 343 delayed_free: 0 }

```

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-325328754

Alekzander Spiridonov

11:36 a.m.

If transaction relies on tcp stack than it could depend on whether firewall dropped or rejected connections in your initial case. As for "UDP with a random port" it was most probably icmp rejected.

2017-08-28 14:30 GMT+03:00 Olle E. Johansson notifications@github.com:

...

Changed to t_replicate_udp with a random port and it seems to not behave in the same way. I see that transactions are freed after a timeout.

# kamcmd tm.stats { current: 119 waiting: 17 total: 462 total_local: 0 rpl_received: 0 rpl_generated: 360 rpl_sent: 360 6xx: 0 5xx: 0 4xx: 360 3xx: 0 2xx: 0 created: 462 freed: 343 delayed_free: 0 }

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kamailio/kamailio/issues/1220#issuecomment-325328754, or mute the thread https://github.com/notifications/unsubscribe-auth/AF36ZdLyYPv1nPgTFowFEnojT0n33KBJks5scqS8gaJpZM4PC_1r .

Kamailio (SER) - Development Mailing List sr-dev@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-dev

-- Best regards, Alekzander Spiridonov

Alekzander Spiridonov

11:40 a.m.

I bet the problem was with established connections been dropped with a firewall. TCP takes ages to timeout on Linux.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-325330403

Olle E. Johansson

11:40 a.m.

What happened was that the firewall closed the TCP connection after a while, Kamailio continued sending but got dropped.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-325330425

Alekzander Spiridonov

11:41 a.m.

If FW was iptables - it doesn't send neither FIN nor RST to existing connections.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-325330627

Olle E. Johansson

11:46 a.m.

Not iptables

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-325331558

Olle E. Johansson

11:51 a.m.

Regardless of the TCP connection, it seems like the transactions should time out anyway...

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-325332490

Olle E. Johansson

11:52 a.m.

Note for the future: Maybe an event_route for failed replications would be a good thing.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-325332824

Daniel-Constantin Mierla

29 Aug 29 Aug

7:04 a.m.

@oej - were those transactions counted in `current` field of the stats?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-325574130

Olle E. Johansson

31 Aug 31 Aug

7:27 a.m.

After fixing the firewall we still have issues. I see the TCP keepalives coming, but replication stopped, no SIP messages flowing and tm.stats has a big pile of open transactions, no more shm memory available. Current and waiting seems to follow each other. No calls during the night, so this is only registrations. 300 devices connected. I will have to go back to replication over UDP or stateless replication using forward to get stability.

``` # kamcmd tm.stats { current: 19965 waiting: 19965 total: 42187 total_local: 0 rpl_received: 42207 rpl_generated: 10 rpl_sent: 42211 6xx: 0 5xx: 1 4xx: 4 3xx: 0 2xx: 42191 created: 42187 freed: 22221 delayed_free: 0 } ```

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-326213777

Daniel-Constantin Mierla

9:47 a.m.

So connections are no ok, but the transactions still live forever, right?

Do you know if the replicated requests were replied? Or the replicated registrations were handled in the location table of the pair server?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-326246424

Daniel-Constantin Mierla

1:36 p.m.

I tried to reproduce using master, but all ok. Based on default kamailio.cfg, I have added t_replicate_tcp:

``` # Handle SIP registrations route[REGISTRAR] { if (!is_method("REGISTER")) return;

if(isflagset(FLT_NATS)) { setbflag(FLB_NATB); #!ifdef WITH_NATSIPPING # do SIP NAT pinging setbflag(FLB_NATSIPPING); #!endif } if (!save("location")) { sl_reply_error(); } t_replicate_tcp("127.0.0.1", 9); exit; } ```

In logs I get:

``` 39(81014) ERROR: <core> [core/tcp_main.c:4135]: handle_tcpconn_ev(): connect 127.0.0.1:9 failed ```

and after a while (delete wait time) transaction is cleaned:

``` 25(81000) DEBUG: tm [t_reply.c:1263]: t_should_relay_response(): ->>>>>>>>> T_code=0, new_code=408 25(81000) DEBUG: tm [t_reply.c:2072]: local_reply(): branch=0, save=0, winner=0 25(81000) DEBUG: tm [t_reply.c:2111]: local_reply(): local transaction completed 408/0 (totag retr: 0/0) 26(81001) DEBUG: tm [h_table.c:129]: free_cell_helper(): freeing transaction 0x105712170 from timer.c:651 ```

`tm.stats` showing 0 current transactions.

What version are you using? What are the parameters you set for tm module?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-326297453

Olle E. Johansson

1:51 p.m.

Latest release. Can't access parameters until monday.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-326301700

Daniel-Constantin Mierla

1:54 p.m.

Latest release as snapshot/tarball 5.0.2, or latest branch 5.0 at some point in time?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-326302491

Olle E. Johansson

4 Sep 4 Sep

7:22 a.m.

After a bit mor testing of t_replicate_udp I am sorry to say that we still have a problem with that. Firewall is open. We see the same locked transactions and replication has stopped, the other side has an empty usrloc. Kamailio from the 5.0 branch of git before the 5.0.3 release. Version: ``` # kamailio -v version: kamailio 5.0.2 (x86_64/linux) ```

``` # kamcmd tm.stats { current: 19974 waiting: 19974 total: 534594 total_local: 0 rpl_received: 544239 rpl_generated: 577 rpl_sent: 534606 6xx: 0 5xx: 4533 4xx: 584 3xx: 0 2xx: 529484 created: 534594 freed: 514619 delayed_free: 0 }

```

One server replicating with udp, the other with tcp. I see TCP keepalives flowing, so no blocking any more.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-326885271

Olle E. Johansson

7:25 a.m.

No strange tm modparams:

``` # ----- tm params ----- # auto-discard branches from previous serial forking leg modparam("tm", "failure_reply_mode", 3) # default retransmission timeout: 30sec modparam("tm", "fr_timer", 30000) # default invite retransmission timeout after 1xx: 120sec modparam("tm", "fr_inv_timer", 120000) modparam("tm", "retr_timer1", 100) modparam("tm", "retr_timer2", 4000) ``` After the TCP problems started, we added som tcp params to the config, just to verify the settings. Mostly default.

``` tcp_connection_lifetime=3200 tcp_accept_no_cl=yes # Maximum write queue. If this is reached, the connection is killed. tcp_conn_wq_max=10000 # Enables keepalive for tcp (sets SO_KEEPALIVE socket option) tcp_keepalive = yes # Time before starting to send keepalives, if the connection is idle (TCP_KEEPIDLE socket option). Linux only. # Seconds tcp_keepidle = 10 # Number of keepalives sent before dropping the connection (TCP_KEEPCNT socket option). Linux only. tcp_keepcnt = 20 # Time interval between keepalive probes, when the previous probe failed (TCP_KEEPINTVL socket option). Linux only. # Seconds tcp_keepintvl = 20 # Time in seconds after a TCP connection will be closed if it is not available for writing in this # interval (and Kamailio wants to send something on it). Lower this value for faster detection of broken # TCP connections. The default value is 10s. tcp_send_timeout=10 # Number of SYN retransmissions before aborting a connect attempt (see linux tcp(7) TCP_SYNCNT). Linux only. # Default in LInux seems to be 5, which corresponds to 180 seconds tcp_syncnt = 2

```

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-326885930

Daniel-Constantin Mierla

8:01 a.m.

Could you spot if it appears from first replication or starts after a while? Are you having any test scenario that could be reused to reproduce it? As wrote previously, I tested with default config + t_replicate_tcp() for each register and worked.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-326892948

Olle E. Johansson

10:42 a.m.

After three hours I have 14726 transactions waiting using around 300 accounts for REGISTER. I will restart now and see how fast I notice a new queue.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-326930453

Olle E. Johansson

11:12 a.m.

After 30 minutes, everything looks fine... So it takes a while before the TM queue starts to build

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-326936424

Olle E. Johansson

1:35 p.m.

Started test 12:42, all seems quiet and current goes up and down around 20. At 13:14 it begins to climb again, so around 30 minutes quiet. We have two-minute registration timeouts, so at the time it starts running astray we have processed 6762 transactions. All over UDP.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-326965067

Daniel-Constantin Mierla

2:05 p.m.

Quite strange ... any relevant log messages in syslog around the time when it goes astray? Are any new registrations still replicated (sent out) and replied?

Would you be able to attach with gdb to one of the kamailio processes and look at the transaction structures?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-326971517

Olle E. Johansson

2:42 p.m.

replication stops. Can't find anything unusual in the syslogs. Can attach GDB, but have no clue on how to look at the transaction structures. Need some help :-)

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-326979140

Daniel-Constantin Mierla

5 Sep 5 Sep

1:35 p.m.

Here is a gdb script snippet to print transaction structures from memory -- hopefully works ok, not tested here yet:

``` set $i=0 while($i<TABLE_ENTRIES)

set $ehead = &_tm_table->entries[$i] set $tcell = _tm_table->entries[$i].next_c

while($ehead!=(void*)$tcell) p $tcell p *$tcell set $tcell = $tcell->next_c; end

set $i = $i + 1 end ```

Try to attach to last udp worker, so traffic routing goes ok. You may need to replace `TABLE_ENTRIES` with `2^16` if gdb complains that it cannot find its value.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-327177060

Olle E. Johansson

11 Sep 11 Sep

7:10 a.m.

There is something fishy going on here. I removed t_replicate and the server has been running for six days without getting into trouble. However when I run kamcmd tm.stats there are 767 waiting transactions, not freed. It's a very basic configuration using the dialog module, the dialog event routes. A few calls, but not much over the weekend. I will try to debug a big harder and find out where these come from. T_replicate really forced the issue and the server bogged down quickly, but it seems like ordinary calls also get transactions that hang around. That's not good. ```

{ current: 780 waiting: 767 total: 792 total_local: 0 rpl_received: 1389 rpl_generated: 339 rpl_sent: 1514 6xx: 4 5xx: 285 4xx: 172 3xx: 0 2xx: 519 created: 792 freed: 11 delayed_free: 0 }

```

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-328437649

Daniel-Constantin Mierla

8:12 a.m.

Did you get the chance to run the gdb script from my previous comment?

Is the shm memory usage still growing without t_replicate?

I master I added a rpc command that prints some details of the transactions in memory, that should make it easier in the future to see what's there.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-328452832

Daniel-Constantin Mierla

12 Sep 12 Sep

9:55 a.m.

I couldn't find a reason after looking around, likely something is triggered at some point, in certain conditions.

As a workaround, I am considering to add a hard cleanup on timer to the transactions that leave too long after lifetime, with a log message to gather some useful information for troubleshooting.

If you can still get the output for gdb script on such transactions is still useful.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-328803937

Daniel-Constantin Mierla

14 Sep 14 Sep

7:49 a.m.

Any chance that you can test with latest master branch? I pushed some new features to list transactions via rpc as well as trigger a clean via rpc. Alternative would be to backport last patches from tm module in your clone.

I will also add an option to do hard clean on time, but it would be good to know if it works when triggered via rpc. Also, the list of transactions via rpc will be useful to do when the issue is exposed, as alternative to gdb script.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-329402631

Olle E. Johansson

18 Sep 18 Sep

7:42 a.m.

I will try installing latest branch. This is the current stats, something is definitely keeping transactions open.

``` # kamcmd tm.stats { current: 5060 waiting: 5055 total: 5065 total_local: 0 rpl_received: 5415 rpl_generated: 266 rpl_sent: 5543 6xx: 1 5xx: 4614 4xx: 53 3xx: 0 2xx: 447 created: 5065 freed: 4 delayed_free: 0 } # kamcmd core.uptime { now: Mon Sep 18 09:39:51 2017 up_since: Tue Sep 12 12:12:25 2017 uptime: 509246 } ```

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-330148402

Olle E. Johansson

7:46 a.m.

Sorry, can't disturb the system until middle of week. Will test then.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-330149090

Daniel-Constantin Mierla

8:17 a.m.

Have you tried with gdb script? That should work with any version.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-330154626

Olle E. Johansson

31 Oct 31 Oct

3:24 p.m.

This is still a problem, even though we managed to push it back a little. After a few weeks of service, kamailio stopped working properly because it was out of memory. Had to restart quickly because of other dependencies (important tests of devices) but will try to isolate the problem with small configuration files this week.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-340797794

Olle E. Johansson

20 Nov 20 Nov

8:29 a.m.

Finally have a day where it's possible to run som tests. Will try to fire up GDB :-) Building latest 5.0 from git.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-345624236

Olle E. Johansson

15 Jan 15 Jan

7:52 a.m.

The server was running over christmas with no issues (but no calls either). Registrations doesn't seem to be the major problem (and thus not replication). I can force a memory issue by running a lot of calls with SIPP.

Another observation is that the timer process runs up to 99% of the CPU it's running on (observed with top). I have moved dialog, usrloc, rtimer and everything else I can move out of the generic timer process, so this points to a problem with TM and timers. I will review our settings there and keep testing.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-357606036

Daniel-Constantin Mierla

7:58 a.m.

What version are you running now there?

If the issue is reproducible with a lot of calls via sipp, then it sounds like not being related to t_replicate() of registrations...

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-357607076

Olle E. Johansson

11:02 a.m.

I mentioned that, but did not change the topic of the issue... Latest 5.0 from git being used here, not 5.1. Wanted to find this issue before upgrading.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-357650887

Olle E. Johansson

11:28 a.m.

New subject: [kamailio/kamailio] Transactions kept in memory, shared memory runs out and Kamailio gives up hope of life. (#1220)

While running SIPP and checking statistics, I see that inuse_transactions keep going up, but active_transactions stay low

``` tmx:active_transactions = 5 tmx:inuse_transactions = 1054

```

What is the difference between these?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-357656678

Olle E. Johansson

11:29 a.m.

New subject: [kamailio/kamailio] Transactions kept in memory, shared memory runs out and Kamailio gives up hope of life. (#1220)

``` dialog:active_dialogs = 638 dialog:early_dialogs = 5 dialog:expired_dialogs = 577 dialog:failed_dialogs = 4 dialog:processed_dialogs = 80

``` Running at 80 calls, seems like dialog have over 600 active dialogs. Something fishy here.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-357657001

Olle E. Johansson

11:44 a.m.

New subject: [kamailio/kamailio] Transactions kept in memory, shared memory runs out and Kamailio gives up hope of life. (#1220)

After a while, I can clearly see infuse_transactions growing

``` tmx:2xx_transactions = 3472 tmx:3xx_transactions = 0 tmx:4xx_transactions = 1078 tmx:5xx_transactions = 2913 tmx:6xx_transactions = 2 tmx:UAC_transactions = 5554 tmx:UAS_transactions = 7323 tmx:active_transactions = 2 tmx:inuse_transactions = 7322 tmx:rpl_absorbed = 463 tmx:rpl_generated = 1679 tmx:rpl_received = 8119 tmx:rpl_relayed = 7656 tmx:rpl_sent = 9335 ```

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-357660042

Daniel-Constantin Mierla

11:59 a.m.

New subject: [kamailio/kamailio] Transactions kept in memory, shared memory runs out and Kamailio gives up hope of life. (#1220)

Can you walk the transaction hash table with gdb? I provided the gdb script in a comment above.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-357663243

Olle E. Johansson

30 Jan 30 Jan

1:19 p.m.

New subject: [kamailio/kamailio] Transactions kept in memory, shared memory runs out and Kamailio gives up hope of life. (#1220)

This server was running on VM ware. We tested with another server and could not see the same behaviour. Still trying to find any differences while installing a third server to test with... Life.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-361590674

Olle E. Johansson

8 Feb 8 Feb

12:17 p.m.

New subject: [kamailio/kamailio] Transactions kept in memory, shared memory runs out and Kamailio gives up hope of life. (#1220)

Must have been a third-party library or something that messed up Kamailio. We reinstalled the server, built a new Kamailio and the problem is totally gone. Will continue testing a few days more, but will likely just close this issue and go hide somewhere.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-364094767

Daniel-Constantin Mierla

12:40 p.m.

New subject: [kamailio/kamailio] Transactions kept in memory, shared memory runs out and Kamailio gives up hope of life. (#1220)

OK, good it was sorted out, tm is quite important and used pretty much everywhere.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-364099842

Daniel-Constantin Mierla

19 Feb 19 Feb

9:49 a.m.

New subject: [kamailio/kamailio] Transactions kept in memory, shared memory runs out and Kamailio gives up hope of life. (#1220)

Closed #1220.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#event-1480191863

Daniel-Constantin Mierla

9:49 a.m.

New subject: [kamailio/kamailio] Transactions kept in memory, shared memory runs out and Kamailio gives up hope of life. (#1220)

Closing it for now. If something new related appears, reopen or create a new item.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kamailio/kamailio/issues/1220#issuecomment-366637925

2689

Age (days ago)

2867

Last active (days ago)

sr-dev@lists.kamailio.org

46 comments

4 participants

tags (0)

participants (4)

Alekzander Spiridonov
Alekzander Spiridonov
Daniel-Constantin Mierla
Olle E. Johansson