### Description While running a simple load with a UAC(Sipp) and two UAS(Both Sipp) , with Kamailio in the middle, acting as a Call Stateful Proxy, I observed that the Dispatcher Module reports one call stuck despite NO calls being reported as Active in the Dialog Module.
Setup: ------- |------> UAS1(sipp) UAC(sipp) ----------------------Kamailio(Call Stateful Proxy)------------ |------> UAS2(sipp)
Scenario: The scenario is simple INVITE----- 200OK------ACK , with BYE being sent after 60 secs by UAC.
### Troubleshooting
Point noteworthy is that this is purely timing issue and happens generally with very long duration, like 24 hours or so, but can crop up withing a few hours also.
#### Reproduction
Just run the calls via sipp as described in the setup.
#### Debugging Data
**OUTPUT after cleanly stopping the test:**
Output of "kamctl stats dialog" command
{ "jsonrpc": "2.0", "result": [ "dialog:active_dialogs = 0", "dialog:early_dialogs = 0", "dialog:expired_dialogs = 0", "dialog:failed_dialogs = 959", "dialog:processed_dialogs = 1695662" ], "id": 21282 }
Output of "kamcmd dispatcher.list"
{ NRSETS: 1 RECORDS: { SET: { ID: 1 TARGETS: { DEST: { URI: sip:10.214.3.20:5060;transport=udp FLAGS: AP PRIORITY: 0 ATTRS: { BODY: duid=sample-cas-1;maxload=1000 DUID: sample-cas-1 MAXLOAD: 1000 WEIGHT: 0 RWEIGHT: 0 SOCKET: } LATENCY: { AVG: 2.409000 STD: 108.999000 EST: 1.124000 MAX: 9503 TIMEOUT: 22 } RUNTIME: { DLGLOAD: 997 } } DEST: { URI: sip:10.214.3.19:5060;transport=udp FLAGS: AP PRIORITY: 0 ATTRS: { BODY: duid=sample-cas-0;maxload=1000 DUID: sample-cas-0 MAXLOAD: 1000 WEIGHT: 0 RWEIGHT: 0 SOCKET: } LATENCY: { AVG: 1.429000 STD: 40.279000 EST: 0.999000 MAX: 3502 TIMEOUT: 22 } RUNTIME: { DLGLOAD: 985 } } } } } }
NOTE: Please see that the dispatcher shows a lot of calls stuck after overnight load, and there by we started seeing 404's sent back by Kamailio dispatcher module for most of the calls. It is noteworthy that we have kept a close tab on the tests thereafter and caught an iteration where initially only call is stuck.
### Possible Solutions
The problem most likely seems to be in the dispatcher module, which does not seem to be incrementing/decrementing the load variable for each UAS( 'dload' variable to be precise), due to which the concurrent incrementing/decrementing of the 'dload' parameter makes the dispatcher report spurious values.
FIX: The following needs to be done. 1. File: **dispatch.c** Function: **ds_load_add()** Just before dset->dlist[dst].dload++; line, we must take a lock and release it thereafter. Like:
**lock_get(&dset->lock)**; dset->dlist[dst].dload++; **lock_release(&dset->lock);**
2: File: **dispatch.c** Function: **ds_load_replace()** , the code needs a lock as well while decrementing. Like: **lock_get(&idx->lock);** if(idx->dlist[olddst].dload > 0) { idx->dlist[olddst].dload--; **lock_release(&idx->lock);** ...... } else { **lock_release(&idx->lock);** ...... }
3. File: **dispatch.c** Function: **ds_load_remove_byid** , the code needs a lock as well while decrementing. Like:
**lock_get(&idx->lock);** if(idx->dlist[olddst].dload > 0) { idx->dlist[olddst].dload--; **lock_release(&idx->lock);** ........... } else { **lock_release(&idx->lock);** ........ }
NOTE: I believe the lock should be taken also for the the API "ds_get_leastloaded()" in dispatch.c because otherwise, we may get select incorrect destination, especially when we are at just about the same load and that is almost close to the 'maxload' value configured in the dispatcher list file. But I will wait for the community comments for this.
The above fix solved my issue, so please review the same and if you deem suitable, I could go ahead with the fix delivery.
### Additional Information
Kamailio Version: [root@CPaaSVM ~]# kamailio -v version: kamailio 5.3.2 (x86_64/linux) 7ba545 flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: 7ba545 compiled on 19:01:39 May 1 2020 with gcc 4.8.5
NOTE: I do not think this is version specific and probably that it exists in the later versions also.
* **Operating System**: Centos 7.7
Thanks for the detailed report. Your description sounds reasonable, as the variable is indeed in shared memory and is accessed from different processes. The same problem should also be apply to the ds_get_leastloaded call.
Could you create a pull request with the changes against git master? It can be then applied and also backported to stable versions.
Thanks for the quick look and review henningw. I'll try to do this tonight post work, else max by tomorrow. I am in IST timezone.
I am pretty much sure the increment/decrement operations on integers are atomic, so using locks only to guard increment/decrement should not needed.
For the if-check-then-update makes sense, I will try to analyse better your report and see what would be a good solution from my point of view. Probably we can use an atomic variable for the counter field.
Thanks Daniel, you know this code better. About the atomic integers, I also thought that. But as the reporter had some issues I did a quick research, it seems it might be also depends on the architecture and also on alignment. But maybe the issue was just caused from the if-check-then-update case.
Thanks Daniel, I will await your response on the issue then, regarding using the lock or atomic integers. I presume the atomic integer usage must be faster than the locking approach. FWIW, a similar locking is being used in 'dp_init_relative_weights()' and 'ds_update_latency' APIs in dispatch.c also.
I have little experience using this algorithm, I noticed looking at the dispatcher latency stats there was some timeouts, not sure about config but I wonder it went Inactive probing at some point and came back. It may be worth consider looking at something that could go wrong when it goes active, inactive, active in such scenario. Or reload the dispatcher before your test to have clean stats just to be sure.
On Tue, May 12, 2020 at 10:14 PM hbilling notifications@github.com wrote:
Thanks Daniel, I will await your response on the issue then, regarding using the lock or atomic integers. I presume the atomic integer usage must be faster than the locking approach. FWIW, a similar locking is being used in 'dp_init_relative_weights()' and 'ds_update_latency' APIs in dispatch.c also.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kamailio/kamailio/issues/2322#issuecomment-627750313, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABO7UZINCF24E6BMTEAK5HLRRIT7ZANCNFSM4M5V5VYA . _______________________________________________ Kamailio (SER) - Development Mailing List sr-dev@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-dev
Hi kamailio-sync, Thanks for the inputs. However, I was referring to the usage of lock in the function that calculates the dispatcher latency stats and suggesting to use the same lock while accessing the 'dload' parameter so as to avoid concurrent access issues related to the 'dload' parameter as its a non-atomic integer as of today. Daniel, however, suggested that he would take a further look to suggest what suits better, ie. using the lock OR making the 'dload' parameter as an atomic integer.
@hbilling - I went through the code and for now I pushed a commit based on your remarks, defining macros for increment/decrement guarded by dst group lock. The increment is in a single place, decrement in two places, so I am not sure if using an atomic variable there will give any significant gain. When I will get some time for more testing, I will play with variants. The `ds_get_leastloaded()` was also updated to use the dst group lock.
Test and if all ok, report back and it can be backported.
Closed #2322.
Thanks Daniel, The changes seem to be fine. Please go ahead with making these available for the previous releases as well.