[kamailio/kamailio] Kamailio 5.4.x dialog profiles functionality dead-lock on certain high-load scenarios (Issue #2923) - sr-dev

8 Nov 2021


      ### Description
The Kamailio 5.4.x dialog profiles functionality can lead to dead-lock on certain high-load scenarios.
The Kamailio dialog profiles are used to track parallel channels for about 200 outgoing PSTN carrier interconnections. During high traffic times (like several thousands parallel calls) the Kamailio server will frequently (e.g. hourly) goes into an end-less loop while executing get_profile_size in the configuration script. This causes the locking for the dialog profiles never be released and Kamailio will stop serving traffic. Internal monitoring tools and RPC commands stay working, as long as they do not touch the dialog functionality.
A similar (dedicated) Kamailio setup is used for tracking parallel channels for customers. Here the dead-lock is not observed that frequently, but aparentely also some crashes happens in a much longer time interval.
### Troubleshooting
After analysis of the back-traces with GDB the get_profile_size() function was removed from the configuration script. After this change the crash did not happened anymore for several days.
#### Reproduction
Issue could not be reproduced so far.
#### Debugging Data
##### bt 1 (some data removed)
(gdb) bt
# 0  0x00007f57cf3b00da in get_profile_size (profile=0x7f50ccbc7e80, value=0x7ffd9928f300) at dlg_profile.c:859
        n = 364
        i = 12
        ph = 0x7f50d3e4b7d0
# 1  0x00007f57cf419c67 in w_get_profile_size_helper (msg=0x7f57d699d418, profile=0x7f50ccbc7e80, value=0x7ffd9928f300, spd=0x7f57d6916960) at dialog.c:941
# 2  0x00007f57cf41a459 in w_get_profile_size3 (msg=0x7f57d699d418, profile=0x7f50ccbc7e80, value=0x7f57d6935118, result=0x7f57d6916960) at dialog.c:982
# 3  0x0000000000463fea in do_action (h=0x7ffd99293610, a=0x7f57d6936488, msg=0x7f57d699d418) at core/action.c:1094
# 4  0x00000000004711ee in run_actions (h=0x7ffd99293610, a=0x7f57d6936488, msg=0x7f57d699d418) at core/action.c:1581
# 5  0x000000000046058b in do_action (h=0x7ffd99293610, a=0x7f57d690fda8, msg=0x7f57d699d418) at core/action.c:700
The first back-trace was taking from a running process with gdb. The counter in f0 does not increased that much during this time, probably due the overflow of the loop counter.
##### bt2 (analysis with data structure with gdb scripts)
Here the loop counter in f0 showed a really high value. Expected size of dialog profiles hash table:
(gdb) p profile->entries[3]
$4 = {first = 0x7f9bfd4aad98, content = 2068}
(gdb) p profile->entries[7]
$3 = {first = 0x7f9c12079f70, content = 784}
(gdb) p profile->entries[12]
$6 = {first = 0x7f9c02be5d50, content = 7600}
(gdb) p profile->entries[14]
$2 = {first = 0x7f9bff636de8, content = 6764}
hash table bucket 14 shows a lot of corruption and the loop never ends (carrier names and IPs replaced). The list for hash bucket 7 got linked to the list for hash bucket 14:
counter 6755: prev 0x7f9c0b9dcde0 - current 0x7f9c02e5b378 - next 0x7f9c0a5f9ba0 - value carrier1-XX.XX - hash 14
counter 6756: prev 0x7f9c02e5b378 - current 0x7f9c0a5f9ba0 - next 0x7f9c0860b968 - value carrier1-XX.XX▒▒▒▒ - hash 14
counter 6757: prev 0x7f9c0a5f9ba0 - current 0x7f9c0860b968 - next 0x7f9bfe3f3a78 - value carrier1-XX.XX▒▒▒▒ - hash 14
counter 6758: prev 0x7f9c0860b968 - current 0x7f9bfe3f3a78 - next 0x7f9c10d977f0 - value carrier1-XX.XX - hash 14
counter 6759: prev 0x7f9bfe3f3a78 - current 0x7f9c10d977f0 - next 0x7f9c0ae198b0 - value carrier2-XX.XX▒▒▒▒ - hash 7
counter 6760: prev 0x7f9c10d977f0 - current 0x7f9c0ae198b0 - next 0x7f9c12079f70 - value carrier3-XX.XX - hash 7
counter 6761: prev 0x7f9c0ae198b0 - current 0x7f9c12079f70 - next 0x7f9c011f2540 - value-carrier2-XX.XX▒▒▒▒ - hash 7
counter 6762: prev 0x7f9c12079f70 - current 0x7f9c011f2540 - next 0x7f9bfff886f0 - value carrier2-XX.XX▒▒▒▒ - hash 7
counter 6763: prev 0x7f9c011f2540 - current 0x7f9bfff886f0 - next 0x7f9c05db00a8 - value carrier3-XX.XX= - hash 7
[...]
counter 28270: prev 0x7f9c019d06e8 - current 0x7f9bfaf18290 - next 0x7f9c12c90680 - value carrier2-XX.XX▒▒▒▒ - hash 7
counter 28271: prev 0x7f9bfaf18290 - current 0x7f9c12c90680 - next 0x7f9c086a2b58 - value-carrier2-XX.XX▒▒▒▒ - hash 7
counter 28272: prev 0x7f9c12c90680 - current 0x7f9c086a2b58 - next 0x7f9c0b4f09e8 - value carrier2-XX.XX▒▒▒▒ - hash 7
[...]
hash table bucket 7 is still consistent regarding the loop, but already shows initial sign of corruption. There is one item of the list for hash bucket 14 visible:
counter 780: prev 0x7f9c0db57ac8 - current 0x7f9c02225700 - next 0x7f9bfbf7db08 - value carrier2-XX.XX▒▒▒▒ - hash 7
counter 781: prev 0x7f9c02225700 - current 0x7f9bfbf7db08 - next 0x7f9c10d977f0 - value carrier1-XX.XX- hash 14
counter 782: prev 0x7f9bfe3f3a78 - current 0x7f9c10d977f0 - next 0x7f9c0ae198b0 - value carrier2-XX.XX▒▒▒▒ - hash 7
counter 783: prev 0x7f9c10d977f0 - current 0x7f9c0ae198b0 - next 0x7f9c12079f70 - value carrier3-XX.XX - hash 7
total size of hash table is 784
#### Log Messages
No special log messages observed.
#### SIP Traffic
SIP traffic looked ok during analysis of the core dumps.
### Possible Solutions
* adding additional safe-guards for the get_profile_size function to not access data from other hash buckets
* stopping the loop counter after some threshold
* finding and fixing the source of the internal data corruption (obviously)
* refactoring the dialog modules to use another approach for storing the dialog profile information
### Additional Information
* **Kamailio version**:
Kamailio 5.4.7, compiled from git repository
* **Operating System**:
CentOS 7.9
-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2923