Fixes for SEGV in dialog module.
<!-- Kamailio Pull Request Template -->
<!-- IMPORTANT: - for detailed contributing guidelines, read: https://github.com/kamailio/kamailio/blob/master/.github/CONTRIBUTING.md - pull requests must be done to master branch, unless they are backports of fixes from master branch to a stable branch - backports to stable branches must be done with 'git cherry-pick -x ...' - code is contributed under BSD for core and main components (tm, sl, auth, tls) - code is contributed GPLv2 or a compatible license for the other components - GPL code is contributed with OpenSSL licensing exception -->
#### Pre-Submission Checklist <!-- Go over all points below, and after creating the PR, tick all the checkboxes that apply --> <!-- All points should be verified, otherwise, read the CONTRIBUTING guidelines from above--> <!-- If you're unsure about any of these, don't hesitate to ask on sr-dev mailing list --> - [x] Commit message has the format required by CONTRIBUTING guide - [x] Commits are split per component (core, individual modules, libs, utils, ...) - [x] Each component has a single commit (if not, squash them into one commit) - [x] No commits to README files for modules (changes must be done to docbook files in `doc/` subfolder, the README file is autogenerated)
#### Type Of Change - [ ] Small bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds new functionality) - [ ] Breaking change (fix or feature that would change existing functionality)
#### Checklist: <!-- Go over all points below, and after creating the PR, tick the checkboxes that apply --> - [ ] PR should be backported to stable branches - [ ] Tested changes locally - [ ] Related to issue #XXXX (replace XXXX with an open issue number)
#### Description Hi !
The problem - periodically, 2-3 times in а month, kamailio dumps core with SEGV.
Different OS versions (Ubuntu 18,20,22), different kamailio versions from 5.7.1 up to 5.8.4, high call load, python kemi, actively used dialog module.
Analyzing core files shows more or less the same picture - SEGV happends in dialog mododule around the dialog vars, can be in rpc call (dlg.list), in destroy dlg, in accessing dlg variable.
Backtrace can look like:
``` Core was generated by `/usr/sbin/kamailio -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg'. Program terminated with signal SIGSEGV, Segmentation fault. #0 get_dlg_variable_unsafe (dlg=dlg@entry=0x7fa2818849b8, key=key@entry=0x7ffe67990d68) at dlg_var.c:236 236 dlg_var.c: No such file or directory. (gdb) bt full #0 get_dlg_variable_unsafe (dlg=dlg@entry=0x7fa2818849b8, key=key@entry=0x7ffe67990d68) at dlg_var.c:236 var = 0x8 var_list = <optimized out> #1 0x00007fa2a2ce0397 in get_dlg_varref (dlg=dlg@entry=0x7fa2818849b8, key=key@entry=0x7ffe67990d68) at dlg_var.c:297 var = 0x0 __func__ = "get_dlg_varref" #2 0x00007fa2a2ccd429 in ki_dlg_var_get_mode (msg=<optimized out>, name=0x7ffe67990d68, rmode=0) at dialog.c:2462 dlg = 0x7fa2818849b8 pval = <optimized out> #3 0x000055e2d5b87cbb in sr_kemi_exec_func (ket=ket@entry=0x7fa2a2d0c0c0 <sr_kemi_dialog_exports+1728>, msg=msg@entry=0x7fa2a34577c0, pno=pno@entry=1, vps=vps@entry=0x7ffe67990d60) at core/kemiexec.c:82 ret = <optimized out> __func__ = "sr_kemi_exec_func" #4 0x00007fa2a2930376 in sr_apy_kemi_exec_func_ex (ket=ket@entry=0x7fa2a2d0c0c0 <sr_kemi_dialog_exports+1728>, self=self@entry=0x7fa280ce2b80, args=args@entry=0x7fa2804ba3a0, idx=idx@entry=485) at apy_kemi.c:327 fname = <optimized out> i = <optimized out> ret = <optimized out> vps = {{vtype = 2, v = {n = -2142122656, l = 140335914274144, s = {s = 0x7fa28051cd60 "rtpe_setid", len = 10}, dict = 0x7fa28051cd60}}, {vtype = 0, v = {n = 0, l = 0, s = {s = 0x0, len = 0}, dict = 0x0}}, {vtype = 0, v = { n = 0, l = 0, s = {s = 0x0, len = 0}, dict = 0x0}}, {vtype = 0, v = {n = 0, l = 0, s = {s = 0x0, len = 0}, dict = 0x0}}, {vtype = 0, v = {n = 0, l = 0, s = {s = 0x0, len = 0}, dict = 0x0}}, {vtype = 0, v = {n = 0, l = 0, s = {s = 0x0, len = 0}, dict = 0x0}}} env_P = <optimized out> lmsg = 0x7fa2a34577c0 xret = <optimized out> slen = 10 alen = 1 pobj = <optimized out> __func__ = "sr_apy_kemi_exec_func_ex" #5 0x00007fa2a2931f45 in sr_apy_kemi_exec_func (self=0x7fa280ce2b80, args=0x7fa2804ba3a0, idx=485) at apy_kemi.c:356 ket = 0x7fa2a2d0c0c0 <sr_kemi_dialog_exports+1728> ret = 0x0 pstate = 0x0 pframe = 0x0 tvb = {tv_sec = 0, tv_usec = 0} tve = {tv_sec = 0, tv_usec = 0} tz = {tz_minuteswest = 2, tz_dsttime = 0} tdiff = <optimized out> __func__ = "sr_apy_kemi_exec_func" #6 0x00007fa2a266f697 in ?? () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0
```
Inspecting dlg vars, the chain look like this:
``` (gdb) frame 0 #0 get_dlg_variable_unsafe (dlg=dlg@entry=0x7fa2818849b8, key=key@entry=0x7ffe67990d68) at dlg_var.c:236 236 in dlg_var.c (gdb) p *dlg.vars $1 = {key = {s = 0x7fa2833ea040 "mos", len = 3}, value = {s = 0x7fa2833ea0b0 "4.3", len = 3}, vflags = 1, next = 0x7fa2825fa3d0} (gdb) p *$.next $2 = {key = {s = 0x7fa200332e34 <error: Cannot access memory at address 0x7fa200332e34>, len = -1061109568}, value = {s = 0xabcdefed <error: Cannot access memory at address 0xabcdefed>, len = 49}, vflags = 1, next = 0x8} (gdb) ```
Additionaly, inspecting logs i found that this is happends in SIP race conditions, f.e. when caller and calle sends BYE at the same time and kamailio processing two different transactions of same dialog in two different workers, so present some competition from identical scenarios. Long time i can't reproduce this in lab, until insert an artificial delay into the scenario which process BYE request to increase competition. This code quickly causes SEGV in race conditions with BYE:
``` #kemi get/set dlg var while time.time()-start < 0.5: for i in ('bbb','ccc'): dlgvar=KSR.dialog.var_get(i) KSR.dialog.var_sets(i,str(random.randint(1,500000))) ```
SEGV not fire if i change dialog var_get/set to kamscript variant:
``` while time.time()-start < 0.5: for i in ('bbb','ccc'): dlgvar=KSR.pv.get(f"$dlg_var({i})") KSR.pv.sets(f"$dlg_var({i})", str(random.randint(1,500000)))
```
By analyzing source i found some places which are not properly covered by locks, and some dangerous places where direct pointer to dialog var structure used, which can change in concurrent process, so in first case i insert the locks, in second - it is better to use the clone of variable in private PV buffer. Also found some mistakes with dialog flags, which can potentionaly affect DMQ and DB dialog behavior.
You can view, comment on, or merge this pull request online at:
https://github.com/kamailio/kamailio/pull/4151
-- Commit Summary --
* dialog: covering by locks accessing/changing variables in kemi functions
-- File Changes --
M src/modules/dialog/dialog.c (36) M src/modules/dialog/dlg_hash.h (2) M src/modules/dialog/dlg_var.c (2)
-- Patch Links --
https://github.com/kamailio/kamailio/pull/4151.patch https://github.com/kamailio/kamailio/pull/4151.diff
henningw left a comment (kamailio/kamailio#4151)
Thanks for the PR, good point with the additional locks and also the error related to the DMQ flags. There have been similar reports (e.g. #2923) for crashes related to the dialog profiles. What was the reasoning by changing the string memory to static and also passing str by value instead of reference? Is this also related to the crash fix?
Den4t left a comment (kamailio/kamailio#4151)
What was the reasoning by changing the string memory to static and also passing str by value instead of reference? Is this also related to the crash fix?
Hi ! I think you mean this in ki_dlg_get_var_helper: static str val;
This is because of natute of function get_dlg_varval, whiсh i reuse it to clone dlg var in private pv buffer, we can't use here stack variable (without static), because no garanty the stack is changed after return from ki_dlg_get_var_helper (if I understood the question correctly), using get_dlg_varref as before is unsafe, because of data under the pointer can be changed in concurrent process.
Den4t left a comment (kamailio/kamailio#4151)
Thanks for the PR, good point with the additional locks and also the error related to the DMQ flags. There have been similar reports (e.g. #2923) for crashes related to the dialog profiles.
I do not analyze dialog profile functions, but think it is possible that they can have similar problems.
Closed #4151.
miconda left a comment (kamailio/kamailio#4151)
Thanks! I split manually the PR in 2, doing differently the part for dlg variables to use directly the static kemi global var, the second being about fixes to the flags.
If you find any related problems, open a bug on the tracker.
Den4t left a comment (kamailio/kamailio#4151)
Thanks! I split manually the PR in 2, doing differently the part for dlg variables to use directly the static kemi global var, the second being about fixes to the flags.
If you find any related problems, open a bug on the tracker.
Thanks !
Should we expect backports?