Fixes for SEGV in dialog module.

Pre-Submission Checklist

Type Of Change

Checklist:

Description

Hi !

The problem - periodically, 2-3 times in а month, kamailio dumps core with SEGV.

Different OS versions (Ubuntu 18,20,22), different kamailio versions from 5.7.1 up to 5.8.4,
high call load, python kemi, actively used dialog module.

Analyzing core files shows more or less the same picture - SEGV happends in dialog mododule
around the dialog vars, can be in rpc call (dlg.list), in destroy dlg, in accessing dlg variable.

Backtrace can look like:

Core was generated by `/usr/sbin/kamailio -P /run/kamailio/kamailio.pid -f /etc/kamailio/kamailio.cfg'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  get_dlg_variable_unsafe (dlg=dlg@entry=0x7fa2818849b8, key=key@entry=0x7ffe67990d68) at dlg_var.c:236
236     dlg_var.c: No such file or directory.
(gdb) bt full
#0  get_dlg_variable_unsafe (dlg=dlg@entry=0x7fa2818849b8, key=key@entry=0x7ffe67990d68) at dlg_var.c:236
        var = 0x8
        var_list = <optimized out>
#1  0x00007fa2a2ce0397 in get_dlg_varref (dlg=dlg@entry=0x7fa2818849b8, key=key@entry=0x7ffe67990d68) at dlg_var.c:297
        var = 0x0
        __func__ = "get_dlg_varref"
#2  0x00007fa2a2ccd429 in ki_dlg_var_get_mode (msg=<optimized out>, name=0x7ffe67990d68, rmode=0) at dialog.c:2462
        dlg = 0x7fa2818849b8
        pval = <optimized out>
#3  0x000055e2d5b87cbb in sr_kemi_exec_func (ket=ket@entry=0x7fa2a2d0c0c0 <sr_kemi_dialog_exports+1728>, msg=msg@entry=0x7fa2a34577c0, pno=pno@entry=1, vps=vps@entry=0x7ffe67990d60) at core/kemiexec.c:82
        ret = <optimized out>
        __func__ = "sr_kemi_exec_func"
#4  0x00007fa2a2930376 in sr_apy_kemi_exec_func_ex (ket=ket@entry=0x7fa2a2d0c0c0 <sr_kemi_dialog_exports+1728>, self=self@entry=0x7fa280ce2b80, args=args@entry=0x7fa2804ba3a0, idx=idx@entry=485) at apy_kemi.c:327
        fname = <optimized out>
        i = <optimized out>
        ret = <optimized out>
        vps = {{vtype = 2, v = {n = -2142122656, l = 140335914274144, s = {s = 0x7fa28051cd60 "rtpe_setid", len = 10}, dict = 0x7fa28051cd60}}, {vtype = 0, v = {n = 0, l = 0, s = {s = 0x0, len = 0}, dict = 0x0}}, {vtype = 0, v = {
              n = 0, l = 0, s = {s = 0x0, len = 0}, dict = 0x0}}, {vtype = 0, v = {n = 0, l = 0, s = {s = 0x0, len = 0}, dict = 0x0}}, {vtype = 0, v = {n = 0, l = 0, s = {s = 0x0, len = 0}, dict = 0x0}}, {vtype = 0, v = {n = 0, l = 0,
              s = {s = 0x0, len = 0}, dict = 0x0}}}
        env_P = <optimized out>
        lmsg = 0x7fa2a34577c0
        xret = <optimized out>
        slen = 10
        alen = 1
        pobj = <optimized out>
        __func__ = "sr_apy_kemi_exec_func_ex"
#5  0x00007fa2a2931f45 in sr_apy_kemi_exec_func (self=0x7fa280ce2b80, args=0x7fa2804ba3a0, idx=485) at apy_kemi.c:356
        ket = 0x7fa2a2d0c0c0 <sr_kemi_dialog_exports+1728>
        ret = 0x0
        pstate = 0x0
        pframe = 0x0
        tvb = {tv_sec = 0, tv_usec = 0}
        tve = {tv_sec = 0, tv_usec = 0}
        tz = {tz_minuteswest = 2, tz_dsttime = 0}
        tdiff = <optimized out>
        __func__ = "sr_apy_kemi_exec_func"
#6  0x00007fa2a266f697 in ?? () from /lib/x86_64-linux-gnu/libpython3.8.so.1.0

Inspecting dlg vars, the chain look like this:

(gdb) frame 0
#0  get_dlg_variable_unsafe (dlg=dlg@entry=0x7fa2818849b8, key=key@entry=0x7ffe67990d68) at dlg_var.c:236
236     in dlg_var.c
(gdb) p *dlg.vars
$1 = {key = {s = 0x7fa2833ea040 "mos", len = 3}, value = {s = 0x7fa2833ea0b0 "4.3", len = 3}, vflags = 1, next = 0x7fa2825fa3d0}
(gdb) p *$.next
$2 = {key = {s = 0x7fa200332e34 <error: Cannot access memory at address 0x7fa200332e34>, len = -1061109568}, value = {s = 0xabcdefed <error: Cannot access memory at address 0xabcdefed>, len = 49}, vflags = 1, next = 0x8}
(gdb)

Additionaly, inspecting logs i found that this is happends in SIP race conditions, f.e. when caller and calle sends BYE at the same time and
kamailio processing two different transactions of same dialog in two different workers, so present some competition from identical scenarios.
Long time i can't reproduce this in lab, until insert an artificial delay into the scenario which process BYE request to increase competition.
This code quickly causes SEGV in race conditions with BYE:

#kemi get/set dlg var
while time.time()-start < 0.5:
    for i in ('bbb','ccc'):
        dlgvar=KSR.dialog.var_get(i)
        KSR.dialog.var_sets(i,str(random.randint(1,500000)))

SEGV not fire if i change dialog var_get/set to kamscript variant:

while time.time()-start < 0.5:
    for i in ('bbb','ccc'):
        dlgvar=KSR.pv.get(f"$dlg_var({i})")
        KSR.pv.sets(f"$dlg_var({i})", str(random.randint(1,500000)))

By analyzing source i found some places which are not properly covered by locks, and some dangerous places
where direct pointer to dialog var structure used, which can change in concurrent process,
so in first case i insert the locks, in second - it is better to use the clone of variable in private PV buffer.
Also found some mistakes with dialog flags, which can potentionaly affect DMQ and DB dialog behavior.


You can view, comment on, or merge this pull request online at:

  https://github.com/kamailio/kamailio/pull/4151

Commit Summary

File Changes

(3 files)

Patch Links:


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <kamailio/kamailio/pull/4151@github.com>