caused by acquiring the same lock twice using get_urecord_by_ruid and lock_udomain You can view, comment on, or merge this pull request online at:
https://github.com/kamailio/kamailio/pull/911
-- Commit Summary --
* dmq_usrloc: fix deadlock
-- File Changes --
M src/modules/dmq_usrloc/usrloc_sync.c (5)
-- Patch Links --
https://github.com/kamailio/kamailio/pull/911.patch https://github.com/kamailio/kamailio/pull/911.diff
jchavanton commented on this pull request.
@@ -72,6 +70,7 @@ static int add_contact(str aor, ucontact_info_t* ci)
// Search by ruid, if possible res = dmq_ul.get_urecord_by_ruid(_d, dmq_ul.get_aorhash(&aor), &ci->ruid, &r, &c);
`get_urecord_by_ruid` was also acquiring the domain lock, this is why we must do `lock_udomain` only after it returns
While running the CI tests Kamctl / MI fifo was hanging, after further investigation it seems it was trying to acquire a lock while the DMK module was stuck using it.
Processes ``` Process:: ID=0 PID=6 Type=main process - attendant Process:: ID=1 PID=168 Type=udp receiver child=0 sock=0.0.0.0:5060 ! (172.19.0.12:5060) ! Process:: ID=2 PID=169 Type=udp receiver child=1 sock=0.0.0.0:5060 (172.19.0.12:5060) ! Process:: ID=3 PID=170 Type=slow timer Process:: ID=4 PID=171 Type=timer Process:: ID=5 PID=172 Type=secondary timer Process:: ID=6 PID=173 Type=MI FIFO Process:: ID=7 PID=174 Type=ctl handler ! Process:: ID=8 PID=175 Type=RTIMER SEC EXEC Process:: ID=9 PID=176 Type=RTIMER SEC EXEC Process:: ID=10 PID=177 Type=HTable Timer Process:: ID=11 PID=178 Type=USRLOC Timer Process:: ID=12 PID=179 Type=USRLOC Timer Process:: ID=13 PID=180 Type=TIMER NH Process:: ID=14 PID=181 Type=TIMER NH Process:: ID=15 PID=182 Type=DMQ WORKER Process:: ID=16 PID=183 Type=DMQ WORKER Process:: ID=17 PID=184 Type=tcp main process ```
gdb /usr/sbin/kamailio 173 ``` (gdb) bt #0 0x0000785ee4f2f042 in __syscall6 (a6=<optimized out>, a5=<optimized out>, a4=<optimized out>, a3=<optimized out>, a2=<optimized out>, a1=<optimized out>, n=202) at ./arch/x86_64/syscall_arch.h:59 #1 syscall (n=n@entry=202) at src/misc/syscall.c:18 #2 0x0000785eb6ec670b in futex_get (lock=0x785eb79215f4) at ../../lib/kcore/../../mem/../futexlock.h:108 #3 0x0000785eb6ecc200 in lock_ulslot (_d=_d@entry=0x785eb7955690, i=i@entry=472189) at udomain.c:1077 #4 0x0000785eb6ec1947 in mi_usrloc_dump (cmd=<optimized out>, param=<optimized out>) at ul_mi.c:451 #5 0x0000785edc4c0fa1 in run_mi_cmd (t=0xa9bbcf24100, cmd=0xa9bbce892e0) at ../../lib/kmi/mi.h:77 #6 mi_fifo_server (fifo_stream=fifo_stream@entry=0xa9bbce8a5a0) at fifo_fnc.c:509 #7 0x0000785edc4c3317 in fifo_process (rank=rank@entry=1) at mi_fifo.c:270 #8 0x0000785edc4c53b4 in mi_child_init (rank=<optimized out>) at mi_fifo.c:234 #9 0x00000a9bb8e9b6e6 in init_mod_child (m=0x785edcf22008, rank=rank@entry=0) at sr_module.c:921 #10 0x00000a9bb8e9b4f0 in init_mod_child (m=0x785edcf22270, rank=rank@entry=0) at sr_module.c:918 #11 0x00000a9bb8e9b4f0 in init_mod_child (m=0x785edcf227f0, rank=rank@entry=0) at sr_module.c:918 #12 0x00000a9bb8e9b4f0 in init_mod_child (m=0x785edcf22d40, rank=rank@entry=0) at sr_module.c:918 #13 0x00000a9bb8e9b4f0 in init_mod_child (m=0x785edcf246d0, rank=rank@entry=0) at sr_module.c:918 #14 0x00000a9bb8e9b4f0 in init_mod_child (m=0x785edcf252f8, rank=rank@entry=0) at sr_module.c:918 ```
gdb /usr/sbin/kamailio 182 ``` (gdb) bt #0 0x0000785ee4f2f042 in __syscall6 (a6=<optimized out>, a5=<optimized out>, a4=<optimized out>, a3=<optimized out>, a2=<optimized out>, a1=<optimized out>, n=202) at ./arch/x86_64/syscall_arch.h:59 #1 syscall (n=n@entry=202) at src/misc/syscall.c:18 #2 0x0000785eb6ec674e in futex_get (lock=0x785eb79215f4) at ../../lib/kcore/../../mem/../futexlock.h:121 #3 0x0000785eb6ece5f7 in get_urecord_by_ruid (_d=0x785eb7955690, _aorhash=1819751549, _ruid=_ruid@entry=0x785eb68201c0 <ci>, _r=_r@entry=0x7d7464b7e400, _c=_c@entry=0x7d7464b7e410) at udomain.c:1178 #4 0x0000785eb66187b3 in add_contact (ci=0x785eb68201c0 <ci>, aor=...) at usrloc_sync.c:72 #5 usrloc_dmq_handle_msg (msg=<optimized out>, resp=0x7d7464b7e520, node=0x785eb8d5b418) at usrloc_sync.c:429 #6 0x0000785eb682adcc in worker_loop (id=id@entry=0) at worker.c:105 #7 0x0000785eb682c56b in child_init (rank=<optimized out>) at dmq.c:293 #8 0x00000a9bb8e9b6e6 in init_mod_child (m=0x785edcf9fe78, rank=rank@entry=0) at sr_module.c:921 #9 0x00000a9bb8e9b4f0 in init_mod_child (m=0x785edcfa02e0, rank=0) at sr_module.c:918 #10 0x00000a9bb8eacbde in main_loop () at main.c:1682 #11 0x00000a9bb8e6249b in main (argc=<optimized out>, argv=<optimized out>) at main.c:2616 ```
gdb /usr/sbin/kamailio 183 ``` (gdb) bt #0 0x0000785ee4f2f042 in __syscall6 (a6=<optimized out>, a5=<optimized out>, a4=<optimized out>, a3=<optimized out>, a2=<optimized out>, a1=<optimized out>, n=202) at ./arch/x86_64/syscall_arch.h:59 #1 syscall (n=n@entry=202) at src/misc/syscall.c:18 #2 0x0000785eb6ec670b in futex_get (lock=0x785eb79215f4) at ../../lib/kcore/../../mem/../futexlock.h:108 #3 0x0000785eb661807c in add_contact (ci=0x785eb68201c0 <ci>, aor=...) at usrloc_sync.c:64 #4 usrloc_dmq_handle_msg (msg=<optimized out>, resp=0x7d7464b7e520, node=0x785eb8d5b418) at usrloc_sync.c:429 #5 0x0000785eb682adcc in worker_loop (id=id@entry=1) at worker.c:105 #6 0x0000785eb682c56b in child_init (rank=<optimized out>) at dmq.c:293 #7 0x00000a9bb8e9b6e6 in init_mod_child (m=0x785edcf9fe78, rank=rank@entry=0) at sr_module.c:921 #8 0x00000a9bb8e9b4f0 in init_mod_child (m=0x785edcfa02e0, rank=0) at sr_module.c:918 #9 0x00000a9bb8eacbde in main_loop () at main.c:1682 #10 0x00000a9bb8e6249b in main (argc=<optimized out>, argv=<optimized out>) at main.c:2616 ```
No time to look in the code, but if it does two times lock and only one time unlock, then it is indeed a deadlock. If any other devel using dmq_usrloc does the review, then it can be committed. If not, I will check when I get a bit of time.
The same process is requesting the lock a second time before unlocking. `futex_get()` is never able to acquire the lock and does not return, I tested this in a very minimalist program using `pthread_mutex_lock()` and it does get stuck.
miconda commented on this pull request.
@@ -72,6 +70,7 @@ static int add_contact(str aor, ucontact_info_t* ci)
// Search by ruid, if possible res = dmq_ul.get_urecord_by_ruid(_d, dmq_ul.get_aorhash(&aor), &ci->ruid, &r, &c); if (res == 0) { + dmq_ul.lock_udomain(_d, &aor);
This addition to line 73 doesn't seem necessary, because the lock is aquired by get_urecord_by_ruid(). Or did I miss something?
@jchavanton pushed 1 commit.
309873e dmq_usrloc: get_urecord_by_ruid returns
jchavanton commented on this pull request.
@@ -72,6 +70,7 @@ static int add_contact(str aor, ucontact_info_t* ci)
// Search by ruid, if possible res = dmq_ul.get_urecord_by_ruid(_d, dmq_ul.get_aorhash(&aor), &ci->ruid, &r, &c); if (res == 0) { + dmq_ul.lock_udomain(_d, &aor);
You are right, if the record is found the lock is kept.
Merged #911.