tsearle created an issue (kamailio/kamailio#4153)
### Description
I use kemi python on arm64. The script is working in 5.8.4 but hangs when running on 6.0.0
#### Trobuleshooting
I reverted the patch https://github.com/kamailio/kamailio/pull/3986
and fully removed the code ref KafkaProducer and calls worked
If I use stock 6.0.0, even if I remove KafkaProducer calls still don't work
#### Reproduction
I have a kemi python script that uses the python module "KafkaProducer" which seems to create threads in the background. I also use the kamailio http_async module and an another async worker pool
#### Debugging Data
``` ---start 1558 ----------------------------------------------------- [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1". __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0xffffd25f2d00, op=137, expected=0, futex_word=0xffff8df076d0 <_PyRuntime+432>) at ./nptl/futex-internal.c:57 57 ./nptl/futex-internal.c: No such file or directory. #0 __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0xffffd25f2d00, op=137, expected=0, futex_word=0xffff8df076d0 <_PyRuntime+432>) at ./nptl/futex-internal.c:57 _x3tmp = 281474211196160 _x0tmp = 281473063089872 _x0 = 281473063089872 _x3 = 281474211196160 _x4tmp = 0 _x1tmp = -1068905079 _x1 = 137 _x4 = 0 _x5tmp = 4294967295 _x2tmp = 0 _x2 = 0 _x5 = 4294967295 _x8 = 98 _sys_result = <optimized out> sc_ret = <optimized out> _sys_result = <optimized out> _x5tmp = <optimized out> _x4tmp = <optimized out> _x3tmp = <optimized out> _x2tmp = <optimized out> _x1tmp = <optimized out> _x0tmp = <optimized out> _x0 = <optimized out> _x1 = <optimized out> _x2 = <optimized out> _x3 = <optimized out> _x4 = <optimized out> _x5 = <optimized out> _x8 = <optimized out> #1 __futex_abstimed_wait_common (cancel=true, private=0, abstime=0xffffd25f2d00, clockid=-473937460, expected=0, futex_word=0xffff8df076d0 <_PyRuntime+432>) at ./nptl/futex-internal.c:87 err = <optimized out> clockbit = 3226062080 op = 137 err = <optimized out> clockbit = <optimized out> op = <optimized out> #2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0xffff8df076d0 <_PyRuntime+432>, expected=expected@entry=0, clockid=clockid@entry=1, abstime=abstime@entry=0xffffd25f2d00, private=private@entry=0) at ./nptl/futex-internal.c:139 No locals. #3 0x0000ffff90ede520 in __pthread_cond_wait_common (abstime=0xffffd25f2d00, clockid=1, mutex=0xffff8df076d8 <_PyRuntime+440>, cond=0xffff8df076a8 <_PyRuntime+392>) at ./nptl/pthread_cond_wait.c:503 spin = 0 buffer = {__routine = 0xffff90eddf74 <__condvar_cleanup_waiting>, __arg = 0xffffd25f2c38, __canceltype = -471327168, __prev = 0x0} cbuffer = {wseq = 9322, cond = 0xffff8df076a8 <_PyRuntime+392>, mutex = 0xffff8df076d8 <_PyRuntime+440>, private = 0} err = <optimized out> g = 0 flags = <optimized out> g1_start = <optimized out> maxspin = 0 signals = <optimized out> result = 0 wseq = 9322 seq = 4661 private = 0 maxspin = <optimized out> err = <optimized out> result = <optimized out> wseq = <optimized out> g = <optimized out> seq = <optimized out> flags = <optimized out> private = <optimized out> signals = <optimized out> done = <optimized out> g1_start = <optimized out> spin = <optimized out> buffer = <optimized out> cbuffer = <optimized out> s = <optimized out> #4 ___pthread_cond_timedwait64 (cond=0xffff8df076a8 <_PyRuntime+392>, mutex=0xffff8df076d8 <_PyRuntime+440>, abstime=0xffffd25f2d00) at ./nptl/pthread_cond_wait.c:643 flags = <optimized out> clockid = 1 #5 0x0000ffff8da3f538 in ?? () from /lib/aarch64-linux-gnu/libpython3.11.so.1.0 No symbol table info available. #6 0x0000ffff8da3fa28 in PyEval_RestoreThread () from /lib/aarch64-linux-gnu/libpython3.11.so.1.0 No symbol table info available.``` #7 0x0000ffff8df99658 in apy_exec (_msg=0xffff857912f0, fname=0xaaaae3e8fda8 "ksr_onsend_route", fparam=fparam@entry=0x0, emode=emode@entry=0) at ./src/modules/app_python3/python_exec.c:75 pFunc = <optimized out> pArgs = <optimized out> pValue = <optimized out> pResult = <optimized out> pmsg = <optimized out> rval = -1 bmsg = <optimized out> locked = 0 __func__ = "apy_exec" #8 0x0000ffff8df90c00 in sr_kemi_config_engine_python (msg=<optimized out>, rtype=16, rname=0x0, rparam=<optimized out>) at ./src/modules/app_python3/apy_kemi.c:92 ret = -1 __func__ = "sr_kemi_config_engine_python" #9 0x0000aaaae3bcb578 in sr_kemi_route (keng=keng@entry=0xaaaae3f7f8e0 <_sr_kemi_eng_list>, msg=msg@entry=0xffff857912f0, rtype=rtype@entry=16, ename=ename@entry=0x0, edata=edata@entry=0x0) at core/kemi.c:3853 sfbk = 0 ret = <optimized out> #10 0x0000aaaae3c09c0c in run_onsend (orig_msg=orig_msg@entry=0xffff857912f0, dst=dst@entry=0xffff8578c6d0, buf=0xffff85792f00 "INVITE sip:+32485112970@54.220.60.194:5060;transport=UDP SIP/2.0\r\nRecord-Route: sip:10.151.1.119;lr=on;did=f84.06f1\r\nRecord-Route: sip:10.151.0.41:5080;r2=on;lr=on\r\nRecord-Route: <sip:176.34.141.1"..., len=<optimized out>) at core/onsend.c:79 onsnd_info = {dst = 0x0, to = 0xffff8578c6d8, send_sock = 0xffff8ffe89e0, buf = 0xffff85792f00 "INVITE sip:+32485112970@54.220.60.194:5060;transport=UDP SIP/2.0\r\nRecord-Route: sip:10.151.1.119;lr=on;did=f84.06f1\r\nRecord-Route: sip:10.151.0.41:5080;r2=on;lr=on\r\nRecord-Route: <sip:176.34.141.1"..., len = 1384, msg = 0xffff857912f0, rmode = 0, rplcode = 0} ret = 1 ra_ctx = {rec_lev = 0, run_flags = 0, last_retcode = 0, jmp_env = {{__jmpbuf = {0, 0, 0, 0, 0, 0, 91, 187650945063344, 187650945067504, 180, 187650944742872, 187650946646016, 1, 281473098000061, 281474211197472, 187650942363668, 281474211197752, 187650945867776, 187650944901120, 281473098000432, 3122766361, 187650944902184}, __mask_was_saved = -471386664, __saved_mask = {__val = {281473098000456, 187650944893376, 2074081191309474304, 281474211197424, 187650943506496, 275, 0, 281472882816896, 281473098000557, 281470681743367, 34, 187650945064960, 187650944741376, 2, 187650945730496, 281474211197472}}}}} bctx = 0x0 backup_route_type = 2 fwd_snd_flags_bak = <optimized out> rpl_snd_flags_bak = <optimized out> keng = <optimized out> __func__ = "run_onsend" #11 0x0000ffff8f75bacc in t_send_branch (t=t@entry=0xffff8578be90, branch=branch@entry=2, p_msg=p_msg@entry=0xffff857912f0, proxy=proxy@entry=0x0, lock_replies=lock_replies@entry=0) at ./src/modules/tm/t_fwd.c:1499 ip = {af = 3529455232, len = 65535, u = {addrl = {187650945761776, 281474211197776}, addr32 = {3824599536, 43690, 3529454416, 65535}, addr16 = {49648, 58358, 43690, 0, 13136, 53855, 65535, 0}, addr = "\360\301\366㪪\000\000P3_\322\377\377\000"}} ret = 2 uac = 0xffff8578c670 __func__ = "t_send_branch" #12 0x0000ffff8f75de14 in t_forward_nonack (t=0xffff8578be90, p_msg=p_msg@entry=0xffff857912f0, proxy=0x0, proto=proto@entry=0) at ./src/modules/tm/t_fwd.c:1786 branch_ret = <optimized out> lowest_ret = <optimized out> added_branches = 4 first_branch = <optimized out> i = 2 t_invite = <optimized out> success_branch = 0 try_new = <optimized out> lock_replies = 0 backup_bflags = 0 obranch = {uri = {s = 0x0, len = 0}, dst_uri = {s = 0x0, len = 0}, q = -1, path = {s = 0x0, len = 0}, flags = 0, force_socket = 0x0, ruid = {s = 0x0, len = 0}, instance = {s = 0x0, len = 0}, location_ua = {s = 0x0, len = 0}, otcpid = 0} __func__ = "t_forward_nonack" #13 0x0000ffff8f7bc41c in _w_t_relay_to (p_msg=0xffff857912f0, proxy=<optimized out>, force_proto=0) at ./src/modules/tm/tm.c:1856 t = <optimized out> res = <optimized out> __func__ = "_w_t_relay_to" #14 0x0000ffff8df91814 in sr_apy_kemi_exec_func_ex (ket=0xffff8f821420 <tm_kemi_exports>, self=<optimized out>, args=0xffff8df15b48 <_PyRuntime+58920>, idx=<optimized out>) at ./src/modules/app_python3/apy_kemi.c:259 fname = {s = 0xffff8f7f6b08 "t_relay", len = 7} i = <optimized out> ret = <optimized out> vps = {{vtype = -1919318472, v = {n = -2062635920, l = 281472914074736, s = {s = 0xffff850eac70 "\003", len = -2062643520}, dict = 0xffff850eac70}}, {vtype = 0, v = {n = -2062621248, l = 281472914089408, s = {s = 0xffff850ee5c0 ";", len = -765511240}, dict = 0xffff850ee5c0}}, {vtype = 0, v = {n = -1914684912, l = 281473062025744, s = {s = 0xffff8de03a10 <PyModule_Type> "*", len = 0}, dict = 0xffff8de03a10 <PyModule_Type>}}, {vtype = 7, v = {n = -1861393856, l = 281473115316800, s = {s = 0xffff910d6240 "\200xN\203\377\377", len = -1914684912}, dict = 0xffff910d6240}}, {vtype = -2062643520, v = {n = -765511888, l = 281474211198768, s = {s = 0xffffd25f3730 "\2607_\322\377\377", len = 1}, dict = 0xffffd25f3730}}, {vtype = -1914655368, v = {n = -765511896, l = 281474211198760, s = {s = 0xffffd25f3728 "\020\346\016\205\377\377", len = -765511888}, dict = 0xffffd25f3728}}} env_P = <optimized out> lmsg = 0xffff857912f0 xret = <optimized out> slen = 841693253507731862 alen = <optimized out> pobj = <optimized out> __func__ = "sr_apy_kemi_exec_func_ex" #15 0x0000ffff8df93440 in sr_apy_kemi_exec_func (self=0xffff850ee5c0, args=0xffff8df15b48 <_PyRuntime+58920>, idx=118) at ./src/modules/app_python3/apy_kemi.c:358 ket = 0xffff8f821420 <tm_kemi_exports> ret = 0x0 pstate = 0x0 pframe = 0x0 pcode = 0x0 tvb = {tv_sec = 0, tv_usec = 0} tve = {tv_sec = 0, tv_usec = 0} tz = {tz_minuteswest = -765511584, tz_dsttime = 65535} tdiff = <optimized out> __func__ = "sr_apy_kemi_exec_func" #16 0x0000ffff8d9a5d88 in ?? () from /lib/aarch64-linux-gnu/libpython3.11.so.1.0 No symbol table info available. #17 0x0000ffff8d9571d8 in _PyObject_MakeTpCall () from /lib/aarch64-linux-gnu/libpython3.11.so.1.0 No symbol table info available. #18 0x0000ffff8d8ed03c in _PyEval_EvalFrameDefault () from /lib/aarch64-linux-gnu/libpython3.11.so.1.0 No symbol table info available. #19 0x0000ffff8da440fc in ?? () from /lib/aarch64-linux-gnu/libpython3.11.so.1.0 No symbol table info available. #20 0x0000ffff8d95a728 in ?? () from /lib/aarch64-linux-gnu/libpython3.11.so.1.0 No symbol table info available. #21 0x0000ffff8df99924 in apy_exec (_msg=<optimized out>, fname=0xffff85797110 "ksr_rtpengine_http_action", fparam=fparam@entry=0x0, emode=emode@entry=1) at ./src/modules/app_python3/python_exec.c:149 pFunc = 0xffff84f7c180 pArgs = 0xffff85075150 pValue = <optimized out> pResult = <optimized out> pmsg = 0xffff83509e10 rval = -1 bmsg = 0x0 locked = 0 __func__ = "apy_exec" #22 0x0000ffff8df90770 in sr_kemi_config_engine_python (msg=<optimized out>, rtype=2, rname=0xffffd25f3f18, rparam=<optimized out>) at ./src/modules/app_python3/apy_kemi.c:88 ret = -1 __func__ = "sr_kemi_config_engine_python" #23 0x0000aaaae3bcb578 in sr_kemi_route (keng=0xaaaae3f7f8e0 <_sr_kemi_eng_list>, msg=msg@entry=0xffff857912f0, rtype=rtype@entry=2, ename=ename@entry=0xffffd25f3f18, edata=edata@entry=0xffffd25f3e38) at core/kemi.c:3853 sfbk = 0 ret = <optimized out> #24 0x0000ffff8f7ac5c0 in t_continue_helper (hash_index=<optimized out>, label=<optimized out>, rtact=<optimized out>, cbname=<optimized out>, cbparam=0xffffd25f3f28, skip_timer=<optimized out>) at ./src/modules/tm/t_suspend.c:319 t = 0xffff8578be90 backup_T = 0xffffffffffffffff backup_T_branch = -1 faked_req = 0xffff857912f0 brpl = 0x0 erpl = 0x0 faked_req_len = 7064 cancel_data = {cancel_bitmap = 3823579136, reason = {cause = 18336, u = {text = {s = 0xffffd25f40a0 "", len = -765509856}, e2e_cancel = 0xffffd25f40a0, packed_hdrs = {s = 0xffffd25f40a0 "", len = -765509856}}}} branch = 1 uac = <optimized out> ret = <optimized out> cb_type = 2 msg_status = <optimized out> last_uac_status = <optimized out> reply_status = <optimized out> do_put_on_wait = <optimized out> hdr = <optimized out> prev = 0x0 tmp = 0x0 route_type_bk = 2 keng = <optimized out> evname = {s = 0xffff8e0915d8 "async:task-exec", len = 15} __func__ = "t_continue_helper" #25 0x0000ffff8e08a93c in async_exec_task (param=0xffff85797100) at ./src/modules/async/async_sleep.c:392 atp = 0xffff85797100 keng = <optimized out> cbname = {s = 0xffff85797110 "ksr_rtpengine_http_action", len = 25} evname = {s = 0xffff8e0915d8 "async:task-exec", len = 15} __func__ = "async_exec_task" #26 0x0000aaaae3b45a6c in async_task_run (awg=awg@entry=0xffff90020d00, idx=idx@entry=1) at core/async_task.c:523 ptask = 0xffff857970f0 received = <optimized out> __func__ = "async_task_run" #27 0x0000aaaae3b46814 in async_task_child_init (rank=rank@entry=0) at core/async_task.c:225 pid = <optimized out> i = 0 pname = "Async Task Worker - rtpengine\000\000\000$\267\262㪪\000\000 B_\322\377\377\000\000\b\362\313㪪\000\000\000\220\366㪪\000" awg = 0xffff90020d00 __func__ = "async_task_child_init" __llevel = <optimized out> __kld = <optimized out> #28 0x0000aaaae3cbf2f8 in init_child (rank=rank@entry=0) at core/sr_module.c:995 ret = <optimized out> type = <optimized out> __func__ = "init_child" #29 0x0000aaaae3b2cfc0 in main_loop () at ./src/main.c:2022 i = <optimized out> pid = <optimized out> si = <optimized out> sx = <optimized out> si_desc = "udp receiver child=3 sock=10.151.1.119:5090\000\377\377\000\000`}\004\220\377\377\000\000\250\310\360%\210&\f7\023\000\000\000\000\000\000\0000\000\000\000\000\000\000\000(\000\000\000\000\000\000\000\240D_\322\377\377\000\000\030\340\261㪪\000\000\3305\347㪪\000\000\240E_\322\377\377\000\000\344D\270㪪\000" nrprocs = <optimized out> woneinit = 1 agfound = <optimized out> __func__ = "main_loop" error = <optimized out> #30 0x0000aaaae3b1e030 in main (argc=<optimized out>, argv=<optimized out>) at ./src/main.c:3406 cfg_stream = <optimized out> c = <optimized out> r = <optimized out> tmp = 0xffffd25f4e5d "" tmp_len = 65535 port = 5060 proto = 0 aproto = 0 ahost = 0x0 socket_name = 0x0 aport = 0 listen_field_count = <optimized out> listen_fields = {0x0, 0x0, 0x0} options = 0xaaaae3e75ea0 ":f:cm:M:dVIhEeb:B:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:" ret = -1 seed = 379075458 rfd = <optimized out> debug_save = <optimized out> debug_flag = <optimized out> dont_fork_cnt = <optimized out> n_lst = <optimized out> p = <optimized out> tbuf = 0xffff910ed028 <_rtld_global> "\200\343\016\221\377\377" tbuf_tmp = <optimized out> st = {st_dev = 24, st_ino = 1050, st_mode = 16888, st_nlink = 2, st_uid = 106, st_gid = 111, st_rdev = 0, __pad1 = 0, st_size = 40, st_blksize = 4096, __pad2 = 0, st_blocks = 0, st_atim = {tv_sec = 1740037653, tv_nsec = 298823983}, st_mtim = {tv_sec = 1740037653, tv_nsec = 298823983}, st_ctim = {tv_sec = 1740037653, tv_nsec = 298823983}, __glibc_reserved = {0, 0}} l1 = <optimized out> lim = {rlim_cur = 1024, rlim_max = 524288} option_index = 9 long_options = {{name = 0xaaaae3e77a28 "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0xaaaae3e77a30 "version", has_arg = 0, flag = 0x0, val = 118}, {name = 0xaaaae3e8fba8 "alias", has_arg = 1, flag = 0x0, val = 1024}, {name = 0xaaaae3e77a38 "subst", has_arg = 1, flag = 0x0, val = 1025}, {name = 0xaaaae3e77a40 "substdef", has_arg = 1, flag = 0x0, val = 1026}, {name = 0xaaaae3e77a50 "substdefs", has_arg = 1, flag = 0x0, val = 1027}, {name = 0xaaaae3e77a60 "server-id", has_arg = 1, flag = 0x0, val = 1028}, {name = 0xaaaae3e77a70 "loadmodule", has_arg = 1, flag = 0x0, val = 1029}, {name = 0xaaaae3e77a80 "modparam", has_arg = 1, flag = 0x0, val = 1030}, {name = 0xaaaae3e77a90 "log-engine", has_arg = 1, flag = 0x0, val = 1031}, {name = 0xaaaae3e77aa0 "debug", has_arg = 1, flag = 0x0, val = 1032}, {name = 0xaaaae3e77aa8 "cfg-print", has_arg = 0, flag = 0x0, val = 1033}, {name = 0xaaaae3e77ab8 "atexit", has_arg = 1, flag = 0x0, val = 1034}, {name = 0xaaaae3e77ac0 "all-errors", has_arg = 0, flag = 0x0, val = 1035}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}} __func__ = "main"
### Additional Information
* **Kamailio Version** - output of `kamailio -v`
``` version: kamailio 6.0.0 (aarch64/linux) flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, MEM_JOIN_FREE, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT-NOSMP, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLOCKLIST, HAVE_RESOLV_RES, TLS_PTHREAD_MUTEX_SHARED ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_SEND_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: unknown compiled with gcc 12.2.0 ```
* **Operating System**:
``` No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 12 (bookworm) Release: 12 Codename: bookworm
Linux #1 SMP Debian 6.1.128-1 (2025-02-07) aarch64 GNU/Linux ```
space88man left a comment (kamailio/kamailio#4153)
@tsearle - hi Torrey - can I confirm the routing script creates threads in the main process before fork (vs. threading is only done in the workers)?
For testing: - if the routing script uses threads in both master/workers - can you defer master threads until after the fork? Your scripts would need a way to sync with each other. Are you able to run the script with the master process doing no threading/kafka work? - I have encountered something similar with gRPC (even with the old code)
tsearle left a comment (kamailio/kamailio#4153)
I stripped down my python as much as possible, removed all libs that could be making threads and make it forward the call to a hardcoded destination. It seems to deadlock trying to invoke the onsend kemi route
[basic_py_script.tgz](https://github.com/user-attachments/files/18959911/basic_py_script.tgz)
[gdb_kamailio_20250225_075911.txt](https://github.com/user-attachments/files/18960050/gdb_kamailio_20250225_075...)
miconda left a comment (kamailio/kamailio#4153)
It seems to be a Python thread that is stuck somehow. The backtrace points to:
- https://github.com/kamailio/kamailio/blob/master/src/modules/app_python3/pyt...
Added in commit 0ffe157bc13.
If you have time to test, can you try with app_python3s module without a mod-init callback for it? It should indicate if there is eventually an issue with multi-threading state altered by forking, as app_python3 initialises the class in mod-init before forking.
tsearle left a comment (kamailio/kamailio#4153)
[gdb_kamailio_20250226_071321.txt](https://github.com/user-attachments/files/18979030/gdb_kamailio_20250226_071...)
[python3s.tgz](https://github.com/user-attachments/files/18979088/python3s.tgz)
python3s has the same issue
tsearle left a comment (kamailio/kamailio#4153)
I've stripped down my script even more... I removed the rtimers, I can't remove the xhttp as the instance would get killed if I removed monitoring... it still deadlocks
[gdb_kamailio_20250226_074257.txt](https://github.com/user-attachments/files/18979477/gdb_kamailio_20250226_074...)
[python3s.tgz](https://github.com/user-attachments/files/18979383/python3s.tgz)
miconda left a comment (kamailio/kamailio#4153)
Found some time to setup a small testing instance, stripping further from you sample configs attached for app_python3s. Just sending OPTIONS didn't result in a blocking state, so it is not about kamailio's inside code for executing the Python script, but by sending INVITEs, it got blocked, so it should be some impact of the functions used for processing INVITEs.
Maybe @space88man can figure out more if he is more familiar with the matter, I will also try to dig in.
For now, I introduced a new module parameter `threads_mode` which defaults to `0` and used the (old) code before the pr/patch https://github.com/kamailio/kamailio/pull/3986 . It is default, since it is known to have worked. The parameter can be set to 1 to switch to the code introduced by patch https://github.com/kamailio/kamailio/pull/3986.
I did a bit of testing and with `threads_mode = 0`, sending INVITEs didn't result in blocking.
miconda left a comment (kamailio/kamailio#4153)
I added the same parameter to app_python3.