Version: 5.3.3 `065668` Platform: CentOS 7 (3.10.0-862.14.4.el7.x86_64)
I seem to have a persistent intermittent crash on all five gateways which use `http_async_client` to query an HTTP API.
In all cases, the crash is on attempt to access or reference an AVP called `$avp(is_subscribe)`:
``` Program terminated with signal 11, Segmentation fault. #0 0x00000000005d479b in match_by_name (avp=0xabcdefed, id=45, name=0x7fff9379c668) at core/usr_avp.c:378 378 if (id==avp->id && avp->flags&AVP_NAME_STR && hMissing separate debuginfos, use: debuginfo-install libuuid-2.23.2-52.el7_5.1.x86_64 nss-pem-1.0.3-4.el7.x86_64 (gdb) where #0 0x00000000005d479b in match_by_name (avp=0xabcdefed, id=45, name=0x7fff9379c668) at core/usr_avp.c:378 #1 0x00000000005d54bf in search_next_avp (s=0x7fff9379c660, val=0x7fff9379c630) at core/usr_avp.c:499 #2 0x00000000005d4f04 in search_avp (ident=..., val=0x7fff9379c630, state=0x7fff9379c660) at core/usr_avp.c:465 #3 0x00000000005d4913 in search_first_avp (flags=1, name=..., val=0x7fff9379c630, s=0x7fff9379c660) at core/usr_avp.c:414 #4 0x00007f3f6ba529fd in ops_is_avp_set (msg=0xb04f60 <_faked_msg>, ap=0x7f3f73296ed8) at avpops_impl.c:1821 #5 0x00007f3f6ba60d47 in w_is_avp_set (msg=0xb04f60 <_faked_msg>, param=0x7f3f73296ed8 "", op=0x0) at avpops.c:1059 #6 0x0000000000480f3d in do_action (h=0x7fff9379ce70, a=0x7f3f732a8b30, msg=0xb04f60 <_faked_msg>) at core/action.c:1077 #7 0x000000000048da7c in run_actions (h=0x7fff9379ce70, a=0x7f3f732a8b30, msg=0xb04f60 <_faked_msg>) at core/action.c:1576 #8 0x000000000048e13d in run_actions_safe (h=0x7fff9379dae0, a=0x7f3f732a8b30, msg=0xb04f60 <_faked_msg>) at core/action.c:1640 #9 0x0000000000438ff3 in rval_get_int (h=0x7fff9379dae0, msg=0xb04f60 <_faked_msg>, i=0x7fff9379d318, rv=0x7f3f732a8d70, cache=0x0) at core/rvalue.c:915 #10 0x000000000043d5a3 in rval_expr_eval_int (h=0x7fff9379dae0, msg=0xb04f60 <_faked_msg>, res=0x7fff9379d318, rve=0x7f3f732a8d68) at core/rvalue.c:1913 #11 0x000000000043d9f6 in rval_expr_eval_int (h=0x7fff9379dae0, msg=0xb04f60 <_faked_msg>, res=0x7fff9379d7cc, rve=0x7f3f732a9598) at core/rvalue.c:1921 #12 0x000000000048097b in do_action (h=0x7fff9379dae0, a=0x7f3f732aa9c0, msg=0xb04f60 <_faked_msg>) at core/action.c:1047 #13 0x000000000048da7c in run_actions (h=0x7fff9379dae0, a=0x7f3f732aa9c0, msg=0xb04f60 <_faked_msg>) at core/action.c:1576 #14 0x000000000048e205 in run_top_route (a=0x7f3f732aa9c0, msg=0xb04f60 <_faked_msg>, c=0x0) at core/action.c:1661 #15 0x00007f3f705f7d0e in async_http_cb (reply=0x7f3f2b7ee620, param=0x7f3f2b7eca18) at async_http.c:256 #16 0x00007f3f705f0dab in check_multi_info (g=0x7f3f2abeb9e8) at http_multi.c:597 #17 0x00007f3f705e84ec in event_cb (fd=6, kind=2, userp=0x1d7cdf0) at http_multi.c:145 #18 0x00007f3f70131a14 in event_process_active_single_queue (activeq=0x1c6f820, base=0x1d3ad90) at event.c:1350 #19 event_process_active (base=<optimized out>) at event.c:1420 #20 event_base_loop (base=0x1d3ad90, flags=0) at event.c:1621 #21 0x00007f3f705f490f in async_http_run_worker (worker=0x7f3f2abd1bb0) at async_http.c:92 #22 0x00007f3f705dcb17 in child_init (rank=0) at http_async_client_mod.c:366 #23 0x000000000057a810 in init_mod_child (m=0x7f3f73245168, rank=0) at core/sr_module.c:780 #24 0x000000000057a4ac in init_mod_child (m=0x7f3f73245d58, rank=0) at core/sr_module.c:776 #25 0x000000000057a4ac in init_mod_child (m=0x7f3f73246340, rank=0) at core/sr_module.c:776 #26 0x000000000057a4ac in init_mod_child (m=0x7f3f732467f8, rank=0) at core/sr_module.c:776 #27 0x000000000057a4ac in init_mod_child (m=0x7f3f73246e38, rank=0) at core/sr_module.c:776 #28 0x000000000057a4ac in init_mod_child (m=0x7f3f73247210, rank=0) at core/sr_module.c:776 #29 0x000000000057a4ac in init_mod_child (m=0x7f3f73247978, rank=0) at core/sr_module.c:776 #30 0x000000000057a4ac in init_mod_child (m=0x7f3f73247da0, rank=0) at core/sr_module.c:776 #31 0x000000000057af5a in init_child (rank=0) at core/sr_module.c:825 #32 0x0000000000426bc1 in main_loop () at main.c:1753 #33 0x000000000042e63a in main (argc=13, argv=0x7fff9379edb8) at main.c:2802 ```
Most recently, it is here:
``` route { $avp(is_subscribe) = 0;
...
$http_req(method) = 'GET'; $http_req(suspend) = 1; $http_req(hdr) = "x-api-key: " + $sel(cfg_get.api.whitelist_key);
t_newtran();
http_async_query("https://some_host/check_ip_in_whi telist?ip=$(si{s.escape.common})", "WHITELIST_CHECK"); }
...
route[WHITELIST_CHECK] { if(!is_avp_set("$avp(is_subscribe)")) { xlog("L_INFO", "[R-HTTP-REPLY:$ci] !> AVP is_subscribe is undefined -- cannot process further\n"); sl_send_reply("500", "Internal Server Error"); exit; } ```
However, adding `is_avp_set()` was my attempt to put a safe-guard around unsafe access of this AVP. It didn't work:
``` (gdb) print *name $1 = {s = 0x7f3f73261835 "is_subscribe)", len = 12} (gdb) print avp $2 = (avp_t *) 0xabcdefed (gdb) print avp->id Cannot access memory at address 0xabcdefed ```
Note that I am making the assumption above that AVPs created in the main request route will be available in the `http_async_client` callback, because it is just a suspended TM transaction underneath -- right? So, transaction-persistent variables such as AVP should be visible throughout the lifetime of the transaction just as for a conventional async TM transaction, I would think.
Digging further into the mesage being handled in `async_http_cb()`, it is always a normal HTTP `200 OK` reply:
``` (gdb) set print elements 0 (gdb) frame 15 #15 0x00007f3f705f7d0e in async_http_cb (reply=0x7f3f2b7ee620, param=0x7f3f2b7eca18) at async_http.c:256 256 if (run_top_route(act, fmsg, 0)<0) { (gdb) print *reply $5 = {retcode = 200, result = 0x7f3f2b7ec560, error = '\000' <repeats 255 times>} (gdb) print *(reply->result) $6 = { s = 0x7f3f2b7ee798 "HTTP/1.1 200 OK\r\nContent-Type: application/json\r\nContent-Length: 16\r\nConnection: keep-alive\r\nDate: Wed, 15 Apr 2020 15:06:52 GMT\r\nx-amzn-RequestId: d3a44941-e0c4-4cb9-9144-ea54d929c300\r\nx-amz-apigw-id: LCJqBF7CIAMFXCw=\r\nX-Amzn-Trace-Id: Root=1-5e97230c-4fe1af7d072275d95c231a4c;Sampled=0\r\nX-Cache: Miss from cloudfront\r\nHia: 1.1 ade18dc841d2e1cc8ef49611c5d4c93e.cloudfront.net (CloudFront)\r\nX-Amz-Cf-Pop: IAD89-C2\r\nX-Amz-Cf-Id: 3s1S2kvOgL7L4ZKMlWwQnfIxHmWFGz0PHbcNFefYA1tZ1j0AGy3JHg==\r\n\r\n{"status":false}\300\300\300\300", len = 504} ```
Any insights would be appreciated! These crashes are happening basically hourly from nowhere, after over a year of undisturbed operation.
Hi Alex, yes, avp should be persistent. I've never seen a crash like this in 5.2.x (the release I'm using in prod) and I'm using avp in resume routes, even suspending/resuming twice a single transaction (I have a scenario where two http queries are run). Have anything changed in your system recently, e.g. libcurl version? I'll try to have a look in the next days to what might have changed in the transactions framework that could explain.
Hi -- thanks for the response!
Alas, nothing has really changed that I know of.
Actually, this is a scenario where multiple HTTP queries are being run as well; this is the first stage, but subsequently the scenario is more complicated and requires use of the synchronous 'http_client', so it's done manually with mqueue/rtimer. I do not remember why simply suspending the transaction twice didn't work, I think it had something to do with missing transaction state somewhere--I will need to revisit that.
However, the crash occurs long before any of that occurs.
Just had another crash -- identical backtrace; it seems any attempt to access this AVP explodes:
``` (gdb) print *avp Cannot access memory at address 0xabcdefed ```
``` Program terminated with signal 11, Segmentation fault. #0 0x00000000005d479b in match_by_name (avp=0xabcdefed, id=45, name=0x7fff9379c668) at core/usr_avp.c:378 378 if (id==avp->id && avp->flags&AVP_NAME_STR && Missing separate debuginfos, use: debuginfo-install libuuid-2.23.2-52.el7_5.1.x86_64 nss-pem-1.0.3-4.el7.x86_64 (gdb) set print elements 0 (gdb) where #0 0x00000000005d479b in match_by_name (avp=0xabcdefed, id=45, name=0x7fff9379c668) at core/usr_avp.c:378 #1 0x00000000005d54bf in search_next_avp (s=0x7fff9379c660, val=0x7fff9379c630) at core/usr_avp.c:499 #2 0x00000000005d4f04 in search_avp (ident=..., val=0x7fff9379c630, state=0x7fff9379c660) at core/usr_avp.c:465 #3 0x00000000005d4913 in search_first_avp (flags=1, name=..., val=0x7fff9379c630, s=0x7fff9379c660) at core/usr_avp.c:414 #4 0x00007f3f6ba529fd in ops_is_avp_set (msg=0xb04f60 <_faked_msg>, ap=0x7f3f73296ed8) at avpops_impl.c:1821 #5 0x00007f3f6ba60d47 in w_is_avp_set (msg=0xb04f60 <_faked_msg>, param=0x7f3f73296ed8 "", op=0x0) at avpops.c:1059 #6 0x0000000000480f3d in do_action (h=0x7fff9379ce70, a=0x7f3f732a8b30, msg=0xb04f60 <_faked_msg>) at core/action.c:1077 #7 0x000000000048da7c in run_actions (h=0x7fff9379ce70, a=0x7f3f732a8b30, msg=0xb04f60 <_faked_msg>) at core/action.c:1576 #8 0x000000000048e13d in run_actions_safe (h=0x7fff9379dae0, a=0x7f3f732a8b30, msg=0xb04f60 <_faked_msg>) at core/action.c:1640 #9 0x0000000000438ff3 in rval_get_int (h=0x7fff9379dae0, msg=0xb04f60 <_faked_msg>, i=0x7fff9379d318, rv=0x7f3f732a8d70, cache=0x0) at core/rvalue.c:915 #10 0x000000000043d5a3 in rval_expr_eval_int (h=0x7fff9379dae0, msg=0xb04f60 <_faked_msg>, res=0x7fff9379d318, rve=0x7f3f732a8d68) at core/rvalue.c:1913 #11 0x000000000043d9f6 in rval_expr_eval_int (h=0x7fff9379dae0, msg=0xb04f60 <_faked_msg>, res=0x7fff9379d7cc, rve=0x7f3f732a9598) at core/rvalue.c:1921 #12 0x000000000048097b in do_action (h=0x7fff9379dae0, a=0x7f3f732aa9c0, msg=0xb04f60 <_faked_msg>) at core/action.c:1047 #13 0x000000000048da7c in run_actions (h=0x7fff9379dae0, a=0x7f3f732aa9c0, msg=0xb04f60 <_faked_msg>) at core/action.c:1576 #14 0x000000000048e205 in run_top_route (a=0x7f3f732aa9c0, msg=0xb04f60 <_faked_msg>, c=0x0) at core/action.c:1661 #15 0x00007f3f705f7d0e in async_http_cb (reply=0x7f3f2b7ee620, param=0x7f3f2b7eca18) at async_http.c:256 #16 0x00007f3f705f0dab in check_multi_info (g=0x7f3f2abeb9e8) at http_multi.c:597 #17 0x00007f3f705e84ec in event_cb (fd=6, kind=2, userp=0x1d7cdf0) at http_multi.c:145 #18 0x00007f3f70131a14 in event_process_active_single_queue (activeq=0x1c6f820, base=0x1d3ad90) at event.c:1350 #19 event_process_active (base=<optimized out>) at event.c:1420 #20 event_base_loop (base=0x1d3ad90, flags=0) at event.c:1621 #21 0x00007f3f705f490f in async_http_run_worker (worker=0x7f3f2abd1bb0) at async_http.c:92 #22 0x00007f3f705dcb17 in child_init (rank=0) at http_async_client_mod.c:366 #23 0x000000000057a810 in init_mod_child (m=0x7f3f73245168, rank=0) at core/sr_module.c:780 #24 0x000000000057a4ac in init_mod_child (m=0x7f3f73245d58, rank=0) at core/sr_module.c:776 #25 0x000000000057a4ac in init_mod_child (m=0x7f3f73246340, rank=0) at core/sr_module.c:776 #26 0x000000000057a4ac in init_mod_child (m=0x7f3f732467f8, rank=0) at core/sr_module.c:776 #27 0x000000000057a4ac in init_mod_child (m=0x7f3f73246e38, rank=0) at core/sr_module.c:776 #28 0x000000000057a4ac in init_mod_child (m=0x7f3f73247210, rank=0) at core/sr_module.c:776 #29 0x000000000057a4ac in init_mod_child (m=0x7f3f73247978, rank=0) at core/sr_module.c:776 #30 0x000000000057a4ac in init_mod_child (m=0x7f3f73247da0, rank=0) at core/sr_module.c:776 #31 0x000000000057af5a in init_child (rank=0) at core/sr_module.c:825 #32 0x0000000000426bc1 in main_loop () at main.c:1753 #33 0x000000000042e63a in main (argc=13, argv=0x7fff9379edb8) at main.c:2802 ```
Looking at the backtrace, the execution continues with a `faked_msg` structure:
``` #14 0x000000000048e205 in run_top_route (a=0x7f3f732aa9c0, msg=0xb04f60 <_faked_msg>, c=0x0) at core/action.c:1661 #15 0x00007f3f705f7d0e in async_http_cb (reply=0x7f3f2b7ee620, param=0x7f3f2b7eca18) at async_http.c:256 ```
which, based on the code, is done when the query is done without suspending the transaction:
* https://github.com/kamailio/kamailio/blob/master/src/modules/http_async_clie...
Now I see that @abalashov's example is:
``` $http_req(suspend) = 1; ```
So I wonder if there is an issue propagating it or there are cases in @abalashov's config not setting it.
On the other hand, the crash seems to be because in the worker process continuing the execution with the faked_msg, the list of avps is not valid, maybe they were not reset from a previous execution. The avps there are not for sure those you set for the received SIP request, because, again, it is not resuming any transaction in this case, so it doesn't get the avps from transaction.
@abalashov - do you have event_route blocks in your config? Those are the usual suspects running with a faked_msg structure.
@abalashov have you checked if you set suspend field always? any new crash?
I think I found something that could have been caused the issue, respectively setting the AVPs lists from transaction in the process context inside the http async callback function. It was done for the case when transaction was suspended and no cleaned up, living further in the context of the process and when the async callback was executed for a non-suspended-transaction case (with faked_msg), then the local avps are cleaned up, but they belonged to a past transaction, already destroyed (along with avps).
So this case could happen when there is a mix usage of http async with suspended=1 as well as with suspended=0. Code was rather old, but maybe nobody used the module mixed suspending modes.
IMO, there is no need to set avps lists to process local in the process, because it is done by faked_env which is used inside t_continue function before executing the routing block, so the avps are available inside the config block.
I pushed the commit 1bc3bbd - @grumvalski can you check it and analyze if there are some side effects by removing those lines? I couldn't spot when AVPs would be needed before the use of fake_env() by t_continue(), but I am not that familiar with http async module internals. If some unwanted effects, then the commit can be reverted. Or if all looks ok and tests are fine, then it can be backported.
Hi, thank you very much for investigating this!
Unfortunately, the causes of the crash do not appear to arise dependably; as luck has it, there hasn't been one in the days since I submitted this issue. So, all I can do is cherry-pick this patch into the source and hope for the best.
I will be sure to follow up, but unfortunately there is no control group.
@miconda To answer a few other questions:
1. I always set `$http_req(suspend) = 1` on async HTTP client transactions;
2. There are no `event_routes` anywhere;
3. The flow is complex, though. After this async HTTP query, there is another _synchronous_ HTTP query using the normal `http_client` module + `mqueue` + `rtimer` -- it was implemented this way because of some problems with transaction-persistent variables (ironically) when suspending the same TM transaction twice, and a desire for more granular control over the request pipelines given the high and somewhat unpredictable latency of what's being queried.
Nevertheless, all this takes place long after the crash point.
Hi Daniel, you are absolutely right and that part of code is for sure wrong. Those lines have been there since the beginning and for sure due to my lack of understanding of transactions's internals and avps (don't remember why sincerely). For the records, I've been using mixed suspended/not suspended from the beginning but I've always used separate return routes for suspended and not suspended queries. That could explain why I never had it. Thanks for looking into it
On Sat, Apr 18, 2020 at 3:13 AM Alex Balashov notifications@github.com wrote:
@miconda https://github.com/miconda To answer a few other questions:
I always set $http_req(suspend) = 1 on async HTTP client transactions; 2.
There are no event_routes anywhere; 3.
The flow is complex, though. After this async HTTP query, there is another *synchronous* HTTP query using the normal http_client module + mqueue + rtimer -- it was implemented this way because of some problems with transaction-persistent variables (ironically) when suspending the same TM transaction twice, and a desire for more granular control over the request pipelines given the high and somewhat unpredictable latency of what's being queried.
Nevertheless, all this takes place long after the crash point.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kamailio/kamailio/issues/2286#issuecomment-615530081, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABO7UZJSBJ6RUK2BJR3UONLRND5DNANCNFSM4MIVPAIQ . _______________________________________________ Kamailio (SER) - Development Mailing List sr-dev@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-dev
Sorry the mail was sent before ending by accident.
Thanks for looking into it and fixing it Daniel. For sure it should be backported.
Cheers,
Federico
On Sat, Apr 18, 2020 at 9:23 AM Federico Cabiddu federico.cabiddu@gmail.com wrote:
Hi Daniel, you are absolutely right and that part of code is for sure wrong. Those lines have been there since the beginning and for sure due to my lack of understanding of transactions's internals and avps (don't remember why sincerely). For the records, I've been using mixed suspended/not suspended from the beginning but I've always used separate return routes for suspended and not suspended queries. That could explain why I never had it. Thanks for looking into it
On Sat, Apr 18, 2020 at 3:13 AM Alex Balashov notifications@github.com wrote:
@miconda https://github.com/miconda To answer a few other questions:
I always set $http_req(suspend) = 1 on async HTTP client transactions; 2.
There are no event_routes anywhere; 3.
The flow is complex, though. After this async HTTP query, there is another *synchronous* HTTP query using the normal http_client module
- mqueue + rtimer -- it was implemented this way because of some
problems with transaction-persistent variables (ironically) when suspending the same TM transaction twice, and a desire for more granular control over the request pipelines given the high and somewhat unpredictable latency of what's being queried.
Nevertheless, all this takes place long after the crash point.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kamailio/kamailio/issues/2286#issuecomment-615530081, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABO7UZJSBJ6RUK2BJR3UONLRND5DNANCNFSM4MIVPAIQ . _______________________________________________ Kamailio (SER) - Development Mailing List sr-dev@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-dev
@grumvalski - you can backport to the branches relevant fo the fix. If not, I will do it before next releases.
@abalashov can you try with latest version from stable branch? @grumvalski backported the patch. The avps lists should be no longer there.
I manually cherry-picked the patch into my clone of 5.3 before it was backported, and haven’t had a crash since.
However, I also hadn’t had a crash the few days prior.
So, it remains to be seen what happens. But I am optimistic this will fix the issue. Thank you very much to you both!
— Sent from mobile, with due apologies for brevity and errors.
On Apr 21, 2020, at 2:05 PM, Daniel-Constantin Mierla notifications@github.com wrote:
@abalashov can you try with latest version from stable branch? @grumvalski backported the patch. The avps lists should be no longer there.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
Kamailio (SER) - Development Mailing List sr-dev@lists.kamailio.org https://lists.kamailio.org/cgi-bin/mailman/listinfo/sr-dev
OK. Reopen if pops up again.
Closed #2286.