Hi,
we use the dialog-module to keep track of concurrent calls, and then set a treshold for simultanious calls. After we upgraded from 1.5 to 3.2, we've been having a lot of dialogs which are stuck in both the database and memory.
kamailio -V version: kamailio 3.2.1 (i386/linux) 918035-dirty flags: STATS: Off, USE_IPV6, USE_TCP, USE_TLS, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, DBG_QM_MALLOC, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 4MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: 918035 -dirty compiled on 10:00:10 Dec 20 2011 with gcc 4.1.2
Our configuration:
modparam("dialog", "dlg_flag", 3) modparam("dialog", "default_timeout", 21600) modparam("dialog", "dlg_match_mode", 1) modparam("dialog", "db_url", VOIP_DATA4) modparam("dialog", "db_mode", 1) #!ifdef VOIP1 modparam("dialog", "table_name", "dialog1") modparam("dialog", "vars_table_name", "dialog_vars1") #!endif #!ifdef VOIP2 modparam("dialog", "table_name", "dialog2") modparam("dialog", "vars_table_name", "dialog_vars2") #!endif modparam("dialog", "profiles_with_value", "busy")
To Kamailio instances, voip1 and voip2, share the same config, but define VOIP1 and VOIP2, respectively.
Then in the main route:
if !allow_trusted() { route(AUTH); }
if (is_method("INVITE")) { setflag(3); dlg_manage();
if ($avp(s:f_uid) != $null) { set_dlg_profile("busy","$avp(s:f_uid)"); get_profile_size("busy", "$avp(s:f_uid)", "$var(dlg_busy)"); xlog("L_INFO", "BUSY: f_uid: $avp(s:f_uid), dlg_busy: $var(dlg_busy)\n"); } }
$avp(s:f_uid) is set in route(AUTH), so it will catch calls originating from one of our users.
Then later in the callforward route:
# Testing dialog $var(dlg_busy) = 0; $var(dlg_busy1) = 0; $var(dlg_busy2) = 0; #!ifdef VOIP1 get_profile_size("busy", "$rU", "$var(dlg_busy1)"); sql_query("data4","select id from dialog2 where caller_contact like 'sip:$rU@%' or callee_contact like 'sip:$rU@%'", "dialog"); $var(dlg_busy2) = $dbr(dialog=>rows); #!endif #!ifdef VOIP2 get_profile_size("busy", "$rU", "$var(dlg_busy2)"); sql_query("data4","select id from dialog1 where caller_contact like 'sip:$rU@%' or callee_contact like 'sip:$rU@%'", "dialog"); $var(dlg_busy1) = $dbr(dialog=>rows); #!endif sql_result_free("dialog"); $var(dlg_busy) = $var(dlg_busy1) + $var(dlg_busy2);
And then check if $var(dlg_busy) is above a users treshold.
Finally when routing to a local user we mark it as busy:
set_dlg_profile("busy","$rU");
As of writing I have 5 entries in the dialog1-table, all valid, but:
# kamctl fifo dlg_list | grep -c "^dialog" 18 # kamctl fifo dlg_list | grep state | sort | uniq -c 13 state:: 1 5 state:: 4
on the other hand in the busy profile, which is the one we use:
# kamctl fifo profile_list_dlgs busy | grep -c "^dialog" 11 # kamctl fifo profile_list_dlgs busy | grep state | sort | uniq -c 6 state:: 1 5 state:: 4
It seems as this happens when, for some reason, kamailio is trying to parse duplicate INVITEs.
From the logs:
# grep "^Jan 17" /var/log/messages | grep BUSY | grep -v "dlg_busy: 1" | cut -d\ -f8- |sort | uniq -c 1 BUSY: f_uid: 2367449, dlg_busy: 2 3 BUSY: f_uid: 2367453, dlg_busy: 2 1 BUSY: f_uid: 2367453, dlg_busy: 3 1 BUSY: f_uid: 2367453, dlg_busy: 4 1 BUSY: f_uid: 2574912, dlg_busy: 2 1 BUSY: f_uid: 2582270, dlg_busy: 2 1 BUSY: f_uid: ahmedma, dlg_busy: 2 1 BUSY: f_uid: josephim, dlg_busy: 2 1 BUSY: f_uid: perla, dlg_busy: 2 1 BUSY: f_uid: rosalinm, dlg_busy: 2 1 BUSY: f_uid: suresha, dlg_busy: 2 1 BUSY: f_uid: sveinb, dlg_busy: 2
I've full debugging logs available, if needed.
Hey Øyvind,
Am 17.01.2012 um 13:41 schrieb Øyvind Kolbu:
we use the dialog-module to keep track of concurrent calls, and then set a treshold for simultanious calls. After we upgraded from 1.5 to 3.2, we've been having a lot of dialogs which are stuck in both the database and memory.
kamailio -V version: kamailio 3.2.1 (i386/linux) 918035-dirty flags: STATS: Off, USE_IPV6, USE_TCP, USE_TLS, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, DBG_QM_MALLOC, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 4MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: 918035 -dirty compiled on 10:00:10 Dec 20 2011 with gcc 4.1.2
Our configuration:
modparam("dialog", "dlg_flag", 3) modparam("dialog", "default_timeout", 21600) modparam("dialog", "dlg_match_mode", 1) modparam("dialog", "db_url", VOIP_DATA4) modparam("dialog", "db_mode", 1) #!ifdef VOIP1 modparam("dialog", "table_name", "dialog1") modparam("dialog", "vars_table_name", "dialog_vars1") #!endif #!ifdef VOIP2 modparam("dialog", "table_name", "dialog2") modparam("dialog", "vars_table_name", "dialog_vars2") #!endif modparam("dialog", "profiles_with_value", "busy")
To Kamailio instances, voip1 and voip2, share the same config, but define VOIP1 and VOIP2, respectively.
Then in the main route:
if !allow_trusted() { route(AUTH); } if (is_method("INVITE")) { setflag(3); dlg_manage(); if ($avp(s:f_uid) != $null) { set_dlg_profile("busy","$avp(s:f_uid)"); get_profile_size("busy", "$avp(s:f_uid)", "$var(dlg_busy)"); xlog("L_INFO", "BUSY: f_uid: $avp(s:f_uid), dlg_busy: $var(dlg_busy)\n"); } }
$avp(s:f_uid) is set in route(AUTH), so it will catch calls originating from one of our users.
Then later in the callforward route:
# Testing dialog $var(dlg_busy) = 0; $var(dlg_busy1) = 0; $var(dlg_busy2) = 0;
#!ifdef VOIP1 get_profile_size("busy", "$rU", "$var(dlg_busy1)"); sql_query("data4","select id from dialog2 where caller_contact like 'sip:$rU@%' or callee_contact like 'sip:$rU@%'", "dialog"); $var(dlg_busy2) = $dbr(dialog=>rows); #!endif #!ifdef VOIP2 get_profile_size("busy", "$rU", "$var(dlg_busy2)"); sql_query("data4","select id from dialog1 where caller_contact like 'sip:$rU@%' or callee_contact like 'sip:$rU@%'", "dialog"); $var(dlg_busy1) = $dbr(dialog=>rows); #!endif sql_result_free("dialog"); $var(dlg_busy) = $var(dlg_busy1) + $var(dlg_busy2);
And then check if $var(dlg_busy) is above a users treshold.
Finally when routing to a local user we mark it as busy:
set_dlg_profile("busy","$rU");
As of writing I have 5 entries in the dialog1-table, all valid, but:
# kamctl fifo dlg_list | grep -c "^dialog" 18 # kamctl fifo dlg_list | grep state | sort | uniq -c 13 state:: 1 5 state:: 4
on the other hand in the busy profile, which is the one we use:
# kamctl fifo profile_list_dlgs busy | grep -c "^dialog" 11 # kamctl fifo profile_list_dlgs busy | grep state | sort | uniq -c 6 state:: 1 5 state:: 4
It seems as this happens when, for some reason, kamailio is trying to parse duplicate INVITEs.
From the logs:
# grep "^Jan 17" /var/log/messages | grep BUSY | grep -v "dlg_busy: 1" | cut -d\ -f8- |sort | uniq -c 1 BUSY: f_uid: 2367449, dlg_busy: 2 3 BUSY: f_uid: 2367453, dlg_busy: 2 1 BUSY: f_uid: 2367453, dlg_busy: 3 1 BUSY: f_uid: 2367453, dlg_busy: 4 1 BUSY: f_uid: 2574912, dlg_busy: 2 1 BUSY: f_uid: 2582270, dlg_busy: 2 1 BUSY: f_uid: ahmedma, dlg_busy: 2 1 BUSY: f_uid: josephim, dlg_busy: 2 1 BUSY: f_uid: perla, dlg_busy: 2 1 BUSY: f_uid: rosalinm, dlg_busy: 2 1 BUSY: f_uid: suresha, dlg_busy: 2 1 BUSY: f_uid: sveinb, dlg_busy: 2
I've full debugging logs available, if needed.
Do you possibly start to track dialogs before they are authenticated and reply to unauthenticated INVITEs with 407 statelessly? IIRC, that causes dialogs to dangle in state 1 ("unconfirmed") because the module requires statefulness to operate and conclude dialog tracking properly.
If that's the case, you have two options: Either return 407's statefully, or defer dialog tracking until dialogs are authenticated. The former may impose a security risk, the latter may (depending on your situation) limit your dialog tracking needs.
Cheers,
--Timo
On 2012-01-17 at 23:47, Timo Reimann wrote:
Hey Øyvind, Am 17.01.2012 um 13:41 schrieb Øyvind Kolbu:
Then in the main route:
if !allow_trusted() { route(AUTH); } if (is_method("INVITE")) { setflag(3); dlg_manage(); if ($avp(s:f_uid) != $null) { set_dlg_profile("busy","$avp(s:f_uid)"); get_profile_size("busy", "$avp(s:f_uid)",
"$var(dlg_busy)");
xlog("L_INFO", "BUSY: f_uid: $avp(s:f_uid), dlg_busy:
$var(dlg_busy)\n");
} }
$avp(s:f_uid) is set in route(AUTH), so it will catch calls
originating from one of
our users.
Do you possibly start to track dialogs before they are authenticated and reply to unauthenticated INVITEs with 407 statelessly? IIRC, that causes dialogs to dangle in state 1 ("unconfirmed") because the module requires statefulness to operate and conclude dialog tracking properly.
No, we only do dlg_manage() after route(AUTH), as seen above.
Our route[AUTH] is below, slightly redacted, if you can spot any errors. Worth mentioning is that both trusted calls, from our Nortel pbx, and calls initiated from voip phones get stuck in the dialog memory.
# Route AUTH - LDAP authentication for hardphones REGISTER and INVITE route[AUTH] { # if (isflagset(1)) { # xlog("L_INFO", "ROUTE AUTH: Au length $(aU{s.len}) \n"); # }
$var(digauth) = $aU + "x"; if ($var(digauth) == "x") { if (is_method("REGISTER")) { www_challenge("uio.no","1"); } else { proxy_challenge("uio.no","1"); } exit; }
# ldap search to authenticate the request $var(res) = ldap_search("ldap://ldapprod/cn=clients,cn=voip,dc=uio,dc=no?sipMacAddress,sipSecret,sipVoipAddressDN,uid?one?(&(objectclass=sipClient)(|(sipMacAddress=$aU)(uid=$aU)))"); if (!$var(res)) { switch ($retcode) { case -1: # no LDAP entry found xlog("L_INFO", "LDAPAUTH: no ldap entry found for $aU\n"); sl_send_reply("404", "User Not Found"); exit; case -2: # internal error xlog("L_INFO", "LDAPAUTH: (ldap) internal server error\n"); sl_send_reply("500", "Internal server error"); exit; default: xlog("L_INFO", "LDAPAUTH: (ldap) found $var(res) entries\n"); } }
# Pick from the first entry # First try hardphone if (!ldap_result("sipMacAddress/$avp(s:username)")) { ldap_result("uid/$avp(s:username)"); } ldap_result("sipSecret/$avp(s:password)"); ldap_result("sipVoipAddressDN/$avp(s:voipdn)");
if (is_method("INVITE")) { $avp(s:f_uid) = $(avp(s:voipdn){param.value,uid}); }
if (is_method("REGISTER")) { if(!pv_www_authenticate("uio.no", "$avp(s:password)","0")) { xlog("L_WARN", "ROUTE AUTH: FAILED authentication on REGISTER for username: $avp(s:username)\n"); www_challenge("uio.no","1"); exit; } if (isflagset(1)) { xlog("L_INFO", "LDAPAUTH: Successful authentication. Username $avp(s:username)\n"); } return(1); } if(!pv_proxy_authenticate("uio.no", "$avp(s:password)", "0")) { proxy_challenge("uio.no","1"); exit; } append_hf("UIO-edge-auth-uid: $avp(s:username)\r\n"); /* Now, remove credentials before passing the message on */ consume_credentials();
if (is_method("INVITE")) { ldap_search("ldap://ldapprod/cn=uris,cn=voip,dc=uio,dc=no?voipExtensionUri,cn,voipE164Uri?one?(&(objectClass=voipAddress)(uid=$avp(s:f_uid)))"); ldap_result("voipExtensionUri/$avp(s:f_exten)"); ldap_result("cn/$avp(s:f_name)"); ldap_result("voipE164Uri/$avp(s:f_e164)"); $avp(s:f_num) = "sip:" + $(avp(s:f_e164){uri.user}{s.strip,3}) + "@uio.no"; append_hf("P-Asserted-Identity: $avp(s:f_name) <$avp(s:f_e164)>\r\n"); # append_hf("Remote-Party-ID: "$avp(s:f_name)" <$avp(s:f_exten)>;privacy=off;screen=no\r\n"); } }
If that's the case, you have two options: Either return 407's statefully, or defer dialog tracking until dialogs are authenticated. The former may impose a security risk, the latter may (depending on your situation) limit your dialog tracking needs.
We don't need dialog tracking until either a call is authenticated or it trusted, so that is fine our situation.
Our Kamailio has a tendency to send double replies for a request, e.g. two 200 OK for a single REGISTER. Most likely this is relevant to the dialog-module being confused.
Hey,
Am 18.01.2012 um 13:46 schrieb Øyvind Kolbu:
On 2012-01-17 at 23:47, Timo Reimann wrote:
Hey Øyvind, Am 17.01.2012 um 13:41 schrieb Øyvind Kolbu:
Then in the main route:
if !allow_trusted() { route(AUTH); } if (is_method("INVITE")) { setflag(3); dlg_manage(); if ($avp(s:f_uid) != $null) { set_dlg_profile("busy","$avp(s:f_uid)"); get_profile_size("busy", "$avp(s:f_uid)",
"$var(dlg_busy)");
xlog("L_INFO", "BUSY: f_uid: $avp(s:f_uid), dlg_busy:
$var(dlg_busy)\n");
} }
$avp(s:f_uid) is set in route(AUTH), so it will catch calls
originating from one of
our users.
Do you possibly start to track dialogs before they are authenticated and reply to unauthenticated INVITEs with 407 statelessly? IIRC, that causes dialogs to dangle in state 1 ("unconfirmed") because the module requires statefulness to operate and conclude dialog tracking properly.
No, we only do dlg_manage() after route(AUTH), as seen above.
Our route[AUTH] is below, slightly redacted, if you can spot any errors. Worth mentioning is that both trusted calls, from our Nortel pbx, and calls initiated from voip phones get stuck in the dialog memory.
Looks good to me. I wouldn't expect to spot the error there anyway since it's not related to dialog tracking.
Another possible root cause: After calling dlg_manage() on an INVITE, you do not forward the request (e.g., by calling exit() instead). Could that be the case? If so, the solution would be (again) to defer dialog tracking unless you're sure the INVITE will be routed.
If not, the last thing I can think of to try is to do some tracing (using ngrep or tcpdump, for example) and attempt to catch a dialog that dangles. If you succeed at that, analyzing the trace will probably help in determining the issue.
Cheers,
--Timo
# Route AUTH - LDAP authentication for hardphones REGISTER and INVITE route[AUTH] { # if (isflagset(1)) { # xlog("L_INFO", "ROUTE AUTH: Au length $(aU{s.len}) \n"); # }
$var(digauth) = $aU + "x"; if ($var(digauth) == "x") { if (is_method("REGISTER")) { www_challenge("uio.no","1"); } else { proxy_challenge("uio.no","1"); } exit; } # ldap search to authenticate the request $var(res) =
ldap_search("ldap://ldapprod/cn=clients,cn=voip,dc=uio,dc=no?sipMacAddress,sipSecret,sipVoipAddressDN,uid?one?(&(objectclass=sipClient)(|(sipMacAddress=$aU)(uid=$aU)))"); if (!$var(res)) { switch ($retcode) { case -1: # no LDAP entry found xlog("L_INFO", "LDAPAUTH: no ldap entry found for $aU\n"); sl_send_reply("404", "User Not Found"); exit; case -2: # internal error xlog("L_INFO", "LDAPAUTH: (ldap) internal server error\n"); sl_send_reply("500", "Internal server error"); exit; default: xlog("L_INFO", "LDAPAUTH: (ldap) found $var(res) entries\n"); } }
# Pick from the first entry # First try hardphone if (!ldap_result("sipMacAddress/$avp(s:username)")) { ldap_result("uid/$avp(s:username)"); } ldap_result("sipSecret/$avp(s:password)"); ldap_result("sipVoipAddressDN/$avp(s:voipdn)"); if (is_method("INVITE")) { $avp(s:f_uid) = $(avp(s:voipdn){param.value,uid}); } if (is_method("REGISTER")) { if(!pv_www_authenticate("uio.no", "$avp(s:password)","0")) { xlog("L_WARN", "ROUTE AUTH: FAILED authentication
on REGISTER for username: $avp(s:username)\n"); www_challenge("uio.no","1"); exit; } if (isflagset(1)) { xlog("L_INFO", "LDAPAUTH: Successful authentication. Username $avp(s:username)\n"); } return(1); } if(!pv_proxy_authenticate("uio.no", "$avp(s:password)", "0")) { proxy_challenge("uio.no","1"); exit; } append_hf("UIO-edge-auth-uid: $avp(s:username)\r\n"); /* Now, remove credentials before passing the message on */ consume_credentials();
if (is_method("INVITE")) { ldap_search("ldap://ldapprod/cn=uris,cn=voip,dc=uio,dc=no?voipExtensionUri,cn,voipE164Uri?one?(&(objectClass=voipAddress)(uid=$avp(s:f_uid)))"); ldap_result("voipExtensionUri/$avp(s:f_exten)"); ldap_result("cn/$avp(s:f_name)"); ldap_result("voipE164Uri/$avp(s:f_e164)"); $avp(s:f_num) = "sip:" +
$(avp(s:f_e164){uri.user}{s.strip,3}) + "@uio.no"; append_hf("P-Asserted-Identity: $avp(s:f_name) <$avp(s:f_e164)>\r\n"); # append_hf("Remote-Party-ID: "$avp(s:f_name)" <$avp(s:f_exten)>;privacy=off;screen=no\r\n"); } }
If that's the case, you have two options: Either return 407's statefully, or defer dialog tracking until dialogs are authenticated. The former may impose a security risk, the latter may (depending on your situation) limit your dialog tracking needs.
We don't need dialog tracking until either a call is authenticated or it trusted, so that is fine our situation.
Our Kamailio has a tendency to send double replies for a request, e.g. two 200 OK for a single REGISTER. Most likely this is relevant to the dialog-module being confused.
-- Øyvind Kolbu
On 2012-01-20 at 23:32, Timo Reimann wrote:
Hi and sorry for the late response!
Another possible root cause: After calling dlg_manage() on an INVITE, you do not forward the request (e.g., by calling exit() instead). Could that be the case? If so, the solution would be (again) to defer dialog tracking unless you're sure the INVITE will be routed.
Thanks, this was indeed at least the main problem! We have no replaced a lot of sl_send_reply("123", "message") with t_newtran() + t_reply("123","message") before exit(), and that solved most of the hanging dialogs. These were by the way hanging in state 1.
If not, the last thing I can think of to try is to do some tracing (using ngrep or tcpdump, for example) and attempt to catch a dialog that dangles. If you succeed at that, analyzing the trace will probably help in determining the issue.
We still have a few calls, probably one to two a day, which get stuck in state 4. They have proper INVITE, 200 OK and later BYE. We have tracked the problem down to that once in a while Asterisk, our bridge to PSTN, issues double INVITEs at the extact same time for the same call and in this case there seem to be a race within Kamailio. The call is properly setup and termitated, but the entry is still in the dialog table.
Any ideas how to cope with the double INVITEs? We do btw use dlg_match_mode = 1, as we used that in Kamailio 1.5 and that worked like a charm. Have not tested altering it to either 0 or 2.
On 01/31/2012 09:11 AM, Øyvind Kolbu wrote:
On 2012-01-20 at 23:32, Timo Reimann wrote:
Hi and sorry for the late response!
Another possible root cause: After calling dlg_manage() on an INVITE, you do not forward the request (e.g., by calling exit() instead). Could that be the case? If so, the solution would be (again) to defer dialog tracking unless you're sure the INVITE will be routed.
Thanks, this was indeed at least the main problem! We have no replaced a lot of sl_send_reply("123", "message") with t_newtran() + t_reply("123","message") before exit(), and that solved most of the hanging dialogs. These were by the way hanging in state 1.
If not, the last thing I can think of to try is to do some tracing (using ngrep or tcpdump, for example) and attempt to catch a dialog that dangles. If you succeed at that, analyzing the trace will probably help in determining the issue.
We still have a few calls, probably one to two a day, which get stuck in state 4. They have proper INVITE, 200 OK and later BYE. We have tracked the problem down to that once in a while Asterisk, our bridge to PSTN, issues double INVITEs at the extact same time for the same call and in this case there seem to be a race within Kamailio. The call is properly setup and termitated, but the entry is still in the dialog table.
Any ideas how to cope with the double INVITEs? We do btw use dlg_match_mode = 1, as we used that in Kamailio 1.5 and that worked like a charm. Have not tested altering it to either 0 or 2.
Some extra information:
We also still get some dialogs stuck in state 1 when we see these double invites (but the call is not set up due to busy, hang-up etc).
For these events we also (always?) see one or more of these messages in the logs:
CRITICAL: dialog [dlg_hash.c:650]: bogus event 6 in state 1 for dlg 0xb5e88dc4 [2445:97666510] with clid '2adcd4b23355a3aa3a4ae5a73fe72631@pstn-gateway-ip' and tags 'as485e20a7' ''
CRITICAL: dialog [dlg_hash.c:650]: bogus event 7 in state 1 for dlg 0xb5e88dc4 [2445:97666510] with clid '2adcd4b23355a3aa3a4ae5a73fe72631@pstn-gateway-ip' and tags 'as485e20a7' ''
Also sometimes event 8.
This happens to a very small minority of calls (<1% I'd guess).
Best regards, Marius Pedersen
Hi,
Am 31.01.2012 um 11:32 schrieb Marius Pedersen:
On 01/31/2012 09:11 AM, Øyvind Kolbu wrote:
On 2012-01-20 at 23:32, Timo Reimann wrote:
Hi and sorry for the late response!
Another possible root cause: After calling dlg_manage() on an INVITE, you do not forward the request (e.g., by calling exit() instead). Could that be the case? If so, the solution would be (again) to defer dialog tracking unless you're sure the INVITE will be routed.
Thanks, this was indeed at least the main problem! We have no replaced a lot of sl_send_reply("123", "message") with t_newtran() + t_reply("123","message") before exit(), and that solved most of the hanging dialogs. These were by the way hanging in state 1.
If not, the last thing I can think of to try is to do some tracing (using ngrep or tcpdump, for example) and attempt to catch a dialog that dangles. If you succeed at that, analyzing the trace will probably help in determining the issue.
We still have a few calls, probably one to two a day, which get stuck in state 4. They have proper INVITE, 200 OK and later BYE. We have tracked the problem down to that once in a while Asterisk, our bridge to PSTN, issues double INVITEs at the extact same time for the same call and in this case there seem to be a race within Kamailio. The call is properly setup and termitated, but the entry is still in the dialog table.
Any ideas how to cope with the double INVITEs? We do btw use dlg_match_mode = 1, as we used that in Kamailio 1.5 and that worked like a charm. Have not tested altering it to either 0 or 2.
Some extra information:
We also still get some dialogs stuck in state 1 when we see these double invites (but the call is not set up due to busy, hang-up etc).
For these events we also (always?) see one or more of these messages in the logs:
CRITICAL: dialog [dlg_hash.c:650]: bogus event 6 in state 1 for dlg 0xb5e88dc4 [2445:97666510] with clid '2adcd4b23355a3aa3a4ae5a73fe72631@pstn-gateway-ip' and tags 'as485e20a7' ''
CRITICAL: dialog [dlg_hash.c:650]: bogus event 7 in state 1 for dlg 0xb5e88dc4 [2445:97666510] with clid '2adcd4b23355a3aa3a4ae5a73fe72631@pstn-gateway-ip' and tags 'as485e20a7' ''
Also sometimes event 8.
This happens to a very small minority of calls (<1% I'd guess).
I'm currently in the process of investigating a dialog-related issue together with Uri (see CC). It may be related to your problem, so let's see if I can find something out that helps you as well. If not, I/we should take a dedicated look at your case.
Unfortunately, I'm currently short on time so I cannot give any guarantees as to when I'll find the time to get to these dialog-related things. I promise to get back to you folks ASAP though, so please hang on.
Cheers,
--Timo
On 2012-01-31 at 14:06, Timo Reimann wrote:
I'm currently in the process of investigating a dialog-related issue together with Uri (see CC). It may be related to your problem, so let's see if I can find something out that helps you as well. If not, I/we should take a dedicated look at your case.
Unfortunately, I'm currently short on time so I cannot give any guarantees as to when I'll find the time to get to these dialog-related things. I promise to get back to you folks ASAP though, so please hang on.
Great! We are currently running with full debugging enabled, so don't hesitate to contact if you need some data.
On 2012-01-31 at 14:15, Øyvind Kolbu wrote:
On 2012-01-31 at 14:06, Timo Reimann wrote:
I'm currently in the process of investigating a dialog-related issue together with Uri (see CC). It may be related to your problem, so let's see if I can find something out that helps you as well. If not, I/we should take a dedicated look at your case.
Unfortunately, I'm currently short on time so I cannot give any guarantees as to when I'll find the time to get to these dialog-related things. I promise to get back to you folks ASAP though, so please hang on.
Great! We are currently running with full debugging enabled, so don't hesitate to contact if you need some data.
Hi,
we've found that perhaps the cause of our problems are multiple identical INVITEs at more or less the exact same time, +/- a few ms. This causes a race within either the tm or dialog module, as it doesn't seem to cope well with duplicates ~simultanously. Our fix is that we use the lock and unlock commands from the cfgutils module together with a hashtable to check for duplicates, and if so, drop them.
So our patch is more or less:
modparam("htable", "htable", "cidhist=>size=8;autoexpire=10"); modparam("cfgutils", "lock_set_size", 8)
$avp(s:f_uid) is the username for our authenticated user and is set in route(AUTH), then in main route, after route(AUTH):
if (is_method("INVITE")) { if($(rU{s.len}) == 0) { xlog("L_INFO", "$avp(s:f_uid) tried to call an empty number, dropping call\n"); sl_send_reply("404", "User Not Found"); exit; } lock($ci); if($sht(cidhist=>$ci) != $null) { xlog("L_INFO", "DLG: We've already seen this call-id before, we should drop this invite $ci\n"); unlock($ci); exit; } else { xlog("L_INFO", "DLG: This is a new invite, let's start a dialog! $ci\n"); $sht(cidhist=>$ci) = 1; setflag(3); dlg_manage();
if ($avp(s:f_uid) != $null) { set_dlg_profile("busy","$avp(s:f_uid)"); get_profile_size("busy", "$avp(s:f_uid)", "$var(dlg_busy)"); xlog("L_INFO", "BUSY: f_uid: $avp(s:f_uid), dlg_busy: $var(dlg_busy)\n"); } } unlock($ci); }
Worth mentioning is that we also get REGISTER messages stuck in the dialog-module, in state 1, but we only do setflag(3) and dlg_manage() for INVITE. Againt probably caused by identical REGISTER messages. Any idea of why?
On 2/14/12 10:17 AM, Øyvind Kolbu wrote:
On 2012-01-31 at 14:15, Øyvind Kolbu wrote:
On 2012-01-31 at 14:06, Timo Reimann wrote:
I'm currently in the process of investigating a dialog-related issue together with Uri (see CC). It may be related to your problem, so let's see if I can find something out that helps you as well. If not, I/we should take a dedicated look at your case.
Unfortunately, I'm currently short on time so I cannot give any guarantees as to when I'll find the time to get to these dialog-related things. I promise to get back to you folks ASAP though, so please hang on.
Great! We are currently running with full debugging enabled, so don't hesitate to contact if you need some data.
Hi,
we've found that perhaps the cause of our problems are multiple identical INVITEs at more or less the exact same time, +/- a few ms. This causes a race within either the tm or dialog module, as it doesn't seem to cope well with duplicates ~simultanously. Our fix is that we use the lock and unlock commands from the cfgutils module together with a hashtable to check for duplicates, and if so, drop them.
So our patch is more or less:
modparam("htable", "htable", "cidhist=>size=8;autoexpire=10"); modparam("cfgutils", "lock_set_size", 8)
$avp(s:f_uid) is the username for our authenticated user and is set in route(AUTH), then in main route, after route(AUTH):
if (is_method("INVITE")) { if($(rU{s.len}) == 0) { xlog("L_INFO", "$avp(s:f_uid) tried to call an empty number, dropping call\n"); sl_send_reply("404", "User Not Found"); exit; } lock($ci); if($sht(cidhist=>$ci) != $null) { xlog("L_INFO", "DLG: We've already seen this call-id before, we should drop this invite $ci\n"); unlock($ci); exit; } else { xlog("L_INFO", "DLG: This is a new invite, let's start a dialog! $ci\n"); $sht(cidhist=>$ci) = 1; setflag(3); dlg_manage(); if ($avp(s:f_uid) != $null) { set_dlg_profile("busy","$avp(s:f_uid)"); get_profile_size("busy", "$avp(s:f_uid)", "$var(dlg_busy)"); xlog("L_INFO", "BUSY: f_uid: $avp(s:f_uid), dlg_busy: $var(dlg_busy)\n"); } } unlock($ci); }
can you create the transaction with t_newtran() somewhere before calling dlg manage? It should absorb retransmissions before going into dialog processing. After creating the transaction, use either send_reply() or t_reply() instead of sl_send_reply().
Worth mentioning is that we also get REGISTER messages stuck in the dialog-module, in state 1, but we only do setflag(3) and dlg_manage() for INVITE. Againt probably caused by identical REGISTER messages. Any idea of why?
what version are you using? REGISTER requests should not create any dialog and that is fixed starting wit 3.2.1, iirc.
Cheers, Daniel
On 2012-02-14 at 11:38, Daniel-Constantin Mierla wrote:
can you create the transaction with t_newtran() somewhere before calling dlg manage? It should absorb retransmissions before going into dialog processing. After creating the transaction, use either send_reply() or t_reply() instead of sl_send_reply().
Can try that, but as the problem with duplicates only occur for less than 1% of our conversations, it is not that easy to test in our testrig.
what version are you using? REGISTER requests should not create any dialog and that is fixed starting wit 3.2.1, iirc.
# kamailio -V version: kamailio 3.2.1 (i386/linux) 31c991 flags: STATS: Off, USE_IPV6, USE_TCP, USE_TLS, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, DBG_QM_MALLOC, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 4MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: 31c991 compiled on 08:08:06 Dec 6 2011 with gcc 4.1.2
Willing to upgrade to a later git, if needed.
On 2012-02-14 at 11:38, Daniel-Constantin Mierla wrote:
can you create the transaction with t_newtran() somewhere before calling dlg manage? It should absorb retransmissions before going into dialog processing. After creating the transaction, use either send_reply() or t_reply() instead of sl_send_reply().
Can try that, but as the problem with duplicates only occur for less than 1% of our conversations, it is not that easy to test in our testrig.
We'll try this and see if we can provoke the problem by sending a large amount of retransmissions manually to our testrig.
what version are you using? REGISTER requests should not create any dialog and that is fixed starting wit 3.2.1, iirc.
This part was a misunderstanding. We are not seeing REGISTERS making dialogs and hanging (we just have some INVITES with weird to_uris, but that's not a problem with Kamailio).
-- Marius Pedersen
Hey all,
Am 31.01.2012 um 14:06 schrieb Timo Reimann:
Am 31.01.2012 um 11:32 schrieb Marius Pedersen:
On 01/31/2012 09:11 AM, Øyvind Kolbu wrote:
On 2012-01-20 at 23:32, Timo Reimann wrote:
Hi and sorry for the late response!
Another possible root cause: After calling dlg_manage() on an INVITE, you do not forward the request (e.g., by calling exit() instead). Could that be the case? If so, the solution would be (again) to defer dialog tracking unless you're sure the INVITE will be routed.
Thanks, this was indeed at least the main problem! We have no replaced a lot of sl_send_reply("123", "message") with t_newtran() + t_reply("123","message") before exit(), and that solved most of the hanging dialogs. These were by the way hanging in state 1.
If not, the last thing I can think of to try is to do some tracing (using ngrep or tcpdump, for example) and attempt to catch a dialog that dangles. If you succeed at that, analyzing the trace will probably help in determining the issue.
We still have a few calls, probably one to two a day, which get stuck in state 4. They have proper INVITE, 200 OK and later BYE. We have tracked the problem down to that once in a while Asterisk, our bridge to PSTN, issues double INVITEs at the extact same time for the same call and in this case there seem to be a race within Kamailio. The call is properly setup and termitated, but the entry is still in the dialog table.
Any ideas how to cope with the double INVITEs? We do btw use dlg_match_mode = 1, as we used that in Kamailio 1.5 and that worked like a charm. Have not tested altering it to either 0 or 2.
Some extra information:
We also still get some dialogs stuck in state 1 when we see these double invites (but the call is not set up due to busy, hang-up etc).
For these events we also (always?) see one or more of these messages in the logs:
CRITICAL: dialog [dlg_hash.c:650]: bogus event 6 in state 1 for dlg 0xb5e88dc4 [2445:97666510] with clid '2adcd4b23355a3aa3a4ae5a73fe72631@pstn-gateway-ip' and tags 'as485e20a7' ''
CRITICAL: dialog [dlg_hash.c:650]: bogus event 7 in state 1 for dlg 0xb5e88dc4 [2445:97666510] with clid '2adcd4b23355a3aa3a4ae5a73fe72631@pstn-gateway-ip' and tags 'as485e20a7' ''
Also sometimes event 8.
This happens to a very small minority of calls (<1% I'd guess).
I'm currently in the process of investigating a dialog-related issue together with Uri (see CC). It may be related to your problem, so let's see if I can find something out that helps you as well. If not, I/we should take a dedicated look at your case.
Unfortunately, I'm currently short on time so I cannot give any guarantees as to when I'll find the time to get to these dialog-related things. I promise to get back to you folks ASAP though, so please hang on.
Uri's issue was fixed in the master branch. I don't think it has anything to do with your case, however.
Before getting back to you in detail: Is the problem still persisting?
Cheers,
--Timo
Sorry I missed to see that this thread was already in full action and Daniel got catched on...
Cheers,
--Timo
Am 20.02.2012 um 22:55 schrieb Timo Reimann:
Hey all,
Am 31.01.2012 um 14:06 schrieb Timo Reimann:
Am 31.01.2012 um 11:32 schrieb Marius Pedersen:
On 01/31/2012 09:11 AM, Øyvind Kolbu wrote:
On 2012-01-20 at 23:32, Timo Reimann wrote:
Hi and sorry for the late response!
Another possible root cause: After calling dlg_manage() on an INVITE, you do not forward the request (e.g., by calling exit() instead). Could that be the case? If so, the solution would be (again) to defer dialog tracking unless you're sure the INVITE will be routed.
Thanks, this was indeed at least the main problem! We have no replaced a lot of sl_send_reply("123", "message") with t_newtran() + t_reply("123","message") before exit(), and that solved most of the hanging dialogs. These were by the way hanging in state 1.
If not, the last thing I can think of to try is to do some tracing (using ngrep or tcpdump, for example) and attempt to catch a dialog that dangles. If you succeed at that, analyzing the trace will probably help in determining the issue.
We still have a few calls, probably one to two a day, which get stuck in state 4. They have proper INVITE, 200 OK and later BYE. We have tracked the problem down to that once in a while Asterisk, our bridge to PSTN, issues double INVITEs at the extact same time for the same call and in this case there seem to be a race within Kamailio. The call is properly setup and termitated, but the entry is still in the dialog table.
Any ideas how to cope with the double INVITEs? We do btw use dlg_match_mode = 1, as we used that in Kamailio 1.5 and that worked like a charm. Have not tested altering it to either 0 or 2.
Some extra information:
We also still get some dialogs stuck in state 1 when we see these double invites (but the call is not set up due to busy, hang-up etc).
For these events we also (always?) see one or more of these messages in the logs:
CRITICAL: dialog [dlg_hash.c:650]: bogus event 6 in state 1 for dlg 0xb5e88dc4 [2445:97666510] with clid '2adcd4b23355a3aa3a4ae5a73fe72631@pstn-gateway-ip' and tags 'as485e20a7' ''
CRITICAL: dialog [dlg_hash.c:650]: bogus event 7 in state 1 for dlg 0xb5e88dc4 [2445:97666510] with clid '2adcd4b23355a3aa3a4ae5a73fe72631@pstn-gateway-ip' and tags 'as485e20a7' ''
Also sometimes event 8.
This happens to a very small minority of calls (<1% I'd guess).
I'm currently in the process of investigating a dialog-related issue together with Uri (see CC). It may be related to your problem, so let's see if I can find something out that helps you as well. If not, I/we should take a dedicated look at your case.
Unfortunately, I'm currently short on time so I cannot give any guarantees as to when I'll find the time to get to these dialog-related things. I promise to get back to you folks ASAP though, so please hang on.
Uri's issue was fixed in the master branch. I don't think it has anything to do with your case, however.
Before getting back to you in detail: Is the problem still persisting?
Cheers,
--Timo _______________________________________________ SIP Express Router (SER) and Kamailio (OpenSER) - sr-users mailing list sr-users@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-users