Hi Gang
We are still having massive issues on how to safely reload kamailio after config changes when using the dialog module and DMQ.
If there are active dialogues, kamailio corrupts them on a restart even when using MySQL as dialog backend.
As we use two core nodes for redundancy, I am looking for a way to gracefully restart kamailio.
I am considering adding some key in a hash table or anything else I can reload on runtime to indicate to kamailio not to accept any new calls (effectively rejecting Invites without To-Tag with 503 causing the registrar or IC peer to hopefully resend the invite to the other node).
Then wait, until no more dialogues are active, so kamailio can safely be restarted.
My Issue now: How can I find out, one specific node does not have any active dialogues?
'kamcmd dlg.stats_active' returns the count of all DMQ synced nodes, not of the local one.
And suggestions or any other ideas how I can 'reload' the kamailio config without disrupting active dialogues?
My last resort would be to look into the database: modparam("dialog", "h_id_start", -1) # Use server_id modparam("dialog", "h_id_step", 2)
So odd/even H_ID should tell me the number per node. But I see a lot of orphan dialogues hanging around in the database not being cleaned so I guess that will not be reliable at all.
Yes, I know I will get the question: 'Why do you need to restart kamailio that often'.
We have started production on our kamailio based TSP platform. And of course, despite a LOT of testing beforehand, there is always some issue that pops up. At the moment, I have to implement a config change about once or twice a week to fix some new minor issues.
I hope, somewhen in the future we will hopefully have a stable config which will last for several months, but at the moment this is the situation.
Hello,
what is the purpose of dmq replication? To limit active calls?
What exactly happens? What means "corrupts" them? What data/fields become corrupted?
Regarding reject of the calls for cooling down the instance for restart, check if the 305 Use Proxy is supported by origin of the calls, it might be more suitable.
Cheers, Daniel
On 05.07.23 11:38, Benoît Panizzon wrote:
Hi Gang
We are still having massive issues on how to safely reload kamailio after config changes when using the dialog module and DMQ.
If there are active dialogues, kamailio corrupts them on a restart even when using MySQL as dialog backend.
As we use two core nodes for redundancy, I am looking for a way to gracefully restart kamailio.
I am considering adding some key in a hash table or anything else I can reload on runtime to indicate to kamailio not to accept any new calls (effectively rejecting Invites without To-Tag with 503 causing the registrar or IC peer to hopefully resend the invite to the other node).
Then wait, until no more dialogues are active, so kamailio can safely be restarted.
My Issue now: How can I find out, one specific node does not have any active dialogues?
'kamcmd dlg.stats_active' returns the count of all DMQ synced nodes, not of the local one.
And suggestions or any other ideas how I can 'reload' the kamailio config without disrupting active dialogues?
My last resort would be to look into the database: modparam("dialog", "h_id_start", -1) # Use server_id modparam("dialog", "h_id_step", 2)
So odd/even H_ID should tell me the number per node. But I see a lot of orphan dialogues hanging around in the database not being cleaned so I guess that will not be reliable at all.
Yes, I know I will get the question: 'Why do you need to restart kamailio that often'.
We have started production on our kamailio based TSP platform. And of course, despite a LOT of testing beforehand, there is always some issue that pops up. At the moment, I have to implement a config change about once or twice a week to fix some new minor issues.
I hope, somewhen in the future we will hopefully have a stable config which will last for several months, but at the moment this is the situation.
-- Mit freundlichen Grüssen
-Benoît Panizzon- @ HomeOffice und normal erreichbar
I m p r o W a r e A G - Leiter Commerce Kunden ______________________________________________________
Zurlindenstrasse 29 Tel +41 61 826 93 00 CH-4133 Pratteln Fax +41 61 826 93 01 Schweiz Web http://www.imp.ch ______________________________________________________ __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender! Edit mailing list options or unsubscribe:
Hi Daniel
PS: Kamailio 5.5 in use so not on the edge yet.
Thank you for helping regarding that issue and maybe hinting how it could be improved.
what is the purpose of dmq replication? To limit active calls?
Exactly. Our subscriptions contain a certain number of 'channels'. If they are used the customer is busy.
So I use profile counters to track the used channel count per customer.
What exactly happens? What means "corrupts" them? What data/fields become corrupted?
It looks like the/some dialogues just don't exist any more after a reload. Or they exist but are not being found.
Observed issues:
* dialogue variables that were populated before the restart do not exist any more. * When a call ends, the corresponding dialogue is not found, so the dialogue modules is unable to end the CDR - but when the dialogue timeout hits, the CDR is then written with duration = timeout value which is way longer than the actual duration. * profile counter for dialogues that were not found when the call ended are still present so 'POTS' customers with 'one' channel stay 'busy' until the dialogue timeout hits. * Database accumulates data from dialogues that do not exist anymore.
Specific error I see, when a dialogue should be ended and kamailio can't find it anymore after a restart is:
ERROR: dialog [dlg_dmq.c:289]: dlg_dmq_handle_msg(): dialog [838:15539] not found
If you could help, I could try to dig out the full log of a dialogue experiencing that issue.
Dialog Parameters used:
modparam("dialog", "send_bye", 1) modparam("dialog", "timer_procs", 0) modparam("dialog", "db_mode", 1 ) modparam("dialog", "db_url", DBLOCAL ) modparam("dialog", "dlg_flag", FLT_DLG ) modparam("dialog", "dlg_match_mode", 1) modparam("dialog", "dlg_extra_hdrs", "Hint: Initiated by IMP Core Proxy\r\n") modparam("dialog", "hash_size", 4096 ) # Do not send any keepalive messages in dialog modparam("dialog", "ka_timer", 0) modparam("dialog", "ka_interval", 30 ) modparam("dialog", "enable_stats", 1 ) modparam("dialog", "detect_spirals", 1 ) modparam("dialog", "bridge_controller", "sip:controller@imp.ch") modparam("dialog", "default_timeout", 43200 ) modparam("dialog", "timeout_avp", "$avp(dlgtimeout)") # Needs to be same as sst timeout! modparam("dialog", "profiles_no_value", "callcounter;total_sbcincoming"); modparam("dialog", "profiles_with_value", "dispatchout;sbcincoming;trunkincoming;cpeincoming;safariincoming;custprofilecounter;legcounter"); modparam("dialog", "enable_dmq", 1) modparam("dialog", "h_id_start", -1) # Use server_id modparam("dialog", "h_id_step", 2)
Each node uses a local database (defined as DBLOCAL), they don't access our common 'remote' database where for example customer authentication information is provided.
Regarding reject of the calls for cooling down the instance for restart, check if the 305 Use Proxy is supported by origin of the calls, it might be more suitable.
Our registrar nodes run kamailio too, so implementing that would be an option. Regarding our IC to other TSP and Carriers, I would have to check, at the moment, they are all connected via a commercial vendor SBC so if that SBC can handle 305 on Invites (it can in register) that would work.
But one of our goals is to eventually also get rid of that SBC which has some limitations and a not very advantageous 'feature' licensing model in favour of open source and flexiblity by using Kamailion and rtpengine for that task. But then we would have to check with every IC we have. I know that at the moment 503 is understood by all our switches connected to kamailio and also our registrars handle 503 as a failure to the other node in the dispatcher list.
Hello,
while I did some work for the dialog module over the time, it is not one of my favourites modules beside using it to ensure a maximum duration of calls (for which it should work fine). Also, I never ended up using it for CDRs generation, I like the acc event based account which can record more events event for the same call.
That said, for active calls limiting I usually rely on other solutions built via config file and leveraging htable or various backends. Also, for values that I need to use during call duration, I use htable.
Anyhow, I find it strange that after restart a request within dialog does not match the record loaded in memory, because obviously it is there as you say the dialog times out at some point in time later. Did you change the value of modparam hash_size?
Have you captured the sip traffic and can you see the 'did' parameter in the Route headers of BYE?
Cheers, Daniel
On 05.07.23 15:44, Benoît Panizzon wrote:
Hi Daniel
PS: Kamailio 5.5 in use so not on the edge yet.
Thank you for helping regarding that issue and maybe hinting how it could be improved.
what is the purpose of dmq replication? To limit active calls?
Exactly. Our subscriptions contain a certain number of 'channels'. If they are used the customer is busy.
So I use profile counters to track the used channel count per customer.
What exactly happens? What means "corrupts" them? What data/fields become corrupted?
It looks like the/some dialogues just don't exist any more after a reload. Or they exist but are not being found.
Observed issues:
- dialogue variables that were populated before the restart do not exist any more.
- When a call ends, the corresponding dialogue is not found, so the dialogue modules is unable to end the CDR - but when the dialogue timeout hits, the CDR is then written with duration = timeout value which is way longer than the actual duration.
- profile counter for dialogues that were not found when the call ended are still present so 'POTS' customers with 'one' channel stay 'busy' until the dialogue timeout hits.
- Database accumulates data from dialogues that do not exist anymore.
Specific error I see, when a dialogue should be ended and kamailio can't find it anymore after a restart is:
ERROR: dialog [dlg_dmq.c:289]: dlg_dmq_handle_msg(): dialog [838:15539] not found
If you could help, I could try to dig out the full log of a dialogue experiencing that issue.
Dialog Parameters used:
modparam("dialog", "send_bye", 1) modparam("dialog", "timer_procs", 0) modparam("dialog", "db_mode", 1 ) modparam("dialog", "db_url", DBLOCAL ) modparam("dialog", "dlg_flag", FLT_DLG ) modparam("dialog", "dlg_match_mode", 1) modparam("dialog", "dlg_extra_hdrs", "Hint: Initiated by IMP Core Proxy\r\n") modparam("dialog", "hash_size", 4096 ) # Do not send any keepalive messages in dialog modparam("dialog", "ka_timer", 0) modparam("dialog", "ka_interval", 30 ) modparam("dialog", "enable_stats", 1 ) modparam("dialog", "detect_spirals", 1 ) modparam("dialog", "bridge_controller", "sip:controller@imp.ch") modparam("dialog", "default_timeout", 43200 ) modparam("dialog", "timeout_avp", "$avp(dlgtimeout)") # Needs to be same as sst timeout! modparam("dialog", "profiles_no_value", "callcounter;total_sbcincoming"); modparam("dialog", "profiles_with_value", "dispatchout;sbcincoming;trunkincoming;cpeincoming;safariincoming;custprofilecounter;legcounter"); modparam("dialog", "enable_dmq", 1) modparam("dialog", "h_id_start", -1) # Use server_id modparam("dialog", "h_id_step", 2)
Each node uses a local database (defined as DBLOCAL), they don't access our common 'remote' database where for example customer authentication information is provided.
Regarding reject of the calls for cooling down the instance for restart, check if the 305 Use Proxy is supported by origin of the calls, it might be more suitable.
Our registrar nodes run kamailio too, so implementing that would be an option. Regarding our IC to other TSP and Carriers, I would have to check, at the moment, they are all connected via a commercial vendor SBC so if that SBC can handle 305 on Invites (it can in register) that would work.
But one of our goals is to eventually also get rid of that SBC which has some limitations and a not very advantageous 'feature' licensing model in favour of open source and flexiblity by using Kamailion and rtpengine for that task. But then we would have to check with every IC we have. I know that at the moment 503 is understood by all our switches connected to kamailio and also our registrars handle 503 as a failure to the other node in the dispatcher list.
-- Mit freundlichen Grüssen
-Benoît Panizzon- @ HomeOffice und normal erreichbar
I m p r o W a r e A G - Leiter Commerce Kunden ______________________________________________________
Zurlindenstrasse 29 Tel +41 61 826 93 00 CH-4133 Pratteln Fax +41 61 826 93 01 Schweiz Web http://www.imp.ch ______________________________________________________
Hello,
Have you tried to use REDIS as a backend for the DLG module? You can count active calls per node like:
modparam("htable", "htable", "dialog_counter=>size=8;")
event_route[dialog:start]{
if($sht(dialog_counter=>count) == $null){ sht_lock(dialog_counter=>count); $sht(dialog_counter=>count) = 1; sht_unlock(dialog_counter=>count); }else { $sht(dialog_counter=>count) + 1; } }
event_route[dialog:end]{ $sht(dialog_counter=>count) - 1; }
event_route[dialog:failed]{ $sht(dialog_counter=>count) - 1; }
Then you can get
kamcmd htable.get dialog_counter count
ср, 5 июл. 2023 г. в 17:28, Daniel-Constantin Mierla miconda@gmail.com:
Hello,
while I did some work for the dialog module over the time, it is not one of my favourites modules beside using it to ensure a maximum duration of calls (for which it should work fine). Also, I never ended up using it for CDRs generation, I like the acc event based account which can record more events event for the same call.
That said, for active calls limiting I usually rely on other solutions built via config file and leveraging htable or various backends. Also, for values that I need to use during call duration, I use htable.
Anyhow, I find it strange that after restart a request within dialog does not match the record loaded in memory, because obviously it is there as you say the dialog times out at some point in time later. Did you change the value of modparam hash_size?
Have you captured the sip traffic and can you see the 'did' parameter in the Route headers of BYE?
Cheers, Daniel
On 05.07.23 15:44, Benoît Panizzon wrote:
Hi Daniel
PS: Kamailio 5.5 in use so not on the edge yet.
Thank you for helping regarding that issue and maybe hinting how it could be improved.
what is the purpose of dmq replication? To limit active calls?
Exactly. Our subscriptions contain a certain number of 'channels'. If they are used the customer is busy.
So I use profile counters to track the used channel count per customer.
What exactly happens? What means "corrupts" them? What data/fields become corrupted?
It looks like the/some dialogues just don't exist any more after a reload. Or they exist but are not being found.
Observed issues:
- dialogue variables that were populated before the restart do not exist any more.
- When a call ends, the corresponding dialogue is not found, so the dialogue modules is unable to end the CDR - but when the dialogue timeout hits, the CDR is then written with duration = timeout value which is way longer than the actual duration.
- profile counter for dialogues that were not found when the call ended are still present so 'POTS' customers with 'one' channel stay 'busy' until the dialogue timeout hits.
- Database accumulates data from dialogues that do not exist anymore.
Specific error I see, when a dialogue should be ended and kamailio can't find it anymore after a restart is:
ERROR: dialog [dlg_dmq.c:289]: dlg_dmq_handle_msg(): dialog [838:15539] not found
If you could help, I could try to dig out the full log of a dialogue experiencing that issue.
Dialog Parameters used:
modparam("dialog", "send_bye", 1) modparam("dialog", "timer_procs", 0) modparam("dialog", "db_mode", 1 ) modparam("dialog", "db_url", DBLOCAL ) modparam("dialog", "dlg_flag", FLT_DLG ) modparam("dialog", "dlg_match_mode", 1) modparam("dialog", "dlg_extra_hdrs", "Hint: Initiated by IMP Core
Proxy\r\n")
modparam("dialog", "hash_size", 4096 ) # Do not send any keepalive messages in dialog modparam("dialog", "ka_timer", 0) modparam("dialog", "ka_interval", 30 ) modparam("dialog", "enable_stats", 1 ) modparam("dialog", "detect_spirals", 1 ) modparam("dialog", "bridge_controller", "sip:controller@imp.ch") modparam("dialog", "default_timeout", 43200 ) modparam("dialog", "timeout_avp", "$avp(dlgtimeout)") # Needs to be same
as sst timeout!
modparam("dialog", "profiles_no_value", "callcounter;total_sbcincoming"); modparam("dialog", "profiles_with_value",
"dispatchout;sbcincoming;trunkincoming;cpeincoming;safariincoming;custprofilecounter;legcounter");
modparam("dialog", "enable_dmq", 1) modparam("dialog", "h_id_start", -1) # Use server_id modparam("dialog", "h_id_step", 2)
Each node uses a local database (defined as DBLOCAL), they don't access our common 'remote' database where for example customer authentication information is provided.
Regarding reject of the calls for cooling down the instance for restart, check if the 305 Use Proxy is supported by origin of the calls, it might be more suitable.
Our registrar nodes run kamailio too, so implementing that would be an option. Regarding our IC to other TSP and Carriers, I would have to check, at the moment, they are all connected via a commercial vendor SBC so if that SBC can handle 305 on Invites (it can in register) that would work.
But one of our goals is to eventually also get rid of that SBC which has some limitations and a not very advantageous 'feature' licensing model in favour of open source and flexiblity by using Kamailion and rtpengine for that task. But then we would have to check with every IC we have. I know that at the moment 503 is understood by all our switches connected to kamailio and also our registrars handle 503 as a failure to the other node in the dispatcher list.
-- Mit freundlichen Grüssen
-Benoît Panizzon- @ HomeOffice und normal erreichbar
I m p r o W a r e A G - Leiter Commerce Kunden ______________________________________________________
Zurlindenstrasse 29 Tel +41 61 826 93 00 CH-4133 Pratteln Fax +41 61 826 93 01 Schweiz Web http://www.imp.ch ______________________________________________________
-- Daniel-Constantin Mierla -- www.asipto.com www.twitter.com/miconda -- www.linkedin.com/in/miconda Kamailio World Conference - www.kamailioworld.com
Kamailio - Users Mailing List - Non Commercial Discussions To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender! Edit mailing list options or unsubscribe: