Hi Henning,
thanks for the input. Problem is, the peer(s) we have to find a workaround for is actually
Deutsche Telekom with their CompanyFlex accounts. I know that it's bad to have
non-working servers in their SRV entries, especially with the highest prio. But
apparently, they are in the middle of some kind of migration and it currently happens
almost every week. And since we enable end users to use our PBX with the CompanyFlex
service, we have to follow the rules according to their TR119 spec. And that says, if we
detect an outage, we must use the next prio server from the SRV entry and try the first
prio server once in a while again. But those outages can remain for hours if not days.
Currently, whenever we detect such an outage, we manipulate our DNS forwarder - an evil
solution as well.
And since every customer has their own DNS SRV record ($userid.primary.companyflex.de),
the second approach of setting up dispatcher sets is unfortunately not an option.
The dst_blocklist feature blocked traffic to another trunk provider for our customers
twice in as many days, and since it's the main enduser registration server in that
case, we cannot route the traffic differently. We had not noticed outages with this peer
before, so it probably is only for a few seconds but enough for the dst_blocklist module
to catch it.
I already thought about implementing something manually, so to catch errors in failure
route and then block it manually, but I don't know how. There are no commands to fill
the dst_blocklist from routing logic, and I wouldn't know if I had an entry for
example in a hash table, how to tell Kamailio to send the request to the next-best server
from the SRV result. Is it even possible to manually choose the server from an SRV result
set?
Regards
Sebastian
________________________________
From: Henning Westerholt <hw(a)gilawa.com>
Sent: Thursday, December 15, 2022 16:59
To: Kamailio (SER) - Users Mailing List <sr-users(a)lists.kamailio.org>
Cc: Sebastian Damm <sdamm(a)pascom.net>
Subject: RE: Dealing with failed SRV peers
Hello Sebastian,
actually, it’s the fault is by the provider, that they do not manage their DNS records
properly. It makes no sense to return non-working systems in the end, but some of them do
not care.
I would probably just use the dst_blocklist functionality, probably with a shorter
internal TTL.
Regarding the peers that are having only one server which fails, I would just route to
another provider in this case, if they can not bother to fix it or to provide redundancy.
You could also implement a script that fetches periodically the SRV record, create a
dispatcher cfg from it and then uses dispatcher. You could use active OPTIONS ping
probing, or also manually deactivate the failed hosts for the time period.
Cheers,
Henning
--
Henning Westerholt –
https://skalatan.de/blog/
Kamailio services –
https://gilawa.com<https://gilawa.com/>
From: Sebastian Damm <sdamm(a)pascom.net>
Sent: Thursday, December 15, 2022 7:29 AM
To: Kamailio (SER) - Users Mailing List <sr-users(a)lists.kamailio.org>
Subject: [SR-Users] Dealing with failed SRV peers
Hi,
we have some Kamailios working as outboundproxy. So they get requests from internal
systems and send them to different providers. From time to time, one provider returns a
server as primary resource which is currently unavailable.
I guess if the internal systems connected directly to the target, they would remember the
failed server and remember to always use the server with second priority at least until
DNS refresh time. In our setup, since every request is "new" for Kamailio, it
doesn't remember, which host is reachable or not.
Example:
target:
example.com
_sip._udp.example.com SRV resolves to:
10 192.0.2.42 5060
20 198.51.100.42 5060
30 203.0.113.42 5060
192.0.2.42 is unavailable. Still, Kamailio uses it for every new request and failover to
198.51.100.42 occurs only after timeouts hit.
Is there a best practice for solving this? I have played around with the dst_blocklist
settings, but that caused even more trouble because Kamailio started blocking requests to
peers that have only one server in the record having a short hickup.
Thanks in advance for every input, as this is causing trouble every time we run into such
a situation.
Regards,
Sebastian