[Serusers] Replication problem

List overview All Threads
Download

newer

older

[Serusers] nathelper/rtpproxy + MOH

[Serusers] User-Agent: Asterisk PBX

Andreas Granig

29 Jul 2005 29 Jul '05

12:20 p.m.

Hi all,

we use DNSSRV balancing and forward_tcp() to replicate registrations from one SER to the other SERs in the system.

Now when one machine completely crashes, all other SER processes on all other machines hang when processing a REGISTER until tcp-connect times out, leading to a system load of ~16 per machine assuming 16 child processes per SER, and no other messages can be processed.

I understand that replicating using UDP would solve this issue, but then replicated registrations get lost every now and then because of unreliable transmission, and as far as I found out t_replicate() can only be used for replicating to *one* other SER.

This really gets me thinking about patching out the internal location cache and lookup every location from memory, because this additional lookup really doesn't hurt because of ~10 other DB queries per call.

IMHO in systems with more than two SERs this cache is just a big pain.

Andy

Show replies by date

Greger V. Teigre

1 Aug 1 Aug

8:38 a.m.

Hi Andreas, You are probably one of the people on the list with the most experience with replication. AFAIK, you are correct on all statements below. I assume you have SERs on different locations since TCP connect timeout is a problem? But I'm not sure why removing the cache would help you?! Unless you want to move to a cluster or DB layer replication?

IMHO, there are only two valid paths for replication in SER: Either develop a SIP-layer replication with guaranteed deliveries, queue, non-blocking etc (which ends up being proprietary SER) or patch up SER to better be able to handle DB-based replication. I lean towards DB-based replication. Two prominent things that must be handled: Storing the Path information for proper routing of messages to UAs behind NAT and a cache that checks the DB if location is not found in memory.

I would be very interested in patches for this in the experimental CVS module ;-)

g-)

Andreas Granig wrote:

...

Hi all,

we use DNSSRV balancing and forward_tcp() to replicate registrations from one SER to the other SERs in the system.

Now when one machine completely crashes, all other SER processes on all other machines hang when processing a REGISTER until tcp-connect times out, leading to a system load of ~16 per machine assuming 16 child processes per SER, and no other messages can be processed.

I understand that replicating using UDP would solve this issue, but then replicated registrations get lost every now and then because of unreliable transmission, and as far as I found out t_replicate() can only be used for replicating to *one* other SER.

This really gets me thinking about patching out the internal location cache and lookup every location from memory, because this additional lookup really doesn't hurt because of ~10 other DB queries per call.

IMHO in systems with more than two SERs this cache is just a big pain.

Andy

Serusers mailing list serusers@lists.iptel.org http://lists.iptel.org/mailman/listinfo/serusers

Klaus Darilion

8:50 a.m.

Hi Andy!

Another problem: nathelper uses the in memory location table to ping natted clients. Thus, also nathelper would have to query the database and we need a process to watch the expires and delete outdated entries.

regards, klaus

Greger V. Teigre wrote:

...

Hi Andreas, You are probably one of the people on the list with the most experience with replication. AFAIK, you are correct on all statements below. I assume you have SERs on different locations since TCP connect timeout is a problem? But I'm not sure why removing the cache would help you?! Unless you want to move to a cluster or DB layer replication?

IMHO, there are only two valid paths for replication in SER: Either develop a SIP-layer replication with guaranteed deliveries, queue, non-blocking etc (which ends up being proprietary SER) or patch up SER to better be able to handle DB-based replication. I lean towards DB-based replication. Two prominent things that must be handled: Storing the Path information for proper routing of messages to UAs behind NAT and a cache that checks the DB if location is not found in memory.

I would be very interested in patches for this in the experimental CVS module ;-)

g-)

Andreas Granig wrote:

...
Hi all,

we use DNSSRV balancing and forward_tcp() to replicate registrations from one SER to the other SERs in the system.

Now when one machine completely crashes, all other SER processes on all other machines hang when processing a REGISTER until tcp-connect times out, leading to a system load of ~16 per machine assuming 16 child processes per SER, and no other messages can be processed.

I understand that replicating using UDP would solve this issue, but then replicated registrations get lost every now and then because of unreliable transmission, and as far as I found out t_replicate() can only be used for replicating to *one* other SER.

This really gets me thinking about patching out the internal location cache and lookup every location from memory, because this additional lookup really doesn't hurt because of ~10 other DB queries per call.

IMHO in systems with more than two SERs this cache is just a big pain.

Andy

Serusers mailing list serusers@lists.iptel.org http://lists.iptel.org/mailman/listinfo/serusers

Serusers mailing list serusers@lists.iptel.org http://lists.iptel.org/mailman/listinfo/serusers

Greger V. Teigre

9:01 a.m.

:-) Jan participated in a discussion on serusers on a new cache system where all locations where not loaded at start-up, but rather at need. I'm not sure how this plays with nathelper, but I'm sure he has been thinking about it. Maybe we could help him with the cache coding? BTW, only the SER acting as a SIP register for a given UA needs to ping. This means that it will be easier, as no locations outside the cache will have to be pinged. g-)

Klaus Darilion wrote:

...

Hi Andy!

Another problem: nathelper uses the in memory location table to ping natted clients. Thus, also nathelper would have to query the database and we need a process to watch the expires and delete outdated entries. regards, klaus

Greger V. Teigre wrote:

...
Hi Andreas, You are probably one of the people on the list with the most experience with replication. AFAIK, you are correct on all statements below. I assume you have SERs on different locations since TCP connect timeout is a problem? But I'm not sure why removing the cache would help you?! Unless you want to move to a cluster or DB layer replication?

IMHO, there are only two valid paths for replication in SER: Either develop a SIP-layer replication with guaranteed deliveries, queue, non-blocking etc (which ends up being proprietary SER) or patch up SER to better be able to handle DB-based replication. I lean towards DB-based replication. Two prominent things that must be handled: Storing the Path information for proper routing of messages to UAs behind NAT and a cache that checks the DB if location is not found in memory.

I would be very interested in patches for this in the experimental CVS module ;-)

g-)

Andreas Granig wrote:

...
Hi all,

we use DNSSRV balancing and forward_tcp() to replicate registrations from one SER to the other SERs in the system.

Now when one machine completely crashes, all other SER processes on all other machines hang when processing a REGISTER until tcp-connect times out, leading to a system load of ~16 per machine assuming 16 child processes per SER, and no other messages can be processed.

I understand that replicating using UDP would solve this issue, but then replicated registrations get lost every now and then because of unreliable transmission, and as far as I found out t_replicate() can only be used for replicating to *one* other SER.

This really gets me thinking about patching out the internal location cache and lookup every location from memory, because this additional lookup really doesn't hurt because of ~10 other DB queries per call. IMHO in systems with more than two SERs this cache is just a big pain. Andy

Serusers mailing list serusers@lists.iptel.org http://lists.iptel.org/mailman/listinfo/serusers

Serusers mailing list serusers@lists.iptel.org http://lists.iptel.org/mailman/listinfo/serusers

Andreas Granig

12:24 p.m.

Greger V. Teigre wrote:

...

:-) Jan participated in a discussion on serusers on a new cache system where all locations where not loaded at start-up, but rather at need.

Well, I just read the thread at http://lists.iptel.org/pipermail/serusers/2005-May/019867.html and for me as a guy who wants to run a scalabe system with a minimum of afford and as soon as possible, one question comes up: is it really worth the efford to implement such a caching system?

Don't understand me wrong, I'm sure there are a lot of people who benefit from this caching mechanism in terms of performance, but my experience with this is, well, very bad, when using it in a bigger environment.

For me (and I think a lot of other people, like Paul, who mentioned it too in the above thread) the performance won due to the cache lookup is so minimal in combination with my 10-15 other mysql queries and 1-2 enum queries per call, that I really don't care, especially when thinking about the advantages of a consistend location table in the backend DB:

No need to replicate on SIP level anymore (and there's no really usable mechanism for that for more than two SER nodes), no need to distribute other contacts like aliases to every single SER node using a remote fifo hack, the need of much less memory, much faster startup time and so on, and that all for just a few more milliseconds of latency.

So IMHO the option of disabling the cache absolutely makes sense *in some circumstances*, like large scale systems.

Andy

Juha Heinanen

4:11 p.m.

Andreas Granig writes:

...

For me (and I think a lot of other people, like Paul, who mentioned it too in the above thread) the performance won due to the cache lookup is so minimal in combination with my 10-15 other mysql queries and 1-2 enum queries per call, that I really don't care, especially when thinking about the advantages of a consistend location table in the backend DB:

andreas,

i agree with you that consistent location table in DB is a good idea, but i don't understand why you need 10-15 mysql queries. switch to radius and return all caller attributes as result of authentication and all callee attributes as result of uri exists check. you end up with only two queries (plus accounting).

-- juha

Andreas Granig

2 Aug 2 Aug

11:07 p.m.

Juha Heinanen wrote:

...

i agree with you that consistent location table in DB is a good idea, but i don't understand why you need 10-15 mysql queries. switch to radius and return all caller attributes as result of authentication and all callee attributes as result of uri exists check. you end up with only two queries (plus accounting).

Well, I have to do some uri mangling depending on the From-header and I use quiet a few modules which rely on a database backend (group, lcr, some own, just to name a few) so I don't think it makes sense to switch to radius (please correct me if I'm wrong).

Andy

Juha Heinanen

3 Aug 3 Aug

5:28 a.m.

Andreas Granig writes:

...

Well, I have to do some uri mangling depending on the From-header and I use quiet a few modules which rely on a database backend (group, lcr, some own, just to name a few) so I don't think it makes sense to switch to radius (please correct me if I'm wrong).

if you use radius, you don't need any separate group lookups, nor do you need any ser group modules. what comes to lcr, the implementation should be speed up to do everything in memory without a need of per request db lookups, but i haven't had time for that.

-- juha

Andreas Granig

17 Aug 17 Aug

3:20 p.m.

Juha Heinanen wrote:

...

i agree with you that consistent location table in DB is a good idea, but i don't understand why you need 10-15 mysql queries. switch to radius and return all caller attributes as result of authentication and all callee attributes as result of uri exists check. you end up with only two queries (plus accounting).

I just came across this posting from you (well, a little bit outdated) while digging in Path header documentation, where you argue the same way I did (that one more additional query doesn't matter):

http://www1.ietf.org/mail-archive/web/sip/current/msg04087.html

Now you say that this all can be done using radius. I don't know much of radius, but looking at auth_radius and uri_radius modules I just see the xxx_authorize() and radius_does_uri_exists() methods and don't see a way how these could help me (except writing a own modules for authentication which stores the additional attributes in avp which I could query subsequently).

Beside that the URI could be altered by some kind of call-forwarding-lookups after some basic checks (call barring and such), where checks against this rewritten URI have to be made again, which requires additional queries.

Are you willing to share your experience doing this things with radius?

Thanks, Andy

Klaus Darilion

3:54 p.m.

I've greatly reduced db lookups and external scripts using radius and avpops:

1. During radius authentication, some parameter are returned as SIP_AVP, which will be used later, e.g. remote party id, ....

2. user preferences will be loaded for the from URI (caller preferences) and if necessary for the called URI (callee preferences, e.g. voicemail URI) using avp_db_load.

Thus, there is one radius lookup for authentication, and depending on the call direction (incoming, outgoing, local) one or two db lookups for the user parameters. Then, everything can be handled (at least in our scenario) internally using avpops.

regards klaus

Andreas Granig wrote:

...

Juha Heinanen wrote:

...
i agree with you that consistent location table in DB is a good idea, but i don't understand why you need 10-15 mysql queries. switch to radius and return all caller attributes as result of authentication and all callee attributes as result of uri exists check. you end up with only two queries (plus accounting).

I just came across this posting from you (well, a little bit outdated) while digging in Path header documentation, where you argue the same way I did (that one more additional query doesn't matter):

http://www1.ietf.org/mail-archive/web/sip/current/msg04087.html

Now you say that this all can be done using radius. I don't know much of radius, but looking at auth_radius and uri_radius modules I just see the xxx_authorize() and radius_does_uri_exists() methods and don't see a way how these could help me (except writing a own modules for authentication which stores the additional attributes in avp which I could query subsequently).

Beside that the URI could be altered by some kind of call-forwarding-lookups after some basic checks (call barring and such), where checks against this rewritten URI have to be made again, which requires additional queries.

Are you willing to share your experience doing this things with radius?

Thanks, Andy

Serusers mailing list serusers@lists.iptel.org http://lists.iptel.org/mailman/listinfo/serusers

Juha Heinanen

4:09 p.m.

Klaus Darilion writes:

...

user preferences will be loaded for the from URI (caller

preferences) and if necessary for the called URI (callee preferences, e.g. voicemail URI) using avp_db_load.

you could also return these from radius as results of authorize or does_uri_exists checks.

-- juha

Klaus Darilion

4:19 p.m.

Juha Heinanen wrote:

...

Klaus Darilion writes:

...

user preferences will be loaded for the from URI (caller

preferences) and if necessary for the called URI (callee preferences, e.g. voicemail URI) using avp_db_load.

you could also return these from radius as results of authorize or does_uri_exists checks.

Yes, technically I could, but in my environment the authentication data and the user preferences are stored in different places maintained by different entities. Thus it is easier this way :-)

klaus

Andreas Granig

8:14 p.m.

Juha Heinanen wrote:

...

Klaus Darilion writes:

...

user preferences will be loaded for the from URI (caller

preferences) and if necessary for the called URI (callee preferences, e.g. voicemail URI) using avp_db_load.

you could also return these from radius as results of authorize or does_uri_exists checks.

Ok, thanks for the hints, I should really give it a try.

So theoretically one could also return the contacts of the R-URI in radius_does_uri_exist() without the need of modifying the registrar module (ignoring nathelper module for now)? Then there would be no performance impact when loading the contacts from DB for each initial request...

Regards, Andy

Juha Heinanen

18 Aug 18 Aug

6:11 a.m.

Andreas Granig writes:

...

So theoretically one could also return the contacts of the R-URI in radius_does_uri_exist() without the need of modifying the registrar module (ignoring nathelper module for now)? Then there would be no performance impact when loading the contacts from DB for each initial request...

you can return APVs from multiple tables using mysql union construct, but i don't know if that would work if the tables are in more than one database.

-- juha

Richard Z

19 Aug 19 Aug

8:06 a.m.

Hi,

Going back to the original discussion... we are also very interested in a cacheless solution. So basically remove cache from ser, save/lookup functions write/read the database. I am less concerned about natping. If the location info is in the db, it is probably not too hard to have a daemon or cron job send out ping/option periodically outside ser.

We are willing to contribute some money for this feature if that helps.

Thanks, Richard

Juha Heinanen

17 Aug 17 Aug

4:06 p.m.

Andreas Granig writes:

...

I just came across this posting from you (well, a little bit outdated) while digging in Path header documentation, where you argue the same way I did (that one more additional query doesn't matter):

http://www1.ietf.org/mail-archive/web/sip/current/msg04087.html

that was year 2002. things have improved a lot since then in ser and now number of db queries can be very few.

...

Now you say that this all can be done using radius. I don't know much of radius, but looking at auth_radius and uri_radius modules I just see the xxx_authorize() and radius_does_uri_exists() methods and don't see a way how these could help me (except writing a own modules for authentication which stores the additional attributes in avp which I could query subsequently).

you don't need to implement your own modules. both radius_www_authorize and radius_does_uri_exists can now store any number of attributes into AVPs as side effect of the call.

...

Beside that the URI could be altered by some kind of call-forwarding-lookups after some basic checks (call barring and such), where checks against this rewritten URI have to be made again, which requires additional queries.

also call forwarding attributes can be returned by adius_does_uri_exists call.

...

Are you willing to share your experience doing this things with radius?

there is really nothing special. for example, if called uri has unconditional forwarding on, radius would return the new uri as value of "cf_unc" attribute, which you can then test in your ser.cfg.

-- juha

Greger V. Teigre

2 Aug 2 Aug

6:05 a.m.

Andreas Granig wrote:

...

Greger V. Teigre wrote:

...
:-) Jan participated in a discussion on serusers on a new cache system where all locations where not loaded at start-up, but rather at need.

Well, I just read the thread at http://lists.iptel.org/pipermail/serusers/2005-May/019867.html and for me as a guy who wants to run a scalabe system with a minimum of afford and as soon as possible, one question comes up: is it really worth the efford to implement such a caching system?

I see your point. However, SER has many different uses and many different deployment scenarios. In some scenarios the focus is on high through-put of 20K users (who are divided across registrars), DNS SRV is used, and Linux HA is used for failover. Relying on a database lookup for every single lookup("location") will reduce performance quite a lot.

...

Don't understand me wrong, I'm sure there are a lot of people who benefit from this caching mechanism in terms of performance, but my experience with this is, well, very bad, when using it in a bigger environment.

Again, I think it depends on your setup. The setup mentioned above does not use DB replication, but separate databases and a path type implementation for routing between Registrars for incoming INVITEs.

...

For me (and I think a lot of other people, like Paul, who mentioned it too in the above thread) the performance won due to the cache lookup is so minimal in combination with my 10-15 other mysql queries and 1-2 enum queries per call, that I really don't care, especially when thinking about the advantages of a consistend location table in the backend DB:

This I don't understand either. If you do user preferece lookup, it's for the first INVITE, lookup("location") is done for far more messages than just the first INVITE. And then you have nathelper's ping every 10-30 seconds...

...

No need to replicate on SIP level anymore (and there's no really usable mechanism for that for more than two SER nodes), no need to distribute other contacts like aliases to every single SER node using a remote fifo hack, the need of much less memory, much faster startup time and so on, and that all for just a few more milliseconds of latency. So IMHO the option of disabling the cache absolutely makes sense *in some circumstances*, like large scale systems.

Yes, I agree. It would be interesting to run a test on a such system to see the actual performance impact. g-)

Andreas Granig

9:33 a.m.

Greger V. Teigre wrote:

...

I see your point. However, SER has many different uses and many different deployment scenarios. In some scenarios the focus is on high through-put of 20K users (who are divided across registrars), DNS SRV is used, and Linux HA is used for failover. Relying on a database lookup for every single lookup("location") will reduce performance quite a lot.

I fully understand that my approach will only help in some special cases, but I think it's important that the option is given to the user of SER to disable the cache (assuming that he knows what he's doing), because it *can* reduce pain alot in some circumstances.

Andy

Andreas Granig

1 Aug 1 Aug

9:41 a.m.

Greger, Klaus,

...

Another problem: nathelper uses the in memory location table to ping natted clients. Thus, also nathelper would have to query the database and we need a process to watch the expires and delete outdated entries.

Interesting thoughts. I still don't have the overall overview of the cache and where exactly it is used, but I already thought this won't be that easy.

However, I'll start writing my diploma thesis the next weeks, and it will target SIP clusters using SER, because there's rather few work done which is available for the public.

The basic idea is to use one SER pair (let's say sip1.foo.bar) as some kind of session border controller secured with some failover protocol (carp/vrrp/...) which does the NAT ping and SIP balancing stuff. When this border proxy reaches it's max. capacity, you can just add another pair (sip2.foo.bar) and propagate it to new customers.

Behind that border proxy you can dynamically add/remove the proxies which do the routing. They all have a database cluster behind them which they use for location lookup and other stuff. After experimenting a lot with SIP replication I think it's best to delegate this to the DB system where a lot of work is done in this area for years now.

I'm fully aware that all this sounds much easier than it is, but IMHO this is the way to go, because it's the only solution I can think of which scales and is also able to handle NAT.

Well, we'll see where this all leads us (or me) to ;o)

Andy

Klaus Darilion

10:24 a.m.

Andreas Granig wrote:

...

Greger, Klaus,

...
Another problem: nathelper uses the in memory location table to ping natted clients. Thus, also nathelper would have to query the database and we need a process to watch the expires and delete outdated entries.

Interesting thoughts. I still don't have the overall overview of the cache and where exactly it is used, but I already thought this won't be that easy.

However, I'll start writing my diploma thesis the next weeks, and it will target SIP clusters using SER, because there's rather few work done which is available for the public.

The basic idea is to use one SER pair (let's say sip1.foo.bar) as some kind of session border controller secured with some failover protocol (carp/vrrp/...) which does the NAT ping and SIP balancing stuff. When this border proxy reaches it's max. capacity, you can just add another pair (sip2.foo.bar) and propagate it to new customers.

This means, you also have to store the contacts of the registered clients in these proxies.

...

Behind that border proxy you can dynamically add/remove the proxies which do the routing. They all have a database cluster behind them which they use for location lookup and other stuff. After experimenting a lot with SIP replication I think it's best to delegate this to the DB system where a lot of work is done in this area for years now.

As you have routing proxies and outbound proxies, have you thought about how to route the SIP messages from the routing proxies via the corresponding outbound proxy? Using the Path: header, or rewriting the contact in the REGISTER messages in the outboundproxy before forwarding it to the routing proxy, or any other method?

regards, klaus

...

I'm fully aware that all this sounds much easier than it is, but IMHO this is the way to go, because it's the only solution I can think of which scales and is also able to handle NAT.

Well, we'll see where this all leads us (or me) to ;o)

Andy

Andreas Granig

10:56 a.m.

Klaus Darilion wrote:

...

...
The basic idea is to use one SER pair (let's say sip1.foo.bar) as some kind of session border controller secured with some failover protocol (carp/vrrp/...) which does the NAT ping and SIP balancing stuff. When this border proxy reaches it's max. capacity, you can just add another pair (sip2.foo.bar) and propagate it to new customers.

This means, you also have to store the contacts of the registered clients in these proxies.

Right, and this is also a reason why I want to go for a single location table inside a DB cluster instead of using the cache. But I'll have to lookup the discussion with Jan about the new cache pointed out by Greger first.

...

As you have routing proxies and outbound proxies, have you thought about how to route the SIP messages from the routing proxies via the corresponding outbound proxy? Using the Path: header, or rewriting the contact in the REGISTER messages in the outboundproxy before forwarding it to the routing proxy, or any other method?

I currently favour the Path header over rewriting the Contact, because that's what the Path header is designed for, isn't it? (haven't fully read the RFC yet).

Andy

Greger V. Teigre

2 Aug 2 Aug

6:10 a.m.

Have you looked at the Path implementation in the experimental CVS module? It should be possible to use for that purpose. As stated earlier on this list, it's on my to-do list to set up a test implementation with at least two registrars using that module, but... I'm not sure I understand the idea of using a SER pair as a session border controller. Also, using carp/vrrp, do you plan an active-passive setup? g-)

Andreas Granig wrote:

...

Klaus Darilion wrote:

...
...
The basic idea is to use one SER pair (let's say sip1.foo.bar) as some kind of session border controller secured with some failover protocol (carp/vrrp/...) which does the NAT ping and SIP balancing stuff. When this border proxy reaches it's max. capacity, you can just add another pair (sip2.foo.bar) and propagate it to new customers.

This means, you also have to store the contacts of the registered clients in these proxies.

Right, and this is also a reason why I want to go for a single location table inside a DB cluster instead of using the cache. But I'll have to lookup the discussion with Jan about the new cache pointed out by Greger first.

...
As you have routing proxies and outbound proxies, have you thought about how to route the SIP messages from the routing proxies via the corresponding outbound proxy? Using the Path: header, or rewriting the contact in the REGISTER messages in the outboundproxy before forwarding it to the routing proxy, or any other method?

I currently favour the Path header over rewriting the Contact, because that's what the Path header is designed for, isn't it? (haven't fully read the RFC yet).

Andy

7244

Age (days ago)

7265

Last active (days ago)

sr-users@lists.kamailio.org

21 comments

5 participants

tags (0)

participants (5)

Andreas Granig
Greger V. Teigre
Juha Heinanen
Klaus Darilion
Richard Z