Solutions to missing BYEs, accounting for them - sr-dev

21 Apr 2010

[Sorry for cross-posting this from sr-users;  after some reflection 
upon posting, I got the impression this question may be more 
developer-centric than I initially imagined.]

Hi all,

Please forgive the slightly long post, but if you have anything to
contribute on this topic, please consider giving it a read as I could
really use your input.  :-)

As I'm sure many others of you running proxy-based service delivery
platforms of some description also, I am faced with the problem of
trying to account for calls with missing BYEs in a realistic way.
There is no shortage of mailing list posts over the years on this
topic.  Inevitably, in a platform with sufficient call volume, with
some NAT'd and/or endpoint diversity and other technical causes, there 
will be some calls that are never officially terminated from the point 
of view of a proxy.

The ability of the 'dialog' module to spoof bidirectional BYEs on
timeout[1] goes a long way toward addressing this problem
theoretically.  However, there are practical obstacles to relying on
it solely as a solution, mainly because there is not an acceptable
timeout value to use as a trade-off.  If the timeout period is set to
a very low value, users will obviously complain, and in any case,
depending on the destination, the worst-case scenario for maximum call 
billing may still be far too high.  If the timeout period is set
high--perhaps something like 5-8 hours--then all calls that fail to
end in the normal way will be billed some excessively large amount
that certainly will not sit well with users either.

If either the core delivery element of the platform or the user agent
is tightly controlled by the operator of the proxy from an
administrative point of view, it is indeed probably possible to rely
on RTP timeouts or SIP Session Timers (SSTs) on one of the endpoints.

That doesn't create a satisfying resolution for those of us dealing
with indeterminate call completion scenarios with a great deal of user 
and vendor diversity, though.  For instance, I route to about 15 ITSPs 
and carriers;  I think maybe one of them does 15-minute SSTs, and the 
rest are certainly not going to turn them on just for me, even if 
their SBCs/switches/things have the capability.  The user endpoints 
are mostly Asterisk and do RTP timeout, of course, and in most cases I 
do get the resulting BYE.  However, this discussion is about the 
minute but nontrivial percentage of cases in which I do not get the 
BYE, whether because of NAT statekeeping problems or network
reachability or whatever underlying causes--in truth, I cannot
accurately characterise these.

So, it seems to me that from a theoretical point of view, there are
basically two directions someone in this position can go from here:

1) Inline B2BUA in the signaling path of all calls;

1a) Make it do SSTs; or
1b) Make it relay media, too, and hang up the call (bidirectional BYE)
on RTP receive timeout;

2) Couple the proxy to an RTP relay and provide some mechanism by
which the proxy can be made aware, in an asynchronous fashion, that an 
RTP timeout was detected by the relay.

It seems to me from a brief and informal survey of prior mailing list
literature that #1 is the usually recommended option here.

If #1 is pursued, what is the best tool to use in the
Kamailio/SIP-Router-oriented ecosystem?  My default instinct would say
SEMS;  I really like SEMS, and use it a lot for various related chores.

The problem is that the pre-built modules and examples for SEMS mostly 
center on application-level functionality, while low-level
documentation of its powerful C++ API is a bit impoverished, so this
would take a lot of work.

Needless to say, I am interested in the option that requires the least 
work but still solves the problem in an elegant way from a technical 
and--dare I say--aesthetic perspective.

For instance, it seems clear from looking at the SEMS-1.1.1 sources
that SSTs are supported in principle in core/plug-in/session_timer.
But unless I am missing something, I cannot find anywhere in the
sources or examples where it is actually used.

So, I suppose one option is to figure out how to make this stuff work
in SEMS, and make it work.  But for some reason who is not attune to
the universe of its C++ API, it is a rather formidable chore.  I think 
the same would hold true of making it observe bidirectional RTP timeout.

Turning attention to option #2, I have looked at rtpproxy (my
preferred default), iptrtpproxy, and mediaproxy modules but have not
found any evidence that the control protocols Kamailio/SR uses to
engage them support any notion of backward asynchronous feedback in
case of RTP timeout.

It would be really nice if one of these stream control protocols was
augmented to kick back a packet to Kamailio that can be caught in a
special event_route, like event_route[nathelper:rtp-stream-timeout],
but that is clearly not the case today.

To be honest, I would not use MediaProxy even if it had this feature,
because, well, let's be bluntly honest and acknowledge what the more
politically aware presumably already conjecture: in light of AG
Projects' zealous OpenSIPS partnership, it's difficult to muster
confidence in future compatibility of MediaProxy with Kamailio.  The
module is there, it works, and I'm sure its maintainers are dedicated
to doing whatever it takes to reverse engineer and keep it working,
lift patches from OpenSIPS as necessary, etc., but who wants to be on
the wrong side of the project ecosystem fence?  Not I.

That leaves iptrtpproxy, whose 'switchboard' concept I do not fully
comprehend due to lack of experience with it, but which holds a
potentially viable, if slightly kludgy/Rube Goldbergian answer.  Of
the three RTP proxies, it is the only one that provides a ready means
of exporting a list of media streams it is currently tracking,
together with statistics on how many packets have been received, etc.
  It is not inconceivable to cook up an external process that will
frequently check this 'switchboard', as it were, and incite
Kamailio/SR to do dlg_bye() via MI if it appears that the media stream 
has disappeared from either side;  the dialog module helpfully exports 
the MI command dlg_end_dlg.

Still, this does not seem nearly as parsimonious and reliable a
solution as simply building some kind of RTP stream leg timeout
notification into the control socket.  After all, the control socket
is open persistently, right, not on-demand?  The various RTP proxies
all seem to have some kind of dead peer detection internally in order
to have some means of gracefully expiring resources allocated to media 
streams that have gone away, so it would just be a matter of passing a 
control frame up the socket to Kamailio/SR and wiring that to a custom 
event_route or a more static callback in the code.

By the way, I should mention that I am aware of and historically very
sympathetic to the perspective that this kind of call control is alien 
to the nature of a proxy, and an appropriate job for UAs and not 
proxies at all.  However, we all have to make pragmatic concessions to 
the realities of real-world operation, which I assume is the 
motivation for dialog timeouts, dlg_bye(), and other perversions from 
the point of view of a purist.  :-)

I welcome your thoughts and suggestions about the easiest and most
technically meritorious approach.

Thanks,

-- Alex

[1] Enabled via $dlg_ctx(timeout_bye) = 1

-- 
Alex Balashov - Principal
Evariste Systems LLC
1170 Peachtree Street
12th Floor, Suite 1200
Atlanta, GA 30309
Tel: +1-678-954-0670
Fax: +1-404-961-1892
Web: http://www.evaristesys.com/