Hello,
I am looking for some recommendations on how to move forward with asynchronous call processing.
Our adoption and standardisation of async processing in >= 4.2 into our core product has been a disaster. I don't mean that to sound accusatory; it's open source, there's no reason to blame anyone. It's just a matter of fact.
1) We can't use async_task_route()/the standard async task worker approach in the 'async' module because of this problem, which both Olle and I have reported:
http://sr-dev.sip-router.narkive.com/5Sfc5cUU/async-module-cpu-load
Most users run Kamailio inside a VM and the problem shows up for ~50% of them.
2) Using t_suspend -> mqueue -> rtimer -> t_continue(), we continue to see deadlocks and occasional crashes. They are rare, and are most likely to happen in high-throughput, short-duration environments, but when they do happen, they're politically disastrous. We've had to roll back use of this method of async processing for pretty much all customers for whom it was enabled.
After the last time we visited this issue earlier this spring, the problem has shifted away from crashes and mostly toward deadlocks. Regardless, the customer's enthusiasm for pausing call processing long enough to attach a debugger and grab a backtrace or something like that is exactly 0.0%. I think most of them have more of an enthusiasm for firing us as a vendor than for doing any diagnostic work.
I know there's a way to invoke a process in such a way that when it crashes, gdb auto-attaches and pulls a backtrace, then restarts the process. I've written such a wrapper script before in the distant past. I just don't remember how to do it, especially with modern versions of GDB; any suggestions would be appreciated.
Otherwise, I don't really know what to do. We need async processing for higher-CPS systems, and would like to standardise upon it in principle, but so far it has, from a strictly functional point of view, been an enormous economic blunder.
I still prefer to be an "early adopter" of such novelties - when useful - in high-volume production systems in order to contribute the testing and feedback back to the project. But I have to strike some realistic balance here and not lose the customers. :-)
Thanks!
-- Alex
Hello,
if you keep coming with some reports, but don't follow the requests for further details, nobody will be able to help.
Not being able to gather basic troubleshooting data, is not something that should make you look good as a vendor. Moreover, you should know that we can help if help troubleshooting. Even few days ago I replied to another similar email from you and instead of following there, you come with another thread.
Also, Olle concluded that needs to investigate more as there is no real high CPU load, but system load -- however, it was no follow up.
I run many instances of kamailio in VMs with Debian, with async enabled, no symptoms as you reported.
If you run on centos, I noticed a high cpu from rtpproxy on a test box, with no or really low traffic, that rtpproxy (official package installed from distro rpm, no custom changes) eats 100% with backtrace reporting it is executing recvfrom(). So, the problem might be combination of vm+centos. I did investigated on rtpproxy, couldn't find anything different than what I run on Debian (same version 1.2.1, from 2013). Some similar reports on the web for other apps suggested upgrades to kernel.
Regarding the blocking, I asked whether is high CPU or not, because if not, the blocking can be an I/O operation (database, dns), not tm module. I pointed you in my previous response at 'kamctl trap', which you can easily look inside and extract the gdb commands to get the backtrace from command line.
The latest version of async framework in branch 4.2 is quite the same as in 4.1, as the extra dedicated lock was reverted.
Your way to deal with these reports and not follow up when asked for more details doesn't encourage developers (at least me) to assist -- it is anything but constructive, waste of time here keep requesting same details and not getting the responses.
Daniel
On 10/06/15 16:59, Alex Balashov wrote:
Hello,
I am looking for some recommendations on how to move forward with asynchronous call processing.
Our adoption and standardisation of async processing in >= 4.2 into our core product has been a disaster. I don't mean that to sound accusatory; it's open source, there's no reason to blame anyone. It's just a matter of fact.
- We can't use async_task_route()/the standard async task worker
approach in the 'async' module because of this problem, which both Olle and I have reported:
http://sr-dev.sip-router.narkive.com/5Sfc5cUU/async-module-cpu-load
Most users run Kamailio inside a VM and the problem shows up for ~50% of them.
- Using t_suspend -> mqueue -> rtimer -> t_continue(), we continue to
see deadlocks and occasional crashes. They are rare, and are most likely to happen in high-throughput, short-duration environments, but when they do happen, they're politically disastrous. We've had to roll back use of this method of async processing for pretty much all customers for whom it was enabled.
After the last time we visited this issue earlier this spring, the problem has shifted away from crashes and mostly toward deadlocks. Regardless, the customer's enthusiasm for pausing call processing long enough to attach a debugger and grab a backtrace or something like that is exactly 0.0%. I think most of them have more of an enthusiasm for firing us as a vendor than for doing any diagnostic work.
I know there's a way to invoke a process in such a way that when it crashes, gdb auto-attaches and pulls a backtrace, then restarts the process. I've written such a wrapper script before in the distant past. I just don't remember how to do it, especially with modern versions of GDB; any suggestions would be appreciated.
Otherwise, I don't really know what to do. We need async processing for higher-CPS systems, and would like to standardise upon it in principle, but so far it has, from a strictly functional point of view, been an enormous economic blunder.
I still prefer to be an "early adopter" of such novelties - when useful - in high-volume production systems in order to contribute the testing and feedback back to the project. But I have to strike some realistic balance here and not lose the customers. :-)
Thanks!
-- Alex
Daniel,
I hear you, and don't disagree that reports without follow-up are not useful. The limitations on the information I've provided is a reflection of the constraints I face in trying to get it.
My goal was not to criticise, but to ask if there were any suggestions for technical means of gathering the necessary further details while causing negligible downtime after a crash or deadlock was discovered.
-- Alex
You haven't even confirmed if it is a real deadlock (as I asked during the past days again), as you didn't report whether is full CPU usage or not. Again, blocking can happen from different reasons, a lot of them being I/O (e.g., writing to syslog, db operations, dns, ...).
Dumping the output of top in a file as well as running gdb in batch mode to grab backtrace in a file is not costing more that few seconds -- you can put the commands in the script for restarting.
Daniel
On 10/06/15 18:29, Alex Balashov wrote:
Daniel,
I hear you, and don't disagree that reports without follow-up are not useful. The limitations on the information I've provided is a reflection of the constraints I face in trying to get it.
My goal was not to criticise, but to ask if there were any suggestions for technical means of gathering the necessary further details while causing negligible downtime after a crash or deadlock was discovered.
-- Alex
On 06/10/2015 12:37 PM, Daniel-Constantin Mierla wrote:
Dumping the output of top in a file as well as running gdb in batch mode to grab backtrace in a file is not costing more that few seconds -- you can put the commands in the script for restarting.
Agreed. It was this suggestion that I was looking for -- I am not intimately familiar with GDB and did not know about batch mode. I'll follow up with your suggestion regarding looking to the 'kamctl trap' facility for an example.
I didn't mean for this to turn adversarial. I'm sorry for any offence I may have caused. I think what we are encountering is the typical tension in exotic, difficult-to-reproduce bugs, where, from a technical and rationalistic perspective, reports without empirical details are useless, while the human and political realities of getting them - profoundly irrational - intercede. For instance, I now have to figure out how to convince one of the users to allow me to turn asynchronous processing back on so I can utilise your suggestions, which is a problem that benefits from a vastly different skill set, and is at times stubbornly impervious to rational appeal.
-- Alex