So, I have a platform that handles in the order of around 2500 or more call attempts per second. There are 2 servers that are built that handle this traffic, "old" and "new". The "old" design is a bit heavy, has multiple redis calls to external IO, does a lot of "work" on each invite before passing it along, etc. The "new" design is extremely lightweight, only a couple of htable comparisons and a dispatcher check on each invite before passing it along. The 2 servers are running different versions of Kamailio, however they're both in the 5.x train and I've found nothing in the changelogs that I believe would explain a significant difference in how child process CPU scheduling is performed, especially a detrimental change. I'm more than happy to be proven wrong about that though!
The "old" system is running 5.4.8: This system is running 4 x 8 core CPUs (32 cores) with no hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO interactions with things like redis, lots of "work" being done on each invite checking various things before allowing it to proceed. This is currently running 32 child processes and again, has no issues handling the requests whatsoever and has been running at this level and higher for literal years.
The "new" system is running 5.8.3: This system is running 2 x 8 core CPUs (16 cores) *with* hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO interactions with anything, and only does a couple of checks against some hash table values and dispatcher before allowing it to proceed. This is also currently running 32 child processes, however I am experiencing spikes in context switching and odd memory access issues (e.g. backtraces show things like "<error: Cannot access memory at address 0x14944ec60>" in udp_rvc_loop) with similar loads.
What I'm looking for here is empirical knowledge about things specifically like context switching and if hyperthreading is detrimental in cases where I'm doing very high volume, low latency (e.g. no waiting on external replies, etc) work. Is it possible there are additional issues with the way hyperthreading works and the concepts behind it if the individual calls for CPU resources are so "quick" as to overwhelm the scheduler and cause congestion in the CPU's thread scheduler itself?
Have you run into anything like this? Have you discovered that very low latency call processing can incur more hyperthreading overhead than the benefit you get from it? Do you have additional data I may be missing to help me justify to the C suite to change out the system architecture to provide a higher physical core count? Can you share it?
Thanks!
I don't have a great deal of empirical experience with this. However, proceeding from first principles in a rather general way, hyperthreading probably does more harm than good to more pure CPU-bound workloads, and is more beneficial in I/O-bound workloads where there is more of the kind of waiting on external responses that you ascribe to the "old" system.
The basic idea of hyperthreading, as best as I understand it, is a kind of oversubscription of a physical core across two logical threads. That means all the core's resources are shared (e.g. caches, memory bandwidth) and the OS has to manage more threads on the same physical core, increasing scheduler overhead.
I imagine this means that hyperthreading works best in cases where there are execution or memory access patterns that naturally "yield" to the other thread on the same core a lot, with I/O waiting of course coming to mind first and foremost. The "hyper" part I guess comes from the idea that the physical cores aren't sitting idle as much as when a single thread of execution waits on something. If there's not much of that kind of waiting, HyperThreading probably isn't of much help, and may even hurt.
The real unknown is whether a seemingly compute-bound Kamailio config has the kinds of execution patterns that create a lot of natural yielding on the given architecture. It's hard to say, under the hood, if htable lookups and dispatcher leaves a lot of these yield "gaps". My guess is they don't; it doesn't seem to me to map onto stuff like complex arithmetic, wildly varying memory allocations, cache misses and branch mispredictions and other things that could lead a thread of execution to stall. However, without forensically combing through the underlying machine code--which I of course cannot do--it's so hard to say. There are attempts throughout Kamailio's core code to modulate branch prediction to the compiler, e.g. through lots of use of unlikely()[1]:
/* likely/unlikely */ #if __GNUC__ >= 3
#define likely(expr) __builtin_expect(!!(expr), 1) #define unlikely(expr) __builtin_expect(!!(expr), 0)
I don't know how you'd profile the real-world effects of this, though.
My guess is that this is, like most things of the kind, splitting very thin hairs, and doesn't have much effect on your ability to process 2500 CPS one way or another.
-- Alex
[1] https://stackoverflow.com/questions/7346929/what-is-the-advantage-of-gccs-bu...
On May 29, 2025, at 11:57 AM, Brooks Bridges via sr-users sr-users@lists.kamailio.org wrote:
So, I have a platform that handles in the order of around 2500 or more call attempts per second. There are 2 servers that are built that handle this traffic, "old" and "new". The "old" design is a bit heavy, has multiple redis calls to external IO, does a lot of "work" on each invite before passing it along, etc. The "new" design is extremely lightweight, only a couple of htable comparisons and a dispatcher check on each invite before passing it along. The 2 servers are running different versions of Kamailio, however they're both in the 5.x train and I've found nothing in the changelogs that I believe would explain a significant difference in how child process CPU scheduling is performed, especially a detrimental change. I'm more than happy to be proven wrong about that though!
The "old" system is running 5.4.8: This system is running 4 x 8 core CPUs (32 cores) with no hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO interactions with things like redis, lots of "work" being done on each invite checking various things before allowing it to proceed. This is currently running 32 child processes and again, has no issues handling the requests whatsoever and has been running at this level and higher for literal years.
The "new" system is running 5.8.3: This system is running 2 x 8 core CPUs (16 cores) *with* hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO interactions with anything, and only does a couple of checks against some hash table values and dispatcher before allowing it to proceed. This is also currently running 32 child processes, however I am experiencing spikes in context switching and odd memory access issues (e.g. backtraces show things like "<error: Cannot access memory at address 0x14944ec60>" in udp_rvc_loop) with similar loads.
What I'm looking for here is empirical knowledge about things specifically like context switching and if hyperthreading is detrimental in cases where I'm doing very high volume, low latency (e.g. no waiting on external replies, etc) work. Is it possible there are additional issues with the way hyperthreading works and the concepts behind it if the individual calls for CPU resources are so "quick" as to overwhelm the scheduler and cause congestion in the CPU's thread scheduler itself?
Have you run into anything like this? Have you discovered that very low latency call processing can incur more hyperthreading overhead than the benefit you get from it? Do you have additional data I may be missing to help me justify to the C suite to change out the system architecture to provide a higher physical core count? Can you share it?
Thanks! __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
The counterpoint would be:
Kamailio may not wait a lot for outside I/O, but even a pure in-memory config's code path largely rakes any incoming request over a bunch of kernel and system call interactions, which are in effect wait.
That doesn't have much in common with a truly pure-computational workload.
-- Alex
On May 29, 2025, at 1:06 PM, Alex Balashov abalashov@evaristesys.com wrote:
I don't have a great deal of empirical experience with this. However, proceeding from first principles in a rather general way, hyperthreading probably does more harm than good to more pure CPU-bound workloads, and is more beneficial in I/O-bound workloads where there is more of the kind of waiting on external responses that you ascribe to the "old" system.
The basic idea of hyperthreading, as best as I understand it, is a kind of oversubscription of a physical core across two logical threads. That means all the core's resources are shared (e.g. caches, memory bandwidth) and the OS has to manage more threads on the same physical core, increasing scheduler overhead.
I imagine this means that hyperthreading works best in cases where there are execution or memory access patterns that naturally "yield" to the other thread on the same core a lot, with I/O waiting of course coming to mind first and foremost. The "hyper" part I guess comes from the idea that the physical cores aren't sitting idle as much as when a single thread of execution waits on something. If there's not much of that kind of waiting, HyperThreading probably isn't of much help, and may even hurt.
The real unknown is whether a seemingly compute-bound Kamailio config has the kinds of execution patterns that create a lot of natural yielding on the given architecture. It's hard to say, under the hood, if htable lookups and dispatcher leaves a lot of these yield "gaps". My guess is they don't; it doesn't seem to me to map onto stuff like complex arithmetic, wildly varying memory allocations, cache misses and branch mispredictions and other things that could lead a thread of execution to stall. However, without forensically combing through the underlying machine code--which I of course cannot do--it's so hard to say. There are attempts throughout Kamailio's core code to modulate branch prediction to the compiler, e.g. through lots of use of unlikely()[1]:
/* likely/unlikely */ #if __GNUC__ >= 3
#define likely(expr) __builtin_expect(!!(expr), 1) #define unlikely(expr) __builtin_expect(!!(expr), 0)
I don't know how you'd profile the real-world effects of this, though.
My guess is that this is, like most things of the kind, splitting very thin hairs, and doesn't have much effect on your ability to process 2500 CPS one way or another.
-- Alex
[1] https://stackoverflow.com/questions/7346929/what-is-the-advantage-of-gccs-bu...
On May 29, 2025, at 11:57 AM, Brooks Bridges via sr-users sr-users@lists.kamailio.org wrote:
So, I have a platform that handles in the order of around 2500 or more call attempts per second. There are 2 servers that are built that handle this traffic, "old" and "new". The "old" design is a bit heavy, has multiple redis calls to external IO, does a lot of "work" on each invite before passing it along, etc. The "new" design is extremely lightweight, only a couple of htable comparisons and a dispatcher check on each invite before passing it along. The 2 servers are running different versions of Kamailio, however they're both in the 5.x train and I've found nothing in the changelogs that I believe would explain a significant difference in how child process CPU scheduling is performed, especially a detrimental change. I'm more than happy to be proven wrong about that though!
The "old" system is running 5.4.8: This system is running 4 x 8 core CPUs (32 cores) with no hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO interactions with things like redis, lots of "work" being done on each invite checking various things before allowing it to proceed. This is currently running 32 child processes and again, has no issues handling the requests whatsoever and has been running at this level and higher for literal years.
The "new" system is running 5.8.3: This system is running 2 x 8 core CPUs (16 cores) *with* hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO interactions with anything, and only does a couple of checks against some hash table values and dispatcher before allowing it to proceed. This is also currently running 32 child processes, however I am experiencing spikes in context switching and odd memory access issues (e.g. backtraces show things like "<error: Cannot access memory at address 0x14944ec60>" in udp_rvc_loop) with similar loads.
What I'm looking for here is empirical knowledge about things specifically like context switching and if hyperthreading is detrimental in cases where I'm doing very high volume, low latency (e.g. no waiting on external replies, etc) work. Is it possible there are additional issues with the way hyperthreading works and the concepts behind it if the individual calls for CPU resources are so "quick" as to overwhelm the scheduler and cause congestion in the CPU's thread scheduler itself?
Have you run into anything like this? Have you discovered that very low latency call processing can incur more hyperthreading overhead than the benefit you get from it? Do you have additional data I may be missing to help me justify to the C suite to change out the system architecture to provide a higher physical core count? Can you share it?
Thanks! __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
-- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://evaristesys.com Tel: +1-706-510-6800
To be clear, the "new" system is currently handling ~1000-1500 cps with no issues at all, it only starts exhibiting these "drops" when we push it with the rest of the traffic that's currently on the "old" system. It runs fine for a while, but then when the traffic level gets up to a certain point, it starts tossing things like "ERROR: tm [tm.c:1846]: _w_t_relay_to(): t_forward_noack failed" into the system journal and those memory access errors show up in a trap backtrace.
While I get that there are more "things" than one might think, the disparity is kind of what I'm looking at. 8 or 9 redis lookups, plus tons of function calls to evaluate message parameters, plus dispatcher lookups, plus .... that all equals a lot more "stuff" that gets put into the pipe to be worked on, and for the cpu scheduler to handle. Alternately, the new one will have drastically fewer of these, so it's going to "churn" a lot more since each child worker is doing less overall.
I'm mainly trying to figure out if the addition of the hyperthreading could be part of the reason we're having so much trouble reaching the higher levels of traffic that this thing should be able to handle as opposed to the old one.
Thanks for the insight!
On 5/29/2025 10:10 AM, Alex Balashov via sr-users wrote:
The counterpoint would be:
Kamailio may not wait a lot for outside I/O, but even a pure in-memory config's code path largely rakes any incoming request over a bunch of kernel and system call interactions, which are in effect wait.
That doesn't have much in common with a truly pure-computational workload.
-- Alex
On May 29, 2025, at 1:06 PM, Alex Balashov abalashov@evaristesys.com wrote:
I don't have a great deal of empirical experience with this. However, proceeding from first principles in a rather general way, hyperthreading probably does more harm than good to more pure CPU-bound workloads, and is more beneficial in I/O-bound workloads where there is more of the kind of waiting on external responses that you ascribe to the "old" system.
The basic idea of hyperthreading, as best as I understand it, is a kind of oversubscription of a physical core across two logical threads. That means all the core's resources are shared (e.g. caches, memory bandwidth) and the OS has to manage more threads on the same physical core, increasing scheduler overhead.
I imagine this means that hyperthreading works best in cases where there are execution or memory access patterns that naturally "yield" to the other thread on the same core a lot, with I/O waiting of course coming to mind first and foremost. The "hyper" part I guess comes from the idea that the physical cores aren't sitting idle as much as when a single thread of execution waits on something. If there's not much of that kind of waiting, HyperThreading probably isn't of much help, and may even hurt.
The real unknown is whether a seemingly compute-bound Kamailio config has the kinds of execution patterns that create a lot of natural yielding on the given architecture. It's hard to say, under the hood, if htable lookups and dispatcher leaves a lot of these yield "gaps". My guess is they don't; it doesn't seem to me to map onto stuff like complex arithmetic, wildly varying memory allocations, cache misses and branch mispredictions and other things that could lead a thread of execution to stall. However, without forensically combing through the underlying machine code--which I of course cannot do--it's so hard to say. There are attempts throughout Kamailio's core code to modulate branch prediction to the compiler, e.g. through lots of use of unlikely()[1]:
/* likely/unlikely */ #if __GNUC__ >= 3
#define likely(expr) __builtin_expect(!!(expr), 1) #define unlikely(expr) __builtin_expect(!!(expr), 0)
I don't know how you'd profile the real-world effects of this, though.
My guess is that this is, like most things of the kind, splitting very thin hairs, and doesn't have much effect on your ability to process 2500 CPS one way or another.
-- Alex
[1] https://stackoverflow.com/questions/7346929/what-is-the-advantage-of-gccs-bu...
On May 29, 2025, at 11:57 AM, Brooks Bridges via sr-users sr-users@lists.kamailio.org wrote:
So, I have a platform that handles in the order of around 2500 or more call attempts per second. There are 2 servers that are built that handle this traffic, "old" and "new". The "old" design is a bit heavy, has multiple redis calls to external IO, does a lot of "work" on each invite before passing it along, etc. The "new" design is extremely lightweight, only a couple of htable comparisons and a dispatcher check on each invite before passing it along. The 2 servers are running different versions of Kamailio, however they're both in the 5.x train and I've found nothing in the changelogs that I believe would explain a significant difference in how child process CPU scheduling is performed, especially a detrimental change. I'm more than happy to be proven wrong about that though!
The "old" system is running 5.4.8: This system is running 4 x 8 core CPUs (32 cores) with no hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO interactions with things like redis, lots of "work" being done on each invite checking various things before allowing it to proceed. This is currently running 32 child processes and again, has no issues handling the requests whatsoever and has been running at this level and higher for literal years.
The "new" system is running 5.8.3: This system is running 2 x 8 core CPUs (16 cores) *with* hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO interactions with anything, and only does a couple of checks against some hash table values and dispatcher before allowing it to proceed. This is also currently running 32 child processes, however I am experiencing spikes in context switching and odd memory access issues (e.g. backtraces show things like "<error: Cannot access memory at address 0x14944ec60>" in udp_rvc_loop) with similar loads.
What I'm looking for here is empirical knowledge about things specifically like context switching and if hyperthreading is detrimental in cases where I'm doing very high volume, low latency (e.g. no waiting on external replies, etc) work. Is it possible there are additional issues with the way hyperthreading works and the concepts behind it if the individual calls for CPU resources are so "quick" as to overwhelm the scheduler and cause congestion in the CPU's thread scheduler itself?
Have you run into anything like this? Have you discovered that very low latency call processing can incur more hyperthreading overhead than the benefit you get from it? Do you have additional data I may be missing to help me justify to the C suite to change out the system architecture to provide a higher physical core count? Can you share it?
Thanks! __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
-- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://evaristesys.com Tel: +1-706-510-6800
No expert here, just lots of experience 🤣 What Alex describes is exactly what I’ve heard. Even to the point of people completely disabling hyperthreading.
Regards,
David Villasmil email: david.villasmil.work@gmail.com
On Thu, May 29, 2025 at 8:05 PM Brooks Bridges via sr-users < sr-users@lists.kamailio.org> wrote:
To be clear, the "new" system is currently handling ~1000-1500 cps with no issues at all, it only starts exhibiting these "drops" when we push it with the rest of the traffic that's currently on the "old" system. It runs fine for a while, but then when the traffic level gets up to a certain point, it starts tossing things like "ERROR: tm [tm.c:1846]: _w_t_relay_to(): t_forward_noack failed" into the system journal and those memory access errors show up in a trap backtrace.
While I get that there are more "things" than one might think, the disparity is kind of what I'm looking at. 8 or 9 redis lookups, plus tons of function calls to evaluate message parameters, plus dispatcher lookups, plus .... that all equals a lot more "stuff" that gets put into the pipe to be worked on, and for the cpu scheduler to handle. Alternately, the new one will have drastically fewer of these, so it's going to "churn" a lot more since each child worker is doing less overall.
I'm mainly trying to figure out if the addition of the hyperthreading could be part of the reason we're having so much trouble reaching the higher levels of traffic that this thing should be able to handle as opposed to the old one.
Thanks for the insight!
On 5/29/2025 10:10 AM, Alex Balashov via sr-users wrote:
The counterpoint would be:
Kamailio may not wait a lot for outside I/O, but even a pure in-memory
config's code path largely rakes any incoming request over a bunch of kernel and system call interactions, which are in effect wait.
That doesn't have much in common with a truly pure-computational
workload.
-- Alex
On May 29, 2025, at 1:06 PM, Alex Balashov abalashov@evaristesys.com
wrote:
I don't have a great deal of empirical experience with this. However,
proceeding from first principles in a rather general way, hyperthreading probably does more harm than good to more pure CPU-bound workloads, and is more beneficial in I/O-bound workloads where there is more of the kind of waiting on external responses that you ascribe to the "old" system.
The basic idea of hyperthreading, as best as I understand it, is a kind
of oversubscription of a physical core across two logical threads. That means all the core's resources are shared (e.g. caches, memory bandwidth) and the OS has to manage more threads on the same physical core, increasing scheduler overhead.
I imagine this means that hyperthreading works best in cases where
there are execution or memory access patterns that naturally "yield" to the other thread on the same core a lot, with I/O waiting of course coming to mind first and foremost. The "hyper" part I guess comes from the idea that the physical cores aren't sitting idle as much as when a single thread of execution waits on something. If there's not much of that kind of waiting, HyperThreading probably isn't of much help, and may even hurt.
The real unknown is whether a seemingly compute-bound Kamailio config
has the kinds of execution patterns that create a lot of natural yielding on the given architecture. It's hard to say, under the hood, if htable lookups and dispatcher leaves a lot of these yield "gaps". My guess is they don't; it doesn't seem to me to map onto stuff like complex arithmetic, wildly varying memory allocations, cache misses and branch mispredictions and other things that could lead a thread of execution to stall. However, without forensically combing through the underlying machine code--which I of course cannot do--it's so hard to say. There are attempts throughout Kamailio's core code to modulate branch prediction to the compiler, e.g. through lots of use of unlikely()[1]:
/* likely/unlikely */ #if __GNUC__ >= 3
#define likely(expr) __builtin_expect(!!(expr), 1) #define unlikely(expr) __builtin_expect(!!(expr), 0)
I don't know how you'd profile the real-world effects of this, though.
My guess is that this is, like most things of the kind, splitting very
thin hairs, and doesn't have much effect on your ability to process 2500 CPS one way or another.
-- Alex
[1]
https://stackoverflow.com/questions/7346929/what-is-the-advantage-of-gccs-bu...
On May 29, 2025, at 11:57 AM, Brooks Bridges via sr-users <
sr-users@lists.kamailio.org> wrote:
So, I have a platform that handles in the order of around 2500 or more
call attempts per second. There are 2 servers that are built that handle this traffic, "old" and "new". The "old" design is a bit heavy, has multiple redis calls to external IO, does a lot of "work" on each invite before passing it along, etc. The "new" design is extremely lightweight, only a couple of htable comparisons and a dispatcher check on each invite before passing it along. The 2 servers are running different versions of Kamailio, however they're both in the 5.x train and I've found nothing in the changelogs that I believe would explain a significant difference in how child process CPU scheduling is performed, especially a detrimental change. I'm more than happy to be proven wrong about that though!
The "old" system is running 5.4.8: This system is running 4 x 8 core CPUs (32 cores) with no
hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO interactions with things like redis, lots of "work" being done on each invite checking various things before allowing it to proceed. This is currently running 32 child processes and again, has no issues handling the requests whatsoever and has been running at this level and higher for literal years.
The "new" system is running 5.8.3: This system is running 2 x 8 core CPUs (16 cores) *with*
hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO interactions with anything, and only does a couple of checks against some hash table values and dispatcher before allowing it to proceed. This is also currently running 32 child processes, however I am experiencing spikes in context switching and odd memory access issues (e.g. backtraces show things like "<error: Cannot access memory at address 0x14944ec60>" in udp_rvc_loop) with similar loads.
What I'm looking for here is empirical knowledge about things
specifically like context switching and if hyperthreading is detrimental in cases where I'm doing very high volume, low latency (e.g. no waiting on external replies, etc) work. Is it possible there are additional issues with the way hyperthreading works and the concepts behind it if the individual calls for CPU resources are so "quick" as to overwhelm the scheduler and cause congestion in the CPU's thread scheduler itself?
Have you run into anything like this? Have you discovered that very low latency call processing can incur
more hyperthreading overhead than the benefit you get from it?
Do you have additional data I may be missing to help me justify to the
C suite to change out the system architecture to provide a higher physical core count?
Can you share it?
Thanks! __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions --
sr-users@lists.kamailio.org
To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only
to the sender!
-- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://evaristesys.com Tel: +1-706-510-6800
Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
Hello,
you have to provide the error log messages as printed by Kamailio in the syslog, because they give details about the place in the code (i.e., the one related to the udp_rvc_loop()).
Also, usually there are more than one error message, the last one might not be the most relevant. For example, the "ERROR: tm [tm.c:1846]: _w_t_relay_to(): t_forward_noack failed" should have at least another one related to the case printed before it.
You should also give some details about the old and new systems (e.g., the operating system, CPU type, ...), try to figure out if some kernel modules are loaded, like conntrack, ... or selinux is enabled with specific limits...
As a long shot, not having the full error message related to the udp_rvc_loop(), that one seems to come from recvfrom() which is system/libc socket function, so something at operating system might need to be tuned.
Cheers, Daniel
On 29.05.25 19:41, Brooks Bridges via sr-users wrote:
To be clear, the "new" system is currently handling ~1000-1500 cps with no issues at all, it only starts exhibiting these "drops" when we push it with the rest of the traffic that's currently on the "old" system. It runs fine for a while, but then when the traffic level gets up to a certain point, it starts tossing things like "ERROR: tm [tm.c:1846]: _w_t_relay_to(): t_forward_noack failed" into the system journal and those memory access errors show up in a trap backtrace.
While I get that there are more "things" than one might think, the disparity is kind of what I'm looking at. 8 or 9 redis lookups, plus tons of function calls to evaluate message parameters, plus dispatcher lookups, plus .... that all equals a lot more "stuff" that gets put into the pipe to be worked on, and for the cpu scheduler to handle. Alternately, the new one will have drastically fewer of these, so it's going to "churn" a lot more since each child worker is doing less overall.
I'm mainly trying to figure out if the addition of the hyperthreading could be part of the reason we're having so much trouble reaching the higher levels of traffic that this thing should be able to handle as opposed to the old one.
Thanks for the insight!
On 5/29/2025 10:10 AM, Alex Balashov via sr-users wrote:
The counterpoint would be:
Kamailio may not wait a lot for outside I/O, but even a pure in-memory config's code path largely rakes any incoming request over a bunch of kernel and system call interactions, which are in effect wait.
That doesn't have much in common with a truly pure-computational workload.
-- Alex
On May 29, 2025, at 1:06 PM, Alex Balashov abalashov@evaristesys.com wrote:
I don't have a great deal of empirical experience with this. However, proceeding from first principles in a rather general way, hyperthreading probably does more harm than good to more pure CPU-bound workloads, and is more beneficial in I/O-bound workloads where there is more of the kind of waiting on external responses that you ascribe to the "old" system.
The basic idea of hyperthreading, as best as I understand it, is a kind of oversubscription of a physical core across two logical threads. That means all the core's resources are shared (e.g. caches, memory bandwidth) and the OS has to manage more threads on the same physical core, increasing scheduler overhead.
I imagine this means that hyperthreading works best in cases where there are execution or memory access patterns that naturally "yield" to the other thread on the same core a lot, with I/O waiting of course coming to mind first and foremost. The "hyper" part I guess comes from the idea that the physical cores aren't sitting idle as much as when a single thread of execution waits on something. If there's not much of that kind of waiting, HyperThreading probably isn't of much help, and may even hurt.
The real unknown is whether a seemingly compute-bound Kamailio config has the kinds of execution patterns that create a lot of natural yielding on the given architecture. It's hard to say, under the hood, if htable lookups and dispatcher leaves a lot of these yield "gaps". My guess is they don't; it doesn't seem to me to map onto stuff like complex arithmetic, wildly varying memory allocations, cache misses and branch mispredictions and other things that could lead a thread of execution to stall. However, without forensically combing through the underlying machine code--which I of course cannot do--it's so hard to say. There are attempts throughout Kamailio's core code to modulate branch prediction to the compiler, e.g. through lots of use of unlikely()[1]:
/* likely/unlikely */ #if __GNUC__ >= 3
#define likely(expr) __builtin_expect(!!(expr), 1) #define unlikely(expr) __builtin_expect(!!(expr), 0)
I don't know how you'd profile the real-world effects of this, though.
My guess is that this is, like most things of the kind, splitting very thin hairs, and doesn't have much effect on your ability to process 2500 CPS one way or another.
-- Alex
[1] https://stackoverflow.com/questions/7346929/what-is-the-advantage-of-gccs-bu...
On May 29, 2025, at 11:57 AM, Brooks Bridges via sr-users sr-users@lists.kamailio.org wrote:
So, I have a platform that handles in the order of around 2500 or more call attempts per second. There are 2 servers that are built that handle this traffic, "old" and "new". The "old" design is a bit heavy, has multiple redis calls to external IO, does a lot of "work" on each invite before passing it along, etc. The "new" design is extremely lightweight, only a couple of htable comparisons and a dispatcher check on each invite before passing it along. The 2 servers are running different versions of Kamailio, however they're both in the 5.x train and I've found nothing in the changelogs that I believe would explain a significant difference in how child process CPU scheduling is performed, especially a detrimental change. I'm more than happy to be proven wrong about that though!
The "old" system is running 5.4.8: This system is running 4 x 8 core CPUs (32 cores) with no hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO interactions with things like redis, lots of "work" being done on each invite checking various things before allowing it to proceed. This is currently running 32 child processes and again, has no issues handling the requests whatsoever and has been running at this level and higher for literal years.
The "new" system is running 5.8.3: This system is running 2 x 8 core CPUs (16 cores) *with* hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO interactions with anything, and only does a couple of checks against some hash table values and dispatcher before allowing it to proceed. This is also currently running 32 child processes, however I am experiencing spikes in context switching and odd memory access issues (e.g. backtraces show things like "<error: Cannot access memory at address 0x14944ec60>" in udp_rvc_loop) with similar loads.
What I'm looking for here is empirical knowledge about things specifically like context switching and if hyperthreading is detrimental in cases where I'm doing very high volume, low latency (e.g. no waiting on external replies, etc) work. Is it possible there are additional issues with the way hyperthreading works and the concepts behind it if the individual calls for CPU resources are so "quick" as to overwhelm the scheduler and cause congestion in the CPU's thread scheduler itself?
Have you run into anything like this? Have you discovered that very low latency call processing can incur more hyperthreading overhead than the benefit you get from it? Do you have additional data I may be missing to help me justify to the C suite to change out the system architecture to provide a higher physical core count? Can you share it?
Thanks! __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
-- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://evaristesys.com Tel: +1-706-510-6800
Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
On May 29, 2025, at 1:41 PM, Brooks Bridges via sr-users sr-users@lists.kamailio.org wrote:
when the traffic level gets up to a certain point, it starts tossing things like "ERROR: tm [tm.c:1846]: _w_t_relay_to(): t_forward_noack failed" into the system journal and those memory access errors show up in a trap backtrace.
I'm sure you've had this thought, too, but I wonder if there is any other contemporaneous/contextual/immediately antecedent logged matter that could give further insight.
Any chance you're just running out of SHM or package memory, or that adding either a lot more or a lot fewer child processes tilts the picture one way or another?
-- Alex
It is widely believe (and rightfully so) that Hyperthreading (also known as *SMT*) does more harm then good for ultra-low-latency and/or real-time applications due to *Execution Jitter*, *Resource Contention* and *Unpredictability of Execution Times*.
Let me elaborate a little on each of these before listing possible solutions.
1. Increased Latency Variability aka Execution Jitters: In SMT, two logical threads share a single CPU core and compete for resources, leading to *unpredictable execution time* that leads to *Execution Jitter*.
2. Cache and Memory Contention - Resource Contention: Since both threads share the L1& L2 caches and memory bandwidth. If one thread is more memory-intensive (like pattern matching or search through HTable etc.) then it can evict cache lines needed by the other thread, thus increasing *cache misses* and *memory latency*.
3. Pipeline Stalls and Resource Starvation: SMT threads compete for *execute ports*, *reorder buffers (ROB)* and *load/store queues*. So, if one thread stalls (e.g. due to a branch misprediction or memory access), the other thread may not fully utilize the core, leading to *suboptimal throughput*. This adds *unpredictable delays* in critical code paths.
4. False Sharing and Core Saturation: If two unrelated threads are scheduled on the same physical core (e.g. a UDP listener thread and a RTimer thread in Kamailio), they may *inadvertently share cache lines*, causing *false sharing* and increasing coherence overhead. In extreme cases, a noisy neighbor (e.g. a background task) can *starve the real-time thread* (e.g. a UDP listener) of CPU resources.
5. Interference with CPU Affinity & Isolation: Many real-time services (e.g. media proxy services, especially that use Kernel mode code execution like RTPEngine) use *CPU pinning (affinity)* to isolate critical threads. Hyperthreading complicates this because two logical CPUs are mapped to one physical CPU core, making it harder to ensure *exclusive access* to compute resources.
Possible Solution: Here are a few possible solutions that you can try,
1. *Disable Hyperthreading* in BIOS if possible. 2. Use *CPU Isolation* for SIP and Media proxy services. If I remember correctly there is *isolcpus* command in Linux that can do the job. 3. Prioritize real-time processes with *high CPU scheduling priority* by using *SCHED_FIFO* in Linux.
Make sure to Benchmark with and without SMT and share the results with us as a case study.
Thank you.
-- Muhammad Shahzad Shafi Tel: +49 176 99 83 10 85
On Thu, May 29, 2025 at 8:05 PM Brooks Bridges via sr-users < sr-users@lists.kamailio.org> wrote:
To be clear, the "new" system is currently handling ~1000-1500 cps with no issues at all, it only starts exhibiting these "drops" when we push it with the rest of the traffic that's currently on the "old" system. It runs fine for a while, but then when the traffic level gets up to a certain point, it starts tossing things like "ERROR: tm [tm.c:1846]: _w_t_relay_to(): t_forward_noack failed" into the system journal and those memory access errors show up in a trap backtrace.
While I get that there are more "things" than one might think, the disparity is kind of what I'm looking at. 8 or 9 redis lookups, plus tons of function calls to evaluate message parameters, plus dispatcher lookups, plus .... that all equals a lot more "stuff" that gets put into the pipe to be worked on, and for the cpu scheduler to handle. Alternately, the new one will have drastically fewer of these, so it's going to "churn" a lot more since each child worker is doing less overall.
I'm mainly trying to figure out if the addition of the hyperthreading could be part of the reason we're having so much trouble reaching the higher levels of traffic that this thing should be able to handle as opposed to the old one.
Thanks for the insight!
On 5/29/2025 10:10 AM, Alex Balashov via sr-users wrote:
The counterpoint would be:
Kamailio may not wait a lot for outside I/O, but even a pure in-memory
config's code path largely rakes any incoming request over a bunch of kernel and system call interactions, which are in effect wait.
That doesn't have much in common with a truly pure-computational
workload.
-- Alex
On May 29, 2025, at 1:06 PM, Alex Balashov abalashov@evaristesys.com
wrote:
I don't have a great deal of empirical experience with this. However,
proceeding from first principles in a rather general way, hyperthreading probably does more harm than good to more pure CPU-bound workloads, and is more beneficial in I/O-bound workloads where there is more of the kind of waiting on external responses that you ascribe to the "old" system.
The basic idea of hyperthreading, as best as I understand it, is a kind
of oversubscription of a physical core across two logical threads. That means all the core's resources are shared (e.g. caches, memory bandwidth) and the OS has to manage more threads on the same physical core, increasing scheduler overhead.
I imagine this means that hyperthreading works best in cases where
there are execution or memory access patterns that naturally "yield" to the other thread on the same core a lot, with I/O waiting of course coming to mind first and foremost. The "hyper" part I guess comes from the idea that the physical cores aren't sitting idle as much as when a single thread of execution waits on something. If there's not much of that kind of waiting, HyperThreading probably isn't of much help, and may even hurt.
The real unknown is whether a seemingly compute-bound Kamailio config
has the kinds of execution patterns that create a lot of natural yielding on the given architecture. It's hard to say, under the hood, if htable lookups and dispatcher leaves a lot of these yield "gaps". My guess is they don't; it doesn't seem to me to map onto stuff like complex arithmetic, wildly varying memory allocations, cache misses and branch mispredictions and other things that could lead a thread of execution to stall. However, without forensically combing through the underlying machine code--which I of course cannot do--it's so hard to say. There are attempts throughout Kamailio's core code to modulate branch prediction to the compiler, e.g. through lots of use of unlikely()[1]:
/* likely/unlikely */ #if __GNUC__ >= 3
#define likely(expr) __builtin_expect(!!(expr), 1) #define unlikely(expr) __builtin_expect(!!(expr), 0)
I don't know how you'd profile the real-world effects of this, though.
My guess is that this is, like most things of the kind, splitting very
thin hairs, and doesn't have much effect on your ability to process 2500 CPS one way or another.
-- Alex
[1]
https://stackoverflow.com/questions/7346929/what-is-the-advantage-of-gccs-bu...
On May 29, 2025, at 11:57 AM, Brooks Bridges via sr-users <
sr-users@lists.kamailio.org> wrote:
So, I have a platform that handles in the order of around 2500 or more
call attempts per second. There are 2 servers that are built that handle this traffic, "old" and "new". The "old" design is a bit heavy, has multiple redis calls to external IO, does a lot of "work" on each invite before passing it along, etc. The "new" design is extremely lightweight, only a couple of htable comparisons and a dispatcher check on each invite before passing it along. The 2 servers are running different versions of Kamailio, however they're both in the 5.x train and I've found nothing in the changelogs that I believe would explain a significant difference in how child process CPU scheduling is performed, especially a detrimental change. I'm more than happy to be proven wrong about that though!
The "old" system is running 5.4.8: This system is running 4 x 8 core CPUs (32 cores) with no
hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO interactions with things like redis, lots of "work" being done on each invite checking various things before allowing it to proceed. This is currently running 32 child processes and again, has no issues handling the requests whatsoever and has been running at this level and higher for literal years.
The "new" system is running 5.8.3: This system is running 2 x 8 core CPUs (16 cores) *with*
hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO interactions with anything, and only does a couple of checks against some hash table values and dispatcher before allowing it to proceed. This is also currently running 32 child processes, however I am experiencing spikes in context switching and odd memory access issues (e.g. backtraces show things like "<error: Cannot access memory at address 0x14944ec60>" in udp_rvc_loop) with similar loads.
What I'm looking for here is empirical knowledge about things
specifically like context switching and if hyperthreading is detrimental in cases where I'm doing very high volume, low latency (e.g. no waiting on external replies, etc) work. Is it possible there are additional issues with the way hyperthreading works and the concepts behind it if the individual calls for CPU resources are so "quick" as to overwhelm the scheduler and cause congestion in the CPU's thread scheduler itself?
Have you run into anything like this? Have you discovered that very low latency call processing can incur
more hyperthreading overhead than the benefit you get from it?
Do you have additional data I may be missing to help me justify to the
C suite to change out the system architecture to provide a higher physical core count?
Can you share it?
Thanks! __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions --
sr-users@lists.kamailio.org
To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only
to the sender!
-- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://evaristesys.com Tel: +1-706-510-6800
Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
I'm glad MuhammadGPT made an appearance.
On May 29, 2025, at 3:10 PM, M S via sr-users sr-users@lists.kamailio.org wrote:
It is widely believe (and rightfully so) that Hyperthreading (also known as SMT) does more harm then good for ultra-low-latency and/or real-time applications due to Execution Jitter, Resource Contention and Unpredictability of Execution Times.
Let me elaborate a little on each of these before listing possible solutions.
- Increased Latency Variability aka Execution Jitters:
In SMT, two logical threads share a single CPU core and compete for resources, leading to unpredictable execution time that leads to Execution Jitter.
- Cache and Memory Contention - Resource Contention:
Since both threads share the L1& L2 caches and memory bandwidth. If one thread is more memory-intensive (like pattern matching or search through HTable etc.) then it can evict cache lines needed by the other thread, thus increasing cache misses and memory latency.
- Pipeline Stalls and Resource Starvation:
SMT threads compete for execute ports, reorder buffers (ROB) and load/store queues. So, if one thread stalls (e.g. due to a branch misprediction or memory access), the other thread may not fully utilize the core, leading to suboptimal throughput. This adds unpredictable delays in critical code paths.
- False Sharing and Core Saturation:
If two unrelated threads are scheduled on the same physical core (e.g. a UDP listener thread and a RTimer thread in Kamailio), they may inadvertently share cache lines, causing false sharing and increasing coherence overhead. In extreme cases, a noisy neighbor (e.g. a background task) can starve the real-time thread (e.g. a UDP listener) of CPU resources.
- Interference with CPU Affinity & Isolation:
Many real-time services (e.g. media proxy services, especially that use Kernel mode code execution like RTPEngine) use CPU pinning (affinity) to isolate critical threads. Hyperthreading complicates this because two logical CPUs are mapped to one physical CPU core, making it harder to ensure exclusive access to compute resources.
Possible Solution: Here are a few possible solutions that you can try,
- Disable Hyperthreading in BIOS if possible.
- Use CPU Isolation for SIP and Media proxy services. If I remember correctly there is isolcpus command in Linux that can do the job.
- Prioritize real-time processes with high CPU scheduling priority by using SCHED_FIFO in Linux.
Make sure to Benchmark with and without SMT and share the results with us as a case study.
Thank you.
-- Muhammad Shahzad Shafi Tel: +49 176 99 83 10 85
On Thu, May 29, 2025 at 8:05 PM Brooks Bridges via sr-users sr-users@lists.kamailio.org wrote: To be clear, the "new" system is currently handling ~1000-1500 cps with no issues at all, it only starts exhibiting these "drops" when we push it with the rest of the traffic that's currently on the "old" system. It runs fine for a while, but then when the traffic level gets up to a certain point, it starts tossing things like "ERROR: tm [tm.c:1846]: _w_t_relay_to(): t_forward_noack failed" into the system journal and those memory access errors show up in a trap backtrace.
While I get that there are more "things" than one might think, the disparity is kind of what I'm looking at. 8 or 9 redis lookups, plus tons of function calls to evaluate message parameters, plus dispatcher lookups, plus .... that all equals a lot more "stuff" that gets put into the pipe to be worked on, and for the cpu scheduler to handle. Alternately, the new one will have drastically fewer of these, so it's going to "churn" a lot more since each child worker is doing less overall.
I'm mainly trying to figure out if the addition of the hyperthreading could be part of the reason we're having so much trouble reaching the higher levels of traffic that this thing should be able to handle as opposed to the old one.
Thanks for the insight!
On 5/29/2025 10:10 AM, Alex Balashov via sr-users wrote:
The counterpoint would be:
Kamailio may not wait a lot for outside I/O, but even a pure in-memory config's code path largely rakes any incoming request over a bunch of kernel and system call interactions, which are in effect wait.
That doesn't have much in common with a truly pure-computational workload.
-- Alex
On May 29, 2025, at 1:06 PM, Alex Balashov abalashov@evaristesys.com wrote:
I don't have a great deal of empirical experience with this. However, proceeding from first principles in a rather general way, hyperthreading probably does more harm than good to more pure CPU-bound workloads, and is more beneficial in I/O-bound workloads where there is more of the kind of waiting on external responses that you ascribe to the "old" system.
The basic idea of hyperthreading, as best as I understand it, is a kind of oversubscription of a physical core across two logical threads. That means all the core's resources are shared (e.g. caches, memory bandwidth) and the OS has to manage more threads on the same physical core, increasing scheduler overhead.
I imagine this means that hyperthreading works best in cases where there are execution or memory access patterns that naturally "yield" to the other thread on the same core a lot, with I/O waiting of course coming to mind first and foremost. The "hyper" part I guess comes from the idea that the physical cores aren't sitting idle as much as when a single thread of execution waits on something. If there's not much of that kind of waiting, HyperThreading probably isn't of much help, and may even hurt.
The real unknown is whether a seemingly compute-bound Kamailio config has the kinds of execution patterns that create a lot of natural yielding on the given architecture. It's hard to say, under the hood, if htable lookups and dispatcher leaves a lot of these yield "gaps". My guess is they don't; it doesn't seem to me to map onto stuff like complex arithmetic, wildly varying memory allocations, cache misses and branch mispredictions and other things that could lead a thread of execution to stall. However, without forensically combing through the underlying machine code--which I of course cannot do--it's so hard to say. There are attempts throughout Kamailio's core code to modulate branch prediction to the compiler, e.g. through lots of use of unlikely()[1]:
/* likely/unlikely */ #if __GNUC__ >= 3
#define likely(expr) __builtin_expect(!!(expr), 1) #define unlikely(expr) __builtin_expect(!!(expr), 0)
I don't know how you'd profile the real-world effects of this, though.
My guess is that this is, like most things of the kind, splitting very thin hairs, and doesn't have much effect on your ability to process 2500 CPS one way or another.
-- Alex
[1] https://stackoverflow.com/questions/7346929/what-is-the-advantage-of-gccs-bu...
On May 29, 2025, at 11:57 AM, Brooks Bridges via sr-users sr-users@lists.kamailio.org wrote:
So, I have a platform that handles in the order of around 2500 or more call attempts per second. There are 2 servers that are built that handle this traffic, "old" and "new". The "old" design is a bit heavy, has multiple redis calls to external IO, does a lot of "work" on each invite before passing it along, etc. The "new" design is extremely lightweight, only a couple of htable comparisons and a dispatcher check on each invite before passing it along. The 2 servers are running different versions of Kamailio, however they're both in the 5.x train and I've found nothing in the changelogs that I believe would explain a significant difference in how child process CPU scheduling is performed, especially a detrimental change. I'm more than happy to be proven wrong about that though!
The "old" system is running 5.4.8: This system is running 4 x 8 core CPUs (32 cores) with no hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO interactions with things like redis, lots of "work" being done on each invite checking various things before allowing it to proceed. This is currently running 32 child processes and again, has no issues handling the requests whatsoever and has been running at this level and higher for literal years.
The "new" system is running 5.8.3: This system is running 2 x 8 core CPUs (16 cores) *with* hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO interactions with anything, and only does a couple of checks against some hash table values and dispatcher before allowing it to proceed. This is also currently running 32 child processes, however I am experiencing spikes in context switching and odd memory access issues (e.g. backtraces show things like "<error: Cannot access memory at address 0x14944ec60>" in udp_rvc_loop) with similar loads.
What I'm looking for here is empirical knowledge about things specifically like context switching and if hyperthreading is detrimental in cases where I'm doing very high volume, low latency (e.g. no waiting on external replies, etc) work. Is it possible there are additional issues with the way hyperthreading works and the concepts behind it if the individual calls for CPU resources are so "quick" as to overwhelm the scheduler and cause congestion in the CPU's thread scheduler itself?
Have you run into anything like this? Have you discovered that very low latency call processing can incur more hyperthreading overhead than the benefit you get from it? Do you have additional data I may be missing to help me justify to the C suite to change out the system architecture to provide a higher physical core count? Can you share it?
Thanks! __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
-- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://evaristesys.com Tel: +1-706-510-6800
Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender! __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
Unfortunately, I can't afford *over-priced* and *expensive shit* like ChatGPT, so I have to resort to more *conventional* and *old-school *means to find solutions for the technical problems in my field, like reading through the *Linux Kernel Docs* and having experience in building *Real-Time VoIP Apps and Service* for last 20+ years on hardcore Linux distributions like *Gentoo* and *Arch Linux*. But sure, if it gives you happiness and peace of mind to tag my response as MuhammadGPT then be my guest.
Regards.
-- Muhammad Shahzad Shafi Tel: +49 176 99 83 10 85
On Thu, May 29, 2025 at 9:59 PM Alex Balashov via sr-users < sr-users@lists.kamailio.org> wrote:
I'm glad MuhammadGPT made an appearance.
On May 29, 2025, at 3:10 PM, M S via sr-users <
sr-users@lists.kamailio.org> wrote:
It is widely believe (and rightfully so) that Hyperthreading (also known
as SMT) does more harm then good for ultra-low-latency and/or real-time applications due to Execution Jitter, Resource Contention and Unpredictability of Execution Times.
Let me elaborate a little on each of these before listing possible
solutions.
- Increased Latency Variability aka Execution Jitters:
In SMT, two logical threads share a single CPU core and compete for
resources, leading to unpredictable execution time that leads to Execution Jitter.
- Cache and Memory Contention - Resource Contention:
Since both threads share the L1& L2 caches and memory bandwidth. If one
thread is more memory-intensive (like pattern matching or search through HTable etc.) then it can evict cache lines needed by the other thread, thus increasing cache misses and memory latency.
- Pipeline Stalls and Resource Starvation:
SMT threads compete for execute ports, reorder buffers (ROB) and
load/store queues. So, if one thread stalls (e.g. due to a branch misprediction or memory access), the other thread may not fully utilize the core, leading to suboptimal throughput. This adds unpredictable delays in critical code paths.
- False Sharing and Core Saturation:
If two unrelated threads are scheduled on the same physical core (e.g. a
UDP listener thread and a RTimer thread in Kamailio), they may inadvertently share cache lines, causing false sharing and increasing coherence overhead. In extreme cases, a noisy neighbor (e.g. a background task) can starve the real-time thread (e.g. a UDP listener) of CPU resources.
- Interference with CPU Affinity & Isolation:
Many real-time services (e.g. media proxy services, especially that use
Kernel mode code execution like RTPEngine) use CPU pinning (affinity) to isolate critical threads. Hyperthreading complicates this because two logical CPUs are mapped to one physical CPU core, making it harder to ensure exclusive access to compute resources.
Possible Solution: Here are a few possible solutions that you can try,
- Disable Hyperthreading in BIOS if possible.
- Use CPU Isolation for SIP and Media proxy services. If I remember
correctly there is isolcpus command in Linux that can do the job.
- Prioritize real-time processes with high CPU scheduling priority by
using SCHED_FIFO in Linux.
Make sure to Benchmark with and without SMT and share the results with
us as a case study.
Thank you.
-- Muhammad Shahzad Shafi Tel: +49 176 99 83 10 85
On Thu, May 29, 2025 at 8:05 PM Brooks Bridges via sr-users <
sr-users@lists.kamailio.org> wrote:
To be clear, the "new" system is currently handling ~1000-1500 cps with no issues at all, it only starts exhibiting these "drops" when we push it with the rest of the traffic that's currently on the "old" system. It runs fine for a while, but then when the traffic level gets up to a certain point, it starts tossing things like "ERROR: tm [tm.c:1846]: _w_t_relay_to(): t_forward_noack failed" into the system journal and those memory access errors show up in a trap backtrace.
While I get that there are more "things" than one might think, the disparity is kind of what I'm looking at. 8 or 9 redis lookups, plus tons of function calls to evaluate message parameters, plus dispatcher lookups, plus .... that all equals a lot more "stuff" that gets put into the pipe to be worked on, and for the cpu scheduler to handle. Alternately, the new one will have drastically fewer of these, so it's going to "churn" a lot more since each child worker is doing less
overall.
I'm mainly trying to figure out if the addition of the hyperthreading could be part of the reason we're having so much trouble reaching the higher levels of traffic that this thing should be able to handle as opposed to the old one.
Thanks for the insight!
On 5/29/2025 10:10 AM, Alex Balashov via sr-users wrote:
The counterpoint would be:
Kamailio may not wait a lot for outside I/O, but even a pure in-memory
config's code path largely rakes any incoming request over a bunch of kernel and system call interactions, which are in effect wait.
That doesn't have much in common with a truly pure-computational
workload.
-- Alex
On May 29, 2025, at 1:06 PM, Alex Balashov abalashov@evaristesys.com
wrote:
I don't have a great deal of empirical experience with this. However,
proceeding from first principles in a rather general way, hyperthreading probably does more harm than good to more pure CPU-bound workloads, and is more beneficial in I/O-bound workloads where there is more of the kind of waiting on external responses that you ascribe to the "old" system.
The basic idea of hyperthreading, as best as I understand it, is a
kind of oversubscription of a physical core across two logical threads. That means all the core's resources are shared (e.g. caches, memory bandwidth) and the OS has to manage more threads on the same physical core, increasing scheduler overhead.
I imagine this means that hyperthreading works best in cases where
there are execution or memory access patterns that naturally "yield" to the other thread on the same core a lot, with I/O waiting of course coming to mind first and foremost. The "hyper" part I guess comes from the idea that the physical cores aren't sitting idle as much as when a single thread of execution waits on something. If there's not much of that kind of waiting, HyperThreading probably isn't of much help, and may even hurt.
The real unknown is whether a seemingly compute-bound Kamailio config
has the kinds of execution patterns that create a lot of natural yielding on the given architecture. It's hard to say, under the hood, if htable lookups and dispatcher leaves a lot of these yield "gaps". My guess is they don't; it doesn't seem to me to map onto stuff like complex arithmetic, wildly varying memory allocations, cache misses and branch mispredictions and other things that could lead a thread of execution to stall. However, without forensically combing through the underlying machine code--which I of course cannot do--it's so hard to say. There are attempts throughout Kamailio's core code to modulate branch prediction to the compiler, e.g. through lots of use of unlikely()[1]:
/* likely/unlikely */ #if __GNUC__ >= 3
#define likely(expr) __builtin_expect(!!(expr), 1) #define unlikely(expr) __builtin_expect(!!(expr), 0)
I don't know how you'd profile the real-world effects of this, though.
My guess is that this is, like most things of the kind, splitting
very thin hairs, and doesn't have much effect on your ability to process 2500 CPS one way or another.
-- Alex
[1]
https://stackoverflow.com/questions/7346929/what-is-the-advantage-of-gccs-bu...
On May 29, 2025, at 11:57 AM, Brooks Bridges via sr-users <
sr-users@lists.kamailio.org> wrote:
So, I have a platform that handles in the order of around 2500 or
more call attempts per second. There are 2 servers that are built that handle this traffic, "old" and "new". The "old" design is a bit heavy, has multiple redis calls to external IO, does a lot of "work" on each invite before passing it along, etc. The "new" design is extremely lightweight, only a couple of htable comparisons and a dispatcher check on each invite before passing it along. The 2 servers are running different versions of Kamailio, however they're both in the 5.x train and I've found nothing in the changelogs that I believe would explain a significant difference in how child process CPU scheduling is performed, especially a detrimental change. I'm more than happy to be proven wrong about that though!
The "old" system is running 5.4.8: This system is running 4 x 8 core CPUs (32 cores) with no
hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO interactions with things like redis, lots of "work" being done on each invite checking various things before allowing it to proceed. This is currently running 32 child processes and again, has no issues handling the requests whatsoever and has been running at this level and higher for literal years.
The "new" system is running 5.8.3: This system is running 2 x 8 core CPUs (16 cores) *with*
hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO interactions with anything, and only does a couple of checks against some hash table values and dispatcher before allowing it to proceed. This is also currently running 32 child processes, however I am experiencing spikes in context switching and odd memory access issues (e.g. backtraces show things like "<error: Cannot access memory at address 0x14944ec60>" in udp_rvc_loop) with similar loads.
What I'm looking for here is empirical knowledge about things
specifically like context switching and if hyperthreading is detrimental in cases where I'm doing very high volume, low latency (e.g. no waiting on external replies, etc) work. Is it possible there are additional issues with the way hyperthreading works and the concepts behind it if the individual calls for CPU resources are so "quick" as to overwhelm the scheduler and cause congestion in the CPU's thread scheduler itself?
Have you run into anything like this? Have you discovered that very low latency call processing can incur
more hyperthreading overhead than the benefit you get from it?
Do you have additional data I may be missing to help me justify to
the C suite to change out the system architecture to provide a higher physical core count?
Can you share it?
Thanks! __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions --
sr-users@lists.kamailio.org
To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply
only to the sender!
-- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://evaristesys.com Tel: +1-706-510-6800
Kamailio - Users Mailing List - Non Commercial Discussions --
sr-users@lists.kamailio.org
To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to
the sender!
Kamailio - Users Mailing List - Non Commercial Discussions --
sr-users@lists.kamailio.org
To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to
the sender!
-- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://evaristesys.com Tel: +1-706-510-6800
Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
I don't doubt your technical depth. However, the text was preponderantly LLM-generated in a fairly obvious way.
On May 29, 2025, at 5:05 PM, M S shaheryarkh@gmail.com wrote:
Unfortunately, I can't afford over-priced and expensive shit like ChatGPT, so I have to resort to more conventional and old-school means to find solutions for the technical problems in my field, like reading through the Linux Kernel Docs and having experience in building Real-Time VoIP Apps and Service for last 20+ years on hardcore Linux distributions like Gentoo and Arch Linux. But sure, if it gives you happiness and peace of mind to tag my response as MuhammadGPT then be my guest.
Regards.
-- Muhammad Shahzad Shafi Tel: +49 176 99 83 10 85
On Thu, May 29, 2025 at 9:59 PM Alex Balashov via sr-users sr-users@lists.kamailio.org wrote: I'm glad MuhammadGPT made an appearance.
On May 29, 2025, at 3:10 PM, M S via sr-users sr-users@lists.kamailio.org wrote:
It is widely believe (and rightfully so) that Hyperthreading (also known as SMT) does more harm then good for ultra-low-latency and/or real-time applications due to Execution Jitter, Resource Contention and Unpredictability of Execution Times.
Let me elaborate a little on each of these before listing possible solutions.
- Increased Latency Variability aka Execution Jitters:
In SMT, two logical threads share a single CPU core and compete for resources, leading to unpredictable execution time that leads to Execution Jitter.
- Cache and Memory Contention - Resource Contention:
Since both threads share the L1& L2 caches and memory bandwidth. If one thread is more memory-intensive (like pattern matching or search through HTable etc.) then it can evict cache lines needed by the other thread, thus increasing cache misses and memory latency.
- Pipeline Stalls and Resource Starvation:
SMT threads compete for execute ports, reorder buffers (ROB) and load/store queues. So, if one thread stalls (e.g. due to a branch misprediction or memory access), the other thread may not fully utilize the core, leading to suboptimal throughput. This adds unpredictable delays in critical code paths.
- False Sharing and Core Saturation:
If two unrelated threads are scheduled on the same physical core (e.g. a UDP listener thread and a RTimer thread in Kamailio), they may inadvertently share cache lines, causing false sharing and increasing coherence overhead. In extreme cases, a noisy neighbor (e.g. a background task) can starve the real-time thread (e.g. a UDP listener) of CPU resources.
- Interference with CPU Affinity & Isolation:
Many real-time services (e.g. media proxy services, especially that use Kernel mode code execution like RTPEngine) use CPU pinning (affinity) to isolate critical threads. Hyperthreading complicates this because two logical CPUs are mapped to one physical CPU core, making it harder to ensure exclusive access to compute resources.
Possible Solution: Here are a few possible solutions that you can try,
- Disable Hyperthreading in BIOS if possible.
- Use CPU Isolation for SIP and Media proxy services. If I remember correctly there is isolcpus command in Linux that can do the job.
- Prioritize real-time processes with high CPU scheduling priority by using SCHED_FIFO in Linux.
Make sure to Benchmark with and without SMT and share the results with us as a case study.
Thank you.
-- Muhammad Shahzad Shafi Tel: +49 176 99 83 10 85
On Thu, May 29, 2025 at 8:05 PM Brooks Bridges via sr-users sr-users@lists.kamailio.org wrote: To be clear, the "new" system is currently handling ~1000-1500 cps with no issues at all, it only starts exhibiting these "drops" when we push it with the rest of the traffic that's currently on the "old" system. It runs fine for a while, but then when the traffic level gets up to a certain point, it starts tossing things like "ERROR: tm [tm.c:1846]: _w_t_relay_to(): t_forward_noack failed" into the system journal and those memory access errors show up in a trap backtrace.
While I get that there are more "things" than one might think, the disparity is kind of what I'm looking at. 8 or 9 redis lookups, plus tons of function calls to evaluate message parameters, plus dispatcher lookups, plus .... that all equals a lot more "stuff" that gets put into the pipe to be worked on, and for the cpu scheduler to handle. Alternately, the new one will have drastically fewer of these, so it's going to "churn" a lot more since each child worker is doing less overall.
I'm mainly trying to figure out if the addition of the hyperthreading could be part of the reason we're having so much trouble reaching the higher levels of traffic that this thing should be able to handle as opposed to the old one.
Thanks for the insight!
On 5/29/2025 10:10 AM, Alex Balashov via sr-users wrote:
The counterpoint would be:
Kamailio may not wait a lot for outside I/O, but even a pure in-memory config's code path largely rakes any incoming request over a bunch of kernel and system call interactions, which are in effect wait.
That doesn't have much in common with a truly pure-computational workload.
-- Alex
On May 29, 2025, at 1:06 PM, Alex Balashov abalashov@evaristesys.com wrote:
I don't have a great deal of empirical experience with this. However, proceeding from first principles in a rather general way, hyperthreading probably does more harm than good to more pure CPU-bound workloads, and is more beneficial in I/O-bound workloads where there is more of the kind of waiting on external responses that you ascribe to the "old" system.
The basic idea of hyperthreading, as best as I understand it, is a kind of oversubscription of a physical core across two logical threads. That means all the core's resources are shared (e.g. caches, memory bandwidth) and the OS has to manage more threads on the same physical core, increasing scheduler overhead.
I imagine this means that hyperthreading works best in cases where there are execution or memory access patterns that naturally "yield" to the other thread on the same core a lot, with I/O waiting of course coming to mind first and foremost. The "hyper" part I guess comes from the idea that the physical cores aren't sitting idle as much as when a single thread of execution waits on something. If there's not much of that kind of waiting, HyperThreading probably isn't of much help, and may even hurt.
The real unknown is whether a seemingly compute-bound Kamailio config has the kinds of execution patterns that create a lot of natural yielding on the given architecture. It's hard to say, under the hood, if htable lookups and dispatcher leaves a lot of these yield "gaps". My guess is they don't; it doesn't seem to me to map onto stuff like complex arithmetic, wildly varying memory allocations, cache misses and branch mispredictions and other things that could lead a thread of execution to stall. However, without forensically combing through the underlying machine code--which I of course cannot do--it's so hard to say. There are attempts throughout Kamailio's core code to modulate branch prediction to the compiler, e.g. through lots of use of unlikely()[1]:
/* likely/unlikely */ #if __GNUC__ >= 3
#define likely(expr) __builtin_expect(!!(expr), 1) #define unlikely(expr) __builtin_expect(!!(expr), 0)
I don't know how you'd profile the real-world effects of this, though.
My guess is that this is, like most things of the kind, splitting very thin hairs, and doesn't have much effect on your ability to process 2500 CPS one way or another.
-- Alex
[1] https://stackoverflow.com/questions/7346929/what-is-the-advantage-of-gccs-bu...
On May 29, 2025, at 11:57 AM, Brooks Bridges via sr-users sr-users@lists.kamailio.org wrote:
So, I have a platform that handles in the order of around 2500 or more call attempts per second. There are 2 servers that are built that handle this traffic, "old" and "new". The "old" design is a bit heavy, has multiple redis calls to external IO, does a lot of "work" on each invite before passing it along, etc. The "new" design is extremely lightweight, only a couple of htable comparisons and a dispatcher check on each invite before passing it along. The 2 servers are running different versions of Kamailio, however they're both in the 5.x train and I've found nothing in the changelogs that I believe would explain a significant difference in how child process CPU scheduling is performed, especially a detrimental change. I'm more than happy to be proven wrong about that though!
The "old" system is running 5.4.8: This system is running 4 x 8 core CPUs (32 cores) with no hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO interactions with things like redis, lots of "work" being done on each invite checking various things before allowing it to proceed. This is currently running 32 child processes and again, has no issues handling the requests whatsoever and has been running at this level and higher for literal years.
The "new" system is running 5.8.3: This system is running 2 x 8 core CPUs (16 cores) *with* hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO interactions with anything, and only does a couple of checks against some hash table values and dispatcher before allowing it to proceed. This is also currently running 32 child processes, however I am experiencing spikes in context switching and odd memory access issues (e.g. backtraces show things like "<error: Cannot access memory at address 0x14944ec60>" in udp_rvc_loop) with similar loads.
What I'm looking for here is empirical knowledge about things specifically like context switching and if hyperthreading is detrimental in cases where I'm doing very high volume, low latency (e.g. no waiting on external replies, etc) work. Is it possible there are additional issues with the way hyperthreading works and the concepts behind it if the individual calls for CPU resources are so "quick" as to overwhelm the scheduler and cause congestion in the CPU's thread scheduler itself?
Have you run into anything like this? Have you discovered that very low latency call processing can incur more hyperthreading overhead than the benefit you get from it? Do you have additional data I may be missing to help me justify to the C suite to change out the system architecture to provide a higher physical core count? Can you share it?
Thanks! __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
-- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://evaristesys.com Tel: +1-706-510-6800
Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender! __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
-- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://evaristesys.com Tel: +1-706-510-6800
Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
It's probably worth asking if you tried the "old" config on the "new" system? Given your descriptions it sounds like there's a large difference between the two configs and it's probably good to isolate against that - alternately, you could try the "new" config on the old system as well. I don't want to jump to "it's a config issue", but consider the fact that your new configuration will use a lot more shared memory if you're trading storing data in htable instead of redis. Your new host might have tons of memory, but if you're not allocating that into Kamailio it won't really matter, etc.
Regards, Kaufman
________________________________ From: Brooks Bridges via sr-users sr-users@lists.kamailio.org Sent: Thursday, May 29, 2025 12:41 PM To: Alex Balashov via sr-users sr-users@lists.kamailio.org Cc: Brooks Bridges brooks@firestormnetworks.net Subject: [SR-Users] Re: Hyperthreading and high request volumes in modern times
CAUTION: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
To be clear, the "new" system is currently handling ~1000-1500 cps with no issues at all, it only starts exhibiting these "drops" when we push it with the rest of the traffic that's currently on the "old" system. It runs fine for a while, but then when the traffic level gets up to a certain point, it starts tossing things like "ERROR: tm [tm.c:1846]: _w_t_relay_to(): t_forward_noack failed" into the system journal and those memory access errors show up in a trap backtrace.
While I get that there are more "things" than one might think, the disparity is kind of what I'm looking at. 8 or 9 redis lookups, plus tons of function calls to evaluate message parameters, plus dispatcher lookups, plus .... that all equals a lot more "stuff" that gets put into the pipe to be worked on, and for the cpu scheduler to handle. Alternately, the new one will have drastically fewer of these, so it's going to "churn" a lot more since each child worker is doing less overall.
I'm mainly trying to figure out if the addition of the hyperthreading could be part of the reason we're having so much trouble reaching the higher levels of traffic that this thing should be able to handle as opposed to the old one.
Thanks for the insight!
On 5/29/2025 10:10 AM, Alex Balashov via sr-users wrote:
The counterpoint would be:
Kamailio may not wait a lot for outside I/O, but even a pure in-memory config's code path largely rakes any incoming request over a bunch of kernel and system call interactions, which are in effect wait.
That doesn't have much in common with a truly pure-computational workload.
-- Alex
On May 29, 2025, at 1:06 PM, Alex Balashov abalashov@evaristesys.com wrote:
I don't have a great deal of empirical experience with this. However, proceeding from first principles in a rather general way, hyperthreading probably does more harm than good to more pure CPU-bound workloads, and is more beneficial in I/O-bound workloads where there is more of the kind of waiting on external responses that you ascribe to the "old" system.
The basic idea of hyperthreading, as best as I understand it, is a kind of oversubscription of a physical core across two logical threads. That means all the core's resources are shared (e.g. caches, memory bandwidth) and the OS has to manage more threads on the same physical core, increasing scheduler overhead.
I imagine this means that hyperthreading works best in cases where there are execution or memory access patterns that naturally "yield" to the other thread on the same core a lot, with I/O waiting of course coming to mind first and foremost. The "hyper" part I guess comes from the idea that the physical cores aren't sitting idle as much as when a single thread of execution waits on something. If there's not much of that kind of waiting, HyperThreading probably isn't of much help, and may even hurt.
The real unknown is whether a seemingly compute-bound Kamailio config has the kinds of execution patterns that create a lot of natural yielding on the given architecture. It's hard to say, under the hood, if htable lookups and dispatcher leaves a lot of these yield "gaps". My guess is they don't; it doesn't seem to me to map onto stuff like complex arithmetic, wildly varying memory allocations, cache misses and branch mispredictions and other things that could lead a thread of execution to stall. However, without forensically combing through the underlying machine code--which I of course cannot do--it's so hard to say. There are attempts throughout Kamailio's core code to modulate branch prediction to the compiler, e.g. through lots of use of unlikely()[1]:
/* likely/unlikely */ #if __GNUC__ >= 3
#define likely(expr) __builtin_expect(!!(expr), 1) #define unlikely(expr) __builtin_expect(!!(expr), 0)
I don't know how you'd profile the real-world effects of this, though.
My guess is that this is, like most things of the kind, splitting very thin hairs, and doesn't have much effect on your ability to process 2500 CPS one way or another.
-- Alex
[1] https://urldefense.com/v3/__https://stackoverflow.com/questions/7346929/what...
On May 29, 2025, at 11:57 AM, Brooks Bridges via sr-users sr-users@lists.kamailio.org wrote:
So, I have a platform that handles in the order of around 2500 or more call attempts per second. There are 2 servers that are built that handle this traffic, "old" and "new". The "old" design is a bit heavy, has multiple redis calls to external IO, does a lot of "work" on each invite before passing it along, etc. The "new" design is extremely lightweight, only a couple of htable comparisons and a dispatcher check on each invite before passing it along. The 2 servers are running different versions of Kamailio, however they're both in the 5.x train and I've found nothing in the changelogs that I believe would explain a significant difference in how child process CPU scheduling is performed, especially a detrimental change. I'm more than happy to be proven wrong about that though!
The "old" system is running 5.4.8: This system is running 4 x 8 core CPUs (32 cores) with no hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO interactions with things like redis, lots of "work" being done on each invite checking various things before allowing it to proceed. This is currently running 32 child processes and again, has no issues handling the requests whatsoever and has been running at this level and higher for literal years.
The "new" system is running 5.8.3: This system is running 2 x 8 core CPUs (16 cores) *with* hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO interactions with anything, and only does a couple of checks against some hash table values and dispatcher before allowing it to proceed. This is also currently running 32 child processes, however I am experiencing spikes in context switching and odd memory access issues (e.g. backtraces show things like "<error: Cannot access memory at address 0x14944ec60>" in udp_rvc_loop) with similar loads.
What I'm looking for here is empirical knowledge about things specifically like context switching and if hyperthreading is detrimental in cases where I'm doing very high volume, low latency (e.g. no waiting on external replies, etc) work. Is it possible there are additional issues with the way hyperthreading works and the concepts behind it if the individual calls for CPU resources are so "quick" as to overwhelm the scheduler and cause congestion in the CPU's thread scheduler itself?
Have you run into anything like this? Have you discovered that very low latency call processing can incur more hyperthreading overhead than the benefit you get from it? Do you have additional data I may be missing to help me justify to the C suite to change out the system architecture to provide a higher physical core count? Can you share it?
Thanks! __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
-- Alex Balashov Principal Consultant Evariste Systems LLC Web: https://urldefense.com/v3/__https://evaristesys.com__;!!KWzduNI!ZF1bZli3605Y... Tel: +1-706-510-6800
__________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
Ok guys, while I appreciate that the default response is to try to help, I'm not asking for troubleshooting advice at this time so I'd like to stay focused on the topic at hand if we can.
I'm specifically asking about hyperthreading experience with Kamailio, specifically with very low external IO workloads and high packet throughput, and I want to put that "to bed" as it were before I move on to other things.
On 5/29/2025 2:14 PM, Ben Kaufman wrote:
It's probably worth asking if you tried the "old" config on the "new" system? Given your descriptions it sounds like there's a large difference between the two configs and it's probably good to isolate against that - alternately, you could try the "new" config on the old system as well. I don't want to jump to "it's a config issue", but consider the fact that your new configuration will use a lot more shared memory if you're trading storing data in htable instead of redis. Your new host might have tons of memory, but if you're not allocating that into Kamailio it won't really matter, etc.
Regards, Kaufman
*From:* Brooks Bridges via sr-users sr-users@lists.kamailio.org *Sent:* Thursday, May 29, 2025 12:41 PM *To:* Alex Balashov via sr-users sr-users@lists.kamailio.org *Cc:* Brooks Bridges brooks@firestormnetworks.net *Subject:* [SR-Users] Re: Hyperthreading and high request volumes in modern times
CAUTION: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
To be clear, the "new" system is currently handling ~1000-1500 cps with no issues at all, it only starts exhibiting these "drops" when we push it with the rest of the traffic that's currently on the "old" system. It runs fine for a while, but then when the traffic level gets up to a certain point, it starts tossing things like "ERROR: tm [tm.c:1846]: _w_t_relay_to(): t_forward_noack failed" into the system journal and those memory access errors show up in a trap backtrace.
While I get that there are more "things" than one might think, the disparity is kind of what I'm looking at. 8 or 9 redis lookups, plus tons of function calls to evaluate message parameters, plus dispatcher lookups, plus .... that all equals a lot more "stuff" that gets put into the pipe to be worked on, and for the cpu scheduler to handle. Alternately, the new one will have drastically fewer of these, so it's going to "churn" a lot more since each child worker is doing less overall.
I'm mainly trying to figure out if the addition of the hyperthreading could be part of the reason we're having so much trouble reaching the higher levels of traffic that this thing should be able to handle as opposed to the old one.
Thanks for the insight!
On 5/29/2025 10:10 AM, Alex Balashov via sr-users wrote:
The counterpoint would be:
Kamailio may not wait a lot for outside I/O, but even a pure
in-memory config's code path largely rakes any incoming request over a bunch of kernel and system call interactions, which are in effect wait.
That doesn't have much in common with a truly pure-computational
workload.
-- Alex
On May 29, 2025, at 1:06 PM, Alex Balashov
abalashov@evaristesys.com wrote:
I don't have a great deal of empirical experience with this.
However, proceeding from first principles in a rather general way, hyperthreading probably does more harm than good to more pure CPU-bound workloads, and is more beneficial in I/O-bound workloads where there is more of the kind of waiting on external responses that you ascribe to the "old" system.
The basic idea of hyperthreading, as best as I understand it, is a
kind of oversubscription of a physical core across two logical threads. That means all the core's resources are shared (e.g. caches, memory bandwidth) and the OS has to manage more threads on the same physical core, increasing scheduler overhead.
I imagine this means that hyperthreading works best in cases where
there are execution or memory access patterns that naturally "yield" to the other thread on the same core a lot, with I/O waiting of course coming to mind first and foremost. The "hyper" part I guess comes from the idea that the physical cores aren't sitting idle as much as when a single thread of execution waits on something. If there's not much of that kind of waiting, HyperThreading probably isn't of much help, and may even hurt.
The real unknown is whether a seemingly compute-bound Kamailio
config has the kinds of execution patterns that create a lot of natural yielding on the given architecture. It's hard to say, under the hood, if htable lookups and dispatcher leaves a lot of these yield "gaps". My guess is they don't; it doesn't seem to me to map onto stuff like complex arithmetic, wildly varying memory allocations, cache misses and branch mispredictions and other things that could lead a thread of execution to stall. However, without forensically combing through the underlying machine code--which I of course cannot do--it's so hard to say. There are attempts throughout Kamailio's core code to modulate branch prediction to the compiler, e.g. through lots of use of unlikely()[1]:
/* likely/unlikely */ #if __GNUC__ >= 3
#define likely(expr) __builtin_expect(!!(expr), 1) #define unlikely(expr) __builtin_expect(!!(expr), 0)
I don't know how you'd profile the real-world effects of this, though.
My guess is that this is, like most things of the kind, splitting
very thin hairs, and doesn't have much effect on your ability to process 2500 CPS one way or another.
-- Alex
[1]
https://urldefense.com/v3/__https://stackoverflow.com/questions/7346929/what... https://urldefense.com/v3/__https://stackoverflow.com/questions/7346929/what-is-the-advantage-of-gccs-builtin-expect-in-if-else-statements__;!!KWzduNI!ZF1bZli3605YMy2B4whLR_uPDFQwzpJianaOdh67MFi8UPtleWKQx78A1lPrkrBaSn55Tx1DsMQcRHAbiLqqYmk$
On May 29, 2025, at 11:57 AM, Brooks Bridges via sr-users
sr-users@lists.kamailio.org wrote:
So, I have a platform that handles in the order of around 2500 or
more call attempts per second. There are 2 servers that are built that handle this traffic, "old" and "new". The "old" design is a bit heavy, has multiple redis calls to external IO, does a lot of "work" on each invite before passing it along, etc. The "new" design is extremely lightweight, only a couple of htable comparisons and a dispatcher check on each invite before passing it along. The 2 servers are running different versions of Kamailio, however they're both in the 5.x train and I've found nothing in the changelogs that I believe would explain a significant difference in how child process CPU scheduling is performed, especially a detrimental change. I'm more than happy to be proven wrong about that though!
The "old" system is running 5.4.8: This system is running 4 x 8 core CPUs (32 cores) with no
hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO interactions with things like redis, lots of "work" being done on each invite checking various things before allowing it to proceed. This is currently running 32 child processes and again, has no issues handling the requests whatsoever and has been running at this level and higher for literal years.
The "new" system is running 5.8.3: This system is running 2 x 8 core CPUs (16 cores) *with*
hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO interactions with anything, and only does a couple of checks against some hash table values and dispatcher before allowing it to proceed. This is also currently running 32 child processes, however I am experiencing spikes in context switching and odd memory access issues (e.g. backtraces show things like "<error: Cannot access memory at address 0x14944ec60>" in udp_rvc_loop) with similar loads.
What I'm looking for here is empirical knowledge about things
specifically like context switching and if hyperthreading is detrimental in cases where I'm doing very high volume, low latency (e.g. no waiting on external replies, etc) work. Is it possible there are additional issues with the way hyperthreading works and the concepts behind it if the individual calls for CPU resources are so "quick" as to overwhelm the scheduler and cause congestion in the CPU's thread scheduler itself?
Have you run into anything like this? Have you discovered that very low latency call processing can
incur more hyperthreading overhead than the benefit you get from it?
Do you have additional data I may be missing to help me justify to
the C suite to change out the system architecture to provide a higher physical core count?
Can you share it?
Thanks! __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions --
sr-users@lists.kamailio.org
To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply
only to the sender!
-- Alex Balashov Principal Consultant Evariste Systems LLC Web:
https://urldefense.com/v3/__https://evaristesys.com__;!!KWzduNI!ZF1bZli3605Y... https://urldefense.com/v3/__https://evaristesys.com__;!!KWzduNI!ZF1bZli3605YMy2B4whLR_uPDFQwzpJianaOdh67MFi8UPtleWKQx78A1lPrkrBaSn55Tx1DsMQcRHAbzNvaqfA$
Tel: +1-706-510-6800
Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
@Alex Balashov I think my writing style must have caused the confusion.I specifically use it to write long docs, emails and technical manuals, so the readers can quickly read through it.
@Brooks Bridges brooks@firestormnetworks.net Personally, I never had problem with Hyperthreading in production environment but that's probably because of our custom-built Linux Kernel that is optimized for our needs, from CPU scheduling to network congestion control, everything is carefully enabled or disabled at compile time.
Apart from Hyperthreading, there is one potential issue in Kamailio v5.8.x that we couldn't quite figure out yet. We have extensive test suite that we have developed in-house for testing our VoIP services and we have noticed that with Kamailio v5.8, some of our load tests are failing with re-transmit time out, while these tests are working fine with Kamailio v5.6. We are still research on it and will post a detailed problem statement in Kamailio mailing list probably by next week. So, the behavior you are observing may be caused by something other then SMT.
Kind regards.
-- Muhammad Shahzad Shafi Tel: +49 176 99 83 10 85
On Thu, May 29, 2025 at 11:40 PM Brooks Bridges via sr-users < sr-users@lists.kamailio.org> wrote:
Ok guys, while I appreciate that the default response is to try to help, I'm not asking for troubleshooting advice at this time so I'd like to stay focused on the topic at hand if we can.
I'm specifically asking about hyperthreading experience with Kamailio, specifically with very low external IO workloads and high packet throughput, and I want to put that "to bed" as it were before I move on to other things.
On 5/29/2025 2:14 PM, Ben Kaufman wrote:
It's probably worth asking if you tried the "old" config on the "new" system? Given your descriptions it sounds like there's a large difference between the two configs and it's probably good to isolate against that - alternately, you could try the "new" config on the old system as well. I don't want to jump to "it's a config issue", but consider the fact that your new configuration will use a lot more shared memory if you're trading storing data in htable instead of redis. Your new host might have tons of memory, but if you're not allocating that into Kamailio it won't really matter, etc.
Regards, Kaufman
*From:* Brooks Bridges via sr-users sr-users@lists.kamailio.org sr-users@lists.kamailio.org *Sent:* Thursday, May 29, 2025 12:41 PM *To:* Alex Balashov via sr-users sr-users@lists.kamailio.org sr-users@lists.kamailio.org *Cc:* Brooks Bridges brooks@firestormnetworks.net brooks@firestormnetworks.net *Subject:* [SR-Users] Re: Hyperthreading and high request volumes in modern times
CAUTION: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
To be clear, the "new" system is currently handling ~1000-1500 cps with no issues at all, it only starts exhibiting these "drops" when we push it with the rest of the traffic that's currently on the "old" system. It runs fine for a while, but then when the traffic level gets up to a certain point, it starts tossing things like "ERROR: tm [tm.c:1846]: _w_t_relay_to(): t_forward_noack failed" into the system journal and those memory access errors show up in a trap backtrace.
While I get that there are more "things" than one might think, the disparity is kind of what I'm looking at. 8 or 9 redis lookups, plus tons of function calls to evaluate message parameters, plus dispatcher lookups, plus .... that all equals a lot more "stuff" that gets put into the pipe to be worked on, and for the cpu scheduler to handle. Alternately, the new one will have drastically fewer of these, so it's going to "churn" a lot more since each child worker is doing less overall.
I'm mainly trying to figure out if the addition of the hyperthreading could be part of the reason we're having so much trouble reaching the higher levels of traffic that this thing should be able to handle as opposed to the old one.
Thanks for the insight!
On 5/29/2025 10:10 AM, Alex Balashov via sr-users wrote:
The counterpoint would be:
Kamailio may not wait a lot for outside I/O, but even a pure in-memory
config's code path largely rakes any incoming request over a bunch of kernel and system call interactions, which are in effect wait.
That doesn't have much in common with a truly pure-computational
workload.
-- Alex
On May 29, 2025, at 1:06 PM, Alex Balashov abalashov@evaristesys.com
abalashov@evaristesys.com wrote:
I don't have a great deal of empirical experience with this. However,
proceeding from first principles in a rather general way, hyperthreading probably does more harm than good to more pure CPU-bound workloads, and is more beneficial in I/O-bound workloads where there is more of the kind of waiting on external responses that you ascribe to the "old" system.
The basic idea of hyperthreading, as best as I understand it, is a kind
of oversubscription of a physical core across two logical threads. That means all the core's resources are shared (e.g. caches, memory bandwidth) and the OS has to manage more threads on the same physical core, increasing scheduler overhead.
I imagine this means that hyperthreading works best in cases where
there are execution or memory access patterns that naturally "yield" to the other thread on the same core a lot, with I/O waiting of course coming to mind first and foremost. The "hyper" part I guess comes from the idea that the physical cores aren't sitting idle as much as when a single thread of execution waits on something. If there's not much of that kind of waiting, HyperThreading probably isn't of much help, and may even hurt.
The real unknown is whether a seemingly compute-bound Kamailio config
has the kinds of execution patterns that create a lot of natural yielding on the given architecture. It's hard to say, under the hood, if htable lookups and dispatcher leaves a lot of these yield "gaps". My guess is they don't; it doesn't seem to me to map onto stuff like complex arithmetic, wildly varying memory allocations, cache misses and branch mispredictions and other things that could lead a thread of execution to stall. However, without forensically combing through the underlying machine code--which I of course cannot do--it's so hard to say. There are attempts throughout Kamailio's core code to modulate branch prediction to the compiler, e.g. through lots of use of unlikely()[1]:
/* likely/unlikely */ #if __GNUC__ >= 3
#define likely(expr) __builtin_expect(!!(expr), 1) #define unlikely(expr) __builtin_expect(!!(expr), 0)
I don't know how you'd profile the real-world effects of this, though.
My guess is that this is, like most things of the kind, splitting very
thin hairs, and doesn't have much effect on your ability to process 2500 CPS one way or another.
-- Alex
[1]
https://urldefense.com/v3/__https://stackoverflow.com/questions/7346929/what...
On May 29, 2025, at 11:57 AM, Brooks Bridges via sr-users
sr-users@lists.kamailio.org sr-users@lists.kamailio.org wrote:
So, I have a platform that handles in the order of around 2500 or more
call attempts per second. There are 2 servers that are built that handle this traffic, "old" and "new". The "old" design is a bit heavy, has multiple redis calls to external IO, does a lot of "work" on each invite before passing it along, etc. The "new" design is extremely lightweight, only a couple of htable comparisons and a dispatcher check on each invite before passing it along. The 2 servers are running different versions of Kamailio, however they're both in the 5.x train and I've found nothing in the changelogs that I believe would explain a significant difference in how child process CPU scheduling is performed, especially a detrimental change. I'm more than happy to be proven wrong about that though!
The "old" system is running 5.4.8: This system is running 4 x 8 core CPUs (32 cores) with no
hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO interactions with things like redis, lots of "work" being done on each invite checking various things before allowing it to proceed. This is currently running 32 child processes and again, has no issues handling the requests whatsoever and has been running at this level and higher for literal years.
The "new" system is running 5.8.3: This system is running 2 x 8 core CPUs (16 cores) *with*
hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO interactions with anything, and only does a couple of checks against some hash table values and dispatcher before allowing it to proceed. This is also currently running 32 child processes, however I am experiencing spikes in context switching and odd memory access issues (e.g. backtraces show things like "<error: Cannot access memory at address 0x14944ec60>" in udp_rvc_loop) with similar loads.
What I'm looking for here is empirical knowledge about things
specifically like context switching and if hyperthreading is detrimental in cases where I'm doing very high volume, low latency (e.g. no waiting on external replies, etc) work. Is it possible there are additional issues with the way hyperthreading works and the concepts behind it if the individual calls for CPU resources are so "quick" as to overwhelm the scheduler and cause congestion in the CPU's thread scheduler itself?
Have you run into anything like this? Have you discovered that very low latency call processing can incur
more hyperthreading overhead than the benefit you get from it?
Do you have additional data I may be missing to help me justify to the
C suite to change out the system architecture to provide a higher physical core count?
Can you share it?
Thanks! __________________________________________________________ Kamailio - Users Mailing List - Non Commercial Discussions --
sr-users@lists.kamailio.org
To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only
to the sender!
-- Alex Balashov Principal Consultant Evariste Systems LLC Web:
https://urldefense.com/v3/__https://evaristesys.com__;!!KWzduNI!ZF1bZli3605Y...
Tel: +1-706-510-6800
Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
Kamailio - Users Mailing List - Non Commercial Discussions -- sr-users@lists.kamailio.org To unsubscribe send an email to sr-users-leave@lists.kamailio.org Important: keep the mailing list in the recipients, do not reply only to the sender!
On May 29, 2025, at 7:08 PM, M S via sr-users sr-users@lists.kamailio.org wrote:
@Alex Balashov I think my writing style must have caused the confusion.I specifically use it to write long docs, emails and technical manuals, so the readers can quickly read through it.
Perhaps so.
@Brooks Bridges Personally, I never had problem with Hyperthreading in production environment but that's probably because of our custom-built Linux Kernel that is optimized for our needs, from CPU scheduling to network congestion control, everything is carefully enabled or disabled at compile time.
It's worth asking, at the risk of offering a troubleshooting suggestion and not an a priori insight:
Why not just disable HyperThreading and see if performance improves?
-- Alex