[SR-Users] Re: Hyperthreading and high request volumes in modern times

29 May 2025

      Unfortunately, I can't afford *over-priced* and *expensive shit* like
ChatGPT, so I have to resort to more *conventional* and *old-school *means
to find solutions for the technical problems in my field, like reading
through the *Linux Kernel Docs* and having experience in building *Real-Time
VoIP Apps and Service* for last 20+ years on hardcore Linux distributions
like *Gentoo* and *Arch Linux*. But sure, if it gives you happiness and
peace of mind to tag my response as MuhammadGPT then be my guest.
Regards.
--
Muhammad Shahzad Shafi
Tel: +49 176 99 83 10 85
On Thu, May 29, 2025 at 9:59 PM Alex Balashov via sr-users <
sr-users@lists.kamailio.org> wrote:
...
I'm glad MuhammadGPT made an appearance.
...
On May 29, 2025, at 3:10 PM, M S via sr-users <
sr-users@lists.kamailio.org> wrote:
...
It is widely believe (and rightfully so) that Hyperthreading (also known
as SMT) does more harm then good for ultra-low-latency and/or real-time
applications due to Execution Jitter, Resource Contention and
Unpredictability of Execution Times.
...
Let me elaborate a little on each of these before listing possible
solutions.
...

Increased Latency Variability aka Execution Jitters:

In SMT, two logical threads share a single CPU core and compete for
resources, leading to unpredictable execution time that leads to Execution
Jitter.
...

Cache and Memory Contention - Resource Contention:

Since both threads share the L1& L2 caches and memory bandwidth. If one
thread is more memory-intensive (like pattern matching or search through
HTable etc.) then it can evict cache lines needed by the other thread, thus
increasing cache misses and memory latency.
...

Pipeline Stalls and Resource Starvation:

SMT threads compete for execute ports, reorder buffers (ROB) and
load/store queues. So, if one thread stalls (e.g. due to a branch
misprediction or memory access), the other thread may not fully utilize the
core, leading to suboptimal throughput. This adds unpredictable delays in
critical code paths.
...

False Sharing and Core Saturation:

If two unrelated threads are scheduled on the same physical core (e.g. a
UDP listener thread and a RTimer thread in Kamailio), they may
inadvertently share cache lines, causing false sharing and increasing
coherence overhead. In extreme cases, a noisy neighbor (e.g. a background
task) can starve the real-time thread (e.g. a UDP listener) of CPU
resources.
...

Interference with CPU Affinity & Isolation:

Many real-time services (e.g. media proxy services, especially that use
Kernel mode code execution like RTPEngine) use CPU pinning (affinity) to
isolate critical threads. Hyperthreading complicates this because two
logical CPUs are mapped to one physical CPU core, making it harder to
ensure exclusive access to compute resources.
...
Possible Solution:
Here are a few possible solutions that you can try,

Disable Hyperthreading in BIOS if possible.
Use CPU Isolation for SIP and Media proxy services. If I remember

correctly there is isolcpus command in Linux that can do the job.
...

Prioritize real-time processes with high CPU scheduling priority by

using SCHED_FIFO in Linux.
...
Make sure to Benchmark with and without SMT and share the results with
us as a case study.
...
Thank you.
--
Muhammad Shahzad Shafi
Tel: +49 176 99 83 10 85
On Thu, May 29, 2025 at 8:05 PM Brooks Bridges via sr-users <
sr-users@lists.kamailio.org> wrote:
...
To be clear, the "new" system is currently handling ~1000-1500 cps with
no issues at all, it only starts exhibiting these "drops" when we push
it with the rest of the traffic that's currently on the "old" system.
It runs fine for a while, but then when the traffic level gets up to a
certain point, it starts tossing things like "ERROR: tm [tm.c:1846]:
_w_t_relay_to(): t_forward_noack failed" into the system journal and
those memory access errors show up in a trap backtrace.
While I get that there are more "things" than one might think, the
disparity is kind of what I'm looking at.  8 or 9 redis lookups, plus
tons of function calls to evaluate message parameters, plus dispatcher
lookups, plus .... that all equals a lot more "stuff" that gets put into
the pipe to be worked on, and for the cpu scheduler to handle.
Alternately, the new one will have drastically fewer of these, so it's
going to "churn" a lot more since each child worker is doing less
overall.
...
I'm mainly trying to figure out if the addition of the hyperthreading
could be part of the reason we're having so much trouble reaching the
higher levels of traffic that this thing should be able to handle as
opposed to the old one.
Thanks for the insight!
On 5/29/2025 10:10 AM, Alex Balashov via sr-users wrote:
...
The counterpoint would be:
Kamailio may not wait a lot for outside I/O, but even a pure in-memory
config's code path largely rakes any incoming request over a bunch of
kernel and system call interactions, which are in effect wait.
...
...
That doesn't have much in common with a truly pure-computational
workload.
...
...
-- Alex
...
On May 29, 2025, at 1:06 PM, Alex Balashov abalashov@evaristesys.com
wrote:
...
...
...
I don't have a great deal of empirical experience with this. However,
proceeding from first principles in a rather general way, hyperthreading
probably does more harm than good to more pure CPU-bound workloads, and is
more beneficial in I/O-bound workloads where there is more of the kind of
waiting on external responses that you ascribe to the "old" system.
...
...
...
The basic idea of hyperthreading, as best as I understand it, is a
kind of oversubscription of a physical core across two logical threads.
That means all the core's resources are shared (e.g. caches, memory
bandwidth) and the OS has to manage more threads on the same physical core,
increasing scheduler overhead.
...
...
...
I imagine this means that hyperthreading works best in cases where
there are execution or memory access patterns that naturally "yield" to the
other thread on the same core a lot, with I/O waiting of course coming to
mind first and foremost. The "hyper" part I guess comes from the idea that
the physical cores aren't sitting idle as much as when a single thread of
execution waits on something. If there's not much of that kind of waiting,
HyperThreading probably isn't of much help, and may even hurt.
...
...
...
The real unknown is whether a seemingly compute-bound Kamailio config
has the kinds of execution patterns that create a lot of natural yielding
on the given architecture. It's hard to say, under the hood, if htable
lookups and dispatcher leaves a lot of these yield "gaps". My guess is they
don't; it doesn't seem to me to map onto stuff like complex arithmetic,
wildly varying memory allocations, cache misses and branch mispredictions
and other things that could lead a thread of execution to stall. However,
without forensically combing through the underlying machine code--which I
of course cannot do--it's so hard to say. There are attempts throughout
Kamailio's core code to modulate branch prediction to the compiler, e.g.
through lots of use of unlikely()[1]:
...
...
...
/* likely/unlikely */
   #if __GNUC__ >= 3
#define likely(expr) __builtin_expect(!!(expr), 1)
   #define unlikely(expr) __builtin_expect(!!(expr), 0)
I don't know how you'd profile the real-world effects of this, though.
My guess is that this is, like most things of the kind, splitting
very thin hairs, and doesn't have much effect on your ability to process
2500 CPS one way or another.
...
...
...
-- Alex
[1]
https://stackoverflow.com/questions/7346929/what-is-the-advantage-of-gccs-bu...
...
...
...
...
On May 29, 2025, at 11:57 AM, Brooks Bridges via sr-users <
sr-users@lists.kamailio.org> wrote:
...
...
...
...
So, I have a platform that handles in the order of around 2500 or
more call attempts per second.  There are 2 servers that are built that
handle this traffic, "old" and "new".  The "old" design is a bit heavy, has
multiple redis calls to external IO, does a lot of "work" on each invite
before passing it along, etc.  The "new" design is extremely lightweight,
only a couple of htable comparisons and a dispatcher check on each invite
before passing it along.  The 2 servers are running different versions of
Kamailio, however they're both in the 5.x train and I've found nothing in
the changelogs that I believe would explain a significant difference in how
child process CPU scheduling is performed, especially a detrimental
change.  I'm more than happy to be proven wrong about that though!
...
...
...
...
The "old" system is running 5.4.8:
This system is running 4 x 8 core CPUs (32 cores) with no
hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO
interactions with things like redis, lots of "work" being done on each
invite checking various things before allowing it to proceed.  This is
currently running 32 child processes and again, has no issues handling the
requests whatsoever and has been running at this level and higher for
literal years.
...
...
...
...
The "new" system is running 5.8.3:
This system is running 2 x 8 core CPUs (16 cores) *with*
hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO
interactions with anything, and only does a couple of checks against some
hash table values and dispatcher before allowing it to proceed.  This is
also currently running 32 child processes, however I am experiencing spikes
in context switching and odd memory access issues (e.g. backtraces show
things like "<error: Cannot access memory at address 0x14944ec60>" in
udp_rvc_loop) with similar loads.
...
...
...
...
What I'm looking for here is empirical knowledge about things
specifically like context switching and if hyperthreading is detrimental in
cases where I'm doing very high volume, low latency (e.g. no waiting on
external replies, etc) work.  Is it possible there are additional issues
with the way hyperthreading works and the concepts behind it if the
individual calls for CPU resources are so "quick" as to overwhelm the
scheduler and cause congestion in the CPU's thread scheduler itself?
...
...
...
...
Have you run into anything like this?
Have you discovered that very low latency call processing can incur
more hyperthreading overhead than the benefit you get from it?
...
...
...
...
Do you have additional data I may be missing to help me justify to
the C suite to change out the system architecture to provide a higher
physical core count?
...
...
...
...
Can you share it?
Thanks!
__________________________________________________________
Kamailio - Users Mailing List - Non Commercial Discussions --
sr-users@lists.kamailio.org
...
...
...
...
To unsubscribe send an email to sr-users-leave@lists.kamailio.org
Important: keep the mailing list in the recipients, do not reply
only to the sender!
...
...
...
--
Alex Balashov
Principal Consultant
Evariste Systems LLC
Web: https://evaristesys.com
Tel: +1-706-510-6800

Kamailio - Users Mailing List - Non Commercial Discussions --
sr-users@lists.kamailio.org
...
To unsubscribe send an email to sr-users-leave@lists.kamailio.org
Important: keep the mailing list in the recipients, do not reply only to
the sender!
...

Kamailio - Users Mailing List - Non Commercial Discussions --
sr-users@lists.kamailio.org
...
To unsubscribe send an email to sr-users-leave@lists.kamailio.org
Important: keep the mailing list in the recipients, do not reply only to
the sender!
--
Alex Balashov
Principal Consultant
Evariste Systems LLC
Web: https://evaristesys.com
Tel: +1-706-510-6800

Kamailio - Users Mailing List - Non Commercial Discussions --
sr-users@lists.kamailio.org
To unsubscribe send an email to sr-users-leave@lists.kamailio.org
Important: keep the mailing list in the recipients, do not reply only to
the sender!

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[SR-Users] Re: Hyperthreading and high request volumes in modern times