So, I have a platform that handles in the order of around 2500 or more call attempts per second. There are 2 servers that are built that handle this traffic, "old" and "new". The "old" design is a bit heavy, has multiple redis calls to external IO, does a lot of "work" on each invite before passing it along, etc. The "new" design is extremely lightweight, only a couple of htable comparisons and a dispatcher check on each invite before passing it along. The 2 servers are running different versions of Kamailio, however they're both in the 5.x train and I've found nothing in the changelogs that I believe would explain a significant difference in how child process CPU scheduling is performed, especially a detrimental change. I'm more than happy to be proven wrong about that though!
The "old" system is running 5.4.8: This system is running 4 x 8 core CPUs (32 cores) with no hyperthreading (4 x E5-4627 v2 to be exact) and has multiple external IO interactions with things like redis, lots of "work" being done on each invite checking various things before allowing it to proceed. This is currently running 32 child processes and again, has no issues handling the requests whatsoever and has been running at this level and higher for literal years.
The "new" system is running 5.8.3: This system is running 2 x 8 core CPUs (16 cores) *with* hyperthreading (2 x E5-2667 v4 to be exact) and has no external IO interactions with anything, and only does a couple of checks against some hash table values and dispatcher before allowing it to proceed. This is also currently running 32 child processes, however I am experiencing spikes in context switching and odd memory access issues (e.g. backtraces show things like "<error: Cannot access memory at address 0x14944ec60>" in udp_rvc_loop) with similar loads.
What I'm looking for here is empirical knowledge about things specifically like context switching and if hyperthreading is detrimental in cases where I'm doing very high volume, low latency (e.g. no waiting on external replies, etc) work. Is it possible there are additional issues with the way hyperthreading works and the concepts behind it if the individual calls for CPU resources are so "quick" as to overwhelm the scheduler and cause congestion in the CPU's thread scheduler itself?
Have you run into anything like this? Have you discovered that very low latency call processing can incur more hyperthreading overhead than the benefit you get from it? Do you have additional data I may be missing to help me justify to the C suite to change out the system architecture to provide a higher physical core count? Can you share it?
Thanks!