You are right, this limit does not exist, you can run Kamailio with only one UDP worker if you want, all the threads are started from a forked process, not even a worker process (to be precise).
The reason to create all the threads from one specific process and centralize the interactions with the threads from one process is that we may not be forced to use shared memory, this can be problematic if there is a bug/limitation in the malloc wrapper/abstraction in one library.
I do not know what are the bottlenecks, since I did not do any load tests yet.
However I did a lot of tests and some profiling on the filters in the past, one thing that is very nice with MediaStreamer2 is that you can find where the CPU time is spent in each filter.
One example of bottle neck that can be surprising is that the speex resampler that can be CPU intensive in some scenarios.
Some encoder like Opus can be CPU intensive as well.
If one filter is using too much CPU it will be easy to find it looking at the logs.
At one point we could use the ticker filter stats to create overall perf reports to clarify in which filter was the CPU time spent.
I know that oRTP was used 10 years ago to build server side
applications handling 2000 calls on an x86 server.