Feature #6387
closedosmo_io / io_uring support for RTP/RTCP
100%
Description
The RTP/RTCP sockets of osmo-mgw should be prime candidates for migration to osmo_io and hence benefit from the optional io_uring backend.
Given the many small recvfrom/sendto syscalls on those sockets, performance should be enhanced in a significant way.
Related issues
Updated by laforge 2 months ago
- Related to Feature #5751: io_uring support in libosmocore added
Updated by laforge about 2 months ago
- Status changed from New to In Progress
- Assignee set to laforge
- % Done changed from 0 to 80
The patch is in https://gerrit.osmocom.org/c/osmo-mgw/+/36363 - in my local testing it shows no regression in the TTCN3 test suite. Jenkins however does report regressions in the unit tests, I'll investigate.
In a benchmark running 200 concurrent bi-directional voice calls (set up from mncc-python, using rtpsource as RTP generator) with GSM-EFR codec, I am observing:
- the code before this patch uses 40..42% of a single core on a Ryzen 5950X at 200 calls (=> 200 endpoints with each two connections)
- no increase in CPU utilization before/after this patch, i.e. the osmo_io overhead for the osmo_fd backend is insignificant compared to the direct osmo_fd mode before
- an almost exactly 50% reduction of CPU utilization when running the same osmo-mgw build with LIBOSMO_IO_BACKEND=IO_URING - top shows 19..21% for the same workload instead of 40..42% with the OSMO_FD default backend.
- an increase of about 4 Megabytes in both RSS and VIRT size when enabling the OSMO_IO backend. This is likely the memory-mapped rings.
Updated by laforge about 2 months ago
- poll (including the eventfd of io_uring)
- the read of said eventfd
- tons of io_uring_enter() syscalls
The latter are the result of us calling io_uring_submit() after every every individual read or write operation we add to the submission queue.
I've done another experiment to remove thsoe io_uring_submit() calls and do them just before we enter poll(). This indeed removed the duplicate io_uring_enter() syscalls, and we now have an even number of poll, read(eventfd) and io_uring_enter calls. The patch is at https://gerrit.osmocom.org/c/libosmocore/+/36364
However, this is not really making a visible difference in terms of CPU utilization reported by top/ps. Maybe 1% but not more than that; so at least at this relatively low overall CPU load of ~20% it doesn't make a difference. This might change when we get closer to 100% CPU and hence more batching might give more benefits.
FYI, In my 200-calls on 200-endpoints with 400-connections load test, I'm seeing the eventfd signalling something like 3..5 completions each time we poll+read it.
Updated by laforge about 2 months ago
- % Done changed from 80 to 90
finally ported the failing unit test over to the new code; build verification now passes.