revisit fn-advance / rts-advance default settings
We currently use a fn-advance default of of 20 frames, and a rts-advance of 5, resulting in a total of 25 frames (equalling 115ms) of downlink frame nubmer advance.This will cause
- significantly increased RTT for GPRS user plane data
- increase latency of RLC/MAC signaling, specifically
- tbf establishment
- potentially cause window stalls if we don't poll for ACK/NACK a lot sooner than our window filling up.
- probably mess with LAPDm timing
I would guess that on modern hardware, particularly with SCHED_RR on TRX + BTS, we can reduce the fn_advance drastically. The rts_advance likely needs to remain in place without too many changes, as this is the amount of time the PCU has to prepare downlink data (i.e. schedule DL).
As a second step, we could possibly even think of something like a dynamically sized fn-advance, similar to dynamic jitter buffers work in RTP.
- Status changed from In Progress to Feedback
- % Done changed from 0 to 80
So in summary:
- I tested with B200 + osmo-trx-uhd + multi-arfcn with 2 TRX
- I tested with LimeSDR-USB + osmo-trx-lms + 1 TRX
- I had to run osmo-pcu also with SCHED_RR (-r 1) to avoid having issues with PDTCH Dl blocks not enqueued quickly enough in BTS (related to rts-advance value)
- I also noticed that using a more conservative logging levels (I was using a quite verbose and compute intensive one for RLCMAC category) also helps in getting more stable.
- "fn-advance" can be decreased to 2 by default, it worked fine. "rts-advance is on the edge already, so I wouldn't touch that one.
I also submitted patches improving some related scheduler code to provide more information. I also added rate counters in order to display issues related to fn-advance and rts-advance ("show rate-counters" in osmo-bts).
I did some testing with a LimeNET-micro and so far it looks good from osmo-bts-trx side, but it's not working properly on osmo-trx-lms side due to Tx downlink bursts arriving too late when using fn-advance 2 or 3, I get lots of messages like this from time to time:
DTRXDDL <0003> Transceiver.cpp:430 [tid=140424023869184][chan=0] dumping STALE burst in TRX->SDR interface (0:2005343 vs 1:2005343), retrans=0
I'm running all through systemd services and they have realtime scheduling set in the service files.
I added some rate counters to monitor that kind of issue in osmo-trx, and provide also some VTY command to establish a threshold at which osmo-trx will exit to flag the BTS that something's wrong, like we do for other counters (overruns, underruns, dropped packets, etc.):
remote: https://gerrit.osmocom.org/c/osmo-trx/+/19050 Rename device specific rate counter multi-thread helpers
remote: https://gerrit.osmocom.org/c/osmo-trx/+/19051 Introduce rate counter tx_stale_bursts
While at it, I also fixed some bug in the rate counter thresholds I observed.
Using current default fn values (20 and so), I have been running osmo-bts-trx+osmo-trx-lms in LimeNet-Micro3 for a few hours with 1 phone attached and pinging some IP addr. Then check over time the related rate counters:
trx_clk:sched_dl_miss_fn: 0 (0/s 0/m 0/h 0/d) Downlink frames scheduled later than expected due to missed timerfd event (due to high system load)
This one didn't change over time, which is good.
But then in osmo-trx-lms:
trx:tx_stale_bursts: 232 (0/s 22/m 232/h 0/d) Number of Tx burts dropped by TRX due to arriving too late trx:tx_stale_bursts: 1849 (0/s 0/m 1849/h 1793/d) Number of Tx burts dropped by TRX due to arriving too late trx:tx_stale_bursts: 5067 (0/s 0/m 1890/h 5031/d) Number of Tx burts dropped by TRX due to arriving too late trx:tx_stale_bursts: 5201 (0/s 0/m 2024/h 5031/d) Number of Tx burts dropped by TRX due to arriving too late trx:tx_stale_bursts: 6156 (0/s 0/m 1125/h 5998/d) Number of Tx burts dropped by TRX due to arriving too late
So we are dropping around 2k bursts per hour aprox, with current settings. I still need to figure out what fn param relates to that.
After a few more hours with the test running with same environment, I continue to have trx_clk:sched_dl_miss_fn at 0 and trx:tx_stale_bursts at around 2425/h
Regarding that counter, I just found out that there may be a race condition between getStaleBurst() and getCurrentBurst(), where a burst is fed in the queue in between and then getCurrentBurst() fails because the required one may be not the first one even if it was queued and getStaleBurst() will later drop it. So we potentially need to refactor that code to avoid those issues., and repeat the test (adding an extra counter for the case where getCurrentBurst fails, because then by comparing with the other one it can be known if bursts arrived late or never arrived).
If that's not introducing an issue, then we need to investigate whether lowering fn_advance further degrades the current situation or not.
I submitted a bunch of more patches fixing the potential race condition as well as adding more counters useful to gasp timing issues:
remote: https://gerrit.osmocom.org/c/osmo-trx/+/19206 Transceiver: Fix race condition obtaining Dl burst from Upper layer
remote: https://gerrit.osmocom.org/c/osmo-trx/+/19207 Add rate counter for missing Txbursts when scheduled towards the radioInterface
remote: https://gerrit.osmocom.org/c/osmo-trx/+/19205 Introduce rate counters to detect issues in received Dl bursts from TRXD