Feature #4537
openOsmoBSC needs strategies to recover broken lchans (lchan state BORKEN)
30%
Description
Currently, there are some stuations where broken lchans stick around in osmo-bsc without being recovered.
The reason is that we cannot be sure what state the BTS is in, which may be distinct for each individual BTS model.
There are various reasons for an lchan to reach a broken state:
{act, deact} x {timeout, got a nack} x {for CS, for PS}
Each one of those potentially have distinct ways that the BSC should/could try to recover.
For example:
- after a while try to chan activ the lchan (to probe, not for a subscriber request),
- if the BTS accepts the activation, then deact again and the lchan becomes usable again.
- if the activ didn't work, then try to deact. If that succeeds, the lchan becomes usable again.
There are also other, more general approaches to try to recover:
- if a BTS has one or more broken lchans, and there comes a moment where all lchans are unused, drop the OML link to cause a restart of the BTS. After that, all lchans are reset to a clean state.
- ...?
An important aspect: currently, the BSC picks a free lchan, and if that fails, the subscriber gets kicked out. We don't retry with another lchan. So, we cannot risk marking a channel as UNUSED when it is in fact broken: since we always pick the first lchan, if it is broken under the hood, every subscriber gets kicked out, and the entire BTS could become unusable as soon as the first broken lchan shows up.
We could introduce a way how osmo-bsc retries establishing a different lchan for a subscriber if the first one fails.
That would make it less dangerous to have a broken state in the BTS, but letting the BSC try that lchan anyway.
However, this should still remain distinguishable from a clean UNUSED lchan that never had a problem, so that constantly unrecoverable lchans are visible in the BSC's "show lchan summary".
A compound solution of the above could be the gold standard:
- if establishing an lchan fails, try another one instead of dropping the subscriber.
That could be nicer for subscribers when first hitting a broken lchan.
(counter argument: does the MS anyway retry establishing an lchan again?
Maybe for first access, but finding an lchan for voice call assignment could retry different lchans.)
- an lchan that failed should remain in a broken state for a minimum short time (T3111?).
- when in a broken state, the BSC should regularly try to send CHAN ACTIV and/or CHAN DEACT messages to probe whether the BTS responds with a sane state (probably depending on how the broken state was reached).
- Since recovering could still fail, the BSC should notice when a BTS that has lchans that remain broken for a given period of time.
It should then try to reset the BTS completely (drop OML) when it reaches a moment of no lchans being in use.
These approaches need to be tested on all supported BTS models,
and ideally should be configurable per-bts in the osmo-bsc.cfg.
Related issues