Bug #6359
closedTbfTest: sporadic SEGV in
100%
Description
This error showed up today in raspbian11 test:
https://jenkins.osmocom.org/jenkins/job/master-osmo-pcu/FIRMWARE_VERSION=master,WITH_MANUALS=0,label=rpi4-raspbian11,with_dsp=none,with_vty=False/6448/console
../../../tests/testsuite.at:28: $OSMO_QEMU $abs_top_builddir/tests/tbf/TbfTest --- experr 2024-02-13 07:37:47.768562611 +0000 +++ /build/osmo-pcu-1.4.0.1-b04e1/_build/sub/tests/testsuite.dir/at-groups/4/stderr 2024-02-13 07:37:48.558541306 +0000 @@ -11580,18 +11580,16 @@ TBF(DL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN} Starting timer X2001 [assignment (PACCH)] with 2 sec. 0 microsec DL_TBF(DL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN}: Received Event ASSIGN_PCUIF_CNF TBF(DL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN} Ignoring event ASSIGN_PCUIF_CNF from BTS (CCCH was not requested on current assignment) -DL_ASS_TBF(UL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){SEND_ASS}: Received Event CREATE_RLCMAC_MSG -PDCH(bts=0,trx=0,ts=2) POLL scheduled at FN 0 + 13 = 13 -TBF(UL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN} start Packet Downlink Assignment (PACCH) for TBF(DL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN} -+++++++++++++++++++++++++ TX : Packet Downlink Assignment +++++++++++++++++++++++++ -------------------------- TX : Packet Downlink Assignment ------------------------- -PDCH(bts=0,trx=0,ts=2) Reserving FN 13 for type POLL -TBF(UL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN} Scheduled DL Assignment polling on PACCH (FN=13, TS=2) -DL_ASS_TBF(UL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){SEND_ASS}: state_chg to WAIT_ACK -PDCH(bts=0,trx=0,ts=2) FN=0 Scheduling control message at RTS for TBF(UL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN} -=== end test_ms_merge_dl_tbf_different_trx === -MS(IMSI-001001000000001:TLLI-0xecc1f953:TA-220:MSCLS-11-0:UL:DL) Destroying MS object -MS(IMSI-001001000000001:TLLI-0xecc1f953:TA-220:MSCLS-11-0:UL:DL) Detaching TBF: TBF(UL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN} -MS(IMSI-001001000000001:TLLI-0xecc1f953:TA-220:MSCLS-11-0:DL): - tbf: now used by 1 (tbf) -MS(IMSI-001001000000001:TLLI-0xecc1f953:TA-220:MSCLS-11-0:DL) Detaching TBF: TBF(DL:TFI-0-0-0:G:IMSI-001001000000001:TLLI-0xecc1f953){ASSIGN} -MS(IMSI-001001000000001:TLLI-0xecc1f953:TA-220:MSCLS-11-0): - tbf: now used by 0 (-) +UL_TBF(UL:TFI-0-0-0:G){ASSIGN}: Timeout of X2002 +UL_TBF(UL:TFI-0-0-0:G){ASSIGN}: Received Event ASSIGN_READY_CCCH +../../../src/tbf_ul_fsm.c:173:3: runtime error: member access within null pointer of type 'struct GprsMs' +AddressSanitizer:DEADLYSIGNAL +================================================================= +==1984==ERROR: AddressSanitizer: SEGV on unknown address 0x0000000c (pc 0x00687b8e bp 0x1d52b65d sp 0xbeeb12b0 T0) +==1984==The signal is caused by a READ memory access. +==1984==Hint: address points to the zero page. + #0 0x687b8e in st_assign (/build/osmo-pcu-1.4.0.1-b04e1/_build/sub/tests/tbf/TbfTest+0x227b8e) + +AddressSanitizer can not provide additional info. +SUMMARY: AddressSanitizer: SEGV (/build/osmo-pcu-1.4.0.1-b04e1/_build/sub/tests/tbf/TbfTest+0x227b8e) in st_assign +==1984==ABORTING stdout: ../../../tests/testsuite.at:28: exit code was 1, expected 0 4. testsuite.at:25: 4. tbf (testsuite.at:25): FAILED (testsuite.at:28)
At first sight it seems that something took longer than usual to run (due to load?) and then some event triggered which caused a SEGV.
I marked the job above as "Keep this build forever". It can be unmarked once this issue is looked at.
I also upload here the build artifacts of the job just in case.
Files
Updated by pespin 3 months ago
Seems its failing when doing:
#define tbf_ul_fsm_state_chg(fi, NEXT_STATE) \ osmo_tdef_fsm_inst_state_chg(fi, NEXT_STATE, \ tbf_ul_fsm_timeouts, \ tbf_ms(((struct tbf_dl_fsm_ctx *)(fi->priv))->tbf)->bts->T_defs_bts, \ -1) static void st_assign(struct osmo_fsm_inst *fi, uint32_t event, void *data) { case TBF_EV_ASSIGN_READY_CCCH: /* change state to FLOW, so scheduler will start requesting USF */ tbf_ul_fsm_state_chg(fi, TBF_ST_FLOW);
Fails when accessing "tbf_ms(((struct tbf_dl_fsm_ctx *)(fi->priv))", which returns a NULL pointer.
The cast there is wrong, I fixed it here: https://gerrit.osmocom.org/c/osmo-pcu/+/35957 tbf_ul_fsm: Fix wrong cast
However, that's not really the culprit (tbf is at the same offset in both).
Updated by pespin about 2 months ago
With the following change, the crash can be reproduced easily:
diff --git a/tests/tbf/TbfTest.cpp b/tests/tbf/TbfTest.cpp index 643551b9..b2650518 100644 --- a/tests/tbf/TbfTest.cpp +++ b/tests/tbf/TbfTest.cpp @@ -2524,6 +2524,7 @@ static void test_ms_merge_dl_tbf_different_trx(void) OSMO_ASSERT(osmo_tdef_set(the_pcu->T_defs, -2002, 0, OSMO_TDEF_MS) == 0); osmo_fsm_inst_dispatch(ms_dl_tbf(second_ms)->state_fi, TBF_EV_ASSIGN_PCUIF_CNF, NULL); osmo_select_main(0); + osmo_select_main(0); /* get the PACCH PktDlAss for the DL-TBF, allocated on the UL-TBF from the new MS obj: */ request_dl_rlc_block(dl_tbf->bts, dl_tbf->control_ts, &fn);
Updated by pespin about 2 months ago
The UL TBF crashing is actually coming from another test, because in the test "test_ms_merge_dl_tbf_different_trx" the only UL TBF created happens through PACCH, and X2002 is only used during CCCH assignment.
By commenting all tests and running test_ms_merge_dl_tbf_different_trx() alone, it can be seen that the crash doesn't show up.
After several attempts, the one producing the crash later on when test_ms_merge_dl_tbf_different_trx blocks in osmo_main_select() is:
test_immediate_assign_rej();
Updated by pespin about 2 months ago
The problem seems triggered by the fact that test_immediate_assign_rej_multi_block() is calling "rc = bts_handle_rach(bts, 0x78, rach_fn, qta);", which create a UL TBF, but the UL TBF is never explicitly freed through procedures.
Instead, it was expecting "TALLOC_FREE(the_pcu);" at the end to free it.
However, when tearing down the whole tree of objects, ms_talloc_destructor() is unattaching the TBFs but not freeing them, which seems a bug.
Updated by pespin about 2 months ago
- Status changed from New to Feedback
- % Done changed from 0 to 90
Should be fixed by:
https://gerrit.osmocom.org/c/osmo-pcu/+/36450 Make sure GprsMs free() also frees its tbfs
I run the whole ttcn3 PCU_Tests testsuite with the patch applied and it looked good.
Once merged, we can close the ticket.