Bug #6439
closedttcn3-sgsn-test: SGSN_Tests.TC_attach_rau_a_b crashes osmo-sgsn
100%
Description
Starting from build 2319 we see a significant regression in ttcn3-sgsn-test:
https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-sgsn-test/2319/
Interestingly enough, this is also affecting the -latest:
https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-sgsn-test-latest/2022/
I was able to reproduce the problem locally by running SGSN_Tests.TC_attach_rau_a_b
two times in a row.
Files
Related issues
Updated by fixeria 15 days ago
20240421165147281 DLGSUP INFO GSUP link to 127.0.0.103:4222 DOWN (gsup_client.c:151) 20240421165148281 DLGSUP NOTICE GSUP connecting to 127.0.0.103:4222 (gsup_client.c:74) 20240421165148281 DLGSUP INFO GSUP link to 127.0.0.103:4222 DOWN (gsup_client.c:151) 20240421165149282 DLGSUP NOTICE GSUP connecting to 127.0.0.103:4222 (gsup_client.c:74) 20240421165149282 DLGSUP INFO GSUP link to 127.0.0.103:4222 DOWN (gsup_client.c:151) 20240421165149841 DLGLOBAL INFO Accept()ed new telnet connection r=127.0.0.1:43888<->l=127.0.0.10:4245 (telnet_interface.c:192) 20240421165149843 DMM INFO MM(262420000000038/e95c24ac) Cancelled, deleting context silently (gprs_gmm.c:1056) 20240421165149843 DMM INFO MM(262420000000038/e95c24ac) Cleaning MM context due to access cancelled (gprs_gmm.c:195) 20240421165149843 DMM DEBUG GMM(gmm_fsm)[0x55555574c700]{Registered.NORMAL}: Received Event E_GMM_CLEANUP (gprs_gmm.c:198) 20240421165149843 DMM DEBUG GMM(gmm_fsm)[0x55555574c700]{Registered.NORMAL}: state_chg to Deregistered (gprs_gmm_fsm.c:223) 20240421165149843 DMM DEBUG MM_STATE_Gb(0)[0x55555574c960]{Ready}: Received Event E_MM_GPRS_DETACH (gprs_gmm.c:205) 20240421165149843 DMM DEBUG MM_STATE_Gb(0)[0x55555574c960]{Ready}: state_chg to Idle (gprs_mm_state_gb_fsm.c:76) 20240421165149843 DLLC NOTICE LLME(527b5d30/36396334){(null)} LLGM Assign pre (36396334 => ffffffff) (gprs_llc.c:1079) 20240421165149843 DLLC NOTICE LLME(00000000/00000000){UNASSIGNED} LLGM Assign post (36396334 => ffffffff) (gprs_llc.c:1125) Program received signal SIGABRT, Aborted. 0x00007ffff75e332c in ?? () from /usr/lib/libc.so.6 (gdb) bt #0 0x00007ffff75e332c in ?? () from /usr/lib/libc.so.6 #1 0x00007ffff75926c8 in raise () from /usr/lib/libc.so.6 #2 0x00007ffff757a4b8 in abort () from /usr/lib/libc.so.6 #3 0x00007ffff78270d3 in ?? () from /usr/lib/libtalloc.so.2 #4 0x000055555557dd74 in llme_free (llme=0x555555753550) at ../../../../src/osmo-sgsn/src/sgsn/gprs_llc.c:605 #5 gprs_llgmm_assign (llme=0x555555753550, old_tlli=909730612, new_tlli=new_tlli@entry=4294967295) at ../../../../src/osmo-sgsn/src/sgsn/gprs_llc.c:1129 #6 0x000055555557e07d in gprs_llgmm_unassign (llme=<optimized out>) at ../../../../src/osmo-sgsn/src/sgsn/gprs_llc.c:1137 #7 0x00005555555673f5 in st_mm_idle_on_enter (fi=<optimized out>, prev_state=<optimized out>) at ../../../../src/osmo-sgsn/src/sgsn/gprs_mm_state_gb_fsm.c:51 #8 0x00007ffff797f7e0 in state_chg (fi=fi@entry=0x55555574c960, new_state=new_state@entry=0, keep_timer=keep_timer@entry=false, timeout_ms=timeout_ms@entry=0, T=0, file=file@entry=0x55555558d838 "../../../../src/osmo-sgsn/src/sgsn/gprs_mm_state_gb_fsm.c", line=76) at ../../../../src/libosmocore/src/core/fsm.c:697 #9 0x00007ffff7980180 in _osmo_fsm_inst_state_chg (fi=fi@entry=0x55555574c960, new_state=new_state@entry=0, timeout_secs=timeout_secs@entry=0, T=<optimized out>, file=file@entry=0x55555558d838 "../../../../src/osmo-sgsn/src/sgsn/gprs_mm_state_gb_fsm.c", line=line@entry=76) at ../../../../src/libosmocore/src/core/fsm.c:746 #10 0x00007ffff799b090 in _osmo_tdef_fsm_inst_state_chg (fi=fi@entry=0x55555574c960, state=state@entry=0, timeouts_array=timeouts_array@entry=0x55555559a8c0 <mm_state_gb_fsm_timeouts>, tdefs=<optimized out>, default_timeout=93824992461304, default_timeout@entry=-1, file=file@entry=0x55555558d838 "../../../../src/osmo-sgsn/src/sgsn/gprs_mm_state_gb_fsm.c", line=76) at ../../../../src/libosmocore/src/core/tdef.c:344 #11 0x0000555555567358 in st_mm_ready (fi=0x55555574c960, event=<optimized out>, data=<optimized out>) at ../../../../src/osmo-sgsn/src/sgsn/gprs_mm_state_gb_fsm.c:76 #12 0x00007ffff79803bc in _osmo_fsm_inst_dispatch (fi=0x55555574c960, event=event@entry=1, data=data@entry=0x0, file=file@entry=0x55555558c1f8 "../../../../src/osmo-sgsn/src/sgsn/gprs_gmm.c", line=line@entry=205) at ../../../../src/libosmocore/src/core/fsm.c:875 #13 0x000055555555efbc in mm_ctx_cleanup_free (ctx=0x55555574c060, log_text=0x55555559768a "access cancelled") at ../../../../src/osmo-sgsn/src/sgsn/gprs_gmm.c:205 #14 0x0000555555575eaa in reset_sgsn_state (self=<optimized out>, vty=0x5555556d5f90, argc=<optimized out>, argv=<optimized out>) at ../../../../src/osmo-sgsn/src/sgsn/sgsn_vty.c:1052 #15 0x00007ffff79c7445 in cmd_execute_command_real (vline=<optimized out>, vty=<optimized out>, cmd=cmd@entry=0x101508d7ae252000) at ../../../../src/libosmocore/src/vty/command.c:2671 #16 0x00007ffff79c7f1d in cmd_execute_command (vline=<optimized out>, vty=<optimized out>, cmd=0x101508d7ae252000, vtysh=<optimized out>) at ../../../../src/libosmocore/src/vty/command.c:2723 #17 0x00007ffff79ca696 in vty_command (vty=0x5555556d5f90) at ../../../../src/libosmocore/src/vty/vty.c:464 #18 vty_execute (vty=0x5555556d5f90) at ../../../../src/libosmocore/src/vty/vty.c:729 #19 vty_read (vty=<optimized out>) at ../../../../src/libosmocore/src/vty/vty.c:1471 #20 0x00007ffff79cd3ae in client_data (fd=0x55555574ba68, what=1) at ../../../../src/libosmocore/src/vty/telnet_interface.c:161 #21 0x00007ffff798f94f in poll_disp_fds (n_fd=<optimized out>) at ../../../../src/libosmocore/src/core/select.c:419 #22 _osmo_select_main (polling=polling@entry=0) at ../../../../src/libosmocore/src/core/select.c:457 #23 0x00007ffff798fa2e in osmo_select_main (polling=polling@entry=0) at ../../../../src/libosmocore/src/core/select.c:496 #24 0x000055555555d4e7 in main (argc=<optimized out>, argv=<optimized out>) at ../../../../src/osmo-sgsn/src/sgsn/sgsn_main.c:498
Updated by fixeria 14 days ago
- Status changed from New to In Progress
git-bisect tells me that one of my recent commits is the culprit:
1ee1edd20e7e9d8e844de1e4826cfa9d9cba10ca is the first bad commit commit 1ee1edd20e7e9d8e844de1e4826cfa9d9cba10ca Author: Vadim Yanitskiy <vyanitskiy@sysmocom.de> Date: Sat Apr 6 05:35:39 2024 +0700 sgsn: fix unused param in f_routing_area_update() Change-Id: I582b2836eba3188cfd4c640b1febbd25c50001e0 Related: SYS#6603, OS#6294 sgsn/SGSN_Tests.ttcn | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
Updated by fixeria 14 days ago
- Related to Feature #6294: Support GN/Gp interoperation procedures between SGSN and MME added
Updated by fixeria 14 days ago
- File SGSN_Tests.TC_attach_rau_a_b_fixed.pcap.gz SGSN_Tests.TC_attach_rau_a_b_fixed.pcap.gz added
- File SGSN_Tests.TC_attach_rau_a_b_bogus.pcap.gz SGSN_Tests.TC_attach_rau_a_b_bogus.pcap.gz added
- Subject changed from ttcn3-sgsn-test: SIGABRT since build 2319 to ttcn3-sgsn-test: SGSN_Tests.TC_attach_rau_a_b crashes osmo-sgsn
- % Done changed from 0 to 20
fixeria wrote in #note-2:
git-bisect tells me that one of my recent commits is the culprit:
[...]
SGSN_Tests.TC_attach_rau_a_b
is the only testcase passing a different RAI thang_pars.ra
tof_routing_area_update()
.- Before my commit, that different RAI would be ignored and
g_pars.ra
would be used internally byf_routing_area_update()
(bug).- ... this was making
SGSN_Tests.TC_attach_rau_a_b
identical toSGSN_Tests.TC_attach_rau_a_a
.
- ... this was making
- So my commit actually made the testcase behave the way it was supposed to behave! And uncovered a problem in osmo-sgsn.
I am attaching two PCAPs:
As can be seen from the bogus PCAP (check frames 399 and 433), the RAI being sent is the same in both RAUs:
Routing Area Identification - Old routing area identification - RAI: 262-42-13135-0 Routing area identification: 262-42-13135-0 Mobile Country Code (MCC): Germany (262) Mobile Network Code (MNC): Vodafone GmbH (42) Location Area Code (LAC): 0x334f (13135) Routing Area Code (RAC): 0x00 (0)
In the "fixed" PCAP, the first and second RAUs carry different RAI values:
Routing Area Identification - Old routing area identification - RAI: 262-42-13135-0 Routing area identification: 262-42-13135-0 Mobile Country Code (MCC): Germany (262) Mobile Network Code (MNC): Vodafone GmbH (42) Location Area Code (LAC): 0x334f (13135) Routing Area Code (RAC): 0x00 (0) Routing Area Identification - Old routing area identification - RAI: 262-42-13200-0 Routing area identification: 262-42-13200-0 Mobile Country Code (MCC): Germany (262) Mobile Network Code (MNC): Vodafone GmbH (42) Location Area Code (LAC): 0x3390 (13200) Routing Area Code (RAC): 0x00 (0)
So, most likely, the problem is in osmo-sgsn and needs to be fixed there.
Updated by fixeria 14 days ago
- Status changed from In Progress to Feedback
- % Done changed from 20 to 100
fixeria wrote in #note-4:
As can be seen from the bogus PCAP (check frames 399 and 433), the RAI being sent is the same in both RAUs:
[...]
In the "fixed" PCAP, the first and second RAUs carry different RAI values:
[...]
I actually misunderstood the testcase scenario. What I was talking above is the Old RAI IE (GMM layer), which should be the same in both RAUs.
There's also RAI at the BSSGP layer, which unlike the Old RAI should obviously indicate the new RAI.
So, most likely, the problem is in osmo-sgsn and needs to be fixed there.
It's actually both:
- regression in the testsuite due to my patch, and
- use-after-free problem in osmo-sgsn, which was uncovered by broken testcase scenario.
Here is a fixup-patch fixing the testsuite:
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/36613 fixup: sgsn: fix unused param in f_routing_area_update() [NEW]
I'll prepare a separate testcase triggering the use-after-free problem and open a separate ticket for it.