Bug #6304
closedosmo-gsm-tester_virtual is broken since build 7424
Added by fixeria 5 months ago. Updated 5 months ago.
0%
Description
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7424/ -- was aborted by osmith because it was executing too long (for 3 days 11 hr!)
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7425/ -- broken, executed on Dec 11, took 9 min 11 sec
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7426/ -- broken, executed on Dec 11, took 4 min 46 sec
The last two builds are failing with the following verdicts, respectively:
FAIL: netreg_mass (fail: 1) FAIL: register_default_mass.py (141.8 sec) Exception: Completion ratio of 26.000000% lower than threshold. FAIL: netreg_mass (fail: 1) FAIL: register_default_mass.py (141.5 sec) Exception: Completion ratio of 18.000000% lower than threshold.
Files
osmo-ms-mob-00000-stderr | osmo-ms-mob-00000-stderr | 5.72 KB | fixeria, 12/13/2023 04:37 AM | ||
osmo-ms-mob-00006-stderr | osmo-ms-mob-00006-stderr | 5.06 KB | fixeria, 12/13/2023 04:37 AM |
Related issues
Updated by fixeria 5 months ago
- Status changed from New to In Progress
- Assignee set to fixeria
fixeria wrote:
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7424/ -- was aborted by osmith because it was executing too long (for 3 days 11 hr!)
According to the Console Output, this build used libosmocore 1.9.0.79.cb5ce.202312062026, which predates the logging fix (see #6299), so most likely virtphy just crashed and everything got stuck.
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7425/ -- broken, executed on Dec 11, took 9 min 11 sec
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7426/ -- broken, executed on Dec 11, took 4 min 46 sec
These builds both used libosmocore 1.9.0.94.d2e8.202312102026, which also predates the logging fix (see #6299).
Meanwhile, there was another build on Dec 12th:
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7427/ -- broken, executed on Dec 12, took 5 min 1 sec
which used libosmocore 1.9.0.95.9c603.202312112026 pointing exactly to the bugfix, but still failed:
FAIL: netreg_mass (fail: 1) FAIL: register_default_mass.py (141.2 sec) Exception: Completion ratio of 17.000000% lower than threshold.
Updated by fixeria 5 months ago
I compared the output of osmo-msc between run 7423 (last passing) and run 7427 (latest failing), and indeed the amount of LOCATION UPDATE REQUESTs differs:
fixeria@LEGION:/home/fixeria/ws/run.2023-12-06_22-50-16/netreg_mass/register_default_mass.py$ less -R osmo-msc_127.0.42.6/stderr | grep --text "LOCATION UPDATING REQUEST" | wc -l 100 fixeria@LEGION:/home/fixeria/ws/run.2023-12-12_22-50-22/netreg_mass/register_default_mass.py$ less -R osmo-msc_127.0.42.6/stderr | grep --text "LOCATION UPDATING REQUEST" | wc -l 17
The output files of mobile instances contain lots of LAPDm errors, which might be the cause here.
Updated by fixeria 5 months ago
In run 7427, I also checked the PCAP containing the Um/LAPDm interface traces:
fixeria@LEGION:/home/fixeria/ws/run.2023-12-12_22-50-22/netreg_mass/register_default_mass.py$ tshark -r osmo-bsc_127.0.42.7/pcap/pcap-recorder_any\(filters\=\'host\ 127.0.42.7\ and\ port\ not\ 22\'\).pcap -Y "lapdm && gsm_a.dtap.msg_mm_type == 0x08" 330 17.670040 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 337 17.674456 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 429 21.435941 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 436 21.440358 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 500 23.318697 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 518 23.475163 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 628 25.442644 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 638 25.450369 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 753 27.075639 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 789 27.242246 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 1038 30.850650 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 1040 30.850689 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 1074 31.006859 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 2865 47.802652 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 2904 47.828940 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 4356 60.978121 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 4407 61.152466 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 4669 64.734205 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 4724 64.899432 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 4978 68.508959 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 5011 68.665255 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 5117 70.391990 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 5124 70.396335 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 5234 72.270133 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 5252 72.431839 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 5374 74.361881 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 5387 74.406354 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 5539 81.205983 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 5550 81.223076 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 5685 92.268392 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 5695 92.285186 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 5795 97.916800 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 5805 97.934046 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 6092 126.603749 127.0.0.1 → 127.0.42.7 LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 6108 126.800394 127.0.0.1 → 127.0.42.7 LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request fixeria@LEGION:/home/fixeria/ws/run.2023-12-12_22-50-22/netreg_mass/register_default_mass.py$ tshark -r osmo-bsc_127.0.42.7/pcap/pcap-recorder_any\(filters\=\'host\ 127.0.42.7\ and\ port\ not\ 22\'\).pcap -Y "lapdm && gsm_a.dtap.msg_mm_type == 0x08" | wc -l 35
Ideally, we should see 200 "Location Updating Request" entries (100 func=SABM + 100 func=UA).
Something is definitely wrong on the Um interface.
Updated by fixeria 5 months ago
- File osmo-ms-mob-00006-stderr osmo-ms-mob-00006-stderr added
- File osmo-ms-mob-00000-stderr osmo-ms-mob-00000-stderr added
- Status changed from In Progress to Feedback
- Assignee changed from fixeria to jolly
Attaching two log files generated by the mobile application:
- osmo-ms-mob-00000-stderr: IMSI=001010000000000, Location Update successful
- osmo-ms-mob-00006-stderr: IMSI=001010000000006, Location Update did not show up
Interestingly enough, the osmo-ms-mob-00000-stderr contains even more LAPDm errors than osmo-ms-mob-00006-stderr, but still it succeeds:
20231212225048472 DPLMN NOTICE Event EVENT_REG_SUCCESS unhandled in state A2 on PLMN. (gsm322.c:3784)
I am running out of ideas and assigning this ticket to jolly, who has recently been working on LAPDm.
It might be caused by one of the LAPDm patches recently merged to libosmocore.
Updated by fixeria 5 months ago
- Status changed from Feedback to Resolved
The problem was that a stale docker container called interesting_torvalds
was still running on the server. It survived even though the respective job was aborted:
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7424/ (3 days 11 hr build duration)
All subsequent containers (failing jobs) were executing on the same host, so there was a race condition between two osmo-bts-virtual processes running at the same time.
We had a long live debugging call with jolly, trying to understand what's happening. Then osmith joined and helped us a lot by logging in and inspecting the running containers.
After the stale container was killed, the job is back to normal:
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7431/
|| ___ || || .' '. || || / \ oOoOo || || | | ,==||||| || || \ / _|| ||||| || || '.___.' _.-'^|| ||||| || || __/_______.-' '==HHHHH || || _.-'` / """"" || || .-' / oOoOo || || `-._ / ,==||||| || || '-/._|| ||||| || || / ^|| ||||| || || / '==HHHHH || || /________""""" || || `\ `\ || || \ `\ / || || \ `\/ || || / || || / || || jgs /_____ || || ||
Updated by osmith 5 months ago
Patch to ensure this doesn't happen again: https://gerrit.osmocom.org/c/osmo-ci/+/35362