Project

General

Profile

Actions

Bug #6304

closed

osmo-gsm-tester_virtual is broken since build 7424

Added by fixeria 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
Start date:
12/12/2023
Due date:
% Done:

0%

Spec Reference:

Description

https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7424/ -- was aborted by osmith because it was executing too long (for 3 days 11 hr!)
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7425/ -- broken, executed on Dec 11, took 9 min 11 sec
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7426/ -- broken, executed on Dec 11, took 4 min 46 sec

The last two builds are failing with the following verdicts, respectively:

FAIL: netreg_mass (fail: 1)
    FAIL: register_default_mass.py (141.8 sec) Exception: Completion ratio of 26.000000% lower than threshold.

FAIL: netreg_mass (fail: 1)
    FAIL: register_default_mass.py (141.5 sec) Exception: Completion ratio of 18.000000% lower than threshold.

Files

osmo-ms-mob-00000-stderr osmo-ms-mob-00000-stderr 5.72 KB fixeria, 12/13/2023 04:37 AM
osmo-ms-mob-00006-stderr osmo-ms-mob-00006-stderr 5.06 KB fixeria, 12/13/2023 04:37 AM

Related issues

Has duplicate Core testing infrastructure - Bug #6308: osmo-gsm-tester_virtual sporadically fails due to double received frames at the MS side.Closedjolly12/13/2023

Actions
Actions #1

Updated by fixeria 5 months ago

  • Status changed from New to In Progress
  • Assignee set to fixeria

fixeria wrote:

https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7424/ -- was aborted by osmith because it was executing too long (for 3 days 11 hr!)

According to the Console Output, this build used libosmocore 1.9.0.79.cb5ce.202312062026, which predates the logging fix (see #6299), so most likely virtphy just crashed and everything got stuck.

https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7425/ -- broken, executed on Dec 11, took 9 min 11 sec
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7426/ -- broken, executed on Dec 11, took 4 min 46 sec

These builds both used libosmocore 1.9.0.94.d2e8.202312102026, which also predates the logging fix (see #6299).

Meanwhile, there was another build on Dec 12th:

https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7427/ -- broken, executed on Dec 12, took 5 min 1 sec

which used libosmocore 1.9.0.95.9c603.202312112026 pointing exactly to the bugfix, but still failed:

FAIL: netreg_mass (fail: 1)
    FAIL: register_default_mass.py (141.2 sec) Exception: Completion ratio of 17.000000% lower than threshold.
Actions #2

Updated by fixeria 5 months ago

I compared the output of osmo-msc between run 7423 (last passing) and run 7427 (latest failing), and indeed the amount of LOCATION UPDATE REQUESTs differs:

fixeria@LEGION:/home/fixeria/ws/run.2023-12-06_22-50-16/netreg_mass/register_default_mass.py$ less -R osmo-msc_127.0.42.6/stderr | grep --text "LOCATION UPDATING REQUEST" | wc -l
100
fixeria@LEGION:/home/fixeria/ws/run.2023-12-12_22-50-22/netreg_mass/register_default_mass.py$ less -R osmo-msc_127.0.42.6/stderr | grep --text "LOCATION UPDATING REQUEST" | wc -l
17

The output files of mobile instances contain lots of LAPDm errors, which might be the cause here.

Actions #3

Updated by fixeria 5 months ago

In run 7427, I also checked the PCAP containing the Um/LAPDm interface traces:

fixeria@LEGION:/home/fixeria/ws/run.2023-12-12_22-50-22/netreg_mass/register_default_mass.py$ tshark -r osmo-bsc_127.0.42.7/pcap/pcap-recorder_any\(filters\=\'host\ 127.0.42.7\ and\ port\ not\ 22\'\).pcap -Y "lapdm && gsm_a.dtap.msg_mm_type == 0x08" 
  330  17.670040    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
  337  17.674456    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
  429  21.435941    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
  436  21.440358    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
  500  23.318697    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
  518  23.475163    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
  628  25.442644    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
  638  25.450369    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
  753  27.075639    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
  789  27.242246    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
 1038  30.850650    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
 1040  30.850689    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
 1074  31.006859    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
 2865  47.802652    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
 2904  47.828940    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
 4356  60.978121    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
 4407  61.152466    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
 4669  64.734205    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
 4724  64.899432    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
 4978  68.508959    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
 5011  68.665255    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
 5117  70.391990    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
 5124  70.396335    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
 5234  72.270133    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
 5252  72.431839    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
 5374  74.361881    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
 5387  74.406354    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
 5539  81.205983    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
 5550  81.223076    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
 5685  92.268392    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
 5695  92.285186    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
 5795  97.916800    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
 5805  97.934046    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
 6092 126.603749    127.0.0.1 → 127.0.42.7   LAPDm 83 U P, func=SABM(DTAP) (MM) Location Updating Request 
 6108 126.800394    127.0.0.1 → 127.0.42.7   LAPDm 83 U F, func=UA(DTAP) (MM) Location Updating Request 
fixeria@LEGION:/home/fixeria/ws/run.2023-12-12_22-50-22/netreg_mass/register_default_mass.py$ tshark -r osmo-bsc_127.0.42.7/pcap/pcap-recorder_any\(filters\=\'host\ 127.0.42.7\ and\ port\ not\ 22\'\).pcap -Y "lapdm && gsm_a.dtap.msg_mm_type == 0x08" | wc -l
35

Ideally, we should see 200 "Location Updating Request" entries (100 func=SABM + 100 func=UA).
Something is definitely wrong on the Um interface.

Actions #4

Updated by fixeria 5 months ago

Attaching two log files generated by the mobile application:

  • osmo-ms-mob-00000-stderr: IMSI=001010000000000, Location Update successful
  • osmo-ms-mob-00006-stderr: IMSI=001010000000006, Location Update did not show up

Interestingly enough, the osmo-ms-mob-00000-stderr contains even more LAPDm errors than osmo-ms-mob-00006-stderr, but still it succeeds:

20231212225048472 DPLMN NOTICE Event EVENT_REG_SUCCESS unhandled in state A2 on PLMN. (gsm322.c:3784)

I am running out of ideas and assigning this ticket to jolly, who has recently been working on LAPDm.
It might be caused by one of the LAPDm patches recently merged to libosmocore.

Actions #5

Updated by osmith 5 months ago

  • Has duplicate Bug #6308: osmo-gsm-tester_virtual sporadically fails due to double received frames at the MS side. added
Actions #6

Updated by fixeria 5 months ago

  • Status changed from Feedback to Resolved

The problem was that a stale docker container called interesting_torvalds was still running on the server. It survived even though the respective job was aborted:

https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7424/ (3 days 11 hr build duration)

All subsequent containers (failing jobs) were executing on the same host, so there was a race condition between two osmo-bts-virtual processes running at the same time.

We had a long live debugging call with jolly, trying to understand what's happening. Then osmith joined and helped us a lot by logging in and inspecting the running containers.

After the stale container was killed, the job is back to normal:

https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_virtual/7431/

||                            ___                                  ||
||                          .'   '.                                ||
||                         /       \           oOoOo               ||
||                        |         |       ,==|||||               ||
||                         \       /       _|| |||||               ||
||                          '.___.'    _.-'^|| |||||               ||
||                        __/_______.-'     '==HHHHH               ||
||                   _.-'` /                   """""               ||
||                .-'     /   oOoOo                                ||
||                `-._   / ,==|||||                                ||
||                    '-/._|| |||||                                ||
||                     /  ^|| |||||                                ||
||                    /    '==HHHHH                                ||
||                   /________"""""                                ||
||                   `\       `\                                   ||
||                     \        `\   /                             ||
||                      \         `\/                              ||
||                      /                                          ||
||                     /                                           ||
||               jgs  /_____                                       ||
||                                                                 ||
Actions #7

Updated by osmith 5 months ago

Patch to ensure this doesn't happen again: https://gerrit.osmocom.org/c/osmo-ci/+/35362

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)