Project

General

Profile

Actions

Bug #4619

closed

ttcn3-bsc-test: all LCLS test cases broken since build #987

Added by fixeria almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
06/17/2020
Due date:
% Done:

90%

Spec Reference:
Tags:

Description

Check out https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test/test_results_analyzer/.

This is probably related to my recent IPA/RSL emulation refactoring changes:

Changes
1. centos-repo-install-test: new image (detail)
2. debian-repo-install-test: move scripts to osmo-ci (detail)
3. ttcn3-bts-test: enable 3 additional transceivers for BTS#0 (detail)

Here is an extract from the build artifacts of build#987:

06:26:01.572279 921 RSL_Emulation.ttcn:556 Matching on port IPA_PT succeeded:  matched                                                                                   
06:26:01.572304 921 RSL_Emulation.ttcn:556 Receive operation on port IPA_PT succeeded, message from IPA0-RSL-IPA(920): @IPA_Emulation.ASP_RSL_Unitdata : { conn_id := 1, 
streamId := IPAC_PROTO_RSL_TRX0 (0), rsl := { msg_disc := { msg_group := RSL_MDISC_IPACCESS (63), transparent := false }, msg_type := RSL_MT_IPAC_CRCX (112), ies := { { 
iei := RSL_IE_CHAN_NR (1), body := { chan_nr := { u := { ch0 := RSL_CHAN_NR_Bm_ACCH (1) }, tn := 2 } } }, { iei := RSL_IE_IPAC_SPEECH_MODE (244), body := { ipa_speech_mo
de := { reserved := '00'B, mode := RSL_IPA_SPM_RECVONLY (1), codec := RSL_IPA_CODEC_FR (0) } } }, { iei := RSL_IE_IPAC_RTP_PAYLOAD (242), body := { ipa_rtp_pt := 3 } } }
 } } id 27                                                                                                                                                               
06:26:01.572325 921 RSL_Emulation.ttcn:556 Message with id 27 was extracted from the queue of IPA_PT.                                                                    
06:26:01.572351 921 RSL_Emulation.ttcn:199 No Dchan handler for 0{ u := { ch0 := RSL_CHAN_NR_Bm_ACCH (1) }, tn := 2 }                                                    
06:26:01.572469 923 MSC_ConnectionHandler.ttcn:784 dec_PDU_ML3_NW_MS(): Decoded @MobileL3_Types.PDU_ML3_NW_MS: { discriminator := '0110'B, tiOrSkip := { skipIndicator :=
 '0000'B }, msgs := { rrm := { assignmentCommand := { messageType := '00101110'B, descrOf1stChAfterTime := { timeslotNumber := '010'B, channelTypeandTDMAOffset := '00001
'B, octet3 := '43'O ("C"), octet4 := '67'O ("g") }, PowerCommand := { powerlevel := '00111'B, fPC_EP := '0'B, ePC_Mode := '0'B, spare_1 := '0'B }, frequencyList_at := om
it, cellChannelDescr := omit, descrMultislotAllocation := omit, modeOf1stChannel := { elementIdentifier := '63'O ("c"), mode := '01'O }, channelSet2 := omit, channelSet3
 := omit, channelSet4 := omit, channelSet5 := omit, channelSet6 := omit, channelSet7 := omit, channelSet8 := omit, descrOf2ndChAfterTime := omit, modeOf2ndChannel := omi
t, mobileAllocation_at := omit, startingTime := omit, frequencyList_bt := omit, descrOf1stCh_bt := omit, descrOf2ndCh_bt := omit, frequencyChannelSequence := omit, mobil
eAllocation_bt := omit, cipherModeSetting := omit, vGCS_TargetModeIndication := omit, multiRateConfiguration := omit, vGCS_Ciphering_Parameters := omit, extendedTSCSet_a
fterTime := omit, extendedTSCSet_beforeTime := omit } } } }                     
06:26:01.572471 921 RSL_Emulation.ttcn:563 setverdict(fail): none -> fail reason: "RSL for unknown Dchan", new component reason: "RSL for unknown Dchan"                 
06:26:01.572505 921 RSL_Emulation.ttcn:564 Stopping MTC. The current test case will be terminated.                                                                       
06:26:01.572526 921 RSL_Emulation.ttcn:564 Stopping test component execution.

Note that ttcn3-bsc-test-latest is not affected.


Files

lcls_ass_crcx.pcap lcls_ass_crcx.pcap 468 Bytes fixeria, 06/19/2020 01:09 AM

Related issues

Related to Cellular Network Infrastructure - Feature #4701: implement OsmoSTP notification of peers disconnecting, e.g. for OsmoBSC to detect that a specific MSC in the pool is disconnectedResolvedneels08/11/2020

Actions
Related to OsmoBSC - Bug #4899: ttcn3-bsc-test-latest BSC_Tests_LCLS failing since a few days agoRejectedneels12/08/2020

Actions
Actions #1

Updated by fixeria almost 4 years ago

This is probably related to my recent IPA/RSL emulation refactoring changes:
...
Note that ttcn3-bsc-test-latest is not affected.

Actually, this should not be the case. If that was a regression in ttcn3-bsc-test, then LCLS tests in *-latest would fail too.

Actions #2

Updated by fixeria almost 4 years ago

Interestingly enough, some test cases pass when running on my machine:

pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_only
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect_hr
pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_bway_codec_mismatch
pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_nomatch_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_cfg
pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_csc
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect_csc
pass->FAIL BSC_Tests_LCLS.TC_lcls_connect_break
pass->FAIL BSC_Tests_LCLS.TC_lcls_connect_clear
pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect
pass->FAIL BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect_hr
pass BSC_Tests_LCLS.TC_lcls_bts_connect_break

Summary:
  pass->FAIL: 6
  pass: 8
  skip: 149

All libraries, osmo-bsc and ttcn3-bts-test (running natively) were compiled from the recent master.

Actions #3

Updated by fixeria almost 4 years ago

Hmm, running the same group of test cases again gives different results:

pass BSC_Tests_LCLS.TC_lcls_gcr_only
pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect_hr
pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_bway_codec_mismatch
pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_nomatch_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect
pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_cfg
pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_csc
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect_csc
pass BSC_Tests_LCLS.TC_lcls_connect_break
pass->FAIL BSC_Tests_LCLS.TC_lcls_connect_clear
pass->FAIL BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect
pass->FAIL BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect_hr
pass BSC_Tests_LCLS.TC_lcls_bts_connect_break

Summary:
  pass->FAIL: 7
  pass: 7
  skip: 149

so the test results have sporadic nature...

Actions #4

Updated by fixeria almost 4 years ago

Running LCLS test cases third time without restarting osmo-bsc resulted in a segmentation fault:

Assert failed conn->lchan ../../../../src/osmo-bsc/include/osmocom/bsc/gsm_data.h:1376
backtrace() returned 11 addresses
/usr/local/lib/libosmocore.so.12(osmo_panic+0xce) [0x7f6da763758e]
osmo-bsc(+0x35a25) [0x55707098ca25]
osmo-bsc(+0x3ada5) [0x557070991da5]
/usr/local/lib/libosmocore.so.12(+0x13104) [0x7f6da7630104]
/usr/local/lib/libosmocore.so.12(osmo_timers_update+0xc2) [0x7f6da762a062]
/usr/local/lib/libosmocore.so.12(+0xd5d2) [0x7f6da762a5d2]
/usr/local/lib/libosmocore.so.12(osmo_select_main_ctx+0xf) [0x7f6da762ad2f]
osmo-bsc(+0x195f1) [0x5570709705f1]
/usr/lib/libc.so.6(__libc_start_main+0xf2) [0x7f6da743f002]
osmo-bsc(+0x1973e) [0x55707097073e]
signal 6 received

I'll try to reproduce this and investigate the reason. Buiding with "-O0 -ggdb" now...

Actions #5

Updated by fixeria almost 4 years ago

This is reproducible, just run BSC_Tests_LCLS.control three times. Here is a backtrace:

Program received signal SIGABRT
pwndbg> bt
#0  0x00007ffff6767355 in raise () from /usr/lib/libc.so.6
#1  0x00007ffff6750853 in abort () from /usr/lib/libc.so.6
#2  0x00007ffff72b8593 in osmo_panic_default (args=0x7fffffffdc30, fmt=<optimized out>) at ../../../src/libosmocore/src/panic.c:49
#3  osmo_panic (fmt=<optimized out>) at ../../../src/libosmocore/src/panic.c:84
#4  0x0000555555be6e52 in conn_get_bts (conn=0x622000057160) at ../../include/osmocom/bsc/gsm_data.h:1376
#5  0x0000555555c1edc8 in assignment_fsm_timer_cb (fi=0x612000060220) at assignment_fsm.c:758
#6  0x00007ffff72b1104 in fsm_tmr_cb (data=0x612000060220) at ../../../src/libosmocore/src/fsm.c:325
#7  0x00007ffff72ab062 in osmo_timers_update () at ../../../src/libosmocore/src/timer.c:257
#8  0x00007ffff72ab5d2 in _osmo_select_main (polling=0) at ../../../src/libosmocore/src/select.c:260
#9  0x00007ffff72abd2f in osmo_select_main_ctx (polling=<optimized out>) at ../../../src/libosmocore/src/select.c:291
#10 0x0000555555e1b81b in main (argc=3, argv=0x7fffffffe1b8) at osmo_bsc_main.c:953
#11 0x00007ffff6752002 in __libc_start_main () from /usr/lib/libc.so.6
#12 0x0000555555b61bbe in _start ()
Actions #6

Updated by fixeria almost 4 years ago

Among with that, we seem to have a memory leak. This is what I see after running one of the LCLS tests several times:

OsmoBSC# show talloc-context application 3 tree 0x60b000000780
talloc report on 'osmo-bsc' (total 1852081 bytes in 738 blocks)
    ipa                            contains 1013233 bytes in  68 blocks (ref 0) 0x60b000000780
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f000222460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f000206460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f0001ea460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f0001ce460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f0001b2460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f000196460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00017a460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00015e460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f000142460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f000126460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00010a460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f0000ee460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f0000d2460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f0000b6460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00009a460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00007e460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f000062460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f000046460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00002a460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00001c460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00000e460
      struct ipa_server_link         contains     96 bytes in   2 blocks (ref 0) 0x6100000028a0
      struct ipa_server_link         contains     96 bytes in   2 blocks (ref 0) 0x6100000027a0
Actions #7

Updated by neels almost 4 years ago

fixeria wrote:

#5 0x0000555555c1edc8 in assignment_fsm_timer_cb (fi=0x612000060220) at assignment_fsm.c:758

this looks like a timeout at a point where the lchan has already been torn down, so that code path should not OSMO_ASSERT(conn->lchan) but rather error handle.

(still doesn't explain the root cause of why the situation comes about)

Actions #8

Updated by fixeria almost 4 years ago

this looks like a timeout at a point where the lchan has already been torn down

Yes, this (usually) happens right after the test case execution is finished. Looks like a race condition to me.

Actions #9

Updated by fixeria almost 4 years ago

Among with that, we seem to have a memory leak.

And not just one memory leak, here is another one:

OsmoBSC# show talloc-context application 5 tree 0x6180000040e0
talloc report on 'osmo-bsc' (total 1900481 bytes in 745 blocks)
  struct gsm_network             contains 777050 bytes in 584 blocks (ref 0) 0x6160000000e0
    struct mgcp_client             contains   3488 bytes in  71 blocks (ref 0) 0x6180000040e0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00009f160
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00009f090
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000abc80
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000ab870
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000aab70
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000aaaa0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000a4d30
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000a4c60
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000a4510
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000a4100
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00009b6e0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00009b610
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00009aec0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00009aab0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d000095080
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d000094ad0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000917a0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000916d0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00007efe0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00007ef10
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00007e7c0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00007e3b0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00008ae00
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00008ad30
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00008a5e0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00008a1d0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d000083f80
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d000083eb0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d000083760
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d000083350
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00007aad0
Actions #10

Updated by fixeria almost 4 years ago

And not just one memory leak, here is another one:

This might be related to:

DRLL ERROR mgcp_client_fsm.c:525 MGCP_CONN(to-MSC)[0x61200000bc20]{ST_READY}: MGW/DLCX: abrupt FSM termination with connections still present, sending unconditional DLCX...
DLCLS ERROR osmo_bsc_lcls.c:539 LCLS[0x612000010120]{NO_LONGER_LS}: transition to state NOT_POSSIBLE_LS not permitted!
DMSC INFO osmo_bsc_sigtran.c:388 Tx MSC: BSSMAP: LCLS-NOTIFICATION
DRLL ERROR mgcp_client_fsm.c:525 MGCP_CONN(to-MSC)[0x612000010d20]{ST_READY}: MGW/DLCX: abrupt FSM termination with connections still present, sending unconditional DLCX...

or

DLMGCP ERROR mgcp_client.c:700 Failed to read: r=127.0.0.1:2427<->l=127.0.0.1:2727: 111='Connection refused'
DLMGCP ERROR mgcp_client.c:700 Failed to read: r=127.0.0.1:2427<->l=127.0.0.1:2727: 111='Connection refused'

I could not find the reason why it's leaked, but found another problem while looking at the code:

https://gerrit.osmocom.org/c/osmo-mgw/+/18880 libosmo-mgcp-client: fix use-after-free in mgcp_client_tx()

Actions #11

Updated by fixeria almost 4 years ago

I could not find the reason why it's leaked, but found another problem while looking at the code:

Ok, this change fixes the problem for me:

https://gerrit.osmocom.org/c/osmo-mgw/+/18881 libosmo-mgcp-client: fix memleak: clear pending queue

Actions #12

Updated by fixeria almost 4 years ago

  • Status changed from New to In Progress

Interestingly enough, reverting "src/input/ipaccess.c: set TCP_NODELAY" [1] makes the LCLS test results much more stable:

pass BSC_Tests_LCLS.TC_lcls_gcr_only
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect_hr
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_codec_mismatch
pass BSC_Tests_LCLS.TC_lcls_gcr_nomatch_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_cfg
pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_csc
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect_csc
pass->FAIL BSC_Tests_LCLS.TC_lcls_connect_break
pass BSC_Tests_LCLS.TC_lcls_connect_clear
pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect
pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect_hr
pass BSC_Tests_LCLS.TC_lcls_bts_connect_break

Summary:
  pass->FAIL: 1
  pass: 13
  skip: 149

[1] https://git.osmocom.org/libosmo-abis/commit/?id=62725d0b58a4842ce96ac9107c565de40fe4e945

Actions #13

Updated by fixeria almost 4 years ago

https://gerrit.osmocom.org/c/osmo-mgw/+/18881 libosmo-mgcp-client: fix memleak: clear pending queue

As was recommended by neels, I took a different approach based on timers:

https://gerrit.osmocom.org/c/osmo-mgw/+/18881 libosmo-mgcp-client: fix memleak in case if no response is received

Assert failed conn->lchan ../../../../src/osmo-bsc/include/osmocom/bsc/gsm_data.h:1376

This problem is getting even more annoying...

Actions #14

Updated by fixeria almost 4 years ago

Assert failed conn->lchan ../../../../src/osmo-bsc/include/osmocom/bsc/gsm_data.h:1376

According to the backtrace, the crash originates from assignment_fsm_timer_cb(). This callback tries to update some counters:

int assignment_fsm_timer_cb(struct osmo_fsm_inst *fi)
{
        struct gsm_subscriber_connection *conn = assignment_fi_conn(fi);
        assignment_count_result(CTR_ASSIGNMENT_TIMEOUT); // New: b5ccf09fc4 2020-06-08 14:36:46
        assignment_fail(GSM0808_CAUSE_EQUIPMENT_FAILURE, "Timeout");
        return 0;
}

so trying to follow (Ctrl+] in Vim) assignment_count_result() brings us to the macros hell:

#define assignment_count_result(counter) do { \
                if (!conn->assignment.result_rate_ctr_done) { \
                        assignment_count(counter); /* <--- We go here */ \
                        conn->assignment.result_rate_ctr_done = true; \
                } else \
                        LOG_ASSIGNMENT(conn, LOGL_DEBUG, \
                                       "result rate counter already recorded, NOT counting as: %s %s\n", \
                                       bsc_ctr_description[BSC_##counter].name, \
                                       bsc_ctr_description[BSC_##counter].description); \
        } while(0)

following the rabbit hole, we end up here:

/* Assume presence of local var 'conn' as struct gsm_subscriber_connection */
#define assignment_count(counter) do { \
                struct gsm_bts *bts = conn_get_bts(conn); /* <--- Here is the party pooper! */ \
                LOG_ASSIGNMENT(conn, LOGL_DEBUG, "incrementing rate counter: %s %s\n", \
                               bsc_ctr_description[BSC_##counter].name, \
                               bsc_ctr_description[BSC_##counter].description); \
                rate_ctr_inc(&conn->network->bsc_ctrs->ctr[BSC_##counter]); \
                rate_ctr_inc(&bts->bts_ctrs->ctr[BTS_##counter]); \
        } while(0)

so I think I found culprit:

commit b5ccf09fc4042c7fb1fdaaa6263961c40b32564e
Author: Daniel Willmann <dwillmann@sysmocom.de>
Date:   Mon Jun 8 14:36:46 2020 +0200

    Count assignment rates per BTS as well

    This adds the assignment counters for the BTS as well and changes the
    assignment_count() macro to increase both the counters for the BSC as
    well as the BTS.

    Related: SYS#4877
    Change-Id: I0009e51d4caf68e762138d98e2e23d49acc3cc1a
Actions #15

Updated by fixeria almost 4 years ago

Here is a potential fix:

https://gerrit.osmocom.org/c/osmo-bsc/+/18907 assignment_fsm: fix assignment_count(): conn->lchan may be NULL

the bug will be tracked in a separate ticket (see #4620), neels will complete the patch.

Actions #16

Updated by fixeria almost 4 years ago

  • % Done changed from 0 to 20

I executed BSC_Tests_LCLS.TC_lcls_gcr_only a few times and made packet captures of successful and unsuccessful outcomes.
Comparing the both captures reveals an interesting detail:

  • in case of a successful outcome:
    • the TC sends CM Service Request with cause MO call
    • the IUT sends MGCP CRCX to the virtual MGW
    • the IUT activates a TCH channel, TC ACKnowledges
    • the IUT sends Assignment Command
    • the TC responds with Assignment Complete
    • the TC responds with MGCP 200 OK
    • the IUT sends RSL ip.access CRCX, TC ACKnowledges
    • the IUT sends RSL ip.access MDCX, TC ACKnowledges
    • the IUT sends MGCP MDCX to the virtual MGW, TC responds with MGCP 200 OK
    • the IUT sends MGCP CRCX to the virtual MGW, TC responds with MGCP 200 OK
  • in case of a unsuccessful outcome:
    • the TC sends CM Service Request with cause MO call
    • the IUT sends MGCP CRCX to the virtual MGW
    • the IUT activates a TCH channel, TC ACKnowledges
    • the IUT sends Assignment Command
    • the TC responds with MGCP 200 OK
    • the IUT sends RSL ip.access CRCX, TC does not respond :/

so the key difference is that the IUT sends RSL ip.access CRCX before getting Assignment Complete from the TC. I am not familiar with these things, so my best guess is that the IUT is not doing anything wrong, it's just because the test case does not expect RSL ip.access CRCX. Looks like a race condition caused by decreased delays on the RSL link (thanks to TCP_NODELAY).

Actions #17

Updated by fixeria almost 4 years ago

Here is what happens on the TITAN's side in case of an unsuccessful outcome:

IPA0-RSL-IPA(8)@DELL: dec_PDU_IPA(): Stream before decoding: '0008007E70010AF410F203'O                                                                                   
IPA0-RSL-IPA(8)@DELL: dec_PDU_IPA(): Decoded @IPA_Types.PDU_IPA: { lengthInd := 8, streamId := IPAC_PROTO_RSL_TRX0 (0), streamIdExt := omit, payloadData := '7E70010AF410F203'O }                                                                                                                                                                 
IPA0-RSL-IPA(8)@DELL: Matching on port IPA_PORT IPAC_PROTO_RSL_TRX0 (0) with IPAC_PROTO_CCM (254) unmatched: First message in the queue does not match the template:     
IPA0-RSL-IPA(8)@DELL: Matching on port IPA_PORT succeeded:  matched                                                                                                      
IPA0-RSL-IPA(8)@DELL: dec_RSL_Message(): Stream before decoding: '7E70010AF410F203'O                                                                                     
IPA0-RSL-IPA(8)@DELL: dec_RSL_Message(): Decoded @RSL_Types.RSL_Message: { msg_disc := { msg_group := RSL_MDISC_IPACCESS (63), transparent := false }, msg_type := RSL_MT_IPAC_CRCX (112), ies := { { iei := RSL_IE_CHAN_NR (1), body := { chan_nr := { u := { ch0 := RSL_CHAN_NR_Bm_ACCH (1) }, tn := 2 } } }, { iei := RSL_IE_IPAC_SPEECH_MODE (244), body := { ipa_speech_mode := { reserved := '00'B, mode := RSL_IPA_SPM_RECVONLY (1), codec := RSL_IPA_CODEC_FR (0) } } }, { iei := RSL_IE_IPAC_RTP_PAYLOAD (242), body := { ipa_rtp_pt := 3 } } } }

IPA0-RSL-RSL(9)@DELL: No Dchan handler for trx_nr=0 and chan_nr={ u := { ch0 := RSL_CHAN_NR_Bm_ACCH (1) }, tn := 2 }

TC_lcls_gcr_only-B(11)@DELL: Rx L3 from net: { discriminator := '0110'B, tiOrSkip := { skipIndicator := '0000'B }, msgs := { rrm := { assignmentCommand := { messageType 
:= '00101110'B, descrOf1stChAfterTime := { timeslotNumber := '010'B, channelTypeandTDMAOffset := '00001'B, octet3 := '43'O ("C"), octet4 := '67'O ("g") }, PowerCommand :
= { powerlevel := '00111'B, fPC_EP := '0'B, ePC_Mode := '0'B, spare_1 := '0'B }, frequencyList_at := omit, cellChannelDescr := omit, descrMultislotAllocation := omit, mo
deOf1stChannel := { elementIdentifier := '63'O ("c"), mode := '01'O }, channelSet2 := omit, channelSet3 := omit, channelSet4 := omit, channelSet5 := omit, channelSet6 :=
 omit, channelSet7 := omit, channelSet8 := omit, descrOf2ndChAfterTime := omit, modeOf2ndChannel := omit, mobileAllocation_at := omit, startingTime := omit, frequencyLis
t_bt := omit, descrOf1stCh_bt := omit, descrOf2ndCh_bt := omit, frequencyChannelSequence := omit, mobileAllocation_bt := omit, cipherModeSetting := omit, vGCS_TargetMode
Indication := omit, multiRateConfiguration := omit, vGCS_Ciphering_Parameters := omit, extendedTSCSet_afterTime := omit, extendedTSCSet_beforeTime := omit } } } }

so it fails because RSL ip.access CRCX is somehow received before the Assignment Command. Usually, on receipt of an Assignment Command the test case calls f_rslem_register(chan_nr), so the RSL Emulation knows were to dispatch messages related to a given chan_nr. In this case the order of messages is messed up, so the RSL Emulation has no idea where to dispatch the received RSL ip.access CRCX message.

This is very odd, given that in Wireshark the order looks correct (see attached capture).

Actions #18

Updated by fixeria almost 4 years ago

  • Status changed from In Progress to Feedback
  • % Done changed from 20 to 40

This is very odd, given that in Wireshark the order looks correct (see attached capture).

Ok, I guess I know what's happening. The test case consists of several components running in parallel. Even if the messages are received in the correct order (I am pretty sure they are), there is still a risk that the RSL Emulation component (IPA0-RSL-RSL@DELL) would process RSL ip.access CRCX faster than the test case component (TC_lcls_gcr_only-B(11)@DELL) would process Assignment Command and register a new expectation in the RSL Emulation component. The TCP_NODELAY itself is not a problem, it just makes osmo-bsc send RSL ip.access CRCX faster, so we hit this race condition more often than without this option.

So this is clearly a problem of the test suite, and I don't know how to solve this problem. It's unfortunately the dark side of concurrency. Probably we can implement a queue in the RSL Emulation component, that would absorb all messages for which there is currently no handler. As soon as a new chan_nr/handler is registered, it would iterate over the queue and dispatch stalled messages.

laforge what do you think?

Actions #19

Updated by fixeria almost 4 years ago

P.S. I believe that sporadic test case failures in ttcn3-bsc-test-sccplite [1] are also caused by this problem. All of them fail with:

RSL for unknown Dchan
      BSC_Tests.ttcn:4501 BSC_Tests control part
      BSC_Tests.ttcn:2176 TC_assignment_codec_fr testcase

[1] https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test-sccplite/test_results_analyzer/

Actions #20

Updated by laforge almost 4 years ago

I think we already have the capability to temporarily suspend processing RSL messages
during hand-over, see f_rslem_suspend(). Maybe this approach can also be used here?

Actions #21

Updated by fixeria almost 4 years ago

laforge wrote:

I think we already have the capability to temporarily suspend processing RSL messages
during hand-over, see f_rslem_suspend(). Maybe this approach can also be used here?

Thanks for a hint. Unfortunately, I doubt that f_rslem_suspend() would help here. If we suspend processing of RSL messages, we would not get the Assignment Command, and thus wound not send Assignment Complete to the IUT. I did a quick test and indeed it breaks the assignment :/

I'll implement a queue. Given that I am now familiar with the RSL Emulation code, I think it will be easy.

Actions #22

Updated by fixeria almost 4 years ago

  • % Done changed from 40 to 90

Hooray! This change makes all LCLS test cases pass on my machine with TCP_NODELAY enabled:

https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/18923 library/RSL_Emulation: implement waiting queue for DCHAN messages

Actions #23

Updated by fixeria almost 4 years ago

Just to summarize all comments unrelated to the LCLS test case failures:

Actions #24

Updated by fixeria almost 4 years ago

  • % Done changed from 90 to 100

As was discussed with laforge, I re-implemented queuing, so it needs to be enabled and disabled explicitly by calling f_rslem_dchan_queue_enable() and f_rslem_dchan_queue_dispatch():

https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/18947 library/RSL_Emulation: fix comment near f_rslem_suspend()
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/18923 library/RSL_Emulation: implement waiting queue for DChan messages
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/18948 MSC_ConnectionHandler: fix race condition in f_establish_fully()

pass BSC_Tests_LCLS.TC_lcls_gcr_only
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect_hr
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_codec_mismatch
pass BSC_Tests_LCLS.TC_lcls_gcr_nomatch_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_cfg
pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_csc
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect_csc
pass BSC_Tests_LCLS.TC_lcls_connect_break
pass BSC_Tests_LCLS.TC_lcls_connect_clear
pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect
pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect_hr
pass BSC_Tests_LCLS.TC_lcls_bts_connect_break

Summary:
  pass: 14
  skip: 149
Actions #25

Updated by laforge almost 4 years ago

Patch merged

Actions #26

Updated by fixeria almost 4 years ago

  • Priority changed from High to Low

I was hoping to see all LCLS test cases green in ttcn3-bsc-test build#1019 and ttcn3-centos-bsc-test build#24, but only some of them became green. The rest started to fail due to a different reason (not seen before):

"MSC_ConnectionHandler.ttcn:1034 : Timeout waiting for COMPLETE LAYER 3 INFORMATION" 
      BSC_Tests_LCLS.ttcn:745 BSC_Tests_LCLS control part
      BSC_Tests_LCLS.ttcn:419 TC_lcls_gcr_bway_codec_mismatch testcase

As it turns out, this a regression caused by:

https://gerrit.osmocom.org/c/osmo-bsc/+/18907 fix crashes due to OSMO_ASSERT(conn->lchan)

20200624220227798 DMSC ERROR gsm_08_08.c:488 Compl L3 without lchan
20200624220227798 DLINP ERROR ipaccess.c:393 Bad signalling message, sign_link returned error: Operation not permitted.

that should have been fixed by:

https://gerrit.osmocom.org/c/osmo-bsc/+/18993 fix segfault introduced by recent segfault fix

Unfortunately, the recent shutdown FSM implementation in osmo-bts somehow broke osmo-bts-omldummy, so currently all test cases fail:

https://jenkins.osmocom.org/jenkins/view/TTCN3-centos/job/TTCN3-centos-bsc-test/26/
https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test/1020/
https://lists.osmocom.org/pipermail/openbsc/2020-June/013195.html

Since they all pass on my machine with older osmo-bts-omldummy, setting priority to Low.
I'll close this ticket as soon as we see all LCLS test cases green in Jenkins.

Actions #27

Updated by fixeria almost 4 years ago

  • Status changed from Feedback to Stalled

I'll close this ticket as soon as we see all LCLS test cases green in Jenkins.

Somehow 2/3 test cases still fail. No more time to work on this, will try to investigate later.

https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test/1024/testReport/(root)/BSC_Tests_LCLS/

"MSC_ConnectionHandler.ttcn:1038 : Timeout waiting for COMPLETE LAYER 3 INFORMATION" 
      BSC_Tests_LCLS.ttcn:742 BSC_Tests_LCLS control part
      BSC_Tests_LCLS.ttcn:262 TC_lcls_gcr_only testcase

P.S. I just tried to run all LCLS test cases on my machine (natively and in Docker), all pass.

Actions #28

Updated by fixeria almost 4 years ago

Huh, this may (potentially) be caused by the test cases we run before LCLS group, e.g. the following ones:

/* at bottom as they might crash OsmoBSC before OS#3182 is fixed */
execute( TC_early_conn_fail() );
execute( TC_late_conn_fail() );
Actions #29

Updated by fixeria over 3 years ago

  • Status changed from Stalled to In Progress
  • Priority changed from Low to High
Actions #30

Updated by fixeria over 3 years ago

  • Status changed from In Progress to New
  • Assignee changed from fixeria to neels
  • Priority changed from High to Normal
  • % Done changed from 100 to 80

Quick *.pcap analysis (latest vs master) shows that the breakage seems to be caused by the round-robin MSC selection.

Successful test case execution, see https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test-latest/645/artifact/logs/bsc-tester/BSC_Tests_LCLS.TC_lcls_gcr_only.pcap.gz:

 1401   3.222635  172.18.2.20 → 172.18.2.203 RSL 100  IMMEDIATE ASSIGN COMMAND (CCCH) (RR) Immediate Assignment
 1402   3.225732 172.18.2.203 → 172.18.2.20  RSL  96  ESTablish INDication (DTAP) (MM) CM Service Request

 1410   3.225935  172.18.2.20 → 172.18.2.203 GSMTAP 160 MSC Tx MSC COMPL L3
 1411   3.225956  172.18.2.20 → 172.18.2.203 GSMTAP 235 MSC Initializing resources for new SIGTRAN connection to MSC: RI=SSN_PC,PC=0.23.1,SSN=BSSAP...

 1417   3.226055  172.18.2.20 → 172.18.2.203 GSMTAP 176 MSC Allocated new connection id: 98
 1418   3.226086  172.18.2.20 → 172.18.2.203 GSMTAP 221 MSC Opening new SIGTRAN connection (id=98) to MSC: RI=SSN_PC,PC=0.23.1,SSN=BSSAP

 1451   3.226594 172.18.2.200 → 172.18.2.203 GSMTAP 205 LSS7 XUA_AS(virt-msc0){AS_ACTIVE}: Received Event AS-TRANSFER.req
 1452   3.226607 172.18.2.200 → 172.18.2.203 GSMTAP 165 LINP connected read/write
 1453   3.226612 172.18.2.200 → 172.18.2.203 GSMTAP 157 LINP sending data

 1454   3.226623          187 → 185           BSSAP 160      CR (BSSMAP) Complete Layer 3 Information (DTAP) (MM) CM Service Request

Unsuccessful test case execution, see https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test/lastCompletedBuild/artifact/logs/bsc-tester/BSC_Tests_LCLS.TC_lcls_gcr_only.pcap.gz:

  342   3.160835  172.18.2.20 → 172.18.2.203 RSL 100  IMMEDIATE ASSIGN COMMAND (CCCH) (RR) Immediate Assignment
  343   3.163572 172.18.2.203 → 172.18.2.20  RSL 96  ESTablish INDication (DTAP) (MM) CM Service Request

  351   3.163697  172.18.2.20 → 172.18.2.203 GSMTAP 160 MSC Tx MSC COMPL L3
  352   3.163709  172.18.2.20 → 172.18.2.203 GSMTAP 211 MSC New subscriber IMSI-001019876543210: MSC round-robin selects msc 1
  353   3.163721  172.18.2.20 → 172.18.2.203 GSMTAP 231 MSC Initializing resources for new SCCP connection to MSC: RI=SSN_PC,PC=0.0.2,SSN=BSSAP...

  358   3.163767  172.18.2.20 → 172.18.2.203 GSMTAP 177 MSC Allocated new connection id: 133
  359   3.163776  172.18.2.20 → 172.18.2.203 GSMTAP 218 MSC Opening new SCCP connection (id=133) to MSC: RI=SSN_PC,PC=0.0.2,SSN=BSSAP

  391   3.164242 172.18.2.200 → 172.18.2.203 GSMTAP 203 LSS7 XUA_AS(virt-msc1){AS_DOWN}: Received Event AS-TRANSFER.req
  392   3.164251 172.18.2.200 → 172.18.2.203 GSMTAP 208 LSS7 XUA_AS(virt-msc1){AS_DOWN}: Event AS-TRANSFER.req not permitted

So AFAIU, the BSC is trying to send a Complete Layer 3 Information to a different MSC (PC=0.23.1 vs PC=0.0.2), so the test case never gets it it.

Assigning to neels since he has been working on the related topic.

Actions #31

Updated by laforge over 3 years ago

  • Priority changed from Normal to High
Actions #32

Updated by neels over 3 years ago

Found out that the failure in TC_lcls_gcr_only is related to leftover state from a previous MSC pooling test:
'msc 1' still appears connected to osmo-bsc, even though the ttcn3 init omits connecting that MSC.

Now I need to pinpoint what exactly the right mechanism would be for osmo-bsc to realize that an MSC is no longer connected.
I realize that I haven't paid enough attention to that aspect yet.

However, the TC_lcls_gcr_bway_connect test properly picks 'msc 0' and still fails, not sure yet of the reason.

Actions #33

Updated by neels over 3 years ago

  • Status changed from New to In Progress
Actions #34

Updated by laforge over 3 years ago

On Fri, Aug 07, 2020 at 01:44:00PM +0000, neels [REDMINE] wrote:

Now I need to pinpoint what exactly the right mechanism would be for osmo-bsc to realize that an MSC is no longer connected.

Without any active calls (SCCP connections) you don't really know.

The only situations are:
  • SCCP connection timers timing out (if there are any active SCCP connections)
  • SCCP network informing the user about unavailability of a given point code
    • we don't imlpement this in Osmocom so far, but there are messages like
      M3UA DUNA/DAVA that can be used by an STP to notify M3UA ASPs like the BSC about
      availability/unavailability of a given point code.
    • libosmo-sigtan would have to parse such messages and send N-NOTICE.indication
      from the SCCP provider to the SCCP user (local SCCP user like MSC, BSC, ...),
      as well as N-INFO.indication for active SCCP connections
Actions #35

Updated by neels over 3 years ago

  • Related to Feature #4701: implement OsmoSTP notification of peers disconnecting, e.g. for OsmoBSC to detect that a specific MSC in the pool is disconnected added
Actions #36

Updated by neels over 3 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 80 to 100

The LCLS test breakage is fixed by https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/19554 by setting 'no allow-attach' on MSCs that are not intended to be used by a test, during f_init().
Detecting SCCP detach of MSCs should be a separate issue from this one: #4701

Closing this issue since all LCLS tests pass again since the patch was merged.

Actions #37

Updated by neels over 3 years ago

  • Status changed from Resolved to In Progress
  • % Done changed from 100 to 90

re-opening, apparently I broke the 'latest' build with the 'master' fix

Actions #38

Updated by neels over 3 years ago

https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/19567
will fast track this change to fix the 'latest' builds.

Actions #39

Updated by laforge over 3 years ago

  • Status changed from In Progress to Resolved

can be considered resolved. both nightly and latest LCLS tests pass again.

Actions #40

Updated by fixeria over 3 years ago

  • Related to Bug #4899: ttcn3-bsc-test-latest BSC_Tests_LCLS failing since a few days ago added
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)