Project

General

Profile

Actions

Bug #3680

closed

osmo-bsc: smpp.esme_ms_sms_storeforward.py test fails ocasionally due to paging response arriving after T3113 expired

Added by pespin over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
11/02/2018
Due date:
% Done:

100%

Spec Reference:

Description

Test smpp.esme_ms_sms_storeforward.py fails ocassionaly with following error:

  File "/home/jenkins/workspace/osmo-gsm-tester_run-prod/osmo-gsm-tester/suites/smpp/esme_ms_sms_storeforward.py", line 62, in <module>
    wait(ms.sms_was_received, msg)
  File "/home/jenkins/workspace/osmo-gsm-tester_run-prod/osmo-gsm-tester/src/osmo_gsm_tester/testenv.py", line 48, in <lambda>
    wait = lambda *args, **kwargs: event_module.wait(suite_run, *args, **kwargs)
  File "/home/jenkins/workspace/osmo-gsm-tester_run-prod/osmo-gsm-tester/src/osmo_gsm_tester/event_loop.py", line 111, in wait
    raise log.Error('Wait timeout', condition=condition, timeout=timeout, timestep=timestep)
osmo_gsm_tester.log.Error: Wait timeout {condition=<bound method Modem.sms_was_received of /sierra_2>, timeout=300, timestep=1}

Attached run tgz of failing test.

In the pcap file under bsc dir (GSMTAP+RSL), it can be seen that BSC starts paging the BTS, but eventually it expires (T3113) due to no Paging Response from MS. But immediately after (few ms later) the Paging Response arrives, but then it finds no subscriber (probably removed due to T3113 expiration. Then, nothing is communicated towards the MS (not sure what should be sent in that case, but I suspect the MS expects the BSC to do next step).

20181102063651613 DPAG <0005> paging.c:90 (bts=0) Going to send paging commands: imsi: 901700000015252 tmsi: 0x314b6ebe for ch. type 0 (attempt 16)
20181102063651613 DLMI <0017> input/ipaccess.c:356 TX 2: 0c 15 01 90 0e 02 0c 05 f4 31 4b 6e be 28 00
20181102063652114 DPAG <0005> paging.c:90 (bts=0) Going to send paging commands: imsi: 901700000015252 tmsi: 0x314b6ebe for ch. type 0 (attempt 17)
20181102063652114 DLMI <0017> input/ipaccess.c:356 TX 2: 0c 15 01 90 0e 02 0c 05 f4 31 4b 6e be 28 00
20181102063652190 DLMI <0017> input/ipaccess.c:251 RX 2: 0c 12 01 90 0f 00 c7
20181102063652615 DPAG <0005> paging.c:90 (bts=0) Going to send paging commands: imsi: 901700000015252 tmsi: 0x314b6ebe for ch. type 0 (attempt 18)
20181102063652615 DLMI <0017> input/ipaccess.c:356 TX 2: 0c 15 01 90 0e 02 0c 05 f4 31 4b 6e be 28 00
20181102063653097 DPAG <0005> paging.c:281 T3113 expired for request 0x611000000e60 (IMSI:901700000015252)
20181102063653097 DREF <000a> paging.c:75 BSC subscr IMSI:901700000015252 usage decreases to: 0
20181102063653190 DLMI <0017> input/ipaccess.c:251 RX 2: 0c 12 01 90 0f 00 c7
20181102063653652 DLMI <0017> input/ipaccess.c:251 RX 2: 0c 13 01 88 13 8e 5a f7 11 00
20181102063653653 DRSL <0003> abis_rsl.c:1362 (bts=0) CHAN RQD: reason: answer to paging (ra=0x8e, neci=0x01, chreq_reason=0x01)
20181102063653653 DRLL <0000> lchan_select.c:159 (bts=0) lchan_select_by_type(SDCCH)
20181102063653653 DRLL <0000> lchan_select.c:86 looking for lchan CCCH+SDCCH4: (bts=0,trx=0,ts=0,pchan=CCCH+SDCCH4,state=UNUSED) ss=0 is available
20181102063653653 DCHAN <0010> lchan_select.c:253 lchan(0-0-0-CCCH_SDCCH4-0)[0x612000007ba0]{UNUSED}: (type=SDCCH) Selected
20181102063653653 DCHAN <0010> abis_rsl.c:1426 lchan(0-0-0-CCCH_SDCCH4-0)[0x612000007ba0]{UNUSED}: (type=SDCCH) MS: Channel Request: reason=PAGING ra=0x8e ta=0
20181102063653653 DCHAN <0010> lchan_fsm.c:318 lchan(0-0-0-CCCH_SDCCH4-0)[0x612000007ba0]{UNUSED}: Received Event LCHAN_EV_ACTIVATE
20181102063653653 DCHAN <0010> lchan_fsm.c:547 lchan(0-0-0-CCCH_SDCCH4-0)[0x612000007ba0]{UNUSED}: state_chg to WAIT_TS_READY
20181102063653653 DCHAN <0010> lchan_fsm.c:573 lchan(0-0-0-CCCH_SDCCH4-0)[0x612000007ba0]{WAIT_TS_READY}: (type=SDCCH) Activation requested: FOR_MS_CHANNEL_REQUEST voice=no MGW-ci=none type=SDCCH tch-mode=SIGNALLING
20181102063653653 DTS <0011> lchan_fsm.c:577 timeslot(0-0-0-CCCH_SDCCH4)[0x612000009220]{UNUSED}: Received Event TS_EV_LCHAN_REQUESTED
20181102063653653 DTS <0011> timeslot_fsm.c:328 timeslot(0-0-0-CCCH_SDCCH4)[0x612000009220]{UNUSED}: state_chg to IN_USE
20181102063653653 DCHAN <0010> timeslot_fsm.c:104 lchan(0-0-0-CCCH_SDCCH4-0)[0x612000007ba0]{WAIT_TS_READY}: Received Event LCHAN_EV_TS_READY
20181102063653653 DCHAN <0010> lchan_fsm.c:591 lchan(0-0-0-CCCH_SDCCH4-0)[0x612000007ba0]{WAIT_TS_READY}: state_chg to WAIT_ACTIV_ACK
20181102063653653 DRSL <0003> abis_rsl.c:476 (bts=0,trx=0,ts=0,pchan=CCCH+SDCCH4,state=IN_USE) Tx RSL Channel Activate with act_type=INITIAL
20181102063653654 DLMI <0017> input/ipaccess.c:356 TX 2: 08 21 01 20 03 00 06 04 00 03 01 00 05 06 64 20 e3 64 72 00 04 00 0d 1f 18 00
20181102063653661 DLMI <0017> input/ipaccess.c:251 RX 2: 08 22 01 20 08 5b e5
20181102063653662 DCHAN <0010> abis_rsl.c:1129 lchan(0-0-0-CCCH_SDCCH4-0)[0x612000007ba0]{WAIT_ACTIV_ACK}: (type=SDCCH) Rx CHAN_ACTIV_ACK
20181102063653662 DCHAN <0010> abis_rsl.c:1141 lchan(0-0-0-CCCH_SDCCH4-0)[0x612000007ba0]{WAIT_ACTIV_ACK}: Received Event LCHAN_EV_RSL_CHAN_ACTIV_ACK
20181102063653662 DCHAN <0010> lchan_fsm.c:715 lchan(0-0-0-CCCH_SDCCH4-0)[0x612000007ba0]{WAIT_ACTIV_ACK}: (type=SDCCH) Tx RR Immediate Assignment
20181102063653662 DCHAN <0010> lchan_fsm.c:764 lchan(0-0-0-CCCH_SDCCH4-0)[0x612000007ba0]{WAIT_ACTIV_ACK}: state_chg to WAIT_RLL_RTP_ESTABLISH
20181102063653662 DLMI <0017> input/ipaccess.c:356 TX 2: 0c 16 01 90 2b 17 2d 06 3f 03 20 e3 64 8e 5a f7 00 00 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b
20181102063653707 DHODEC <0009> handover_decision_2.c:1847 (BTS 0) No congestion check: Assignment and Handover both disabled
20181102063653708 DHODEC <0009> handover_decision_2.c:162 HO algorithm 2: next periodical congestion check in 10 seconds
20181102063653967 DLMI <0017> input/ipaccess.c:251 RX 2: 02 06 01 20 02 00 0b 00 0d 06 27 07 03 50 58 a6 05 f4 31 4b 6e be
20181102063653967 DCHAN <0010> abis_rsl.c:1614 lchan(0-0-0-CCCH_SDCCH4-0)[0x612000007ba0]{WAIT_RLL_RTP_ESTABLISH}: (type=SDCCH) SAPI=0 ESTABLISH INDICATION
20181102063653967 DCHAN <0010> abis_rsl.c:1647 lchan(0-0-0-CCCH_SDCCH4-0)[0x612000007ba0]{WAIT_RLL_RTP_ESTABLISH}: Received Event LCHAN_EV_RLL_ESTABLISH_IND
20181102063653967 DCHAN <0010> lchan_fsm.c:782 lchan(0-0-0-CCCH_SDCCH4-0)[0x612000007ba0]{WAIT_RLL_RTP_ESTABLISH}: state_chg to ESTABLISHED
20181102063653967 DMSC <0007> fsm.c:299 SUBSCR_CONN[0x612000005aa0]{INIT}: Allocated
20181102063653967 DLCLS <000f> fsm.c:299 LCLS[0x612000005920]{NO_LCLS}: Allocated
20181102063653967 DLCLS <000f> fsm.c:329 LCLS[0x612000005920]{NO_LCLS}: is child of SUBSCR_CONN[0x612000005aa0]
20181102063653967 DMSC <0007> gsm_08_08.c:374 Tx MSC COMPL L3
20181102063653967 DRR <0002> gsm_08_08.c:269 PAGING RESPONSE: MI(TMSI)=827027134
20181102063653968 DMSC <0007> gsm_08_08.c:344 Got paged but no subscriber found.
20181102063653968 DMSC <0007> gsm_08_08.c:379 Failed to find a MSC for a connection.
20181102063654168 DLINP <0015> stream.c:275 connected read

Then, a while later a new (actually probably a retransmission of the same one) Paging CMD arrives from MSC, and then BTS/MS do nothing, probably due to still waiting for some feedback from previous one, and test goes on with BSC sending Paging CMD all he time to the BTS and nothing else happening until test times out and fails.


Files

esme_ms_sms_storeforward.tgz esme_ms_sms_storeforward.tgz 391 KB pespin, 11/02/2018 11:29 AM
Actions #1

Updated by pespin over 5 years ago

As far as I understand, upon not finding the originating Paging Command for the Paging Response, the BSC should tell the BTS to close the lower layer channel MS<->BTS so MS can go back to idle modem and start listening for new Paging. Right now the BTS keeps sending pagings to the MS once the entire procedure is restarted (MSC->BSC->BTS), but previous channel is still allocated for the MS and the MS is listening to it.

So from looking at the log of osmo-bsc, we have bsc_compl_l3:

/*! MS->MSC: New MM context with L3 payload. */
int bsc_compl_l3(struct gsm_subscriber_connection *conn, struct msgb *msg, uint16_t chosen_channel)
{
    struct bsc_msc_data *msc;

    LOGP(DMSC, LOGL_INFO, "Tx MSC COMPL L3\n");

    /* find the MSC link we want to use */
    msc = bsc_find_msc(conn, msg);
    if (!msc) {
        LOGP(DMSC, LOGL_ERROR, "Failed to find a MSC for a connection.\n");
        bsc_send_ussd_no_srv(conn, msg,
                     conn_get_bts(conn)->network->bsc_data->ussd_no_msc_txt);
        return -1;
    }

    return complete_layer3(conn, msg, msc);
}

In this case, bsc_find_msc returns NULL, so subscr is NULL and ends up calling bsc_send_ussd_no_srv. That function, however seems to be not closing the channel with the MS. Function bsc_compl_l3 returns -1 but its unique caller gsm0408_rcvmsg:

/*! \brief RSL has received a DATA INDICATION with L3 from MS */
int gsm0408_rcvmsg(struct msgb *msg, uint8_t link_id)
{
...
        /* fwd via bsc_api to send COMPLETE L3 INFO to MSC */
        bsc_compl_l3(lchan->conn, msg, 0);
        /* conn shall release lchan on teardown, also if this Layer 3 Complete is rejected. */

I think we should call lchan_release() if bsc_compl_l3 returns -1.

Actions #2

Updated by pespin over 5 years ago

Looking at history of related code, in commit 3561bd48976dbee8dbd4659dad15be85a3e79ace it used to have a lchan_release:

                /* fwd via bsc_api to send COMPLETE L3 INFO to MSC */
                rc = api->compl_l3(lchan->conn, msg, 0);

                if (rc != BSC_API_CONN_POL_ACCEPT) {
-                       lchan->conn->lchan = NULL;
-                       bsc_subscr_con_free(lchan->conn);
-                       lchan_release(lchan, 1, RSL_REL_NORMAL);
+                       //osmo_fsm_inst_dispatch(lchan->conn->fi, FIXME, NULL);
                }
        }

Later on it was again updated by 31f525e7560ad13e32cfc5e0b5f1862c0efcb991:

                /* fwd via bsc_api to send COMPLETE L3 INFO to MSC */
-               rc = bsc_compl_l3(lchan->conn, msg, 0);
-
-               if (rc != BSC_API_CONN_POL_ACCEPT) {
-                       //osmo_fsm_inst_dispatch(lchan->conn->fi, FIXME, NULL);
-               }
+               bsc_compl_l3(lchan->conn, msg, 0);
+               /* conn shall release lchan on teardown, also if this Layer 3 Complete is rejected. */

Actions #3

Updated by pespin over 5 years ago

Looking at the code indeed the previous expiration removes the subscriber, that's why later on when paging Response arrives path bsc_subscr_find_by_tmsi -> extract_sub -> bsc_find_msc -> bsc_compl_l3 returns NULL and gsm0408_rcvmsg doesn't care about the error.

The question now is how to ask the BSC in that part of the code to run lchan_release, because I have the feeling I cannot longer call lchan_release there, but need to somehow provoke some change or signal in the related FSM.

Actions #4

Updated by pespin over 5 years ago

  • Status changed from New to Feedback
  • % Done changed from 0 to 90

Submitted a TTCN3 test to reproduce the issue:
https://gerrit.osmocom.org/#/c/osmo-ttcn3-hacks/+/11604 bsc: Introduce test TC_paging_resp_unsol

And this commit should fix the issue in osmo-bsc. I'm not sure if it's the correct way to release the channel though:
remote: https://gerrit.osmocom.org/#/c/osmo-bsc/+/11605 bsc: gsm0808: Fix return code in bsc_compl_l3
remote: https://gerrit.osmocom.org/#/c/osmo-bsc/+/11606 gsm0408_rcvmsg: Release lchan if L3 fails to complete

Actions #5

Updated by pespin over 5 years ago

From this issue we also can infer that the default T3113 timer we are using is not long enough (since MS can answer after we expire it). Default T3113 timer should be calculated dynamically based on network channel configuration.

Right now, the T3113 timeout is being set in paging.c _paging_request:

    osmo_timer_setup(&req->T3113, paging_T3113_expired, req);
    osmo_timer_schedule(&req->T3113, T_def_get(bts->network->T_defs, 3113, T_S, -1), 0);
    llist_add_tail(&req->entry, &bts_entry->pending_requests);

function T_def_get can be found in osmo-bsc.git gsm_timers.c.

paging timer is defined in the global struct in net_init.c:

     { .T=3113, .default_val=10, .desc="Paging"},

which matches the 10 seconds timeout we see in pcap file from this issue.

Actions #6

Updated by pespin over 5 years ago

Submitted following commit. It can be used as a starting point for discussion:
https://gerrit.osmocom.org/#/c/osmo-bsc/+/11654 paging: Make default T3113 timeout dynamic

Actions #7

Updated by laforge over 5 years ago

  • Priority changed from Normal to High

pespin wrote:

Submitted following commit. It can be used as a starting point for discussion:
https://gerrit.osmocom.org/#/c/osmo-bsc/+/11654 paging: Make default T3113 timeout dynamic

Even though the patch was merged, the TTCN3 test still fails. Please investigate.

Actions #8

Updated by pespin over 5 years ago

That patch was not the one fixing the scenario for which the TTCN3 test was applied. That's only an improvement when calculating the timer.

Fix is available here:
https://gerrit.osmocom.org/#/c/osmo-ttcn3-hacks/+/12196/

Actions #9

Updated by laforge over 5 years ago

On Fri, Dec 07, 2018 at 11:46:57AM +0000, pespin [REDMINE] wrote:

Fix is available here:
https://gerrit.osmocom.org/#/c/osmo-ttcn3-hacks/+/12196/

fix was just merged.

Actions #10

Updated by pespin over 5 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 90 to 100

Test is green now, closing.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)