Project

General

Profile

Actions

Bug #4620

closed

ttcn3-bsc-sccplite: crash in BSC_Tests.TC_assignment_codec_amr_f (introduced by "count assignment rates per bts")

Added by neels almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
06/18/2020
Due date:
% Done:

0%

Spec Reference:

Description

since one or two days ago, ttcn3-bsc-sccplite fails, by osmo-bsc crashing during BSC_Tests.TC_assignment_codec_amr_f.
I couldn't get this to happen in a manual run, but using the unmodified docker images, the crash is reproducable.

It seems that a test hits an error or the end, and that because connections are released, the conn->lchan is freed.
After that, osmo-bsc hits an assertion on conn->lchan:

20200618153244799 DAS DEBUG assignment(conn25_0-0-1-TCH_F-0)[0x55f0a64979f0]{WAIT_RR_ASS_COMPLETE}: Timeout of T10 (fsm.c:322)
Assert failed conn->lchan ../../include/osmocom/bsc/gsm_data.h:1376

Related issues

Related to OsmoBSC - Bug #3744: include/osmocom/bsc/gsm_data.h: OSMO_ASSERT(conn->lchan); failed in conn_get_bts()Resolvedfixeria12/30/2018

Actions
Actions #1

Updated by neels almost 4 years ago

the error seems to happen at different times, depending on the timing(?)

jenkins shows:

[0;m[1;31m20200618083656230 [1;31mDRLL[0;m[1;31m <0000> mgcp_client_fsm.c:526 MGCP_CONN(to-BTS)[0x555555b32910]{ST_DLCX_RESP}: MGW/DLCX: abrupt FSM termination with connections still present, sending unconditional DLCX...
[0;m20200618083656230 [1;31mDLMGCP[0;m <0023> mgcp_client.c:701 Failed to read: r=172.18.12.203:2427<->l=172.18.12.20:2727: 111='Connection refused'
[0;mAssert failed conn->lchan ../../include/osmocom/bsc/gsm_data.h:1376

while a local run shows:

20200618153242817 DRLL ERROR MGCP_CONN(to-BTS)[0x55f0a6491e80]{ST_DLCX_RESP}: MGW/DLCX: abrupt FSM termination with connections still present, sending unconditional DLCX... (mgcp_client_fsm.c:526)
20200618153242817 DRLL DEBUG MGCP_CONN(to-BTS)[0x55f0a6491e80]{ST_DLCX_RESP}: Freeing instance (mgcp_client_fsm.c:496)
20200618153242817 DRLL DEBUG MGCP_CONN(to-BTS)[0x55f0a6491e80]{ST_DLCX_RESP}: Deallocated (fsm.c:573)
20200618153242817 DLMGCP ERROR Failed to read: r=172.18.12.203:2427<->l=172.18.12.20:2727: 111='Connection refused' (mgcp_client.c:701)
20200618153243361 DHODEC DEBUG (BTS 0) No congestion check: no minimum for free TCH/F nor TCH/H set (handover_decision_2.c:1846)
20200618153243361 DHODEC DEBUG (BTS 1) No congestion check: no minimum for free TCH/F nor TCH/H set (handover_decision_2.c:1846)
20200618153243361 DHODEC DEBUG (BTS 2) No congestion check: no minimum for free TCH/F nor TCH/H set (handover_decision_2.c:1846)
20200618153243362 DHODEC DEBUG (BTS 3) No congestion check: TRX 0 not usable (handover_decision_2.c:1830)
20200618153243362 DHODEC DEBUG HO algorithm 2: next periodical congestion check in 10 seconds (handover_decision_2.c:162)
20200618153244799 DAS DEBUG assignment(conn25_0-0-1-TCH_F-0)[0x55f0a64979f0]{WAIT_RR_ASS_COMPLETE}: Timeout of T10 (fsm.c:322)
Assert failed conn->lchan ../../include/osmocom/bsc/gsm_data.h:1376

i.e. one during MGCP client teardown, the other during assignment FSM teardown a bit later.

It is obvious that the OSMO_ASSERT(conn->lchan) is too drastic,
but I'd also like to understand the root cause for things going wrong in the first place.

Actions #2

Updated by neels almost 4 years ago

in #4619, fixeria wrote:

"

Assert failed conn->lchan ../../../../src/osmo-bsc/include/osmocom/bsc/gsm_data.h:1376

According to the backtrace, the crash originates from assignment_fsm_timer_cb(). This callback tries to update some counters:

int assignment_fsm_timer_cb(struct osmo_fsm_inst *fi)
{
        struct gsm_subscriber_connection *conn = assignment_fi_conn(fi);
        assignment_count_result(CTR_ASSIGNMENT_TIMEOUT); // New: b5ccf09fc4 2020-06-08 14:36:46
        assignment_fail(GSM0808_CAUSE_EQUIPMENT_FAILURE, "Timeout");
        return 0;
}

so trying to follow (Ctrl+] in Vim) assignment_count_result() brings us to the macros hell:

#define assignment_count_result(counter) do { \
                if (!conn->assignment.result_rate_ctr_done) { \
                        assignment_count(counter); /* <--- We go here */ \
                        conn->assignment.result_rate_ctr_done = true; \
                } else \
                        LOG_ASSIGNMENT(conn, LOGL_DEBUG, \
                                       "result rate counter already recorded, NOT counting as: %s %s\n", \
                                       bsc_ctr_description[BSC_##counter].name, \
                                       bsc_ctr_description[BSC_##counter].description); \
        } while(0)

following the rabbit hole, we end up here:

/* Assume presence of local var 'conn' as struct gsm_subscriber_connection */
#define assignment_count(counter) do { \
                struct gsm_bts *bts = conn_get_bts(conn); /* <--- Here is the party pooper! */ \
                LOG_ASSIGNMENT(conn, LOGL_DEBUG, "incrementing rate counter: %s %s\n", \
                               bsc_ctr_description[BSC_##counter].name, \
                               bsc_ctr_description[BSC_##counter].description); \
                rate_ctr_inc(&conn->network->bsc_ctrs->ctr[BSC_##counter]); \
                rate_ctr_inc(&bts->bts_ctrs->ctr[BTS_##counter]); \
        } while(0)

so I think I found culprit:

commit b5ccf09fc4042c7fb1fdaaa6263961c40b32564e
Author: Daniel Willmann <dwillmann@sysmocom.de>
Date:   Mon Jun 8 14:36:46 2020 +0200

    Count assignment rates per BTS as well

    This adds the assignment counters for the BTS as well and changes the
    assignment_count() macro to increase both the counters for the BSC as
    well as the BTS.

    Related: SYS#4877
    Change-Id: I0009e51d4caf68e762138d98e2e23d49acc3cc1a

"

Actions #3

Updated by laforge almost 4 years ago

  • Subject changed from ttcn3-bsc-sccplite: crash in BSC_Tests.TC_assignment_codec_amr_f to ttcn3-bsc-sccplite: crash in BSC_Tests.TC_assignment_codec_amr_f (introduced by "count assignment rates per bts")
  • Assignee set to d8g11qnrlqjn
  • Priority changed from Normal to Urgent

neels wrote:

so I think I found culprit:

[...]

If Daniel recently authored a commit, it definitely helps to add him as a watcher to this issue before makign the related issue update to ensure he gets notified about it. Alternatively, you could directly assign it to him. A ticket with no watchers and no assignee will not be seen by anyone but those few insane enough to watch everything everywhere.

Actions #4

Updated by laforge almost 4 years ago

  • Assignee changed from d8g11qnrlqjn to daniel
Actions #5

Updated by laforge almost 4 years ago

Actions #6

Updated by fixeria almost 4 years ago

  • Status changed from New to Feedback

There is already a fix for this issue (the result of our collaborative investigation):

https://gerrit.osmocom.org/c/osmo-bsc/+/18907 fix crashes due to OSMO_ASSERT(conn->lchan)

I had to spend time on this because this segfault was blocking my #4619 investigation.

Actions #7

Updated by fixeria almost 4 years ago

  • Related to Bug #3744: include/osmocom/bsc/gsm_data.h: OSMO_ASSERT(conn->lchan); failed in conn_get_bts() added
Actions #8

Updated by daniel almost 4 years ago

  • Status changed from Feedback to Resolved

Fix merged, thanks!

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)