Bug #4620
closedttcn3-bsc-sccplite: crash in BSC_Tests.TC_assignment_codec_amr_f (introduced by "count assignment rates per bts")
0%
Description
since one or two days ago, ttcn3-bsc-sccplite fails, by osmo-bsc crashing during BSC_Tests.TC_assignment_codec_amr_f.
I couldn't get this to happen in a manual run, but using the unmodified docker images, the crash is reproducable.
It seems that a test hits an error or the end, and that because connections are released, the conn->lchan is freed.
After that, osmo-bsc hits an assertion on conn->lchan:
20200618153244799 DAS DEBUG assignment(conn25_0-0-1-TCH_F-0)[0x55f0a64979f0]{WAIT_RR_ASS_COMPLETE}: Timeout of T10 (fsm.c:322) Assert failed conn->lchan ../../include/osmocom/bsc/gsm_data.h:1376
Related issues
Updated by neels almost 4 years ago
the error seems to happen at different times, depending on the timing(?)
jenkins shows:
[0;m[1;31m20200618083656230 [1;31mDRLL[0;m[1;31m <0000> mgcp_client_fsm.c:526 MGCP_CONN(to-BTS)[0x555555b32910]{ST_DLCX_RESP}: MGW/DLCX: abrupt FSM termination with connections still present, sending unconditional DLCX... [0;m20200618083656230 [1;31mDLMGCP[0;m <0023> mgcp_client.c:701 Failed to read: r=172.18.12.203:2427<->l=172.18.12.20:2727: 111='Connection refused' [0;mAssert failed conn->lchan ../../include/osmocom/bsc/gsm_data.h:1376
while a local run shows:
20200618153242817 DRLL ERROR MGCP_CONN(to-BTS)[0x55f0a6491e80]{ST_DLCX_RESP}: MGW/DLCX: abrupt FSM termination with connections still present, sending unconditional DLCX... (mgcp_client_fsm.c:526) 20200618153242817 DRLL DEBUG MGCP_CONN(to-BTS)[0x55f0a6491e80]{ST_DLCX_RESP}: Freeing instance (mgcp_client_fsm.c:496) 20200618153242817 DRLL DEBUG MGCP_CONN(to-BTS)[0x55f0a6491e80]{ST_DLCX_RESP}: Deallocated (fsm.c:573) 20200618153242817 DLMGCP ERROR Failed to read: r=172.18.12.203:2427<->l=172.18.12.20:2727: 111='Connection refused' (mgcp_client.c:701) 20200618153243361 DHODEC DEBUG (BTS 0) No congestion check: no minimum for free TCH/F nor TCH/H set (handover_decision_2.c:1846) 20200618153243361 DHODEC DEBUG (BTS 1) No congestion check: no minimum for free TCH/F nor TCH/H set (handover_decision_2.c:1846) 20200618153243361 DHODEC DEBUG (BTS 2) No congestion check: no minimum for free TCH/F nor TCH/H set (handover_decision_2.c:1846) 20200618153243362 DHODEC DEBUG (BTS 3) No congestion check: TRX 0 not usable (handover_decision_2.c:1830) 20200618153243362 DHODEC DEBUG HO algorithm 2: next periodical congestion check in 10 seconds (handover_decision_2.c:162) 20200618153244799 DAS DEBUG assignment(conn25_0-0-1-TCH_F-0)[0x55f0a64979f0]{WAIT_RR_ASS_COMPLETE}: Timeout of T10 (fsm.c:322) Assert failed conn->lchan ../../include/osmocom/bsc/gsm_data.h:1376
i.e. one during MGCP client teardown, the other during assignment FSM teardown a bit later.
It is obvious that the OSMO_ASSERT(conn->lchan) is too drastic,
but I'd also like to understand the root cause for things going wrong in the first place.
Updated by neels almost 4 years ago
in #4619, fixeria wrote:
"
Assert failed conn->lchan ../../../../src/osmo-bsc/include/osmocom/bsc/gsm_data.h:1376
According to the backtrace, the crash originates from assignment_fsm_timer_cb(). This callback tries to update some counters:
int assignment_fsm_timer_cb(struct osmo_fsm_inst *fi)
{
struct gsm_subscriber_connection *conn = assignment_fi_conn(fi);
assignment_count_result(CTR_ASSIGNMENT_TIMEOUT); // New: b5ccf09fc4 2020-06-08 14:36:46
assignment_fail(GSM0808_CAUSE_EQUIPMENT_FAILURE, "Timeout");
return 0;
}
so trying to follow (Ctrl+] in Vim) assignment_count_result() brings us to the macros hell:
#define assignment_count_result(counter) do { \
if (!conn->assignment.result_rate_ctr_done) { \
assignment_count(counter); /* <--- We go here */ \
conn->assignment.result_rate_ctr_done = true; \
} else \
LOG_ASSIGNMENT(conn, LOGL_DEBUG, \
"result rate counter already recorded, NOT counting as: %s %s\n", \
bsc_ctr_description[BSC_##counter].name, \
bsc_ctr_description[BSC_##counter].description); \
} while(0)
following the rabbit hole, we end up here:
/* Assume presence of local var 'conn' as struct gsm_subscriber_connection */
#define assignment_count(counter) do { \
struct gsm_bts *bts = conn_get_bts(conn); /* <--- Here is the party pooper! */ \
LOG_ASSIGNMENT(conn, LOGL_DEBUG, "incrementing rate counter: %s %s\n", \
bsc_ctr_description[BSC_##counter].name, \
bsc_ctr_description[BSC_##counter].description); \
rate_ctr_inc(&conn->network->bsc_ctrs->ctr[BSC_##counter]); \
rate_ctr_inc(&bts->bts_ctrs->ctr[BTS_##counter]); \
} while(0)
so I think I found culprit:
commit b5ccf09fc4042c7fb1fdaaa6263961c40b32564e Author: Daniel Willmann <dwillmann@sysmocom.de> Date: Mon Jun 8 14:36:46 2020 +0200 Count assignment rates per BTS as well This adds the assignment counters for the BTS as well and changes the assignment_count() macro to increase both the counters for the BSC as well as the BTS. Related: SYS#4877 Change-Id: I0009e51d4caf68e762138d98e2e23d49acc3cc1a
"
Updated by laforge almost 4 years ago
- Subject changed from ttcn3-bsc-sccplite: crash in BSC_Tests.TC_assignment_codec_amr_f to ttcn3-bsc-sccplite: crash in BSC_Tests.TC_assignment_codec_amr_f (introduced by "count assignment rates per bts")
- Assignee set to d8g11qnrlqjn
- Priority changed from Normal to Urgent
neels wrote:
so I think I found culprit:
[...]
If Daniel recently authored a commit, it definitely helps to add him as a watcher to this issue before makign the related issue update to ensure he gets notified about it. Alternatively, you could directly assign it to him. A ticket with no watchers and no assignee will not be seen by anyone but those few insane enough to watch everything everywhere.
Updated by laforge almost 4 years ago
- Assignee changed from d8g11qnrlqjn to daniel
Updated by fixeria almost 4 years ago
- Status changed from New to Feedback
There is already a fix for this issue (the result of our collaborative investigation):
https://gerrit.osmocom.org/c/osmo-bsc/+/18907 fix crashes due to OSMO_ASSERT(conn->lchan)
I had to spend time on this because this segfault was blocking my #4619 investigation.
Updated by fixeria almost 4 years ago
- Related to Bug #3744: include/osmocom/bsc/gsm_data.h: OSMO_ASSERT(conn->lchan); failed in conn_get_bts() added