action #169531: Scripts CI | Failed pipeline for master - ping_client test failed size:S - openQA Project (public) - openSUSE Project Management Tool

Custom queries

All 'new' issues w/o assignee, sorted by version/priority
All auto_review tickets
All auto_review+force_result tickets
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE Tools Team - Beginner
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE Tools team - due soon
QE tools team - exceeding due-date
QE Tools Team - Expert
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #169531

closed

Scripts CI | Failed pipeline for master - ping_client test failed size:S

Added by tinita 6 months ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2024-11-07

Due date:

2024-11-29

% Done:

Estimated time:

Tags:

infra, reactive work

Description

Observation¶

https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3347063
https://openqa.opensuse.org/tests/4625959
https://openqa.opensuse.org/tests/4625960#step/setup_multimachine/82

Test died: command 'until nmcli networking connectivity check | tee /dev/stderr | grep 'full'; do sleep 10; done' timed

I observed that those tests are running on different hosts. Most of the time it works, but shouldn't they be scheduled on the same host?

Rollback steps¶

Enable pipeline schedule again in https://gitlab.suse.de/openqa/scripts-ci/-/pipeline_schedules

Suggestions¶

Don't restart anything
Use EXPECTED_NM_CONNECTIVITY
This is not strictly worker20, although it is usually one of the workers (server or client

History
Notes
Property changes

Actions

Copy link

Updated by okurz 6 months ago

Tags set to reactive work, infra
Priority changed from Normal to Urgent
Target version set to Ready

Actions

Copy link

Updated by okurz 6 months ago

Assignee set to nicksinger

Actions

Copy link

Updated by nicksinger 6 months ago

Status changed from New to Rejected

I looked into this job, apparently NetworkManager had network but struggled to reach "the internet" - but only this single time. So I just assume this was a very short network outage. All jobs before and after work again, also on two different workers: https://openqa.opensuse.org/tests/4627077 - so unfortunately I see nothing we can improve right now. If this happens again we can think about increasing the timeout waiting for NM.

Actions

Copy link

Updated by tinita 6 months ago

It just happened again this morning: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3351096

Actions

Copy link

Updated by mkittler 6 months ago

Status changed from Rejected to Workable

Actions

Copy link

Updated by livdywan 5 months ago

Also https://openqa.opensuse.org/tests/4634869#step/setup_multimachine/82 which looks to be the second instance today (https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3360637)

Actions

Copy link

Updated by nicksinger 5 months ago · Edited

Status changed from Workable to In Progress

First occurrence (https://openqa.opensuse.org/tests/4625959 ):

server on w22
client on w20

Second occurrence (https://openqa.opensuse.org/tests/4628700 ):

server on w20
client on w23

Third occurrence (https://openqa.opensuse.org/tests/4630474 ):

server on w21
client on w22

Fourth occurrence (https://openqa.opensuse.org/tests/4631479 ):

server on w26
client on w20

Fifth occurrence (https://openqa.opensuse.org/tests/4631799 ):

server on w20
client on w25

Sixth occurrence (https://openqa.opensuse.org/tests/4631816 ):

server on w23
client on w20

Seventh occurrence (https://openqa.opensuse.org/tests/4632725 ):

server on w20
client on w21

Eight occurrence (https://openqa.opensuse.org/tests/4633438 ):

server on w20
client on w21

Ninth occurrence (https://openqa.opensuse.org/tests/4634869 ):

server on w20
client on w22

Tenth occurrence (https://openqa.opensuse.org/tests/4635025 ):

server on w20
client on w23

The only test where w20 is not involved was https://openqa.opensuse.org/tests/4630474 and it eventually managed to "fix" itself (the status goes from "partial" to "full" later on). I won't focus on a specific worker but rather on if "partial" might be enough for us already because we don't really need (and also don't want to test) a working, external, network for this.

Actions

Copy link

Updated by nicksinger 5 months ago

I looked into the code and realized that the restart logic might be flawed and we don't need it in the first place so I try to remove it: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20592
While trying to ping the correct reviewers I also found https://github.com/os-autoinst/os-autoinst-distri-opensuse/commit/d80077c8c4dd32b5140a173f2a28ab1c47f49591 which we maybe can use as workaround too (e.g. by setting this var to (partial|fully)) and asked Dominik about it in: https://suse.slack.com/archives/C02AJ1E568M/p1731324488409659?thread_ts=1731324309.806219&cid=C02AJ1E568M

Actions

Copy link

Updated by nicksinger 5 months ago

Status changed from In Progress to Feedback
Priority changed from Urgent to Normal

Reducing prio as this doesn't seem like a general setup problem but rather a very specific test issue

Actions

Copy link

#10

Updated by nicksinger 5 months ago

I disabled the o3 pipeline schedule for now to avoid mail spam as it seems to happen more often now.

Actions

Copy link

#11

Updated by nicksinger 5 months ago

Description updated (diff)

Actions

Copy link

#12

Updated by okurz 5 months ago

Subject changed from Scripts CI | Failed pipeline for master - ping_client test failed to Scripts CI | Failed pipeline for master - ping_client test failed size:S
Description updated (diff)

Actions

Copy link

#13

Updated by nicksinger 5 months ago

Status changed from Feedback to Resolved

Currently all runs look good. We received two alerts but for a different reason (I answered in the alert mails directly).

Actions

Copy link

#14

Updated by dzedro 5 months ago

Status changed from Resolved to Feedback

For some reason it's breaking setup_multimachine on 15-SP4 Desktop https://openqa.suse.de/tests/15934939#step/setup_multimachine/206
Without the PR https://openqa.suse.de/tests/15939547#step/setup_multimachine/90

Actions

Copy link

#15

Updated by nicksinger 5 months ago

Status changed from Feedback to In Progress

dzedro wrote in #note-14:

For some reason it's breaking setup_multimachine on 15-SP4 Desktop https://openqa.suse.de/tests/15934939#step/setup_multimachine/206
Without the PR https://openqa.suse.de/tests/15939547#step/setup_multimachine/90

Yes, we encountered https://progress.opensuse.org/issues/169843 as well - seems like the restart covered up some mistake/missing step. NM is definitely running and configured but… different: https://openqa.suse.de/tests/15934939#step/setup_multimachine/84 (there is only one, single process running after the restart of NM: https://openqa.suse.de/tests/15939547#step/setup_multimachine/102)

Actions

Copy link

#16

Updated by nicksinger 5 months ago

I reverted https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20592 with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20638 - will restart tests and investigate further.

Actions

Copy link

#17

Updated by openqa_review 5 months ago

Due date set to 2024-11-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#18

Updated by nicksinger 5 months ago

nicksinger wrote in #note-16:

I reverted https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20592 with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20638 - will restart tests and investigate further.

I restarted all tests I was able to find. I used the following sql query to find jobs containing (failing) modules I was aware of:

select j.id, jm.name from jobs j join job_modules jm on j.id = jm.job_id where t_started >= '2024-10-15T20:00:00' and j.result = 'failed' and j.test not like '%:investigate:%' and name like '%yast2_nfs_server%' and j.clone_id is null

these consisted of:

yast2_nfs_server - https://progress.opensuse.org/issues/169945
rsync_server - https://progress.opensuse.org/issues/169843
setup_multimachine - https://progress.opensuse.org/issues/169531#note-14

I will take them into consideration for testing before un-drafting https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20651 which is my next approach to improve the module in question.
In the meantime I understood why the initially reported test is failing sporadically. While looking at the TW jobs group (containing this job) I found https://openqa.opensuse.org/tests/4647382 which is not the latest. The latest job is https://openqa.opensuse.org/tests/4649329 and both differ hugely in applied settings. This happens because the first test is scheduled as part of the TW product while all newer ones are triggered by https://github.com/os-autoinst/scripts/blob/master/openqa-schedule-mm-ping-test from within https://gitlab.suse.de/openqa/scripts-ci. One of these settings missing in our CI is "EXPECTED_NM_CONNECTIVITY" which is set to "none" in the TW schedule. So if our test happens to run at a moment where the internal connection is only considered "limited" by NM (not sure why, but apparently it happens from time to time) and it was scheduled by our pipeline, it will fail. In all other cases it passes.

Adding "EXPECTED_NM_CONNECTIVITY=none" to https://github.com/os-autoinst/scripts/blob/master/openqa-schedule-mm-ping-test to skip the test in https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/lib/mm_network.pm#L243-L245 is easy enough but I want to improve the situation further (e.g. by allowing "EXPECTED_NM_CONNECTIVITY=(limited|full)" with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20651) and also looking into how our pipeline could automatically clone other inherited variables.

Actions

Copy link

#19