Project

General

Profile

Actions

action #169531

closed

Scripts CI | Failed pipeline for master - ping_client test failed size:S

Added by tinita about 1 month ago. Updated 22 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-11-07
Due date:
2024-11-29
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3347063
https://openqa.opensuse.org/tests/4625959
https://openqa.opensuse.org/tests/4625960#step/setup_multimachine/82

Test died: command 'until nmcli networking connectivity check | tee /dev/stderr | grep 'full'; do sleep 10; done' timed

I observed that those tests are running on different hosts. Most of the time it works, but shouldn't they be scheduled on the same host?

Rollback steps

  1. Enable pipeline schedule again in https://gitlab.suse.de/openqa/scripts-ci/-/pipeline_schedules

Suggestions

  • Don't restart anything
  • Use EXPECTED_NM_CONNECTIVITY
  • This is not strictly worker20, although it is usually one of the workers (server or client
Actions #1

Updated by okurz about 1 month ago

  • Tags set to reactive work, infra
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #2

Updated by okurz about 1 month ago

  • Assignee set to nicksinger
Actions #3

Updated by nicksinger about 1 month ago

  • Status changed from New to Rejected

I looked into this job, apparently NetworkManager had network but struggled to reach "the internet" - but only this single time. So I just assume this was a very short network outage. All jobs before and after work again, also on two different workers: https://openqa.opensuse.org/tests/4627077 - so unfortunately I see nothing we can improve right now. If this happens again we can think about increasing the timeout waiting for NM.

Actions #4

Updated by tinita about 1 month ago

It just happened again this morning: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3351096

Actions #5

Updated by mkittler about 1 month ago

  • Status changed from Rejected to Workable
Actions #7

Updated by nicksinger about 1 month ago · Edited

  • Status changed from Workable to In Progress

First occurrence (https://openqa.opensuse.org/tests/4625959 ):

  • server on w22
  • client on w20

Second occurrence (https://openqa.opensuse.org/tests/4628700 ):

  • server on w20
  • client on w23

Third occurrence (https://openqa.opensuse.org/tests/4630474 ):

  • server on w21
  • client on w22

Fourth occurrence (https://openqa.opensuse.org/tests/4631479 ):

  • server on w26
  • client on w20

Fifth occurrence (https://openqa.opensuse.org/tests/4631799 ):

  • server on w20
  • client on w25

Sixth occurrence (https://openqa.opensuse.org/tests/4631816 ):

  • server on w23
  • client on w20

Seventh occurrence (https://openqa.opensuse.org/tests/4632725 ):

  • server on w20
  • client on w21

Eight occurrence (https://openqa.opensuse.org/tests/4633438 ):

  • server on w20
  • client on w21

Ninth occurrence (https://openqa.opensuse.org/tests/4634869 ):

  • server on w20
  • client on w22

Tenth occurrence (https://openqa.opensuse.org/tests/4635025 ):

  • server on w20
  • client on w23

The only test where w20 is not involved was https://openqa.opensuse.org/tests/4630474 and it eventually managed to "fix" itself (the status goes from "partial" to "full" later on). I won't focus on a specific worker but rather on if "partial" might be enough for us already because we don't really need (and also don't want to test) a working, external, network for this.

Actions #8

Updated by nicksinger about 1 month ago

I looked into the code and realized that the restart logic might be flawed and we don't need it in the first place so I try to remove it: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20592
While trying to ping the correct reviewers I also found https://github.com/os-autoinst/os-autoinst-distri-opensuse/commit/d80077c8c4dd32b5140a173f2a28ab1c47f49591 which we maybe can use as workaround too (e.g. by setting this var to (partial|fully)) and asked Dominik about it in: https://suse.slack.com/archives/C02AJ1E568M/p1731324488409659?thread_ts=1731324309.806219&cid=C02AJ1E568M

Actions #9

Updated by nicksinger about 1 month ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to Normal

Reducing prio as this doesn't seem like a general setup problem but rather a very specific test issue

Actions #10

Updated by nicksinger about 1 month ago

I disabled the o3 pipeline schedule for now to avoid mail spam as it seems to happen more often now.

Actions #11

Updated by nicksinger about 1 month ago

  • Description updated (diff)
Actions #12

Updated by okurz about 1 month ago

  • Subject changed from Scripts CI | Failed pipeline for master - ping_client test failed to Scripts CI | Failed pipeline for master - ping_client test failed size:S
  • Description updated (diff)
Actions #13

Updated by nicksinger about 1 month ago

  • Status changed from Feedback to Resolved

Currently all runs look good. We received two alerts but for a different reason (I answered in the alert mails directly).

Actions #14

Updated by dzedro about 1 month ago

  • Status changed from Resolved to Feedback
Actions #15

Updated by nicksinger about 1 month ago

  • Status changed from Feedback to In Progress

dzedro wrote in #note-14:

For some reason it's breaking setup_multimachine on 15-SP4 Desktop https://openqa.suse.de/tests/15934939#step/setup_multimachine/206
Without the PR https://openqa.suse.de/tests/15939547#step/setup_multimachine/90

Yes, we encountered https://progress.opensuse.org/issues/169843 as well - seems like the restart covered up some mistake/missing step. NM is definitely running and configured but… different: https://openqa.suse.de/tests/15934939#step/setup_multimachine/84 (there is only one, single process running after the restart of NM: https://openqa.suse.de/tests/15939547#step/setup_multimachine/102)

Actions #17

Updated by openqa_review about 1 month ago

  • Due date set to 2024-11-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions #18

Updated by nicksinger about 1 month ago

nicksinger wrote in #note-16:

I reverted https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20592 with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20638 - will restart tests and investigate further.

I restarted all tests I was able to find. I used the following sql query to find jobs containing (failing) modules I was aware of:

select j.id, jm.name from jobs j join job_modules jm on j.id = jm.job_id where t_started >= '2024-10-15T20:00:00' and j.result = 'failed' and j.test not like '%:investigate:%' and name like '%yast2_nfs_server%' and j.clone_id is null

these consisted of:

I will take them into consideration for testing before un-drafting https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20651 which is my next approach to improve the module in question.
In the meantime I understood why the initially reported test is failing sporadically. While looking at the TW jobs group (containing this job) I found https://openqa.opensuse.org/tests/4647382 which is not the latest. The latest job is https://openqa.opensuse.org/tests/4649329 and both differ hugely in applied settings. This happens because the first test is scheduled as part of the TW product while all newer ones are triggered by https://github.com/os-autoinst/scripts/blob/master/openqa-schedule-mm-ping-test from within https://gitlab.suse.de/openqa/scripts-ci. One of these settings missing in our CI is "EXPECTED_NM_CONNECTIVITY" which is set to "none" in the TW schedule. So if our test happens to run at a moment where the internal connection is only considered "limited" by NM (not sure why, but apparently it happens from time to time) and it was scheduled by our pipeline, it will fail. In all other cases it passes.

Adding "EXPECTED_NM_CONNECTIVITY=none" to https://github.com/os-autoinst/scripts/blob/master/openqa-schedule-mm-ping-test to skip the test in https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/lib/mm_network.pm#L243-L245 is easy enough but I want to improve the situation further (e.g. by allowing "EXPECTED_NM_CONNECTIVITY=(limited|full)" with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20651) and also looking into how our pipeline could automatically clone other inherited variables.

Actions #19

Updated by nicksinger 23 days ago

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/scripts/pull/353 created to address the initial issue of our failing pipelines. https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20651 for a general improvement.

Actions #20

Updated by okurz 23 days ago

  • Status changed from Feedback to In Progress
Actions #21

Updated by nicksinger 23 days ago · Edited

  • Status changed from In Progress to Feedback

Added VRs to my PR and asked @dzedro in Slack how to avoid #169531#note-14

Actions #23

Updated by livdywan 22 days ago

also https://github.com/os-autoinst/scripts/pull/354

Merged. Anything else left here ?

Actions #24

Updated by nicksinger 22 days ago

  • Status changed from Feedback to Resolved

With https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20651 merged I now added the proper fix to our pipeline definitions and the test suite on o3. That these changes work can be seen at https://openqa.opensuse.org/tests/4669194

Actions

Also available in: Atom PDF