action #169531
closedScripts CI | Failed pipeline for master - ping_client test failed size:S
Added by tinita about 1 month ago. Updated 22 days ago.
0%
Description
Observation¶
https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3347063
https://openqa.opensuse.org/tests/4625959
https://openqa.opensuse.org/tests/4625960#step/setup_multimachine/82
Test died: command 'until nmcli networking connectivity check | tee /dev/stderr | grep 'full'; do sleep 10; done' timed
I observed that those tests are running on different hosts. Most of the time it works, but shouldn't they be scheduled on the same host?
Rollback steps¶
- Enable pipeline schedule again in https://gitlab.suse.de/openqa/scripts-ci/-/pipeline_schedules
Suggestions¶
- Don't restart anything
- Use EXPECTED_NM_CONNECTIVITY
- This is not strictly worker20, although it is usually one of the workers (server or client
Updated by okurz about 1 month ago
- Tags set to reactive work, infra
- Priority changed from Normal to Urgent
- Target version set to Ready
Updated by nicksinger about 1 month ago
- Status changed from New to Rejected
I looked into this job, apparently NetworkManager had network but struggled to reach "the internet" - but only this single time. So I just assume this was a very short network outage. All jobs before and after work again, also on two different workers: https://openqa.opensuse.org/tests/4627077 - so unfortunately I see nothing we can improve right now. If this happens again we can think about increasing the timeout waiting for NM.
Updated by tinita about 1 month ago
It just happened again this morning: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3351096
Updated by livdywan about 1 month ago
Also https://openqa.opensuse.org/tests/4634869#step/setup_multimachine/82 which looks to be the second instance today (https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3360637)
Updated by nicksinger about 1 month ago · Edited
- Status changed from Workable to In Progress
First occurrence (https://openqa.opensuse.org/tests/4625959 ):
- server on w22
- client on w20
Second occurrence (https://openqa.opensuse.org/tests/4628700 ):
- server on w20
- client on w23
Third occurrence (https://openqa.opensuse.org/tests/4630474 ):
- server on w21
- client on w22
Fourth occurrence (https://openqa.opensuse.org/tests/4631479 ):
- server on w26
- client on w20
Fifth occurrence (https://openqa.opensuse.org/tests/4631799 ):
- server on w20
- client on w25
Sixth occurrence (https://openqa.opensuse.org/tests/4631816 ):
- server on w23
- client on w20
Seventh occurrence (https://openqa.opensuse.org/tests/4632725 ):
- server on w20
- client on w21
Eight occurrence (https://openqa.opensuse.org/tests/4633438 ):
- server on w20
- client on w21
Ninth occurrence (https://openqa.opensuse.org/tests/4634869 ):
- server on w20
- client on w22
Tenth occurrence (https://openqa.opensuse.org/tests/4635025 ):
- server on w20
- client on w23
The only test where w20 is not involved was https://openqa.opensuse.org/tests/4630474 and it eventually managed to "fix" itself (the status goes from "partial" to "full" later on). I won't focus on a specific worker but rather on if "partial" might be enough for us already because we don't really need (and also don't want to test) a working, external, network for this.
Updated by nicksinger about 1 month ago
I looked into the code and realized that the restart logic might be flawed and we don't need it in the first place so I try to remove it: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20592
While trying to ping the correct reviewers I also found https://github.com/os-autoinst/os-autoinst-distri-opensuse/commit/d80077c8c4dd32b5140a173f2a28ab1c47f49591 which we maybe can use as workaround too (e.g. by setting this var to (partial|fully)
) and asked Dominik about it in: https://suse.slack.com/archives/C02AJ1E568M/p1731324488409659?thread_ts=1731324309.806219&cid=C02AJ1E568M
Updated by nicksinger about 1 month ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to Normal
Reducing prio as this doesn't seem like a general setup problem but rather a very specific test issue
Updated by nicksinger about 1 month ago
I disabled the o3 pipeline schedule for now to avoid mail spam as it seems to happen more often now.
Updated by okurz about 1 month ago
- Subject changed from Scripts CI | Failed pipeline for master - ping_client test failed to Scripts CI | Failed pipeline for master - ping_client test failed size:S
- Description updated (diff)
Updated by nicksinger about 1 month ago
- Status changed from Feedback to Resolved
Currently all runs look good. We received two alerts but for a different reason (I answered in the alert mails directly).
Updated by dzedro about 1 month ago
- Status changed from Resolved to Feedback
For some reason it's breaking setup_multimachine
on 15-SP4 Desktop https://openqa.suse.de/tests/15934939#step/setup_multimachine/206
Without the PR https://openqa.suse.de/tests/15939547#step/setup_multimachine/90
Updated by nicksinger about 1 month ago
- Status changed from Feedback to In Progress
dzedro wrote in #note-14:
For some reason it's breaking
setup_multimachine
on 15-SP4 Desktop https://openqa.suse.de/tests/15934939#step/setup_multimachine/206
Without the PR https://openqa.suse.de/tests/15939547#step/setup_multimachine/90
Yes, we encountered https://progress.opensuse.org/issues/169843 as well - seems like the restart covered up some mistake/missing step. NM is definitely running and configured but… different: https://openqa.suse.de/tests/15934939#step/setup_multimachine/84 (there is only one, single process running after the restart of NM: https://openqa.suse.de/tests/15939547#step/setup_multimachine/102)
Updated by nicksinger about 1 month ago
I reverted https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20592 with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20638 - will restart tests and investigate further.
Updated by openqa_review about 1 month ago
- Due date set to 2024-11-29
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger about 1 month ago
nicksinger wrote in #note-16:
I reverted https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20592 with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20638 - will restart tests and investigate further.
I restarted all tests I was able to find. I used the following sql query to find jobs containing (failing) modules I was aware of:
select j.id, jm.name from jobs j join job_modules jm on j.id = jm.job_id where t_started >= '2024-10-15T20:00:00' and j.result = 'failed' and j.test not like '%:investigate:%' and name like '%yast2_nfs_server%' and j.clone_id is null
these consisted of:
- yast2_nfs_server - https://progress.opensuse.org/issues/169945
- rsync_server - https://progress.opensuse.org/issues/169843
- setup_multimachine - https://progress.opensuse.org/issues/169531#note-14
I will take them into consideration for testing before un-drafting https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20651 which is my next approach to improve the module in question.
In the meantime I understood why the initially reported test is failing sporadically. While looking at the TW jobs group (containing this job) I found https://openqa.opensuse.org/tests/4647382 which is not the latest. The latest job is https://openqa.opensuse.org/tests/4649329 and both differ hugely in applied settings. This happens because the first test is scheduled as part of the TW product while all newer ones are triggered by https://github.com/os-autoinst/scripts/blob/master/openqa-schedule-mm-ping-test from within https://gitlab.suse.de/openqa/scripts-ci. One of these settings missing in our CI is "EXPECTED_NM_CONNECTIVITY" which is set to "none" in the TW schedule. So if our test happens to run at a moment where the internal connection is only considered "limited" by NM (not sure why, but apparently it happens from time to time) and it was scheduled by our pipeline, it will fail. In all other cases it passes.
Adding "EXPECTED_NM_CONNECTIVITY=none" to https://github.com/os-autoinst/scripts/blob/master/openqa-schedule-mm-ping-test to skip the test in https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/lib/mm_network.pm#L243-L245 is easy enough but I want to improve the situation further (e.g. by allowing "EXPECTED_NM_CONNECTIVITY=(limited|full)" with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20651) and also looking into how our pipeline could automatically clone other inherited variables.
Updated by nicksinger 23 days ago
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/scripts/pull/353 created to address the initial issue of our failing pipelines. https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20651 for a general improvement.
Updated by okurz 23 days ago
- Status changed from Feedback to In Progress
https://github.com/os-autoinst/scripts/pull/353 merged. You can continue with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20651 now
Updated by nicksinger 23 days ago · Edited
- Status changed from In Progress to Feedback
Added VRs to my PR and asked @dzedro in Slack how to avoid #169531#note-14
Updated by nicksinger 23 days ago
nicksinger wrote in #note-21:
Added VRs to my PR and asked @dzedro in Slack how to avoid https://progress.opensuse.org/issues/169531#note-14
Updated by nicksinger 22 days ago
- Status changed from Feedback to Resolved
With https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/20651 merged I now added the proper fix to our pipeline definitions and the test suite on o3. That these changes work can be seen at https://openqa.opensuse.org/tests/4669194