action #161381
closedmulti-machine test network issues reported 2024-06-03 due to missing content in the salt mine size:S
0%
Description
Observation¶
Same problem as in #160646
From https://suse.slack.com/archives/C02CANHLANP/p1717381703517509
(Lili Zhao) Hi, multi machine issues found today, for example: https://openqa.suse.de/tests/14504387#step/iscsi_client/8 (ping with packet size 100 failed, problems with MTU size are expected) and https://openqa.suse.de/tests/14504397#step/suseconnect_scc/25 (curl: (7) Couldn't connect to server)
possibly related https://suse.slack.com/archives/C02CANHLANP/p1717400281975529
(Anton Smorodskyi) when I see such error https://openqa.suse.de/tests/14492957#step/prepare_instance/27 No route to host at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Transaction.pm line 54. I conclude that worker's network is down . Is my assumption correct ?
also
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1717347718902&to=1717408634010
shows the significantly higher ratio of multi-machine test failures happening
Acceptance criteria¶
- AC1: The original issue is understood and resolved
- AC2: The multi-machine test failure ratio on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test is back to sane levels
Suggestions¶
- Just cover up the symptoms, retrigger jobs as necessary, etc.
- Ensure that the multi-machine test failure ratio on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test is back to sane levels
- Add additional ideas as they come up to #161735
Out of scope¶
Updated by okurz 7 months ago
- Related to action #160646: multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all size:M added
Updated by mkittler 7 months ago
- Status changed from New to In Progress
The config looks good on all workers. I checked the gre tunnel config, the forwarding and the firewall zone of interfaces and debugging commands on some workers.
We also saw problems repeatedly in non-MM tests like https://openqa.suse.de/tests/14492957#step/prepare_instance/27 so maybe there's a bigger problem (or those are due to https://openqa.suse.de/tests/14506392#step/deploy_qesap_terraform/25 and therefore unrelated).
Updated by mkittler 7 months ago
The pipeline on https://gitlab.suse.de/openqa/scripts-ci/-/pipelines started to fail one day ago (https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2677163). It started to be passing again one hour ago. So if there was a problem I might be slightly too late with my investigation (and e.g. a subsequent salt run has already fixed the config again).
Updated by mkittler 7 months ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to Normal
This was in fact fixed by @nicksinger:
What I did was manually reverting https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/826 on OSD and ran a highstate again. We basically had the same situation as in https://progress.opensuse.org/issues/160646 again.
So I guess there's nothing left to do immediately but we'll have to make our salt setup more reliable.
Updated by okurz 7 months ago
I don't understand it yet. Could you please provide more details what the problem was and what symptoms this has caused?
Also, do you have ideas how to improve the error reporting so that in the future it is more clear to test reviewers how to continue. And if you have such ideas it's possibly best to write them in specific tickets unless you can immediately provide suggestions in form of code changes in pull requests.
Updated by openqa_review 7 months ago
- Due date set to 2024-06-18
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 7 months ago
- Copied to coordination #161735: [epic] Better error detection on GRE tunnel misconfiguration added
Updated by mkittler 7 months ago
- Status changed from In Progress to Resolved
The fail ratio looks currently good on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24. The identified that the salt mine sometimes being not correctly populated was the culprit and #161735 should be enough to continue - although we should probably create at least one concrete ticket for how to continue (but it is probably not within the scope of this ticket to make that decision).