action #161381: multi-machine test network issues reported 2024-06-03 due to missing content in the salt mine size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #161381

closed

multi-machine test network issues reported 2024-06-03 due to missing content in the salt mine size:S

Added by okurz 7 months ago. Updated 7 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-06-03

Due date:

2024-06-18

% Done:

Estimated time:

Tags:

infra, reactive work

Description

Observation¶

Same problem as in #160646

From https://suse.slack.com/archives/C02CANHLANP/p1717381703517509

(Lili Zhao) Hi, multi machine issues found today, for example: https://openqa.suse.de/tests/14504387#step/iscsi_client/8　（ping with packet size 100 failed, problems with MTU size are expected）　and　https://openqa.suse.de/tests/14504397#step/suseconnect_scc/25　（curl: (7) Couldn't connect to server)

possibly related https://suse.slack.com/archives/C02CANHLANP/p1717400281975529

(Anton Smorodskyi) when I see such error https://openqa.suse.de/tests/14492957#step/prepare_instance/27 No route to host at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Transaction.pm line 54. I conclude that worker's network is down . Is my assumption correct ?

also
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1717347718902&to=1717408634010
shows the significantly higher ratio of multi-machine test failures happening

Acceptance criteria¶

AC1: The original issue is understood and resolved
AC2: The multi-machine test failure ratio on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test is back to sane levels

Suggestions¶

Just cover up the symptoms, retrigger jobs as necessary, etc.
Ensure that the multi-machine test failure ratio on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test is back to sane levels
Add additional ideas as they come up to #161735

Out of scope¶

Fixing the false positive salt-lint #161393
Ensuring that we check YAML validity of the workerconf #161396
Fixing and preventing the actual issue

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by okurz 7 months ago

Assignee set to mkittler

Actions

Copy link

Updated by okurz 7 months ago

Related to action #160646: multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all size:M added

Actions

Copy link

Updated by mkittler 7 months ago

Status changed from New to In Progress

The config looks good on all workers. I checked the gre tunnel config, the forwarding and the firewall zone of interfaces and debugging commands on some workers.

We also saw problems repeatedly in non-MM tests like https://openqa.suse.de/tests/14492957#step/prepare_instance/27 so maybe there's a bigger problem (or those are due to https://openqa.suse.de/tests/14506392#step/deploy_qesap_terraform/25 and therefore unrelated).

Actions

Copy link

Updated by mkittler 7 months ago

The pipeline on https://gitlab.suse.de/openqa/scripts-ci/-/pipelines started to fail one day ago (https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2677163). It started to be passing again one hour ago. So if there was a problem I might be slightly too late with my investigation (and e.g. a subsequent salt run has already fixed the config again).

Actions

Copy link

Updated by mkittler 7 months ago

Status changed from In Progress to Feedback
Priority changed from Urgent to Normal

This was in fact fixed by @nicksinger:

What I did was manually reverting https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/826 on OSD and ran a highstate again. We basically had the same situation as in https://progress.opensuse.org/issues/160646 again.

So I guess there's nothing left to do immediately but we'll have to make our salt setup more reliable.

Actions

Copy link

Updated by okurz 7 months ago

I don't understand it yet. Could you please provide more details what the problem was and what symptoms this has caused?
Also, do you have ideas how to improve the error reporting so that in the future it is more clear to test reviewers how to continue. And if you have such ideas it's possibly best to write them in specific tickets unless you can immediately provide suggestions in form of code changes in pull requests.

Actions

Copy link

Updated by okurz 7 months ago

Parent task set to #111929

Actions

Copy link

Updated by okurz 7 months ago

Description updated (diff)
Status changed from Feedback to In Progress
Parent task deleted (~~#111929~~)

as discussed please check if the mentioned ACs are already resolved or covered by already existing tickets.

Actions

Copy link

Updated by openqa_review 7 months ago

Due date set to 2024-06-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#10

Updated by okurz 7 months ago

Copied to coordination #161735: [epic] Better error detection on GRE tunnel misconfiguration added

Actions

Copy link

#11

Updated by okurz 7 months ago

Subject changed from multi-machine test network issues reported 2024-06-03 to multi-machine test network issues reported 2024-06-03 due to missing content in the salt mine size:S
Description updated (diff)

Actions

Copy link

#12

Updated by mkittler 7 months ago

Status changed from In Progress to Resolved

The fail ratio looks currently good on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24. The identified that the salt mine sometimes being not correctly populated was the culprit and #161735 should be enough to continue - although we should probably create at least one concrete ticket for how to continue (but it is probably not within the scope of this ticket to make that decision).

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #161381

multi-machine test network issues reported 2024-06-03 due to missing content in the salt mine size:S

Observation¶

Acceptance criteria¶

Suggestions¶

Out of scope¶

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by mkittler 7 months ago

Updated by mkittler 7 months ago

Updated by mkittler 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by openqa_review 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by mkittler 7 months ago