Project

General

Profile

Actions

action #161381

closed

multi-machine test network issues reported 2024-06-03 due to missing content in the salt mine size:S

Added by okurz 7 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2024-06-03
Due date:
2024-06-18
% Done:

0%

Estimated time:

Description

Observation

Same problem as in #160646

From https://suse.slack.com/archives/C02CANHLANP/p1717381703517509

(Lili Zhao) Hi, multi machine issues found today, for example: https://openqa.suse.de/tests/14504387#step/iscsi_client/8 (ping with packet size 100 failed, problems with MTU size are expected) and https://openqa.suse.de/tests/14504397#step/suseconnect_scc/25 (curl: (7) Couldn't connect to server)

possibly related https://suse.slack.com/archives/C02CANHLANP/p1717400281975529

(Anton Smorodskyi) when I see such error https://openqa.suse.de/tests/14492957#step/prepare_instance/27 No route to host at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Transaction.pm line 54. I conclude that worker's network is down . Is my assumption correct ?

also
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1717347718902&to=1717408634010
shows the significantly higher ratio of multi-machine test failures happening

Acceptance criteria

Suggestions

Out of scope

  • Fixing the false positive salt-lint #161393
  • Ensuring that we check YAML validity of the workerconf #161396
  • Fixing and preventing the actual issue

Related issues 2 (1 open1 closed)

Related to openQA Infrastructure (public) - action #160646: multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all size:MResolvedybonatakis2024-05-21

Actions
Copied to openQA Infrastructure (public) - coordination #161735: [epic] Better error detection on GRE tunnel misconfigurationBlockedokurz2024-06-21

Actions
Actions #1

Updated by okurz 7 months ago

  • Assignee set to mkittler
Actions #2

Updated by okurz 7 months ago

  • Related to action #160646: multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all size:M added
Actions #3

Updated by mkittler 7 months ago

  • Status changed from New to In Progress

The config looks good on all workers. I checked the gre tunnel config, the forwarding and the firewall zone of interfaces and debugging commands on some workers.

We also saw problems repeatedly in non-MM tests like https://openqa.suse.de/tests/14492957#step/prepare_instance/27 so maybe there's a bigger problem (or those are due to https://openqa.suse.de/tests/14506392#step/deploy_qesap_terraform/25 and therefore unrelated).

Actions #4

Updated by mkittler 7 months ago

The pipeline on https://gitlab.suse.de/openqa/scripts-ci/-/pipelines started to fail one day ago (https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2677163). It started to be passing again one hour ago. So if there was a problem I might be slightly too late with my investigation (and e.g. a subsequent salt run has already fixed the config again).

Actions #5

Updated by mkittler 7 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to Normal

This was in fact fixed by @nicksinger:

What I did was manually reverting https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/826 on OSD and ran a highstate again. We basically had the same situation as in https://progress.opensuse.org/issues/160646 again.

So I guess there's nothing left to do immediately but we'll have to make our salt setup more reliable.

Actions #6

Updated by okurz 7 months ago

I don't understand it yet. Could you please provide more details what the problem was and what symptoms this has caused?
Also, do you have ideas how to improve the error reporting so that in the future it is more clear to test reviewers how to continue. And if you have such ideas it's possibly best to write them in specific tickets unless you can immediately provide suggestions in form of code changes in pull requests.

Actions #7

Updated by okurz 7 months ago

  • Parent task set to #111929
Actions #8

Updated by okurz 7 months ago

  • Description updated (diff)
  • Status changed from Feedback to In Progress
  • Parent task deleted (#111929)

as discussed please check if the mentioned ACs are already resolved or covered by already existing tickets.

Actions #9

Updated by openqa_review 7 months ago

  • Due date set to 2024-06-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by okurz 7 months ago

  • Copied to coordination #161735: [epic] Better error detection on GRE tunnel misconfiguration added
Actions #11

Updated by okurz 7 months ago

  • Subject changed from multi-machine test network issues reported 2024-06-03 to multi-machine test network issues reported 2024-06-03 due to missing content in the salt mine size:S
  • Description updated (diff)
Actions #12

Updated by mkittler 6 months ago

  • Status changed from In Progress to Resolved

The fail ratio looks currently good on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=24. The identified that the salt mine sometimes being not correctly populated was the culprit and #161735 should be enough to continue - although we should probably create at least one concrete ticket for how to continue (but it is probably not within the scope of this ticket to make that decision).

Actions

Also available in: Atom PDF