Project

General

Profile

Actions

coordination #111929

open

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

[epic] Stable multi-machine tests covering multiple physical workers

Added by okurz over 2 years ago. Updated 4 months ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-06-03
Due date:
% Done:

77%

Estimated time:
(Total: 0.00 h)

Description

Motivation

openQA supports multi-machine tests even covering multiple physical workers but we never could really ensure or know exactly what are necessary requirements to provide a stable test environment. We should ensure that we have stable multi-machine tests covering multiple physical workers.


Subtasks 45 (10 open35 closed)

action #111908: Multimachine failures between multiple physical workersResolvedokurz2022-06-03

Actions
action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configurationResolvedokurz2022-06-03

Actions
openQA Infrastructure (public) - action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retryResolvednicksinger2023-08-15

Actions
openQA Infrastructure (public) - action #135005: Reduce duplication in salt-pillars-openqa openqa/workerconf.sls with advanced YAML/jinja featuresNew2023-09-01

Actions
action #135035: Optionally restrict multimachine jobs to a single workerResolvedmkittler2023-09-01

Actions
action #135914: Extend/add initial validation steps and "best practices" for multi-machine test setup/debugging to openQA documentation size:MResolvedmkittler

Actions
openQA Infrastructure (public) - action #135944: Implement a constantly running monitoring/debugging VM for the multi-machine networkNew2023-09-18

Actions
action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:MResolveddheidler

Actions
openQA Infrastructure (public) - action #137771: Configure o3 ppc64le multi-machine worker size:MResolvedmkittler2023-10-11

Actions
action #138698: significant increase in multi-machine test failures on OSD since 2023-10-25, e.g. test fails in support_server/setup size:MResolvedmkittler2023-10-27

Actions
action #139136: Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:MResolvedokurz

Actions
openQA Infrastructure (public) - action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:MResolvedmkittler

Actions
action #151310: [regression] significant increase of parallel_failed+failed since 2023-11-21 size:MResolvedmkittler2023-11-23

Actions
openQA Infrastructure (public) - action #152092: Handle all package downgrades in OSD infrastructure properly in salt size:MResolvednicksinger2023-12-05

Actions
openQA Infrastructure (public) - action #152095: [spike solution][timeboxed:8h] Ping over GRE tunnels and TAP devices and openvswitch outside a VM with differing packet sizes size:SResolvedjbaier_cz2023-12-05

Actions
openQA Infrastructure (public) - action #152098: [research][timeboxed:10h] Learn more about openvswitch with experimenting together size:SWorkable2023-12-05

Actions
openQA Infrastructure (public) - action #152101: Allow salt to properly configure non-production multi-machine workers size:MResolvedmkittler2023-12-05

Actions
action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:MResolvedmkittler2023-12-11

Actions
openQA Infrastructure (public) - action #152557: unexpected routing between PRG1/NUE2+PRG2Resolvedokurz

Actions
action #152737: Support for triggering parallel (multi-machine-)tests within a configured zone or locationNew

Actions
action #153769: Better handle changes in GRE tunnel configuration size:MResolvedokurz2024-01-17

Actions
action #154552: [ppc64le] test fails in iscsi_client - zypper reports Error Message: Could not resolve host: openqa.suse.deResolvedmkittler2024-01-30

Actions
openQA Infrastructure (public) - action #154624: Periodically running simple ping-check multi-machine tests on x86_64 covering multiple physical hosts on OSD alerting tools team on failures size:MResolvedjbaier_cz2024-01-30

Actions
openQA Infrastructure (public) - action #155200: Periodically running simple ping-check multi-machine tests on ppc64le covering multiple physical hosts on OSD alerting tools team on failures size:MWorkable2024-01-30

Actions
openQA Infrastructure (public) - action #155929: Try out rstp_enable=True in openqa/openvswitch.sls size:MResolveddheidler

Actions
action #157534: Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machinesResolvedokurz2024-03-19

Actions
openQA Infrastructure (public) - action #157606: Prevent missing gre tunnel connections in our salt states due to misconfigurationBlockedokurz2024-03-19

Actions
openQA Infrastructure (public) - action #157738: Use rstp_enable=True on o3 as wellNew

Actions
action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not availableNew

Actions
action #158146: Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:MResolvedmkittler2024-03-27

Actions
action #160628: periodic multi-machine OSD test in https://gitlab.suse.de/openqa/scripts-ci/ does not trigger any jobs size:SResolvedmkittler2024-01-30

Actions
openQA Infrastructure (public) - action #160646: multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all size:MResolvedybonatakis2024-05-21

Actions
openQA Infrastructure (public) - action #160652: Secondary TAP worker class in different zones size:SResolvedybonatakis

Actions
openQA Infrastructure (public) - action #160826: Optimize gre_tunnel_preup.sh generation jinja template size:SWorkable2024-05-21

Actions
openQA Infrastructure (public) - action #161393: gitlab CI jobs do not execute salt-lint and do not fail on missing "find" executableResolved2024-06-03

Actions
openQA Infrastructure (public) - action #161396: gitlab CI jobs do not fail on invalid YAML size:SResolvedybonatakis2024-06-03

Actions
openQA Infrastructure (public) - action #162044: Ensure proper indentation yamllint checks in our salt-states-openqa+salt-pillars-openqa size:SResolvedjbaier_cz2024-06-03

Actions
action #162320: multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retryResolvedokurz2024-06-15

Actions
openQA Infrastructure (public) - action #162374: Limit number of OSD PRG2 x86_64 tap multi-machine workers until stabilizedResolvedokurz2024-06-17

Actions
openQA Infrastructure (public) - action #162455: Secondary TAP worker class instead of "tap_poo…" on closed tickets size:SResolvedokurz2024-06-18

Actions
openQA Infrastructure (public) - action #162485: [alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:SResolvedokurz2024-06-19

Actions
openQA Infrastructure (public) - action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker)Resolvedlivdywan

Actions
openQA Infrastructure (public) - action #162602: [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:SResolvedokurz2024-06-20

Actions
openQA Infrastructure (public) - action #162605: [FIRING:1] CPU load alert, should be "system load"Resolvedokurz2024-06-20

Actions
openQA Infrastructure (public) - action #165192: Enable all OSD PRG2 x86_64 machines for multi-machine use againNew

Actions
Actions

Also available in: Atom PDF