Project

General

Profile

Actions

coordination #111929

open

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

[epic] Stable multi-machine tests covering multiple physical workers

Added by okurz about 2 years ago. Updated 8 days ago.

Status:
Blocked
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-06-03
Due date:
% Done:

69%

Estimated time:
(Total: 0.00 h)

Description

Motivation

openQA supports multi-machine tests even covering multiple physical workers but we never could really ensure or know exactly what are necessary requirements to provide a stable test environment. We should ensure that we have stable multi-machine tests covering multiple physical workers.


Subtasks 44 (13 open31 closed)

action #111908: Multimachine failures between multiple physical workersResolvedokurz2022-06-03

Actions
action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configurationResolvedokurz2022-06-03

Actions
openQA Infrastructure - action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retryResolvednicksinger2023-08-15

Actions
openQA Infrastructure - action #135005: Reduce duplication in salt-pillars-openqa openqa/workerconf.sls with advanced YAML/jinja featuresNew2023-09-01

Actions
action #135035: Optionally restrict multimachine jobs to a single workerResolvedmkittler2023-09-01

Actions
action #135914: Extend/add initial validation steps and "best practices" for multi-machine test setup/debugging to openQA documentation size:MResolvedmkittler

Actions
openQA Infrastructure - action #135944: Implement a constantly running monitoring/debugging VM for the multi-machine networkNew2023-09-18

Actions
action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:MResolveddheidler

Actions
openQA Infrastructure - action #137771: Configure o3 ppc64le multi-machine worker size:MResolvedmkittler2023-10-11

Actions
action #138698: significant increase in multi-machine test failures on OSD since 2023-10-25, e.g. test fails in support_server/setup size:MResolvedmkittler2023-10-27

Actions
action #139136: Conduct "lessons learned" with Five Why analysis for "test fails in iscsi_client due to salt 'host'/'nodename' confusion" size:MResolvedokurz

Actions
openQA Infrastructure - action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:MResolvedmkittler

Actions
action #151310: [regression] significant increase of parallel_failed+failed since 2023-11-21 size:MResolvedmkittler2023-11-23

Actions
openQA Infrastructure - action #152092: Handle all package downgrades in OSD infrastructure properly in salt size:MResolvednicksinger2023-12-05

Actions
openQA Infrastructure - action #152095: [spike solution][timeboxed:8h] Ping over GRE tunnels and TAP devices and openvswitch outside a VM with differing packet sizes size:SResolvedjbaier_cz2023-12-05

Actions
openQA Infrastructure - action #152098: [research][timeboxed:10h] Learn more about openvswitch with experimenting together size:SWorkable2023-12-05

Actions
openQA Infrastructure - action #152101: Allow salt to properly configure non-production multi-machine workers size:MResolvedmkittler2023-12-05

Actions
action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:MResolvedmkittler2023-12-11

Actions
openQA Infrastructure - action #152557: unexpected routing between PRG1/NUE2+PRG2Resolvedokurz

Actions
action #152737: Support for triggering parallel (multi-machine-)tests within a configured zone or locationNew

Actions
action #153769: Better handle changes in GRE tunnel configuration size:MResolvedokurz2024-01-17

Actions
action #154552: [ppc64le] test fails in iscsi_client - zypper reports Error Message: Could not resolve host: openqa.suse.deResolvedmkittler2024-01-30

Actions
openQA Infrastructure - action #154624: Periodically running simple ping-check multi-machine tests on x86_64 covering multiple physical hosts on OSD alerting tools team on failures size:MResolvedjbaier_cz2024-01-30

Actions
openQA Infrastructure - action #155200: Periodically running simple ping-check multi-machine tests on ppc64le covering multiple physical hosts on OSD alerting tools team on failures size:MWorkable2024-01-30

Actions
openQA Infrastructure - action #155929: Try out rstp_enable=True in openqa/openvswitch.sls size:MResolveddheidler

Actions
action #157534: Multi-Machine Job fails in suseconnect_scc due to worker class misconfiguration when we introduced prg2e machinesResolvedokurz2024-03-19

Actions
openQA Infrastructure - action #157606: Prevent missing gre tunnel connections in our salt states due to misconfigurationNew2024-03-19

Actions
openQA Infrastructure - action #157738: Use rstp_enable=True on o3 as wellNew

Actions
action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not availableNew

Actions
action #158146: Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:MResolvedmkittler2024-03-27

Actions
action #160628: periodic multi-machine OSD test in https://gitlab.suse.de/openqa/scripts-ci/ does not trigger any jobs size:SResolvedmkittler2024-01-30

Actions
openQA Infrastructure - action #160646: multiple multi-machine test failures, no GRE tunnels are setup between machines anymore at all size:MResolvedybonatakis2024-05-21

Actions
openQA Infrastructure - action #160652: Secondary TAP worker class in different zones size:SResolvedybonatakis

Actions
openQA Infrastructure - action #160826: Optimize gre_tunnel_preup.sh generation jinja templateNew2024-05-21

Actions
openQA Infrastructure - action #161393: gitlab CI jobs do not execute salt-lint and do not fail on missing "find" executableResolved2024-06-03

Actions
openQA Infrastructure - action #161396: gitlab CI jobs do not fail on invalid YAML size:SResolvedybonatakis2024-06-03

Actions
openQA Infrastructure - action #162044: Ensure proper indendation yamllint checks in our salt-states-openqa+salt-pillars-openqaNew2024-06-03

Actions
action #162320: multi-machine test failures 2024-06-14+, auto_review:"ping with packet size 100 failed.*can be GRE tunnel setup issue":retryResolvedokurz2024-06-15

Actions
openQA Infrastructure - action #162374: Limit number of OSD PRG2 x86_64 tap multi-machine workers until stabilizedFeedbackokurz2024-06-17

Actions
openQA Infrastructure - action #162455: Secondary TAP worker class instead of "tap_poo…" on closed tickets size:SResolvedokurz2024-06-18

Actions
openQA Infrastructure - action #162485: [alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:SResolvedokurz2024-06-19

Actions
openQA Infrastructure - action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retryBlockedokurz

Actions
openQA Infrastructure - action #162602: [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:SBlockedokurz2024-06-20

Actions
openQA Infrastructure - action #162605: [FIRING:1] CPU load alert, should be "system load"Resolvedokurz2024-06-20

Actions
Actions #1

Updated by okurz about 2 years ago

  • Target version changed from Ready to future
Actions #2

Updated by okurz almost 2 years ago

  • Parent task changed from #103962 to #112862

Move future ideas to the actual "Future ideas" tracker #112862

Actions #3

Updated by okurz 8 months ago

  • Subtask #134282 added
Actions #4

Updated by okurz 8 months ago

  • Subtask #135914 added
Actions #5

Updated by okurz 8 months ago

  • Subtask #136013 added
Actions #6

Updated by okurz 8 months ago

  • Subtask #135944 added
Actions #7

Updated by okurz 8 months ago

  • Subtask #137771 added
Actions #8

Updated by okurz 8 months ago

  • Subtask #151310 added
Actions #9

Updated by okurz 8 months ago

  • Subtask #138698 added
Actions #10

Updated by okurz 8 months ago

  • Subtask #139136 added
Actions #11

Updated by okurz 8 months ago

  • Subtask #152092 added
Actions #12

Updated by okurz 8 months ago

  • Subtask #152095 added
Actions #13

Updated by okurz 8 months ago

  • Subtask #152098 added
Actions #14

Updated by okurz 8 months ago

  • Subtask #152101 added
Actions #15

Updated by okurz 7 months ago

  • Related to action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added
Actions #16

Updated by okurz 7 months ago

  • Related to deleted (action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M)
Actions #17

Updated by okurz 7 months ago

  • Subtask #152389 added
Actions #18

Updated by okurz 7 months ago

  • Subtask #135035 added
Actions #19

Updated by okurz 7 months ago

  • Subtask #152557 added
Actions #20

Updated by okurz 7 months ago

  • Subtask #152737 added
Actions #21

Updated by okurz 6 months ago

  • Subtask #153769 added
Actions #22

Updated by okurz 6 months ago

  • Subtask #153880 added
Actions #23

Updated by okurz 6 months ago

  • Subtask deleted (#153880)
Actions #24

Updated by okurz 6 months ago

  • Subtask #154552 added
Actions #25

Updated by okurz 6 months ago

  • Subtask #154624 added
Actions #26

Updated by okurz 5 months ago

  • Subtask #155200 added
Actions #27

Updated by okurz 5 months ago

  • Subtask #155926 added
Actions #28

Updated by okurz 5 months ago

  • Subtask #155929 added
Actions #29

Updated by okurz 5 months ago

  • Subtask deleted (#155926)
Actions #30

Updated by okurz 4 months ago

  • Subtask #157534 added
Actions #31

Updated by okurz 4 months ago

  • Subtask #157606 added
Actions #32

Updated by okurz 4 months ago

  • Subtask #157738 added
Actions #33

Updated by okurz 4 months ago

  • Subtask #158143 added
Actions #34

Updated by okurz 4 months ago

  • Subtask #158146 added
Actions #35

Updated by okurz 3 months ago

  • Subtask #150869 added
Actions #36

Updated by okurz about 2 months ago

  • Subtask #160628 added
Actions #37

Updated by okurz about 2 months ago

  • Subtask #160652 added
Actions #38

Updated by okurz about 2 months ago

  • Subtask #160826 added
Actions #39

Updated by okurz about 2 months ago

  • Subtask #160646 added
Actions #40

Updated by okurz about 2 months ago

  • Subtask #161381 added
Actions #41

Updated by okurz about 2 months ago

  • Subtask #161396 added
Actions #42

Updated by okurz about 2 months ago

  • Subtask #135005 added
Actions #43

Updated by okurz about 2 months ago

  • Subtask deleted (#161381)
Actions #44

Updated by okurz about 2 months ago

  • Subtask #161393 added
Actions #45

Updated by okurz about 1 month ago

  • Subtask #162044 added
Actions #46

Updated by okurz about 1 month ago

  • Subtask #162320 added
Actions #47

Updated by okurz about 1 month ago

  • Subtask #162374 added
Actions #48

Updated by okurz about 1 month ago

  • Subtask #162455 added
Actions #49

Updated by okurz about 1 month ago

  • Subtask #162485 added
Actions #50

Updated by livdywan about 1 month ago

  • Subtask #162518 added
Actions #51

Updated by livdywan about 1 month ago

  • Subtask deleted (#162518)
Actions #52

Updated by okurz 30 days ago

  • Subtask #162596 added
Actions #53

Updated by okurz 30 days ago

  • Subtask #162602 added
Actions #54

Updated by okurz 30 days ago

  • Subtask #162605 added
Actions

Also available in: Atom PDF