Project

General

Profile

Actions

action #135035

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Optionally restrict multimachine jobs to a single worker

Added by apappas about 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-09-01
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker.

The purpose of this ticket is to have all multi-machine runs be scheduled on the same well-configured worker.

The change doesn't need to be permanent but it does need to be applied until proper networking between multi-machine nodes can be guaranteed.

Acceptance Criteria

  • AC1: If configured accordingly all jobs of a multi-machine parallel cluster must be scheduled to run on the same worker host
  • AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts

Suggestions


Related issues 6 (3 open3 closed)

Related to openQA Infrastructure - action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retryResolvednicksinger2023-08-15

Actions
Related to openQA Infrastructure - action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:MResolvedmkittler

Actions
Related to openQA Project - coordination #157144: [epic] Groups of worker classes: Regions, locations, etc.New2024-03-13

Actions
Related to openQA Project - action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configurationResolvedokurz2022-06-03

Actions
Copied to openQA Project - action #152737: Support for triggering parallel (multi-machine-)tests within a configured zone or locationNew

Actions
Copied to openQA Project - action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not availableNew

Actions
Actions #1

Updated by apappas about 1 year ago

  • Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Actions #2

Updated by apappas about 1 year ago

  • Description updated (diff)
Actions #3

Updated by okurz about 1 year ago

  • Category set to Feature requests
  • Target version set to future

Good idea for a workaround. The workaround for the workaround is to pin to a specific machine

Actions #4

Updated by apappas about 1 year ago

The workaround for the workaround is to pin to a specific machine.

I do not understand.
We will pin to a specific machine as a bridge until this is implemented.

Target version set to future

Can we get either a concrete ETA or a rejection?

Actions #5

Updated by okurz about 1 year ago

apappas wrote in #note-4:

Target version set to future
Can we get either a concrete ETA or a rejection?

The ETA is: Certainly not within the next days or weeks. I don't see why we should reject the feature request. It's a good idea and valid for openQA. The team just doesn't have capacity to work on that anytime soon.

Actions #6

Updated by asmorodskyi about 1 year ago

I want to remind you that it is actually rollback to state which we had some years ago when MM tests was ALWAYS running on same host . This was dramatically increasing wait time in queue for MM tests because mixed queue with MM jobs and single jobs hard to catch condition when two worker instances in same worker are free. To resolve this problem GRE bridges was introduced . Now if we will drop this we will get back to old problem so we need to make sure that we address old problem before switching to this mode

Actions #8

Updated by okurz 11 months ago

  • Target version changed from future to Tools - Next
  • Parent task set to #111929
Actions #9

Updated by okurz 11 months ago

  • Description updated (diff)
Actions #10

Updated by okurz 11 months ago

  • Copied to action #152737: Support for triggering parallel (multi-machine-)tests within a configured zone or location added
Actions #11

Updated by mkittler 11 months ago

  • Description updated (diff)
Actions #12

Updated by okurz 11 months ago

  • Related to action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M added
Actions #13

Updated by okurz 11 months ago

  • Description updated (diff)
Actions #14

Updated by okurz 11 months ago

  • Subject changed from [tools]Pin multimachine jobs to a single worker to Optionally restrict multimachine jobs to a single worker
Actions #15

Updated by okurz 8 months ago

  • Target version changed from Tools - Next to Ready
Actions #16

Updated by mkittler 8 months ago

  • Assignee set to mkittler
Actions #17

Updated by mkittler 8 months ago

  • Status changed from New to In Progress
Actions #18

Updated by openqa_review 8 months ago

  • Due date set to 2024-04-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #19

Updated by mkittler 8 months ago

  • Status changed from In Progress to Feedback

PR: https://github.com/os-autoinst/openQA/pull/5536

The PR is ready from my side and be good enough for all the clusters/worker-classes we have in production. I'm only waiting for reviews.

Actions #20

Updated by okurz 8 months ago

  • Copied to action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available added
Actions #21

Updated by okurz 8 months ago

Actions #22

Updated by mkittler 8 months ago

  • Status changed from Feedback to Resolved

The PR was merged yesterday and it fulfills the ACs. That's not the end of the story (see https://github.com/os-autoinst/openQA/pull/5536#issuecomment-2022848509) but I would resolve this ticket now considering we have the follow-up tickets #158146 and #158143.

Actions #23

Updated by okurz 7 months ago

  • Due date deleted (2024-04-02)
Actions #24

Updated by okurz 7 months ago

  • Related to action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configuration added
Actions

Also available in: Atom PDF