Project

General

Profile

Actions

action #135035

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Optionally restrict multimachine jobs to a single worker

Added by apappas 8 months ago. Updated 12 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-09-01
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker.

The purpose of this ticket is to have all multi-machine runs be scheduled on the same well-configured worker.

The change doesn't need to be permanent but it does need to be applied until proper networking between multi-machine nodes can be guaranteed.

Acceptance Criteria

  • AC1: If configured accordingly all jobs of a multi-machine parallel cluster must be scheduled to run on the same worker host
  • AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts

Suggestions


Related issues 5 (4 open1 closed)

Related to openQA Infrastructure - action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retryResolvednicksinger2023-08-15

Actions
Related to openQA Infrastructure - action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:MBlockedmkittler

Actions
Related to openQA Project - coordination #157144: [epic] Groups of worker classes: Regions, locations, etc.New2024-03-13

Actions
Copied to openQA Project - action #152737: Support for triggering parallel (multi-machine-)tests within a configured zone or locationNew

Actions
Copied to openQA Project - action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not availableNew

Actions
Actions #1

Updated by apappas 8 months ago

  • Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Actions #2

Updated by apappas 8 months ago

  • Description updated (diff)
Actions #3

Updated by okurz 8 months ago

  • Category set to Feature requests
  • Target version set to future

Good idea for a workaround. The workaround for the workaround is to pin to a specific machine

Actions #4

Updated by apappas 8 months ago

The workaround for the workaround is to pin to a specific machine.

I do not understand.
We will pin to a specific machine as a bridge until this is implemented.

Target version set to future

Can we get either a concrete ETA or a rejection?

Actions #5

Updated by okurz 8 months ago

apappas wrote in #note-4:

Target version set to future
Can we get either a concrete ETA or a rejection?

The ETA is: Certainly not within the next days or weeks. I don't see why we should reject the feature request. It's a good idea and valid for openQA. The team just doesn't have capacity to work on that anytime soon.

Actions #6

Updated by asmorodskyi 8 months ago

I want to remind you that it is actually rollback to state which we had some years ago when MM tests was ALWAYS running on same host . This was dramatically increasing wait time in queue for MM tests because mixed queue with MM jobs and single jobs hard to catch condition when two worker instances in same worker are free. To resolve this problem GRE bridges was introduced . Now if we will drop this we will get back to old problem so we need to make sure that we address old problem before switching to this mode

Actions #8

Updated by okurz 5 months ago

  • Target version changed from future to Tools - Next
  • Parent task set to #111929
Actions #9

Updated by okurz 4 months ago

  • Description updated (diff)
Actions #10

Updated by okurz 4 months ago

  • Copied to action #152737: Support for triggering parallel (multi-machine-)tests within a configured zone or location added
Actions #11

Updated by mkittler 4 months ago

  • Description updated (diff)
Actions #12

Updated by okurz 4 months ago

  • Related to action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M added
Actions #13

Updated by okurz 4 months ago

  • Description updated (diff)
Actions #14

Updated by okurz 4 months ago

  • Subject changed from [tools]Pin multimachine jobs to a single worker to Optionally restrict multimachine jobs to a single worker
Actions #15

Updated by okurz about 1 month ago

  • Target version changed from Tools - Next to Ready
Actions #16

Updated by mkittler about 1 month ago

  • Assignee set to mkittler
Actions #17

Updated by mkittler about 1 month ago

  • Status changed from New to In Progress
Actions #18

Updated by openqa_review about 1 month ago

  • Due date set to 2024-04-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #19

Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback

PR: https://github.com/os-autoinst/openQA/pull/5536

The PR is ready from my side and be good enough for all the clusters/worker-classes we have in production. I'm only waiting for reviews.

Actions #20

Updated by okurz about 1 month ago

  • Copied to action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available added
Actions #21

Updated by okurz about 1 month ago

Actions #22

Updated by mkittler 30 days ago

  • Status changed from Feedback to Resolved

The PR was merged yesterday and it fulfills the ACs. That's not the end of the story (see https://github.com/os-autoinst/openQA/pull/5536#issuecomment-2022848509) but I would resolve this ticket now considering we have the follow-up tickets #158146 and #158143.

Actions #23

Updated by okurz 12 days ago

  • Due date deleted (2024-04-02)
Actions

Also available in: Atom PDF