Project

General

Profile

Actions

action #135035

open

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Optionally restrict multimachine jobs to a single worker

Added by apappas 6 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2023-09-01
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker.

The purpose of this ticket is to have all multi-machine runs be scheduled on the same well-configured worker.

The change doesn't need to be permanent but it does need to be applied until proper networking between multi-machine nodes can be guaranteed.

Acceptance Criteria

  • AC1: If configured accordingly all jobs of a multi-machine parallel cluster must be scheduled to run on the same worker host
  • AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts

Suggestions


Related issues 3 (2 open1 closed)

Related to openQA Infrastructure - action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retryResolvednicksinger2023-08-15

Actions
Related to openQA Infrastructure - action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:MBlockedokurz

Actions
Copied to openQA Project - action #152737: Support for triggering parallel (multi-machine-)tests within a configured zone or locationNew

Actions
Actions #1

Updated by apappas 6 months ago

  • Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added
Actions #2

Updated by apappas 6 months ago

  • Description updated (diff)
Actions #3

Updated by okurz 6 months ago

  • Category set to Feature requests
  • Target version set to future

Good idea for a workaround. The workaround for the workaround is to pin to a specific machine

Actions #4

Updated by apappas 6 months ago

The workaround for the workaround is to pin to a specific machine.

I do not understand.
We will pin to a specific machine as a bridge until this is implemented.

Target version set to future

Can we get either a concrete ETA or a rejection?

Actions #5

Updated by okurz 6 months ago

apappas wrote in #note-4:

Target version set to future
Can we get either a concrete ETA or a rejection?

The ETA is: Certainly not within the next days or weeks. I don't see why we should reject the feature request. It's a good idea and valid for openQA. The team just doesn't have capacity to work on that anytime soon.

Actions #6

Updated by asmorodskyi 6 months ago

I want to remind you that it is actually rollback to state which we had some years ago when MM tests was ALWAYS running on same host . This was dramatically increasing wait time in queue for MM tests because mixed queue with MM jobs and single jobs hard to catch condition when two worker instances in same worker are free. To resolve this problem GRE bridges was introduced . Now if we will drop this we will get back to old problem so we need to make sure that we address old problem before switching to this mode

Actions #8

Updated by okurz 3 months ago

  • Target version changed from future to Tools - Next
  • Parent task set to #111929
Actions #9

Updated by okurz 2 months ago

  • Description updated (diff)
Actions #10

Updated by okurz 2 months ago

  • Copied to action #152737: Support for triggering parallel (multi-machine-)tests within a configured zone or location added
Actions #11

Updated by mkittler 2 months ago

  • Description updated (diff)
Actions #12

Updated by okurz 2 months ago

  • Related to action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M added
Actions #13

Updated by okurz 2 months ago

  • Description updated (diff)
Actions #14

Updated by okurz about 2 months ago

  • Subject changed from [tools]Pin multimachine jobs to a single worker to Optionally restrict multimachine jobs to a single worker
Actions

Also available in: Atom PDF