action #135035: Optionally restrict multimachine jobs to a single worker - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #135035

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Optionally restrict multimachine jobs to a single worker

Added by apappas over 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

2023-09-01

Due date:

% Done:

Estimated time:

Description

Motivation¶

Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker.

The purpose of this ticket is to have all multi-machine runs be scheduled on the same well-configured worker.

The change doesn't need to be permanent but it does need to be applied until proper networking between multi-machine nodes can be guaranteed.

Acceptance Criteria¶

AC1: If configured accordingly all jobs of a multi-machine parallel cluster must be scheduled to run on the same worker host
AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts

Suggestions¶

Have a look at https://github.com/Martchus/openQA/pull/new/dependency-pinning for how this could be enabled and documented.

Related issues 6 (3 open — 3 closed)

Related to openQA Infrastructure (public) - action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry

Resolved

nicksinger

2023-08-15

Actions

Related to openQA Infrastructure (public) - action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M

Resolved

mkittler

Actions

Related to openQA Project (public) - coordination #157144: [epic] Groups of worker classes: Regions, locations, etc.

New

2024-03-13

Actions

Related to openQA Project (public) - action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configuration

Resolved

okurz

2022-06-03

Actions

Copied to openQA Project (public) - action #152737: Support for triggering parallel (multi-machine-)tests within a configured zone or location

New

Actions

Copied to openQA Project (public) - action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available

New

Actions

Copy link

Updated by apappas over 1 year ago

Related to action #134282: [tools] network protocols failures on multimachine tests on HA/SAP size:S auto_review:"no candidate.*iscsi-target-overview-service-tab|yast2.+firewall.+services.+add.+zone":retry added

Actions

Copy link

Updated by apappas over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 1 year ago

Category set to Feature requests
Target version set to future

Good idea for a workaround. The workaround for the workaround is to pin to a specific machine

Actions

Copy link

Updated by apappas over 1 year ago

The workaround for the workaround is to pin to a specific machine.

I do not understand.
We will pin to a specific machine as a bridge until this is implemented.

Target version set to future

Can we get either a concrete ETA or a rejection?

Actions

Copy link

Updated by okurz over 1 year ago

apappas wrote in #note-4:

Target version set to future
Can we get either a concrete ETA or a rejection?

The ETA is: Certainly not within the next days or weeks. I don't see why we should reject the feature request. It's a good idea and valid for openQA. The team just doesn't have capacity to work on that anytime soon.

Actions

Copy link

Updated by asmorodskyi over 1 year ago

I want to remind you that it is actually rollback to state which we had some years ago when MM tests was ALWAYS running on same host . This was dramatically increasing wait time in queue for MM tests because mixed queue with MM jobs and single jobs hard to catch condition when two worker instances in same worker are free. To resolve this problem GRE bridges was introduced . Now if we will drop this we will get back to old problem so we need to make sure that we address old problem before switching to this mode

Actions

Copy link

Updated by okurz over 1 year ago

Target version changed from future to Tools - Next
Parent task set to #111929

Actions

Copy link

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#10

Updated by okurz over 1 year ago

Copied to action #152737: Support for triggering parallel (multi-machine-)tests within a configured zone or location added

Actions

Copy link

#11

Updated by mkittler over 1 year ago

Description updated (diff)

Actions

Copy link

#12

Updated by okurz over 1 year ago

Related to action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M added

Actions

Copy link

#13

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#14

Updated by okurz over 1 year ago

Subject changed from [tools]Pin multimachine jobs to a single worker to Optionally restrict multimachine jobs to a single worker

Actions

Copy link

#15

Updated by okurz about 1 year ago

Target version changed from Tools - Next to Ready

Actions

Copy link

#16

Updated by mkittler about 1 year ago

Assignee set to mkittler

Actions

Copy link

#17

Updated by mkittler about 1 year ago

Status changed from New to In Progress

Actions

Copy link

#18

Updated by openqa_review about 1 year ago

Due date set to 2024-04-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#19

Updated by mkittler about 1 year ago

Status changed from In Progress to Feedback

PR: https://github.com/os-autoinst/openQA/pull/5536

The PR is ready from my side and be good enough for all the clusters/worker-classes we have in production. I'm only waiting for reviews.

Actions

Copy link

#20

Updated by okurz about 1 year ago

Copied to action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available added

Actions

Copy link

#21

Updated by okurz about 1 year ago

Related to coordination #157144: [epic] Groups of worker classes: Regions, locations, etc. added

Actions

Copy link

#22

Updated by mkittler about 1 year ago

Status changed from Feedback to Resolved

The PR was merged yesterday and it fulfills the ACs. That's not the end of the story (see https://github.com/os-autoinst/openQA/pull/5536#issuecomment-2022848509) but I would resolve this ticket now considering we have the follow-up tickets #158146 and #158143.

Actions

Copy link

#23

Updated by okurz about 1 year ago

Due date deleted (~~2024-04-02~~)

Actions

Copy link

#24

Updated by okurz about 1 year ago

Related to action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configuration added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #135035

Optionally restrict multimachine jobs to a single worker

Motivation¶

Acceptance Criteria¶

Suggestions¶

Updated by apappas over 1 year ago

Updated by apappas over 1 year ago

Updated by okurz over 1 year ago

Updated by apappas over 1 year ago

Updated by okurz over 1 year ago

Updated by asmorodskyi over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago

Updated by openqa_review about 1 year ago

Updated by mkittler about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago