Project

General

Profile

Actions

action #150869

closed

openQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M

Added by okurz about 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

As part of #136133 aarch64-o3 was ensured to be working from FC Basement but not capable yet to run openQA multi-machine tests. As this machine may be the only one able to execute aarch32 jobs the machine should be setup for multi-machine tests accordingly.

Acceptance criteria

  • AC1: Only one machine is successfully working on aarch32 o3 multi-machine jobs at a time
  • AC2: Inventory management systems are up-to-date
  • AC3: The machine is not connected to any other o3 machines by GRE tunnels

Suggestions

Further details

Currently openQA would schedule MM jobs for all workers with a matching worker class connected to one openQA instance. But if those workers can not reach each other which we normally achieve with GRE tunnels then those jobs would fail. And we shouldn't try to connect an ARM/ggardet maintained cloud ARM machine like oss-cobbler-03 to a SUSE internal machine due to security best practices. Hence there should be only machine it a time for a matching worker class to work on such jobs.


Related issues 4 (0 open4 closed)

Related to openQA Infrastructure (public) - action #137771: Configure o3 ppc64le multi-machine worker size:MResolvedmkittler2023-10-11

Actions
Related to openQA Project (public) - action #135035: Optionally restrict multimachine jobs to a single workerResolvedmkittler2023-09-01

Actions
Related to openQA Tests (public) - action #159558: network unreachable on aarch64-o3Resolvedmkittler2024-04-24

Actions
Copied from openQA Infrastructure (public) - action #136133: Migrate aarch64.openqanet.opensuse.org to FC Basement size:MResolveddheidler

Actions
Actions #1

Updated by okurz about 1 year ago

  • Copied from action #136133: Migrate aarch64.openqanet.opensuse.org to FC Basement size:M added
Actions #2

Updated by okurz about 1 year ago

  • Target version changed from future to Ready
Actions #3

Updated by okurz about 1 year ago

  • Related to action #137771: Configure o3 ppc64le multi-machine worker size:M added
Actions #4

Updated by tinita about 1 year ago

  • Subject changed from Ensure multi-machine tests work on aarch64-o3 to Ensure multi-machine tests work on aarch64-o3 size:M
  • Status changed from New to Workable
Actions #5

Updated by okurz about 1 year ago

  • Priority changed from Normal to High
Actions #6

Updated by livdywan about 1 year ago

  • Subject changed from Ensure multi-machine tests work on aarch64-o3 size:M to Ensure multi-machine tests work on aarch64-o3
  • Status changed from Workable to New

We looked at it in the infra daily. It was estimated before the most recent discoveries and we should re-estimate it

Actions #7

Updated by okurz about 1 year ago

livdywan wrote in #note-6:

We looked at it in the infra daily. It was estimated before the most recent discoveries and we should re-estimate it

With what we learned recently I suggest to not connect this machine to other o3 workers in PRG2 over GRE tunnels and only enable tap on specific aarch32 worker classes to not interfere with aarch64 multi-machine jobs which should run solely in PRG2.

Actions #8

Updated by okurz about 1 year ago

  • Subject changed from Ensure multi-machine tests work on aarch64-o3 to Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M
  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #9

Updated by okurz about 1 year ago

  • Related to action #135035: Optionally restrict multimachine jobs to a single worker added
Actions #10

Updated by okurz about 1 year ago · Edited

  • Status changed from In Progress to Blocked
  • Priority changed from High to Normal
  • Target version changed from Ready to Tools - Next

Asked ggardet in
https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$mJDYph_Njx_n8YQHTLkyMIG7z4SLtnOUDL3Na_M-WJs

hi, in light of #150869 I found that oss-cobbler-o3 is already happily working on arm7/aarch32 multi-machine jobs. Our plan was to bring in a SUSE machine back to that task but if I just add "tap" on that machine then clusters would try to schedule tests across SUSE-internal machine + oss-cobbler-03 but as they can't connect to each other that wouldn't work. So I see the following options: 1. Implement a feature in openQA first so that clusters can be scheduled on one machine each at a time and then add all fitting machines with "tap" or 2. Remove "tap" from oss-cobbler-03 and add to SUSE-internal machine instead only. My preference would be 1. WDYT?

Unless I receive a response asking for 2 blocking on #135035

Actions #11

Updated by okurz 10 months ago

  • Target version changed from Tools - Next to future
Actions #12

Updated by okurz 9 months ago

  • Category set to Feature requests
  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)
  • Target version changed from future to Ready

Blockers resolved.

Actions #13

Updated by mkittler 9 months ago

  • Description updated (diff)
Actions #14

Updated by mkittler 9 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #15

Updated by mkittler 9 months ago · Edited

I added PARALLEL_ONE_HOST_ONLY=1 to relevant machine definitions (those also defining WORKER_CLASS=qemu_aarch32). So far there is only one other aarch32-capable host (oss-cobbler-03) so this won't make things worse for scheduling jobs on existing worker slots.

There are no existing aarch32 jobs in the queue so I suppose I can move on with the tap setup itself.

EDIT: Ran the setup script and also took care of the wicked config of the eth interface (which is currently not covered by the script; the zone was left public). After a reboot everything seemed to be still in place so I scheduled test jobs:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/4094320 WORKER_CLASS=aarch64-o3 {BUILD,TEST}+=-poo150869
Cloning parents of opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_ref@aarch32
Cloning children of opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_ref@aarch32
Cloning parents of opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_sut@aarch32
2 jobs have been created:
 - opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_ref@aarch32 -> https://openqa.opensuse.org/tests/4101001
 - opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_sut@aarch32 -> https://openqa.opensuse.org/tests/4101000

EDIT: The test jobs passed. I'll enable the worker class tomorrow.

Actions #16

Updated by openqa_review 9 months ago

  • Due date set to 2024-05-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions #17

Updated by mkittler 9 months ago

I now enabled the tap worker class on the worker so production jobs can run. This is the first time we use the PARALLEL_ONE_HOST_ONLY=1 feature in production so I'll have a look at https://openqa.opensuse.org/tests/latest?arch=arm&distri=opensuse&flavor=NET&machine=aarch32&test=wicked_basic_ref&version=Tumbleweed and https://openqa.opensuse.org/tests/latest?arch=arm&distri=opensuse&flavor=NET&machine=aarch32&test=yast2_nfs_v3_server&version=Tumbleweed and https://openqa.opensuse.org/tests/latest?arch=arm&distri=opensuse&flavor=NET&machine=aarch32&test=yast2_nfs_v4_server&version=Tumbleweed after the next build. (Those seem to be the only MM scenarios we schedule for aarch32.)

Actions #18

Updated by mkittler 9 months ago

  • Status changed from In Progress to Feedback
Actions #19

Updated by ggardet_arm 9 months ago

Maybe the config update broke something, since some tests are failing due to network issue - see https://progress.opensuse.org/issues/159558

Actions #20

Updated by okurz 9 months ago

Actions #21

Updated by mkittler 9 months ago

Just for the record, my work on this ticket was not the cause of #159558.

It doesn't look like any of the jobs of the last build were scheduled on aarch64-o3. I'll wait for the next build. (I have already created test jobs and they worked so we might also just consider this ticket resolved.)

Actions #22

Updated by ggardet_arm 9 months ago

https://openqa.opensuse.org/tests/4106737 fails with:

[2024-04-25T11:16:46.532901+02:00] [info] [pid:46959] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  Open vSwitch command 'set_vlan' with arguments 'tap13 5' failed: 'tap13' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
Actions #23

Updated by mkittler 9 months ago

I've just reverted adding the tap worker class. I noticed myself that this isn't going to cut it because also 64-bit tap jobs might now run on aarch64-o3 - which of course breaks as @ggardet_arm wrote in the previous comment.

Not sure how to allow scheduling those jobs then. We would probably need #158146. Or we add PARALLEL_ONE_HOST_ONLY=1 also 64-bit arm worker classes.

Actions #24

Updated by okurz 9 months ago

  • Due date deleted (2024-05-08)
  • Status changed from Feedback to Blocked
  • Priority changed from Normal to Low
  • Target version changed from Ready to Tools - Next
  • Parent task changed from #129280 to #111929

Added #158146 to next, blocking on that

Actions #25

Updated by mkittler 7 months ago

  • Status changed from Blocked to In Progress

The PR for #158146 has been deployed on o3 so I can try to use it on aarch64-o3. It is currently offline despite being powered on. So I'm going to recover it.

Actions #26

Updated by mkittler 7 months ago · Edited

The machine is up and running again. The tap setup seems still good but before enabling the tap worker class I'll run some test jobs:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/4311446 WORKER_CLASS=aarch64-o3 {BUILD,TEST}+=-poo150869
Cloning parents of opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_ref@aarch64
Cloning children of opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_ref@aarch64
Cloning parents of opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_sut@aarch64
2 jobs have been created:
 - opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_ref@aarch64 -> https://openqa.opensuse.org/tests/4315667
 - opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_sut@aarch64 -> https://openqa.opensuse.org/tests/4315666
Actions #27

Updated by mkittler 7 months ago · Edited

The test jobs passed and also other production jobs that ran on aarch64-o3 seems to generally run as expected. So I enabled the tap worker class on the aarch64-o3 slots and mentioned it in the openQA chat on Matrix.

Actions #28

Updated by okurz 7 months ago

  • Target version changed from Tools - Next to Ready
Actions #29

Updated by mkittler 7 months ago

So far only one production job ran on the worker¹ but it failed on an unrelated typing issue (https://openqa.opensuse.org/tests/4318525#step/remote_controller/19).


¹

with mm_jobs as (select distinct id, result, state, t_finished, (select host from workers where id = assigned_worker_id) as worker_host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2) select concat('https://openqa.opensuse.org/tests/', id) as url, result, t_finished, worker_host from mm_jobs where state = 'done' and worker_host in ('aarch64-o3') order by id desc limit 50;
Actions #30

Updated by mkittler 7 months ago

  • Status changed from In Progress to Feedback

A couple of new production jobs were running. Some passed, some failed but none incompleted due to a broken tap setup or cross-host scheduling. So I guess this works as expected.

Actions #31

Updated by mkittler 7 months ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF