action #150869
closedopenQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M
0%
Description
Motivation¶
As part of #136133 aarch64-o3 was ensured to be working from FC Basement but not capable yet to run openQA multi-machine tests. As this machine may be the only one able to execute aarch32 jobs the machine should be setup for multi-machine tests accordingly.
Acceptance criteria¶
- AC1: Only one machine is successfully working on aarch32 o3 multi-machine jobs at a time
- AC2: Inventory management systems are up-to-date
- AC3: The machine is not connected to any other o3 machines by GRE tunnels
Suggestions¶
- Find out why other openQA workers connected to o3 already successfully do aarch32 multi-machine tests like oss-cobbler-03 in https://openqa.opensuse.org/tests/latest?arch=arm&distri=opensuse&flavor=NET&machine=aarch32&test=wicked_basic_ref&version=Tumbleweed
Talk to ggardet to resolve that conflict to ensure that there is only one machine executing aarch32 MM jobs at a time as we can't have GRE tunnels to all and we don't have a feature yet to limit clusters to a single machine yet, see #135035We havePARALLEL_ONE_HOST_ONLY=1
now which could be added to relevant test scenarios (but #158146 is still open)- Execute
os-autoinst-setup-multi-machine
on aarch64-o3, add "tap" to worker instances - Find a scenario to test the MM setup, use "wicked_basic_sut/ref" https://openqa.opensuse.org/tests/latest?arch=arm&distri=opensuse&flavor=NET&machine=aarch32&test=wicked_basic_ref&version=Tumbleweed
- Do not add GRE tunnels to other o3 machines as this machine should be the only one working on aarch32 MM jobs
- Inventory management systems are probably already up to date
Further details¶
Currently openQA would schedule MM jobs for all workers with a matching worker class connected to one openQA instance. But if those workers can not reach each other which we normally achieve with GRE tunnels then those jobs would fail. And we shouldn't try to connect an ARM/ggardet maintained cloud ARM machine like oss-cobbler-03 to a SUSE internal machine due to security best practices. Hence there should be only machine it a time for a matching worker class to work on such jobs.
Updated by okurz about 1 year ago
- Copied from action #136133: Migrate aarch64.openqanet.opensuse.org to FC Basement size:M added
Updated by okurz 12 months ago
- Related to action #137771: Configure o3 ppc64le multi-machine worker size:M added
Updated by okurz 11 months ago
- Priority changed from Normal to High
bumping prio to high as guillaume_garted asked about it in https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$HcJaAt5-ztOtreuvUWMDDMuC6ixBUzF8sTt_3NMq5pk
Updated by livdywan 11 months ago
- Subject changed from Ensure multi-machine tests work on aarch64-o3 size:M to Ensure multi-machine tests work on aarch64-o3
- Status changed from Workable to New
We looked at it in the infra daily. It was estimated before the most recent discoveries and we should re-estimate it
Updated by okurz 11 months ago
livdywan wrote in #note-6:
We looked at it in the infra daily. It was estimated before the most recent discoveries and we should re-estimate it
With what we learned recently I suggest to not connect this machine to other o3 workers in PRG2 over GRE tunnels and only enable tap on specific aarch32 worker classes to not interfere with aarch64 multi-machine jobs which should run solely in PRG2.
Updated by okurz 11 months ago
- Related to action #135035: Optionally restrict multimachine jobs to a single worker added
Updated by okurz 11 months ago · Edited
- Status changed from In Progress to Blocked
- Priority changed from High to Normal
- Target version changed from Ready to Tools - Next
Asked ggardet in
https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$mJDYph_Njx_n8YQHTLkyMIG7z4SLtnOUDL3Na_M-WJs
hi, in light of #150869 I found that oss-cobbler-o3 is already happily working on arm7/aarch32 multi-machine jobs. Our plan was to bring in a SUSE machine back to that task but if I just add "tap" on that machine then clusters would try to schedule tests across SUSE-internal machine + oss-cobbler-03 but as they can't connect to each other that wouldn't work. So I see the following options: 1. Implement a feature in openQA first so that clusters can be scheduled on one machine each at a time and then add all fitting machines with "tap" or 2. Remove "tap" from oss-cobbler-03 and add to SUSE-internal machine instead only. My preference would be 1. WDYT?
Unless I receive a response asking for 2 blocking on #135035
Updated by mkittler 7 months ago · Edited
I added PARALLEL_ONE_HOST_ONLY=1
to relevant machine definitions (those also defining WORKER_CLASS=qemu_aarch32
). So far there is only one other aarch32-capable host (oss-cobbler-03
) so this won't make things worse for scheduling jobs on existing worker slots.
There are no existing aarch32 jobs in the queue so I suppose I can move on with the tap setup itself.
EDIT: Ran the setup script and also took care of the wicked config of the eth interface (which is currently not covered by the script; the zone was left public). After a reboot everything seemed to be still in place so I scheduled test jobs:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/4094320 WORKER_CLASS=aarch64-o3 {BUILD,TEST}+=-poo150869
Cloning parents of opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_ref@aarch32
Cloning children of opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_ref@aarch32
Cloning parents of opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_sut@aarch32
2 jobs have been created:
- opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_ref@aarch32 -> https://openqa.opensuse.org/tests/4101001
- opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_sut@aarch32 -> https://openqa.opensuse.org/tests/4101000
EDIT: The test jobs passed. I'll enable the worker class tomorrow.
Updated by openqa_review 7 months ago
- Due date set to 2024-05-08
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler 7 months ago
I now enabled the tap worker class on the worker so production jobs can run. This is the first time we use the PARALLEL_ONE_HOST_ONLY=1
feature in production so I'll have a look at https://openqa.opensuse.org/tests/latest?arch=arm&distri=opensuse&flavor=NET&machine=aarch32&test=wicked_basic_ref&version=Tumbleweed and https://openqa.opensuse.org/tests/latest?arch=arm&distri=opensuse&flavor=NET&machine=aarch32&test=yast2_nfs_v3_server&version=Tumbleweed and https://openqa.opensuse.org/tests/latest?arch=arm&distri=opensuse&flavor=NET&machine=aarch32&test=yast2_nfs_v4_server&version=Tumbleweed after the next build. (Those seem to be the only MM scenarios we schedule for aarch32.)
Updated by ggardet_arm 7 months ago
Maybe the config update broke something, since some tests are failing due to network issue - see https://progress.opensuse.org/issues/159558
Updated by okurz 7 months ago
- Related to action #159558: network unreachable on aarch64-o3 added
Updated by mkittler 7 months ago
Just for the record, my work on this ticket was not the cause of #159558.
It doesn't look like any of the jobs of the last build were scheduled on aarch64-o3. I'll wait for the next build. (I have already created test jobs and they worked so we might also just consider this ticket resolved.)
Updated by ggardet_arm 7 months ago
https://openqa.opensuse.org/tests/4106737 fails with:
[2024-04-25T11:16:46.532901+02:00] [info] [pid:46959] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
Open vSwitch command 'set_vlan' with arguments 'tap13 5' failed: 'tap13' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
Updated by mkittler 7 months ago
I've just reverted adding the tap
worker class. I noticed myself that this isn't going to cut it because also 64-bit tap jobs might now run on aarch64-o3 - which of course breaks as @ggardet_arm wrote in the previous comment.
Not sure how to allow scheduling those jobs then. We would probably need #158146. Or we add PARALLEL_ONE_HOST_ONLY=1
also 64-bit arm worker classes.
Updated by mkittler 5 months ago · Edited
The machine is up and running again. The tap setup seems still good but before enabling the tap
worker class I'll run some test jobs:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/4311446 WORKER_CLASS=aarch64-o3 {BUILD,TEST}+=-poo150869
Cloning parents of opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_ref@aarch64
Cloning children of opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_ref@aarch64
Cloning parents of opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_sut@aarch64
2 jobs have been created:
- opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_ref@aarch64 -> https://openqa.opensuse.org/tests/4315667
- opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_sut@aarch64 -> https://openqa.opensuse.org/tests/4315666
Updated by mkittler 5 months ago
So far only one production job ran on the worker¹ but it failed on an unrelated typing issue (https://openqa.opensuse.org/tests/4318525#step/remote_controller/19).
¹
with mm_jobs as (select distinct id, result, state, t_finished, (select host from workers where id = assigned_worker_id) as worker_host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2) select concat('https://openqa.opensuse.org/tests/', id) as url, result, t_finished, worker_host from mm_jobs where state = 'done' and worker_host in ('aarch64-o3') order by id desc limit 50;