action #150869
closed
openQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M
Added by okurz about 1 year ago.
Updated 6 months ago.
Category:
Feature requests
Description
Motivation¶
As part of #136133 aarch64-o3 was ensured to be working from FC Basement but not capable yet to run openQA multi-machine tests. As this machine may be the only one able to execute aarch32 jobs the machine should be setup for multi-machine tests accordingly.
Acceptance criteria¶
- AC1: Only one machine is successfully working on aarch32 o3 multi-machine jobs at a time
- AC2: Inventory management systems are up-to-date
- AC3: The machine is not connected to any other o3 machines by GRE tunnels
Suggestions¶
Further details¶
Currently openQA would schedule MM jobs for all workers with a matching worker class connected to one openQA instance. But if those workers can not reach each other which we normally achieve with GRE tunnels then those jobs would fail. And we shouldn't try to connect an ARM/ggardet maintained cloud ARM machine like oss-cobbler-03 to a SUSE internal machine due to security best practices. Hence there should be only machine it a time for a matching worker class to work on such jobs.
- Copied from action #136133: Migrate aarch64.openqanet.opensuse.org to FC Basement size:M added
- Target version changed from future to Ready
- Related to action #137771: Configure o3 ppc64le multi-machine worker size:M added
- Subject changed from Ensure multi-machine tests work on aarch64-o3 to Ensure multi-machine tests work on aarch64-o3 size:M
- Status changed from New to Workable
- Priority changed from Normal to High
- Subject changed from Ensure multi-machine tests work on aarch64-o3 size:M to Ensure multi-machine tests work on aarch64-o3
- Status changed from Workable to New
We looked at it in the infra daily. It was estimated before the most recent discoveries and we should re-estimate it
livdywan wrote in #note-6:
We looked at it in the infra daily. It was estimated before the most recent discoveries and we should re-estimate it
With what we learned recently I suggest to not connect this machine to other o3 workers in PRG2 over GRE tunnels and only enable tap on specific aarch32 worker classes to not interfere with aarch64 multi-machine jobs which should run solely in PRG2.
- Subject changed from Ensure multi-machine tests work on aarch64-o3 to Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M
- Description updated (diff)
- Status changed from New to In Progress
- Assignee set to okurz
- Related to action #135035: Optionally restrict multimachine jobs to a single worker added
- Status changed from In Progress to Blocked
- Priority changed from High to Normal
- Target version changed from Ready to Tools - Next
Asked ggardet in
https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$mJDYph_Njx_n8YQHTLkyMIG7z4SLtnOUDL3Na_M-WJs
hi, in light of #150869 I found that oss-cobbler-o3 is already happily working on arm7/aarch32 multi-machine jobs. Our plan was to bring in a SUSE machine back to that task but if I just add "tap" on that machine then clusters would try to schedule tests across SUSE-internal machine + oss-cobbler-03 but as they can't connect to each other that wouldn't work. So I see the following options: 1. Implement a feature in openQA first so that clusters can be scheduled on one machine each at a time and then add all fitting machines with "tap" or 2. Remove "tap" from oss-cobbler-03 and add to SUSE-internal machine instead only. My preference would be 1. WDYT?
Unless I receive a response asking for 2 blocking on #135035
- Target version changed from Tools - Next to future
- Category set to Feature requests
- Status changed from Blocked to Workable
- Assignee deleted (
okurz)
- Target version changed from future to Ready
- Description updated (diff)
- Status changed from Workable to In Progress
- Assignee set to mkittler
I added PARALLEL_ONE_HOST_ONLY=1
to relevant machine definitions (those also defining WORKER_CLASS=qemu_aarch32
). So far there is only one other aarch32-capable host (oss-cobbler-03
) so this won't make things worse for scheduling jobs on existing worker slots.
There are no existing aarch32 jobs in the queue so I suppose I can move on with the tap setup itself.
EDIT: Ran the setup script and also took care of the wicked config of the eth interface (which is currently not covered by the script; the zone was left public). After a reboot everything seemed to be still in place so I scheduled test jobs:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/4094320 WORKER_CLASS=aarch64-o3 {BUILD,TEST}+=-poo150869
Cloning parents of opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_ref@aarch32
Cloning children of opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_ref@aarch32
Cloning parents of opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_sut@aarch32
2 jobs have been created:
- opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_ref@aarch32 -> https://openqa.opensuse.org/tests/4101001
- opensuse-Tumbleweed-NET-arm-Build20240418-wicked_basic_sut@aarch32 -> https://openqa.opensuse.org/tests/4101000
EDIT: The test jobs passed. I'll enable the worker class tomorrow.
- Due date set to 2024-05-08
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Feedback
Just for the record, my work on this ticket was not the cause of #159558.
It doesn't look like any of the jobs of the last build were scheduled on aarch64-o3. I'll wait for the next build. (I have already created test jobs and they worked so we might also just consider this ticket resolved.)
https://openqa.opensuse.org/tests/4106737 fails with:
[2024-04-25T11:16:46.532901+02:00] [info] [pid:46959] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
Open vSwitch command 'set_vlan' with arguments 'tap13 5' failed: 'tap13' is not connected to bridge 'br1' at /usr/lib/os-autoinst/backend/qemu.pm line 127.
I've just reverted adding the tap
worker class. I noticed myself that this isn't going to cut it because also 64-bit tap jobs might now run on aarch64-o3 - which of course breaks as @ggardet_arm wrote in the previous comment.
Not sure how to allow scheduling those jobs then. We would probably need #158146. Or we add PARALLEL_ONE_HOST_ONLY=1
also 64-bit arm worker classes.
- Due date deleted (
2024-05-08)
- Status changed from Feedback to Blocked
- Priority changed from Normal to Low
- Target version changed from Ready to Tools - Next
- Parent task changed from #129280 to #111929
Added #158146 to next, blocking on that
- Status changed from Blocked to In Progress
The PR for #158146 has been deployed on o3 so I can try to use it on aarch64-o3. It is currently offline despite being powered on. So I'm going to recover it.
The machine is up and running again. The tap setup seems still good but before enabling the tap
worker class I'll run some test jobs:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/tests/4311446 WORKER_CLASS=aarch64-o3 {BUILD,TEST}+=-poo150869
Cloning parents of opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_ref@aarch64
Cloning children of opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_ref@aarch64
Cloning parents of opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_sut@aarch64
2 jobs have been created:
- opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_ref@aarch64 -> https://openqa.opensuse.org/tests/4315667
- opensuse-Tumbleweed-DVD-aarch64-Build20240701-wicked_basic_sut@aarch64 -> https://openqa.opensuse.org/tests/4315666
The test jobs passed and also other production jobs that ran on aarch64-o3 seems to generally run as expected. So I enabled the tap worker class on the aarch64-o3 slots and mentioned it in the openQA chat on Matrix.
- Target version changed from Tools - Next to Ready
So far only one production job ran on the worker¹ but it failed on an unrelated typing issue (https://openqa.opensuse.org/tests/4318525#step/remote_controller/19).
¹
with mm_jobs as (select distinct id, result, state, t_finished, (select host from workers where id = assigned_worker_id) as worker_host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2) select concat('https://openqa.opensuse.org/tests/', id) as url, result, t_finished, worker_host from mm_jobs where state = 'done' and worker_host in ('aarch64-o3') order by id desc limit 50;
- Status changed from In Progress to Feedback
A couple of new production jobs were running. Some passed, some failed but none incompleted due to a broken tap setup or cross-host scheduling. So I guess this works as expected.
- Status changed from Feedback to Resolved
Also available in: Atom
PDF