action #158146
closedcoordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M
Description
Motivation¶
Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker. In #135035 we added a feature flag to limit jobs to a single physical host which can be used for debugging or as temporary workaround or if the network design prevents multiple hosts to be interconnected by GRE tunnels. But by default when multi-machine jobs are scheduled with worker classes fulfilled by multiple hosts which might not be properly interconnected then there is no measure preventing workers to pick up such clusters causing hard to investigate openQA job failures which we should try to prevent. Can we propagate test variables like the "limit to one host only" feature flag in worker properties so that the openQA scheduler can see that flag before assigning to workers?
Acceptance Criteria¶
- AC1: the openQA scheduler does not schedule across-host multimachine clusters to any host that has the feature flag from #135035 set or like that feature flag (considering proposals in #157144-2)
- AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts
Suggestions¶
- Look into what was done in #135035 but for the central openQA scheduler
- Investigate if any worker properties are already available to read by the openQA scheduler when scheduling. At least it knows about the worker class already, right? Should we translate the feature flag from #135035 as a "special worker class" to act as an exclusive class that is only implemented by one host at a time?
- Consider proposals in #157144-2 regarding using a special worker class or directly the flag from #135035
PARALLEL_ONE_HOST_ONLY=1
- Ensure that the scheduler does not schedule across-host multimachine clusters to any host that has such special worker class or worker property
Updated by okurz 6 months ago
- Copied from action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available added
Updated by okurz 5 months ago
- Related to action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configuration added
Updated by livdywan 5 months ago
- Subject changed from Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves to Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by ybonatakis 4 months ago
- Status changed from Workable to In Progress
- Assignee set to ybonatakis
Updated by openqa_review 4 months ago
- Due date set to 2024-06-21
Setting due date based on mean cycle time of SUSE QE Tools
Updated by ybonatakis 4 months ago
- Status changed from In Progress to Workable
After 5 days I have done nothing meaningful on this. I setup my openqa in my laptop, where jobs run but fail to execute qemu due to permissions. I though I didnt need it at first, but when I tried to print some output I realize that it probably it doesnt reach the function and I assumed that the scheduling take place later. But in workable as it doesnt make sense to be in progress while I try to figure out how to solve the setup problem
Updated by mkittler 4 months ago · Edited
Do you need any help with the setup? Although I'm wondering why you need any special setup for this at all. This can all be tested in unit tests. And for a real test I suggest you just spawn a few dummy tests. This ticket is completely independent of "tap devices" and our MM setup. It is a scheduling problem and nothing more.
(I actually wanted to work on this. If you're not doing anything anytime soon, don't keep the ticket assigned to yourself in "Workable" forever so others can pick it up instead.)
Updated by ybonatakis 4 months ago
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/openQA/pull/5695
PR needs clean up but I am looking first to see if there is need to write more tests.
Updated by ybonatakis 3 months ago
- Status changed from Workable to Feedback
PR is ready. I couldnt extends the tests but I think the existing one is enough for now.
Updated by ybonatakis 3 months ago
PR is updated to fix a problem which was discovered during the manual validation. The setting was assigned to the worker when it was enabled in the workers.ini but when the setting was removed(commented out) the database couldnt update a record which was non-null. I fixed also test which failing after latest change and submit to see whether the pipeline will complain in some other test which I could run locally.
Updated by mkittler 3 months ago · Edited
PR is updated to fix a problem which was discovered during the manual validation. The setting was assigned to the worker when it was enabled in the workers.ini but when the setting was removed(commented out) the database couldnt update a record which was non-null. I fixed also test which failing after latest change and submit to see whether the pipeline will complain in some other test which I could run locally.
Note that @ybonatakis has actually fixed these problems with the latest version in his PR.
I now implemented the changes required in the scheduler logic and pushed them to https://github.com/os-autoinst/openQA/pull/5695 as well.
This was more work then expected and I haven't been able to run manual fullstack tests yet. The CI checks might also fail. So it might make sense to assign this ticket to me.
Updated by ybonatakis 3 months ago
- Assignee changed from ybonatakis to mkittler
Updated by okurz 3 months ago
It's deployed: https://mailman.suse.de/mlarch/SuSE/openqa/2024/openqa.2024.07/msg00002.html
As discussed I suggest to enable it on a single host first, e.g. one where you add the new setting plus add back "tap", verify and then extend to other hosts.
Updated by mkittler 3 months ago
MR: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/856
I'll also try it out on aarch64-o3 as part of #150869#note-26.
Updated by mkittler 3 months ago
I've just gone though the recent history of MM jobs on OSD via with mm_jobs as (select distinct id, result, state, t_finished, (select host from workers where id = assigned_worker_id) as worker_host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2) select concat('https://openqa.suse.de/tests/', id) as url, result, t_finished, worker_host from mm_jobs where state = 'done' and worker_host in ('worker35', 'worker40') order by id desc limit 50;
.
I couldn't find any cross-host jobs being scheduled and the tap setup on worker35 and worker40 generally seems to work and jobs are generally scheduled on both hosts. So everything works as expected.
Updated by mkittler 3 months ago
MR to enable more tap workers again: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/858
Updated by mkittler 3 months ago · Edited
There is so far one cluster where jobs were executed across different hosts when they shouldn't have: https://openqa.suse.de/tests/14810624#dependencies
The relevant slots are imagetester:1
and sapworker3:1
.
I restarted the jobs. I'm not immediately reverting the MR because it might not have been fully effective when those jobs were scheduled. (But now both slots have PARALLEL_ONE_HOST_ONLY=1
shown on the web UI so it should not happen again.)
EDIT: Looks like both of those jobs have the special worker class 64bit-mlx_con5
- and we have only exactly two worker slots that can execute those jobs (imagetester:1
and sapworker3:1
). Because those slots are on different worker hosts the jobs cannot be scheduled anymore at all with my new configuration. They probably got scheduled before because the worker slots haven't been restarted yet and thus PARALLEL_ONE_HOST_ONLY=1
was not effective yet. So this is not a bug. And now those restarted jobs have been stuck in the scheduled for 10 minutes even though the two worker slots are idling. So the PARALLEL_ONE_HOST_ONLY=1
setting really works.
Of course it is questionable that I added the PARALLEL_ONE_HOST_ONLY=1
setting globally. Probably it would make sense to add it only on worker slots with the tap worker class. (Those two parallel jobs are not using the tap class. They are somewhat special. I suppose parallel dependencies are also used outside the scope of the tap setup.)
Updated by mkittler 3 months ago · Edited
I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/859 for the problem mentioned in my previous comment.
Otherwise the output of with mm_jobs as (select distinct id, result, state, t_finished, (select host from workers where id = assigned_worker_id) as worker_host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2) select concat('https://openqa.suse.de/tests/', id) as url, result, t_finished, worker_host from mm_jobs where state = 'done' order by id desc limit 100;
looks still good. Tests are mostly passing/softfailing and the ones that did fail were still correctly scheduled and likely failed due to other reasons then the scheduling or tap setup.
Updated by mkittler 3 months ago
Many more production jobs were running. Some passed, some failed but none incompleted due to a broken tap setup or cross-host scheduling - at least within the set of jobs I reviewed. (There are too many jobs to have a close look at all the failures.) So I guess this works as expected.
We could enable even more tap workers now but I kept a few tap_secondary
workers around (especially workers that are at Marienberg anyway). Should I replace all occurrences of tap_secondary
again? (We can also do this outside of the scope of this ticket when needed. It should only take a few seconds. I just wanted to avoid enabling workers that may make generally problems with the tap setup for now.)