action #158146: Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #158146

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M

Added by okurz about 1 year ago. Updated 11 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

2024-03-27

Due date:

% Done:

Estimated time:

Description

Motivation¶

Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker. In #135035 we added a feature flag to limit jobs to a single physical host which can be used for debugging or as temporary workaround or if the network design prevents multiple hosts to be interconnected by GRE tunnels. But by default when multi-machine jobs are scheduled with worker classes fulfilled by multiple hosts which might not be properly interconnected then there is no measure preventing workers to pick up such clusters causing hard to investigate openQA job failures which we should try to prevent. Can we propagate test variables like the "limit to one host only" feature flag in worker properties so that the openQA scheduler can see that flag before assigning to workers?

Acceptance Criteria¶

AC1: the openQA scheduler does not schedule across-host multimachine clusters to any host that has the feature flag from #135035 set or like that feature flag (considering proposals in #157144-2)
AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts

Suggestions¶

Look into what was done in #135035 but for the central openQA scheduler
Investigate if any worker properties are already available to read by the openQA scheduler when scheduling. At least it knows about the worker class already, right? Should we translate the feature flag from #135035 as a "special worker class" to act as an exclusive class that is only implemented by one host at a time?
Consider proposals in #157144-2 regarding using a special worker class or directly the flag from #135035 PARALLEL_ONE_HOST_ONLY=1
Ensure that the scheduler does not schedule across-host multimachine clusters to any host that has such special worker class or worker property

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by okurz about 1 year ago

Copied from action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available added

Actions

Copy link

Updated by okurz about 1 year ago

Subject changed from Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available to Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves

Actions

Copy link

Updated by okurz about 1 year ago

Target version changed from future to Tools - Next

Actions

Copy link

Updated by okurz about 1 year ago

Related to action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configuration added

Actions

Copy link

Updated by okurz about 1 year ago

Priority changed from Low to Normal
Target version changed from Tools - Next to Ready

#160646 makes it necessary that we apply more priority

Actions

Copy link

Updated by livdywan about 1 year ago

Subject changed from Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves to Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by ybonatakis 12 months ago

Status changed from Workable to In Progress
Assignee set to ybonatakis

Actions

Copy link

Updated by openqa_review 12 months ago

Due date set to 2024-06-21

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by ybonatakis 12 months ago

Status changed from In Progress to Workable

After 5 days I have done nothing meaningful on this. I setup my openqa in my laptop, where jobs run but fail to execute qemu due to permissions. I though I didnt need it at first, but when I tried to print some output I realize that it probably it doesnt reach the function and I assumed that the scheduling take place later. But in workable as it doesnt make sense to be in progress while I try to figure out how to solve the setup problem

Actions

Copy link

#10

Updated by mkittler 12 months ago · Edited

Do you need any help with the setup? Although I'm wondering why you need any special setup for this at all. This can all be tested in unit tests. And for a real test I suggest you just spawn a few dummy tests. This ticket is completely independent of "tap devices" and our MM setup. It is a scheduling problem and nothing more.

(I actually wanted to work on this. If you're not doing anything anytime soon, don't keep the ticket assigned to yourself in "Workable" forever so others can pick it up instead.)

Actions

Copy link

#11

Updated by ybonatakis 12 months ago

Status changed from Workable to In Progress

Actions

Copy link

#12

Updated by ybonatakis 11 months ago

Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/5695

PR needs clean up but I am looking first to see if there is need to write more tests.

Actions

Copy link

#13

Updated by okurz 11 months ago

Status changed from Feedback to Workable

Actions

Copy link

#14

Updated by livdywan 11 months ago

Due date deleted (~~2024-06-21~~)

Actions

Copy link

#15

Updated by ybonatakis 11 months ago

Status changed from Workable to Feedback

PR is ready. I couldnt extends the tests but I think the existing one is enough for now.

Actions

Copy link

#16

Updated by ybonatakis 11 months ago

PR is updated to fix a problem which was discovered during the manual validation. The setting was assigned to the worker when it was enabled in the workers.ini but when the setting was removed(commented out) the database couldnt update a record which was non-null. I fixed also test which failing after latest change and submit to see whether the pipeline will complain in some other test which I could run locally.

Actions

Copy link

#17

Updated by mkittler 11 months ago · Edited

PR is updated to fix a problem which was discovered during the manual validation. The setting was assigned to the worker when it was enabled in the workers.ini but when the setting was removed(commented out) the database couldnt update a record which was non-null. I fixed also test which failing after latest change and submit to see whether the pipeline will complain in some other test which I could run locally.

Note that @ybonatakis has actually fixed these problems with the latest version in his PR.

I now implemented the changes required in the scheduler logic and pushed them to https://github.com/os-autoinst/openQA/pull/5695 as well.

This was more work then expected and I haven't been able to run manual fullstack tests yet. The CI checks might also fail. So it might make sense to assign this ticket to me.

Actions

Copy link

#18

Updated by ybonatakis 11 months ago

Assignee changed from ybonatakis to mkittler

Actions

Copy link

#19

Updated by mkittler 11 months ago

The PR has been merged. Once it is merged I'll add PARALLEL_ONE_HOST_ONLY=1 in our config and enable MM tests on all workers again.

Actions

Copy link

#20

Updated by okurz 11 months ago

It's deployed: https://mailman.suse.de/mlarch/SuSE/openqa/2024/openqa.2024.07/msg00002.html

As discussed I suggest to enable it on a single host first, e.g. one where you add the new setting plus add back "tap", verify and then extend to other hosts.

Actions

Copy link

#21

Updated by mkittler 11 months ago

MR: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/856

I'll also try it out on aarch64-o3 as part of #150869#note-26.

Actions

Copy link

#22

Updated by mkittler 11 months ago

I've just gone though the recent history of MM jobs on OSD via with mm_jobs as (select distinct id, result, state, t_finished, (select host from workers where id = assigned_worker_id) as worker_host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2) select concat('https://openqa.suse.de/tests/', id) as url, result, t_finished, worker_host from mm_jobs where state = 'done' and worker_host in ('worker35', 'worker40') order by id desc limit 50;.

I couldn't find any cross-host jobs being scheduled and the tap setup on worker35 and worker40 generally seems to work and jobs are generally scheduled on both hosts. So everything works as expected.

Actions

Copy link

#23

Updated by mkittler 11 months ago

MR to enable more tap workers again: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/858

Actions

Copy link

#24

Updated by mkittler 11 months ago · Edited

There is so far one cluster where jobs were executed across different hosts when they shouldn't have: https://openqa.suse.de/tests/14810624#dependencies

The relevant slots are imagetester:1 and sapworker3:1.

I restarted the jobs. I'm not immediately reverting the MR because it might not have been fully effective when those jobs were scheduled. (But now both slots have PARALLEL_ONE_HOST_ONLY=1 shown on the web UI so it should not happen again.)

EDIT: Looks like both of those jobs have the special worker class 64bit-mlx_con5 - and we have only exactly two worker slots that can execute those jobs (imagetester:1 and sapworker3:1). Because those slots are on different worker hosts the jobs cannot be scheduled anymore at all with my new configuration. They probably got scheduled before because the worker slots haven't been restarted yet and thus PARALLEL_ONE_HOST_ONLY=1 was not effective yet. So this is not a bug. And now those restarted jobs have been stuck in the scheduled for 10 minutes even though the two worker slots are idling. So the PARALLEL_ONE_HOST_ONLY=1 setting really works.

Of course it is questionable that I added the PARALLEL_ONE_HOST_ONLY=1 setting globally. Probably it would make sense to add it only on worker slots with the tap worker class. (Those two parallel jobs are not using the tap class. They are somewhat special. I suppose parallel dependencies are also used outside the scope of the tap setup.)

Actions

Copy link

#25

Updated by mkittler 11 months ago · Edited

I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/859 for the problem mentioned in my previous comment.

Otherwise the output of with mm_jobs as (select distinct id, result, state, t_finished, (select host from workers where id = assigned_worker_id) as worker_host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2) select concat('https://openqa.suse.de/tests/', id) as url, result, t_finished, worker_host from mm_jobs where state = 'done' order by id desc limit 100; looks still good. Tests are mostly passing/softfailing and the ones that did fail were still correctly scheduled and likely failed due to other reasons then the scheduling or tap setup.

Actions

Copy link

#26

Updated by mkittler 11 months ago

Many more production jobs were running. Some passed, some failed but none incompleted due to a broken tap setup or cross-host scheduling - at least within the set of jobs I reviewed. (There are too many jobs to have a close look at all the failures.) So I guess this works as expected.

We could enable even more tap workers now but I kept a few tap_secondary workers around (especially workers that are at Marienberg anyway). Should I replace all occurrences of tap_secondary again? (We can also do this outside of the scope of this ticket when needed. It should only take a few seconds. I just wanted to avoid enabling workers that may make generally problems with the tap setup for now.)

Actions

Copy link

#27

Updated by mkittler 11 months ago

Status changed from Feedback to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #158146

Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M

Motivation¶

Acceptance Criteria¶

Suggestions¶

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by livdywan about 1 year ago

Updated by ybonatakis 12 months ago

Updated by openqa_review 12 months ago

Updated by ybonatakis 12 months ago

Updated by mkittler 12 months ago · Edited

Updated by ybonatakis 12 months ago

Updated by ybonatakis 11 months ago

Updated by okurz 11 months ago

Updated by livdywan 11 months ago

Updated by ybonatakis 11 months ago

Updated by ybonatakis 11 months ago

Updated by mkittler 11 months ago · Edited

Updated by ybonatakis 11 months ago

Updated by mkittler 11 months ago

Updated by okurz 11 months ago

Updated by mkittler 11 months ago

Updated by mkittler 11 months ago

Updated by mkittler 11 months ago

Updated by mkittler 11 months ago · Edited

Updated by mkittler 11 months ago · Edited

Updated by mkittler 11 months ago

Updated by mkittler 11 months ago