Project

General

Profile

Actions

action #158146

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M

Added by okurz 4 months ago. Updated 12 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-03-27
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Multi-machine jobs have been failing since 20230814, because of a misconfiguration of the MTU/GRE tunnels. A workaround has been found in forcing the complete multi-machine tests to run in the same worker. In #135035 we added a feature flag to limit jobs to a single physical host which can be used for debugging or as temporary workaround or if the network design prevents multiple hosts to be interconnected by GRE tunnels. But by default when multi-machine jobs are scheduled with worker classes fulfilled by multiple hosts which might not be properly interconnected then there is no measure preventing workers to pick up such clusters causing hard to investigate openQA job failures which we should try to prevent. Can we propagate test variables like the "limit to one host only" feature flag in worker properties so that the openQA scheduler can see that flag before assigning to workers?

Acceptance Criteria

  • AC1: the openQA scheduler does not schedule across-host multimachine clusters to any host that has the feature flag from #135035 set or like that feature flag (considering proposals in #157144-2)
  • AC2: By default jobs of a multi-machine parallel cluster can still be scheduled covering multiple different hosts

Suggestions

  • Look into what was done in #135035 but for the central openQA scheduler
  • Investigate if any worker properties are already available to read by the openQA scheduler when scheduling. At least it knows about the worker class already, right? Should we translate the feature flag from #135035 as a "special worker class" to act as an exclusive class that is only implemented by one host at a time?
  • Consider proposals in #157144-2 regarding using a special worker class or directly the flag from #135035 PARALLEL_ONE_HOST_ONLY=1
  • Ensure that the scheduler does not schedule across-host multimachine clusters to any host that has such special worker class or worker property

Related issues 2 (1 open1 closed)

Related to openQA Project - action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configurationResolvedokurz2022-06-03

Actions
Copied from openQA Project - action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not availableNew

Actions
Actions #1

Updated by okurz 4 months ago

  • Copied from action #158143: Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available added
Actions #2

Updated by okurz 4 months ago

  • Subject changed from Make workers unassign/reject/incomplete jobs when across-host multimachine setup is requested but not available to Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves
Actions #3

Updated by okurz 3 months ago

  • Target version changed from future to Tools - Next
Actions #4

Updated by okurz about 2 months ago

  • Related to action #112001: [timeboxed:20h][spike solution] Pin multi-machine cluster jobs to same openQA worker host based on configuration added
Actions #5

Updated by okurz about 2 months ago

  • Priority changed from Low to Normal
  • Target version changed from Tools - Next to Ready

#160646 makes it necessary that we apply more priority

Actions #6

Updated by livdywan about 2 months ago

  • Subject changed from Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves to Prevent scheduling across-host multimachine clusters to hosts that are marked to exclude themselves size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #7

Updated by ybonatakis about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #8

Updated by openqa_review about 1 month ago

  • Due date set to 2024-06-21

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by ybonatakis about 1 month ago

  • Status changed from In Progress to Workable

After 5 days I have done nothing meaningful on this. I setup my openqa in my laptop, where jobs run but fail to execute qemu due to permissions. I though I didnt need it at first, but when I tried to print some output I realize that it probably it doesnt reach the function and I assumed that the scheduling take place later. But in workable as it doesnt make sense to be in progress while I try to figure out how to solve the setup problem

Actions #10

Updated by mkittler about 1 month ago · Edited

Do you need any help with the setup? Although I'm wondering why you need any special setup for this at all. This can all be tested in unit tests. And for a real test I suggest you just spawn a few dummy tests. This ticket is completely independent of "tap devices" and our MM setup. It is a scheduling problem and nothing more.

(I actually wanted to work on this. If you're not doing anything anytime soon, don't keep the ticket assigned to yourself in "Workable" forever so others can pick it up instead.)

Actions #11

Updated by ybonatakis about 1 month ago

  • Status changed from Workable to In Progress
Actions #12

Updated by ybonatakis about 1 month ago

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/5695

PR needs clean up but I am looking first to see if there is need to write more tests.

Actions #13

Updated by okurz 28 days ago

  • Status changed from Feedback to Workable
Actions #14

Updated by livdywan 27 days ago

  • Due date deleted (2024-06-21)
Actions #15

Updated by ybonatakis 23 days ago

  • Status changed from Workable to Feedback

PR is ready. I couldnt extends the tests but I think the existing one is enough for now.

Actions #16

Updated by ybonatakis 20 days ago

PR is updated to fix a problem which was discovered during the manual validation. The setting was assigned to the worker when it was enabled in the workers.ini but when the setting was removed(commented out) the database couldnt update a record which was non-null. I fixed also test which failing after latest change and submit to see whether the pipeline will complain in some other test which I could run locally.

Actions #17

Updated by mkittler 16 days ago · Edited

PR is updated to fix a problem which was discovered during the manual validation. The setting was assigned to the worker when it was enabled in the workers.ini but when the setting was removed(commented out) the database couldnt update a record which was non-null. I fixed also test which failing after latest change and submit to see whether the pipeline will complain in some other test which I could run locally.

Note that @ybonatakis has actually fixed these problems with the latest version in his PR.

I now implemented the changes required in the scheduler logic and pushed them to https://github.com/os-autoinst/openQA/pull/5695 as well.

This was more work then expected and I haven't been able to run manual fullstack tests yet. The CI checks might also fail. So it might make sense to assign this ticket to me.

Actions #18

Updated by ybonatakis 15 days ago

  • Assignee changed from ybonatakis to mkittler
Actions #19

Updated by mkittler 14 days ago

The PR has been merged. Once it is merged I'll add PARALLEL_ONE_HOST_ONLY=1 in our config and enable MM tests on all workers again.

Actions #20

Updated by okurz 14 days ago

It's deployed: https://mailman.suse.de/mlarch/SuSE/openqa/2024/openqa.2024.07/msg00002.html

As discussed I suggest to enable it on a single host first, e.g. one where you add the new setting plus add back "tap", verify and then extend to other hosts.

Actions #21

Updated by mkittler 14 days ago

Actions #22

Updated by mkittler 13 days ago

I've just gone though the recent history of MM jobs on OSD via with mm_jobs as (select distinct id, result, state, t_finished, (select host from workers where id = assigned_worker_id) as worker_host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2) select concat('https://openqa.suse.de/tests/', id) as url, result, t_finished, worker_host from mm_jobs where state = 'done' and worker_host in ('worker35', 'worker40') order by id desc limit 50;.

I couldn't find any cross-host jobs being scheduled and the tap setup on worker35 and worker40 generally seems to work and jobs are generally scheduled on both hosts. So everything works as expected.

Actions #23

Updated by mkittler 13 days ago

Actions #24

Updated by mkittler 13 days ago · Edited

There is so far one cluster where jobs were executed across different hosts when they shouldn't have: https://openqa.suse.de/tests/14810624#dependencies

The relevant slots are imagetester:1 and sapworker3:1.

I restarted the jobs. I'm not immediately reverting the MR because it might not have been fully effective when those jobs were scheduled. (But now both slots have PARALLEL_ONE_HOST_ONLY=1 shown on the web UI so it should not happen again.)

EDIT: Looks like both of those jobs have the special worker class 64bit-mlx_con5 - and we have only exactly two worker slots that can execute those jobs (imagetester:1 and sapworker3:1). Because those slots are on different worker hosts the jobs cannot be scheduled anymore at all with my new configuration. They probably got scheduled before because the worker slots haven't been restarted yet and thus PARALLEL_ONE_HOST_ONLY=1 was not effective yet. So this is not a bug. And now those restarted jobs have been stuck in the scheduled for 10 minutes even though the two worker slots are idling. So the PARALLEL_ONE_HOST_ONLY=1 setting really works.

Of course it is questionable that I added the PARALLEL_ONE_HOST_ONLY=1 setting globally. Probably it would make sense to add it only on worker slots with the tap worker class. (Those two parallel jobs are not using the tap class. They are somewhat special. I suppose parallel dependencies are also used outside the scope of the tap setup.)

Actions #25

Updated by mkittler 13 days ago · Edited

I created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/859 for the problem mentioned in my previous comment.

Otherwise the output of with mm_jobs as (select distinct id, result, state, t_finished, (select host from workers where id = assigned_worker_id) as worker_host from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) where dependency = 2) select concat('https://openqa.suse.de/tests/', id) as url, result, t_finished, worker_host from mm_jobs where state = 'done' order by id desc limit 100; looks still good. Tests are mostly passing/softfailing and the ones that did fail were still correctly scheduled and likely failed due to other reasons then the scheduling or tap setup.

Actions #26

Updated by mkittler 12 days ago

Many more production jobs were running. Some passed, some failed but none incompleted due to a broken tap setup or cross-host scheduling - at least within the set of jobs I reviewed. (There are too many jobs to have a close look at all the failures.) So I guess this works as expected.

We could enable even more tap workers now but I kept a few tap_secondary workers around (especially workers that are at Marienberg anyway). Should I replace all occurrences of tap_secondary again? (We can also do this outside of the scope of this ticket when needed. It should only take a few seconds. I just wanted to avoid enabling workers that may make generally problems with the tap setup for now.)

Actions #27

Updated by mkittler 12 days ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF