Project

General

Profile

Actions

action #45749

closed

[tools][scheduler] Multi-machine jobs with higher priority do not get worker to run.

Added by xlai about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
2019-01-07
Due date:
% Done:

0%

Estimated time:

Description

During 15sp1 beta1 test, the multi-machine jobs(2 sut) in virtualization job groups can not get workers to kickoff job until some low priority single machine jobs finish. This delays especially acceptance test(not able to finish within 24 hours).

https://openqa.suse.de/tests/overview?distri=sle&version=15-SP1&build=125.1&groupid=115
https://openqa.suse.de/tests/overview?distri=sle&version=15-SP1&build=125.1&groupid=213

I did not open a ticket when I found it, because I understood that:
although they were with higher priority in our group, but possibly other job group ipmi jobs have even higher priority. So they got the machine first, and they did not finish at the same time, so our multi-machine jobs still could not be started and other lower priority single machine jobs started.

@okurz commented that openqa tool should make some enhancement for it in https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/148/.

Please help to evaluate. Really appreciate.

Actions #1

Updated by coolo about 5 years ago

  • Status changed from New to Rejected

Priority 30 (your group) means (roughly) 30 chances need to pass for a job to be taken without another peer. We won't leave IPMI machines stale because of a prio 30 job around. If you want your jobs to be take-it-all, you would need to set the prio to much lower.

Actions #2

Updated by coolo about 5 years ago

  • Status changed from Rejected to New

on a second thought: do you have more infos about the other jobs' priority? One thing we could improve is how much impact the priority difference has. I don't think we care atm.

Actions #3

Updated by xlai about 5 years ago

coolo wrote:

on a second thought: do you have more infos about the other jobs' priority? One thing we could improve is how much impact the priority difference has. I don't think we care atm.

I am not so sure. I checked the job history of the workers with class virt-mm-64bit-ipmi, seems most jobs launched before them were in virtualization group, some with same priority 30, while some with priority 50(single machine jobs with the lowest job priority in our group, please note that virtualization-milestone job group also has some multi-machine jobs with priority 40 which were scheduled nearly after all other single machine prio 50 jobs done).

Actions #4

Updated by coolo about 5 years ago

So prio 40 and prio 50 isn't good enough for me to justify stalling workers on first try. But I think we can optimize it to a little quicker if the prio difference is large.

Actions #5

Updated by coolo about 5 years ago

  • Status changed from New to Resolved

https://github.com/os-autoinst/openQA/pull/1953 is all I can do. As said: if you want jobs to rule 'em all - set their prio to 0.

Actions #6

Updated by xlai about 5 years ago

After changing the priority to 20, the guest migration jobs always got SUT to run in recent beta2 candidates. So close the original MR https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/148/.

Also big thanks for the quick fix.

Actions

Also available in: Atom PDF