Project

General

Profile

Actions

action #169513

open

Improve/investigate behavior when repairing half-scheduled parallel clusters when `PARALLEL_ONE_HOST_ONLY=1` is used

Added by mkittler 14 days ago. Updated 13 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2024-11-07
Due date:
% Done:

0%

Estimated time:

Description

Observation

When working on #169342 we noticed that the scheduler might need to repair half-scheduled parallel clusters when a job assignment is refused by a worker. This can happen for various reasons, e.g. when the load limit on the worker is exceeded (see #169342#note-10).

Note that what we have observed is an unlikely situation because normally those workers are not considered by the scheduler¹. So the impact of this not working as good as it could is probably not very high. However, it is likely enough so that we were able to observe it another time only a few days later, see #169342#note-26.

We also found 3 other likely relevant job clusters. I checked the scheduler logs of them and it is unfortunately not always due to a job being refused, e.g. the cluster of https://openqa.suse.de/tests/15881165#dependencies was scheduled across hosts but none of the jobs are clones and none of the jobs were refused. So there must be another problem. I saw the error message Failed to send data to websocket server, reason: Connection refused at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27. in some scheduler log files but not in the relevant time frame so it must be something different. Those were about refused jobs after all. This is only logged on the worker side and was thus not in the scheduler logs.

In case of https://openqa.suse.de/tests/15878799#dependencies the message Discarding job 15878799 (with priority 57) due to incomplete parallel cluster, reducing priority by 1 was logged so we should probably also look into this case.


¹ A job might be refused if the worker only enters the "broken" state after the scheduler has already checked the worker status (and has possibly already assigned other jobs in that parallel cluster).

Acceptance criteria

  • AC1: We know how the scheduler behaves if it has to repair a half-scheduled parallel cluster when PARALLEL_ONE_HOST_ONLY=1 is used. This is stated in the documentation of the PARALLEL_ONE_HOST_ONLY=1 feature.
  • AC2: The scheduler does not assign parallel jobs across multiple hosts when PARALLEL_ONE_HOST_ONLY=1 is used. This is also true if a cluster was half-scheduled for whatever reason.
  • AC3 (optional): The scheduler restarts a half-scheduled parallel cluster using PARALLEL_ONE_HOST_ONLY=1 when repairing it is not possible because no matching worker slots are available.

Remarks

If we don't implement AC3 then the behavior we are left with would probably be that the job simply stays scheduled until a worker slot on the required host (where the rest of the parallel cluster is already running) is finally available. This means other parallel jobs which are already running might need to wait quite a while so the system is prone to timeouts.


Related issues 1 (0 open1 closed)

Related to openQA Project - action #169342: Fix scheduling parallel clusters with `PARALLEL_ONE_HOST_ONLY=1` when the openQA jobs depend on Minion jobs e.g. `git_clone` tasks started for the `git_auto_update` feature size:MResolvedmkittler2024-11-05

Actions
Actions #1

Updated by mkittler 14 days ago

  • Related to action #169342: Fix scheduling parallel clusters with `PARALLEL_ONE_HOST_ONLY=1` when the openQA jobs depend on Minion jobs e.g. `git_clone` tasks started for the `git_auto_update` feature size:M added
Actions #2

Updated by okurz 14 days ago

  • Target version set to future
Actions #3

Updated by mkittler 14 days ago

  • Description updated (diff)
Actions #4

Updated by mkittler 13 days ago

  • Description updated (diff)
Actions #5

Updated by mkittler 13 days ago

  • Description updated (diff)
Actions

Also available in: Atom PDF