action #169513
openImprove/investigate behavior when repairing half-scheduled parallel clusters when `PARALLEL_ONE_HOST_ONLY=1` is used
0%
Description
Observation¶
When working on #169342 we noticed that the scheduler might need to repair half-scheduled parallel clusters when a job assignment is refused by a worker. This can happen for various reasons, e.g. when the load limit on the worker is exceeded (see #169342#note-10).
Note that what we have observed is an unlikely situation because normally those workers are not considered by the scheduler¹. So the impact of this not working as good as it could is probably not very high. However, it is likely enough so that we were able to observe it another time only a few days later, see #169342#note-26.
We also found 3 other likely relevant job clusters. I checked the scheduler logs of them and it is unfortunately not always due to a job being refused, e.g. the cluster of https://openqa.suse.de/tests/15881165#dependencies was scheduled across hosts but none of the jobs are clones and none of the jobs were refused. So there must be another problem. I saw the error message Those were about refused jobs after all. This is only logged on the worker side and was thus not in the scheduler logs.Failed to send data to websocket server, reason: Connection refused at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27.
in some scheduler log files but not in the relevant time frame so it must be something different.
In case of https://openqa.suse.de/tests/15878799#dependencies the message Discarding job 15878799 (with priority 57) due to incomplete parallel cluster, reducing priority by 1
was logged so we should probably also look into this case.
¹ A job might be refused if the worker only enters the "broken" state after the scheduler has already checked the worker status (and has possibly already assigned other jobs in that parallel cluster).
Acceptance criteria¶
- AC1: We know how the scheduler behaves if it has to repair a half-scheduled parallel cluster when
PARALLEL_ONE_HOST_ONLY=1
is used. This is stated in the documentation of thePARALLEL_ONE_HOST_ONLY=1
feature. - AC2: The scheduler does not assign parallel jobs across multiple hosts when
PARALLEL_ONE_HOST_ONLY=1
is used. This is also true if a cluster was half-scheduled for whatever reason. - AC3 (optional): The scheduler restarts a half-scheduled parallel cluster using
PARALLEL_ONE_HOST_ONLY=1
when repairing it is not possible because no matching worker slots are available.
Remarks¶
If we don't implement AC3 then the behavior we are left with would probably be that the job simply stays scheduled until a worker slot on the required host (where the rest of the parallel cluster is already running) is finally available. This means other parallel jobs which are already running might need to wait quite a while so the system is prone to timeouts.