Actions
coordination #102864
opencoordination #102861: [saga][epic] Improved openQA for multi-user environments
[epic] Inform openQA webUI users about potential worker class mismatch or long delays
Start date:
2021-09-13
Due date:
% Done:
0%
Estimated time:
Description
Motivation¶
In #98562 the idea came to cancel jobs with "invalid" worker class but that is time dependant. Then in #100973 we implemented automatic cancellation of all jobs after a (longer) timeout so that jobs don't hang around forever. Now we can go the next step and improve the feedback to users about potential worker class mismatches or expected long delays in job execution
Acceptance criteria¶
- AC1: Given a scheduled job When worker class does not match any worker entry Then inform user about that fact and that the job is likely misconfigured
- AC2: Given a scheduled job When worker class does match a worker entry And there are currently no online workers for this worker class And the last online time is below a configurable threshold, e.g. 10 minutes, Then inform user about that fact and that the job will likely be executed later
- AC3: Given a scheduled job When worker class does match a worker entry And there are currently no online workers for this worker class And the last online time is above a configurable threshold, e.g. 10 minutes, Then inform user about that fact and that there is likely an infrastructure problem and admins should be contacted
- AC4: Given a scheduled job When worker class does match a worker entry And there are currently no free workers for this worker class And the ratio of "scheduled for this worker class / available worker instances for this worker class" is high Then inform user about to be expected longer delays
Updated by okurz about 3 years ago
- Copied from action #98562: Cancel jobs with invalid WORKER_CLASS after a timeout added
Updated by livdywan about 3 years ago
I think yesterday I hit this case again:
- Configured a worker
- Checked the web UI /workers page
- Waited 10 minutes while my job is not picked up
- No errors in logs anywhere
- AC3 worker class does match a worker entry / there are currently no online workers for this worker class / the last online time is above a configurable threshold, e.g. 10 minutes
Actions