Actions
action #157666
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #108209: [epic] Reduce load on OSD
OSD unresponsive and then not starting any more jobs on 2024-03-21
Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-03-12
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
No firing alerts yet. Problem was reported in https://suse.slack.com/archives/C02CANHLANP/p1711027186175969, people couldn't access the webUI and jobs were delayed in starting. Also reported in https://suse.slack.com/archives/C02CANHLANP/p1711019658421359
Suggestions¶
- Maybe again too many online worker instances https://openqa.suse.de/admin/workers as in #135122
- Disable/disconnect/poweroff some workers to reduce load on OSD -> okurz did that with worker3[6-9] by completely powering of the machines reducing the number of online openQA worker instances from 1006 to 878, see https://openqa.suse.de/admin/workers . After that the openQA scheduler near-immediate picked up new jobs again so apparently this was a helpful mitigation.
- Monitor over the next days/weeks if we are hitting unresponsiveness and scheduler refusing to assign jobs. If we still see such issues then we must investigate further what else could be the problem otherwise we would need to block on #157669 before enabling more worker instances again.
Out of scope¶
- Any significant change in implementation
Rollback steps¶
- Bring worker3[6-9] back online using IPMI commands
Actions