Actions
action #157666
closedopenQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project - coordination #108209: [epic] Reduce load on OSD
OSD unresponsive and then not starting any more jobs on 2024-03-21
Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-03-12
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
No firing alerts yet. Problem was reported in https://suse.slack.com/archives/C02CANHLANP/p1711027186175969, people couldn't access the webUI and jobs were delayed in starting. Also reported in https://suse.slack.com/archives/C02CANHLANP/p1711019658421359
Suggestions¶
- Maybe again too many online worker instances https://openqa.suse.de/admin/workers as in #135122
- Disable/disconnect/poweroff some workers to reduce load on OSD -> okurz did that with worker3[6-9] by completely powering of the machines reducing the number of online openQA worker instances from 1006 to 878, see https://openqa.suse.de/admin/workers . After that the openQA scheduler near-immediate picked up new jobs again so apparently this was a helpful mitigation.
- Monitor over the next days/weeks if we are hitting unresponsiveness and scheduler refusing to assign jobs. If we still see such issues then we must investigate further what else could be the problem otherwise we would need to block on #157669 before enabling more worker instances again.
Out of scope¶
- Any significant change in implementation
Rollback steps¶
- Bring worker3[6-9] back online using IPMI commands
Updated by okurz 7 months ago
- Copied from action #157081: OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30Z added
Updated by okurz 7 months ago
- Related to coordination #157669: websockets+scheduler improvements to support more online worker instances added
Updated by okurz 7 months ago
- Related to action #157726: osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org) added
Updated by tinita about 1 month ago
- Related to action #167557: OSD not starting new jobs on 2024-09-28 due to >1k worker instances connected, overloading websocket server added
Actions