action #162602
closed
openQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
[FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S
Added by okurz 6 months ago.
Updated 5 months ago.
Category:
Regressions/Crashes
Description
Observation¶
With #162374 w40 (worker40.oqa.prg2.suse.org) is the only OSD PRG2 x86_64 tap worker and due to the openQA job queue size w40 is executing openQA jobs near-continuously. Now an alert triggered about too high CPU load and one about a partition getting full. Similar to #162596
Suggestions¶
- Maybe the high CPU load was caused by the lack of space - which is tracked in #162596
- Are tests passing successfully on worker40? - If it doesn't look like we have typing or similar issues, bump the alert threshold.
- Lower the load limit
- Check the number of worker slots and e.g. reduce according to the load - maybe we didn't notice the capacity was already too high before
- Take #162596 into account
Rollback actions¶
- Copied from action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) added
- Copied to action #162605: [FIRING:1] CPU load alert, should be "system load" added
- Subject changed from [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) to [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S
- Description updated (diff)
- Status changed from New to Workable
- Status changed from Workable to In Progress
- Assignee set to okurz
- Description updated (diff)
- Status changed from In Progress to Resolved
limit is effective. rollback action done. No alert right now. Will be notified if the alert would still trigger
- Status changed from Resolved to In Progress
Apparently not enough, need to silence alerts and see what to do about it
- Due date set to 2024-07-07
Setting due date based on mean cycle time of SUSE QE Tools
- Priority changed from Urgent to High
Silenced alert again for now.
I looked at https://monitor.qa.suse.de/d/WDworker40/worker-dashboard-worker40?orgId=1&from=now-7d&to=now
and found that there are short-timed load spikes coinciding with significant, non-critical memory usage, low CPU usage but high I/O usage with maxing-out I/O times. As observed the past days during those times it seems that there are especially high-demanding openQA jobs with bigger HDDSIZE requests and according high I/O demands. I assume that we are not actually hitting typing issues in such cases but still stall test execution and trigger the observed alerts.
So I proposed
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/848
to reduce worker load limit 30->25.
Next idea: Limit number of "qemu_x86_64-large-mem" to a limited number of instances or introduce a new class and ask SAP-HA squad to schedule tests against those in particular.
- Status changed from In Progress to Feedback
- Related to action #162719: Ensure w40 has more space for worker pool directories size:S added
- Due date deleted (
2024-07-07)
- Status changed from Feedback to Blocked
- Priority changed from High to Normal
the proposal for "size" classes stands and I would like to give it more time to decide if that is the right approach. Then there is also the impact of space depletion which coincided with too high load so blocking on #162719
- Status changed from Blocked to Resolved
Also available in: Atom
PDF