Actions
action #132998
open[alert] [FIRING:1] openqaworker-arm-3: Memory usage alert openQA (openqaworker-arm-3 memory_usage_alert_openqaworker-arm-3 worker) size:M
Start date:
2023-07-19
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-3/worker-dashboard-openqaworker-arm-3?orgId=1&viewPanel=12054&from=1689743130960&to=1689746327640 and according email.
The graph shows that the system exhausted all available memory.
Acceptance criteria¶
- AC1: Measures have been applied to prevent memory exhaustion
- AC2: It's safe to schedule jobs with too high memory requirements
Acceptance Tests¶
- AT1-1: A job with QEMURAM=999999999 aborts cleanly without alerts being raised
- AT1-2: A worker without the mitigation kills processes due to memory exhaustion
Suggestions¶
- Look into logs and according openQA jobs running on that host what exhausted the memory, likely too many too big openQA jobs
- Ask people to not do that!
- As necessary adapt number of worker instances or different worker classes like "big mem"
- As necessary adapt job scenarios to not overcommit
- If it is not openQA jobs then look into what else it is
Out of scope¶
- Preventing the over-commit in openQA worker, see #133511 for this
Updated by okurz over 1 year ago
- Copied to action #133511: [spike solution][timeboxed:10h] Prevent memory over-commits in openQA worker service definitions size:S added
Updated by okurz over 1 year ago
- Subject changed from [alert] [FIRING:1] openqaworker-arm-3: Memory usage alert openQA (openqaworker-arm-3 memory_usage_alert_openqaworker-arm-3 worker) to [alert] [FIRING:1] openqaworker-arm-3: Memory usage alert openQA (openqaworker-arm-3 memory_usage_alert_openqaworker-arm-3 worker) size:M
- Description updated (diff)
- Status changed from New to Workable
Actions