action #133511
closed
[spike solution][timeboxed:10h] Prevent memory over-commits in openQA worker service definitions size:S
Added by okurz over 1 year ago.
Updated about 1 year ago.
Description
Motivation¶
See what happened in #132998. Assuming the memory exhaustion is caused by over-commiting in job settings, i.e. qemu memory applied to machines, we can prevent memory over-commits with some fancy cgroup settings in the openQA worker systemd units depending on available memory and a fair division for all instances.
Suggestions¶
- Research https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html how to limit memory with cgroup settings in systemd unit files
- Figure out a way how to find a good limit value depending on number of openQA worker instances, reserved memory for other services, etc.
- Draft pull request or doc change or whatever with some example implementation on one of our workers to demonstrate the concept
Related issues
2 (2 open — 0 closed)
- Copied from action #132998: [alert] [FIRING:1] openqaworker-arm-3: Memory usage alert openQA (openqaworker-arm-3 memory_usage_alert_openqaworker-arm-3 worker) size:M added
- Assignee set to jbaier_cz
- Status changed from Workable to In Progress
I am a little bit struggling how to approach this. My assumption is, we want to prevent worker memory exhaustion to ensure stable worker (and thus reliable openqa-worker instances and tests inside).
I can easily prevent system crashes by ensuring there is a memory limit. All our workers are part of openqa-worker.slice, we can apply MemoryMax
to this slice which will prevent openqa-workers to exhaust all the memory. In this case, systemd is very convenient and allow us to set percentage so a simple MemoryMax=90%
should invoke oom killer inside the slice if it become hungry.
After this, we should solve the competition inside the slice. I will need to make some experiments with the MemoryMin
.
- Due date set to 2023-10-20
Setting due date based on mean cycle time of SUSE QE Tools
PR for the global workers memory limit: https://github.com/os-autoinst/openQA/pull/5323
We can further apply limit to each worker individually (if desired), but I am not sure there is a good default value. The safe setting "max memory / number of workers" can be probably computed and set in a drop-in via salt on osd, but it can have unexpected behavior as this is not taking QEMURAM
into account. A second choice would be to look at the current worker setting and set the limit according to configured QEMURAM
, but it is hard to keep that in sync.
- Status changed from In Progress to Feedback
Yes, I can easily demonstrate the effects of the directive by running memory hungry task inside the slice.
- Due date deleted (
2023-10-20)
- Status changed from Feedback to Resolved
- Copied to action #150986: [timeboxed:10h] Prevent cpu over-allocation in openQA worker service definitions added
Also available in: Atom
PDF