action #133511: [spike solution][timeboxed:10h] Prevent memory over-commits in openQA worker service definitions size:S - openQA Infrastructure - openSUSE Project Management Tool

Actions

Copy link

action #133511

closed

[spike solution][timeboxed:10h] Prevent memory over-commits in openQA worker service definitions size:S

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

jbaier_cz

Category:

Target version:

openQA Project - Ready

Start date:

2023-07-19

Due date:

% Done:

Estimated time:

Description

Motivation¶

See what happened in #132998. Assuming the memory exhaustion is caused by over-commiting in job settings, i.e. qemu memory applied to machines, we can prevent memory over-commits with some fancy cgroup settings in the openQA worker systemd units depending on available memory and a fair division for all instances.

Suggestions¶

Research https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html how to limit memory with cgroup settings in systemd unit files
Figure out a way how to find a good limit value depending on number of openQA worker instances, reserved memory for other services, etc.
Draft pull request or doc change or whatever with some example implementation on one of our workers to demonstrate the concept

Related issues 2 (2 open — 0 closed)

Actions

Copy link

Updated by okurz over 1 year ago

Copied from action #132998: [alert] [FIRING:1] openqaworker-arm-3: Memory usage alert openQA (openqaworker-arm-3 memory_usage_alert_openqaworker-arm-3 worker) size:M added

Actions

Copy link

Updated by jbaier_cz about 1 year ago

Assignee set to jbaier_cz

I will look at this

Actions

Copy link

Updated by jbaier_cz about 1 year ago

Status changed from Workable to In Progress

I am a little bit struggling how to approach this. My assumption is, we want to prevent worker memory exhaustion to ensure stable worker (and thus reliable openqa-worker instances and tests inside).

I can easily prevent system crashes by ensuring there is a memory limit. All our workers are part of openqa-worker.slice, we can apply MemoryMax to this slice which will prevent openqa-workers to exhaust all the memory. In this case, systemd is very convenient and allow us to set percentage so a simple MemoryMax=90% should invoke oom killer inside the slice if it become hungry.

After this, we should solve the competition inside the slice. I will need to make some experiments with the MemoryMin.

Actions

Copy link

Updated by openqa_review about 1 year ago

Due date set to 2023-10-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by jbaier_cz about 1 year ago

PR for the global workers memory limit: https://github.com/os-autoinst/openQA/pull/5323

We can further apply limit to each worker individually (if desired), but I am not sure there is a good default value. The safe setting "max memory / number of workers" can be probably computed and set in a drop-in via salt on osd, but it can have unexpected behavior as this is not taking QEMURAM into account. A second choice would be to look at the current worker setting and set the limit according to configured QEMURAM, but it is hard to keep that in sync.

Actions

Copy link

Updated by jbaier_cz about 1 year ago

Status changed from In Progress to Feedback

Actions

Copy link

Updated by okurz about 1 year ago

PR https://github.com/os-autoinst/openQA/pull/5323 merged even though that was not a requirement for this spike solution. Can you demonstrate the functionality?

Actions

Copy link

Updated by jbaier_cz about 1 year ago

Yes, I can easily demonstrate the effects of the directive by running memory hungry task inside the slice.

Actions

Copy link

Updated by okurz about 1 year ago

Due date deleted (~~2023-10-20~~)
Status changed from Feedback to Resolved

jbaier demonstrated how the feature works and we can see it in effect in https://monitor.qa.suse.de/d/WDworker31/worker-dashboard-worker31?orgId=1&viewPanel=12054&from=1697445822192&to=1697447326787 as well where is a big dip of available memory and then the system automatically recovers from that. Given that the solution is already in our code base and effective we can conclude and do not even need a follow-up for the actual implementation which is already covered.

Actions

Copy link

#10

Updated by livdywan about 1 year ago

Copied to action #150986: [timeboxed:10h] Prevent cpu over-allocation in openQA worker service definitions added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Infrastructure

Tags

Custom queries

action #133511

[spike solution][timeboxed:10h] Prevent memory over-commits in openQA worker service definitions size:S

Motivation¶

Suggestions¶

Updated by okurz over 1 year ago

Updated by jbaier_cz about 1 year ago

Updated by jbaier_cz about 1 year ago

Updated by openqa_review about 1 year ago

Updated by jbaier_cz about 1 year ago

Updated by jbaier_cz about 1 year ago

Updated by okurz about 1 year ago

Updated by jbaier_cz about 1 year ago

Updated by okurz about 1 year ago

Updated by livdywan about 1 year ago