Project

General

Profile

Actions

action #133511

closed

[spike solution][timeboxed:10h] Prevent memory over-commits in openQA worker service definitions size:S

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2023-07-19
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See what happened in #132998. Assuming the memory exhaustion is caused by over-commiting in job settings, i.e. qemu memory applied to machines, we can prevent memory over-commits with some fancy cgroup settings in the openQA worker systemd units depending on available memory and a fair division for all instances.

Suggestions

  • Research https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html how to limit memory with cgroup settings in systemd unit files
  • Figure out a way how to find a good limit value depending on number of openQA worker instances, reserved memory for other services, etc.
  • Draft pull request or doc change or whatever with some example implementation on one of our workers to demonstrate the concept

Related issues 2 (2 open0 closed)

Copied from openQA Infrastructure (public) - action #132998: [alert] [FIRING:1] openqaworker-arm-3: Memory usage alert openQA (openqaworker-arm-3 memory_usage_alert_openqaworker-arm-3 worker) size:MWorkable2023-07-19

Actions
Copied to openQA Infrastructure (public) - action #150986: [timeboxed:10h] Prevent cpu over-allocation in openQA worker service definitionsNew2023-07-19

Actions
Actions #1

Updated by okurz over 1 year ago

  • Copied from action #132998: [alert] [FIRING:1] openqaworker-arm-3: Memory usage alert openQA (openqaworker-arm-3 memory_usage_alert_openqaworker-arm-3 worker) size:M added
Actions #2

Updated by jbaier_cz about 1 year ago

  • Assignee set to jbaier_cz

I will look at this

Actions #3

Updated by jbaier_cz about 1 year ago

  • Status changed from Workable to In Progress

I am a little bit struggling how to approach this. My assumption is, we want to prevent worker memory exhaustion to ensure stable worker (and thus reliable openqa-worker instances and tests inside).

I can easily prevent system crashes by ensuring there is a memory limit. All our workers are part of openqa-worker.slice, we can apply MemoryMax to this slice which will prevent openqa-workers to exhaust all the memory. In this case, systemd is very convenient and allow us to set percentage so a simple MemoryMax=90% should invoke oom killer inside the slice if it become hungry.

After this, we should solve the competition inside the slice. I will need to make some experiments with the MemoryMin.

Actions #4

Updated by openqa_review about 1 year ago

  • Due date set to 2023-10-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by jbaier_cz about 1 year ago

PR for the global workers memory limit: https://github.com/os-autoinst/openQA/pull/5323

We can further apply limit to each worker individually (if desired), but I am not sure there is a good default value. The safe setting "max memory / number of workers" can be probably computed and set in a drop-in via salt on osd, but it can have unexpected behavior as this is not taking QEMURAM into account. A second choice would be to look at the current worker setting and set the limit according to configured QEMURAM, but it is hard to keep that in sync.

Actions #6

Updated by jbaier_cz about 1 year ago

  • Status changed from In Progress to Feedback
Actions #7

Updated by okurz about 1 year ago

PR https://github.com/os-autoinst/openQA/pull/5323 merged even though that was not a requirement for this spike solution. Can you demonstrate the functionality?

Actions #8

Updated by jbaier_cz about 1 year ago

Yes, I can easily demonstrate the effects of the directive by running memory hungry task inside the slice.

Actions #9

Updated by okurz about 1 year ago

  • Due date deleted (2023-10-20)
  • Status changed from Feedback to Resolved

jbaier demonstrated how the feature works and we can see it in effect in https://monitor.qa.suse.de/d/WDworker31/worker-dashboard-worker31?orgId=1&viewPanel=12054&from=1697445822192&to=1697447326787 as well where is a big dip of available memory and then the system automatically recovers from that. Given that the solution is already in our code base and effective we can conclude and do not even need a follow-up for the actual implementation which is already covered.

Actions #10

Updated by livdywan about 1 year ago

  • Copied to action #150986: [timeboxed:10h] Prevent cpu over-allocation in openQA worker service definitions added
Actions

Also available in: Atom PDF