action #133511: [spike solution][timeboxed:10h] Prevent memory over-commits in openQA worker service definitions size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #133511

closed

[spike solution][timeboxed:10h] Prevent memory over-commits in openQA worker service definitions size:S

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

jbaier_cz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-07-19

Due date:

% Done:

Estimated time:

Description

Motivation¶

See what happened in #132998. Assuming the memory exhaustion is caused by over-commiting in job settings, i.e. qemu memory applied to machines, we can prevent memory over-commits with some fancy cgroup settings in the openQA worker systemd units depending on available memory and a fair division for all instances.

Suggestions¶

Research https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html how to limit memory with cgroup settings in systemd unit files
Figure out a way how to find a good limit value depending on number of openQA worker instances, reserved memory for other services, etc.
Draft pull request or doc change or whatever with some example implementation on one of our workers to demonstrate the concept

Related issues 2 (2 open — 0 closed)

Copied from openQA Infrastructure (public) - action #132998: [alert] [FIRING:1] openqaworker-arm-3: Memory usage alert openQA (openqaworker-arm-3 memory_usage_alert_openqaworker-arm-3 worker) size:M

Workable

2023-07-19

Actions

Copied to openQA Infrastructure (public) - action #150986: [timeboxed:10h] Prevent cpu over-allocation in openQA worker service definitions

New

2023-07-19

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by okurz over 1 year ago

Copied from action #132998: [alert] [FIRING:1] openqaworker-arm-3: Memory usage alert openQA (openqaworker-arm-3 memory_usage_alert_openqaworker-arm-3 worker) size:M added

Actions

Copy link

Updated by jbaier_cz over 1 year ago

Assignee set to jbaier_cz

I will look at this

Actions

Copy link

Updated by jbaier_cz about 1 year ago

Status changed from Workable to In Progress

I am a little bit struggling how to approach this. My assumption is, we want to prevent worker memory exhaustion to ensure stable worker (and thus reliable openqa-worker instances and tests inside).

I can easily prevent system crashes by ensuring there is a memory limit. All our workers are part of openqa-worker.slice, we can apply MemoryMax to this slice which will prevent openqa-workers to exhaust all the memory. In this case, systemd is very convenient and allow us to set percentage so a simple MemoryMax=90% should invoke oom killer inside the slice if it become hungry.

After this, we should solve the competition inside the slice. I will need to make some experiments with the MemoryMin.

Actions

Copy link

Updated by openqa_review about 1 year ago

Due date set to 2023-10-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by jbaier_cz about 1 year ago

PR for the global workers memory limit: https://github.com/os-autoinst/openQA/pull/5323

We can further apply limit to each worker individually (if desired), but I am not sure there is a good default value. The safe setting "max memory / number of workers" can be probably computed and set in a drop-in via salt on osd, but it can have unexpected behavior as this is not taking QEMURAM into account. A second choice would be to look at the current worker setting and set the limit according to configured QEMURAM, but it is hard to keep that in sync.

Actions

Copy link

Updated by jbaier_cz about 1 year ago

Status changed from In Progress to Feedback

Actions

Copy link

Updated by okurz about 1 year ago

PR https://github.com/os-autoinst/openQA/pull/5323 merged even though that was not a requirement for this spike solution. Can you demonstrate the functionality?

Actions

Copy link

Updated by jbaier_cz about 1 year ago

Yes, I can easily demonstrate the effects of the directive by running memory hungry task inside the slice.

Actions

Copy link

Updated by okurz about 1 year ago

Due date deleted (~~2023-10-20~~)
Status changed from Feedback to Resolved

jbaier demonstrated how the feature works and we can see it in effect in https://monitor.qa.suse.de/d/WDworker31/worker-dashboard-worker31?orgId=1&viewPanel=12054&from=1697445822192&to=1697447326787 as well where is a big dip of available memory and then the system automatically recovers from that. Given that the solution is already in our code base and effective we can conclude and do not even need a follow-up for the actual implementation which is already covered.

Actions

Copy link

#10

Updated by livdywan about 1 year ago

Copied to action #150986: [timeboxed:10h] Prevent cpu over-allocation in openQA worker service definitions added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #133511

[spike solution][timeboxed:10h] Prevent memory over-commits in openQA worker service definitions size:S

Motivation¶

Suggestions¶

Updated by okurz over 1 year ago

Updated by jbaier_cz over 1 year ago

Updated by jbaier_cz about 1 year ago

Updated by openqa_review about 1 year ago

Updated by jbaier_cz about 1 year ago

Updated by jbaier_cz about 1 year ago

Updated by okurz about 1 year ago

Updated by jbaier_cz about 1 year ago

Updated by okurz about 1 year ago

Updated by livdywan about 1 year ago