coordination #78226: [epic] Prevent or better handle OOM conditions on worker machines - openQA Infrastructure - openSUSE Project Management Tool

Actions

Copy link

coordination #78226

open

[epic] Prevent or better handle OOM conditions on worker machines

Added by okurz almost 4 years ago. Updated almost 4 years ago.

Status:

Workable

Priority:

Normal

Assignee:

Category:

Target version:

QA - future

Start date:

Due date:

% Done:

Estimated time:

Description

Motivation¶

the sqlite databases on our worker machines corrupted multiple times. As one of the reasons that could lead to this we suspect OOM conditions on the worker machines.

Acceptance criteria¶

AC1: OOM conditions on worker machines do not go unnoticed and do not cause worker cache sqlite database corruption or are handled automatically to ensure openQA jobs are producing correct results

Suggestions¶

Understand history from #78058#note-6 : worker8 is the one that is overcommited on memory - no suprise sqlite does not work if running out of memory
As we want reliable (database) operations we need to make sure the test developers are not impacting the complete systems in a harmful way
We should check the assigned QEMURAM at a given time per worker
Detecing OOM and shutting down the worker would teach the test developers quickly :) Likely this should be followed as first step
We could have reacted if we would have seen an alert again. But if telegraf fails to get memory no wonder there was no alert. have an alert that triggers before we reach 100% mem usage?
Crosscheck alerts: https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?tab=alert&editPanel=12054&orgId=1&from=now-24h&to=now looks like we do not even have a sensible RAM usage alert anymore (did we ever have?)
As the cache service is in its own systemd service maybe we can block the worker cgroup from getting all of the memory, i.e. split 90% of RAM available to X slots and assign each worker that much memory. Possibly allow some overcommit, but don't let test developers pick QEMURAM freely
- "not pick freely means" crash that one job (not all services) early if QEMURAM is above WORKER_AVAILABLE_RAM. Job should incomplete with a clear reason pointing to the test maintainer

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz almost 4 years ago

Copied from action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8 added

Actions

Copy link

Updated by okurz almost 4 years ago

Tracker changed from action to coordination
Subject changed from Prevent or better handle OOM conditions on worker machines to [epic] Prevent or better handle OOM conditions on worker machines

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Infrastructure

Tags

Custom queries

coordination #78226

[epic] Prevent or better handle OOM conditions on worker machines

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago