Project

General

Profile

coordination #78226

[epic] Prevent or better handle OOM conditions on worker machines

Added by okurz 5 months ago. Updated 5 months ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

the sqlite databases on our worker machines corrupted multiple times. As one of the reasons that could lead to this we suspect OOM conditions on the worker machines.

Acceptance criteria

  • AC1: OOM conditions on worker machines do not go unnoticed and do not cause worker cache sqlite database corruption or are handled automatically to ensure openQA jobs are producing correct results

Suggestions

  • Understand history from #78058#note-6 : worker8 is the one that is overcommited on memory - no suprise sqlite does not work if running out of memory
  • As we want reliable (database) operations we need to make sure the test developers are not impacting the complete systems in a harmful way
  • We should check the assigned QEMURAM at a given time per worker
  • Detecing OOM and shutting down the worker would teach the test developers quickly :) Likely this should be followed as first step
  • We could have reacted if we would have seen an alert again. But if telegraf fails to get memory no wonder there was no alert. have an alert that triggers before we reach 100% mem usage?
  • Crosscheck alerts: https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?tab=alert&editPanel=12054&orgId=1&from=now-24h&to=now looks like we do not even have a sensible RAM usage alert anymore (did we ever have?)
  • As the cache service is in its own systemd service maybe we can block the worker cgroup from getting all of the memory, i.e. split 90% of RAM available to X slots and assign each worker that much memory. Possibly allow some overcommit, but don't let test developers pick QEMURAM freely
    • "not pick freely means" crash that one job (not all services) early if QEMURAM is above WORKER_AVAILABLE_RAM. Job should incomplete with a clear reason pointing to the test maintainer

Related issues

Copied from openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8Resolved2020-11-162020-11-18

History

#1 Updated by okurz 5 months ago

  • Copied from action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8 added

#2 Updated by okurz 5 months ago

  • Tracker changed from action to coordination
  • Subject changed from Prevent or better handle OOM conditions on worker machines to [epic] Prevent or better handle OOM conditions on worker machines

Also available in: Atom PDF