[epic] Prevent or better handle OOM conditions on worker machines
the sqlite databases on our worker machines corrupted multiple times. As one of the reasons that could lead to this we suspect OOM conditions on the worker machines.
- AC1: OOM conditions on worker machines do not go unnoticed and do not cause worker cache sqlite database corruption or are handled automatically to ensure openQA jobs are producing correct results
- Understand history from #78058#note-6 : worker8 is the one that is overcommited on memory - no suprise sqlite does not work if running out of memory
- As we want reliable (database) operations we need to make sure the test developers are not impacting the complete systems in a harmful way
- We should check the assigned QEMURAM at a given time per worker
- Detecing OOM and shutting down the worker would teach the test developers quickly :) Likely this should be followed as first step
- We could have reacted if we would have seen an alert again. But if telegraf fails to get memory no wonder there was no alert. have an alert that triggers before we reach 100% mem usage?
- Crosscheck alerts: https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?tab=alert&editPanel=12054&orgId=1&from=now-24h&to=now looks like we do not even have a sensible RAM usage alert anymore (did we ever have?)
- As the cache service is in its own systemd service maybe we can block the worker cgroup from getting all of the memory, i.e. split 90% of RAM available to X slots and assign each worker that much memory. Possibly allow some overcommit, but don't let test developers pick QEMURAM freely
- "not pick freely means" crash that one job (not all services) early if QEMURAM is above WORKER_AVAILABLE_RAM. Job should incomplete with a clear reason pointing to the test maintainer