Project

General

Profile

Actions

action #78058

closed

[Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2020-11-16
Due date:
2020-11-18
% Done:

0%

Estimated time:

Description

Observation

alert email received from grafana.


Related issues 5 (1 open4 closed)

Related to openQA Project - action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retryResolvedmkittler2020-05-18

Actions
Related to openQA Infrastructure - action #75220: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retryResolvedokurz

Actions
Related to openQA Infrastructure - action #73342: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retryResolvedokurz2020-10-142020-10-16

Actions
Related to openQA Project - action #73321: all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*"Rejectedokurz2020-10-14

Actions
Copied to openQA Infrastructure - coordination #78226: [epic] Prevent or better handle OOM conditions on worker machinesWorkable

Actions
Actions #1

Updated by okurz over 3 years ago

  • Related to action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry added
Actions #2

Updated by okurz over 3 years ago

  • Related to action #75220: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry added
Actions #3

Updated by okurz over 3 years ago

  • Related to action #73342: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry added
Actions #4

Updated by okurz over 3 years ago

  • Related to action #73321: all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*" added
Actions #5

Updated by okurz over 3 years ago

  • Tags set to alert, cache, openqaworker8, osd, corrupted, grafana
  • Due date set to 2020-11-18
  • Status changed from New to Feedback

On openqaworker8 did:

sudo systemctl stop openqa-worker.target openqa-worker-cacheservice openqa-worker-cacheservice-minion.service && sudo rm -rf /var/lib/openqa/cache/ && sudo systemctl start openqa-worker.target openqa-worker-cacheservice openqa-worker-cacheservice-minion.service

and monitored the system journal.

Then called

export host=openqa.suse.de; ./openqa-monitor-incompletes |  ./openqa-label-known-issues

to label and retry incompletes.

Paused the alert "Incomplete jobs of last 24h alert" for now. Should revisit next day.

Something seems to be wrong with openqaworker8, see related tickets. Maybe conduct a storage hardware test?

Actions #6

Updated by okurz over 3 years ago

  • Status changed from Feedback to In Progress
<@coolo> worker8 is the one that is overcommited on memory - no suprise sqlite does not work if running out of memory
<@coolo> https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?orgId=1&from=1605590335474&to=1605590707245 - that's <@coolo> I think if you want reliable (database) operations you need to make sure the test developers aren't screwing with you - and we'd need to check the assigned QEMURAM at a given time per worker
<@coolo> of course detecing OOM and shutting down the worker would teach the test developers quickly 🙂
<@coolo> https://paste.opensuse.org/view/raw/83854722 is from journal about that time
<@okurz> oh geez. I thought I could trust people after I explained them the last time. I would have reacted if I would have seen an alert again. But if telegraf fails to get memory no wonder there was no alert
<@coolo> so what do you suggest? look for some kind of OOM detection daemon? crash the kernel and reboot and rely on us finding it in the logs?
<@okurz> have an alert that triggers *before* we reach 100% mem usage?
<@okurz> https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?tab=alert&editPanel=12054&orgId=1&from=now-24h&to=now looks like we do not even have a sensible RAM usage alert anymore (did we ever have?)
<@coolo> the cache service is in its own systemd service, right? I'm sure there are ways to block the worker cgroup from getting all of the memory
<@coolo> i.e. split 90% of RAM available to X slots and assign each worker that much memory. Possibly allow some overcommit, but don't let test developers pick QEMURAM freely
<@okurz> not pick freely means crash early if QEMURAM is above WORKER_AVAILABLE_RAM?
<@coolo> right, but only that one job and not all of the worker's service
<@okurz> of course, I mean to incomplete said job with a clear reason pointing to the test maintainer
<@coolo> but that will only cure the problem on worker8 - if the sqlite problem plagues other hosts, you don't want to put all your fun on this issue
<@okurz> no, it would be yet another epic with a lot of subtasks to cover different aspects from different angles
<@okurz> what can I *right now* to prevent further chaos? I guess I will reduce the number of instances further more to have some margin
Actions #7

Updated by okurz over 3 years ago

  • Priority changed from Immediate to High

Reduced number of worker instances on openqaworker8 from 20 to 12 in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/274 and did systemctl mask --now openqa-worker@{13..20} on openqaworker8.

TODO: Create openQA feature epic and specific tasks for all the things we can do to prevent this in the future. Plus crosscheck if our memory alert still works and implement better one that triggers faster and such.

Actions #8

Updated by okurz over 3 years ago

  • Copied to coordination #78226: [epic] Prevent or better handle OOM conditions on worker machines added
Actions #9

Updated by okurz over 3 years ago

  • Status changed from In Progress to Resolved

created #78226

Actions

Also available in: Atom PDF