Project

General

Profile

action #78058

[Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8

Added by okurz 8 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2020-11-16
Due date:
2020-11-18
% Done:

0%

Estimated time:

Description

Observation

alert email received from grafana.


Related issues

Related to openQA Project - action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retryResolved2020-05-18

Related to openQA Infrastructure - action #75220: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retryResolved

Related to openQA Infrastructure - action #73342: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retryResolved2020-10-142020-10-16

Related to openQA Project - action #73321: all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*"Rejected2020-10-14

Copied to openQA Infrastructure - coordination #78226: [epic] Prevent or better handle OOM conditions on worker machinesWorkable

History

#1 Updated by okurz 8 months ago

  • Related to action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry added

#2 Updated by okurz 8 months ago

  • Related to action #75220: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry added

#3 Updated by okurz 8 months ago

  • Related to action #73342: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry added

#4 Updated by okurz 8 months ago

  • Related to action #73321: all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*" added

#5 Updated by okurz 8 months ago

  • Tags set to alert, cache, openqaworker8, osd, corrupted, grafana
  • Due date set to 2020-11-18
  • Status changed from New to Feedback

On openqaworker8 did:

sudo systemctl stop openqa-worker.target openqa-worker-cacheservice openqa-worker-cacheservice-minion.service && sudo rm -rf /var/lib/openqa/cache/ && sudo systemctl start openqa-worker.target openqa-worker-cacheservice openqa-worker-cacheservice-minion.service

and monitored the system journal.

Then called

export host=openqa.suse.de; ./openqa-monitor-incompletes |  ./openqa-label-known-issues

to label and retry incompletes.

Paused the alert "Incomplete jobs of last 24h alert" for now. Should revisit next day.

Something seems to be wrong with openqaworker8, see related tickets. Maybe conduct a storage hardware test?

#6 Updated by okurz 8 months ago

  • Status changed from Feedback to In Progress
<@coolo> worker8 is the one that is overcommited on memory - no suprise sqlite does not work if running out of memory
<@coolo> https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?orgId=1&from=1605590335474&to=1605590707245 - that's <@coolo> I think if you want reliable (database) operations you need to make sure the test developers aren't screwing with you - and we'd need to check the assigned QEMURAM at a given time per worker
<@coolo> of course detecing OOM and shutting down the worker would teach the test developers quickly 🙂
<@coolo> https://paste.opensuse.org/view/raw/83854722 is from journal about that time
<@okurz> oh geez. I thought I could trust people after I explained them the last time. I would have reacted if I would have seen an alert again. But if telegraf fails to get memory no wonder there was no alert
<@coolo> so what do you suggest? look for some kind of OOM detection daemon? crash the kernel and reboot and rely on us finding it in the logs?
<@okurz> have an alert that triggers *before* we reach 100% mem usage?
<@okurz> https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?tab=alert&editPanel=12054&orgId=1&from=now-24h&to=now looks like we do not even have a sensible RAM usage alert anymore (did we ever have?)
<@coolo> the cache service is in its own systemd service, right? I'm sure there are ways to block the worker cgroup from getting all of the memory
<@coolo> i.e. split 90% of RAM available to X slots and assign each worker that much memory. Possibly allow some overcommit, but don't let test developers pick QEMURAM freely
<@okurz> not pick freely means crash early if QEMURAM is above WORKER_AVAILABLE_RAM?
<@coolo> right, but only that one job and not all of the worker's service
<@okurz> of course, I mean to incomplete said job with a clear reason pointing to the test maintainer
<@coolo> but that will only cure the problem on worker8 - if the sqlite problem plagues other hosts, you don't want to put all your fun on this issue
<@okurz> no, it would be yet another epic with a lot of subtasks to cover different aspects from different angles
<@okurz> what can I *right now* to prevent further chaos? I guess I will reduce the number of instances further more to have some margin

#7 Updated by okurz 8 months ago

  • Priority changed from Immediate to High

Reduced number of worker instances on openqaworker8 from 20 to 12 in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/274 and did systemctl mask --now openqa-worker@{13..20} on openqaworker8.

TODO: Create openQA feature epic and specific tasks for all the things we can do to prevent this in the future. Plus crosscheck if our memory alert still works and implement better one that triggers faster and such.

#8 Updated by okurz 8 months ago

  • Copied to coordination #78226: [epic] Prevent or better handle OOM conditions on worker machines added

#9 Updated by okurz 8 months ago

  • Status changed from In Progress to Resolved

created #78226

Also available in: Atom PDF