QA (public) &raquo; openQA Project (public) &raquo; openQA Infrastructure (public)

openQA Project (public) - Ready

Category:

Target version:

Start date:

2020-11-16

Due date:

2020-11-18

% Done:

Estimated time:

Tags:

alert, osd, cache, grafana, openqaworker8, corrupted

Description

Observation¶

alert email received from grafana.

Related issues 5 (1 open — 4 closed)

Related to openQA Project (public) - action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry

Resolved

mkittler

2020-05-18

Related to openQA Infrastructure (public) - action #75220: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry

Resolved

Related to openQA Infrastructure (public) - action #73342: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry

Resolved

2020-10-14

2020-10-16

Related to openQA Project (public) - action #73321: all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*"

Rejected

2020-10-14

Copied to openQA Infrastructure (public) - coordination #78226: [epic] Prevent or better handle OOM conditions on worker machines

Workable

Updated by okurz over 4 years ago

Related to action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry added

Actions

Updated by okurz over 4 years ago

Related to action #75220: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry added

Actions

Updated by okurz over 4 years ago

Related to action #73342: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry added

Actions

Updated by okurz over 4 years ago

Related to action #73321: all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*" added

Actions

Updated by okurz over 4 years ago

Tags set to alert, cache, openqaworker8, osd, corrupted, grafana
Due date set to 2020-11-18
Status changed from New to Feedback

On openqaworker8 did:

sudo systemctl stop openqa-worker.target openqa-worker-cacheservice openqa-worker-cacheservice-minion.service && sudo rm -rf /var/lib/openqa/cache/ && sudo systemctl start openqa-worker.target openqa-worker-cacheservice openqa-worker-cacheservice-minion.service

and monitored the system journal.

Then called

export host=openqa.suse.de; ./openqa-monitor-incompletes |  ./openqa-label-known-issues

to label and retry incompletes.

Paused the alert "Incomplete jobs of last 24h alert" for now. Should revisit next day.

Something seems to be wrong with openqaworker8, see related tickets. Maybe conduct a storage hardware test?

Actions

Updated by okurz over 4 years ago

Status changed from Feedback to In Progress

<@coolo> worker8 is the one that is overcommited on memory - no suprise sqlite does not work if running out of memory
<@coolo> https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?orgId=1&from=1605590335474&to=1605590707245 - that's <@coolo> I think if you want reliable (database) operations you need to make sure the test developers aren't screwing with you - and we'd need to check the assigned QEMURAM at a given time per worker
<@coolo> of course detecing OOM and shutting down the worker would teach the test developers quickly 🙂
<@coolo> https://paste.opensuse.org/view/raw/83854722 is from journal about that time
<@okurz> oh geez. I thought I could trust people after I explained them the last time. I would have reacted if I would have seen an alert again. But if telegraf fails to get memory no wonder there was no alert
<@coolo> so what do you suggest? look for some kind of OOM detection daemon? crash the kernel and reboot and rely on us finding it in the logs?
<@okurz> have an alert that triggers *before* we reach 100% mem usage?
<@okurz> https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?tab=alert&editPanel=12054&orgId=1&from=now-24h&to=now looks like we do not even have a sensible RAM usage alert anymore (did we ever have?)
<@coolo> the cache service is in its own systemd service, right? I'm sure there are ways to block the worker cgroup from getting all of the memory
<@coolo> i.e. split 90% of RAM available to X slots and assign each worker that much memory. Possibly allow some overcommit, but don't let test developers pick QEMURAM freely
<@okurz> not pick freely means crash early if QEMURAM is above WORKER_AVAILABLE_RAM?
<@coolo> right, but only that one job and not all of the worker's service
<@okurz> of course, I mean to incomplete said job with a clear reason pointing to the test maintainer
<@coolo> but that will only cure the problem on worker8 - if the sqlite problem plagues other hosts, you don't want to put all your fun on this issue
<@okurz> no, it would be yet another epic with a lot of subtasks to cover different aspects from different angles
<@okurz> what can I *right now* to prevent further chaos? I guess I will reduce the number of instances further more to have some margin

Actions

Updated by okurz over 4 years ago

Priority changed from Immediate to High

Reduced number of worker instances on openqaworker8 from 20 to 12 in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/274 and did systemctl mask --now openqa-worker@{13..20} on openqaworker8.

TODO: Create openQA feature epic and specific tasks for all the things we can do to prevent this in the future. Plus crosscheck if our memory alert still works and implement better one that triggers faster and such.

Actions

Updated by okurz over 4 years ago

Copied to coordination #78226: [epic] Prevent or better handle OOM conditions on worker machines added

Actions