openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8</h1> <article> <h1>openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8</h1> <p>2020-11-16T20:53:56Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-6 priority-high2 closed child" href="/issues/67000">action #67000</a>: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry</i> added</li></ul> </article> <article> <h1>openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8</h1> <p>2020-11-16T20:54:03Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-6 priority-high2 closed" href="/issues/75220">action #75220</a>: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry</i> added</li></ul> </article> <article> <h1>openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8</h1> <p>2020-11-16T20:54:13Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed behind-schedule" href="/issues/73342">action #73342</a>: all jobs run on openqaworker8 incomplete: "Cache service status error from API: Minion job .*failed: .*(database disk image is malformed|not a database)":retry</i> added</li></ul> </article> <article> <h1>openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8</h1> <p>2020-11-16T20:54:23Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-6 priority-4 priority-default closed" href="/issues/73321">action #73321</a>: all jobs run on openqaworker8 incomplete:"Cache service status error from API: Minion job #46203 failed: Couldn't add download: DBD::SQLite::st execute failed: database disk image is malformed*"</i> added</li></ul> </article> <article> <h1>openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8</h1> <p>2020-11-16T21:22:10Z</p> <ul><li><strong>Tags</strong> set to <i>alert, cache, openqaworker8, osd, corrupted, grafana</i></li><li><strong>Due date</strong> set to <i>2020-11-18</i></li><li><strong>Status</strong> changed from <i>New</i> to <i>Feedback</i></li></ul><p>On openqaworker8 did:</p> <pre><code>sudo systemctl stop openqa-worker.target openqa-worker-cacheservice openqa-worker-cacheservice-minion.service && sudo rm -rf /var/lib/openqa/cache/ && sudo systemctl start openqa-worker.target openqa-worker-cacheservice openqa-worker-cacheservice-minion.service </code></pre> <p>and monitored the system journal.</p> <p>Then called</p> <pre><code>export host=openqa.suse.de; ./openqa-monitor-incompletes | ./openqa-label-known-issues </code></pre> <p>to label and retry incompletes.</p> <p>Paused the alert "Incomplete jobs of last 24h alert" for now. Should revisit next day.</p> <p>Something seems to be wrong with openqaworker8, see related tickets. Maybe conduct a storage hardware test?</p> </article> <article> <h1>openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8</h1> <p>2020-11-17T08:18:07Z</p> <ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>In Progress</i></li></ul><pre><code><@coolo> worker8 is the one that is overcommited on memory - no suprise sqlite does not work if running out of memory <@coolo> https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?orgId=1&from=1605590335474&to=1605590707245 - that's <@coolo> I think if you want reliable (database) operations you need to make sure the test developers aren't screwing with you - and we'd need to check the assigned QEMURAM at a given time per worker <@coolo> of course detecing OOM and shutting down the worker would teach the test developers quickly 🙂 <@coolo> https://paste.opensuse.org/view/raw/83854722 is from journal about that time <@okurz> oh geez. I thought I could trust people after I explained them the last time. I would have reacted if I would have seen an alert again. But if telegraf fails to get memory no wonder there was no alert <@coolo> so what do you suggest? look for some kind of OOM detection daemon? crash the kernel and reboot and rely on us finding it in the logs? <@okurz> have an alert that triggers *before* we reach 100% mem usage? <@okurz> https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?tab=alert&editPanel=12054&orgId=1&from=now-24h&to=now looks like we do not even have a sensible RAM usage alert anymore (did we ever have?) <@coolo> the cache service is in its own systemd service, right? I'm sure there are ways to block the worker cgroup from getting all of the memory <@coolo> i.e. split 90% of RAM available to X slots and assign each worker that much memory. Possibly allow some overcommit, but don't let test developers pick QEMURAM freely <@okurz> not pick freely means crash early if QEMURAM is above WORKER_AVAILABLE_RAM? <@coolo> right, but only that one job and not all of the worker's service <@okurz> of course, I mean to incomplete said job with a clear reason pointing to the test maintainer <@coolo> but that will only cure the problem on worker8 - if the sqlite problem plagues other hosts, you don't want to put all your fun on this issue <@okurz> no, it would be yet another epic with a lot of subtasks to cover different aspects from different angles <@okurz> what can I *right now* to prevent further chaos? I guess I will reduce the number of instances further more to have some margin </code></pre> </article> <article> <h1>openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8</h1> <p>2020-11-17T08:37:56Z</p> <ul><li><strong>Priority</strong> changed from <i>Immediate</i> to <i>High</i></li></ul><p>Reduced number of worker instances on openqaworker8 from 20 to 12 in <a href="https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/274" class="external">https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/274</a> and did <code>systemctl mask --now openqa-worker@{13..20}</code> on openqaworker8.</p> <p>TODO: Create openQA feature epic and specific tasks for all the things we can do to prevent this in the future. Plus crosscheck if our memory alert still works and implement better one that triggers faster and such.</p> </article> <article> <h1>openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8</h1> <p>2020-11-19T08:11:44Z</p> <ul><li><strong>Copied to</strong> <i><a class="issue tracker-6 status-12 priority-4 priority-default" href="/issues/78226">coordination #78226</a>: [epic] Prevent or better handle OOM conditions on worker machines</i> added</li></ul> </article> <article> <h1>openQA Infrastructure - action #78058: [Alerting] Incomplete jobs of last 24h alert - again many incompletes due to corrupted cache, on openqaworker8</h1> <p>2020-11-19T08:15:49Z</p> <ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul><p>created <a class="issue tracker-6 status-12 priority-4 priority-default" title="coordination: [epic] Prevent or better handle OOM conditions on worker machines (Workable)" href="https://progress.opensuse.org/issues/78226">#78226</a></p> </article> </main></body></html>