action #112898
Updated by livdywan over 2 years ago
### Observation We've been getting minion workers alerts throughout the day e.g. for 10 minutes or for 40 minutes. day. The alerts usually calm down after a while but alert again later. `journalctl -fu openqa-gru.service` isn't showing anything that looks relevant. Although I noticed a lot of `grep was killed, possibly timed out` messages. `/var/log/openqa_gru` mostly contains `[debug] Process ... is performing job "..." with task "..."` type messages. I paused the alert for now because we're way past alert fatigue. https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=17&orgId=1&from=1655883483843&to=1655945318387 ## Acceptance criteria * **AC1:** Alert does not trigger anymore for at least over a night ### Suggestions - Research what's causing minion workers to disappear frequently - Check the minion dashboard e.g. when the worker was last started - The stats are based on Active/Inactive workers - maybe we need "registered workers"; this would be an upstream feature ### Rollback steps - Unpause the alert in grafana after confirming the monitoring is fine