Project

General

Profile

action #112898

Updated by livdywan almost 2 years ago

### Observation 

 We've been getting minion workers alerts throughout the day e.g. for 10 minutes or for 40 minutes. day. The alerts usually calm down after a while but alert again later. 

 `journalctl -fu openqa-gru.service` isn't showing anything that looks relevant. Although I noticed a lot of `grep was killed, possibly timed out` messages. 
 `/var/log/openqa_gru` mostly contains `[debug] Process ... is performing job "..." with task "..."` type messages. 

 I paused the alert for now because we're way past alert fatigue. 

 https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=17&orgId=1&from=1655883483843&to=1655945318387 

 ## Acceptance criteria 
 * **AC1:** Alert does not trigger anymore for at least over a night 

 ### Suggestions 
 - Research what's causing minion workers to disappear frequently 
 - Check the minion dashboard e.g. when the worker was last started 
 - The stats are based on Active/Inactive workers - maybe we need "registered workers"; this would be an upstream feature 

 ### Rollback steps 
 - Unpause the alert in grafana after confirming the monitoring is fine

Back