Project

General

Profile

action #96380

Updated by livdywan over 2 years ago

## Observation 

 Received two monitoring alerts which are also seen on 
 https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=17&orgId=1&from=1627142086249&to=1627190672223 showing two small ditches. 

 ## Acceptance criteria 
 * **AC1:** minion is not actually down for more than 1 minute 
 * **AC2:** grafana does not show <1 total minion workers if active == 1 

 ## Problem 
 It's likely better if we do not count number of minions as a float number. We should investigate why a minion worker would not be available for nearly 8 minutes but also we should design the alert to be resilient if a minion worker is offline for a limited time when it's not impacting operations further. Also what should be prevented is that we show something like "0.94" total minion workers (active+inactive) when at that time active is actually 1 

 ## Suggestions 
 * Read the description for the grafana panel from #96089 
 * Verify that minions are gone, rather than influxdb data being wrong 
 * Crosscheck with our "Mojo minion framework experts" how it can be that a minion is disappearing for about 8 minutes 
 * Make alert resilient

Back