Project

General

Profile

action #96380

action #95983: alert about "minion workers", alert triggered two times and turned green again

"minion workers" alert shows <1 total minion workers if active == 1 size:M

Added by okurz about 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
Due date:
2021-09-02
% Done:

0%

Estimated time:

Description

Observation

Received two monitoring alerts which are also seen on
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=17&orgId=1&from=1627142086249&to=1627190672223 showing two small ditches.

Acceptance criteria

  • AC1: minion is not actually down for more than 1 minute
  • AC2: grafana does not show <1 total minion workers if active == 1

Problem

It's likely better if we do not count number of minions as a float number. We should investigate why a minion worker would not be available for nearly 8 minutes but also we should design the alert to be resilient if a minion worker is offline for a limited time when it's not impacting operations further. Also what should be prevented is that we show something like "0.94" total minion workers (active+inactive) when at that time active is actually 1

Suggestions

  • Read the description for the grafana panel from #96089
  • Verify that minions are gone, rather than influxdb data being wrong
  • Crosscheck with our "Mojo minion framework experts" how it can be that a minion is disappearing for about 8 minutes
  • Make alert resilient

Related issues

Related to QA - action #112898: Minion workers alert triggering on and off size:MBlocked2022-09-23

History

#1 Updated by cdywan about 1 year ago

  • Subject changed from alert about "minion workers", grafana should not show <1 total minion workers if active == 1 to "minion workers" alert shows <1 total minion workers if active == 1 size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from High to Normal

#2 Updated by mkittler 12 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

The problem is the query SELECT mean(active)+mean(inactive) FROM "openqa_minion_workers" WHERE $timeFilter GROUP BY time($__interval) fill(previous) where both, mean(active) and mean(inactive) are a floating point number within the query interval. This calculation seems indeed not very reliable. I guess this could be worked around by simplifying the alert rule.

#4 Updated by openqa_review 12 months ago

  • Due date set to 2021-09-02

Setting due date based on mean cycle time of SUSE QE Tools

#6 Updated by mkittler 12 months ago

  • Status changed from In Progress to Resolved

The last SR should have done it. The alert hasn't been active/pending since it has been merged.

#7 Updated by okurz about 2 months ago

  • Related to action #112898: Minion workers alert triggering on and off size:M added

Also available in: Atom PDF