Project

General

Profile

Actions

action #112898

closed

Minion workers alert triggering on and off size:M

Added by livdywan over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
Due date:
2022-10-14
% Done:

0%

Estimated time:

Description

Observation

We've been getting minion workers alerts throughout the day e.g. for 10 minutes or for 40 minutes. The alerts usually calm down after a while but alert again later.

journalctl -fu openqa-gru.service isn't showing anything that looks relevant. Although I noticed a lot of grep was killed, possibly timed out messages.
/var/log/openqa_gru mostly contains [debug] Process ... is performing job "..." with task "..." type messages.

I paused the alert for now because we're way past alert fatigue.

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=17&orgId=1&from=1655883483843&to=1655945318387

Acceptance criteria

  • AC1: Alert does not trigger anymore for at least over a night

Suggestions

  • Research what's causing minion workers to disappear frequently
  • Check the minion dashboard e.g. when the worker was last started
  • The stats are based on Active/Inactive workers - maybe we need "registered workers"; this would be an upstream feature

Rollback steps

  • Unpause the alert in grafana after confirming the monitoring is fine

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #96380: "minion workers" alert shows <1 total minion workers if active == 1 size:MResolvedmkittler2021-09-02

Actions
Related to openQA Project - action #102221: t/25-cache-service.t fails exceeding 90s timeout consistently size:MResolvedkraih2021-11-102021-11-27

Actions
Related to openQA Project - action #96561: Speed up `t/25-cache-service.t` by avoiding forking to run Minion jobsResolvedkraih2021-08-04

Actions
Actions

Also available in: Atom PDF