action #112898
closed
Minion workers alert triggering on and off size:M
Added by livdywan over 2 years ago.
Updated about 2 years ago.
Description
Observation¶
We've been getting minion workers alerts throughout the day e.g. for 10 minutes or for 40 minutes. The alerts usually calm down after a while but alert again later.
journalctl -fu openqa-gru.service
isn't showing anything that looks relevant. Although I noticed a lot of grep was killed, possibly timed out
messages.
/var/log/openqa_gru
mostly contains [debug] Process ... is performing job "..." with task "..."
type messages.
I paused the alert for now because we're way past alert fatigue.
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=17&orgId=1&from=1655883483843&to=1655945318387
Acceptance criteria¶
- AC1: Alert does not trigger anymore for at least over a night
Suggestions¶
- Research what's causing minion workers to disappear frequently
- Check the minion dashboard e.g. when the worker was last started
- The stats are based on Active/Inactive workers - maybe we need "registered workers"; this would be an upstream feature
Rollback steps¶
- Unpause the alert in grafana after confirming the monitoring is fine
- Description updated (diff)
- Subject changed from Minion workers alert triggering on and off to Minion workers alert triggering on and off size:M
- Description updated (diff)
- Status changed from New to Workable
- Related to action #96380: "minion workers" alert shows <1 total minion workers if active == 1 size:M added
- Due date set to 2022-07-07
- Status changed from Workable to Feedback
- Assignee set to okurz
- Due date changed from 2022-07-07 to 2022-07-22
- Tags set to reactive work
- Due date changed from 2022-07-22 to 2022-07-29
- Due date changed from 2022-07-29 to 2022-09-23
- Status changed from Feedback to Blocked
PR merged. Waiting for new package release
- Status changed from Blocked to Feedback
Can we confirm that this works now with the new package?
- Due date changed from 2022-09-23 to 2022-10-07
https://github.com/os-autoinst/openQA/pull/4723 progressed a bit. All tests pass but we are missing coverage in lib/OpenQA/CacheService/Controller/Influxdb.pm
which I assume is not gathered anymore from existing t/25-cache-service.t to not slow tests too much when trying to collect coverage in all subprocesses.
- Related to action #102221: t/25-cache-service.t fails exceeding 90s timeout consistently size:M added
- Related to action #96561: Speed up `t/25-cache-service.t` by avoiding forking to run Minion jobs added
- Due date changed from 2022-10-07 to 2022-10-14
- Status changed from Feedback to Resolved
Also available in: Atom
PDF