Minion workers alert triggering on and off size:M
We've been getting minion workers alerts throughout the day e.g. for 10 minutes or for 40 minutes. The alerts usually calm down after a while but alert again later.
journalctl -fu openqa-gru.service isn't showing anything that looks relevant. Although I noticed a lot of
grep was killed, possibly timed out messages.
/var/log/openqa_gru mostly contains
[debug] Process ... is performing job "..." with task "..." type messages.
I paused the alert for now because we're way past alert fatigue.
- AC1: Alert does not trigger anymore for at least over a night
- Research what's causing minion workers to disappear frequently
- Check the minion dashboard e.g. when the worker was last started
- The stats are based on Active/Inactive workers - maybe we need "registered workers"; this would be an upstream feature
- Unpause the alert in grafana after confirming the monitoring is fine
#4 Updated by okurz about 2 months ago
- Due date set to 2022-07-07
- Status changed from Workable to Feedback
- Assignee set to okurz
it seems like the problem has actually vanished since 2022-06-22 21:30 for unknown reason, see https://monitor.qa.suse.de/d/WebuiDb/webui-summary?editPanel=17&tab=alert&orgId=1&from=1655889123384&to=1655943090423. I unpaused the alert and will monitor it. Still we can make the alert more resilient: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/706
#6 Updated by okurz about 1 month ago
- Due date changed from 2022-07-07 to 2022-07-22
I just merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/706 because no one else wanted to take a look. Also https://build.opensuse.org/package/show/devel:openQA:Leap:15.4/perl-Minion is 10.25 now so we can go ahead with https://github.com/os-autoinst/openQA/pull/4723. Currently blocked by failing tests. Let's see if mergify can correctly rebase the PR and if tests pass then.
#8 Updated by okurz about 1 month ago
- Due date changed from 2022-07-22 to 2022-07-29
Let's wait some more to give the author time to review and merge.
- Due date changed from 2022-07-29 to 2022-09-23
- Status changed from Feedback to Blocked
I think we need to be more forgiving when waiting. Also setting "blocked" on https://github.com/Grinnz/Minion-Backend-SQLite/pull/21 now