Project

General

Profile

Actions

action #112898

closed

Minion workers alert triggering on and off size:M

Added by livdywan over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Start date:
Due date:
2022-10-14
% Done:

0%

Estimated time:

Description

Observation

We've been getting minion workers alerts throughout the day e.g. for 10 minutes or for 40 minutes. The alerts usually calm down after a while but alert again later.

journalctl -fu openqa-gru.service isn't showing anything that looks relevant. Although I noticed a lot of grep was killed, possibly timed out messages.
/var/log/openqa_gru mostly contains [debug] Process ... is performing job "..." with task "..." type messages.

I paused the alert for now because we're way past alert fatigue.

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=17&orgId=1&from=1655883483843&to=1655945318387

Acceptance criteria

  • AC1: Alert does not trigger anymore for at least over a night

Suggestions

  • Research what's causing minion workers to disappear frequently
  • Check the minion dashboard e.g. when the worker was last started
  • The stats are based on Active/Inactive workers - maybe we need "registered workers"; this would be an upstream feature

Rollback steps

  • Unpause the alert in grafana after confirming the monitoring is fine

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #96380: "minion workers" alert shows <1 total minion workers if active == 1 size:MResolvedmkittler2021-09-02

Actions
Related to openQA Project (public) - action #102221: t/25-cache-service.t fails exceeding 90s timeout consistently size:MResolvedkraih2021-11-102021-11-27

Actions
Related to openQA Project (public) - action #96561: Speed up `t/25-cache-service.t` by avoiding forking to run Minion jobsResolvedkraih2021-08-04

Actions
Actions #1

Updated by livdywan over 2 years ago

  • Description updated (diff)
Actions #2

Updated by livdywan over 2 years ago

  • Subject changed from Minion workers alert triggering on and off to Minion workers alert triggering on and off size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by okurz over 2 years ago

  • Related to action #96380: "minion workers" alert shows <1 total minion workers if active == 1 size:M added
Actions #4

Updated by okurz over 2 years ago

  • Due date set to 2022-07-07
  • Status changed from Workable to Feedback
  • Assignee set to okurz

it seems like the problem has actually vanished since 2022-06-22 21:30 for unknown reason, see https://monitor.qa.suse.de/d/WebuiDb/webui-summary?editPanel=17&tab=alert&orgId=1&from=1655889123384&to=1655943090423. I unpaused the alert and will monitor it. Still we can make the alert more resilient: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/706

Actions #5

Updated by kraih over 2 years ago

PR for trying to make monitoring the number of workers more reliable: https://github.com/os-autoinst/openQA/pull/4723

Actions #6

Updated by okurz over 2 years ago

  • Due date changed from 2022-07-07 to 2022-07-22

I just merged https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/706 because no one else wanted to take a look. Also https://build.opensuse.org/package/show/devel:openQA:Leap:15.4/perl-Minion is 10.25 now so we can go ahead with https://github.com/os-autoinst/openQA/pull/4723. Currently blocked by failing tests. Let's see if mergify can correctly rebase the PR and if tests pass then.

Actions #7

Updated by okurz over 2 years ago

  • Tags set to reactive work
Actions #9

Updated by livdywan over 2 years ago

  • Due date changed from 2022-07-22 to 2022-07-29

okurz wrote:

https://github.com/os-autoinst/openQA/pull/4723 needs https://github.com/Grinnz/Minion-Backend-SQLite/pull/21 first

Let's wait some more to give the author time to review and merge.

Actions #10

Updated by okurz over 2 years ago

  • Due date changed from 2022-07-29 to 2022-09-23
  • Status changed from Feedback to Blocked

I think we need to be more forgiving when waiting. Also setting "blocked" on https://github.com/Grinnz/Minion-Backend-SQLite/pull/21 now

Actions #11

Updated by okurz over 2 years ago

PR merged. Waiting for new package release

Actions #12

Updated by livdywan over 2 years ago

  • Status changed from Blocked to Feedback

Can we confirm that this works now with the new package?

Actions #13

Updated by okurz about 2 years ago

  • Due date changed from 2022-09-23 to 2022-10-07

First we need to see the new release being used in the referenced PR. I see https://build.opensuse.org/package/show/openSUSE:Factory/perl-Minion-Backend-SQLite being the new updated version. I triggered a rebase in the PR, let's await results

Actions #14

Updated by okurz about 2 years ago

https://github.com/os-autoinst/openQA/pull/4723 progressed a bit. All tests pass but we are missing coverage in lib/OpenQA/CacheService/Controller/Influxdb.pm
which I assume is not gathered anymore from existing t/25-cache-service.t to not slow tests too much when trying to collect coverage in all subprocesses.

Actions #15

Updated by okurz about 2 years ago

  • Related to action #102221: t/25-cache-service.t fails exceeding 90s timeout consistently size:M added
Actions #16

Updated by okurz about 2 years ago

  • Related to action #96561: Speed up `t/25-cache-service.t` by avoiding forking to run Minion jobs added
Actions #17

Updated by okurz about 2 years ago

  • Due date changed from 2022-10-07 to 2022-10-14

https://github.com/os-autoinst/openQA/pull/4723 merged after a session together with tinita and mkittler. Awaiting feedback from production

Actions #18

Updated by okurz about 2 years ago

  • Status changed from Feedback to Resolved

No problems observed from production. https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=17&orgId=1&from=now-7d&to=now looks clean. Alerts are enabled and good.

Actions

Also available in: Atom PDF