action #87898: Add grafana alert for "broken workers" as reported by openQA - openQA Project (public) - openSUSE Project Management Tool

Actions

action #87898

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

Add grafana alert for "broken workers" as reported by openQA

Added by okurz over 4 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Organisational

Target version:

Ready

Start date:

2021-01-18

Due date:

% Done:

Estimated time:

Description

Motivation¶

see #78390 and if you like https://chat.suse.de/channel/testing?msg=udQguXCPNRcAABnBg . We can have "broken" workers, which openQA reports itself.
https://openqa.suse.de/admin/workers can show this list. However we should also have an alert for unexpected "broken" workers.

Acceptance criteria¶

AC1: broken workers within https://openqa.suse.de/admin/workers raise an alert from monitor.qa.suse.de

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from Workable to In Progress
Assignee set to okurz

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from In Progress to Workable
Assignee deleted (~~okurz~~)

I started with this but could not find according entries in influxdb. I forgot how to properly test this again. But as we have too many tickets "in progress" I will set back to "Workable".

Actions

Copy link

Updated by okurz over 4 years ago

Parent task changed from #78390 to #80142

Actions

Copy link

Updated by mkittler over 4 years ago

Status changed from Workable to In Progress
Assignee set to mkittler

SR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/442

I started with this but could not find according entries in influxdb.

No entries were showing up due to permission errors. Even with --debug this was not visible at all and I could only figure it out by guessing. (So grant select on table workers to telegraf; fixed the problem.)

Actions

Copy link

Updated by okurz over 4 years ago

mkittler wrote:
So grant select on table workers to telegraf; fixed the problem.

ok but please include that in salt as well. See https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L166 and following lines. And please add an alert on the panel.

Actions

Copy link

Updated by openqa_review over 4 years ago

Due date set to 2021-02-20

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by mkittler over 4 years ago

SRs for further improvements: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/444 https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/443

Actions

Copy link

Updated by mkittler over 4 years ago

SR for alert: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/447

Actions

Copy link

Updated by okurz over 4 years ago

All three MRs are merged and are effective. Today I found that osd deployment alerts have failed in the "1m after" and "10m after" deployment alerts but not the "1h after". Can you please look into that and ensure that a deployment does not trigger the "broken" alert?

Actions

Copy link

#10

Updated by mkittler over 4 years ago

MR to fix that: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/451 (commit message contains more details)

Actions

Copy link

#11

Updated by mkittler over 4 years ago

Status changed from In Progress to Feedback

Let's wait until the next deployment to see whether it worked.

Actions

Copy link

#12

Updated by livdywan over 4 years ago

Due date changed from 2021-02-20 to 2021-02-26

No broken workers in the web UI or alerts on osd-admins@suse.de that I can see. Bumping the due date so we can check again later this week. Alternatively, consider breaking a worker on purpose?

Actions

Copy link

#13

Updated by mkittler over 4 years ago

Status changed from Feedback to Resolved

The alert hasn't fired during the deployment today although we had a few broken workers for a few minutes (< 15 minutes).

Actions

Copy link

#14

Updated by okurz about 4 years ago

Due date deleted (~~2021-02-26~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #87898

Add grafana alert for "broken workers" as reported by openQA

Motivation¶

Acceptance criteria¶

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by mkittler over 4 years ago

Updated by okurz over 4 years ago

Updated by openqa_review over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by okurz over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by livdywan over 4 years ago

Updated by mkittler over 4 years ago

Updated by okurz about 4 years ago