action #87898
closed
coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
Add grafana alert for "broken workers" as reported by openQA
Added by okurz almost 4 years ago.
Updated over 3 years ago.
- Status changed from Workable to In Progress
- Assignee set to okurz
- Status changed from In Progress to Workable
- Assignee deleted (
okurz)
I started with this but could not find according entries in influxdb. I forgot how to properly test this again. But as we have too many tickets "in progress" I will set back to "Workable".
- Parent task changed from #78390 to #80142
- Status changed from Workable to In Progress
- Assignee set to mkittler
SR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/442
I started with this but could not find according entries in influxdb.
No entries were showing up due to permission errors. Even with --debug
this was not visible at all and I could only figure it out by guessing. (So grant select on table workers to telegraf;
fixed the problem.)
- Due date set to 2021-02-20
Setting due date based on mean cycle time of SUSE QE Tools
All three MRs are merged and are effective. Today I found that osd deployment alerts have failed in the "1m after" and "10m after" deployment alerts but not the "1h after". Can you please look into that and ensure that a deployment does not trigger the "broken" alert?
- Status changed from In Progress to Feedback
Let's wait until the next deployment to see whether it worked.
- Due date changed from 2021-02-20 to 2021-02-26
No broken workers in the web UI or alerts on osd-admins@suse.de
that I can see. Bumping the due date so we can check again later this week. Alternatively, consider breaking a worker on purpose?
- Status changed from Feedback to Resolved
The alert hasn't fired during the deployment today although we had a few broken workers for a few minutes (< 15 minutes).
- Due date deleted (
2021-02-26)
Also available in: Atom
PDF