coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
Add grafana alert for "broken workers" as reported by openQA
see #78390 and if you like https://chat.suse.de/channel/testing?msg=udQguXCPNRcAABnBg . We can have "broken" workers, which openQA reports itself.
https://openqa.suse.de/admin/workers can show this list. However we should also have an alert for unexpected "broken" workers.
- AC1: broken workers within https://openqa.suse.de/admin/workers raise an alert from monitor.qa.suse.de
- Status changed from Workable to In Progress
- Assignee set to mkittler
I started with this but could not find according entries in influxdb.
No entries were showing up due to permission errors. Even with
--debug this was not visible at all and I could only figure it out by guessing. (So
grant select on table workers to telegraf; fixed the problem.)
grant select on table workers to telegraf; fixed the problem.
ok but please include that in salt as well. See https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L166 and following lines. And please add an alert on the panel.
All three MRs are merged and are effective. Today I found that osd deployment alerts have failed in the "1m after" and "10m after" deployment alerts but not the "1h after". Can you please look into that and ensure that a deployment does not trigger the "broken" alert?
MR to fix that: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/451 (commit message contains more details)