Actions
action #78061
closed[Alerting] openQA minion workers alert - alert turned "OK" again after 20 minutes and we don't know what was wrong
Start date:
2020-11-16
Due date:
% Done:
0%
Estimated time:
Description
Observation¶
[Alerting] openQA minion workers alert
From: Grafana osd-admins@suse.de
To: osd-admins@suse.de
Sender: osd-admins
List-Id:
Date: 16/11/2020 21.25
/[Alerting] openQA minion workers alert/
Minion workers down. Check systemd services on the openQA host
Metric name
Value
Sum
0.940
but checking on "the openQA host", I guess that means openqa.suse.de, shows no failed systemd services. What do "minion workers down" have to do with systemd services? Should that be only the service "openqa-gru.service" on osd?
Acceptance criteria¶
- AC1: The grafana panel description and/or alert has a better explanation of what is going on and what should be checked
Suggestions¶
- Understand the data source for https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=query&editPanel=17&orgId=1&refresh=30s
- Extend the grafana panel and/or alert description to include a better description and instructions what to do specifically, e.g. also what log of what service we should into in case that we can not see anything wrong at the time of checking because something "resolved itself" already
Actions