Project

General

Profile

action #94399

Updated by mkittler almost 3 years ago

## Observation 

 On 2021-06-22, all arm workers (arm-1, arm-2, arm-3) couldn't be connected by using `ssh` or `ping`. 
 But https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 showed that all of them were `Online`. 

 ## Acceptance criteria 
 * ~~**AC1:** We can receive the alerting e-mail when arm workers down~~ 
 * **AC2:** https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 should show the correct state 
 * **AC3:** We receive alert notices for errors in telegraf on osd 

 ## Suggestions 
 1. * We should look into feeding something into influxdb when the telegraf service especially on OSD shows errors or log error monitoring 
 2. Than one could add a dashboard/graph with an alert within Grafana using the data from `1.`.

Back