action #94399
openNo alert when arm workers are offline, alert if telegraf throws errors size:M
0%
Description
Observation¶
On 2021-06-22, all arm workers (arm-1, arm-2, arm-3) couldn't be connected by using ssh
or ping
.
But https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 showed that all of them were Online
.
Acceptance criteria¶
AC1: We can receive the alerting e-mail when arm workers down- AC2: https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 should show the correct state
- AC3: We receive alert notices for errors in telegraf on osd
Suggestions¶
- We should look into feeding something into influxdb when the telegraf service especially on OSD shows errors or log error monitoring
- Than one could add a dashboard/graph with an alert within Grafana using the data from
1.
.
Files
Updated by Xiaojing_liu over 3 years ago
- Related to action #94438: OSD deployment fails at 2021-06-21 because ' openqaworker (arm-3 and arm-2) Minion did not return' added
Updated by okurz over 3 years ago
- Status changed from New to Workable
- Priority changed from Normal to Urgent
- Target version set to Ready
Updated by okurz over 3 years ago
original problem was that the telegraf service on osd was failing to get data for "ping" as one host was unresolvable. We worked around that by doing individual monitoring for each host rather than ping all in a list. But the errors in the telegraf log did not influence the telegraf service and not show in grafana. We should look into feeding something into influxdb for that or log error monitoring. The specific error in telegraf log on osd:
-- Logs begin at Sun 2021-06-20 03:30:00 CEST, end at Wed 2021-06-23 11:40:04 CEST. --
Jun 23 07:53:40 openqa telegraf[2034]: 2021-06-23T05:53:40Z E! [inputs.ping] Error in plugin: lookup backup-vm on 10.160.0.1:53: no such host
Updated by okurz over 3 years ago
- Subject changed from No alert when arm workers are offline to No alert when arm workers are offline, alert if telegraf throws errors
- Description updated (diff)
- Priority changed from Urgent to Normal
Urgency reduced with #94456 fixed.
@Xiaojing_liu we want to prevent any alert emails about openqaworker-arm-[123] being down as we have automatic mitigation if they are detected to be down. But an idea we have is that we should catch errors in telegraf, hence adding AC3
Updated by livdywan over 3 years ago
We're estimating M, and Marius offered to narrow down the suggested steps a bit
Updated by livdywan over 3 years ago
- Subject changed from No alert when arm workers are offline, alert if telegraf throws errors to No alert when arm workers are offline, alert if telegraf throws errors size:M
Updated by okurz over 3 years ago
https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 shows the correct status that openqaworker-arm-1/2/3 are "offline". Xiaojing_liu can you confirm that it looks the same for you? Maybe you were just "too fast" when looking there the first time? :)
Updated by Xiaojing_liu over 3 years ago
okurz wrote:
https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 shows the correct status that openqaworker-arm-1/2/3 are "offline". Xiaojing_liu can you confirm that it looks the same for you? Maybe you were just "too fast" when looking there the first time? :)
I have confirmed it looks correctly for me. When created this ticket I uploaded a screenshot to show what I saw. :)