action #94399
open
No alert when arm workers are offline, alert if telegraf throws errors size:M
Added by Xiaojing_liu almost 4 years ago.
Updated about 2 years ago.
Description
Observation¶
On 2021-06-22, all arm workers (arm-1, arm-2, arm-3) couldn't be connected by using ssh
or ping
.
But https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 showed that all of them were Online
.
Acceptance criteria¶
Suggestions¶
- We should look into feeding something into influxdb when the telegraf service especially on OSD shows errors or log error monitoring
- Than one could add a dashboard/graph with an alert within Grafana using the data from
1.
.
Files
- Related to action #94438: OSD deployment fails at 2021-06-21 because ' openqaworker (arm-3 and arm-2) Minion did not return' added
- Status changed from New to Workable
- Priority changed from Normal to Urgent
- Target version set to Ready
original problem was that the telegraf service on osd was failing to get data for "ping" as one host was unresolvable. We worked around that by doing individual monitoring for each host rather than ping all in a list. But the errors in the telegraf log did not influence the telegraf service and not show in grafana. We should look into feeding something into influxdb for that or log error monitoring. The specific error in telegraf log on osd:
-- Logs begin at Sun 2021-06-20 03:30:00 CEST, end at Wed 2021-06-23 11:40:04 CEST. --
Jun 23 07:53:40 openqa telegraf[2034]: 2021-06-23T05:53:40Z E! [inputs.ping] Error in plugin: lookup backup-vm on 10.160.0.1:53: no such host
- Subject changed from No alert when arm workers are offline to No alert when arm workers are offline, alert if telegraf throws errors
- Description updated (diff)
- Priority changed from Urgent to Normal
Urgency reduced with #94456 fixed.
@Xiaojing_liu we want to prevent any alert emails about openqaworker-arm-[123] being down as we have automatic mitigation if they are detected to be down. But an idea we have is that we should catch errors in telegraf, hence adding AC3
We're estimating M, and Marius offered to narrow down the suggested steps a bit
- Description updated (diff)
- Subject changed from No alert when arm workers are offline, alert if telegraf throws errors to No alert when arm workers are offline, alert if telegraf throws errors size:M
okurz wrote:
https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 shows the correct status that openqaworker-arm-1/2/3 are "offline". Xiaojing_liu can you confirm that it looks the same for you? Maybe you were just "too fast" when looking there the first time? :)
I have confirmed it looks correctly for me. When created this ticket I uploaded a screenshot to show what I saw. :)
- Target version changed from Ready to future
Also available in: Atom
PDF