Project

General

Profile

action #94399

No alert when arm workers are offline, alert if telegraf throws errors size:M

Added by Xiaojing_liu 3 months ago. Updated 3 months ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Target version:
Start date:
2021-06-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

On 2021-06-22, all arm workers (arm-1, arm-2, arm-3) couldn't be connected by using ssh or ping.
But https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 showed that all of them were Online.

Acceptance criteria

Suggestions

  1. We should look into feeding something into influxdb when the telegraf service especially on OSD shows errors or log error monitoring
  2. Than one could add a dashboard/graph with an alert within Grafana using the data from 1..
Screenshot_20210622_102648.png (322 KB) Screenshot_20210622_102648.png Xiaojing_liu, 2021-06-22 03:30
11548

Related issues

Related to openQA Infrastructure - action #94438: OSD deployment fails at 2021-06-21 because ' openqaworker (arm-3 and arm-2) Minion did not return'Resolved2021-06-22

History

#1 Updated by Xiaojing_liu 3 months ago

  • Related to action #94438: OSD deployment fails at 2021-06-21 because ' openqaworker (arm-3 and arm-2) Minion did not return' added

#2 Updated by okurz 3 months ago

  • Status changed from New to Workable
  • Priority changed from Normal to Urgent
  • Target version set to Ready

#3 Updated by okurz 3 months ago

original problem was that the telegraf service on osd was failing to get data for "ping" as one host was unresolvable. We worked around that by doing individual monitoring for each host rather than ping all in a list. But the errors in the telegraf log did not influence the telegraf service and not show in grafana. We should look into feeding something into influxdb for that or log error monitoring. The specific error in telegraf log on osd:

-- Logs begin at Sun 2021-06-20 03:30:00 CEST, end at Wed 2021-06-23 11:40:04 CEST. --
Jun 23 07:53:40 openqa telegraf[2034]: 2021-06-23T05:53:40Z E! [inputs.ping] Error in plugin: lookup backup-vm on 10.160.0.1:53: no such host

#4 Updated by okurz 3 months ago

  • Subject changed from No alert when arm workers are offline to No alert when arm workers are offline, alert if telegraf throws errors
  • Description updated (diff)
  • Priority changed from Urgent to Normal

Urgency reduced with #94456 fixed.

@Xiaojing_liu we want to prevent any alert emails about openqaworker-arm-[123] being down as we have automatic mitigation if they are detected to be down. But an idea we have is that we should catch errors in telegraf, hence adding AC3

#5 Updated by cdywan 3 months ago

We're estimating M, and Marius offered to narrow down the suggested steps a bit

#6 Updated by mkittler 3 months ago

  • Description updated (diff)

#7 Updated by cdywan 3 months ago

  • Subject changed from No alert when arm workers are offline, alert if telegraf throws errors to No alert when arm workers are offline, alert if telegraf throws errors size:M

#8 Updated by okurz 3 months ago

https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 shows the correct status that openqaworker-arm-1/2/3 are "offline". Xiaojing_liu can you confirm that it looks the same for you? Maybe you were just "too fast" when looking there the first time? :)

#9 Updated by Xiaojing_liu 3 months ago

okurz wrote:

https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 shows the correct status that openqaworker-arm-1/2/3 are "offline". Xiaojing_liu can you confirm that it looks the same for you? Maybe you were just "too fast" when looking there the first time? :)

I have confirmed it looks correctly for me. When created this ticket I uploaded a screenshot to show what I saw. :)

#10 Updated by okurz 3 months ago

  • Target version changed from Ready to future

Also available in: Atom PDF