Project

General

Profile

Actions

action #94399

open

No alert when arm workers are offline, alert if telegraf throws errors size:M

Added by Xiaojing_liu over 2 years ago. Updated about 1 year ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
2021-06-22
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

On 2021-06-22, all arm workers (arm-1, arm-2, arm-3) couldn't be connected by using ssh or ping.
But https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 showed that all of them were Online.

Acceptance criteria

Suggestions

  1. We should look into feeding something into influxdb when the telegraf service especially on OSD shows errors or log error monitoring
  2. Than one could add a dashboard/graph with an alert within Grafana using the data from 1..

Files

Screenshot_20210622_102648.png (322 KB) Screenshot_20210622_102648.png Xiaojing_liu, 2021-06-22 03:30

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #94438: OSD deployment fails at 2021-06-21 because ' openqaworker (arm-3 and arm-2) Minion did not return'Resolvedokurz2021-06-22

Actions
Actions #1

Updated by Xiaojing_liu over 2 years ago

  • Related to action #94438: OSD deployment fails at 2021-06-21 because ' openqaworker (arm-3 and arm-2) Minion did not return' added
Actions #2

Updated by okurz over 2 years ago

  • Status changed from New to Workable
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #3

Updated by okurz over 2 years ago

original problem was that the telegraf service on osd was failing to get data for "ping" as one host was unresolvable. We worked around that by doing individual monitoring for each host rather than ping all in a list. But the errors in the telegraf log did not influence the telegraf service and not show in grafana. We should look into feeding something into influxdb for that or log error monitoring. The specific error in telegraf log on osd:

-- Logs begin at Sun 2021-06-20 03:30:00 CEST, end at Wed 2021-06-23 11:40:04 CEST. --
Jun 23 07:53:40 openqa telegraf[2034]: 2021-06-23T05:53:40Z E! [inputs.ping] Error in plugin: lookup backup-vm on 10.160.0.1:53: no such host
Actions #4

Updated by okurz over 2 years ago

  • Subject changed from No alert when arm workers are offline to No alert when arm workers are offline, alert if telegraf throws errors
  • Description updated (diff)
  • Priority changed from Urgent to Normal

Urgency reduced with #94456 fixed.

@Xiaojing_liu we want to prevent any alert emails about openqaworker-arm-[123] being down as we have automatic mitigation if they are detected to be down. But an idea we have is that we should catch errors in telegraf, hence adding AC3

Actions #5

Updated by livdywan over 2 years ago

We're estimating M, and Marius offered to narrow down the suggested steps a bit

Actions #6

Updated by mkittler over 2 years ago

  • Description updated (diff)
Actions #7

Updated by livdywan over 2 years ago

  • Subject changed from No alert when arm workers are offline, alert if telegraf throws errors to No alert when arm workers are offline, alert if telegraf throws errors size:M
Actions #8

Updated by okurz over 2 years ago

https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 shows the correct status that openqaworker-arm-1/2/3 are "offline". Xiaojing_liu can you confirm that it looks the same for you? Maybe you were just "too fast" when looking there the first time? :)

Actions #9

Updated by Xiaojing_liu over 2 years ago

okurz wrote:

https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 shows the correct status that openqaworker-arm-1/2/3 are "offline". Xiaojing_liu can you confirm that it looks the same for you? Maybe you were just "too fast" when looking there the first time? :)

I have confirmed it looks correctly for me. When created this ticket I uploaded a screenshot to show what I saw. :)

Actions #10

Updated by okurz over 2 years ago

  • Target version changed from Ready to future
Actions #11

Updated by okurz about 1 year ago

  • Tags set to infra
Actions

Also available in: Atom PDF