Project

General

Profile

Actions

action #162380

open

2024-06-15 osd not accessible - causing false alerts about "[FIRING:1] DatasourceNoData Salt .*openqaworker-arm-1 online (long-time) alert"

Added by okurz 13 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-17
Due date:
% Done:

0%

Estimated time:

Description

Observation

I received alerts about "[FIRING:1] DatasourceNoData Salt (000000001 A Nk0h5mB4z [openqa] openqaworker-arm-1 online (long-time) alert)" starting some minutes after #162332 happened. This is not about openqaworker-arm-1 itself but that OSD itself is down and not delivering data. In other cases we already handled "NoData" as acceptable. Possibly we forgot to just do the same for the automatic recovery dashboard and need to do the same here as well.

Acceptance criteria

  • AC1: outages of any other host do not trigger "no data" alerts for openqaworker-arm-1
  • AC1: Long term unhandled outages of openqaworker-arm-1 still trigger alerts

Suggestions

  • Take a look into the git history of https://gitlab.suse.de/openqa/salt-states-openqa for when we changed "no data" to "OK", e.g. see f7180fe. Then identify the according alert definitions related to the automatic recovery and adjust accordingly

Related issues 1 (1 open0 closed)

Copied from openQA Infrastructure - action #162332: 2024-06-15 osd not accessible size:MWorkableokurz

Actions
Actions

Also available in: Atom PDF