Project

General

Profile

Actions

action #162380

closed

2024-06-15 osd not accessible - causing false alerts for other hosts size:S

Added by okurz 6 months ago. Updated 15 days ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2024-06-17
Due date:
% Done:

0%

Estimated time:

Description

Observation

I received alerts about "[FIRING:1] DatasourceNoData Salt (000000001 A Nk0h5mB4z [openqa] openqaworker-arm-1 online (long-time) alert)" starting some minutes after #162332 happened. This is not about openqaworker-arm-1 itself but that OSD itself is down and not delivering data. In other cases we already handled "NoData" as acceptable. Possibly we forgot to just do the same for the automatic recovery dashboard and need to do the same here as well.

Acceptance criteria

  • AC1: outages of any other host do not trigger "no data" alerts for openqaworker-arm-1
  • AC1: Long term unhandled outages of openqaworker-arm-1 still trigger alerts

Suggestions

  • Take a look into the git history of https://gitlab.suse.de/openqa/salt-states-openqa for when we changed "no data" to "OK", e.g. see f7180fe. Then identify the according alert definitions related to the automatic recovery and adjust accordingly

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #162332: 2024-06-15 osd not accessible size:MResolvedokurz

Actions
Actions

Also available in: Atom PDF