Project

General

Profile

Actions

action #151588

closed

[potential-regression] Our salt node up check in osd-deployment never fails size:M

Added by okurz about 1 year ago. Updated 10 months ago.

Status:
Rejected
Priority:
Low
Assignee:
Category:
-
Start date:
2023-11-28
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #150983-5 we realized that our "check all salt nodes are up" never fails as visible in https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2017020#L38 . Didn't we have that failing in the past? Likely we also don't want that step to fail on any random salt controlled host to be up. But what does ensure we know if systems are not connected over salt anymore?

Acceptance criteria

  • AC1: We are alerted from somewhere if a host is not responsive, for example does not respond over salt
  • AC2: We are also alerted if a host is reachable over ping but does not respond over salt, e.g. stuck in boot
  • AC3: We have a workflow how to handle salt nodes which are down to unblock OSD deployment

Suggestions

  • Review the git log in https://gitlab.suse.de/openqa/osd-deployment or ticket history about "check all salt nodes are up" if the design/change was intentional that the check for salt nodes never fails
    • If we really do not have any better alerting in grafana and we definitely need to do that in gitlab CI then either make it fatal OR document clearly that it's logs to be checked manually
  • Consider using https://build.opensuse.org/package/show/openSUSE:Factory/retry instead of the manual for loop
  • Ensure that somewhere we have an alert or check that actually fails if not all salt nodes are reachable e.g. crosscheck the "host up" check and verify its operation by artificially triggering an alert

Out of scope

  • We shouldn't care about the one salt bug which caused "no response" unlike "not connected". Unlikely to reappear and surprise us

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #159270: openqaworker-arm-1 is Unreachable size:SResolvedybonatakis2024-04-19

Actions
Copied from openQA Infrastructure (public) - action #150983: CPU Load and usage alert for openQA workers size:SResolvedokurz

Actions
Copied to openQA Infrastructure (public) - action #154627: [potential-regression] Ensure that our "host up" alert alerts on not host-up conditions size:MResolvednicksinger

Actions
Actions

Also available in: Atom PDF