Project

General

Profile

Actions

action #151588

closed

[potential-regression] Our salt node up check in osd-deployment never fails size:M

Added by okurz 5 months ago. Updated 3 months ago.

Status:
Rejected
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2023-11-28
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #150983-5 we realized that our "check all salt nodes are up" never fails as visible in https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2017020#L38 . Didn't we have that failing in the past? Likely we also don't want that step to fail on any random salt controlled host to be up. But what does ensure we know if systems are not connected over salt anymore?

Acceptance criteria

  • AC1: We are alerted from somewhere if a host is not responsive, for example does not respond over salt
  • AC2: We are also alerted if a host is reachable over ping but does not respond over salt, e.g. stuck in boot
  • AC3: We have a workflow how to handle salt nodes which are down to unblock OSD deployment

Suggestions

  • Review the git log in https://gitlab.suse.de/openqa/osd-deployment or ticket history about "check all salt nodes are up" if the design/change was intentional that the check for salt nodes never fails
    • If we really do not have any better alerting in grafana and we definitely need to do that in gitlab CI then either make it fatal OR document clearly that it's logs to be checked manually
  • Consider using https://build.opensuse.org/package/show/openSUSE:Factory/retry instead of the manual for loop
  • Ensure that somewhere we have an alert or check that actually fails if not all salt nodes are reachable e.g. crosscheck the "host up" check and verify its operation by artificially triggering an alert

Out of scope

  • We shouldn't care about the one salt bug which caused "no response" unlike "not connected". Unlikely to reappear and surprise us

Related issues 2 (0 open2 closed)

Copied from openQA Infrastructure - action #150983: CPU Load and usage alert for openQA workers size:SResolvedokurz

Actions
Copied to openQA Infrastructure - action #154627: [potential-regression] Ensure that our "host up" alert alerts on not host-up conditions size:MResolvednicksinger

Actions
Actions #1

Updated by okurz 5 months ago

  • Copied from action #150983: CPU Load and usage alert for openQA workers size:S added
Actions #2

Updated by okurz 5 months ago

  • Parent task deleted (#116713)
Actions #3

Updated by okurz 4 months ago

  • Target version changed from Tools - Next to Ready
Actions #4

Updated by okurz 4 months ago

  • Subject changed from Our salt node up check in osd-deployment never fails to [potential-regression] Our salt node up check in osd-deployment never fails size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by okurz 3 months ago

  • Priority changed from Normal to Low
Actions #6

Updated by okurz 3 months ago

  • Copied to action #154627: [potential-regression] Ensure that our "host up" alert alerts on not host-up conditions size:M added
Actions #7

Updated by okurz 3 months ago

  • Description updated (diff)
  • Status changed from Workable to Blocked
  • Assignee set to okurz
Actions #8

Updated by okurz 3 months ago

  • Status changed from Blocked to Rejected

#154627 resolved which means we are again/still noticed if hosts are (completely) down. Good enough for now. I guess we don't need to do more here.

Actions

Also available in: Atom PDF