action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #151588

closed

[potential-regression] Our salt node up check in osd-deployment never fails size:M

Added by okurz about 1 year ago. Updated about 1 year ago.

Status:

Rejected

Priority:

Low

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-11-28

Due date:

% Done:

Estimated time:

Tags:

alert, monitoring, infra, reactive work

Description

Motivation¶

In #150983-5 we realized that our "check all salt nodes are up" never fails as visible in https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2017020#L38 . Didn't we have that failing in the past? Likely we also don't want that step to fail on any random salt controlled host to be up. But what does ensure we know if systems are not connected over salt anymore?

Acceptance criteria¶

AC1: We are alerted from somewhere if a host is not responsive, for example does not respond over salt
AC2: We are also alerted if a host is reachable over ping but does not respond over salt, e.g. stuck in boot
AC3: We have a workflow how to handle salt nodes which are down to unblock OSD deployment

Suggestions¶

Review the git log in https://gitlab.suse.de/openqa/osd-deployment or ticket history about "check all salt nodes are up" if the design/change was intentional that the check for salt nodes never fails
- If we really do not have any better alerting in grafana and we definitely need to do that in gitlab CI then either make it fatal OR document clearly that it's logs to be checked manually
Consider using https://build.opensuse.org/package/show/openSUSE:Factory/retry instead of the manual for loop
Ensure that somewhere we have an alert or check that actually fails if not all salt nodes are reachable e.g. crosscheck the "host up" check and verify its operation by artificially triggering an alert

Out of scope¶

We shouldn't care about the one salt bug which caused "no response" unlike "not connected". Unlikely to reappear and surprise us

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by okurz about 1 year ago

Copied from action #150983: CPU Load and usage alert for openQA workers size:S added

Actions

Copy link

Updated by okurz about 1 year ago

Parent task deleted (~~#116713~~)

Actions

Copy link

Updated by okurz about 1 year ago

Target version changed from Tools - Next to Ready

Actions

Copy link

Updated by okurz about 1 year ago

Subject changed from Our salt node up check in osd-deployment never fails to [potential-regression] Our salt node up check in osd-deployment never fails size:M
Description updated (diff)
Status changed from New to Workable