Project

General

Profile

Actions

action #59079

closed

handle machines that are down gracefully

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2019-11-05
Due date:
% Done:

0%

Estimated time:

Description

Currently openqaworker-arm-3 is down, not even the management interface is reachable. While we want to handle this, in https://infra.nue.suse.com/NoAuth/Login.html?next=bae08f33248347ebf5b8a7d8f6d5b915 , in the meantime we want to still apply the high state to the other machines.

Actions #1

Updated by okurz over 4 years ago

  • Status changed from In Progress to Feedback
  • Target version set to Current Sprint

It seems with salt-run manage.down removekeys=True we can remove the keys for offline hosts automatically. I did that now and openqaworker-arm-3 got removed. We should readd it when the worker is up again.

As an alternative nicksinger suggested "--hide-timeout". The exit code is then 0 (success) even if there are offline minions.
I guess we can should use a process to combine both, use "--hide-timeout" on automatic calls where we do not want to stop on down workers but keep other checks strict and remove the key whenever we expect the machine to be down longer than a day. The PKI should reflect our "currently running setup". If workers are down permanently (or get removed, like openqaw{1..4}) they also should be deleted from the PKI.

Noted down in
https://confluence.suse.com/pages/diffpagesbyversion.action?pageId=192544831&selectedPageVersions=43&selectedPageVersions=42

Actions #2

Updated by okurz over 4 years ago

  • Status changed from Feedback to Resolved

Whenever we encounter a specific problem we can follow this procedure

Actions

Also available in: Atom PDF