action #59079

handle machines that are down gracefully

Added by okurz 5 months ago. Updated 5 months ago.

Status:ResolvedStart date:05/11/2019
Priority:NormalDue date:
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Current Sprint
Duration:

Description

Currently openqaworker-arm-3 is down, not even the management interface is reachable. While we want to handle this, in https://infra.nue.suse.com/NoAuth/Login.html?next=bae08f33248347ebf5b8a7d8f6d5b915 , in the meantime we want to still apply the high state to the other machines.

History

#1 Updated by okurz 5 months ago

  • Status changed from In Progress to Feedback
  • Target version set to Current Sprint

It seems with salt-run manage.down removekeys=True we can remove the keys for offline hosts automatically. I did that now and openqaworker-arm-3 got removed. We should readd it when the worker is up again.

As an alternative nicksinger suggested "--hide-timeout". The exit code is then 0 (success) even if there are offline minions.
I guess we can should use a process to combine both, use "--hide-timeout" on automatic calls where we do not want to stop on down workers but keep other checks strict and remove the key whenever we expect the machine to be down longer than a day. The PKI should reflect our "currently running setup". If workers are down permanently (or get removed, like openqaw{1..4}) they also should be deleted from the PKI.

Noted down in
https://confluence.suse.com/pages/diffpagesbyversion.action?pageId=192544831&selectedPageVersions=43&selectedPageVersions=42

#2 Updated by okurz 5 months ago

  • Status changed from Feedback to Resolved

Whenever we encounter a specific problem we can follow this procedure

Also available in: Atom PDF