Project

General

Profile

action #71590

[osd][alert] Implement proper monitoring of needed resources of workers

Added by nicksinger 7 months ago. Updated 7 months ago.

Status:
New
Priority:
Low
Assignee:
-
Target version:
Start date:
2020-09-21
Due date:
% Done:

0%

Estimated time:

Description

We just had a case that all powerVM jobs where failing to boot (result: failed) because the VIOS of one of our powerVM hosts was down.
So while we monitor the worker-host itself (grenache) there is no monitoring at all for it's SUT-machines (e.g. redcurrant). Ideas what we could add to our monitoring:

  • PowerPC:
    1. availability (ssh) of powerhmc1.suse.de and powerhmc2.suse.de
    2. availability of the VIOS for a powerVM host (maybe by using the HMC api and polling the VIOS state?)

History

#1 Updated by okurz 7 months ago

  • Priority changed from Normal to Low
  • Target version set to future

I would like that. But considering how we currently progress within the infrastructure project (very slowly) and that special hardware hypervisor hosts and such are out of scope of SUSE QA Tools, see https://progress.opensuse.org/projects/qa/wiki#Out-of-scope , I consider this low+future

Also available in: Atom PDF