Project

General

Profile

Actions

action #71590

open

[osd][alert] Implement proper monitoring of needed resources of workers

Added by nicksinger about 4 years ago. Updated about 4 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
-
Target version:
QA (public, currently private due to #173521) - future
Start date:
2020-09-21
Due date:
% Done:

0%

Estimated time:

Description

We just had a case that all powerVM jobs where failing to boot (result: failed) because the VIOS of one of our powerVM hosts was down.
So while we monitor the worker-host itself (grenache) there is no monitoring at all for it's SUT-machines (e.g. redcurrant). Ideas what we could add to our monitoring:

  • PowerPC:
    1. availability (ssh) of powerhmc1.suse.de and powerhmc2.suse.de
    2. availability of the VIOS for a powerVM host (maybe by using the HMC api and polling the VIOS state?)
Actions

Also available in: Atom PDF