action #71590: [osd][alert] Implement proper monitoring of needed resources of workers - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #71590

open

[osd][alert] Implement proper monitoring of needed resources of workers

Added by nicksinger over 4 years ago. Updated over 4 years ago.

Status:

New

Priority:

Low

Assignee:

Category:

Target version:

QA (public) - future

Start date:

2020-09-21

Due date:

% Done:

Estimated time:

Description

We just had a case that all powerVM jobs where failing to boot (result: failed) because the VIOS of one of our powerVM hosts was down.
So while we monitor the worker-host itself (grenache) there is no monitoring at all for it's SUT-machines (e.g. redcurrant). Ideas what we could add to our monitoring:

PowerPC:
1. availability (ssh) of powerhmc1.suse.de and powerhmc2.suse.de
2. availability of the VIOS for a powerVM host (maybe by using the HMC api and polling the VIOS state?)

History
Notes
Property changes

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #71590

[osd][alert] Implement proper monitoring of needed resources of workers

Updated by okurz over 4 years ago