Actions
action #71590
open[osd][alert] Implement proper monitoring of needed resources of workers
Start date:
2020-09-21
Due date:
% Done:
0%
Estimated time:
Description
We just had a case that all powerVM jobs where failing to boot (result: failed) because the VIOS of one of our powerVM hosts was down.
So while we monitor the worker-host itself (grenache) there is no monitoring at all for it's SUT-machines (e.g. redcurrant). Ideas what we could add to our monitoring:
- PowerPC:
- availability (ssh) of powerhmc1.suse.de and powerhmc2.suse.de
- availability of the VIOS for a powerVM host (maybe by using the HMC api and polling the VIOS state?)
Actions