Actions
action #64580
openDetect and recover from I/O blocked worker machines, e.g. openqaworker-arm-{1,2,3}.suse.de
Start date:
2020-03-18
Due date:
% Done:
0%
Estimated time:
Description
Motivation¶
In #41882 we identified arm machines being completely unresponsive and are automatically detecting these situations and recover. But there are also cases when systems are I/O blocked, the machine still responds to ping but is not "usable". In this situation the machine can still have openQA jobs assigned that are then stuck for many hours. Also the machine is not detected as broken in grafana hence never recovered automatically. We should detect a situation like this and recover automatically.
Acceptance criteria¶
- AC1: Machines in I/O blocked stated for multiple minutes/hours are detected and recovered, e.g. with reboot, similar/same as "worker completely down"
Suggestions¶
- Check if there are already measurements available in grafana that could be used to trigger alerts which then trigger the reboot actions same as https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1
- If not, find an additional measurement/alert for this purpose
- Ensure the alerts and notification configurations are covered in salt
Updated by okurz almost 5 years ago
- Copied from action #41882: all arm worker die after some time added
Updated by okurz over 4 years ago
- Tags changed from caching, openQA, sporadic, arm, ipmi, worker to sporadic, arm, worker
Actions