Project

General

Profile

action #64580

Detect and recover from I/O blocked worker machines, e.g. openqaworker-arm-{1,2,3}.suse.de

Added by okurz over 1 year ago. Updated 12 months ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Target version:
Start date:
2020-03-18
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #41882 we identified arm machines being completely unresponsive and are automatically detecting these situations and recover. But there are also cases when systems are I/O blocked, the machine still responds to ping but is not "usable". In this situation the machine can still have openQA jobs assigned that are then stuck for many hours. Also the machine is not detected as broken in grafana hence never recovered automatically. We should detect a situation like this and recover automatically.

Acceptance criteria

  • AC1: Machines in I/O blocked stated for multiple minutes/hours are detected and recovered, e.g. with reboot, similar/same as "worker completely down"

Suggestions


Related issues

Copied from openQA Infrastructure - action #41882: all arm worker die after some timeResolved2018-10-02

History

#1 Updated by okurz over 1 year ago

  • Copied from action #41882: all arm worker die after some time added

#2 Updated by okurz over 1 year ago

  • Status changed from New to Workable

#3 Updated by okurz about 1 year ago

  • Target version set to Ready

#4 Updated by okurz 12 months ago

  • Tags changed from caching, openQA, sporadic, arm, ipmi, worker to sporadic, arm, worker

#5 Updated by okurz 12 months ago

  • Target version changed from Ready to future

Also available in: Atom PDF