Project

General

Profile

Actions

action #64580

open

Detect and recover from I/O blocked worker machines, e.g. openqaworker-arm-{1,2,3}.suse.de

Added by okurz almost 5 years ago. Updated over 4 years ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
2020-03-18
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #41882 we identified arm machines being completely unresponsive and are automatically detecting these situations and recover. But there are also cases when systems are I/O blocked, the machine still responds to ping but is not "usable". In this situation the machine can still have openQA jobs assigned that are then stuck for many hours. Also the machine is not detected as broken in grafana hence never recovered automatically. We should detect a situation like this and recover automatically.

Acceptance criteria

  • AC1: Machines in I/O blocked stated for multiple minutes/hours are detected and recovered, e.g. with reboot, similar/same as "worker completely down"

Suggestions


Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #41882: all arm worker die after some timeResolvedokurz2018-10-02

Actions
Actions #1

Updated by okurz almost 5 years ago

  • Copied from action #41882: all arm worker die after some time added
Actions #2

Updated by okurz almost 5 years ago

  • Status changed from New to Workable
Actions #3

Updated by okurz over 4 years ago

  • Target version set to Ready
Actions #4

Updated by okurz over 4 years ago

  • Tags changed from caching, openQA, sporadic, arm, ipmi, worker to sporadic, arm, worker
Actions #5

Updated by okurz over 4 years ago

  • Target version changed from Ready to future
Actions

Also available in: Atom PDF