Project

General

Profile

Actions

action #78010

closed

unreliable reboots on openqaworker3, likely due do openqa_nvme_format (was: [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN)

Added by okurz over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2020-11-16
Due date:
2021-04-21
% Done:

0%

Estimated time:

Description

Observation

alert by email:
From: Monitoring User nagios@suse.de resent from: okurz@suse.com
To: okurz@suse.com
Date: 16/11/2020 10.01
Spam Status: Spamassassin
Notification: PROBLEM
Host: openqaworker3.suse.de
State: DOWN
Date/Time: Mon Nov 16 09:01:00 UTC 2020
Info: CRITICAL - 10.160.0.243: Host unreachable @ 10.160.0.44. rta nan, lost 100%

See Online: https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=1host=openqaworker3.suse.de

Acceptance criteria

  • AC1: openqaworker3 is "reboot-safe", e.g. at least 10 reboots in a row end up in a successfully booted system

Related issues 4 (0 open4 closed)

Related to openQA Infrastructure - action #68050: openqaworker3 fails to come up on reboot, openqa_nvme_format.service failedResolvedokurz2020-06-142020-07-07

Actions
Related to openQA Infrastructure - action #71098: openqaworker3 down but no alert was raisedResolvedokurz2020-09-08

Actions
Related to openQA Infrastructure - action #88385: openqaworker3 host up alert is flakyRejectedokurz2021-02-01

Actions
Related to openQA Infrastructure - action #88191: openqaworker2 boot ends in emergency shellResolvedmkittler2021-01-25

Actions
Actions

Also available in: Atom PDF