Project

General

Profile

Actions

action #64514

closed

openqaworker7 is down and IPMI SOL very unstable

Added by okurz almost 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
2020-03-16
Due date:
2020-03-19
% Done:

0%

Estimated time:

Description

Observation

from [#opensuse-factory](irc://chat.freenode.net/opensuse-factory) :

[16/03/2020 14:15:40] <Martchus> Seems like openqaworker7 is offline. I can not ssh to it (from openqa.opensuse.org). It has been seen about 20 hours ago by the web UI.
[16/03/2020 14:16:11] <Martchus> Currently there's job https://openqa.opensuse.org/tests/1203580 which is stuck in assigned because of that.
[16/03/2020 14:17:43] <Martchus> That's not releated to one of my latest changes. The worker is really dead and doesn't send any status updates. So the re-scheduling code has not chance to run anyways.
[16/03/2020 14:19:47] <Martchus> It looks like other slots on openqaworker7 had more luck (e.g. instance 2 has been seen 2 hours ago).
[16/03/2020 14:21:09] <Martchus> I now "restarted" the job which was assigned to openqaworker7:14. (Restarting assigned jobs leads to them being re-scheduled so this is easy to workaround.)
[16/03/2020 14:22:03] <okurz> Martchus: I connected with IPMI SOL to worker7 and exactly in the instance of connecting SOL it passed PXE boot menu. Could be it's stuck in a reboot cycle
[16/03/2020 14:26:42] <Martchus> okurz: Have you tried it one more time with IPMI? Are you currently handling it? I don't want to interfere with what you currently might be doing.
[16/03/2020 14:38:54] <okurz> Martchus: yeah, I am currently trying to handle it but seems the machine has a more severe problem. it does not bootup and constantly dumps messages on tty that the journal can not be written due to r/o fs
[16/03/2020 14:40:01] <Martchus> okurz: Then I guess we need to open an infra ticket.
[16/03/2020 14:40:14] <okurz> Martchus: as long as we have SoL it's our job :)
[16/03/2020 14:40:32] <okurz> Martchus: but someone seems to interfer with the machine as well right now
[16/03/2020 14:41:06] <Martchus> It isn't me. I've just tried ssh access a few times.
[16/03/2020 14:42:37] <Martchus> okurz: So it only looks like a software/setup problem? I'd assume if we have access to the machine but it crashes all the time due to hardware issues it is something for infra, right?
[16/03/2020 14:44:53] <okurz> Martchus: maybe. I will look at sensor information and power reset it first before creating a ticket though

ipmi sensor does not show anything out of the ordinary. It seems that the IPMI connection even from another machine within the wired SUSE network is not very stable. Ending often with

Error sending SOL data: FAIL
SOL session closed by BMC

The root filesystem seems to have problems as it is mounted read-only. Also mount -o rw,remount / fails with

[ 1481.307993] BTRFS info (device sda1): disk space caching is enabled
[ 1481.315214] BTRFS error (device sda1): Remounting read-write after error is not allowed
mount: /: mount point not mounted or bad option.

Related issues 1 (0 open1 closed)

Copied to openQA Project (public) - action #69784: Workers not considered offline after ungraceful disconnect; stale job detection has no effect in that caseResolvedmkittler

Actions
Actions

Also available in: Atom PDF