Project

General

Profile

Actions

action #64514

closed

openqaworker7 is down and IPMI SOL very unstable

Added by okurz about 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
2020-03-16
Due date:
2020-03-19
% Done:

0%

Estimated time:

Description

Observation

from [#opensuse-factory](irc://chat.freenode.net/opensuse-factory) :

[16/03/2020 14:15:40] <Martchus> Seems like openqaworker7 is offline. I can not ssh to it (from openqa.opensuse.org). It has been seen about 20 hours ago by the web UI.
[16/03/2020 14:16:11] <Martchus> Currently there's job https://openqa.opensuse.org/tests/1203580 which is stuck in assigned because of that.
[16/03/2020 14:17:43] <Martchus> That's not releated to one of my latest changes. The worker is really dead and doesn't send any status updates. So the re-scheduling code has not chance to run anyways.
[16/03/2020 14:19:47] <Martchus> It looks like other slots on openqaworker7 had more luck (e.g. instance 2 has been seen 2 hours ago).
[16/03/2020 14:21:09] <Martchus> I now "restarted" the job which was assigned to openqaworker7:14. (Restarting assigned jobs leads to them being re-scheduled so this is easy to workaround.)
[16/03/2020 14:22:03] <okurz> Martchus: I connected with IPMI SOL to worker7 and exactly in the instance of connecting SOL it passed PXE boot menu. Could be it's stuck in a reboot cycle
[16/03/2020 14:26:42] <Martchus> okurz: Have you tried it one more time with IPMI? Are you currently handling it? I don't want to interfere with what you currently might be doing.
[16/03/2020 14:38:54] <okurz> Martchus: yeah, I am currently trying to handle it but seems the machine has a more severe problem. it does not bootup and constantly dumps messages on tty that the journal can not be written due to r/o fs
[16/03/2020 14:40:01] <Martchus> okurz: Then I guess we need to open an infra ticket.
[16/03/2020 14:40:14] <okurz> Martchus: as long as we have SoL it's our job :)
[16/03/2020 14:40:32] <okurz> Martchus: but someone seems to interfer with the machine as well right now
[16/03/2020 14:41:06] <Martchus> It isn't me. I've just tried ssh access a few times.
[16/03/2020 14:42:37] <Martchus> okurz: So it only looks like a software/setup problem? I'd assume if we have access to the machine but it crashes all the time due to hardware issues it is something for infra, right?
[16/03/2020 14:44:53] <okurz> Martchus: maybe. I will look at sensor information and power reset it first before creating a ticket though

ipmi sensor does not show anything out of the ordinary. It seems that the IPMI connection even from another machine within the wired SUSE network is not very stable. Ending often with

Error sending SOL data: FAIL
SOL session closed by BMC

The root filesystem seems to have problems as it is mounted read-only. Also mount -o rw,remount / fails with

[ 1481.307993] BTRFS info (device sda1): disk space caching is enabled
[ 1481.315214] BTRFS error (device sda1): Remounting read-write after error is not allowed
mount: /: mount point not mounted or bad option.

Related issues 1 (0 open1 closed)

Copied to openQA Project - action #69784: Workers not considered offline after ungraceful disconnect; stale job detection has no effect in that caseResolvedmkittler

Actions
Actions #1

Updated by okurz about 4 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz

trying to investigate further, e.g. btrfs scrub start /, …

Actions #2

Updated by okurz about 4 years ago

scrub reported no problem, also btrfs check --force /dev/sda1 did not report a problem. I tried a lot, restarting multiple times but did not manage to properly bring up network. Manually I could set the link of eth0 up but wicked refuses to request for DHCP address on the interface. Debugging over IPMI SOL is hard but even more so if the connection drops. Something seems to be really fishy here about the network access. Called mc reset cold and will try again.

I guess the important message is the following:

[   27.527428] BTRFS error (device sda1): parent transid verify failed on 30976278528 wanted 2403657 found 2431659
[   27.538248] BTRFS error (device sda1): parent transid verify failed on 30976278528 wanted 2403657 found 2431659
[   27.549020] BTRFS error (device sda1): parent transid verify failed on 30976278528 wanted 2403657 found 2431
659
[   27.559970] BTRFS warning (device sda1): Skipping commit of aborted transaction.
[   27.568057] BTRFS: error (device sda1) in cleanup_transaction:1881: errno=-5 IO failure
[   27.568063] BTRFS: error (device sda1) in btrfs_drop_snapshot:9361: errno=-5 IO failure
[   27.576752] BTRFS info (device sda1): forced readonly
[   27.577081] BTRFS info (device sda1): delayed_refs has NO entry

Trying

mount -t btrfs -o ro,nospace_cache,nospace_cache /dev/sda1 /mnt

as suggested in https://askubuntu.com/a/300664 as the tool "btrfs-zero-log" does not seem to exist. Now trying https://askubuntu.com/a/171291

EDIT: 2020-03-18: No luck in repairing btrfs. What I did not yet achieve is resurrect openqaworker7. I don't even know if the workers in o3 can boot from PXE offering an installation source. Anyone else wants to try?

Actions #3

Updated by okurz about 4 years ago

  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)
Actions #4

Updated by okurz about 4 years ago

  • Due date set to 2020-03-19
  • Status changed from Workable to Feedback
  • Assignee set to okurz

Debugging session together with mkittler:

We had the idea that we could try again an older snapshot. So from grub we loaded a very old snapshot from 2020-02-13, that was when w7 was still an "osd worker". From the booted system we could list all snapshots and loaded one of the more recent snapshot but not the last one.

openqaworker7:~ # snapper ls
446  | single |       | Thu Mar 12 01:08:13 2020 | root |  40.09 MiB | number  | Snapshot Update of #445 |              
447  | single |       | Fri Mar 13 01:03:10 2020 | root |  93.55 MiB | number  | Snapshot Update of #446 |              
448  | single |       | Sat Mar 14 01:13:36 2020 | root |  12.51 MiB |         | Snapshot Update of #447 |              
449+ | single |       | Sun Mar 15 01:42:31 2020 | root |  15.27 MiB |         | Snapshot Update of #448 |              
450  | pre    |       | Wed Mar 18 15:21:04 2020 | root |  16.00 KiB | number  | zypp(zypper)            | important=no 
openqaworker7:~ # snapper rollback 447
Creating read-only snapshot of current system. (Snapshot 451.)
Creating read-write snapshot of snapshot 447. (Snapshot 452.)
Setting default subvolume to snapshot 452.

From there with transactional-update shell, called zypper dup and reboot. Now everything works again. Seems this was a good way to recover. openQA jobs were picked up again. Waiting until the next nightly update to call this resolved.

@mkittler thanks a lot for joining and helping me.

Actions #5

Updated by okurz about 4 years ago

  • Status changed from Feedback to Resolved

The machine correctly updated the system and then rebooted:

Mar 19 00:04:55 openqaworker7 systemd[1]: Starting Update the system...
Mar 19 00:04:55 openqaworker7 transactional-update[28477]: Checking for newer version.
…
Mar 19 00:05:16 openqaworker7 transactional-update[28477]: Calling zypper up
…
Mar 19 00:06:11 openqaworker7 transactional-update[28477]: transactional-update finished - informed rebootmgr
Mar 19 00:06:11 openqaworker7 systemd[1]: Started Update the system.
…
Mar 19 03:30:00 openqaworker7 rebootmgrd[4347]: rebootmgr: reboot triggered now!
Mar 19 03:30:00 openqaworker7 xinetd[4340]: Exiting...
Mar 19 03:30:00 openqaworker7 worker[4518]: [info] Quit due to signal TERM
Mar 19 03:30:00 openqaworker7 worker[4518]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20083/x_gWTtpY40z4lTx3/broadcast
Mar 19 03:30:00 openqaworker7 systemd-logind[1418]: System is rebooting.
Mar 19 03:30:00 openqaworker7 worker[4512]: [info] Quit due to signal TERM
Mar 19 03:30:00 openqaworker7 worker[4510]: [info] Quit due to signal TERM
Mar 19 03:30:00 openqaworker7 worker[4513]: [info] Quit due to signal TERM
Mar 19 03:30:00 openqaworker7 worker[4509]: [info] Quit due to signal TERM
Mar 19 03:30:00 openqaworker7 worker[4520]: [info] Quit due to signal TERM
Mar 19 03:30:00 openqaworker7 worker[4514]: [info] Quit due to signal TERM
Mar 19 03:30:00 openqaworker7 worker[4514]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20053/1XIQ_SRwmWMoRxqL/broadcast
Mar 19 03:30:00 openqaworker7 worker[4508]: [info] Quit due to signal TERM
Mar 19 03:30:00 openqaworker7 worker[4508]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20043/lD0q9X66FOfJ0ZpO/broadcast
Mar 19 03:30:00 openqaworker7 worker[4521]: [info] Quit due to signal TERM
Mar 19 03:30:00 openqaworker7 worker[4519]: [info] Quit due to signal TERM
Mar 19 03:30:00 openqaworker7 worker[4511]: [info] Quit due to signal TERM
Mar 19 03:30:00 openqaworker7 worker[4515]: [info] Quit due to signal TERM
Mar 19 03:30:00 openqaworker7 worker[4516]: [info] Quit due to signal TERM
Mar 19 03:30:00 openqaworker7 worker[4517]: [info] Quit due to signal TERM
Mar 19 03:30:00 openqaworker7 smartd[1347]: smartd received signal 15: Terminated
Mar 19 03:30:00 openqaworker7 systemd[1]: Stopped target Multi-User System.
…
Mar 19 03:31:56 openqaworker7 systemd[1]: Shutting down.
Mar 19 03:31:57 openqaworker7 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Mar 19 03:31:57 openqaworker7 haveged[789]: haveged: Stopping due to signal 15
Mar 19 03:31:57 openqaworker7 haveged[789]: haveged starting up
Mar 19 03:31:57 openqaworker7 iscsid[4317]: iscsid: iscsid shutting down.
Mar 19 03:31:57 openqaworker7 systemd-journald[766]: Journal stopped
-- Reboot --
Mar 19 03:35:48 localhost kernel: microcode: microcode updated early to revision 0xb000038, date = 2019-06-18

and the system is up and running again, executing openQA tests, no failed systemd services.

Actions #6

Updated by dzedro over 3 years ago

  • Status changed from Resolved to Feedback

6 month is this machine stuck and does not take any jobs, despite appears to be online/active ?

Actions #7

Updated by okurz over 3 years ago

  • Copied to action #69784: Workers not considered offline after ungraceful disconnect; stale job detection has no effect in that case added
Actions #8

Updated by okurz over 3 years ago

  • Status changed from Feedback to Resolved

yes, that's not how it should be but a different problem. I reported #69784 for the wrong reporting because openqaworker7 is an o3 worker since months and happily working on o3 jobs, e.g. see https://openqa.opensuse.org/admin/workers/193 :)

Actions

Also available in: Atom PDF