Project

General

Profile

action #78010

unreliable reboots on openqaworker3, likely due do openqa_nvme_format (was: [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN)

Added by okurz 2 months ago. Updated about 17 hours ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2020-11-16
Due date:
% Done:

0%

Estimated time:

Description

Observation

alert by email:
From: Monitoring User nagios@suse.de resent from: okurz@suse.com
To: okurz@suse.com
Date: 16/11/2020 10.01
Spam Status: Spamassassin
Notification: PROBLEM
Host: openqaworker3.suse.de
State: DOWN
Date/Time: Mon Nov 16 09:01:00 UTC 2020
Info: CRITICAL - 10.160.0.243: Host unreachable @ 10.160.0.44. rta nan, lost 100%

See Online: https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=1host=openqaworker3.suse.de

Acceptance criteria

  • AC1: openqaworker3 is "reboot-safe", e.g. at least 10 reboots in a row end up in a successfully booted system

Related issues

Related to openQA Infrastructure - action #68050: openqaworker3 fails to come up on reboot, openqa_nvme_format.service failedResolved2020-06-142020-07-07

Related to openQA Infrastructure - action #71098: openqaworker3 down but no alert was raisedResolved2020-09-082020-11-30

History

#1 Updated by okurz 2 months ago

  • Due date set to 2020-11-18
  • Status changed from Workable to Feedback

I checked right now with ping openqaworker3 and could ping the system. and ssh openqaworker uptime showed that the system is up for 0:41 , so since about 12:07 UTC.

everything seems to be in order.

#2 Updated by nicksinger 2 months ago

Guess that was me. A simple "chassis power cycle" did the trick.

#3 Updated by okurz 2 months ago

Interesting. But do you know what caused the problem then?

#4 Updated by okurz 2 months ago

  • Assignee changed from okurz to nicksinger

nicksinger Interesting. But do you know what caused the problem then?

#5 Updated by nicksinger about 2 months ago

  • Assignee changed from nicksinger to okurz

no. SOL was stuck once again and I just triggered a reboot

#6 Updated by okurz about 2 months ago

  • Status changed from Feedback to Resolved
  • Assignee changed from okurz to nicksinger

hm, ok. I checked again, everything seems to be in order. I don't know what else we can do. Ok, thanks for fixing it.

#7 Updated by okurz about 2 months ago

  • Related to action #68050: openqaworker3 fails to come up on reboot, openqa_nvme_format.service failed added

#8 Updated by okurz about 2 months ago

  • Related to action #71098: openqaworker3 down but no alert was raised added

#9 Updated by okurz about 2 months ago

  • Subject changed from [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN to unreliable reboots on openqaworker3, likely due do openqa_nvme_format (was: [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN)
  • Description updated (diff)
  • Due date deleted (2020-11-18)
  • Status changed from Resolved to Workable
  • Priority changed from Urgent to High

ok. I know what we can do. This failed again after the weekly automatic reboot whenever there are kernel updates. openqaworker3 was stuck in emergency mode and also a power reset this time ended up in a similar situation. It could be that it would work to just reboot once more but as we had problems repeatedly and still have we should test at least 5-10 reboots in a row.

In sol activate I found:

[FAILED] Failed to start Setup NVMe before mounting it.
See 'systemctl status openqa_nvme_format.service' for details.
[DEPEND] Dependency failed for /var/lib/openqa.
[DEPEND] Dependency failed for Local File Systems.
[DEPEND] Dependency failed for /var/lib/openqa/share.
[DEPEND] Dependency failed for Remote File Systems.
[DEPEND] Dependency failed for openQA Worker #7.
[DEPEND] Dependency failed for openQA Worker #12.
[DEPEND] Dependency failed for openQA Worker #8.
[DEPEND] Dependency failed for openQA Worker #5.
[DEPEND] Dependency failed for openQA Worker #2.
[DEPEND] Dependency failed for openQA Worker #3.
[DEPEND] Dependency failed for openQA Worker #9.
[DEPEND] Dependency failed for openQA Worker #1.
[DEPEND] Dependency failed for openQA Worker #13.
[DEPEND] Dependency failed for openQA Worker #11.
[DEPEND] Dependency failed for openQA Worker #4.
[DEPEND] Dependency failed for openQA Worker #6.
[DEPEND] Dependency failed for openQA Worker #10.

Seems in #68050 I could not make that properly.

#10 Updated by coolo about 2 months ago

but also ssh service failed

#11 Updated by nicksinger about 2 months ago

I've rejected the salt-key for now on OSD to prevent automatic startup of workers. What I found while booting is that the systems hangs for quite some time with the last message being printed: [ 1034.526144] kexec_file: kernel signature verification failed (-129). - which seems to fit with okurz suggestion that this is caused by kernel updates.

#12 Updated by okurz about 2 months ago

nicksinger wrote:

[…] seems to fit with okurz suggestion that this is caused by kernel updates.

but what I meant is only that some package upgrades, for example kernel updates, trigger a reboot which is all according to plan. Have you seen #78010#note-9 regarding the systemd services? I suspect that this is simply again or still a problem of systemd service dependencies and fully in our control.

#13 Updated by okurz about 2 months ago

please see #77011#note-18 for changes that fvogt has applied. We should crosscheck the changes he did, commit to salt and ensure this is applicable for all machines and then apply the same to all, o3 and osd.

#14 Updated by nicksinger about 2 months ago

So I diffed with what fvogt did:

diff --git a/openqa/nvme_store/openqa_nvme_create.service b/openqa/nvme_store/openqa_nvme_create.service
new file mode 100644
index 0000000..9e9b57b
--- /dev/null
+++ b/openqa/nvme_store/openqa_nvme_create.service
@@ -0,0 +1,20 @@
+[Unit]
+Description=Create array on NVMe if necessary
+# Let's hope this is close enough to "all nvmes present"
+Requires=dev-nvme0n1.device
+After=dev-nvme0n1.device
+DefaultDependencies=no
+
+[Service]
+Type=oneshot
+
+# It's not really possible to wait for that to happen, so do it here
+ExecStart=/bin/sh -c "if ! mdadm --detail --scan | grep -qi openqa; then mdadm --assemble --scan || true; fi"
+# Create striped storage for openQA from all NVMe devices when / resides on
+# another device or from a potential third NVMe partition when there is only a
+# single NVMe device for the complete storage
+# For some reason mdadm --create doesn't send an udev event, so do it manually.
+ExecStart=/bin/sh -c -e 'if lsblk -n | grep -q "raid"; then exit 0; fi; if lsblk -n | grep -v nvme | grep -q "/$"; then mdadm --create /dev/md/openqa --level=0 --force --raid-devices=$(ls /dev/nvme?n1 | wc -l) --run /dev/nvme?n1; else mdadm --create /dev/md/openqa --level=0 --force --raid-devices=1 --run /dev/nvme0n1p3; fi; udevadm trigger -c add /dev/md/openqa'
+
+[Install]
+WantedBy=multi-user.target
diff --git a/openqa/nvme_store/openqa_nvme_format.service b/openqa/nvme_store/openqa_nvme_format.service
index f6f1c16..8e9d97c 100644
--- a/openqa/nvme_store/openqa_nvme_format.service
+++ b/openqa/nvme_store/openqa_nvme_format.service
@@ -1,19 +1,12 @@
 [Unit]
-Description=Setup NVMe before mounting it
+Description=Create Ext2 FS on /dev/md/openqa
 Before=var-lib-openqa.mount
+Requires=dev-md-openqa.device
+After=dev-md-openqa.device
 DefaultDependencies=no

 [Service]
 Type=oneshot
-
-# Create striped storage for openQA from all NVMe devices when / resides on
-# another device or from a potential third NVMe partition when there is only a
-# single NVMe device for the complete storage
-ExecStart=/bin/sh -c 'lsblk -n | grep -q "raid" || lsblk -n | grep -v nvme | grep "/$" && (mdadm --stop /dev/md/openqa >/dev/null 2>&1; mdadm --create /dev/md/openqa --level=0 --force --raid-devices=$(ls /dev/nvme?n1 | wc -l) --run /dev/nvme?n1) || mdadm --create /dev/md/openqa --level=0 --force --raid-devices=1 --run /dev/nvme0n1p3'
-# Ensure device is correctly initialized but also spend a little time before
-# trying to create a filesystem to prevent a "busy" error
-ExecStart=/bin/sh -c 'grep nvme /proc/mdstat'
-ExecStart=/bin/sh -c 'mdadm --detail --scan | grep openqa'
 ExecStart=/sbin/mkfs.ext2 -F /dev/md/openqa

 [Install]

I'm not convinced yet that this will resolve our problem. I'd rather like to look into why our array-devices are busy in the first place.
I suspect that something assembles our array before we format and therefore assume we have a dependency problem.

From a first, quick look I think it could be related to mdmonitor.service. From https://linux.die.net/man/8/mdadm:

[…] all arrays listed in the configuration file will be monitored. Further, if --scan is given, then any other md devices that appear in /proc/mdstat will also be monitored. 

Which would explain why the device is busy. I will experiment with some dependencies for our service to run before mdmonitor

#15 Updated by nicksinger about 1 month ago

  • Status changed from Workable to Blocked

Hm, I had the bright Idea of removing the mdraid module from the initramfs. This now causes the machine to hang in the dracut recovery shell (so even earlier than we had before). I wanted to rebuild the initramfs but failed to boot any recovery media. I've opened [RT-ADM #182375] AutoReply: Machine openqaworker3.suse.de boots wrong PXE image now to address the fact that PXE on that machine seems to boot some infra machine image. And of course I forgot to add osd-admins@suse.de into CC so I will keep you updated here…

#16 Updated by nicksinger about 1 month ago

  • Status changed from Blocked to In Progress

machine can boot from PXE again. Apparently we had specific PXE images for each openqaworker back in the days, interesting :)
I was able to boot a TW installer in which I can hopefully recover the initramfs again.

#17 Updated by nicksinger about 1 month ago

well, seems like this custom image was (blindly? Or was I pressing Enter where no console showed up?) reinstalling the worker. At least this is how it looks to me:

0:openqaworker3:~ # lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
loop0         7:0    0  89.1M  1 loop  /parts/mp_0000
loop1         7:1    0  12.7M  1 loop  /parts/mp_0001
loop2         7:2    0  58.3M  1 loop  /mounts/mp_0000
loop3         7:3    0  72.6M  1 loop  /mounts/mp_0001
loop4         7:4    0   4.1M  1 loop  /mounts/mp_0002
loop5         7:5    0   1.8M  1 loop  /mounts/mp_0003
sda           8:0    0 931.5G  0 disk
├─sda1        8:1    0   9.8G  0 part
│ └─md4       9:4    0   9.8G  0 raid1 /mnt
├─sda2        8:2    0 995.6M  0 part
└─sda3        8:3    0 920.8G  0 part
  └─md0       9:0    0 920.8G  0 raid1
sdb           8:16   0 931.5G  0 disk
├─sdb1        8:17   0   9.8G  0 part
│ └─md4       9:4    0   9.8G  0 raid1 /mnt
├─sdb2        8:18   0 995.6M  0 part
└─sdb3        8:19   0 920.8G  0 part
  └─md0       9:0    0 920.8G  0 raid1
nvme0n1     259:0    0 372.6G  0 disk
└─nvme0n1p1 259:1    0 372.6G  0 part
nvme1n1     259:2    0 372.6G  0 disk
0:openqaworker3:~ # ls -lah /mnt/
total 132K
drwxr-xr-x 27 root root 4.0K Dec 11 12:15 .
drwxr-xr-x 23 root root  820 Dec 14 13:34 ..
drwxr-xr-x  2 root root 4.0K Apr 13  2018 bin
drwxr-xr-x  3 root root 4.0K Dec 11 12:15 boot
-rw-r--r--  1 root root  893 Apr 13  2018 bootincluded_archives.filelist
-rw-r--r--  1 root root  816 May 24  2013 build-custom
drwxr-xr-x  3 root root 4.0K Apr 13  2018 config
drwxr-xr-x  3 root root 4.0K Apr 13  2018 dev
drwxr-xr-x 90 root root 4.0K Dec 14 11:34 etc
drwxr-xr-x  2 root root 4.0K Jun 27  2017 home
drwxr-xr-x  2 root root 4.0K Dec 11 12:14 kiwi-hooks
drwxr-xr-x  2 root root 4.0K Apr 13  2018 kvm
drwxr-xr-x  2 root root 4.0K Apr 13  2018 kvm_lock_sync
drwxr-xr-x 10 root root 4.0K Apr 13  2018 lib
drwxr-xr-x  7 root root 4.0K Apr 13  2018 lib64
drwx------  2 root root  16K Apr 13  2018 lost+found
drwxr-xr-x  2 root root 4.0K Jun 27  2017 mnt
drwxr-xr-x  2 root root 4.0K Jun 27  2017 opt
drwxr-xr-x  2 root root 4.0K Apr 13  2018 proc
drwx------  4 root root 4.0K Apr 13  2018 root
drwxr-xr-x 16 root root 4.0K Apr 13  2018 run
drwxr-xr-x  2 root root  12K Dec 11 12:14 sbin
drwxr-xr-x  2 root root 4.0K Jun 27  2017 selinux
drwxr-xr-x  5 root root 4.0K Apr 13  2018 srv
drwxr-xr-x  3 root root 4.0K May 24  2013 studio
dr-xr-xr-x  2 root root 4.0K Jun 27  2017 sys
drwxrwxrwt  9 root root 4.0K Dec 14 11:34 tmp
drwxr-xr-x 13 root root 4.0K Apr 13  2018 usr
drwxr-xr-x 11 root root 4.0K Dec 11 12:15 var

Guess it's time for a reinstall…

#18 Updated by nicksinger about 1 month ago

  • Status changed from In Progress to Feedback

the re-installation went quite smooth. After the initial install was done I re-added it into salt and applied a highstate, after that a reboot, adjusted /etc/salt/grains to include the worker and nvme_store role, ran another highstate, another reboot and with that got the NVMe's as raid for /var/lib/openqa and all MM interfaces/bridges up and running.
A first glimpse at https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 as well as https://stats.openqa-monitor.qa.suse.de/alerting/list looks good (no alerts). Also a test on public cloud looks good till now: https://openqa.suse.de/tests/5173202#

#19 Updated by nicksinger about 17 hours ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF