action #78010
unreliable reboots on openqaworker3, likely due do openqa_nvme_format (was: [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN)
0%
Description
Observation¶
alert by email:
From: Monitoring User nagios@suse.de resent from: okurz@suse.com
To: okurz@suse.com
Date: 16/11/2020 10.01
Spam Status: Spamassassin
Notification: PROBLEM
Host: openqaworker3.suse.de
State: DOWN
Date/Time: Mon Nov 16 09:01:00 UTC 2020
Info: CRITICAL - 10.160.0.243: Host unreachable @ 10.160.0.44. rta nan, lost 100%
See Online: https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=1host=openqaworker3.suse.de
Acceptance criteria¶
- AC1: openqaworker3 is "reboot-safe", e.g. at least 10 reboots in a row end up in a successfully booted system
Related issues
History
#2
Updated by nicksinger 2 months ago
Guess that was me. A simple "chassis power cycle" did the trick.
#4
Updated by okurz 2 months ago
- Assignee changed from okurz to nicksinger
nicksinger Interesting. But do you know what caused the problem then?
#5
Updated by nicksinger about 2 months ago
- Assignee changed from nicksinger to okurz
no. SOL was stuck once again and I just triggered a reboot
#6
Updated by okurz about 2 months ago
- Status changed from Feedback to Resolved
- Assignee changed from okurz to nicksinger
hm, ok. I checked again, everything seems to be in order. I don't know what else we can do. Ok, thanks for fixing it.
#7
Updated by okurz about 2 months ago
- Related to action #68050: openqaworker3 fails to come up on reboot, openqa_nvme_format.service failed added
#8
Updated by okurz about 2 months ago
- Related to action #71098: openqaworker3 down but no alert was raised added
#9
Updated by okurz about 2 months ago
- Subject changed from [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN to unreliable reboots on openqaworker3, likely due do openqa_nvme_format (was: [alert] PROBLEM Host Alert: openqaworker3.suse.de is DOWN)
- Description updated (diff)
- Due date deleted (
2020-11-18) - Status changed from Resolved to Workable
- Priority changed from Urgent to High
ok. I know what we can do. This failed again after the weekly automatic reboot whenever there are kernel updates. openqaworker3 was stuck in emergency mode and also a power reset
this time ended up in a similar situation. It could be that it would work to just reboot once more but as we had problems repeatedly and still have we should test at least 5-10 reboots in a row.
In sol activate
I found:
[FAILED] Failed to start Setup NVMe before mounting it. See 'systemctl status openqa_nvme_format.service' for details. [DEPEND] Dependency failed for /var/lib/openqa. [DEPEND] Dependency failed for Local File Systems. [DEPEND] Dependency failed for /var/lib/openqa/share. [DEPEND] Dependency failed for Remote File Systems. [DEPEND] Dependency failed for openQA Worker #7. [DEPEND] Dependency failed for openQA Worker #12. [DEPEND] Dependency failed for openQA Worker #8. [DEPEND] Dependency failed for openQA Worker #5. [DEPEND] Dependency failed for openQA Worker #2. [DEPEND] Dependency failed for openQA Worker #3. [DEPEND] Dependency failed for openQA Worker #9. [DEPEND] Dependency failed for openQA Worker #1. [DEPEND] Dependency failed for openQA Worker #13. [DEPEND] Dependency failed for openQA Worker #11. [DEPEND] Dependency failed for openQA Worker #4. [DEPEND] Dependency failed for openQA Worker #6. [DEPEND] Dependency failed for openQA Worker #10.
Seems in #68050 I could not make that properly.
#10
Updated by coolo about 2 months ago
but also ssh service failed
#11
Updated by nicksinger about 2 months ago
I've rejected the salt-key for now on OSD to prevent automatic startup of workers. What I found while booting is that the systems hangs for quite some time with the last message being printed: [ 1034.526144] kexec_file: kernel signature verification failed (-129).
- which seems to fit with okurz suggestion that this is caused by kernel updates.
#12
Updated by okurz about 2 months ago
nicksinger wrote:
[…] seems to fit with okurz suggestion that this is caused by kernel updates.
but what I meant is only that some package upgrades, for example kernel updates, trigger a reboot which is all according to plan. Have you seen #78010#note-9 regarding the systemd services? I suspect that this is simply again or still a problem of systemd service dependencies and fully in our control.
#13
Updated by okurz about 2 months ago
please see #77011#note-18 for changes that fvogt has applied. We should crosscheck the changes he did, commit to salt and ensure this is applicable for all machines and then apply the same to all, o3 and osd.
#14
Updated by nicksinger about 2 months ago
So I diffed with what fvogt did:
diff --git a/openqa/nvme_store/openqa_nvme_create.service b/openqa/nvme_store/openqa_nvme_create.service
new file mode 100644
index 0000000..9e9b57b
--- /dev/null
+++ b/openqa/nvme_store/openqa_nvme_create.service
@@ -0,0 +1,20 @@
+[Unit]
+Description=Create array on NVMe if necessary
+# Let's hope this is close enough to "all nvmes present"
+Requires=dev-nvme0n1.device
+After=dev-nvme0n1.device
+DefaultDependencies=no
+
+[Service]
+Type=oneshot
+
+# It's not really possible to wait for that to happen, so do it here
+ExecStart=/bin/sh -c "if ! mdadm --detail --scan | grep -qi openqa; then mdadm --assemble --scan || true; fi"
+# Create striped storage for openQA from all NVMe devices when / resides on
+# another device or from a potential third NVMe partition when there is only a
+# single NVMe device for the complete storage
+# For some reason mdadm --create doesn't send an udev event, so do it manually.
+ExecStart=/bin/sh -c -e 'if lsblk -n | grep -q "raid"; then exit 0; fi; if lsblk -n | grep -v nvme | grep -q "/$"; then mdadm --create /dev/md/openqa --level=0 --force --raid-devices=$(ls /dev/nvme?n1 | wc -l) --run /dev/nvme?n1; else mdadm --create /dev/md/openqa --level=0 --force --raid-devices=1 --run /dev/nvme0n1p3; fi; udevadm trigger -c add /dev/md/openqa'
+
+[Install]
+WantedBy=multi-user.target
diff --git a/openqa/nvme_store/openqa_nvme_format.service b/openqa/nvme_store/openqa_nvme_format.service
index f6f1c16..8e9d97c 100644
--- a/openqa/nvme_store/openqa_nvme_format.service
+++ b/openqa/nvme_store/openqa_nvme_format.service
@@ -1,19 +1,12 @@
[Unit]
-Description=Setup NVMe before mounting it
+Description=Create Ext2 FS on /dev/md/openqa
Before=var-lib-openqa.mount
+Requires=dev-md-openqa.device
+After=dev-md-openqa.device
DefaultDependencies=no
[Service]
Type=oneshot
-
-# Create striped storage for openQA from all NVMe devices when / resides on
-# another device or from a potential third NVMe partition when there is only a
-# single NVMe device for the complete storage
-ExecStart=/bin/sh -c 'lsblk -n | grep -q "raid" || lsblk -n | grep -v nvme | grep "/$" && (mdadm --stop /dev/md/openqa >/dev/null 2>&1; mdadm --create /dev/md/openqa --level=0 --force --raid-devices=$(ls /dev/nvme?n1 | wc -l) --run /dev/nvme?n1) || mdadm --create /dev/md/openqa --level=0 --force --raid-devices=1 --run /dev/nvme0n1p3'
-# Ensure device is correctly initialized but also spend a little time before
-# trying to create a filesystem to prevent a "busy" error
-ExecStart=/bin/sh -c 'grep nvme /proc/mdstat'
-ExecStart=/bin/sh -c 'mdadm --detail --scan | grep openqa'
ExecStart=/sbin/mkfs.ext2 -F /dev/md/openqa
[Install]
I'm not convinced yet that this will resolve our problem. I'd rather like to look into why our array-devices are busy in the first place.
I suspect that something assembles our array before we format and therefore assume we have a dependency problem.
From a first, quick look I think it could be related to mdmonitor.service
. From https://linux.die.net/man/8/mdadm:
[…] all arrays listed in the configuration file will be monitored. Further, if --scan is given, then any other md devices that appear in /proc/mdstat will also be monitored.
Which would explain why the device is busy. I will experiment with some dependencies for our service to run before mdmonitor
#15
Updated by nicksinger about 1 month ago
- Status changed from Workable to Blocked
Hm, I had the bright Idea of removing the mdraid module from the initramfs. This now causes the machine to hang in the dracut recovery shell (so even earlier than we had before). I wanted to rebuild the initramfs but failed to boot any recovery media. I've opened [RT-ADM #182375] AutoReply: Machine openqaworker3.suse.de boots wrong PXE image
now to address the fact that PXE on that machine seems to boot some infra machine image. And of course I forgot to add osd-admins@suse.de into CC so I will keep you updated here…
#16
Updated by nicksinger about 1 month ago
- Status changed from Blocked to In Progress
machine can boot from PXE again. Apparently we had specific PXE images for each openqaworker back in the days, interesting :)
I was able to boot a TW installer in which I can hopefully recover the initramfs again.
#17
Updated by nicksinger about 1 month ago
well, seems like this custom image was (blindly? Or was I pressing Enter where no console showed up?) reinstalling the worker. At least this is how it looks to me:
0:openqaworker3:~ # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 89.1M 1 loop /parts/mp_0000 loop1 7:1 0 12.7M 1 loop /parts/mp_0001 loop2 7:2 0 58.3M 1 loop /mounts/mp_0000 loop3 7:3 0 72.6M 1 loop /mounts/mp_0001 loop4 7:4 0 4.1M 1 loop /mounts/mp_0002 loop5 7:5 0 1.8M 1 loop /mounts/mp_0003 sda 8:0 0 931.5G 0 disk ├─sda1 8:1 0 9.8G 0 part │ └─md4 9:4 0 9.8G 0 raid1 /mnt ├─sda2 8:2 0 995.6M 0 part └─sda3 8:3 0 920.8G 0 part └─md0 9:0 0 920.8G 0 raid1 sdb 8:16 0 931.5G 0 disk ├─sdb1 8:17 0 9.8G 0 part │ └─md4 9:4 0 9.8G 0 raid1 /mnt ├─sdb2 8:18 0 995.6M 0 part └─sdb3 8:19 0 920.8G 0 part └─md0 9:0 0 920.8G 0 raid1 nvme0n1 259:0 0 372.6G 0 disk └─nvme0n1p1 259:1 0 372.6G 0 part nvme1n1 259:2 0 372.6G 0 disk 0:openqaworker3:~ # ls -lah /mnt/ total 132K drwxr-xr-x 27 root root 4.0K Dec 11 12:15 . drwxr-xr-x 23 root root 820 Dec 14 13:34 .. drwxr-xr-x 2 root root 4.0K Apr 13 2018 bin drwxr-xr-x 3 root root 4.0K Dec 11 12:15 boot -rw-r--r-- 1 root root 893 Apr 13 2018 bootincluded_archives.filelist -rw-r--r-- 1 root root 816 May 24 2013 build-custom drwxr-xr-x 3 root root 4.0K Apr 13 2018 config drwxr-xr-x 3 root root 4.0K Apr 13 2018 dev drwxr-xr-x 90 root root 4.0K Dec 14 11:34 etc drwxr-xr-x 2 root root 4.0K Jun 27 2017 home drwxr-xr-x 2 root root 4.0K Dec 11 12:14 kiwi-hooks drwxr-xr-x 2 root root 4.0K Apr 13 2018 kvm drwxr-xr-x 2 root root 4.0K Apr 13 2018 kvm_lock_sync drwxr-xr-x 10 root root 4.0K Apr 13 2018 lib drwxr-xr-x 7 root root 4.0K Apr 13 2018 lib64 drwx------ 2 root root 16K Apr 13 2018 lost+found drwxr-xr-x 2 root root 4.0K Jun 27 2017 mnt drwxr-xr-x 2 root root 4.0K Jun 27 2017 opt drwxr-xr-x 2 root root 4.0K Apr 13 2018 proc drwx------ 4 root root 4.0K Apr 13 2018 root drwxr-xr-x 16 root root 4.0K Apr 13 2018 run drwxr-xr-x 2 root root 12K Dec 11 12:14 sbin drwxr-xr-x 2 root root 4.0K Jun 27 2017 selinux drwxr-xr-x 5 root root 4.0K Apr 13 2018 srv drwxr-xr-x 3 root root 4.0K May 24 2013 studio dr-xr-xr-x 2 root root 4.0K Jun 27 2017 sys drwxrwxrwt 9 root root 4.0K Dec 14 11:34 tmp drwxr-xr-x 13 root root 4.0K Apr 13 2018 usr drwxr-xr-x 11 root root 4.0K Dec 11 12:15 var
Guess it's time for a reinstall…
#18
Updated by nicksinger about 1 month ago
- Status changed from In Progress to Feedback
the re-installation went quite smooth. After the initial install was done I re-added it into salt and applied a highstate, after that a reboot, adjusted /etc/salt/grains
to include the worker
and nvme_store
role, ran another highstate, another reboot and with that got the NVMe's as raid for /var/lib/openqa
and all MM interfaces/bridges up and running.
A first glimpse at https://stats.openqa-monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 as well as https://stats.openqa-monitor.qa.suse.de/alerting/list looks good (no alerts). Also a test on public cloud looks good till now: https://openqa.suse.de/tests/5173202#
#19
Updated by nicksinger about 17 hours ago
- Status changed from Feedback to Resolved