action #91803
closedopenqaworker8 failed to reboot: [Alerting] openqaworker8: host up alert on 2021-04-25
0%
Description
Observation¶
We got an alerting email on 2021-04-25 openqaworker8: host up alert
Connected to that machine using ipmitool -I lanplus -C 3 -H openqaworker8-ipmi.suse.de sol activate
, the content showed that:
[SOL Session operational. Use ~? for help]
Login incorrect
Give root password for maintenance
(or press Control-D to continue):
Then pressed Control-D, the machine started to boot.
Acceptance criteria¶
- AC1: The machine reboot successfully
Updated by okurz over 3 years ago
- Priority changed from Normal to High
- Target version set to Ready
Updated by livdywan over 3 years ago
- Status changed from New to Feedback
- Assignee set to osukup
So checking this with @osukup ipmi is responsive and ends up at a login prompt:
ipmitool -I lanplus -C 3 -H openqaworker8-ipmi.suse.de -U ADMIN -P ADMIN sol activate
Welcome to openSUSE Leap 15.2 - Kernel 5.3.18-lp152.72-default (ttyS1).
openqaworker8 login:
Going in via SSH:
cdywan@openqaworker8:~> systemctl list-units --failed
0 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.
According to https://openqa.suse.de/admin/workers openqaworker8 is healthy and took a job 1h ago
Updated by mkittler over 3 years ago
I bet this was only caused by the RAID/NVMe setup as the corresponding systemd unit failed at that time, see journalctl --since '10 days ago' -u openqa_nvme_format.service
:
Apr 25 03:34:17 openqaworker8 systemd[1]: Starting Setup NVMe before mounting it...
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: Current mount points (printed for debugging purposes):
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: devtmpfs on /dev type devtmpfs (rw,nosuid,size=131888876k,nr_inodes=32972219,mode=755)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on / type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=22,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=1203)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: mqueue on /dev/mqueue type mqueue (rw,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: debugfs on /sys/kernel/debug type debugfs (rw,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/lib/libvirt/images type btrfs (rw,relatime,ssd,space_cache,subvolid=269,subvol=/@/var/lib/libvirt/images)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/opt)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/lib/machines type btrfs (rw,relatime,ssd,space_cache,subvolid=270,subvol=/@/var/lib/machines)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/lib/mysql type btrfs (rw,relatime,ssd,space_cache,subvolid=273,subvol=/@/var/lib/mysql)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/srv)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/crash type btrfs (rw,relatime,ssd,space_cache,subvolid=268,subvol=/@/var/crash)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/tmp)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/cache type btrfs (rw,relatime,ssd,space_cache,subvolid=267,subvol=/@/var/cache)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /home type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/home)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/spool type btrfs (rw,relatime,ssd,space_cache,subvolid=278,subvol=/@/var/spool)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/.snapshots)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/usr/local)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/log type btrfs (rw,relatime,ssd,space_cache,subvolid=276,subvol=/@/var/log)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/opt type btrfs (rw,relatime,ssd,space_cache,subvolid=277,subvol=/@/var/opt)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/lib/named type btrfs (rw,relatime,ssd,space_cache,subvolid=274,subvol=/@/var/lib/named)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/lib/mariadb type btrfs (rw,relatime,ssd,space_cache,subvolid=272,subvol=/@/var/lib/mariadb)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/boot/grub2/i386-pc)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/lib/mailman type btrfs (rw,relatime,ssd,space_cache,subvolid=271,subvol=/@/var/lib/mailman)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/lib/pgsql type btrfs (rw,relatime,ssd,space_cache,subvolid=275,subvol=/@/var/lib/pgsql)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/boot/grub2/x86_64-efi)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=279,subvol=/@/var/tmp)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p1 on /boot/efi type vfat (rw,relatime,fmask=0002,dmask=0002,allow_utime=0020,codepage=437,iocharset=iso8859-1,shortname=mixed,utf8,errors=remount-ro)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: Present block devices (printed for debugging purposes):
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: nvme0n1 259:0 0 745.2G 0 disk
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: ├─nvme0n1p1 259:1 0 156M 0 part /boot/efi
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: ├─nvme0n1p2 259:2 0 100G 0 part /
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: └─nvme0n1p3 259:3 0 645.1G 0 part
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: └─md127 9:127 0 645G 0 raid0
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: Stopping current RAID "/dev/md/openqa"
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: mdadm: stopped /dev/md/openqa
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: mdadm: /dev/nvme0n1p3 appears to contain an ext2fs file system
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: size=676393984K mtime=Tue Jul 9 14:34:57 2019
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: level=raid0 devices=1 ctime=Sun Apr 11 03:34:28 2021
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: mdadm: unexpected failure opening /dev/md127
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: Unable to create RAID, mdadm returned with non-zero code
Apr 25 03:34:17 openqaworker8 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
Apr 25 03:34:17 openqaworker8 systemd[1]: Failed to start Setup NVMe before mounting it.
Apr 25 03:34:18 openqaworker8 systemd[1]: openqa_nvme_format.service: Unit entered failed state.
Apr 25 03:34:18 openqaworker8 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
Note that it would be helpful to collect logs at the time of the ticket creation.
Updated by okurz over 3 years ago
Note that it would be helpful to collect logs at the time of the ticket creation.
I assume that all logs are still on the system from the date of ticket creation. Or do you see something missing?
Updated by osukup over 3 years ago
openqaworker8 is now working, but what is suspicious for me:
osukup@openqaworker8:~> sudo mdadm --detail /dev/m
mapper/ md/ mem mpt2ctl mqueue/
mcelog md127 memory_bandwidth mpt3ctl
osukup@openqaworker8:~> sudo mdadm --detail /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Sun May 2 03:34:17 2021
Raid Level : raid0
Array Size : 676261888 (644.93 GiB 692.49 GB)
Raid Devices : 1
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Sun May 2 03:34:17 2021
State : clean
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Layout : -unknown-
Chunk Size : 512K
Consistency Policy : none
Name : openqaworker8:openqa (local to host openqaworker8)
UUID : 6592bb0e:922fb707:47d72d7f:e4349f69
Events : 0
Number Major Minor RaidDevice State
0 259 3 0 active sync /dev/nvme0n1p3
osukup@openqaworker8:~> sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 745,2G 0 disk
├─nvme0n1p1 259:1 0 156M 0 part /boot/efi
├─nvme0n1p2 259:2 0 100G 0 part /
└─nvme0n1p3 259:3 0 645,1G 0 part
└─md127 9:127 0 645G 0 raid0 /var/lib/openqa
Raid 0 with only one device ( I know, RAID0 is only performance optimization but still)
Updated by okurz over 3 years ago
osukup wrote:
Raid 0 with only one device ( I know, RAID0 is only performance optimization but still)
Good observation but no worries, this is by intention. Within https://gitlab.suse.de/openqa/salt-states-openqa we came up with recipes for using [0..N] NVMe devices for openQA use all with the same setup. In some cases for some machines we have 0 NVMe devices, in some 1, in some other 2 devices. This ensures a simple, consistent setup for all cases and also automatic fallback in case any of the NVMe devices fail (as happened in the past) as the worker machines can still run with a reduced set of devices just in degraded fashion.
Updated by osukup over 3 years ago
maybe we can add --assume-clean
to mdadm
command ?
Updated by okurz over 3 years ago
a very good idea. Can you add that in https://gitlab.suse.de/openqa/salt-states-openqa ? To try it out e.g. on openqaworker8 I suggest you do systemctl disable --now salt-minion telegraf && systemctl mask salt-minion telegraf
, adapt the files manually (I think in /etc/systemd/system/…) and reboot
Updated by osukup over 3 years ago
okurz wrote:
a very good idea. Can you add that in https://gitlab.suse.de/openqa/salt-states-openqa ? To try it out e.g. on openqaworker8 I suggest you do
systemctl disable --now salt-minion telegraf && systemctl mask salt-minion telegraf
, adapt the files manually (I think in /etc/systemd/system/…) and reboot
done , booted without any visible issues
Updated by osukup over 3 years ago
Updated by okurz over 3 years ago
@mkittler @osukup MR https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/486 merged and in deployment right now. Do you think the mentioned change does fix the issue that caused the reboot alert or is this side-fix?
Updated by mkittler over 3 years ago
I assume that all logs are still on the system from the date of ticket creation. Or do you see something missing?
No, but in general it can take quite a while until a ticket is picked up and then the logs might be gone.
Do you think the mentioned change does fix the issue that caused the reboot alert or is this side-fix?
Maybe it fixed it. I suppose the boot was really prevented by the RAID/NVMe setup. Whether adding --assume-clean
helps is of course a different story. Considering we've got the log message mdadm: /dev/nvme0n1p3 appears to contain an ext2fs file system
it might help. The final error message mdadm: unexpected failure opening /dev/md127
is unfortunately quite unspecific.
Updated by osukup over 3 years ago
mkittler wrote:
Maybe it fixed it. I suppose the boot was really prevented by the RAID/NVMe setup. Whether adding
--assume-clean
helps is of course a different story. Considering we've got the log messagemdadm: /dev/nvme0n1p3 appears to contain an ext2fs file system
it might help. The final error messagemdadm: unexpected failure opening /dev/md127
is unfortunately quite unspecific.
agree, I see this more like workaround -- the question is why mdadm saw ext2 fs on the device ...
Updated by okurz over 3 years ago
osukup wrote:
agree, I see this more like workaround -- the question is why mdadm saw ext2 fs on the device ...
maybe it's trying to be "too smart" :D