Project

General

Profile

Actions

action #91803

closed

openqaworker8 failed to reboot: [Alerting] openqaworker8: host up alert on 2021-04-25

Added by Xiaojing_liu almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2021-04-27
Due date:
% Done:

0%

Estimated time:

Description

Observation

We got an alerting email on 2021-04-25 openqaworker8: host up alert
Connected to that machine using ipmitool -I lanplus -C 3 -H openqaworker8-ipmi.suse.de sol activate, the content showed that:

[SOL Session operational.  Use ~? for help]

Login incorrect

Give root password for maintenance
(or press Control-D to continue):

Then pressed Control-D, the machine started to boot.

Acceptance criteria

  • AC1: The machine reboot successfully
Actions #1

Updated by okurz almost 3 years ago

  • Priority changed from Normal to High
  • Target version set to Ready
Actions #2

Updated by livdywan almost 3 years ago

  • Status changed from New to Feedback
  • Assignee set to osukup

So checking this with @osukup ipmi is responsive and ends up at a login prompt:

ipmitool -I lanplus -C 3 -H openqaworker8-ipmi.suse.de -U ADMIN -P ADMIN sol activate

Welcome to openSUSE Leap 15.2 - Kernel 5.3.18-lp152.72-default (ttyS1).


openqaworker8 login:

Going in via SSH:

cdywan@openqaworker8:~> systemctl list-units --failed
0 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

According to https://openqa.suse.de/admin/workers openqaworker8 is healthy and took a job 1h ago

Actions #3

Updated by mkittler almost 3 years ago

I bet this was only caused by the RAID/NVMe setup as the corresponding systemd unit failed at that time, see journalctl --since '10 days ago' -u openqa_nvme_format.service:

Apr 25 03:34:17 openqaworker8 systemd[1]: Starting Setup NVMe before mounting it...
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: Current mount points (printed for debugging purposes):
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: devtmpfs on /dev type devtmpfs (rw,nosuid,size=131888876k,nr_inodes=32972219,mode=755)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on / type btrfs (rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=22,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=1203)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: mqueue on /dev/mqueue type mqueue (rw,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: debugfs on /sys/kernel/debug type debugfs (rw,relatime)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/lib/libvirt/images type btrfs (rw,relatime,ssd,space_cache,subvolid=269,subvol=/@/var/lib/libvirt/images)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /opt type btrfs (rw,relatime,ssd,space_cache,subvolid=263,subvol=/@/opt)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/lib/machines type btrfs (rw,relatime,ssd,space_cache,subvolid=270,subvol=/@/var/lib/machines)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/lib/mysql type btrfs (rw,relatime,ssd,space_cache,subvolid=273,subvol=/@/var/lib/mysql)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /srv type btrfs (rw,relatime,ssd,space_cache,subvolid=264,subvol=/@/srv)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/crash type btrfs (rw,relatime,ssd,space_cache,subvolid=268,subvol=/@/var/crash)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=265,subvol=/@/tmp)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/cache type btrfs (rw,relatime,ssd,space_cache,subvolid=267,subvol=/@/var/cache)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /home type btrfs (rw,relatime,ssd,space_cache,subvolid=262,subvol=/@/home)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/spool type btrfs (rw,relatime,ssd,space_cache,subvolid=278,subvol=/@/var/spool)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /.snapshots type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@/.snapshots)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /usr/local type btrfs (rw,relatime,ssd,space_cache,subvolid=266,subvol=/@/usr/local)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/log type btrfs (rw,relatime,ssd,space_cache,subvolid=276,subvol=/@/var/log)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/opt type btrfs (rw,relatime,ssd,space_cache,subvolid=277,subvol=/@/var/opt)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/lib/named type btrfs (rw,relatime,ssd,space_cache,subvolid=274,subvol=/@/var/lib/named)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/lib/mariadb type btrfs (rw,relatime,ssd,space_cache,subvolid=272,subvol=/@/var/lib/mariadb)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (rw,relatime,ssd,space_cache,subvolid=260,subvol=/@/boot/grub2/i386-pc)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/lib/mailman type btrfs (rw,relatime,ssd,space_cache,subvolid=271,subvol=/@/var/lib/mailman)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/lib/pgsql type btrfs (rw,relatime,ssd,space_cache,subvolid=275,subvol=/@/var/lib/pgsql)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (rw,relatime,ssd,space_cache,subvolid=261,subvol=/@/boot/grub2/x86_64-efi)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p2 on /var/tmp type btrfs (rw,relatime,ssd,space_cache,subvolid=279,subvol=/@/var/tmp)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: /dev/nvme0n1p1 on /boot/efi type vfat (rw,relatime,fmask=0002,dmask=0002,allow_utime=0020,codepage=437,iocharset=iso8859-1,shortname=mixed,utf8,errors=remount-ro)
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: Present block devices (printed for debugging purposes):
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: nvme0n1     259:0    0 745.2G  0 disk
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: ├─nvme0n1p1 259:1    0   156M  0 part  /boot/efi
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: ├─nvme0n1p2 259:2    0   100G  0 part  /
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: └─nvme0n1p3 259:3    0 645.1G  0 part
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]:   └─md127     9:127  0   645G  0 raid0
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: Stopping current RAID "/dev/md/openqa"
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: mdadm: stopped /dev/md/openqa
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: Creating RAID0 "/dev/md/openqa" on: /dev/nvme0n1p3
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: mdadm: /dev/nvme0n1p3 appears to contain an ext2fs file system
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]:        size=676393984K  mtime=Tue Jul  9 14:34:57 2019
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: mdadm: /dev/nvme0n1p3 appears to be part of a raid array:
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]:        level=raid0 devices=1 ctime=Sun Apr 11 03:34:28 2021
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: mdadm: unexpected failure opening /dev/md127
Apr 25 03:34:17 openqaworker8 openqa-establish-nvme-setup[1297]: Unable to create RAID, mdadm returned with non-zero code
Apr 25 03:34:17 openqaworker8 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
Apr 25 03:34:17 openqaworker8 systemd[1]: Failed to start Setup NVMe before mounting it.
Apr 25 03:34:18 openqaworker8 systemd[1]: openqa_nvme_format.service: Unit entered failed state.
Apr 25 03:34:18 openqaworker8 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.

Note that it would be helpful to collect logs at the time of the ticket creation.

Actions #4

Updated by okurz almost 3 years ago

Note that it would be helpful to collect logs at the time of the ticket creation.

I assume that all logs are still on the system from the date of ticket creation. Or do you see something missing?

Actions #5

Updated by osukup almost 3 years ago

openqaworker8 is now working, but what is suspicious for me:

osukup@openqaworker8:~> sudo mdadm --detail /dev/m
mapper/           md/               mem               mpt2ctl           mqueue/
mcelog            md127             memory_bandwidth  mpt3ctl
osukup@openqaworker8:~> sudo mdadm --detail /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Sun May  2 03:34:17 2021
        Raid Level : raid0
        Array Size : 676261888 (644.93 GiB 692.49 GB)
      Raid Devices : 1
     Total Devices : 1
       Persistence : Superblock is persistent

       Update Time : Sun May  2 03:34:17 2021
             State : clean
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

            Layout : -unknown-
        Chunk Size : 512K

Consistency Policy : none

              Name : openqaworker8:openqa  (local to host openqaworker8)
              UUID : 6592bb0e:922fb707:47d72d7f:e4349f69
            Events : 0

    Number   Major   Minor   RaidDevice State
       0     259        3        0      active sync   /dev/nvme0n1p3
osukup@openqaworker8:~> sudo lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme0n1     259:0    0 745,2G  0 disk
├─nvme0n1p1 259:1    0   156M  0 part  /boot/efi
├─nvme0n1p2 259:2    0   100G  0 part  /
└─nvme0n1p3 259:3    0 645,1G  0 part
  └─md127     9:127  0   645G  0 raid0 /var/lib/openqa

Raid 0 with only one device ( I know, RAID0 is only performance optimization but still)

Actions #6

Updated by okurz almost 3 years ago

osukup wrote:

Raid 0 with only one device ( I know, RAID0 is only performance optimization but still)

Good observation but no worries, this is by intention. Within https://gitlab.suse.de/openqa/salt-states-openqa we came up with recipes for using [0..N] NVMe devices for openQA use all with the same setup. In some cases for some machines we have 0 NVMe devices, in some 1, in some other 2 devices. This ensures a simple, consistent setup for all cases and also automatic fallback in case any of the NVMe devices fail (as happened in the past) as the worker machines can still run with a reduced set of devices just in degraded fashion.

Actions #7

Updated by osukup almost 3 years ago

maybe we can add --assume-clean to mdadm command ?

Actions #8

Updated by okurz almost 3 years ago

a very good idea. Can you add that in https://gitlab.suse.de/openqa/salt-states-openqa ? To try it out e.g. on openqaworker8 I suggest you do systemctl disable --now salt-minion telegraf && systemctl mask salt-minion telegraf, adapt the files manually (I think in /etc/systemd/system/…) and reboot

Actions #9

Updated by osukup almost 3 years ago

okurz wrote:

a very good idea. Can you add that in https://gitlab.suse.de/openqa/salt-states-openqa ? To try it out e.g. on openqaworker8 I suggest you do systemctl disable --now salt-minion telegraf && systemctl mask salt-minion telegraf, adapt the files manually (I think in /etc/systemd/system/…) and reboot

done , booted without any visible issues

Actions #11

Updated by okurz almost 3 years ago

@mkittler @osukup MR https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/486 merged and in deployment right now. Do you think the mentioned change does fix the issue that caused the reboot alert or is this side-fix?

Actions #12

Updated by mkittler almost 3 years ago

I assume that all logs are still on the system from the date of ticket creation. Or do you see something missing?

No, but in general it can take quite a while until a ticket is picked up and then the logs might be gone.

Do you think the mentioned change does fix the issue that caused the reboot alert or is this side-fix?

Maybe it fixed it. I suppose the boot was really prevented by the RAID/NVMe setup. Whether adding --assume-clean helps is of course a different story. Considering we've got the log message mdadm: /dev/nvme0n1p3 appears to contain an ext2fs file system it might help. The final error message mdadm: unexpected failure opening /dev/md127 is unfortunately quite unspecific.

Actions #13

Updated by osukup almost 3 years ago

mkittler wrote:

Maybe it fixed it. I suppose the boot was really prevented by the RAID/NVMe setup. Whether adding --assume-clean helps is of course a different story. Considering we've got the log message mdadm: /dev/nvme0n1p3 appears to contain an ext2fs file system it might help. The final error message mdadm: unexpected failure opening /dev/md127 is unfortunately quite unspecific.

agree, I see this more like workaround -- the question is why mdadm saw ext2 fs on the device ...

Actions #14

Updated by okurz almost 3 years ago

osukup wrote:

agree, I see this more like workaround -- the question is why mdadm saw ext2 fs on the device ...

maybe it's trying to be "too smart" :D

Actions #15

Updated by osukup almost 3 years ago

  • Status changed from Feedback to Resolved

it looks working

Actions

Also available in: Atom PDF