action #88900
Updated by okurz almost 4 years ago
## Observation openqaworker13 was unreachable on 2021-02-22. ## Acceptance criteria * **AC1:** machine is reboot-reliable, e.g. ensured by multiple reboots without failures after boot ## Problem Cannot connect to this machine `ping`. ## workaround using `ipmitool sol activate` connect to this machine, shows: ``` [FAILED] Failed to start save kernel crash dump. See 'systemctl status kdump-save.service' for details. Starting Reload Configuration from the Real Root... [ OK ] Started Reload Configuration from the Real Root. [ OK ] Reached target Initrd File Systems. [ OK ] Reached target Initrd Default Target. Starting dracut pre-pivot and cleanup hook... [ OK ] Started dracut pre-pivot and cleanup hook. Starting Cleaning Up and Shutting Down Daemons... [ OK ] Stopped target Timers. [ OK ] Stopped dracut pre-pivot and cleanup hook. [ OK ] Stopped target Initrd Default Target. [ OK ] Stopped target Basic System. [ OK ] Stopped target Slices. [ OK ] Stopped target Initrd Root Device. [ OK ] Stopped target System Initialization. [ OK ] Stopped Apply Kernel Variables. [ OK ] Stopped Create Volatile Files and Directories. [ OK ] Stopped target Local File Systems. Unmounting /kdump/mnt1/var/crash... Unmounting /kdump/mnt0... Stopping udev Kernel Device Manager... [ OK ] Stopped target Sockets. [ OK ] Stopped Load Kernel Modules. [ OK ] Stopped target Paths. [ OK ] Stopped Dispatch Password Requests to Console Directory Watch. [ OK ] Stopped target Swap. [ OK ] Stopped target Remote File Systems. [ OK ] Stopped target Remote File Systems (Pre). [ OK ] Stopped dracut initqueue hook. [ OK ] Stopped udev Coldplug all Devices. [ OK ] Stopped udev Kernel Device Manager. [ OK ] Unmounted /kdump/mnt1/var/crash. [ OK ] Started Cleaning Up and Shutting Down Daemons. [ OK ] Stopped Create Static Device Nodes in /dev. [ OK ] Stopped Create list of required sta…vice nodes for the current kernel. [ OK ] Stopped dracut pre-udev hook. [ OK ] Stopped dracut cmdline hook. [ OK ] Stopped dracut ask for additional cmdline parameters. [ OK ] Closed udev Control Socket. [ OK ] Closed udev Kernel Socket. Starting Cleanup udevd DB... [ OK ] Started Cleanup udevd DB. [ OK ] Reached target Switch Root. Starting Switch Root... [FAILED] Failed to start Switch Root. See 'systemctl status initrd-switch-root.service' for details. Starting Setup Virtual Console... [ OK ] Unmounted /kdump/mnt0. [ OK ] Stopped File System Check on /dev/d…3d875-0e24-49a9-8dc8-ddf8cf83f04e. [ OK ] Started Setup Virtual Console. [ OK ] Started Emergency Shell. [ OK ] Reached target Emergency Mode. Generating "/run/initramfs/rdsosreport.txt" Entering emergency mode. Exit the shell to continue. Type "journalctl" to view system logs. You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot after mounting them and attach it to a bug report. Give root password for maintenance (or press Control-D to continue): [716615.675131] BTRFS info (device sda2): disk space caching is enabled [716615.675134] BTRFS info (device sda2): has skinny extents [716614.923484] dracut-cmdline[499]: mv: cannot stat '/etc/multipath.conf.kdump': No such file or directory Unable to ioctl(KDSETLED) -- are you not on the console? (Inappropriate ioctl for device) Extracting dmesg ------------------------------------------------------------------------------- The dmesg log is saved to /kdump/mnt1/var/crash/2021-02-22-04:16/dmesg.txt. makedumpfile Completed. ------------------------------------------------------------------------------- Saving dump using makedumpfile ------------------------------------------------------------------------------- Copying data : [100.0 %] / eta: 0s The dumpfile is saved to /kdump/mnt1/var/crash/2021-02-22-04:16/vmcore. makedumpfile Completed. ------------------------------------------------------------------------------- ``` Then did `ipmitool power cycle `. The workers on that host works well for now. ## Suggestions * Remove from salt * Identify problem * Ensure stability with multiple reboots * Add back to salt after stable ## Workaround power cycle over IPMI