Project

General

Profile

action #88900

Updated by okurz about 3 years ago

## Observation 
 openqaworker13 was unreachable on 2021-02-22. 

 ## Acceptance criteria 
 * **AC1:** machine is reboot-reliable, e.g. ensured by multiple reboots without failures after boot 

 ## Problem 
 Cannot connect to this machine `ping`. 

 ## workaround 
 using `ipmitool sol activate` connect to this machine, shows: 

 ``` 
 [FAILED] Failed to start save kernel crash dump. 
 See 'systemctl status kdump-save.service' for details. 
          Starting Reload Configuration from the Real Root... 
 [    OK    ] Started Reload Configuration from the Real Root. 
 [    OK    ] Reached target Initrd File Systems. 
 [    OK    ] Reached target Initrd Default Target. 
          Starting dracut pre-pivot and cleanup hook... 
 [    OK    ] Started dracut pre-pivot and cleanup hook. 
          Starting Cleaning Up and Shutting Down Daemons... 
 [    OK    ] Stopped target Timers. 
 [    OK    ] Stopped dracut pre-pivot and cleanup hook. 
 [    OK    ] Stopped target Initrd Default Target. 
 [    OK    ] Stopped target Basic System. 
 [    OK    ] Stopped target Slices. 
 [    OK    ] Stopped target Initrd Root Device. 
 [    OK    ] Stopped target System Initialization. 
 [    OK    ] Stopped Apply Kernel Variables. 
 [    OK    ] Stopped Create Volatile Files and Directories. 
 [    OK    ] Stopped target Local File Systems. 
          Unmounting /kdump/mnt1/var/crash... 
          Unmounting /kdump/mnt0... 
          Stopping udev Kernel Device Manager... 
 [    OK    ] Stopped target Sockets. 
 [    OK    ] Stopped Load Kernel Modules. 
 [    OK    ] Stopped target Paths. 
 [    OK    ] Stopped Dispatch Password Requests to Console Directory Watch. 
 [    OK    ] Stopped target Swap. 
 [    OK    ] Stopped target Remote File Systems. 
 [    OK    ] Stopped target Remote File Systems (Pre). 
 [    OK    ] Stopped dracut initqueue hook. 
 [    OK    ] Stopped udev Coldplug all Devices. 
 [    OK    ] Stopped udev Kernel Device Manager. 
 [    OK    ] Unmounted /kdump/mnt1/var/crash. 
 [    OK    ] Started Cleaning Up and Shutting Down Daemons. 
 [    OK    ] Stopped Create Static Device Nodes in /dev. 
 [    OK    ] Stopped Create list of required sta…vice nodes for the current kernel. 
 [    OK    ] Stopped dracut pre-udev hook. 
 [    OK    ] Stopped dracut cmdline hook. 
 [    OK    ] Stopped dracut ask for additional cmdline parameters. 
 [    OK    ] Closed udev Control Socket. 
 [    OK    ] Closed udev Kernel Socket. 
          Starting Cleanup udevd DB... 
 [    OK    ] Started Cleanup udevd DB. 
 [    OK    ] Reached target Switch Root. 
          Starting Switch Root... 
 [FAILED] Failed to start Switch Root. 
 See 'systemctl status initrd-switch-root.service' for details. 
          Starting Setup Virtual Console... 
 [    OK    ] Unmounted /kdump/mnt0. 
 [    OK    ] Stopped File System Check on /dev/d…3d875-0e24-49a9-8dc8-ddf8cf83f04e. 
 [    OK    ] Started Setup Virtual Console. 
 [    OK    ] Started Emergency Shell. 
 [    OK    ] Reached target Emergency Mode. 

 Generating "/run/initramfs/rdsosreport.txt" 


 Entering emergency mode. Exit the shell to continue. 
 Type "journalctl" to view system logs. 
 You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot 
 after mounting them and attach it to a bug report. 


 Give root password for maintenance 
 (or press Control-D to continue): 
 [716615.675131] BTRFS info (device sda2): disk space caching is enabled 
 [716615.675134] BTRFS info (device sda2): has skinny extents 
 [716614.923484] dracut-cmdline[499]: mv: cannot stat '/etc/multipath.conf.kdump': No such file or directory 
 Unable to ioctl(KDSETLED) -- are you not on the console? (Inappropriate ioctl for device) 
 Extracting dmesg 
 ------------------------------------------------------------------------------- 

 The dmesg log is saved to /kdump/mnt1/var/crash/2021-02-22-04:16/dmesg.txt. 
 makedumpfile Completed. 
 ------------------------------------------------------------------------------- 
 Saving dump using makedumpfile 
 ------------------------------------------------------------------------------- 
 Copying data                                        : [100.0 %] /             eta: 0s 

 The dumpfile is saved to /kdump/mnt1/var/crash/2021-02-22-04:16/vmcore. 

 makedumpfile Completed. 
 ------------------------------------------------------------------------------- 
 ``` 
 Then did `ipmitool power cycle `. The workers on that host works well for now. 


 ## Suggestions 
 * Remove from salt 
 * Identify problem 
 * Ensure stability with multiple reboots 
 * Add back to salt after stable 

 ## Workaround 

 power cycle over IPMI

Back