Project

General

Profile

action #88900

Updated by okurz 11 months ago

## Observation
openqaworker13 was unreachable on 2021-02-22.

## Acceptance criteria
* **AC1:** machine is reboot-reliable, e.g. ensured by multiple reboots without failures after boot

##
Problem
Cannot connect to this machine `ping`.

## workaround
using `ipmitool sol activate` connect to this machine, shows:

```
[FAILED] Failed to start save kernel crash dump.
See 'systemctl status kdump-save.service' for details.
Starting Reload Configuration from the Real Root...
[ OK ] Started Reload Configuration from the Real Root.
[ OK ] Reached target Initrd File Systems.
[ OK ] Reached target Initrd Default Target.
Starting dracut pre-pivot and cleanup hook...
[ OK ] Started dracut pre-pivot and cleanup hook.
Starting Cleaning Up and Shutting Down Daemons...
[ OK ] Stopped target Timers.
[ OK ] Stopped dracut pre-pivot and cleanup hook.
[ OK ] Stopped target Initrd Default Target.
[ OK ] Stopped target Basic System.
[ OK ] Stopped target Slices.
[ OK ] Stopped target Initrd Root Device.
[ OK ] Stopped target System Initialization.
[ OK ] Stopped Apply Kernel Variables.
[ OK ] Stopped Create Volatile Files and Directories.
[ OK ] Stopped target Local File Systems.
Unmounting /kdump/mnt1/var/crash...
Unmounting /kdump/mnt0...
Stopping udev Kernel Device Manager...
[ OK ] Stopped target Sockets.
[ OK ] Stopped Load Kernel Modules.
[ OK ] Stopped target Paths.
[ OK ] Stopped Dispatch Password Requests to Console Directory Watch.
[ OK ] Stopped target Swap.
[ OK ] Stopped target Remote File Systems.
[ OK ] Stopped target Remote File Systems (Pre).
[ OK ] Stopped dracut initqueue hook.
[ OK ] Stopped udev Coldplug all Devices.
[ OK ] Stopped udev Kernel Device Manager.
[ OK ] Unmounted /kdump/mnt1/var/crash.
[ OK ] Started Cleaning Up and Shutting Down Daemons.
[ OK ] Stopped Create Static Device Nodes in /dev.
[ OK ] Stopped Create list of required sta…vice nodes for the current kernel.
[ OK ] Stopped dracut pre-udev hook.
[ OK ] Stopped dracut cmdline hook.
[ OK ] Stopped dracut ask for additional cmdline parameters.
[ OK ] Closed udev Control Socket.
[ OK ] Closed udev Kernel Socket.
Starting Cleanup udevd DB...
[ OK ] Started Cleanup udevd DB.
[ OK ] Reached target Switch Root.
Starting Switch Root...
[FAILED] Failed to start Switch Root.
See 'systemctl status initrd-switch-root.service' for details.
Starting Setup Virtual Console...
[ OK ] Unmounted /kdump/mnt0.
[ OK ] Stopped File System Check on /dev/d…3d875-0e24-49a9-8dc8-ddf8cf83f04e.
[ OK ] Started Setup Virtual Console.
[ OK ] Started Emergency Shell.
[ OK ] Reached target Emergency Mode.

Generating "/run/initramfs/rdsosreport.txt"

Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.

Give root password for maintenance
(or press Control-D to continue):
[716615.675131] BTRFS info (device sda2): disk space caching is enabled
[716615.675134] BTRFS info (device sda2): has skinny extents
[716614.923484] dracut-cmdline[499]: mv: cannot stat '/etc/multipath.conf.kdump': No such file or directory
Unable to ioctl(KDSETLED) -- are you not on the console? (Inappropriate ioctl for device)
Extracting dmesg
-------------------------------------------------------------------------------

The dmesg log is saved to /kdump/mnt1/var/crash/2021-02-22-04:16/dmesg.txt.
makedumpfile Completed.
-------------------------------------------------------------------------------
Saving dump using makedumpfile
-------------------------------------------------------------------------------
Copying data : [100.0 %] / eta: 0s

The dumpfile is saved to /kdump/mnt1/var/crash/2021-02-22-04:16/vmcore.

makedumpfile Completed.
-------------------------------------------------------------------------------
```
Then did `ipmitool power cycle `. The workers on that host works well for now.

## Suggestions

* Remove from salt
* Identify problem
* Ensure stability with multiple reboots
* Add back to salt after stable

## Workaround

power cycle over IPMI

Back