action #123025
o3 worker openqaworker4 is down; boots to emergency shell only
0%
Description
Observation¶
When checking IPMI access (for #120270) I've noticed that openqaworker4 booted into the emergency shell. A reboot didn't help. At least IPMI access works (via jumpy@qe-jumpy.suse.de
, as documented in pillars).
I've got the following over SOL:
$ [(582a0875e...)] ssh -4 jumpy@qe-jumpy.suse.de -- ipmitool -I lanplus -C 3 -H openqaworker4-ipmi.qe-ipmi-ur -U … -P … sol activate tcgetattr: Inappropriate ioctl for device [SOL Session operational. Use ~? for help] g: dracut-initqueue: starting timeout scripts [ 154.763096] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks: [ 154.872232] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62" ]" [ 154.926722] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed.sh: "[ -e "/dev/disk/by-id/md-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed" ]" [ 155.040172] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2fb16b75b5-fff1-4bf3-9c31-dad4c807a49a.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then [ 155.042089] dracut-initqueue[686]: [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ] [ 155.096575] dracut-initqueue[686]: fi" [ 155.270418] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts [ 155.327020] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks: [ 155.385155] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62" ]" [ 155.441888] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed.sh: "[ -e "/dev/disk/by-id/md-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed" ]" [ 155.443048] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2fb16b75b5-fff1-4bf3-9c31-dad4c807a49a.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then [ 155.499042] dracut-initqueue[686]: [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ] [ 155.556320] dracut-initqueue[686]: fi" … [ 203.066174] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts [ 203.066203] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks: [ 203.066230] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62" ]" [ 203.066257] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed.sh: "[ -e "/dev/disk/by-id/md-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed" ]" [ 203.066284] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2fb16b75b5-fff1-4bf3-9c31-dad4c807a49a.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then Starting Dracut Emergency Shell... [ 210.752284] dracut-initqueue[686]: [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ] [ 210.752327] dracut-initqueue[686]: fi" [ 210.752349] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts [ 210.752369] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks: [ 210.752400] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4[ 222.367453] dracut-initqueue[686]: [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ] [ 222.367484] dracut-initqueue[686]: fi" … [ 222.713160] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts [ 222.713181] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks: [ 222.713201] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62" ]" [ 222.713219] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed.sh: "[ -e "/dev/disk/by-id/md-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed" ]" [ 222.713260] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2fb16b75b5-fff1-4bf3-9c31-dad4c807a49a.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then [ 222.713284] dracut-initqueue[686]: [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ] [ 222.713302] dracut-initqueue[686]: fi" [ 222.713320] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts [ 222.713348] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks: [ 223.045600] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdi Login incorrect Give root password for maintenance (or press Control-D to continue): … sh-4.4#
Acceptance criteria¶
- AC1: openqaworker4 is back online
Related issues
History
#1
Updated by mkittler 2 months ago
Somehow there's no /dev/disk/by-uuid
within that shell (only by-id and by-partuuid):
sh-4.4# ls -l /dev/disk ls -l /dev/disk total 0 drwxr-xr-x 2 root root 760 Jan 12 11:33 by-id drwxr-xr-x 2 root root 120 Jan 12 11:33 by-partuuid drwxr-xr-x 2 root root 280 Jan 12 11:33 by-path sh-4.4# sh-4.4# ls -l /dev/disk/by-partuuid ls -l /dev/disk/by-partuuid total 0 lrwxrwxrwx 1 root root 10 Jan 12 11:33 000364d7-01 -> ../../sdb1 lrwxrwxrwx 1 root root 10 Jan 12 11:33 000364d7-02 -> ../../sdb2 lrwxrwxrwx 1 root root 10 Jan 12 11:33 00092cc4-01 -> ../../sda1 lrwxrwxrwx 1 root root 10 Jan 12 11:33 00092cc4-02 -> ../../sda2 sh-4.4# sh-4.4# ls -l /dev/disk/by-id ls -l /dev/disk/by-id total 0 lrwxrwxrwx 1 root root 9 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0HBEHK -> ../../sda lrwxrwxrwx 1 root root 10 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0HBEHK-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0HBEHK-part2 -> ../../sda2 lrwxrwxrwx 1 root root 9 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0MUNRK -> ../../sdb lrwxrwxrwx 1 root root 10 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0MUNRK-part1 -> ../../sdb1 lrwxrwxrwx 1 root root 10 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0MUNRK-part2 -> ../../sdb2 lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0HBEHK -> ../../sda lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0HBEHK-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0HBEHK-part2 -> ../../sda2 lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0MUNRK -> ../../sdb lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0MUNRK-part1 -> ../../sdb1 lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0MUNRK-part2 -> ../../sdb2 lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0HBEHK -> ../../sda lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0HBEHK-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0HBEHK-part2 -> ../../sda2 lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0MUNRK -> ../../sdb lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0MUNRK-part1 -> ../../sdb1 lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0MUNRK-part2 -> ../../sdb2 lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-35000cca8a8c6fc7e -> ../../sda lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-35000cca8a8c6fc7e-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-35000cca8a8c6fc7e-part2 -> ../../sda2 lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-35000cca8a8c90368 -> ../../sdb lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-35000cca8a8c90368-part1 -> ../../sdb1 lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-35000cca8a8c90368-part2 -> ../../sdb2 lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0HBEHK -> ../../sda lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0HBEHK-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0HBEHK-part2 -> ../../sda2 lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0MUNRK -> ../../sdb lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0MUNRK-part1 -> ../../sdb1 lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0MUNRK-part2 -> ../../sdb2 lrwxrwxrwx 1 root root 9 Jan 12 11:33 wwn-0x5000cca8a8c6fc7e -> ../../sda lrwxrwxrwx 1 root root 10 Jan 12 11:33 wwn-0x5000cca8a8c6fc7e-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 Jan 12 11:33 wwn-0x5000cca8a8c6fc7e-part2 -> ../../sda2 lrwxrwxrwx 1 root root 9 Jan 12 11:33 wwn-0x5000cca8a8c90368 -> ../../sdb lrwxrwxrwx 1 root root 10 Jan 12 11:33 wwn-0x5000cca8a8c90368-part1 -> ../../sdb1 lrwxrwxrwx 1 root root 10 Jan 12 11:33 wwn-0x5000cca8a8c90368-part2 -> ../../sdb2
Maybe that is another symptom of the issue preventing the boot.
Supposedly it would help to regenerate the initramfs but I'm not sure how to to that (on openSUSE).
#2
Updated by mkittler 2 months ago
- Status changed from New to In Progress
- Assignee set to nicksinger
nicksinger recovered the machine by recreating the initramfs (using a rescue system ISO mounted via jviewer). We should add a few details about how it was done in the Wiki.
#3
Updated by openqa_review 2 months ago
- Due date set to 2023-01-27
Setting due date based on mean cycle time of SUSE QE Tools
#4
Updated by nicksinger 2 months ago
I was able to use the integrated Java tool to mount an ISO on that machine. First you have to access the webui of the BMC (openqaworker4) by forwarding the http(s) port via jumpy and download the required jviewer.jnlp (usually by clicking the display preview in the webui). Next you have to figure out which ports are required for this tool to work. I used nmap on jumpy to figure this out:
jumpy@qe-jumpy:~> nmap openqaworker4-ipmi.qe-ipmi-ur -p- Starting Nmap 7.70 ( https://nmap.org ) at 2023-01-17 12:23 UTC Nmap scan report for openqaworker4-ipmi.qe-ipmi-ur (192.168.133.4) Host is up (0.0056s latency). Not shown: 65525 closed ports PORT STATE SERVICE 22/tcp open ssh 80/tcp open http 199/tcp open smux 427/tcp open svrloc 443/tcp open https 623/tcp open oob-ws-http 5120/tcp open barracuda-bbs 5122/tcp open unknown 5123/tcp open unknown 7578/tcp open unknown
Afterwards all required ports can be forwarded to localhost via ssh. I used the following command:
sudo ssh -i /home/nicksinger/.ssh/id_rsa.SUSE -4 jumpy@qe-jumpy.suse.de -L 443:openqaworker4-ipmi.qe-ipmi-ur:443 -L 623:openqaworker4-ipmi.qe-ipmi-ur:623 -L 5120:openqaworker4-ipmi.qe-ipmi-ur:5120 -L 5122:openqaworker4-ipmi.qe-ipmi-ur:5122 -L 5123:openqaworker4-ipmi.qe-ipmi-ur:5123 -L 7578:openqaworker4-ipmi.qe-ipmi-ur:7578
since the java tool chainloads additional files over https you need to forward 443 too which requires root privileges for ssh client to attach to that port. You also need to make sure the port is not occupied by a local running webserver. With these forwards you can successfully mount an ISO and boot from it for further recovery steps.
#5
Updated by okurz 2 months ago
- Related to action #123004: Downgrade kernel on o3+osd x86_64 machines as workaround for boo#1206616 added
#6
Updated by nicksinger 2 months ago
- Status changed from In Progress to Resolved
After all we managed to fix the system by unpinning the kernel version from https://progress.opensuse.org/issues/123004#note-1 and regenerating the initramfs. We didn't understand why the fixed version caused the initramfs to fail. We saw some reports of missing symbols so maybe a second package needs to be downgraded?
We figured that having the older kernel is not required for this host and therefore removed the workaround to have a stable system again.
I added my notes about accessing/recovering this machine to https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Access-the-BMC-of-machines-in-the-new-security-zone and https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Using-the-build-in-java-tools-of-BMCs-to-access-machines-in-the-security-zone and cross-checked if it is still online. A (accidental) reboot of the machine worked too so I consider this done here.