action #123025
closedo3 worker openqaworker4 is down; boots to emergency shell only
0%
Description
Observation¶
When checking IPMI access (for #120270) I've noticed that openqaworker4 booted into the emergency shell. A reboot didn't help. At least IPMI access works (via jumpy@qe-jumpy.suse.de
, as documented in pillars).
I've got the following over SOL:
$ [(582a0875e...)] ssh -4 jumpy@qe-jumpy.suse.de -- ipmitool -I lanplus -C 3 -H openqaworker4-ipmi.qe-ipmi-ur -U … -P … sol activate
tcgetattr: Inappropriate ioctl for device
[SOL Session operational. Use ~? for help]
g: dracut-initqueue: starting timeout scripts
[ 154.763096] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[ 154.872232] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62" ]"
[ 154.926722] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed.sh: "[ -e "/dev/disk/by-id/md-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed" ]"
[ 155.040172] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2fb16b75b5-fff1-4bf3-9c31-dad4c807a49a.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
[ 155.042089] dracut-initqueue[686]: [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ]
[ 155.096575] dracut-initqueue[686]: fi"
[ 155.270418] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts
[ 155.327020] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[ 155.385155] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62" ]"
[ 155.441888] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed.sh: "[ -e "/dev/disk/by-id/md-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed" ]"
[ 155.443048] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2fb16b75b5-fff1-4bf3-9c31-dad4c807a49a.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
[ 155.499042] dracut-initqueue[686]: [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ]
[ 155.556320] dracut-initqueue[686]: fi"
…
[ 203.066174] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts
[ 203.066203] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[ 203.066230] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62" ]"
[ 203.066257] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed.sh: "[ -e "/dev/disk/by-id/md-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed" ]"
[ 203.066284] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2fb16b75b5-fff1-4bf3-9c31-dad4c807a49a.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
Starting Dracut Emergency Shell...
[ 210.752284] dracut-initqueue[686]: [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ]
[ 210.752327] dracut-initqueue[686]: fi"
[ 210.752349] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts
[ 210.752369] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[ 210.752400] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4[ 222.367453] dracut-initqueue[686]: [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ]
[ 222.367484] dracut-initqueue[686]: fi"
…
[ 222.713160] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts
[ 222.713181] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[ 222.713201] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62" ]"
[ 222.713219] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed.sh: "[ -e "/dev/disk/by-id/md-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed" ]"
[ 222.713260] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2fb16b75b5-fff1-4bf3-9c31-dad4c807a49a.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
[ 222.713284] dracut-initqueue[686]: [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ]
[ 222.713302] dracut-initqueue[686]: fi"
[ 222.713320] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts
[ 222.713348] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[ 223.045600] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdi
Login incorrect
Give root password for maintenance
(or press Control-D to continue): …
sh-4.4#
Acceptance criteria¶
- AC1: openqaworker4 is back online
Updated by mkittler almost 2 years ago
Somehow there's no /dev/disk/by-uuid
within that shell (only by-id and by-partuuid):
sh-4.4# ls -l /dev/disk
ls -l /dev/disk
total 0
drwxr-xr-x 2 root root 760 Jan 12 11:33 by-id
drwxr-xr-x 2 root root 120 Jan 12 11:33 by-partuuid
drwxr-xr-x 2 root root 280 Jan 12 11:33 by-path
sh-4.4#
sh-4.4# ls -l /dev/disk/by-partuuid
ls -l /dev/disk/by-partuuid
total 0
lrwxrwxrwx 1 root root 10 Jan 12 11:33 000364d7-01 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 000364d7-02 -> ../../sdb2
lrwxrwxrwx 1 root root 10 Jan 12 11:33 00092cc4-01 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 00092cc4-02 -> ../../sda2
sh-4.4#
sh-4.4# ls -l /dev/disk/by-id
ls -l /dev/disk/by-id
total 0
lrwxrwxrwx 1 root root 9 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0HBEHK -> ../../sda
lrwxrwxrwx 1 root root 10 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0HBEHK-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0HBEHK-part2 -> ../../sda2
lrwxrwxrwx 1 root root 9 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0MUNRK -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0MUNRK-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0MUNRK-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0HBEHK -> ../../sda
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0HBEHK-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0HBEHK-part2 -> ../../sda2
lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0MUNRK -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0MUNRK-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0MUNRK-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0HBEHK -> ../../sda
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0HBEHK-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0HBEHK-part2 -> ../../sda2
lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0MUNRK -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0MUNRK-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0MUNRK-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-35000cca8a8c6fc7e -> ../../sda
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-35000cca8a8c6fc7e-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-35000cca8a8c6fc7e-part2 -> ../../sda2
lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-35000cca8a8c90368 -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-35000cca8a8c90368-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-35000cca8a8c90368-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0HBEHK -> ../../sda
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0HBEHK-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0HBEHK-part2 -> ../../sda2
lrwxrwxrwx 1 root root 9 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0MUNRK -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0MUNRK-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0MUNRK-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 9 Jan 12 11:33 wwn-0x5000cca8a8c6fc7e -> ../../sda
lrwxrwxrwx 1 root root 10 Jan 12 11:33 wwn-0x5000cca8a8c6fc7e-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 wwn-0x5000cca8a8c6fc7e-part2 -> ../../sda2
lrwxrwxrwx 1 root root 9 Jan 12 11:33 wwn-0x5000cca8a8c90368 -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 12 11:33 wwn-0x5000cca8a8c90368-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 wwn-0x5000cca8a8c90368-part2 -> ../../sdb2
Maybe that is another symptom of the issue preventing the boot.
Supposedly it would help to regenerate the initramfs but I'm not sure how to to that (on openSUSE).
Updated by mkittler almost 2 years ago
- Status changed from New to In Progress
- Assignee set to nicksinger
@nicksinger recovered the machine by recreating the initramfs (using a rescue system ISO mounted via jviewer). We should add a few details about how it was done in the Wiki.
Updated by openqa_review almost 2 years ago
- Due date set to 2023-01-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger almost 2 years ago
I was able to use the integrated Java tool to mount an ISO on that machine. First you have to access the webui of the BMC (openqaworker4) by forwarding the http(s) port via jumpy and download the required jviewer.jnlp (usually by clicking the display preview in the webui). Next you have to figure out which ports are required for this tool to work. I used nmap on jumpy to figure this out:
jumpy@qe-jumpy:~> nmap openqaworker4-ipmi.qe-ipmi-ur -p-
Starting Nmap 7.70 ( https://nmap.org ) at 2023-01-17 12:23 UTC
Nmap scan report for openqaworker4-ipmi.qe-ipmi-ur (192.168.133.4)
Host is up (0.0056s latency).
Not shown: 65525 closed ports
PORT STATE SERVICE
22/tcp open ssh
80/tcp open http
199/tcp open smux
427/tcp open svrloc
443/tcp open https
623/tcp open oob-ws-http
5120/tcp open barracuda-bbs
5122/tcp open unknown
5123/tcp open unknown
7578/tcp open unknown
Afterwards all required ports can be forwarded to localhost via ssh. I used the following command:
sudo ssh -i /home/nicksinger/.ssh/id_rsa.SUSE -4 jumpy@qe-jumpy.suse.de -L 443:openqaworker4-ipmi.qe-ipmi-ur:443 -L 623:openqaworker4-ipmi.qe-ipmi-ur:623 -L 5120:openqaworker4-ipmi.qe-ipmi-ur:5120 -L 5122:openqaworker4-ipmi.qe-ipmi-ur:5122 -L 5123:openqaworker4-ipmi.qe-ipmi-ur:5123 -L 7578:openqaworker4-ipmi.qe-ipmi-ur:7578
since the java tool chainloads additional files over https you need to forward 443 too which requires root privileges for ssh client to attach to that port. You also need to make sure the port is not occupied by a local running webserver. With these forwards you can successfully mount an ISO and boot from it for further recovery steps.
Updated by okurz almost 2 years ago
- Related to action #123004: Downgrade kernel on o3+osd x86_64 machines as workaround for boo#1206616 size:M added
Updated by nicksinger almost 2 years ago
- Status changed from In Progress to Resolved
After all we managed to fix the system by unpinning the kernel version from https://progress.opensuse.org/issues/123004#note-1 and regenerating the initramfs. We didn't understand why the fixed version caused the initramfs to fail. We saw some reports of missing symbols so maybe a second package needs to be downgraded?
We figured that having the older kernel is not required for this host and therefore removed the workaround to have a stable system again.
I added my notes about accessing/recovering this machine to https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Access-the-BMC-of-machines-in-the-new-security-zone and https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Using-the-build-in-java-tools-of-BMCs-to-access-machines-in-the-security-zone and cross-checked if it is still online. A (accidental) reboot of the machine worked too so I consider this done here.
Updated by mkittler almost 2 years ago
Thanks for updating the Wiki, the instructions are good.