Project

General

Profile

Actions

action #123025

closed

o3 worker openqaworker4 is down; boots to emergency shell only

Added by mkittler almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2023-01-12
Due date:
2023-01-27
% Done:

0%

Estimated time:
Tags:

Description

Observation

When checking IPMI access (for #120270) I've noticed that openqaworker4 booted into the emergency shell. A reboot didn't help. At least IPMI access works (via jumpy@qe-jumpy.suse.de, as documented in pillars).

I've got the following over SOL:

$ [(582a0875e...)] ssh -4 jumpy@qe-jumpy.suse.de -- ipmitool -I lanplus -C 3 -H openqaworker4-ipmi.qe-ipmi-ur -U … -P … sol activate
tcgetattr: Inappropriate ioctl for device
[SOL Session operational.  Use ~? for help]
g: dracut-initqueue: starting timeout scripts
[  154.763096] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[  154.872232] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62" ]"
[  154.926722] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed.sh: "[ -e "/dev/disk/by-id/md-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed" ]"
[  155.040172] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2fb16b75b5-fff1-4bf3-9c31-dad4c807a49a.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
[  155.042089] dracut-initqueue[686]:     [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ]
[  155.096575] dracut-initqueue[686]: fi"
[  155.270418] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts
[  155.327020] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[  155.385155] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62" ]"
[  155.441888] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed.sh: "[ -e "/dev/disk/by-id/md-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed" ]"
[  155.443048] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2fb16b75b5-fff1-4bf3-9c31-dad4c807a49a.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
[  155.499042] dracut-initqueue[686]:     [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ]
[  155.556320] dracut-initqueue[686]: fi"
…
[  203.066174] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts
[  203.066203] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[  203.066230] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62" ]"
[  203.066257] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed.sh: "[ -e "/dev/disk/by-id/md-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed" ]"
[  203.066284] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2fb16b75b5-fff1-4bf3-9c31-dad4c807a49a.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
         Starting Dracut Emergency Shell...
[  210.752284] dracut-initqueue[686]:     [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ]
[  210.752327] dracut-initqueue[686]: fi"
[  210.752349] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts
[  210.752369] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[  210.752400] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4[  222.367453] dracut-initqueue[686]:     [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ]
[  222.367484] dracut-initqueue[686]: fi"
…
[  222.713160] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts
[  222.713181] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[  222.713201] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62.sh: "[ -e "/dev/disk/by-id/md-uuid-b22a1df4:ee1e951a:46cd5205:209e6a62" ]"
[  222.713219] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed.sh: "[ -e "/dev/disk/by-id/md-uuid-c807d4f1:469952f3:a2a7bf78:b5ae26ed" ]"
[  222.713260] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2fb16b75b5-fff1-4bf3-9c31-dad4c807a49a.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
[  222.713284] dracut-initqueue[686]:     [ -e "/dev/disk/by-uuid/b16b75b5-fff1-4bf3-9c31-dad4c807a49a" ]
[  222.713302] dracut-initqueue[686]: fi"
[  222.713320] dracut-initqueue[686]: Warning: dracut-initqueue: starting timeout scripts
[  222.713348] dracut-initqueue[686]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[  223.045600] dracut-initqueue[686]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdi

Login incorrect

Give root password for maintenance
(or press Control-D to continue): …


sh-4.4#

Acceptance criteria

  • AC1: openqaworker4 is back online

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #123004: Downgrade kernel on o3+osd x86_64 machines as workaround for boo#1206616 size:MResolvedokurz2023-01-12

Actions
Actions #1

Updated by mkittler almost 2 years ago

Somehow there's no /dev/disk/by-uuid within that shell (only by-id and by-partuuid):

sh-4.4# ls -l /dev/disk      
ls -l /dev/disk
total 0
drwxr-xr-x 2 root root 760 Jan 12 11:33 by-id
drwxr-xr-x 2 root root 120 Jan 12 11:33 by-partuuid
drwxr-xr-x 2 root root 280 Jan 12 11:33 by-path
sh-4.4# 
sh-4.4# ls -l /dev/disk/by-partuuid
ls -l /dev/disk/by-partuuid
total 0
lrwxrwxrwx 1 root root 10 Jan 12 11:33 000364d7-01 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 000364d7-02 -> ../../sdb2
lrwxrwxrwx 1 root root 10 Jan 12 11:33 00092cc4-01 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 00092cc4-02 -> ../../sda2
sh-4.4# 
sh-4.4# ls -l /dev/disk/by-id      
ls -l /dev/disk/by-id
total 0
lrwxrwxrwx 1 root root  9 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0HBEHK -> ../../sda
lrwxrwxrwx 1 root root 10 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0HBEHK-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0HBEHK-part2 -> ../../sda2
lrwxrwxrwx 1 root root  9 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0MUNRK -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0MUNRK-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 ata-HGST_HTE721010A9E630_JR10034M0MUNRK-part2 -> ../../sdb2
lrwxrwxrwx 1 root root  9 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0HBEHK -> ../../sda
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0HBEHK-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0HBEHK-part2 -> ../../sda2
lrwxrwxrwx 1 root root  9 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0MUNRK -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0MUNRK-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-0ATA_HGST_HTE721010A9_JR10034M0MUNRK-part2 -> ../../sdb2
lrwxrwxrwx 1 root root  9 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0HBEHK -> ../../sda
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0HBEHK-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0HBEHK-part2 -> ../../sda2
lrwxrwxrwx 1 root root  9 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0MUNRK -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0MUNRK-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-1ATA_HGST_HTE721010A9E630_JR10034M0MUNRK-part2 -> ../../sdb2
lrwxrwxrwx 1 root root  9 Jan 12 11:33 scsi-35000cca8a8c6fc7e -> ../../sda
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-35000cca8a8c6fc7e-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-35000cca8a8c6fc7e-part2 -> ../../sda2
lrwxrwxrwx 1 root root  9 Jan 12 11:33 scsi-35000cca8a8c90368 -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-35000cca8a8c90368-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-35000cca8a8c90368-part2 -> ../../sdb2
lrwxrwxrwx 1 root root  9 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0HBEHK -> ../../sda
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0HBEHK-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0HBEHK-part2 -> ../../sda2
lrwxrwxrwx 1 root root  9 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0MUNRK -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0MUNRK-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 scsi-SATA_HGST_HTE721010A9_JR10034M0MUNRK-part2 -> ../../sdb2
lrwxrwxrwx 1 root root  9 Jan 12 11:33 wwn-0x5000cca8a8c6fc7e -> ../../sda
lrwxrwxrwx 1 root root 10 Jan 12 11:33 wwn-0x5000cca8a8c6fc7e-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 wwn-0x5000cca8a8c6fc7e-part2 -> ../../sda2
lrwxrwxrwx 1 root root  9 Jan 12 11:33 wwn-0x5000cca8a8c90368 -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 12 11:33 wwn-0x5000cca8a8c90368-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 12 11:33 wwn-0x5000cca8a8c90368-part2 -> ../../sdb2

Maybe that is another symptom of the issue preventing the boot.

Supposedly it would help to regenerate the initramfs but I'm not sure how to to that (on openSUSE).

Actions #2

Updated by mkittler almost 2 years ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger

@nicksinger recovered the machine by recreating the initramfs (using a rescue system ISO mounted via jviewer). We should add a few details about how it was done in the Wiki.

Actions #3

Updated by openqa_review almost 2 years ago

  • Due date set to 2023-01-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions #4

Updated by nicksinger almost 2 years ago

I was able to use the integrated Java tool to mount an ISO on that machine. First you have to access the webui of the BMC (openqaworker4) by forwarding the http(s) port via jumpy and download the required jviewer.jnlp (usually by clicking the display preview in the webui). Next you have to figure out which ports are required for this tool to work. I used nmap on jumpy to figure this out:

jumpy@qe-jumpy:~> nmap openqaworker4-ipmi.qe-ipmi-ur -p-
Starting Nmap 7.70 ( https://nmap.org ) at 2023-01-17 12:23 UTC
Nmap scan report for openqaworker4-ipmi.qe-ipmi-ur (192.168.133.4)
Host is up (0.0056s latency).
Not shown: 65525 closed ports
PORT     STATE SERVICE
22/tcp   open  ssh
80/tcp   open  http
199/tcp  open  smux
427/tcp  open  svrloc
443/tcp  open  https
623/tcp  open  oob-ws-http
5120/tcp open  barracuda-bbs
5122/tcp open  unknown
5123/tcp open  unknown
7578/tcp open  unknown

Afterwards all required ports can be forwarded to localhost via ssh. I used the following command:

sudo ssh -i /home/nicksinger/.ssh/id_rsa.SUSE -4 jumpy@qe-jumpy.suse.de -L 443:openqaworker4-ipmi.qe-ipmi-ur:443 -L 623:openqaworker4-ipmi.qe-ipmi-ur:623 -L 5120:openqaworker4-ipmi.qe-ipmi-ur:5120 -L 5122:openqaworker4-ipmi.qe-ipmi-ur:5122 -L 5123:openqaworker4-ipmi.qe-ipmi-ur:5123 -L 7578:openqaworker4-ipmi.qe-ipmi-ur:7578

since the java tool chainloads additional files over https you need to forward 443 too which requires root privileges for ssh client to attach to that port. You also need to make sure the port is not occupied by a local running webserver. With these forwards you can successfully mount an ISO and boot from it for further recovery steps.

Actions #5

Updated by okurz almost 2 years ago

  • Related to action #123004: Downgrade kernel on o3+osd x86_64 machines as workaround for boo#1206616 size:M added
Actions #6

Updated by nicksinger almost 2 years ago

  • Status changed from In Progress to Resolved

After all we managed to fix the system by unpinning the kernel version from https://progress.opensuse.org/issues/123004#note-1 and regenerating the initramfs. We didn't understand why the fixed version caused the initramfs to fail. We saw some reports of missing symbols so maybe a second package needs to be downgraded?
We figured that having the older kernel is not required for this host and therefore removed the workaround to have a stable system again.

I added my notes about accessing/recovering this machine to https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Access-the-BMC-of-machines-in-the-new-security-zone and https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Using-the-build-in-java-tools-of-BMCs-to-access-machines-in-the-security-zone and cross-checked if it is still online. A (accidental) reboot of the machine worked too so I consider this done here.

Actions #7

Updated by mkittler almost 2 years ago

Thanks for updating the Wiki, the instructions are good.

Actions

Also available in: Atom PDF