Project

General

Profile

Actions

action #138350

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters

coordination #131519: [epic] Additional redundancy for OSD virtualization testing

worker31 and likely more OSD machines get stuck on boot in grub command line

Added by okurz 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-06-28
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

Working on #131546 we realized that worker31 does not reboot into a valid system and gets stuck in a grub command line. This might be due to https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1030 . This can be critical because if more systems reboot and are stuck in the same state we will run into a serious problem, at the latest when a new maintenance reboot window, e.g. next Sunday, is reached. So way in before we should ensure that systems do not reboot into an unbootable system.

Rollback steps

  • DONE Add worker3{0,2} back to salt
  • Re-enable rebootmgr
Actions #2

Updated by okurz 7 months ago

  • Subject changed from Additional redundancy for OSD virtualization testing - Xen worker host size:M to worker31 and likely more OSD machines get stuck on boot in grub command line
Actions #3

Updated by dheidler 7 months ago

  • Status changed from New to In Progress
  • Assignee set to dheidler
Actions #4

Updated by livdywan 7 months ago

Switching off automatic reboots as a precaution: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1034

Actions #5

Updated by livdywan 7 months ago

Removing worker31 from salt:

sudo salt-key -y -d worker31.oqa.prg2.suse.org

Note: Accidentally removed 32 as well, will fix that shortly. Added back.

worker30 is looking unresponsive now, suffering the same symptom, hence also unsalted.

Actions #6

Updated by dheidler 7 months ago

worker30 was rebooted by me to check, if the problem occurs on all workers.
It does.

The problem is that /boot/grub2/grub.cfg is empty.
This is also the case eg. on worker32.

Actions #7

Updated by dheidler 7 months ago

How to boot into installed system from grub shell (some kernel args are missing so consider this as a rescue system - eg. some virtualisation options are missing).

cat (hd0,gpt2)/etc/fstab

# remember UUID of /

search --no-floppy --fs-uuid --set=root 4de90074-9348-4235-b53a-d1d38f4cc0bc
linux /boot/vmlinuz-5.14.21-150500.55.31-default root=UUID=4de90074-9348-4235-b53a-d1d38f4cc0bc console=tty0 console=ttyS1,115200
initrd /boot/initrd-5.14.21-150500.55.31-default
boot
Actions #8

Updated by dheidler 7 months ago

Afterwards execute this and reboot:

grub2-mkconfig -o /boot/grub2/grub.cfg
Actions #9

Updated by dheidler 7 months ago

This problem appears even on most salt machines - even ones that are no worker hosts in salt context (eg. piworker):

openqa:/home/dheidler # salt "*" cmd.run 'ls -l /boot/grub2/grub.cfg'
worker37.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker40.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker39.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker33.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker29.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker38.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
s390zl12.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:14 /boot/grub2/grub.cfg
worker32.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker34.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker35.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
s390zl13.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:14 /boot/grub2/grub.cfg
worker-arm1.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker-arm2.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker36.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
backup-qam.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker1.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
sapworker1.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
sapworker3.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
sapworker2.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
petrol.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qamaster.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker16.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker17.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
ada.qe.suse.de:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
imagetester.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 10:15 /boot/grub2/grub.cfg
openqa.suse.de:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qesapworker-prg4.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qesapworker-prg5.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaw5-xen.qa.suse.de:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qesapworker-prg6.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
diesel.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
backup-vm.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qesapworker-prg7.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker18.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
jenkins.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker14.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
monitor.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 11:55 /boot/grub2/grub.cfg
tumblesle.qe.nue2.suse.org:
    -rw------- 1 root root 8055 Oct 23 14:28 /boot/grub2/grub.cfg
unreal6.qe.nue2.suse.org:
    -rw------- 1 root root 16489 Oct 23 15:39 /boot/grub2/grub.cfg
openqa-piworker.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
schort-server.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
baremetal-support.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
Actions #10

Updated by dheidler 7 months ago

Executed this on osd (salt master):

salt "*" cmd.run 'grub2-mkconfig -o /boot/grub2/grub.cfg'

Output can be found in openqa:/home/dheidler/grub2-mkconfig.log.

Now all salt machines have a non-empty /boot/grub2/grub.cfg again.

worker30 and worker31 have manually been fixed as described above.

Actions #11

Updated by dheidler 7 months ago

Added worker30 and worker31 back to salt:

salt-key -a worker30.oqa.prg2.suse.org
salt-key -a worker31.oqa.prg2.suse.org
Actions #12

Updated by livdywan 7 months ago

  • Description updated (diff)
Actions #13

Updated by okurz 7 months ago

  • Description updated (diff)

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1035 to revert the anyway ineffective reboot disablement. I will just assume that I caused the original problem with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1030, right?

Actions #14

Updated by nicksinger 7 months ago

okurz wrote in #note-13:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1035 to revert the anyway ineffective reboot disablement. I will just assume that I caused the original problem with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1030, right?

kind of, yes. The syntax error in /etc/default/grub introduced by https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1030/diffs#6d941bec6b3bc75b45972bf3fa461af5cdc33111_18_19 caused our grub2-mkconfig to just generate an error message and an empty stdout which in turn resulted in us overwriting /boot/grub2/grub.cfg with the empty output. I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1036 which should hopefully avoid that problem and not touch the original file in case of problems while generating.

Actions #15

Updated by okurz 7 months ago

I found multiple debug leftovers from nicksinger

sudo salt '\* cmd.run 'grep crashkernel=211 /etc/default/grub'
s390zl12.oqa.prg2.suse.org:
worker33.oqa.prg2.suse.org:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
backup-qam.qe.nue2.suse.org:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker29.oqa.prg2.suse.org:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker40.oqa.prg2.suse.org:
openqaworker16.qa.suse.cz:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
qesapworker-prg5.qa.suse.cz:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
openqaworker14.qa.suse.cz:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker38.oqa.prg2.suse.org:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker36.oqa.prg2.suse.org:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker31.oqa.prg2.suse.org:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
ada.qe.suse.de:
openqaworker18.qa.suse.cz:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
…

removed with

sudo salt \* cmd.run 'sed -i "/crashkernel=211/d" /etc/default/grub'
Actions #16

Updated by openqa_review 7 months ago

  • Due date set to 2023-11-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #17

Updated by dheidler 7 months ago

  • Status changed from In Progress to Resolved

Unless we want to post-estimate this issue or conduct a more detailed RCA, this should be resolved.

Actions #18

Updated by okurz 7 months ago

  • Due date deleted (2023-11-07)
Actions

Also available in: Atom PDF