Project

General

Profile

Actions

action #138350

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters

coordination #131519: [epic] Additional redundancy for OSD virtualization testing

worker31 and likely more OSD machines get stuck on boot in grub command line

Added by okurz 12 months ago. Updated 12 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2023-06-28
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

Working on #131546 we realized that worker31 does not reboot into a valid system and gets stuck in a grub command line. This might be due to https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1030 . This can be critical because if more systems reboot and are stuck in the same state we will run into a serious problem, at the latest when a new maintenance reboot window, e.g. next Sunday, is reached. So way in before we should ensure that systems do not reboot into an unbootable system.

Rollback steps

  • DONE Add worker3{0,2} back to salt
  • Re-enable rebootmgr
Actions #2

Updated by okurz 12 months ago

  • Subject changed from Additional redundancy for OSD virtualization testing - Xen worker host size:M to worker31 and likely more OSD machines get stuck on boot in grub command line
Actions #3

Updated by dheidler 12 months ago

  • Status changed from New to In Progress
  • Assignee set to dheidler
Actions #4

Updated by livdywan 12 months ago

Switching off automatic reboots as a precaution: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1034

Actions #5

Updated by livdywan 12 months ago

Removing worker31 from salt:

sudo salt-key -y -d worker31.oqa.prg2.suse.org

Note: Accidentally removed 32 as well, will fix that shortly. Added back.

worker30 is looking unresponsive now, suffering the same symptom, hence also unsalted.

Actions #6

Updated by dheidler 12 months ago

worker30 was rebooted by me to check, if the problem occurs on all workers.
It does.

The problem is that /boot/grub2/grub.cfg is empty.
This is also the case eg. on worker32.

Actions #7

Updated by dheidler 12 months ago

How to boot into installed system from grub shell (some kernel args are missing so consider this as a rescue system - eg. some virtualisation options are missing).

cat (hd0,gpt2)/etc/fstab

# remember UUID of /

search --no-floppy --fs-uuid --set=root 4de90074-9348-4235-b53a-d1d38f4cc0bc
linux /boot/vmlinuz-5.14.21-150500.55.31-default root=UUID=4de90074-9348-4235-b53a-d1d38f4cc0bc console=tty0 console=ttyS1,115200
initrd /boot/initrd-5.14.21-150500.55.31-default
boot
Actions #8

Updated by dheidler 12 months ago

Afterwards execute this and reboot:

grub2-mkconfig -o /boot/grub2/grub.cfg
Actions #9

Updated by dheidler 12 months ago

This problem appears even on most salt machines - even ones that are no worker hosts in salt context (eg. piworker):

openqa:/home/dheidler # salt "*" cmd.run 'ls -l /boot/grub2/grub.cfg'
worker37.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker40.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker39.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker33.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker29.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker38.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
s390zl12.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:14 /boot/grub2/grub.cfg
worker32.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker34.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker35.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
s390zl13.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:14 /boot/grub2/grub.cfg
worker-arm1.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker-arm2.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker36.oqa.prg2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
backup-qam.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker1.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
sapworker1.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
sapworker3.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
sapworker2.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
petrol.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qamaster.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker16.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker17.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
ada.qe.suse.de:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
imagetester.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 10:15 /boot/grub2/grub.cfg
openqa.suse.de:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qesapworker-prg4.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qesapworker-prg5.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaw5-xen.qa.suse.de:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qesapworker-prg6.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
diesel.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
backup-vm.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qesapworker-prg7.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker18.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
jenkins.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker14.qa.suse.cz:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
monitor.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 11:55 /boot/grub2/grub.cfg
tumblesle.qe.nue2.suse.org:
    -rw------- 1 root root 8055 Oct 23 14:28 /boot/grub2/grub.cfg
unreal6.qe.nue2.suse.org:
    -rw------- 1 root root 16489 Oct 23 15:39 /boot/grub2/grub.cfg
openqa-piworker.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
schort-server.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
baremetal-support.qe.nue2.suse.org:
    -rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
Actions #10

Updated by dheidler 12 months ago

Executed this on osd (salt master):

salt "*" cmd.run 'grub2-mkconfig -o /boot/grub2/grub.cfg'

Output can be found in openqa:/home/dheidler/grub2-mkconfig.log.

Now all salt machines have a non-empty /boot/grub2/grub.cfg again.

worker30 and worker31 have manually been fixed as described above.

Actions #11

Updated by dheidler 12 months ago

Added worker30 and worker31 back to salt:

salt-key -a worker30.oqa.prg2.suse.org
salt-key -a worker31.oqa.prg2.suse.org
Actions #12

Updated by livdywan 12 months ago

  • Description updated (diff)
Actions #13

Updated by okurz 12 months ago

  • Description updated (diff)

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1035 to revert the anyway ineffective reboot disablement. I will just assume that I caused the original problem with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1030, right?

Actions #14

Updated by nicksinger 12 months ago

okurz wrote in #note-13:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1035 to revert the anyway ineffective reboot disablement. I will just assume that I caused the original problem with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1030, right?

kind of, yes. The syntax error in /etc/default/grub introduced by https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1030/diffs#6d941bec6b3bc75b45972bf3fa461af5cdc33111_18_19 caused our grub2-mkconfig to just generate an error message and an empty stdout which in turn resulted in us overwriting /boot/grub2/grub.cfg with the empty output. I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1036 which should hopefully avoid that problem and not touch the original file in case of problems while generating.

Actions #15

Updated by okurz 12 months ago

I found multiple debug leftovers from nicksinger

sudo salt '\* cmd.run 'grep crashkernel=211 /etc/default/grub'
s390zl12.oqa.prg2.suse.org:
worker33.oqa.prg2.suse.org:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
backup-qam.qe.nue2.suse.org:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker29.oqa.prg2.suse.org:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker40.oqa.prg2.suse.org:
openqaworker16.qa.suse.cz:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
qesapworker-prg5.qa.suse.cz:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
openqaworker14.qa.suse.cz:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker38.oqa.prg2.suse.org:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker36.oqa.prg2.suse.org:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker31.oqa.prg2.suse.org:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
ada.qe.suse.de:
openqaworker18.qa.suse.cz:
    GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
…

removed with

sudo salt \* cmd.run 'sed -i "/crashkernel=211/d" /etc/default/grub'
Actions #16

Updated by openqa_review 12 months ago

  • Due date set to 2023-11-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #17

Updated by dheidler 12 months ago

  • Status changed from In Progress to Resolved

Unless we want to post-estimate this issue or conduct a more detailed RCA, this should be resolved.

Actions #18

Updated by okurz 12 months ago

  • Due date deleted (2023-11-07)
Actions

Also available in: Atom PDF