action #138350
closedQA (public) - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA (public) - coordination #129280: [epic] Move from SUSE NUE1 (Maxtorhof) to new NBG Datacenters
coordination #131519: [epic] Additional redundancy for OSD virtualization testing
worker31 and likely more OSD machines get stuck on boot in grub command line
0%
Description
Motivation¶
Working on #131546 we realized that worker31 does not reboot into a valid system and gets stuck in a grub command line. This might be due to https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1030 . This can be critical because if more systems reboot and are stuck in the same state we will run into a serious problem, at the latest when a new maintenance reboot window, e.g. next Sunday, is reached. So way in before we should ensure that systems do not reboot into an unbootable system.
Rollback steps¶
- DONE
Add worker3{0,2} back to salt - Re-enable rebootmgr
Updated by okurz about 1 year ago
- Subject changed from Additional redundancy for OSD virtualization testing - Xen worker host size:M to worker31 and likely more OSD machines get stuck on boot in grub command line
Updated by dheidler about 1 year ago
- Status changed from New to In Progress
- Assignee set to dheidler
Updated by livdywan about 1 year ago
Switching off automatic reboots as a precaution: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1034
Updated by livdywan about 1 year ago
Removing worker31 from salt:
sudo salt-key -y -d worker31.oqa.prg2.suse.org
Note: Accidentally removed 32 as well, will fix that shortly. Added back.
worker30 is looking unresponsive now, suffering the same symptom, hence also unsalted.
Updated by dheidler about 1 year ago
worker30 was rebooted by me to check, if the problem occurs on all workers.
It does.
The problem is that /boot/grub2/grub.cfg is empty.
This is also the case eg. on worker32.
Updated by dheidler about 1 year ago
How to boot into installed system from grub shell (some kernel args are missing so consider this as a rescue system - eg. some virtualisation options are missing).
cat (hd0,gpt2)/etc/fstab
# remember UUID of /
search --no-floppy --fs-uuid --set=root 4de90074-9348-4235-b53a-d1d38f4cc0bc
linux /boot/vmlinuz-5.14.21-150500.55.31-default root=UUID=4de90074-9348-4235-b53a-d1d38f4cc0bc console=tty0 console=ttyS1,115200
initrd /boot/initrd-5.14.21-150500.55.31-default
boot
Updated by dheidler about 1 year ago
Afterwards execute this and reboot:
grub2-mkconfig -o /boot/grub2/grub.cfg
Updated by dheidler about 1 year ago
This problem appears even on most salt machines - even ones that are no worker hosts in salt context (eg. piworker):
openqa:/home/dheidler # salt "*" cmd.run 'ls -l /boot/grub2/grub.cfg'
worker37.oqa.prg2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker40.oqa.prg2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker39.oqa.prg2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker33.oqa.prg2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker29.oqa.prg2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker38.oqa.prg2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
s390zl12.oqa.prg2.suse.org:
-rw------- 1 root root 0 Oct 23 12:14 /boot/grub2/grub.cfg
worker32.oqa.prg2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker34.oqa.prg2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker35.oqa.prg2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
s390zl13.oqa.prg2.suse.org:
-rw------- 1 root root 0 Oct 23 12:14 /boot/grub2/grub.cfg
worker-arm1.oqa.prg2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker-arm2.oqa.prg2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
worker36.oqa.prg2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
backup-qam.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker1.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
sapworker1.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
sapworker3.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
sapworker2.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
petrol.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qamaster.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker16.qa.suse.cz:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker17.qa.suse.cz:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
ada.qe.suse.de:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
imagetester.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 10:15 /boot/grub2/grub.cfg
openqa.suse.de:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qesapworker-prg4.qa.suse.cz:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qesapworker-prg5.qa.suse.cz:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaw5-xen.qa.suse.de:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qesapworker-prg6.qa.suse.cz:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
diesel.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
backup-vm.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
qesapworker-prg7.qa.suse.cz:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker18.qa.suse.cz:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
jenkins.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
openqaworker14.qa.suse.cz:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
monitor.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 11:55 /boot/grub2/grub.cfg
tumblesle.qe.nue2.suse.org:
-rw------- 1 root root 8055 Oct 23 14:28 /boot/grub2/grub.cfg
unreal6.qe.nue2.suse.org:
-rw------- 1 root root 16489 Oct 23 15:39 /boot/grub2/grub.cfg
openqa-piworker.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
schort-server.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
baremetal-support.qe.nue2.suse.org:
-rw------- 1 root root 0 Oct 23 12:15 /boot/grub2/grub.cfg
Updated by dheidler about 1 year ago
Executed this on osd (salt master):
salt "*" cmd.run 'grub2-mkconfig -o /boot/grub2/grub.cfg'
Output can be found in openqa:/home/dheidler/grub2-mkconfig.log
.
Now all salt machines have a non-empty /boot/grub2/grub.cfg
again.
worker30 and worker31 have manually been fixed as described above.
Updated by dheidler about 1 year ago
Added worker30 and worker31 back to salt:
salt-key -a worker30.oqa.prg2.suse.org
salt-key -a worker31.oqa.prg2.suse.org
Updated by okurz about 1 year ago
- Description updated (diff)
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1035 to revert the anyway ineffective reboot disablement. I will just assume that I caused the original problem with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1030, right?
Updated by nicksinger about 1 year ago
okurz wrote in #note-13:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1035 to revert the anyway ineffective reboot disablement. I will just assume that I caused the original problem with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1030, right?
kind of, yes. The syntax error in /etc/default/grub
introduced by https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1030/diffs#6d941bec6b3bc75b45972bf3fa461af5cdc33111_18_19 caused our grub2-mkconfig to just generate an error message and an empty stdout which in turn resulted in us overwriting /boot/grub2/grub.cfg
with the empty output. I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1036 which should hopefully avoid that problem and not touch the original file in case of problems while generating.
Updated by okurz about 1 year ago
I found multiple debug leftovers from nicksinger
sudo salt '\* cmd.run 'grep crashkernel=211 /etc/default/grub'
s390zl12.oqa.prg2.suse.org:
worker33.oqa.prg2.suse.org:
GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
backup-qam.qe.nue2.suse.org:
GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker29.oqa.prg2.suse.org:
GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker40.oqa.prg2.suse.org:
openqaworker16.qa.suse.cz:
GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
qesapworker-prg5.qa.suse.cz:
GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
openqaworker14.qa.suse.cz:
GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker38.oqa.prg2.suse.org:
GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker36.oqa.prg2.suse.org:
GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
worker31.oqa.prg2.suse.org:
GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
ada.qe.suse.de:
openqaworker18.qa.suse.cz:
GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=211M"
…
removed with
sudo salt \* cmd.run 'sed -i "/crashkernel=211/d" /etc/default/grub'
Updated by openqa_review about 1 year ago
- Due date set to 2023-11-07
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler about 1 year ago
- Status changed from In Progress to Resolved
Unless we want to post-estimate this issue or conduct a more detailed RCA, this should be resolved.