Project

General

Profile

Actions

action #160514

closed

qamaster is down, i.e. also no monitoring from monitor.qe.nue2.suse.org

Added by okurz about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-05-19
Due date:
% Done:

0%

Estimated time:

Description

Observation

Trying to access https://monitor.qa.suse.de/d/ showed that monitor.qe.nue2.suse.org is down and also qamaster is not accessible. IPMI still up though.

Actions #1

Updated by okurz about 2 months ago

Added IPMI credentials with https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/813

ipmitool -Ilanplus -H qamaster-sp.qe.nue2.suse.org … power reset and sol activate shows initial firmware initialization with "B2" in lower right corner and then a black screen after that, no change for 5 minutes. Trying ipmitool -Ilanplus -H qamaster-sp.qe.nue2.suse.org … power off && sleep 180 && ipmitool -Ilanplus -H qamaster-sp.qe.nue2.suse.org … power on. I then used "IPMIView" with the internal "Java iKVM Viewer". That showed me that grub tries and fails to load with error: ../../grub-core/kern/dl.c:380:symbol 'grub_verify_string' not found. Using IPMIView I selected to boot from PXE and reboot. Then selected Tumbleweed ttyS1, trying rescue. Didn't see anything on serial terminal but could login with ssh_nt root@qamaster.qe.nue2.suse.org after ping responded.

Wrong approach loading to read-only mount…

I looked up storage volumes and found /dev/sdc2, mounted and chroot'd:

mkdir -p /mnt/sdc2
/dev/sdc2 /mnt/sdc2
for i in proc sys dev dev/pts run ; do mount -o bind /$i /mnt/sdc2/$i; done
chroot /mnt/sdc2
mount -a

In there I first reproduced the problem without needing to reboot the physical machine

qemu-system-x86_64 -snapshot -nographic /dev/sdc

this shows the original problem quickly. Did

grub2-install /dev/sdc

and then the qemu command verified that the problem was fixed. Triggered reboot with echo b >/proc/sysrq-trigger. After reboot the system booted up fine again and also VMs are up again.

Actions #2

Updated by okurz about 2 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF