action #166136
closeds390 LPAR s390ZL12 down and unable to boot - potential corrupted filesystem
Added by mgriessmeier 3 months ago. Updated 2 months ago.
0%
Description
Motivation¶
Since Sep 1st, ~ 03:30 AM there is an issue with s390 LPAR ZL12
initially reported in https://suse.slack.com/archives/C02CANHLANP/p1725240063815389
also tracked in https://sd.suse.com/servicedesk/customer/portal/1/SD-166885
https://stats.openqa-monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?orgId=1&from=1725109294762&to=1725257278550 shows the timeframe of the event
Hardware messages in the zhmc could point to a potential corrupted filesystem:
Central processor (CP) 0 in partition ZL12, entered disabled wait state.
The disabled wait program status word (PSW) is 00020001800000000000000000012370.
Serial output:
uncompression error
-- System halted
Mitigations¶
- Unsilence host_up alert
Updated by mgriessmeier 3 months ago
Updated by mgriessmeier 3 months ago
livdywan wrote in #note-4:
I'm guessing this is related?
[FIRING:1] s390zl12 (s390zl12: host up alert host_up Generic host_up_alert_s390zl12 generic)
yes, most likely - I re-installed the host on a different disk so we'd still be able to investigate the broken one and potentially recover some data (which we actually don't need, but just in case)
I didn't finish the set-up though, so would keep the workers disabled for the time being
Updated by openqa_review 3 months ago
- Due date set to 2024-09-17
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 3 months ago
- Assignee changed from mgriessmeier to nicksinger
Updated by nicksinger 3 months ago
- Description updated (diff)
So apparently update-bootloader
on s390x calls grub2-zipl-setup
which in turn takes care of setting up zipl - a bootloader running just before the grub2 we know. This zipl setup failed because of:
# Error: Could not add internal loader file '/lib/s390-tools/stage3.bin': No such file or directory
# Using config file '/boot/zipl/config' (from command line)
# Building bootmap in '/boot/zipl'
# Building menu 'menu'
# Adding #1: IPL section 'grub2' (default)
# /sbin/zipl: Failed
# /usr/sbin/grub2-install: error: `grub2-zipl-setup' failed.
The stage3.bin file is managed by the package s390-tools-genprotimg-data
which was installed on zl12 but was missing on the filesystem. We found no apparent reason why this happened but only hints of recent updates (https://build.opensuse.org/request/show/1196040). Forcing a reinstall with zypper in -f s390-tools-genprotimg-data
followed by zypper in -f grub2-s390x-emu
(which generates everything in its postinstall hooks, including zipl, dracut and grub2) made the file appear again and the zipl config was successfully generated again. This made the system boot again.
Updated by nicksinger 3 months ago
We found pretty much nothing about what "uncompression error -- System halted" is and where it comes from. Matthias found the "PSW restart" function in the zHMC which apparently sends some kind of "dump signal" to the LPAR which causes the kernel to print something about its state:
uncompression error
-- System halted
Linux version 5.14.21-150500.55.73-default (geeko@buildhost) #1 SMP Tue Aug 6 15:51:33 UTC 2024 (a0ede6a)
Kernel command line: root=/dev/disk/by-path/ccw-0.0.6000-part2 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 initgrub quiet splash=silent plymouth.enable=0 BOOT_IMAGE=0
Kernel fault: interruption code 0001 ilc:1
Kernel random base: 8453c8000
PSW : 0000000000000000 0000000000000002 (0000000000000002)
R:0 T:0 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:0 CC:0 PM:0 RI:0 EA:0
GPRS: 000000000000001b 0000000000012370 0000000000000000 0000000000011014
00000000000123b2 00000000000148b0 00000000007fc000 0000000000026c38
00000000014b2000 0000000000026c38 00000000007d4829 000000000001b79e
0000000000000380 0000000000019d08 00000000000123b2 000000000000be68
Call Trace:
(sp:000000000000be68 <00000000000123b2>| error+0x42/0x62)
sp:000000000000bea8 <0000000000019970>| decompress_kernel+0x298/0x2a4
sp:000000000000bf00 <0000000000012632>| startup_kernel+0x25a/0x518
sp:000000000000bf60 <00000000000100d0>| startup_normal+0xb0/0xb0
Last Breaking-Event-Address:
<00000000000123ca>| error+0x5a/0x62
We saw that the system used .73-default
in that early zipl stage which was apparent in /boot/zipl/config
as well because the default zipl-entry "grub2" points to image = /boot/zipl/image
which in turn is a symlink to /boot/zipl/image -> image-5.14.21-150500.55.73-default
. This was the only difference we spotted compared to zl13 which used .68-default
. In several attempts to bring zl12 to use a .68 image (and failing) we discovered that one can use the "Load parameter" in the zHMC to specify a zipl-menu-selection (as defined in /boot/zipl/config
). We used load parameter "4" to boot "grub2-previous" which used a .68-kernel in the early zipl stage. This made the system boot and we were able to try to reproduce what might have happened at an (automated?) system update.
While trying to generate our own zipl configuration, we quickly discovered the errors mentioned in https://progress.opensuse.org/issues/166136?issue_count=234&issue_position=4&next_issue_id=165860&prev_issue_id=166169#note-9 and fix them. We unfortunately don't understand how the file went missing in the first place. Either we really had some hardware corruption (highly unlikely, several logs and layers showed no issues) or some migration path might be bad in the upstream package which we don't find. Currently zl13 is fine despite being slightly different (e.g. the used kernel in zipl) to zl12 so we have no reproducer.
As zl12+zl13 are currently used as generic host in our salt infra (https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/top.sls?ref_type=heads#L2), we might need to extend our states by some explicit s390-packages (e.g. s390-tools
and s390-tools-genprotimg-data
).
Updated by livdywan 2 months ago
- Related to action #163778: [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S added
Updated by nicksinger 2 months ago
- Status changed from In Progress to Resolved
We had two keys for zl12 in our salt database (caused by the new installation). I removed both with salt-key -d s390zl12.oqa.prg2.suse.org
, restarted salt-minion on zl12 and accepted its key again with salt-key -a s390zl12.oqa.prg2.suse.org
. I currently have no idea how we could improve and avoid the situation in the future. As we don't know if this was just a one-off incident and we don't face the issue (currently) on zl13, I will resolve now
Updated by nicksinger 2 months ago
- Status changed from Workable to Resolved
mkittler wrote in #note-14:
The alert was firing again today at 14:29 CEST. If you think this was just because the alert was unsilenced too soon you can resolve the ticket again. (The alert was resolved again at 14:54 CEST.)
Was this about the partition usage? Because that's the only thing I can find on the worker dashboard. Anyhow, I don't think this is related and if we see it again soon, we have to work on that host on a different level (improve the cleanup).