Project

General

Profile

Actions

action #166136

closed

s390 LPAR s390ZL12 down and unable to boot - potential corrupted filesystem

Added by mgriessmeier 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2024-09-02
Due date:
2024-09-17
% Done:

0%

Estimated time:
Tags:

Description

Motivation

Since Sep 1st, ~ 03:30 AM there is an issue with s390 LPAR ZL12

initially reported in https://suse.slack.com/archives/C02CANHLANP/p1725240063815389
also tracked in https://sd.suse.com/servicedesk/customer/portal/1/SD-166885

https://stats.openqa-monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?orgId=1&from=1725109294762&to=1725257278550 shows the timeframe of the event

Hardware messages in the zhmc could point to a potential corrupted filesystem:

Central processor (CP) 0 in partition ZL12, entered disabled wait state. 
The disabled wait program status word (PSW) is 00020001800000000000000000012370. 

Serial output:

uncompression error
 -- System halted

Mitigations

  • Unsilence host_up alert

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #163778: [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:SResolvednicksinger2024-07-112024-08-06

Actions
Actions #2

Updated by tinita 4 months ago

  • Target version set to Ready
Actions #3

Updated by mgriessmeier 4 months ago

  • Status changed from New to In Progress
Actions #4

Updated by livdywan 4 months ago ยท Edited

I'm guessing this is related?

[FIRING:1] s390zl12 (s390zl12: host up alert host_up Generic host_up_alert_s390zl12 generic)
Actions #5

Updated by mgriessmeier 4 months ago

livdywan wrote in #note-4:

I'm guessing this is related?

[FIRING:1] s390zl12 (s390zl12: host up alert host_up Generic host_up_alert_s390zl12 generic)

yes, most likely - I re-installed the host on a different disk so we'd still be able to investigate the broken one and potentially recover some data (which we actually don't need, but just in case)
I didn't finish the set-up though, so would keep the workers disabled for the time being

Actions #6

Updated by openqa_review 4 months ago

  • Due date set to 2024-09-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by livdywan 4 months ago

  • Description updated (diff)
Actions #8

Updated by nicksinger 4 months ago

  • Assignee changed from mgriessmeier to nicksinger
Actions #9

Updated by nicksinger 4 months ago

  • Description updated (diff)

So apparently update-bootloader on s390x calls grub2-zipl-setup which in turn takes care of setting up zipl - a bootloader running just before the grub2 we know. This zipl setup failed because of:

# Error: Could not add internal loader file '/lib/s390-tools/stage3.bin': No such file or directory
# Using config file '/boot/zipl/config' (from command line)
# Building bootmap in '/boot/zipl'
# Building menu 'menu'
# Adding #1: IPL section 'grub2' (default)
# /sbin/zipl: Failed
# /usr/sbin/grub2-install: error: `grub2-zipl-setup' failed.

The stage3.bin file is managed by the package s390-tools-genprotimg-data which was installed on zl12 but was missing on the filesystem. We found no apparent reason why this happened but only hints of recent updates (https://build.opensuse.org/request/show/1196040). Forcing a reinstall with zypper in -f s390-tools-genprotimg-data followed by zypper in -f grub2-s390x-emu (which generates everything in its postinstall hooks, including zipl, dracut and grub2) made the file appear again and the zipl config was successfully generated again. This made the system boot again.

Actions #10

Updated by nicksinger 4 months ago

We found pretty much nothing about what "uncompression error -- System halted" is and where it comes from. Matthias found the "PSW restart" function in the zHMC which apparently sends some kind of "dump signal" to the LPAR which causes the kernel to print something about its state:

uncompression error
 -- System halted
Linux version 5.14.21-150500.55.73-default (geeko@buildhost) #1 SMP Tue Aug 6 15:51:33 UTC 2024 (a0ede6a)
Kernel command line: root=/dev/disk/by-path/ccw-0.0.6000-part2    nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1  initgrub quiet splash=silent plymouth.enable=0  BOOT_IMAGE=0
Kernel fault: interruption code 0001 ilc:1
Kernel random base: 8453c8000
PSW : 0000000000000000 0000000000000002 (0000000000000002)
      R:0 T:0 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:0 CC:0 PM:0 RI:0 EA:0
GPRS: 000000000000001b 0000000000012370 0000000000000000 0000000000011014
      00000000000123b2 00000000000148b0 00000000007fc000 0000000000026c38
      00000000014b2000 0000000000026c38 00000000007d4829 000000000001b79e
      0000000000000380 0000000000019d08 00000000000123b2 000000000000be68
Call Trace:
(sp:000000000000be68  <00000000000123b2>| error+0x42/0x62)
 sp:000000000000bea8  <0000000000019970>| decompress_kernel+0x298/0x2a4
 sp:000000000000bf00  <0000000000012632>| startup_kernel+0x25a/0x518
 sp:000000000000bf60  <00000000000100d0>| startup_normal+0xb0/0xb0
Last Breaking-Event-Address:
  <00000000000123ca>| error+0x5a/0x62

We saw that the system used .73-default in that early zipl stage which was apparent in /boot/zipl/config as well because the default zipl-entry "grub2" points to image = /boot/zipl/image which in turn is a symlink to /boot/zipl/image -> image-5.14.21-150500.55.73-default. This was the only difference we spotted compared to zl13 which used .68-default. In several attempts to bring zl12 to use a .68 image (and failing) we discovered that one can use the "Load parameter" in the zHMC to specify a zipl-menu-selection (as defined in /boot/zipl/config). We used load parameter "4" to boot "grub2-previous" which used a .68-kernel in the early zipl stage. This made the system boot and we were able to try to reproduce what might have happened at an (automated?) system update.

While trying to generate our own zipl configuration, we quickly discovered the errors mentioned in https://progress.opensuse.org/issues/166136?issue_count=234&issue_position=4&next_issue_id=165860&prev_issue_id=166169#note-9 and fix them. We unfortunately don't understand how the file went missing in the first place. Either we really had some hardware corruption (highly unlikely, several logs and layers showed no issues) or some migration path might be bad in the upstream package which we don't find. Currently zl13 is fine despite being slightly different (e.g. the used kernel in zipl) to zl12 so we have no reproducer.

As zl12+zl13 are currently used as generic host in our salt infra (https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/top.sls?ref_type=heads#L2), we might need to extend our states by some explicit s390-packages (e.g. s390-tools and s390-tools-genprotimg-data).

Actions #11

Updated by livdywan 4 months ago

  • Tags set to infra
Actions #12

Updated by livdywan 3 months ago

  • Related to action #163778: [alert] host_up & Average Ping time (ms) alert for s390zl12&s390zl13 size:S added
Actions #13

Updated by nicksinger 3 months ago

  • Status changed from In Progress to Resolved

We had two keys for zl12 in our salt database (caused by the new installation). I removed both with salt-key -d s390zl12.oqa.prg2.suse.org, restarted salt-minion on zl12 and accepted its key again with salt-key -a s390zl12.oqa.prg2.suse.org. I currently have no idea how we could improve and avoid the situation in the future. As we don't know if this was just a one-off incident and we don't face the issue (currently) on zl13, I will resolve now

Actions #14

Updated by mkittler 3 months ago

  • Status changed from Resolved to Workable

The alert was firing again today at 14:29 CEST. If you think this was just because the alert was unsilenced too soon you can resolve the ticket again. (The alert was resolved again at 14:54 CEST.)

Actions #15

Updated by nicksinger 3 months ago

  • Status changed from Workable to Resolved

mkittler wrote in #note-14:

The alert was firing again today at 14:29 CEST. If you think this was just because the alert was unsilenced too soon you can resolve the ticket again. (The alert was resolved again at 14:54 CEST.)

Was this about the partition usage? Because that's the only thing I can find on the worker dashboard. Anyhow, I don't think this is related and if we see it again soon, we have to work on that host on a different level (improve the cleanup).

Actions

Also available in: Atom PDF