Project

General

Profile

Actions

action #114565

closed

openQA Project - coordination #111860: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.4

recover qa-power8-4+qa-power8-5 size:M

Added by okurz over 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2022-12-19
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Tags:

Description

Observation

After upgrade to Leap 15.4 seems like qa-power8-4 wasn't properly rebooting. okurz could connect over ssh but asked for a password where normally we should have SSH keys. mkittler had varying success with "power reset" and "power cycle". Over SoL mkittler saw petitboot

Acceptance criteria

  • AC1: Both qa-power8-4 and qa-power8-5 are used for production openQA jobs again
  • AC2: Stable over reboot
  • AC3: Alerts unpaused

Further information

Suggestions

Rollback steps

  • After https://bugzilla.opensuse.org/show_bug.cgi?id=1202138 is resolved remove kernel-default and util-linux zypper package locks on qa-power8-4, qa-power8-5, power8.openqanet.opensuse.org
  • Upgrade kernel+OS on qa-power8-4, qa-power8-5, power8.openqanet.opensuse.org

Files

power8-5.log.txt (2.07 MB) power8-5.log.txt mkittler, 2022-07-22 14:09
ipmi-qa-power8-5-boot-loop.txt (1.08 MB) ipmi-qa-power8-5-boot-loop.txt mkittler, 2022-10-24 14:07

Subtasks 1 (0 open1 closed)

action #122158: [alert] qa-power8-4-kvm host up alert - machine not up, nothing obvious on SoL but IPMI works size:MResolveddheidler2022-12-19

Actions

Related issues 10 (1 open9 closed)

Related to openQA Infrastructure - action #115208: failed-systemd-services: logrotate-openqa alerting on and off size:MResolvedlivdywan

Actions
Related to openQA Infrastructure - action #116437: Recover qa-power8-5 size:MResolvedmkittler

Actions
Related to openQA Infrastructure - action #116473: Add OSD PowerPC workers to automatic recovery we already have for ARM workersNew2022-09-12

Actions
Related to openQA Infrastructure - action #116743: [alert] QA-Power8-5-kvm: host up alertResolvednicksinger2022-09-192022-10-04

Actions
Related to openQA Infrastructure - action #117229: [tools] openqa failing on worker QA-Power8-5-kvmResolvedmkittler2022-09-262022-10-13

Actions
Related to openQA Infrastructure - coordination #117268: [epic] Handle reduced PowerPC ressourcesResolvedokurz2022-07-21

Actions
Related to openQA Infrastructure - action #118024: Ensure all PPC workers are upgraded after kernel regression resolved size:MResolvedmkittler2022-10-11

Actions
Related to openQA Infrastructure - action #119290: [alert] Packet loss between worker hosts and other hosts alertRejectedokurz2022-10-24

Actions
Related to openQA Infrastructure - action #116078: Recover o3 worker kerosene formerly known as power8, restore IPMI access size:MResolvedokurz2022-08-31

Actions
Copied from openQA Infrastructure - action #114526: recover openqaworker14Resolvedmkittler

Actions
Actions #1

Updated by okurz over 2 years ago

Actions #2

Updated by mkittler over 2 years ago

power8-4 boots again after power cycle + reset and adding nmi_watchdog=0¹ to kernel params. (Not sure yet whether the kernel param is necessary.)

¹ https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt

Actions #3

Updated by mkittler over 2 years ago

I've tried booting with nmi_watchdog=0 on power-8-5 as well but I'm still getting [ 197.877239][ C62] watchdog: BUG: soft lockup - CPU#62 stuck for 25s! [swapper/62:0]. I suppose I'm still in the state where the system cannot boot. I've attached the logs from my previous attempt.

EDIT: I wanted to do a further experiment and therefore reset power8-5 again. But meanwhile, in the other terminal tab it actually showed some systemd logging. Of course the reset was effective so I'm now back to where I was before.
EDIT: Booted again, this time again without the kernel parameter and it seemed like the boot was stuck like before. So I reset again and booted again with the kernel parameter. Now, power8-5 also started. I suppose nmi_watchdog=0 might actually help. Now the question is how to make the kernel parameter persistent in this petitboot setup.

Actions #4

Updated by mkittler over 2 years ago

I suppose this change would configure the kernel parameter persistently (as Petitboot just seems to parse the GRUB config): https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/716

Actions #5

Updated by openqa_review over 2 years ago

  • Due date set to 2022-08-06

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by mkittler over 2 years ago

QA-Power8-4-kvm.qa.suse.de didn't survive the reboot. However, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/716 has also not been merged yet.

EDIT: QA-Power8-4 is up again after a power reset. It wasn't even required to use nmi_watchdog=0. Maybe it is just a coincidence after all that adding this kernel parameter helps? However, otherwise I wouldn't know what else to improve that these workers are sometimes stuck on boot. (Actually, I'm not sure in what state QA-Power8-4 was stuck. It was powered on but showed no activity over SOL and SSH wasn't working.)

Actions #7

Updated by mkittler over 2 years ago

qa-power8-4-kvm.qa isn't reachable anymore again. Before resetting the power, I'll try some further tests to check where it hangs exactly.

EDIT: Likely not relevant, but those are recent "system" logs from the BMC web UI:

105 Jul 26 03:01:07 AMIA0423F32B471 ntpdate[25206]: 129.70.132.34 rate limit response from server. 
100 Jul 26 08:01:11 AMIA0423F32B471 ntpdate[28197]: step time server 131.234.220.231 offset 0.036390 sec
101 Jul 26 09:01:11 AMIA0423F32B471 ntpdate[28807]: step time server 178.63.9.212 offset 0.031896 sec 
102 Jul 26 09:06:01 AMIA0423F32B471 lighttpd[1585]: [1585 WARNING][webifc_StorDevice.c:1157]Invalid FRU Header Version for ID : 0 
84  Jul 26 09:06:45 AMIA0423F32B471 lighttpd[1585]: [1585 CRITICAL][libipmi_AMIOEM.c:3143]Error While Getting service configuration 403
85  Jul 26 09:07:12 AMIA0423F32B471 lighttpd[1585]: [1585 CRITICAL][webifc_XportDevice.c:438] g_corefeatures.ipv6_compliance_support==1
86  Jul 26 09:08:48 AMIA0423F32B471 lighttpd[1585]: [1585 CRITICAL][webifc_syslog.c:356]Audit file /conf/audit.log not exists.. Trying /var/log/audit.log 
73  Jul 26 09:08:02 AMIA0423F32B471 /USR/SBIN/CRON[28921]: (sysadmin) CMD ( if [ -d /var/cache/samba ]; then echo "2" > /var/cache/samba/unexpected.tdb; fi)
74  Jul 26 09:08:02 AMIA0423F32B471 /USR/SBIN/CRON[28922]: (sysadmin) CMD ( [ 0 -eq `ps -ef | grep "/usr/sbin/logrotate" | grep -vc "grep"` ] && /usr/sbin/logrotate /etc/logrotate.conf)
75  Jul 26 09:09:02 AMIA0423F32B471 /USR/SBIN/CRON[28930]: (sysadmin) CMD ( if [ -d /var/cache/samba ]; then echo "2" > /var/cache/samba/unexpected.tdb; fi)
76  Jul 26 09:09:02 AMIA0423F32B471 /USR/SBIN/CRON[28931]: (sysadmin) CMD ( [ 0 -eq `ps -ef | grep "/usr/sbin/logrotate" | grep -vc "grep"` ] && /usr/sbin/logrotate /etc/logrotate.conf)
77  Jul 26 09:10:02 AMIA0423F32B471 /USR/SBIN/CRON[28939]: (sysadmin) CMD ( if [ -d /var/cache/samba ]; then echo "2" > /var/cache/samba/unexpected.tdb; fi)
78  Jul 26 09:10:02 AMIA0423F32B471 /USR/SBIN/CRON[28941]: (sysadmin) CMD ( [ 0 -eq `ps -ef | grep "/usr/sbin/logrotate" | grep -vc "grep"` ] && /usr/sbin/logrotate /etc/logrotate.conf)
79  Jul 26 09:11:02 AMIA0423F32B471 /USR/SBIN/CRON[28949]: (sysadmin) CMD ( if [ -d /var/cache/samba ]; then echo "2" > /var/cache/samba/unexpected.tdb; fi)
80  Jul 26 09:11:02 AMIA0423F32B471 /USR/SBIN/CRON[28950]: (sysadmin) CMD ( [ 0 -eq `ps -ef | grep "/usr/sbin/logrotate" | grep -vc "grep"` ] && /usr/sbin/logrotate /etc/logrotate.conf) 

The IP shown in the BMC web interface is only the BMC web interface IP itself (and responds to pings and SSH connections but I cannot authenticate).

Actions #8

Updated by okurz over 2 years ago

  • Description updated (diff)

I removed salt keys for now with

sudo salt-key -y -d QA-Power8-*

as OSD deployment was blocked on this.

Actions #9

Updated by mkittler over 2 years ago

Looks like my recovery from yesterday was not lasting very long: https://stats.openqa-monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?orgId=1&from=1658755017279&to=1658761222206

Normally the system shouldn't have been rebooting so maybe it is not a boot problem but the running system just crashed or got stuck. That boot was without nmi_watchdog=0 so I'm rebooting with that parameter again.

Actions #11

Updated by okurz over 2 years ago

both merged. You need to manually deploy or add machines back to salt and apply a high state.

Actions #12

Updated by mkittler over 2 years ago

I've been adding the machines back and applied the changes. It seems to have worked. Let's see whether they continue to run without crashing or getting stuck under this new kernel parameter. I'll also reboot one of them to see whether the new kernel parameter is now persistent.

EDIT: QA-Power8-5-kvm rebooted just fine (unattended) and the kernel parameter is present.

Actions #13

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Feedback

Both workers are running jobs and it looks good so far. I'll set the ticket on feedback and check again in the next days.

Actions #14

Updated by okurz over 2 years ago

I suggest to also simulate a power outage, e.g. just call ipmi.*power cycle and see how it behaves

Actions #15

Updated by mkittler over 2 years ago

Ok, I've just did that with power8-4. Let's see whether it comes up again. (Previously a power cycle needed to be followed by a power reset to get the machine going again but I'll give it a little bit time before trying that.)

By the way, the system journal on power8-4 from yesterday just stops at some point. I couldn't spot any interesting log messages and greping for lockup gives no results. So it reminds me of our arm workers.

Actions #16

Updated by mkittler over 2 years ago

It was necessary to invoke power reset as well but then the system booted relatively fast. I suppose that problem with power cycle is nothing new for these two workers, though.

I resumed the alerts to know if one of the machines would become unresponsive again.

Actions #17

Updated by okurz over 2 years ago

mkittler wrote:

It was necessary to invoke power reset as well but then the system booted relatively fast. I suppose that problem with power cycle is nothing new for these two workers, though.

I resumed the alerts to know if one of the machines would become unresponsive again.

that just seems to have happened: https://stats.openqa-monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?editPanel=65105&tab=alert&orgId=1&from=1658848718115&to=1658854799197

I paused the alert again

Actions #18

Updated by mkittler over 2 years ago

power8-5 seems to be stable now but power8-4 crashed around 18 p.m. again. I'll reboot it via ipmi. Maybe this time the journal has more info. If not I can add the host to our automatic recovery we have for arm (but needed to take into account that we need power reset here and not cycle).

Actions #19

Updated by mkittler over 2 years ago

power8-4 is back but once got stuck on boot even before reaching the OS. I was also often thrown out of the SOL session¹. Again the journal doesn't show anything interesting before the crash².

--

¹ When I tried to reconnect immediately, I got Info: SOL payload already active on another session but after waiting a while or just deactivating the SOL session I could connect again.

²

Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [info] [pid:24538] Download of SLES-12-SP5-ppc64le-containers.qcow2 processed:
Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [info] [#96752] Cache size of "/var/lib/openqa/cache" is 46 GiB, with limit 50 GiB
Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [info] [#96752] Downloading "SLES-12-SP5-ppc64le-containers.qcow2" from "http://openqa.suse.de/tests/9216080/asset/hdd/SLES-12-SP5-ppc64le-containers.qcow2"
Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [info] [#96752] Content of "/var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-ppc64le-containers.qcow2" has not changed, updating last use
Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [debug] [pid:24538] Linked asset "/var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-ppc64le-containers.qcow2" to "/var/lib/openqa/pool/3/SLES-12-SP5-ppc64le-containers.qcow2"
Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [info] [pid:24538] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #96753 sent to Cache Service
Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [debug] [pid:24538] Updating status so job 9216080 is not considered dead.
Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [debug] [pid:24538] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9216080/status
Jul 26 18:23:25 QA-Power8-4-kvm worker[14096]: [debug] [pid:14096] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9216040/status
Jul 26 18:23:25 QA-Power8-4-kvm worker[14096]: [debug] [pid:14096] Upload concluded (at bci_prepare)
Jul 26 18:23:25 QA-Power8-4-kvm openqa-worker-cacheservice-minion[24775]: [24775] [i] [#96753] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Jul 26 18:23:25 QA-Power8-4-kvm openqa-worker-cacheservice-minion[24775]: [24775] [i] [#96753] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Jul 26 18:23:26 QA-Power8-4-kvm worker[24463]: [debug] [pid:24463] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9216065/status
Jul 26 18:23:26 QA-Power8-4-kvm worker[24463]: [debug] [pid:24463] Upload concluded (at boot_to_desktop)
Jul 26 18:23:26 QA-Power8-4-kvm worker[22067]: [debug] [pid:22067] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9216045/status
Jul 26 18:23:27 QA-Power8-4-kvm worker[22067]: [debug] [pid:22067] Upload concluded (at image_docker)
-- Boot 48a662843a3a4ab582b9969f25bdec1f --
Jul 27 14:55:47 QA-Power8-4-kvm kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 262144MB)
Jul 27 14:55:47 QA-Power8-4-kvm kernel: dt-cpu-ftrs: setup for ISA 2070
Jul 27 14:55:47 QA-Power8-4-kvm kernel: dt-cpu-ftrs: not enabling: subcore (unknown and unsupported by kernel)
Jul 27 14:55:47 QA-Power8-4-kvm kernel: dt-cpu-ftrs: final cpu/mmu features = 0x000000fb8f5db187 0x3c006001
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: Page sizes from device-tree:
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=20: shift=20, sllp=0x0130, avpnm=0x00000000, tlbiel=0, penc=2
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3
Jul 27 14:55:47 QA-Power8-4-kvm kernel: Enabling pkeys with max key count 32
Jul 27 14:55:47 QA-Power8-4-kvm kernel: Disabling hardware transactional memory (HTM)
Jul 27 14:55:47 QA-Power8-4-kvm kernel: Activating Kernel Userspace Access Prevention
Jul 27 14:55:47 QA-Power8-4-kvm kernel: Activating Kernel Userspace Execution Prevention
Jul 27 14:55:47 QA-Power8-4-kvm kernel: Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24
Jul 27 14:55:47 QA-Power8-4-kvm kernel: Using 1TB segments
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: Initializing hash mmu with SLB
martchus@QA-Power8-4-kvm:~> sudo journalctl --since '24 hours ago' | grep -i lockup
martchus@QA-Power8-4-kvm:~>
Actions #20

Updated by mkittler over 2 years ago

power8-4 now crashed again while I was still browsing the journal. Since I still had SOL open we now have a little more output:

[   28.237735][ T2075] device tap4 entered promiscuous mode
[   28.238711][ T2075] device tap71 entered promiscuous mode
[   28.244128][ T2075] device tap135 entered promiscuous mode
[  OK  ] Finished ppc64 set SMT off.
         Starting Authorization Manager...
[   29.908745][ T1719] EXT4-fs error: 42 callbacks suppressed
[   29.908748][ T1719] EXT4-fs error (device sdb1): ext4_mb_generate_buddy:1140: group 2589, block bitmap and bg descriptor incSOL session closed by BMC
[martchus@linux-9lzf ~]$ ipmitool -I lanplus -C 3 -H qa-power8-4.qa.suse.de -U ADMIN -P vera8ahreiph6A sol activate
Info: SOL payload already active on another session
[martchus@linux-9lzf ~]$ ipmitool -I lanplus -C 3 -H qa-power8-4.qa.suse.de -U ADMIN -P vera8ahreiph6A sol activate
Info: SOL payload already active on another session
[martchus@linux-9lzf ~]$ ipmitool -I lanplus -C 3 -H qa-power8-4.qa.suse.de -U ADMIN -P vera8ahreiph6A sol activate
         Mounting /var/lib/openqa/share...
[  196.219518][ T4037] FS-Cache: Loaded
[  196.273137][ T4037] RPC: Registered named UNIX socket transport module.
[  196.273200][ T4037] RPC: Registered udp transport module.
[  196.273210][ T4037] RPC: Registered tcp transport module.
[  196.273303][ T4037] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  196.333186][ T4037] FS-Cache: Netfs 'nfs' registered for caching
[  196.338961][ T4043] Key type dns_resolver registered
[  196.543578][ T4043] NFS: Registering the id_resolver key type
[  196.543606][ T4043] Key type id_resolver registered
[  196.543645][ T4043] Key type id_legacy registered
[  OK  ] Listening on RPCbind Server Activation Socket.
[  OK  ] Reached target RPC Port Mapper.
         Starting Notify NFS peers of a restart...
         Starting NFS status monitor for NFSv2/3 locking....
[  OK  ] Started Notify NFS peers of a restart.
         Starting RPC Bind...
[  OK  ] Started RPC Bind.
[  OK  ] Started NFS status monitor for NFSv2/3 locking..
[  197.340453][ T4071] systemd-fstab-generator[4071]: x-systemd.device-timeout ignored for openqa.suse.de:/var/lib/openqa/share
[  197.342888][ T4081] systemd-sysv-generator[4081]: SysV service '/etc/init.d/boot.local' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in order to make it more safe and robust.
[  OK  ] Mounted /var/lib/openqa/share.
[  OK  ] Started Timeline of Snapper Snapshots.
         Starting DBus interface for snapper...
[  OK  ] Started DBus interface for snapper.
[ ***  ] A start job is running for /etc/ini…ompatibility (4min 19s / no limit)
         Starting Check if mainboard battery is Ok...
[  OK  ] Finished Check if mainboard battery is Ok.
[**    ] A start job is running for /etc/ini…Compatibility (5min 2s / no limit)
[  315.571309][   C48] EXT4-fs (sdb1): error count since last fsck: 2430
[  315.571380][   C48] EXT4-fs (sdb1): initial error at time 1605764783: ext4_mb_generate_buddy:759
[  OK  ] Started /etc/init.d/boot.local Compatibility.
         Starting Hold until boot process finishes up...
         Starting Terminate Plymouth Boot Screen...

Welcome to openSUSE Leap 15.4 - Kernel 5.14.21-150400.24.11-default (hvc0).

br1: 10.0.2.2 fe80::e0bf:2eff:fe2f:b749
eth0:  
eth1:  
eth2:  
eth3: 10.162.6.201 2620:113:80c0:80a0:10:162:29:60f


QA-Power8-4-kvm login: [  365.807470][ T3923] EXT4-fs error (device sdb1) in ext4_free_inode:362: Corrupt filesystem
[  438.050890][   T94] Kernel panic - not syncing: corrupted stack end detected inside scheduler
[  438.051046][   T94] CPU: 16 PID: 94 Comm: ksof

Here you can also see the unstable SOL. So we've got stack corruption within the kernel. Maybe a memory error? Not sure how I'd run a memtest on that power machine.

Actions #21

Updated by mkittler over 2 years ago

Haven't found a memtest tool for PowerPC and the Linux kernel's memtest feature is apparently disabled on PowerPC. Not sure whether the firmware provides something. It was also mentioned that we could also contact IBM support about it.

Btw, the machine has 256 GB RAM (but only 8 CPU cores). Maybe just one RAM module is broken and we could live without it.

If it is a memory problem. So far I'm not sure, it only looks like it, especially because power8-5 (which runs the same software and workload) seems stable.

Actions #22

Updated by okurz over 2 years ago

I doubt that suddenly the memory breaks just after we upgraded to Leap 15.4. I suggest to rollback/downgrade to 15.3 to cross-check stability

Actions #23

Updated by mkittler over 2 years ago

I'll try booting into an earlier snapshot. However, if 15.3 would work that would raise the question why power8-5 remains stable under 15.4 and why it sometimes also crashes even before the OS is booting (at least to me it looks like it is sometimes crashing early).

Moved power8-4 out of salt again, invoked snapper rollback 2181 and rebooted. (I suppose it isn't possible to boot into a specific snapshot via petitboot.) It now already runs some tests under Leap 15.3. Let's see whether it remains stable or crashes.

Actions #24

Updated by mkittler over 2 years ago

So far power8-4 looks good on Leap 15.3.

I've noticed that snapper was not configured on power8-5 (in contrast to power8-4) so I configured it. (We already use btrfs and have enough disk space.)

Actions #25

Updated by mkittler over 2 years ago

Considering power8-4 is still running it seems to help downgrading to Leap 15.3, indeed. Before, under Leap 15.4 it never stayed online for so long. However, it can still be just luck. So I'll keep it running over the weekend to be sure.

Maybe it is possible to install the kernel version from Leap 15.3 under Leap 15.4. That configuration would be my next test and if that is stable it could also be a possible workaround.

Actions #26

Updated by mkittler over 2 years ago

Looks like downgrading helps. I suppose I'll boot into Leap 15.4 again and see whether I can install the older kernel version from Leap 15.3.

EDIT: Easier said than done. There's good documentation but I still don't know where I'd get the old kernel version from (for Leap 15.4, I suppose just using the Leap 15.3 package isn't a good idea).

Actions #27

Updated by MDoucha over 2 years ago

I have to ask for QA-Power8-4-kvm to be left on Leap 15.3 for a while. SLE-15SP2 has some strange bug which breaks framebuffer console on PPC64LE QEMU 6.2.0: bsc#1201796. QA-Power8-4-kvm is currently the only PPC64LE worker where the SLE-15SP2 updates can be tested. SLE-15SP3 appears to be fixed but the actual fix is currently unknown.

Actions #28

Updated by okurz over 2 years ago

mkittler wrote:

[…] I still don't know where I'd get the old kernel version from (for Leap 15.4, I suppose just using the Leap 15.3 package isn't a good idea).

That shouldn't be a problem, especially not for the kernel which is self-contained anyway.

Actions #29

Updated by livdywan over 2 years ago

  • Subject changed from recover qa-power8-4+qa-power8-5 to recover qa-power8-4+qa-power8-5 size:M

Suggested to report a product issue after discussing it in the estimation call

Actions #30

Updated by mkittler over 2 years ago

  • Description updated (diff)
Actions #31

Updated by okurz over 2 years ago

  • Due date deleted (2022-08-06)
  • Status changed from Feedback to Blocked

Well written bug report. I suggest for now to block on both the big report you reported https://bugzilla.opensuse.org/show_bug.cgi?id=1202138 as well as https://bugzilla.suse.com/show_bug.cgi?id=1201796

Actions #32

Updated by okurz over 2 years ago

  • Due date set to 2022-11-01
  • Priority changed from High to Normal

So currently both qa-power8-4 and qa-power8-5 are up and running and usable for openQA tests though with differing OS versions. That can be good for debugging and investigation where required.
Let's wait the latest until one month before EOL of Leap 15.3 and check then.

Actions #33

Updated by mkittler over 2 years ago

Looks like now the file system on power8-4 is corrupted:

martchus@QA-Power8-4-kvm:~> sudo journalctl -fu openqa-worker-cacheservice-minion.service 
…
[#107890] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Aug 10 17:33:47 QA-Power8-4-kvm openqa-worker-cacheservice-minion[24456]: rsync: readlink_stat("/var/lib/openqa/cache/openqa.suse.de/tests/sle/products/opensuse/needles/.git/index") failed: Structure needs cleaning (117)
…
martchus@QA-Power8-4-kvm:~> sudo rm -r /var/lib/openqa/cache/openqa.suse.de/tests/sle/products/opensuse/needles/.git/index
rm: das Entfernen von '/var/lib/openqa/cache/openqa.suse.de/tests/sle/products/opensuse/needles/.git/index' ist nicht möglich: Die Struktur muss bereinigt werden

But that's just the throw-away ext2 filesystem created by our NVMe/RAID script so a reboot should repair it.


The ext4 filesystem is actually also not in great shape as I've got many messages about inconsistencies when rebooting:

[  *** ] A start job is running for /etc/ini…Compatibility (1min 7s / no limit)
[   79.648237] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2585, block bitmap and bg descriptor inconsistent: 26423 vs 26428 free clusters
[   79.648407] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2587, block bitmap and bg descriptor inconsistent: 30674 vs 30677 free clusters
[   79.648497] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2589, block bitmap and bg descriptor inconsistent: 31152 vs 31154 free clusters
[   79.648557] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2590, block bitmap and bg descriptor inconsistent: 30689 vs 30693 free clusters
[   79.709621] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2568, block bitmap and bg descriptor inconsistent: 26721 vs 26795 free clusters
[   79.709879] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2575, block bitmap and bg descriptor inconsistent: 27609 vs 27621 free clusters
[   79.756691] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2577, block bitmap and bg descriptor inconsistent: 29779 vs 29828 free clusters
[   79.756883] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2581, block bitmap and bg descriptor inconsistent: 30628 vs 30547 free clusters
[   79.757007] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2582, block bitmap and bg descriptor inconsistent: 13767 vs 13771 free clusters
[  *** ] A start job is running for /etc/ini…ompatibility (1min 31s / no limit)
[  102.907136] EXT4-fs error: 2 callbacks suppressed
[  102.907137] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 6465, block bitmap and bg descriptor inconsistent: 3359 vs 3412 free clusters
[   ***] A start job is running for /etc/ini…ompatibility (1min 38s / no limit)
[  109.661883] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 0, block bitmap and bg descriptor inconsistent: 422 vs 423 free clusters
[  109.662023] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 1, block bitmap and bg descriptor inconsistent: 4478 vs 4479 free clusters
[  109.662045] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2, block bitmap and bg descriptor inconsistent: 1534 vs 1535 free clusters
[  *** ] A start job is running for /etc/ini…ompatibility (1min 38s / no limit)
[  110.178385] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 6460, block bitmap and bg descriptor inconsistent: 1534 vs 1550 free clusters
[  110.178450] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 6461, block bitmap and bg descriptor inconsistent: 2539 vs 2555 free clusters
[  *** ] A start job is running for /etc/ini…ompatibility (1min 42s / no limit)
[  113.698846] EXT4-fs error (device sda1): ext4_lookup:1728: inode #47653495: comm rsync: deleted inode referenced: 47659659
[    **] A start job is running for /etc/ini…Compatibility (2min 4s / no limit)
[  *** ] A start job is running for /etc/ini…ompatibility (2min 10s / no limit)
[   ***] A start job is running for /etc/ini…ompatibility (2min 10s / no limit)
[ ***  ] A start job is running for /etc/ini…ompatibility (2min 23s / no limit)
[  155.134104] EXT4-fs error (device sda1): ext4_lookup:1728: inode #47653495: comm rsync: deleted inode referenced: 47659659
Actions #34

Updated by livdywan over 2 years ago

  • Related to action #115208: failed-systemd-services: logrotate-openqa alerting on and off size:M added
Actions #35

Updated by mkittler over 2 years ago

The reboot actually didn't help because we're not having our NVMe/RAID script installed:

martchus@QA-Power8-4-kvm:~> sudo systemctl status openqa_nvme_format.service
Unit openqa_nvme_format.service could not be found.

I've just stopped everything:

systemctl stop openqa-worker-auto-restart@* openqa-worker-cacheservice* var-lib-openqa.mount

Looks like we're still using ext2 (despite that it is not re-created on every boot):

martchus@QA-Power8-4-kvm:~> sudo file -sL /dev/sda1
/dev/sda1: Linux rev 1.0 ext2 filesystem data (mounted or unclean), UUID=f5de0a79-bfa8-41c9-ba86-c0d2df739acc (errors) (large files)

Considering the previous crashes it is no surprise to end up with a broken ext2 file system.

The filesystem fix is still running but maybe it would be best to reformat with ext4.

Actions #36

Updated by mkittler over 2 years ago

After the filesystem fix I see no more problems in the journal and tests on several worker slots are already past the test syncing. So I suppose the immediate problem has been fixed.

For further improvements, e.g. moving to ext4 or ensuring the ext2 filesystem is re-created on boot we should create a separate ticket (once we decided what we want to do).

Actions #37

Updated by livdywan over 2 years ago

  • Status changed from Blocked to In Progress

I assume this isn't Blocked since you're actively working on it.

Actions #38

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Blocked

It is still blocked by tickets mentioned in #114565#note-31. My most recent work was just file system cleanup after the crashes. I've been creating #115226 for further improvements (regardless of the bigger problem here).

Actions #39

Updated by mkittler about 2 years ago

Now power8-5 just went offline. I'll try a power cycle and investigate.

Actions #40

Updated by mkittler about 2 years ago

The log of power8-5 just ended:

Sep 02 04:38:38 QA-Power8-5-kvm worker[53749]: [debug] [pid:53749] Uploading artefact ioperm02-2.txt
Sep 02 04:38:38 QA-Power8-5-kvm worker[59603]: [debug] [pid:59603] Uploading artefact nfs42_ipv6_02-8.txt
Sep 02 04:38:38 QA-Power8-5-kvm worker[59120]: [debug] [pid:59120] Uploading artefact preadv03_64-4.txt
Sep 02 04:38:38 QA-Power8-5-kvm worker[59704]: [debug] [pid:59704] Uploading artefact boot_ltp-31.txt
Sep 02 04:38:38 QA-Power8-5-kvm worker[59625]: [debug] [pid:59625] Uploading artefact test_connect-2.txt
-- Boot 5378bf885e9e49ecbadd3cfc8aeccb90 --
Sep 02 11:30:30 QA-Power8-5-kvm kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 262144MB)
Sep 02 11:30:30 QA-Power8-5-kvm kernel: hash-mmu: Page sizes from device-tree:
Sep 02 11:30:30 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
Sep 02 11:30:30 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
Sep 02 11:30:30 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
Sep 02 11:30:30 QA-Power8-5-kvm kernel: hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1

I also couldn't spot anything strange in the hours before.

Actions #41

Updated by livdywan about 2 years ago

Actions #42

Updated by mkittler about 2 years ago

  • Related to action #116473: Add OSD PowerPC workers to automatic recovery we already have for ARM workers added
Actions #43

Updated by michals about 2 years ago

1) you shouldn't need nmi_watchdog=0

2) can you use the stable kernel archives to determine the upstream kernel version that broke your machines?

https://build.opensuse.org/project/show/home:tiwai:kernel:5.8

There is a lot of difference between 5.3 and 5.14, and with these OpenPOWER machines being somewhat uncommon hardware you might be the first to see the problem.

Actions #44

Updated by mkittler about 2 years ago

I can try that 5.8 kernel tomorrow.

Actions #45

Updated by okurz about 2 years ago

  • Status changed from Blocked to In Progress

as discussed in our weekly meeting mkittler is working on this based on feedback and suggestions from https://bugzilla.opensuse.org/show_bug.cgi?id=1202138 and https://bugzilla.suse.com/show_bug.cgi?id=1201796

Actions #46

Updated by mkittler about 2 years ago

  • Status changed from In Progress to Feedback

As of now qa-power8-5-kvm.qa.suse.de runs on Linux 5.8:

martchus@QA-Power8-5-kvm:~> uname -a
Linux QA-Power8-5-kvm 5.8.15-1.gc680e93-default #1 SMP Thu Oct 15 08:10:08 UTC 2020 (c680e93) ppc64le ppc64le ppc64le GNU/Linux

Let's see whether the issue is still reproducible on that version. Note that when trying to install the kernel package the machine was offline again. After recovering the machine broke during the installation of the kernel package. So qa-power8-5-kvm.qa.suse.de can now definitely be considered unstable on Linux 5.14/Leap 15.4 as well (while in the beginning only qa-power8-4-kvm.qa.suse.de seemed affected). At least it was stable enough to install the kernel package on the 2nd attempt. It also showed up in the bootloader and is supposedly even the first/default option now.

Actions #48

Updated by okurz about 2 years ago

  • Related to action #116743: [alert] QA-Power8-5-kvm: host up alert added
Actions #49

Updated by okurz about 2 years ago

We decided to give it some days to check if the system stays stable. If it turns out to be stable, report that in bug reports and go forward with investigating, e.g. trying next major kernel version 5.9 or something.

Actions #50

Updated by jbaier_cz about 2 years ago

  • Related to action #117229: [tools] openqa failing on worker QA-Power8-5-kvm added
Actions #51

Updated by okurz about 2 years ago

Actions #52

Updated by mkittler about 2 years ago

qa-power8-5 still hasn't crashed again. I reported it back in https://bugzilla.opensuse.org/show_bug.cgi?id=1202138.

Actions #53

Updated by mkittler about 2 years ago

Now on 5.19.12-lp153.2.g95fa5b8-default, see https://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c21.

Actions #54

Updated by jbaier_cz about 2 years ago

Actions #55

Updated by mkittler about 2 years ago

Yes, I'm currently going though alerts and have just seen it. I'll reboot the machine and investigate.

Actions #56

Updated by mkittler about 2 years ago

I'm not sure whether the machine went offline due to a crash because it is now stuck in petitboot because it cannot detect the installed system anymore. Maybe the GRUB config was modified in a way so petitboot cannot handle it anymore (maybe due to installing too many kernel versions?).


It at least recognizes some disk:

/ # ls -l /dev/disk/by-uuid/
total 0
lrwxrwxrwx    1 root     root            10 Oct  4 09:25 17277885-b991-401b-8f91-a7ddf99d25f9 -> ../../sda1
lrwxrwxrwx    1 root     root            10 Oct  4 09:25 89ca2dff-86af-478b-8d4c-2a45ca689fd5 -> ../../sda2
lrwxrwxrwx    1 root     root            10 Oct  4 09:25 d1a0b849-b41b-4bd3-b0ab-5b027d0a0a63 -> ../../sdb1

I've been removing the worker from salt again and re-triggered the deployment.

Actions #57

Updated by mkittler about 2 years ago

17277885-b991-401b-8f91-a7ddf99d25f9 is just swap:

/ # fdisk -l

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks  Id System
/dev/sda1               1         262     2103296  82 Linux swap
Partition 1 does not end on cylinder boundary
/dev/sda2             262      121602   974657536  83 Linux

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
81 heads, 63 sectors/track, 382818 cylinders
Units = cylinders of 5103 * 512 = 2612736 bytes

   Device Boot      Start         End      Blocks  Id System
/dev/sdb1

Mounting 89ca2dff-86af-478b-8d4c-2a45ca689fd5 worked. It contains a root BTRFS filesystem including /boot with all the usual files including the GRUB2 config.
Mounting d1a0b849-b41b-4bd3-b0ab-5b027d0a0a63 worked as well. It contains the /var/lib/openqa tree.

/ # mount
rootfs on / type rootfs (rw,size=133707968k,nr_inodes=2089187)
devtmpfs on /dev type devtmpfs (rw,relatime,size=133707968k,nr_inodes=2089187,mode=755)
proc on /proc type proc (rw,relatime)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw,relatime,mode=777)
sysfs on /sys type sysfs (rw,relatime)
/dev/disk/by-uuid/89ca2dff-86af-478b-8d4c-2a45ca689fd5 on /media/89ca2dff-86af-478b-8d4c-2a45ca689fd5 type btrfs (rw,relatime,space_cache,subvolid=257,subvol=/@)
/dev/sdb1 on /media/d1a0b849-b41b-4bd3-b0ab-5b027d0a0a63 type ext4 (rw,relatime,data=ordered)

So I suppose I can use petitboot as rescue system to restore the installation, e.g. by fixing the GRUB config so it is recognized by petitboot again. I just don't know how.

Actions #58

Updated by mkittler about 2 years ago

Seems to be the same issue as #68053#note-25:

[   24.417444] md: linear personality registered for level -1
[   24.426260] md: raid0 personality registered for level 0
[   24.618888] md: raid1 personality registered for level 1
[   24.649531] md: raid10 personality registered for level 10
[   24.705608] md: raid6 personality registered for level 6
[   24.716246] md: raid5 personality registered for level 5
[   24.746418] md: raid4 personality registered for level 4
[   24.781340] md: multipath personality registered for level -4
[   24.794915] md: faulty personality registered for level -5
[   24.826726] device-mapper: ioctl: 4.34.0-ioctl (2015-10-28) initialised: dm-devel@redhat.com
[   24.866702] device-mapper: multipath: version 1.10.0 loaded
[   24.897227] device-mapper: multipath round-robin: version 1.0.0 loaded
[   24.920287] powernv-cpufreq: cpufreq pstate min -54 nominal -16 max 0
[   24.944719] input: American Megatrends Inc. Virtual Keyboard and Mouse as /devices/pci0003:00/0003:00:00.0/0003:01:00.0/0003:02:09.0/0003:09:00.0/usb1/1-3/1-3.4/1-3.4:1.0/0003:046B:FF10.0001/input/input0
[   25.062562] hid-generic 0003:046B:FF10.0001: input: USB HID v1.10 Keyboard [American Megatrends Inc. Virtual Keyboard and Mouse] on usb-0003:09:00.0-3.4/input0
[   25.113618] input: American Megatrends Inc. Virtual Keyboard and Mouse as /devices/pci0003:00/0003:00:00.0/0003:01:00.0/0003:02:09.0/0003:09:00.0/usb1/1-3/1-3.4/1-3.4:1.1/0003:046B:FF10.0002/input/input1
[   25.143084] hid-generic 0003:046B:FF10.0002: input: USB HID v1.10 Mouse [American Megatrends Inc. Virtual Keyboard and Mouse] on usb-0003:09:00.0-3.4/input1
[   25.162768] usbcore: registered new interface driver usbhid
[   25.200740] usbhid: USB HID core driver
[   25.246936] ipip: IPv4 over IPv4 tunneling driver
[   25.260200] NET: Registered protocol family 17
[   25.334151] Key type dns_resolver registered
[   25.369404] registered taskstats version 1
[   25.370191] Btrfs loaded
[   25.440185] console [netcon0] enabled
[   25.440220] netconsole: network logging started
[   25.449722] Freeing unused kernel memory: 6400K (c000000001070000 - c0000000016b0000)
[   25.494020] udevd[2977]: starting version 3.0
[   25.533005] tg3 0003:03:00.3 enP3p3s0f3: renamed from eth3
[   25.633656] tg3 0003:03:00.1 enP3p3s0f1: renamed from eth1
[   25.702684] tg3 0003:03:00.2 enP3p3s0f2: renamed from eth2
[   25.744393] BTRFS: device fsid 89ca2dff-86af-478b-8d4c-2a45ca689fd5 devid 1 transid 3700113 /dev/sda2
[   25.852558] tg3 0003:03:00.0 enP3p3s0f0: renamed from eth0
[   26.120507] BTRFS info (device dm-2): disk space caching is enabled
[   26.120511] BTRFS: has skinny extents
[   27.376679] device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception.
[   27.378796] BTRFS error (device dm-2): bdev /dev/mapper/sda2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[   27.385450] BTRFS error (device dm-2): bdev /dev/mapper/sda2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[   27.391198] BTRFS error (device dm-2): bdev /dev/mapper/sda2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
[   27.396653] BTRFS error (device dm-2): bdev /dev/mapper/sda2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
[   27.401262] BTRFS error (device dm-2): bdev /dev/mapper/sda2 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
[   27.405870] BTRFS error (device dm-2): bdev /dev/mapper/sda2 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
[   27.406217] BTRFS error (device dm-2): bdev /dev/mapper/sda2 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
[   27.406227] BTRFS error (device dm-2): bdev /dev/mapper/sda2 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
[   27.406237] BTRFS error (device dm-2): bdev /dev/mapper/sda2 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
[   27.406244] BTRFS error (device dm-2): bdev /dev/mapper/sda2 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
[   27.406541] BTRFS: error (device dm-2) in btrfs_commit_transaction:2124: errno=-5 IO failure (Error while writing out transaction)
[   27.406543] BTRFS warning (device dm-2): Skipping commit of aborted transaction.
[   27.406545] BTRFS: Transaction aborted (error -5)
[   27.406556] ------------[ cut here ]------------
[   27.406557] WARNING: at fs/btrfs/transaction.c:1746
[   27.406558] Modules linked in:
[   27.406562] CPU: 42 PID: 3199 Comm: pb-discover Not tainted 4.4.3-openpower2 #0
[   27.406563] task: c000003ff1320000 ti: c000003ff13c0000 task.ti: c000003ff13c0000
[   27.406565] NIP: c0000000002eae80 LR: c0000000002eae7c CTR: c0000000003b7d24
[   27.406567] REGS: c000003ff13c33c0 TRAP: 0700   Not tainted  (4.4.3-openpower2)
[   27.406568] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28042824  XER: 20000000
[   27.406574] CFAR: c000000000b7a578 SOFTE: 1 
[   27.406574] GPR00: c0000000002eae7c c000003ff13c3640 c000000001753d00 0000000000000025 
[   27.406574] GPR04: 0000000000000001 000000000000033d 0000000000000035 c0000000016b3d00 
[   27.406574] GPR08: 0000000000000007 0000000000000001 0000000000000007 00000000000135b8 
[   27.406574] GPR12: 0000000000002200 c00000000fe87e00 c000003fb55e3100 c000003fb55e3ea0 
[   27.406574] GPR16: c000003fb55e39c0 c000003fb55e39a8 0000000000001000 0000000000387591 
[   27.406574] GPR20: 0000000000000000 c000003fb55e35b0 c000003fb55e32f0 c000003fb55e36a8 
[   27.406574] GPR24: c000003fa1f80800 c000003fa051000c 0000000000000000 fffffffffffffffb 
[   27.406574] GPR28: fffffffffffffffb c000003fa0500000 c000003fa0510000 c000003fa1f80800 
[   27.406594] NIP [c0000000002eae80] cleanup_transaction+0xac/0x2c0
[   27.406597] LR [c0000000002eae7c] cleanup_transaction+0xa8/0x2c0
[   27.406598] Call Trace:
[   27.406600] [c000003ff13c3640] [c0000000002eae7c] cleanup_transaction+0xa8/0x2c0 (unreliable)
[   27.406603] [c000003ff13c3720] [c0000000002ec4cc] btrfs_commit_transaction+0xa94/0xaa0
[   27.406605] [c000003ff13c37f0] [c0000000002e5ef4] btrfs_commit_super+0xa0/0xac
[   27.406607] [c000003ff13c3820] [c0000000002e9530] open_ctree+0x1a18/0x1d70
[   27.406609] [c000003ff13c3950] [c0000000002bf248] btrfs_mount+0x67c/0x884
[   27.406613] [c000003ff13c3a60] [c000000000115a74] mount_fs+0x2c/0xac
[   27.406615] [c000003ff13c3ae0] [c0000000001303c4] vfs_kern_mount+0x64/0x138
[   27.406617] [c000003ff13c3b30] [c0000000002bed90] btrfs_mount+0x1c4/0x884
[   27.406619] [c000003ff13c3c40] [c000000000115a74] mount_fs+0x2c/0xac
[   27.406621] [c000003ff13c3cc0] [c0000000001303c4] vfs_kern_mount+0x64/0x138
[   27.406623] [c000003ff13c3d10] [c000000000134340] do_mount+0xbec/0xd04
[   27.406626] [c000003ff13c3dd0] [c0000000001346c0] SyS_mount+0x90/0xc8
[   27.406628] [c000003ff13c3e30] [c000000000009198] system_call+0x38/0xd0
[   27.406629] Instruction dump:
[   27.406631] 7d4048a8 7d474378 7ce049ad 40c2fff4 7c0004ac 7949f7e3 40e2001c 3c62ff7d 
[   27.406635] 7f84e378 3863480d 4888f6a9 60000000 <0fe00000> 3ca2ff48 7fa3eb78 7fe4fb78 
[   27.406639] ---[ end trace 037d04dfcb7feb3d ]---
[   27.406641] BTRFS: error (device dm-2) in cleanup_transaction:1746: errno=-5 IO failure
[   27.406643] BTRFS info (device dm-2): delayed_refs has NO entry
[   27.406677] BTRFS error (device dm-2): cleaner transaction attach returned -30
[   27.532546] BTRFS: open_ctree failed
[   27.697782] device-mapper: snapshots: Snapshot is marked invalid.
[   27.698528] EXT4-fs (dm-2): unable to read superblock
[   31.742456] tg3 0003:03:00.3 enP3p3s0f3: Link is up at 1000 Mbps, full duplex
[   31.742467] tg3 0003:03:00.3 enP3p3s0f3: Flow control is off for TX and off for RX
[   31.742475] tg3 0003:03:00.3 enP3p3s0f3: EEE is enabled
[ 1968.183521] UDF-fs: warning (device sda1): udf_fill_super: No partition found (2)
[ 2453.622061] UDF-fs: warning (device sda2): udf_fill_super: No partition found (2)
[ 2453.627345] BTRFS info (device sda2): disk space caching is enabled
[ 2453.627350] BTRFS: has skinny extents
[ 2455.468203] BTRFS: checking UUID tree
[ 2899.504240] EXT4-fs (sdb1): couldn't mount as ext3 due to feature incompatibilities
[ 2899.504710] EXT4-fs (sdb1): couldn't mount as ext2 due to feature incompatibilities
[ 2902.432651] EXT4-fs (sdb1): recovery complete
[ 2902.448361] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null)
Actions #59

Updated by mkittler about 2 years ago

And applying the workaround from #68053#note-25 worked as well. So now I'm able to boot the system again and can investigate why it crashed.

Actions #60

Updated by mkittler about 2 years ago

Looks like a crash, it didn't even run 24 hours:

Sep 29 13:30:47 QA-Power8-5-kvm kernel: Linux version 5.19.12-lp153.2.g95fa5b8-default (geeko@buildhost) (gcc (SUSE Linux) 11.3.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP Wed Sep 28 10:51:18 UTC 2022 (95fa5b8)
…
Sep 30 05:04:24 QA-Power8-5-kvm worker[79305]: [info] [pid:79305] +++ worker notes +++
Sep 30 05:04:24 QA-Power8-5-kvm worker[79305]: [info] [pid:79305] End time: 2022-09-30 03:04:24
Sep 30 05:04:24 QA-Power8-5-kvm worker[79305]: [info] [pid:79305] Result: timeout
Sep 30 05:04:24 QA-Power8-5-kvm worker[94930]: [info] [pid:94930] Uploading autoinst-log.txt
Sep 30 05:04:24 QA-Power8-5-kvm worker[94930]: [debug] [pid:94930] Uploading artefact autoinst-log.txt
Sep 30 05:04:24 QA-Power8-5-kvm worker[94930]: [info] [pid:94930] Uploading worker-log.txt
Sep 30 05:04:24 QA-Power8-5-kvm worker[94930]: [debug] [pid:94930] Uploading artefact worker-log.txt
Sep 30 05:04:25 QA-Power8-5-kvm worker[79305]: [warn] [pid:79305] Job 9633652 stopped because it exceeded MAX_JOB_TIME
Sep 30 05:04:25 QA-Power8-5-kvm worker[79305]: [warn] [pid:79305] Unable to upload results of the job because no command server URL or worker ID have been set.
Sep 30 05:04:25 QA-Power8-5-kvm worker[79305]: [debug] [pid:79305] Upload concluded (no current module)
Sep 30 05:04:25 QA-Power8-5-kvm worker[79305]: [debug] [pid:79305] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9633652/set_done?reason=timeout%3A+setup+exceeded+MAX_SETUP_TIME&result=timeout_exceeded&worker_id=1061
Sep 30 05:04:25 QA-Power8-5-kvm worker[79305]: [debug] [pid:79305] Job 9633652 from openqa.suse.de finished - reason: timeout
Sep 30 05:04:25 QA-Power8-5-kvm worker[79305]: [debug] [pid:79305] Cleaning up for next job
Sep 30 05:04:25 QA-Power8-5-kvm worker[79305]: [debug] [pid:79305] Informing openqa.suse.de that we are going offline
Sep 30 05:04:25 QA-Power8-5-kvm systemd[1]: openqa-worker-auto-restart@3.service: Deactivated successfully.
Sep 30 05:04:25 QA-Power8-5-kvm systemd[1]: openqa-worker-auto-restart@3.service: Scheduled restart job, restart counter is at 14.
Sep 30 05:04:25 QA-Power8-5-kvm systemd[1]: Stopped openQA Worker #3.
Sep 30 05:04:25 QA-Power8-5-kvm systemd[1]: Starting openQA Worker #3...
Sep 30 05:04:25 QA-Power8-5-kvm systemd[1]: Started openQA Worker #3.
Sep 30 05:04:26 QA-Power8-5-kvm worker[94932]: [info] [pid:94932] worker 3:
Sep 30 05:04:26 QA-Power8-5-kvm worker[94932]:  - config file:           /etc/openqa/workers.ini
Sep 30 05:04:26 QA-Power8-5-kvm worker[94932]:  - worker hostname:       QA-Power8-5-kvm
Sep 30 05:04:26 QA-Power8-5-kvm worker[94932]:  - isotovideo version:    31
Sep 30 05:04:26 QA-Power8-5-kvm worker[94932]:  - websocket API version: 1
Sep 30 05:04:26 QA-Power8-5-kvm worker[94932]:  - web UI hosts:          openqa.suse.de
Sep 30 05:04:26 QA-Power8-5-kvm worker[94932]:  - class:                 qemu_ppc64le,qemu_ppc64le_no_tmpfs,tap,qemu_ppc64le-large-mem,QA-Power8-5-kvm
Sep 30 05:04:26 QA-Power8-5-kvm worker[94932]:  - no cleanup:            no
Sep 30 05:04:26 QA-Power8-5-kvm worker[94932]:  - pool directory:        /var/lib/openqa/pool/3
Sep 30 05:04:26 QA-Power8-5-kvm worker[94932]: [info] [pid:94932] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de
Sep 30 05:04:26 QA-Power8-5-kvm worker[94932]: [info] [pid:94932] Project dir for host openqa.suse.de is /var/lib/openqa/share
Sep 30 05:04:26 QA-Power8-5-kvm worker[94932]: [info] [pid:94932] Registering with openQA openqa.suse.de
Sep 30 05:04:26 QA-Power8-5-kvm worker[94932]: [info] [pid:94932] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1061
Sep 30 05:04:26 QA-Power8-5-kvm worker[94932]: [info] [pid:94932] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1061
-- Boot 088ba0d4350c47748442be71a06fb9a1 --
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 262144MB)
Okt 04 14:08:05 QA-Power8-5-kvm kernel: hash-mmu: Page sizes from device-tree:
Okt 04 14:08:05 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
Okt 04 14:08:05 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
Okt 04 14:08:05 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
Okt 04 14:08:05 QA-Power8-5-kvm kernel: hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
Okt 04 14:08:05 QA-Power8-5-kvm kernel: hash-mmu: base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8
Okt 04 14:08:05 QA-Power8-5-kvm kernel: hash-mmu: base_shift=20: shift=20, sllp=0x0130, avpnm=0x00000000, tlbiel=0, penc=2
Okt 04 14:08:05 QA-Power8-5-kvm kernel: hash-mmu: base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0
Okt 04 14:08:05 QA-Power8-5-kvm kernel: hash-mmu: base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Enabling pkeys with max key count 32
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Activating Kernel Userspace Access Prevention
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Activating Kernel Userspace Execution Prevention
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Using 1TB segments
Okt 04 14:08:05 QA-Power8-5-kvm kernel: hash-mmu: Initializing hash mmu with SLB
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Linux version 5.19.12-lp153.2.g95fa5b8-default (geeko@buildhost) (gcc (SUSE Linux) 11.3.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP Wed Sep 28 10:51:18 UTC 2022 (95fa5b8)
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Found initrd at 0xc000000004560000:0xc00000000629ab23
Okt 04 14:08:05 QA-Power8-5-kvm kernel: OPAL: Found non-mapped LPC bus on chip 0
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Using PowerNV machine description
Okt 04 14:08:05 QA-Power8-5-kvm kernel: printk: bootconsole [udbg0] enabled
Okt 04 14:08:05 QA-Power8-5-kvm kernel: CPU maps initialized for 8 threads per core
Okt 04 14:08:05 QA-Power8-5-kvm kernel:  (thread shift is 3)
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Allocated 5376 bytes for 112 pacas
Okt 04 14:08:05 QA-Power8-5-kvm kernel: -----------------------------------------------------
Okt 04 14:08:05 QA-Power8-5-kvm kernel: phys_mem_size     = 0x4000000000
Okt 04 14:08:05 QA-Power8-5-kvm kernel: dcache_bsize      = 0x80
Okt 04 14:08:05 QA-Power8-5-kvm kernel: icache_bsize      = 0x80
Okt 04 14:08:05 QA-Power8-5-kvm kernel: cpu_features      = 0x000000fb8f5db187
Okt 04 14:08:05 QA-Power8-5-kvm kernel:   possible        = 0x000ffbfbcf5fb187
Okt 04 14:08:05 QA-Power8-5-kvm kernel:   always          = 0x0000000380008181
Okt 04 14:08:05 QA-Power8-5-kvm kernel: cpu_user_features = 0xdc0065c2 0xef000000
Okt 04 14:08:05 QA-Power8-5-kvm kernel: mmu_features      = 0x7c006e01
Okt 04 14:08:05 QA-Power8-5-kvm kernel: firmware_features = 0x0000000110000000
Okt 04 14:08:05 QA-Power8-5-kvm kernel: vmalloc start     = 0xc008000000000000
Okt 04 14:08:05 QA-Power8-5-kvm kernel: IO start          = 0xc00a000000000000
Okt 04 14:08:05 QA-Power8-5-kvm kernel: vmemmap start     = 0xc00c000000000000
Okt 04 14:08:05 QA-Power8-5-kvm kernel: hash-mmu: ppc64_pft_size    = 0x0
Okt 04 14:08:05 QA-Power8-5-kvm kernel: hash-mmu: htab_hash_mask    = 0x1fffff
Okt 04 14:08:05 QA-Power8-5-kvm kernel: -----------------------------------------------------
Okt 04 14:08:05 QA-Power8-5-kvm kernel: numa:   NODE_DATA [mem 0x1fffbc7780-0x1fffbceeff]
Okt 04 14:08:05 QA-Power8-5-kvm kernel: numa:   NODE_DATA [mem 0x3fff220880-0x3fff227fff]
Okt 04 14:08:05 QA-Power8-5-kvm kernel: kvm_cma_reserve: reserving 13107 MiB for global area
Okt 04 14:08:05 QA-Power8-5-kvm kernel: cma: Reserved 13120 MiB at 0x0000000100000000
Okt 04 14:08:05 QA-Power8-5-kvm kernel: rfi-flush: fallback displacement flush available
Okt 04 14:08:05 QA-Power8-5-kvm kernel: rfi-flush: patched 13 locations (fallback displacement flush)
Okt 04 14:08:05 QA-Power8-5-kvm kernel: count-cache-flush: flush disabled.
Okt 04 14:08:05 QA-Power8-5-kvm kernel: link-stack-flush: flush disabled.
Okt 04 14:08:05 QA-Power8-5-kvm kernel: entry-flush: patched 61 locations (no flush)
Okt 04 14:08:05 QA-Power8-5-kvm kernel: uaccess-flush: patched 1 locations (no flush)
Okt 04 14:08:05 QA-Power8-5-kvm kernel: stf-barrier: hwsync barrier available
Okt 04 14:08:05 QA-Power8-5-kvm kernel: stf-barrier: patched 61 entry locations (hwsync barrier)
Okt 04 14:08:05 QA-Power8-5-kvm kernel: stf-barrier: patched 13 exit locations (hwsync barrier)
Okt 04 14:08:05 QA-Power8-5-kvm kernel: OPAL nvram setup, 589824 bytes
Okt 04 14:08:05 QA-Power8-5-kvm kernel: barrier-nospec: using ORI speculation barrier
Okt 04 14:08:05 QA-Power8-5-kvm kernel: barrier-nospec: patched 295 locations
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Top of RAM: 0x4000000000, Total RAM: 0x4000000000
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Memory hole size: 0MB
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Zone ranges:
Okt 04 14:08:05 QA-Power8-5-kvm kernel:   Normal   [mem 0x0000000000000000-0x0000003fffffffff]
Okt 04 14:08:05 QA-Power8-5-kvm kernel:   Device   empty
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Movable zone start for each node
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Early memory node ranges
Okt 04 14:08:05 QA-Power8-5-kvm kernel:   node   0: [mem 0x0000000000000000-0x0000001fffffffff]
Okt 04 14:08:05 QA-Power8-5-kvm kernel:   node   8: [mem 0x0000002000000000-0x0000003fffffffff]
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Initmem setup node 0 [mem 0x0000000000000000-0x0000001fffffffff]
Okt 04 14:08:05 QA-Power8-5-kvm kernel: Initmem setup node 8 [mem 0x0000002000000000-0x0000003fffffffff]
Okt 04 14:08:05 QA-Power8-5-kvm kernel: percpu: Embedded 10 pages/cpu s598312 r0 d57048 u1048576

Since I've now already booted into 5.19.12 again I'll give it another shot. I've also included a few kernel log messages occurring soon after the boot. Maybe they're helpful.

Actions #61

Updated by mkittler about 2 years ago

Had to reboot once more. Note that one always gets the following error shortly after boot:

Okt 05 12:13:27 QA-Power8-5-kvm udevadm[1365]: systemd-udev-settle.service is deprecated. Please fix wickedd.service not to pull it in.
Okt 05 12:13:28 QA-Power8-5-kvm kernel: Kernel attempted to read user page (0) - exploit attempt? (uid: 0)
Okt 05 12:13:28 QA-Power8-5-kvm kernel: BUG: Kernel NULL pointer dereference on read at 0x00000000
Okt 05 12:13:28 QA-Power8-5-kvm kernel: Faulting instruction address: 0xc0000000000d605c
Okt 05 12:13:28 QA-Power8-5-kvm kernel: Oops: Kernel access of bad area, sig: 11 [#1]
Okt 05 12:13:28 QA-Power8-5-kvm kernel: LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
Okt 05 12:13:28 QA-Power8-5-kvm kernel: Modules linked in: powernv_flash(+) mtd ibmpowernv rtc_opal(+) i2c_opal ipmi_msghandler fuse ip_tables x_tables hid_generic usbhid sr_mod cdrom uas usb_storage sd_mod lpfc ast i2c_algo_bit drm_vram_helper nvmet_fc drm_kms_helper syscopyarea nvmet sysfillrect sysimgblt fb_sys_fops xhci_pci drm_ttm_helper xhci_pci_renesas ahci ttm configfs xhci_hcd agpgart libahci xts nvme_fc ecb nvme_fabrics vmx_crypto drm nvme_core libata usbcore t10_pi crc64_rocksoft_generic crc64_rocksoft drm_panel_orientation_quirks usb_common scsi_transport_fc crc64 btrfs blake2b_generic libcrc32c crc32c_vpmsum xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod scsi_common kvm_hv kvm
Okt 05 12:13:28 QA-Power8-5-kvm kernel: rtc-opal opal-rtc: setting system clock to 2022-10-05T10:13:27 UTC (1664964807)
Okt 05 12:13:28 QA-Power8-5-kvm kernel: CPU: 12 PID: 1421 Comm: systemd-udevd Not tainted 5.19.12-lp153.2.g95fa5b8-default #1 openSUSE Tumbleweed (unreleased) 057af8bed99eed06febc6f765823cd64519772c1
Okt 05 12:13:28 QA-Power8-5-kvm kernel: NIP:  c0000000000d605c LR: c000000000a689c0 CTR: c000000000a68940
Okt 05 12:13:28 QA-Power8-5-kvm kernel: REGS: c000000025b3aa30 TRAP: 0300   Not tainted  (5.19.12-lp153.2.g95fa5b8-default)
Okt 05 12:13:28 QA-Power8-5-kvm kernel: MSR:  900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44228220  XER: 20000000
Okt 05 12:13:28 QA-Power8-5-kvm kernel: CFAR: c000000000a689bc DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 0 
                                        GPR00: c000000000a689c0 c000000025b3acd0 c000000002955e00 c00000001fc83400 
                                        GPR04: c00000001fc83400 0000000000000020 0000000000000000 c000000041ea9000 
                                        GPR08: 0000000000000000 c000000002141f80 0000000000000000 0000000000008800 
                                        GPR12: c000000000a68940 c000001fffff2880 00000100373ef380 0000000102bf0da8 
                                        GPR16: 0000000000000000 0000000000000000 c00000201ac7c1c8 0000000000000000 
                                        GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
                                        GPR24: c00000201ac7c1c8 c00000000288e9d8 c0000000028915a8 c00000000288e998 
                                        GPR28: c0000000342c4010 0000000000000004 0000000000000003 c00000001fc83408 
Okt 05 12:13:28 QA-Power8-5-kvm kernel: NIP [c0000000000d605c] powernv_get_random_long+0x2c/0xd0
Okt 05 12:13:28 QA-Power8-5-kvm kernel: LR [c000000000a689c0] powernv_rng_read+0x80/0xd0
Okt 05 12:13:28 QA-Power8-5-kvm kernel: Call Trace:
Okt 05 12:13:28 QA-Power8-5-kvm kernel: [c000000025b3acd0] [c000000025b3ad10] 0xc000000025b3ad10 (unreliable)
Okt 05 12:13:28 QA-Power8-5-kvm kernel: ipmi device interface
Okt 05 12:13:28 QA-Power8-5-kvm kernel: [c000000025b3ad10] [c000000000a67e58] add_early_randomness+0x88/0x150
Okt 05 12:13:28 QA-Power8-5-kvm kernel: [c000000025b3ad50] [c000000000a68420] hwrng_register+0x310/0x3c0
Okt 05 12:13:28 QA-Power8-5-kvm kernel: [c000000025b3adb0] [c000000000a68538] devm_hwrng_register+0x68/0xf0
Okt 05 12:13:28 QA-Power8-5-kvm kernel: [c000000025b3adf0] [c000000000a688f0] powernv_rng_probe+0x30/0x80
Okt 05 12:13:28 QA-Power8-5-kvm kernel: [c000000025b3ae60] [c000000000a95fe8] platform_probe+0x98/0x150
Okt 05 12:13:28 QA-Power8-5-kvm kernel: [c000000025b3aee0] [c000000000a9139c] really_probe+0x23c/0x590
Okt 05 12:13:28 QA-Power8-5-kvm kernel: [c000000025b3af60] [c000000000a9187c] __driver_probe_device+0x18c/0x250
Okt 05 12:13:28 QA-Power8-5-kvm kernel: [c000000025b3afe0] [c000000000a9199c] driver_probe_device+0x5c/0x140
Okt 05 12:13:28 QA-Power8-5-kvm kernel: [c000000025b3b020] [c000000000a9244c] __device_attach_driver+0x10c/0x1d0
Okt 05 12:13:28 QA-Power8-5-kvm kernel: [c000000025b3b0a0] [c000000000a8da6c] bus_for_each_drv+0xac/0x130
Okt 05 12:13:28 QA-Power8-5-kvm kernel: ipmi-powernv ibm,opal:ipmi: Unable to map irq from device tree
Okt 05 12:13:28 QA-Power8-5-kvm kernel: at24 1-0051: supply vcc not found, using dummy regulator

Otherwise there's nothing useful in the logs about the last crash:

Okt 05 10:19:38 QA-Power8-5-kvm worker[86178]: [debug] [pid:86178] +++ worker notes +++
Okt 05 10:19:38 QA-Power8-5-kvm worker[62549]: [debug] [pid:62549] Running job 9659523 from openqa.suse.de: 09659523-sle-15-SP3-Server-DVD-Incidents-Install-ppc64le-Build:24603:aaa_base-qam-incidentinstall@ppc64le.
Okt 05 10:19:38 QA-Power8-5-kvm worker[62549]: [debug] [pid:62549] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9659523/status
Okt 05 10:19:38 QA-Power8-5-kvm worker[62549]: [debug] [pid:62549] Upload concluded (no current module)
Okt 05 10:19:43 QA-Power8-5-kvm worker[63093]: [debug] [pid:63093] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9659572/status
Okt 05 10:19:43 QA-Power8-5-kvm worker[63093]: [debug] [pid:63093] Upload concluded (at boot_to_desktop)
Okt 05 10:19:47 QA-Power8-5-kvm salt-minion[4419]: [ERROR   ] The Salt Master has cached the public key for this node, this salt minion will wait for 10 seconds before attempting to re-authenticate
Okt 05 10:19:48 QA-Power8-5-kvm worker[62549]: [debug] [pid:62549] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9659523/status
Okt 05 10:19:48 QA-Power8-5-kvm worker[62549]: [debug] [pid:62549] Upload concluded (at boot_to_desktop)
Okt 05 10:19:53 QA-Power8-5-kvm worker[63093]: [debug] [pid:63093] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9659572/status
Okt 05 10:19:53 QA-Power8-5-kvm worker[63093]: [debug] [pid:63093] Upload concluded (at boot_to_desktop)
-- Boot f0ea7f73b0e94995b4a84f753ddcef86 --
Okt 05 12:13:10 QA-Power8-5-kvm kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 262144MB)
Okt 05 12:13:10 QA-Power8-5-kvm kernel: hash-mmu: Page sizes from device-tree:
Actions #62

Updated by mkittler about 2 years ago

Trying the latest vanilla kernel now:

[martchus@linux-9lzf ~]$ ssh qa-power8-5-kvm.qa.suse.de 
Last login: Wed Oct  5 12:14:38 2022 from 10.163.28.162
Have a lot of fun...
martchus@QA-Power8-5-kvm:~> uname -a
Linux QA-Power8-5-kvm 6.0.0-lp153.2.g47c5c19-vanilla #1 SMP Mon Oct 3 05:22:49 UTC 2022 (47c5c19) ppc64le ppc64le ppc64le GNU/Linux
Actions #63

Updated by nicksinger about 2 years ago

mkittler wrote:

Trying the latest vanilla kernel now:

[martchus@linux-9lzf ~]$ ssh qa-power8-5-kvm.qa.suse.de 
Last login: Wed Oct  5 12:14:38 2022 from 10.163.28.162
Have a lot of fun...
martchus@QA-Power8-5-kvm:~> uname -a
Linux QA-Power8-5-kvm 6.0.0-lp153.2.g47c5c19-vanilla #1 SMP Mon Oct 3 05:22:49 UTC 2022 (47c5c19) ppc64le ppc64le ppc64le GNU/Linux

Doesn't seem to help, we received another "host down"-alert yesterday at 21:02 CEST. I paused the alert for now

Actions #64

Updated by mkittler about 2 years ago

Yes, so even the latest vanilla kernel doesn't allow running the machine without crashes. (The log again just ends so it likely crashed again.)

Since we likely have to do some further testing on qa-power8-5 we will likely have to recover it several more times. So I've been creating SR for automatic recovery:

Actions #65

Updated by mkittler about 2 years ago

The automatic recovery seems to work. The reset has been triggered as the qa-power8-5-kvm wasn't in salt (so there was no ping data recovered and the worker therefore considered offline) and it worked. I've just been adding it back to salt and ping data is available now and the alert is ok again. I'll keep the host up alert paused like we do for arm workers with automatic recovery.

Actions #66

Updated by okurz about 2 years ago

  • Related to action #118024: Ensure all PPC workers are upgraded after kernel regression resolved size:M added
Actions #67

Updated by mkittler about 2 years ago

Still waiting on feedback on the bugzilla ticket, see https://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c24 for an update.

Actions #68

Updated by mkittler about 2 years ago

Now the many Minion jobs of the cache service on qa-power8-5 failed because the sqlite db file was locked. At least according to the error message. The Minion dashboard itself was actually responsive and shown all the data. I stopped the cache services and cleaned up the database. I wanted to save the broken database for further investigation but it got wiped. (Why again do we remove everything /var/lib/openqa/cache on a fresh start? It would make more sense to delete only subdirs.)

Note that the worker has not crashed (last boot was Okt 11 09:44:41). Not sure why this happened. Since the problematic db is gone it is hard to tell now. However, the locked db didn't cause many failures. Most of the failures were just jobs that failed with "worker went away" and that's likely due to frequent crashes of the machine. And now simply many of those jobs have piled up.

Actions #69

Updated by okurz about 2 years ago

  • Status changed from Feedback to In Progress

Discussed in tools team unblock meeting. Things to try in this order:

  1. Run stress tests on the system to try to reproduce, e.g. stress-ng
  2. Try to run no qemu tests to see if crashes still happen
  3. Look at monitoring data leading up to the crash to see if anything suspicious there
  4. Research how to bisect kernel builds "the openSUSE way"
Actions #70

Updated by mkittler about 2 years ago

I've tried 1. and running stress-ng --all 4 --timeout 5m freezes power8-5-kvm after a few seconds. I first thought it might just not be responsive anymore but I've waited long enough to say it is really stuck. Additionally, when running the same command on power8-4-kvm which is still on the last good kernel the system gets not stuck.

So it looks like we've already found our reproducer. Maybe I can narrow it down further (to only specific stress tests).

Actions #71

Updated by okurz about 2 years ago

Wow, I guess we are lucky then :)

Actions #72

Updated by mkittler about 2 years ago

I tried it again. With --all is crashed again very soon. It invoked the following tests:

stress-ng: info:  [45464] dispatching hogs: 4 access, 4 af-alg, 4 affinity, 4 aio, 4 aiol, 4 alarm, 4 atomic, 4 bad-altstack, 4 bigheap, 4 branch, 4 brk, 4 bsearch, 4 cache, 4 cap, 4 chattr, 4 chdir, 4 chmod, 4 chown, 4 clock, 4 clone, 4 close, 4 context, 4 copy-file, 4 cpu, 4 crypt, 4 daemon, 4 dccp, 4 dentry, 4 dev, 4 dev-shm, 4 dir, 4 dirdeep, 4 dirmany, 4 dnotify, 4 dup, 4 dynlib, 4 enosys, 4 env, 4 epoll, 4 eventfd, 4 exec, 4 exit-group, 4 fallocate, 4 fault, 4 fcntl, 4 fiemap, 4 fifo, 4 file-ioctl, 4 filename, 4 flock, 4 fork, 4 fp-error, 4 fpunch, 4 fstat, 4 full, 4 funccall, 4 funcret, 4 futex, 4 get, 4 getdent, 4 getrandom, 4 handle, 4 hdd, 4 heapsort, 4 hrtimers, 4 hsearch, 4 icache, 4 inode-flags, 4 inotify, 4 io, 4 iomix, 4 ioport, 4 ioprio, 4 io-uring, 4 itimer, 4 judy, 4 kcmp, 4 key, 4 kill, 4 klog, 4 l1cache, 4 lease, 4 link, 4 list, 4 loadavg, 4 locka, 4 lockbus, 4 lockf, 4 lockofd, 4 longjmp, 4 lsearch, 4 madvise, 4 malloc, 4 matrix, 4 matrix-3d, 4 mcontend, 4 membarrier, 4 memcpy, 4 memfd, 4 memrate, 4 memthrash, 4 mergesort, 4 mincore, 4 misaligned, 4 mknod, 4 mlock, 4 mmap, 4 mmapaddr, 4 mmapfixed, 4 mmapfork, 4 mmaphuge, 4 mmapmany, 4 mq, 4 mremap, 4 msg, 4 msync, 4 munmap, 4 nanosleep, 4 netdev, 4 nice, 4 nop, 4 null, 4 numa, 4 opcode, 4 open, 4 pci, 4 personality, 4 pidfd, 4 ping-sock, 4 pipe, 4 pipeherd, 4 pkey, 4 poll, 4 prctl, 4 prefetch, 4 procfs, 4 pthread, 4 ptrace, 4 pty, 4 qsort, 4 radixsort, 4 readahead, 4 reboot, 4 remap, 4 rename, 4 resources, 4 revio, 4 rlimit, 4 rmap, 4 rseq, 4 rtc, 4 schedpolicy, 4 sctp, 4 seal, 4 secretmem, 4 seek, 4 sem, 4 sem-sysv, 4 sendfile, 4 session, 4 set, 4 shellsort, 4 shm, 4 shm-sysv, 4 sigabrt, 4 sigchld, 4 sigfd, 4 sigfpe, 4 sigio, 4 signal, 4 signest, 4 sigpending, 4 sigpipe, 4 sigq, 4 sigrt, 4 sigsegv, 4 sigsuspend, 4 sigtrap, 4 skiplist, 4 sleep, 4 sock, 4 sockabuse, 4 sockdiag, 4 sockfd, 4 sockpair, 4 sockmany, 4 spawn, 4 splice, 4 stack, 4 stackmmap, 4 str, 4 stream, 4 switch, 4 symlink, 4 sync-file, 4 sysbadaddr, 4 sysinfo, 4 sysfs, 4 tee, 4 timer, 4 timerfd, 4 tlb-shootdown, 4 tmpfs, 4 tree, 4 tsc, 4 tsearch, 4 udp, 4 udp-flood, 4 unshare, 4 urandom, 4 userfaultfd, 4 utime, 4 vecmath, 4 verity, 4 vfork, 4 vforkmany, 4 vm, 4 vm-addr, 4 vm-rw, 4 vm-segv, 4 vm-splice, 4 wait, 4 wcs, 4 x86syscall, 4 xattr, 4 yield, 4 zero, 4 zlib, 4 zombie

I've attached the output from sol. Some of these messages look bad, indeed:

[189185.890877][T76146] BUG: Unable to handle kernel data access at 0xfffffffffffa
[189185.891023][T76146] Faulting instruction address: 0xc00000000058f09c
[189185.891075][T76146] Oops: Kernel access of bad area, sig: 11 [#2]

Note that the OOM killer logs before might just be a symptom of running the bigheap stress test (which I've also executed separately before).


I'm still wondering whether it is possible to crash the worker with just a specific stress tests. All I've tried so far were not enough.

Actions #73

Updated by mkittler about 2 years ago

TLDR: One can easily hang-up the machine (on Linux 6.0.1 vanilla at this point) by invoking stress-ng --bad-altstack 0. So it is at least possible to reproduce some issue very easily. (Not sure whether it is the same problem causing the hang-ups under production workloads, though.)


Looks like --branch 0 triggers it. Got similar output via IPMI:

[  943.905969][T88027] Pid 88027(stress-ng) over core_pi[535125204643,3] OPAL: Trying a CPU re-init with flags: 0x1
[535637573400,3] OPAL: CPU 0x409 not in OPAL !
[535706993019,3] OPAL: Trying a CPU re-init with flags: 0x2
[536219455903,3] OPAL: CPU 0x409 not in OPAL !

When invoking just --branch 0 it isn't causing the issue. So then I've tried --bad-altstack 0 and that triggered the problem again:

[  629.212226][T102690] Pid 102690(stress-ng) over core_pipe_limit
[  629.212228][T102690] Skipping core dump
[  629.212276][T102656] Pid 102656(stress-ng) over core_pipe_limit
[  629.212278][T102656] Skipping core dump
[  629.212327][T102715] Pid 102715(stress-ng) over core_pipe_limit
[  629.212328][T102715] Skipping core dump
[  629.212373][T102540] Pid 102540(stress-ng) over core_pipe_limit
[  629.212375][T102540] Skipping core dump
[  629.212409][T102658] Pid 102658(stress-ng) over core_pipe_limit
[  629.212411][T102658] Skipping core dump
[  629.212419][T102700] Pid 102700(stress-ng) over core_pipe_limit
[  629.212421][T102700] Skipping core dump
[  629.212475][T102694] Pid 102694(stress-ng) over core_pipe_limit
[  629.212475][T102648] Pid 102648(stress-ng) over core_pipe_limit
[  629.212478][T102694] Skipping core dump
[  629.212478][T102648] Skipping core dump
[  629.212480][T102677] Pid 102677(stress-ng) over core_pipe_limit
[  629.212482][T102693] Pid 102693(stress-ng) over core_pipe_limit
[  629.212482][T102677] Skipping core dump
[  629.212484][T102693] Skipping core dump
[  629.212562][T102727] Pid 102727(stress-ng) over core_pipe_limit
[  629.212564][T102727] Skipping core dump
[  629.212591][T102634] Pid 102634(stress-ng) over core_pipe_limit
[  629.212594][T102634] Skipping core dump
[  629.212602][T102528] Pid 102528(stress-ng) over core_pipe_[378987676271,3] OPAL: Trying a CPU re-init with flags: 0x1
[379074063695,3] OPAL: Trying a CPU re-init with flags: 0x2

There are really many of these over core_pipe_limit messages`.


Just tried it again with just one stress test:

martchus@QA-Power8-5-kvm:~> stress-ng --bad-altstack 0
stress-ng: info:  [4443] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info:  [4443] dispatching hogs: 112 bad-altstack


^Cstress-ng: info:  [4443] successful run completed in 48.32s

I aborted the stress test manually but messages kept flooding SOL until the machine hang up again:

QA-Power8-5-kvm login: [  150.769340][ T4453] stress-ng[4453]: bad frame in handle_rt_signal64: 000000008314cd9a nip 000000011a591778 lr 000000011a591758
[  150.769454][ T4456] stress-ng[4456]: bad frame in handle_rt_signal64: 000000000c3768fc nip 000000011a5916c0 lr 000000011a5916b8
[  150.769475][ T4452] stress-ng[4452]: bad frame in handle_rt_signal64: 000000008314cd9a nip 000000011a591778 lr 000000011a591758
[  150.769535][ T4455] stress-ng[4455]: bad frame in handle_rt_signal64: 0000000039803f95 nip 000000011a5916c0 lr 000000011a591970
[  150.769677][ T4462] stress-ng[4462]: bad frame in handle_rt_signal64: 00000000b91e3fd1 nip 000000011a591674 lr 000000011a59165c
[  150.769684][ T4458] stress-ng[4458]: bad frame in handle_rt_signal64: 000000000c3768fc nip 000000011a5916c0 lr 000000011a5916b8
[  150.769762][ T4462] stress-ng[4462]: bad frame in handle_rt_signal64: 00000000b91e3fd1 nip 000000011a591674 lr 000000011a59165c
[  150.769779][ T4460] stress-ng[4460]: bad frame in handle_rt_signal64: 000000008314cd9a nip 000000011a591778 lr 000000011a591758
[  150.769897][ T4466] stress-ng[4466]: bad frame in handle_rt_signal64: 000000008314cd9a nip 000000011a591778 lr 000000011a591758
[  150.769953][ T4470] stress-ng[4470]: bad frame in handle_rt_signal64: 0000000039803f95 nip 000000011a5916c0 lr 000000011a591970
[  150.770561][ T4492] Pid 4492(stress-ng) over core_pipe_limit
[  150.770575][ T4492] Skipping core dump
[  150.770750][ T4503] Pid 4503(stress-ng) over core_pipe_limit
[  150.770768][ T4507] Pid 4507(stress-ng) over core_pipe_limit
…
[  180.839455][T90677] Skipping core dump
[  180.839472][T90650] Pid 90650(stress-ng) over core_pipe_limit
[  180.839474][T90650] Skipping core dump
[  180.839623][T9067[137805864934,3] OPAL: Trying a CPU re-init with flags: 0x1
[137894526220,3] OPAL: Trying a CPU re-init with flags: 0x2
Actions #74

Updated by mkittler about 2 years ago

The stress test mentioned in the previous comment can also easily cause a hang-up on qa-power8-4-kvm which still runs under the kernel version that doesn't show any problems under the production workload. So this stress test is likely showing a different problem and thus not very helpful to trigger the problem we're actually interested in reproducing.

Actions #75

Updated by mkittler about 2 years ago

I've just ran the same test for a while on openqaworker12 (x86_64) and so far it is still ok (although [133279.991444][T26789] Skipping core dump [133279.991551][T26796] Pid 26796(stress-ng) over core_pipe_limit is also printed continuously over SOL). So this test isn't generally causing this behavior. There must still be something wrong with PowerPC specifically. (Likely there are more than just one issues with PowerPC.)

EDIT: I was a bit too fast. Now it has just crashed as well. However, it was then actually able to switch to kdump:

[133280.179172][T27611] Pid 27611(stress-ng) over core_pipe_limit
[133280.179176][T27611] Skipping core dump
[133280.179210][T27615] Pid 27615(stress-ng) over core_pipe_limit
[133280.179210][T27614] Pid 27614(stress-ng) over core_pipe_limit
[133280.179213][T27613] Pid 27613(stress-ng) over core_pipe_limit
[133280.179214][T27615] Skipping core dump
[133280.179215][T27614] Skipping core dump
[133280.179218][T27613] Skipping core dump
[133280.179333][T27608] Pid 27608(stress-ng) over core_pipe_limit
[133280.179433][T27608] Skipping core dump
[133280.179999][T27616] Pid 27616(stress-ng) over core_pipe_limit
[    0.000000][    T0] Linux version 5.14.21-150400.24.21-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-150100.7.37) #1 SMP PREEMPT_DYNAMIC Wed Sep 7 06:51:18 UTC 2022 (974d0aa)
[    0.000000][    T0] Command line: elfcorehdr=0x400d000000  console=tty0 console=ttyS1,115200 nospec kvm.nested=1 kvm_intel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 elevator=deadline sysrq=yes reset_devices acpi_no_memhotplug cgroup_disable=memory nokaslr numa=off irqpoll nr_cpus=1 root=kdump rootflags=bind rd.udev.children-max=8 disable_cpu_apicid=0   panic=1
[    0.000000][    T0] random: get_random_u32 called from bsp_init_amd+0x231/0x260 with crng_init=0
[    0.000000][    T0] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000][    T0] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000][    T0] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
…
Saving dump using makedumpfile
-------------------------------------------------------------------------------
Copying data                                      : [100.0 %] |           eta: 0s

The dumpfile is saved to /kdump/mnt1/var/crash/2022-10-17-16:42/vmcore.

makedumpfile Completed.
-------------------------------------------------------------------------------

I'm not really sure what this all means now. The x86_64 machine definitely lasted longer and was able to execute kdump. However, maybe running this test too long is simply not a good idea anyways (and thus the test results are not very meaningful).

Actions #76

Updated by okurz about 2 years ago

  • Due date deleted (2022-11-01)
  • Status changed from In Progress to Workable
  • Assignee deleted (mkittler)

Discussed in the weekly SUSE QE Tools meeting 2022-10-21: mkittler could not identify which stress test in particular would trigger crashes. To give mkittler a break for this specific test we decided that somebody else should take over.

Tasks to do:

  1. Try to identify the specific stress test component that would trigger a crash on one of the "bad" kernel versions but not the "last good" kernel.
  2. Try to run no qemu tests to see if crashes still happen
  3. Look at monitoring data leading up to the crash to see if anything suspicious there
  4. Research how to bisect kernel builds "the openSUSE way"
Actions #77

Updated by okurz about 2 years ago

  • Priority changed from Normal to Urgent

The repeatedly crashing qa-power8-5 is causing other problems, e.g. osd-deployment failing. We should remedy that hence bumping priority.

Actions #78

Updated by okurz about 2 years ago

  • Related to action #119290: [alert] Packet loss between worker hosts and other hosts alert added
Actions #79

Updated by mkittler about 2 years ago

The worker is now crashing very early so it is basically not available at all (see attached boot log).

I don't think we can fix this quickly so I've just taken out the worker from salt to unblock the OSD deployment.

I have disabled the automatic recovery action for now to exclude that the the power reset it may invoke is interfering with the boot/recovery. I suppose the problem is the root filesystem:

[    3.583734][    T1] List of all partitions:
[    3.583743][    T1] No filesystem could mount root, tried: 
[    3.583744][    T1] 
[    3.583759][    T1] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[    3.693635][    T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.0.2-lp153.5.gdba78aa-default #1 openSUSE Tumbleweed (unreleased) 025ef815668c84eb3eb1a224f125bd5f27320030
[    3.884371][    T1] Call Trace:
[    3.884405][    T1] [c000002006b6baa0] [c0000000008f75b0] dump_stack_lvl+0x74/0xa8 (unreliable)
[    3.929494][    T1] [c000002006b6bae0] [c000000000154b28] panic+0x16c/0x424
[    3.998995][    T1] [c000002006b6bb80] [c000000002005a20] mount_block_root+0x258/0x288
[    4.074541][    T1] [c000002006b6bc50] [c000000002005e24] prepare_namespace+0x1cc/0x224
[    4.179226][    T1] [c000002006b6bcd0] [c00000000200521c] kernel_init_freeable+0x39c/0x3f4
[    4.182705][    T1] [c000002006b6bdb0] [c000000000012630] kernel_init+0x30/0x1b0
[    4.265179][    T1] [c000002006b6be10] [c00000000000ce54] ret_from_kernel_thread+0x5c/0x64
[    6.72[81946958366,5] OPAL: Reboot request...
Actions #81

Updated by okurz about 2 years ago

  • Priority changed from Urgent to High
Actions #82

Updated by mkittler about 2 years ago

The root filesystem can be mounted just fine via petitboot:

/ # mkdir /mnt/sda2
/ # mount /dev/sda2 /mnt/sda2
/ # ls -l /mnt/sda2
total 0
drwxr-xr-x    1 root     root          1812 Sep 11 01:02 bin
drwxr-xr-x    1 root     root          2870 Oct 20 05:33 boot
drwxr-xr-x    1 root     root             0 Feb 28  2018 dev
drwxr-xr-x    1 root     root          5862 Oct 20 01:00 etc
drwxr-xr-x    1 root     root          1062 Sep 23 07:04 home
drwxr-xr-x    1 root     root           106 Jul 21 18:32 lib
drwxr-xr-x    1 root     root          3668 Sep 29 01:01 lib64
drwxr-xr-x    1 root     root             8 Mar 15  2022 mnt
drwxr-xr-x    1 root     root            34 Mar 15  2022 opt
drwxr-xr-x    1 root     root             0 Feb 28  2018 proc
drwx------    1 root     root           276 Oct 11 16:05 root
drwxr-xr-x    1 root     root             0 Feb 28  2018 run
drwxr-xr-x    1 root     root          4520 Oct 12 05:20 sbin
drwxr-xr-x    1 root     root             0 Mar 15  2022 selinux
drwxr-xr-x    1 root     root            18 Aug 30 09:25 srv
drwxr-xr-x    1 root     root             0 Feb 28  2018 sys
drwxrwxrwt    1 root     root          1396 Oct 20 11:41 tmp
drwxr-xr-x    1 root     root           120 Mar 15  2022 usr
drwxr-xr-x    1 root     root           132 Oct 19 10:37 var
/ # mount
rootfs on / type rootfs (rw,size=133707968k,nr_inodes=2089187)
devtmpfs on /dev type devtmpfs (rw,relatime,size=133707968k,nr_inodes=2089187,mode=755)
proc on /proc type proc (rw,relatime)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw,relatime,mode=777)
sysfs on /sys type sysfs (rw,relatime)
/dev/sda2 on /var/petitboot/mnt/dev/sda2 type btrfs (ro,relatime,space_cache,subvolid=257,subvol=/@)
/dev/sdb1 on /var/petitboot/mnt/dev/sdb1 type ext4 (ro,relatime,block_validity,delalloc,nojournal_checksum,norecovery,barrier,user_xattr,acl)
/dev/sda2 on /mnt/sda2 type btrfs (rw,relatime,space_cache,subvolid=257,subvol=/@)

The kernel panic also looks like there's a bigger underlying problem that "just" a corrupted filesystem:

[    2.965897][    T1] List of all partitions:
[    2.965914][    T1] No filesystem could mount root, tried: 
[    2.965916][    T1] 
[    2.965943][    T1] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[    2.965973][    T1] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.0.2-lp153.5.gdba78aa-default #1 openSUSE Tumbleweed (unreleased) 025ef815668c84eb3eb1a224f125bd5f27320030
[    2.966021][    T1] Call Trace:
[    2.966036][    T1] [c000002006afbaa0] [c0000000008f75b0] dump_stack_lvl+0x74/0xa8 (unreliable)
[    2.966076][    T1] [c000002006afbae0] [c000000000154b28] panic+0x16c/0x424
[    2.966103][    T1] [c000002006afbb80] [c000000002005a20] mount_block_root+0x258/0x288
[    2.966133][    T1] [c000002006afbc50] [c000000002005e24] prepare_namespace+0x1cc/0x224
[    2.966163][    T1] [c000002006afbcd0] [c00000000200521c] kernel_init_freeable+0x39c/0x3f4
[    2.966190][    T1] [c000002006afbdb0] [c000000000012630] kernel_init+0x30/0x1b0
[    2.966218][    T1] [c000002006afbe10] [c00000000000ce54] ret_from_kernel_thread+0x5c/0x64

Unfortunately the Linux version from the kernel repo has overridden all older/normal kernel versions (at least initrd/vmlinux):

/ # ls -l /mnt/sda2/boot
total 418688
-rw-r--r--    1 root     root       5177541 Oct 17 09:57 System.map-6.0.2-lp153.3.gaf756fb-default
-rw-r--r--    1 root     root       5177446 Oct 17 09:52 System.map-6.0.2-lp153.3.gaf756fb-vanilla
-rw-r--r--    1 root     root       5177541 Oct 18 10:13 System.map-6.0.2-lp153.4.g304ec74-default
-rw-r--r--    1 root     root       5177446 Oct 18 09:11 System.map-6.0.2-lp153.4.g304ec74-vanilla
-rw-r--r--    1 root     root       5177541 Oct 19 10:43 System.map-6.0.2-lp153.5.gdba78aa-default
-rw-r--r--    1 root     root          1725 Jun  1 19:51 boot.readme
-rw-r--r--    1 root     root        210042 Oct 17 08:32 config-6.0.2-lp153.3.gaf756fb-default
-rw-r--r--    1 root     root        209637 Oct 17 08:31 config-6.0.2-lp153.3.gaf756fb-vanilla
-rw-r--r--    1 root     root        210042 Oct 18 08:20 config-6.0.2-lp153.4.g304ec74-default
-rw-r--r--    1 root     root        209637 Oct 18 08:19 config-6.0.2-lp153.4.g304ec74-vanilla
-rw-r--r--    1 root     root        210042 Oct 19 10:14 config-6.0.2-lp153.5.gdba78aa-default
drwxr-xr-x    1 root     root            90 Oct 19 12:33 grub2
lrwxrwxrwx    1 root     root            37 Oct 19 05:19 initrd -> initrd-6.0.2-lp153.4.g304ec74-vanilla
-rw-------    1 root     root      30697312 Oct 18 05:19 initrd-6.0.2-lp153.3.gaf756fb-default
-rw-------    1 root     root      30687683 Oct 18 05:20 initrd-6.0.2-lp153.3.gaf756fb-vanilla
-rw-------    1 root     root      30699722 Oct 19 05:18 initrd-6.0.2-lp153.4.g304ec74-default
-rw-------    1 root     root      30686684 Oct 19 05:19 initrd-6.0.2-lp153.4.g304ec74-vanilla
-rw-------    1 root     root      27034896 Oct 19 12:34 initrd-6.0.2-lp153.4.g304ec74-vanilla-kdump
-rw-r--r--    1 root     root        297771 Oct 17 10:04 symvers-6.0.2-lp153.3.gaf756fb-default.gz
-rw-r--r--    1 root     root        297757 Oct 17 10:00 symvers-6.0.2-lp153.3.gaf756fb-vanilla.gz
-rw-r--r--    1 root     root        297771 Oct 18 10:26 symvers-6.0.2-lp153.4.g304ec74-default.gz
-rw-r--r--    1 root     root        297757 Oct 18 09:16 symvers-6.0.2-lp153.4.g304ec74-vanilla.gz
-rw-r--r--    1 root     root        297771 Oct 19 10:48 symvers-6.0.2-lp153.5.gdba78aa-default.gz
-rw-r--r--    1 root     root           377 Oct 17 10:04 sysctl.conf-6.0.2-lp153.3.gaf756fb-default
-rw-r--r--    1 root     root           377 Oct 17 10:00 sysctl.conf-6.0.2-lp153.3.gaf756fb-vanilla
-rw-r--r--    1 root     root           377 Oct 18 10:26 sysctl.conf-6.0.2-lp153.4.g304ec74-default
-rw-r--r--    1 root     root           377 Oct 18 09:16 sysctl.conf-6.0.2-lp153.4.g304ec74-vanilla
-rw-r--r--    1 root     root           377 Oct 19 10:48 sysctl.conf-6.0.2-lp153.5.gdba78aa-default
lrwxrwxrwx    1 root     root            38 Oct 19 05:19 vmlinux -> vmlinux-6.0.2-lp153.4.g304ec74-vanilla
-rw-r--r--    1 root     root      49753487 Oct 17 10:48 vmlinux-6.0.2-lp153.3.gaf756fb-default
-rw-r--r--    1 root     root      49753423 Oct 17 10:40 vmlinux-6.0.2-lp153.3.gaf756fb-vanilla
-rw-r--r--    1 root     root      49753487 Oct 18 10:49 vmlinux-6.0.2-lp153.4.g304ec74-default
-rw-r--r--    1 root     root      49753423 Oct 18 09:45 vmlinux-6.0.2-lp153.4.g304ec74-vanilla
-rw-r--r--    1 root     root      49753487 Oct 19 11:19 vmlinux-6.0.2-lp153.5.gdba78aa-default

I tried to invoke kexec (like we did on power8-3, see #119008#note-6) but nothing happened:

/ # cd /tmp
/tmp # wget http://download.opensuse.org/ports/ppc/distribution/leap/15.4/repo/o
ss/boot/ppc64le/linux  
Connecting to download.opensuse.org (195.135.221.134:80)
linux                100% |*******************************| 46550k  0:00:00 ETA
/tmp # wget http://download.opensuse.org/ports/ppc/distribution/leap/15.4/repo/o
ss/boot/ppc64le/initrd
Connecting to download.opensuse.org (195.135.221.134:80)
initrd               100% |*******************************|   119M  0:00:00 ETA
/tmp # kexec -l linux --initrd initrd
/tmp # echo $?
0

I also tried with the old 5.3.18 kernel from qa-power8-4-kvm but nothing happened as well. There are also no messages in dmesg about kexec. When rebooting and only executing wget/kdump without prior tinkering (like mounting a btrfs filesystem) nothing happens as well.

Actions #83

Updated by okurz about 2 years ago

The kernel panic also looks like there's a bigger underlying problem that "just" a corrupted filesystem:

[    2.965897][    T1] List of all partitions:
[    2.965914][    T1] No filesystem could mount root, tried: 
[    2.965916][    T1] 
[    2.965943][    T1] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

this is a pretty standard error when the parameter root=... does not point to a valid device. So maybe it's as simple as finding out if the UUID is still correct or try different parameters like root=/dev/sda2 or root=/dev/sdb2

/tmp # kexec -l linux --initrd initrd

I think this should say "kexec -l linux --initrd initrd --reuse-cmdline" or something like that. Otherwise you are likely missing command line output and provide no root volume to mount.

Actions #84

Updated by okurz about 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

We are right now trying again together. Basically as a follow-up to what mkittler was doing the command kexec -e is missing to actually execute. But what I did is also provide a complete command line so that we can see output and that the linux kernel should find the right root device. kexec -l linux --initrd initrd --command-line="console=tty0 console=hvc0 root=UUID=…" with the UUID read from the grub config of the production system. System is currently still in the process of executing the initial kernel code.

Actions #85

Updated by okurz about 2 years ago

  • Description updated (diff)
  • Due date set to 2023-02-10
  • Status changed from In Progress to Blocked
  • Priority changed from High to Normal

Well, after longer time we booted into the installation system. That's not what we wanted :D So then we adjusted parameters to boot into the production system. Then we removed all more recent kernel versions, installed the Leap 15.3 kernel version and util-linux and applied a lock on the package:

zypper in --oldpackage http://download.opensuse.org/update/leap/15.3/sle/ppc64le/kernel-default-5.3.18-150300.59.93.1.ppc64le.rpm http://download.opensuse.org/update/leap/15.3/sle/ppc64le/util-linux-2.36.2-150300.4.23.1.ppc64le.rpm
for i in kernel-default util-linux; do zypper al --comment "poo#119008, kernel regression boo#1202138" $i; done
reboot

now we have a usable system back. We can for now keep this system in this state until there is any update in https://bugzilla.opensuse.org/show_bug.cgi?id=1202138

I updated rollback steps, set a reasonable due date when to check for progress on the bug and added the machine qa-power8-5 back to salt and unpaused related alerts.

Actions #86

Updated by okurz about 2 years ago

https://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c31 gives an answer:

So far the problematic workers were POWER8 which use the old KVM code which gets very little testing upstream.
Further, the problematic workers use 'openpower' firmware. That's some specific firmware different from what most machines use, and the little testing that is done by upstream on POWER8 likely happens on different firmware that also supports PowerVM.
Finally the virtualization team does not provide any support for KVM on Power at all. Any support we have is best-effort. I don't have any POWER8 hardware capable of running KVM available.

I answered the questions in the bug and answered myself with questions

Let me try to use this channel. Maybe we can find an answer here to a question that was never properly answered elsewhere: What do you consider the best way to run virtual machine based tests on PowerPC and how to implement that? So far we always preferred qemu based tests because we can use the same on x86_64, aarch64, ppc64le (so far) as well as even s390x. Also this way we can scale best because machines can be created on the fly based on test parameters, e.g. RAM and storage size as needed for testing. For PowerVM we have a testing "backend" but it has very limited capabilities compared to bare-metal tests, e.g. interact with pre-configured LPARs and no support to save/load virtual machine images. And implementing that would require very specific PowerVM knowledge and any solution would only be valid for PowerPC. So, what is your take on it?

So let me ask everybody a question based on the discussion we had in the estimations call: Should I a) add a ticket to our backlog to find a way to run efficient tests on current PowerPC like we are used to with qemu or b) Tell everybody loudly that we are unable to do that, will maintain kvm on PowerPC with downgraded kernel versions only for some month and then switch off that part of the infrastructure without replacement unless someone outside the team provides it?

Actions #87

Updated by pcervinka about 2 years ago

We just need to finish this one https://progress.opensuse.org/issues/71794

Actions #88

Updated by MDoucha about 2 years ago

okurz wrote:

Let me try to use this channel. Maybe we can find an answer here to a question that was never properly answered elsewhere: What do you consider the best way to run virtual machine based tests on PowerPC and how to implement that?

I'll just add that IBM has explicitly asked us several years ago to drop the unsupported QEMU/KVM tests and migrate everything to SPVM. But the migration is blocked by #71794

Actions #89

Updated by okurz about 2 years ago

pcervinka wrote:

We just need to finish this one https://progress.opensuse.org/issues/71794

Glad to see you continuing the work there on #71794 :D

MDoucha wrote:

I'll just add that IBM has explicitly asked us several years ago to drop the unsupported QEMU/KVM tests and migrate everything to SPVM. But the migration is blocked by #71794

I know. This is why I am continuing to support kvm on power. There is also the unavailability of Power9 for o3 and the worse scalability.

Actions #90

Updated by okurz about 2 years ago

There was a host up alert for qa-power8-4. Triggered a reboot and monitored the machine. It was coming up fine. Did not find anything in /var/crash

The journal says

Nov 16 19:35:18 QA-Power8-4-kvm worker[36184]: [debug] [pid:36184] Upload concluded (at bci_test_docker)
Nov 16 19:35:19 QA-Power8-4-kvm worker[27816]: [debug] [pid:27816] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9975641/status
-- Reboot --
Nov 16 21:33:00 QA-Power8-4-kvm kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 262144MB)
Nov 16 21:33:00 QA-Power8-4-kvm kernel: dt-cpu-ftrs: setup for ISA 2070

so again nothing visible for the crash. But why did it crash at all when the kernel is 5.3.18-150300.59.98-default?

Actions #91

Updated by okurz almost 2 years ago

  • Tags set to infra
Actions #92

Updated by mkittler almost 2 years ago

Note that "QA-Power8-4-kvm: host up alert" was firing on Sunday but turned back ok on Monday (supposedly without manual recovery).

Actions #93

Updated by dheidler almost 2 years ago

It seems that nick did the recovery.

Actions #94

Updated by okurz almost 2 years ago

  • Status changed from Blocked to Resolved
  • Both qa-power8-4 and qa-power8-5 are used for production openQA jobs again
  • Stable over reboot
  • Alerts unpaused
Actions #96

Updated by okurz about 1 year ago

  • Related to action #116078: Recover o3 worker kerosene formerly known as power8, restore IPMI access size:M added
Actions

Also available in: Atom PDF