Project

General

Profile

action #114565

openQA Project - coordination #111860: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.4

recover qa-power8-4+qa-power8-5 size:M

Added by okurz 2 months ago. Updated 5 days ago.

Status:
Feedback
Priority:
Normal
Assignee:
Target version:
Start date:
2022-07-22
Due date:
2022-11-01
% Done:

0%

Estimated time:

Description

Observation

After upgrade to Leap 15.4 seems like qa-power8-4 wasn't properly rebooting. okurz could connect over ssh but asked for a password where normally we should have SSH keys. mkittler had varying success with "power reset" and "power cycle". Over SoL mkittler saw petitboot

Acceptance criteria

  • AC1: Both qa-power8-4 and qa-power8-5 are used for production openQA jobs again
  • AC2: Stable over reboot
  • AC3: Alerts unpaused

Further information

Suggestions

power8-5.log.txt (2.07 MB) power8-5.log.txt mkittler, 2022-07-22 14:09

Related issues

Related to openQA Infrastructure - action #115208: failed-systemd-services: logrotate-openqa alerting on and off size:MResolved

Related to openQA Infrastructure - action #116437: Recover qa-power8-5 size:MResolved

Related to openQA Infrastructure - action #116473: Add OSD PowerPC workers to automatic recovery we already have for ARM workersNew2022-09-12

Related to openQA Infrastructure - action #116743: [alert] QA-Power8-5-kvm: host up alertResolved2022-09-192022-10-04

Related to openQA Tests - action #117229: [tools] openqa failing on worker QA-Power8-5-kvm:7New2022-09-26

Copied from openQA Infrastructure - action #114526: recover openqaworker14Resolved

History

#1 Updated by okurz 2 months ago

#2 Updated by mkittler 2 months ago

power8-4 boots again after power cycle + reset and adding nmi_watchdog=0¹ to kernel params. (Not sure yet whether the kernel param is necessary.)

¹ https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt

#3 Updated by mkittler 2 months ago

I've tried booting with nmi_watchdog=0 on power-8-5 as well but I'm still getting [ 197.877239][ C62] watchdog: BUG: soft lockup - CPU#62 stuck for 25s! [swapper/62:0]. I suppose I'm still in the state where the system cannot boot. I've attached the logs from my previous attempt.

EDIT: I wanted to do a further experiment and therefore reset power8-5 again. But meanwhile, in the other terminal tab it actually showed some systemd logging. Of course the reset was effective so I'm now back to where I was before.
EDIT: Booted again, this time again without the kernel parameter and it seemed like the boot was stuck like before. So I reset again and booted again with the kernel parameter. Now, power8-5 also started. I suppose nmi_watchdog=0 might actually help. Now the question is how to make the kernel parameter persistent in this petitboot setup.

#4 Updated by mkittler 2 months ago

I suppose this change would configure the kernel parameter persistently (as Petitboot just seems to parse the GRUB config): https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/716

#5 Updated by openqa_review 2 months ago

  • Due date set to 2022-08-06

Setting due date based on mean cycle time of SUSE QE Tools

#6 Updated by mkittler 2 months ago

QA-Power8-4-kvm.qa.suse.de didn't survive the reboot. However, https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/716 has also not been merged yet.

EDIT: QA-Power8-4 is up again after a power reset. It wasn't even required to use nmi_watchdog=0. Maybe it is just a coincidence after all that adding this kernel parameter helps? However, otherwise I wouldn't know what else to improve that these workers are sometimes stuck on boot. (Actually, I'm not sure in what state QA-Power8-4 was stuck. It was powered on but showed no activity over SOL and SSH wasn't working.)

#7 Updated by mkittler 2 months ago

qa-power8-4-kvm.qa isn't reachable anymore again. Before resetting the power, I'll try some further tests to check where it hangs exactly.

EDIT: Likely not relevant, but those are recent "system" logs from the BMC web UI:

105 Jul 26 03:01:07 AMIA0423F32B471 ntpdate[25206]: 129.70.132.34 rate limit response from server. 
100 Jul 26 08:01:11 AMIA0423F32B471 ntpdate[28197]: step time server 131.234.220.231 offset 0.036390 sec
101 Jul 26 09:01:11 AMIA0423F32B471 ntpdate[28807]: step time server 178.63.9.212 offset 0.031896 sec 
102 Jul 26 09:06:01 AMIA0423F32B471 lighttpd[1585]: [1585 WARNING][webifc_StorDevice.c:1157]Invalid FRU Header Version for ID : 0 
84  Jul 26 09:06:45 AMIA0423F32B471 lighttpd[1585]: [1585 CRITICAL][libipmi_AMIOEM.c:3143]Error While Getting service configuration 403
85  Jul 26 09:07:12 AMIA0423F32B471 lighttpd[1585]: [1585 CRITICAL][webifc_XportDevice.c:438] g_corefeatures.ipv6_compliance_support==1
86  Jul 26 09:08:48 AMIA0423F32B471 lighttpd[1585]: [1585 CRITICAL][webifc_syslog.c:356]Audit file /conf/audit.log not exists.. Trying /var/log/audit.log 
73  Jul 26 09:08:02 AMIA0423F32B471 /USR/SBIN/CRON[28921]: (sysadmin) CMD ( if [ -d /var/cache/samba ]; then echo "2" > /var/cache/samba/unexpected.tdb; fi)
74  Jul 26 09:08:02 AMIA0423F32B471 /USR/SBIN/CRON[28922]: (sysadmin) CMD ( [ 0 -eq `ps -ef | grep "/usr/sbin/logrotate" | grep -vc "grep"` ] && /usr/sbin/logrotate /etc/logrotate.conf)
75  Jul 26 09:09:02 AMIA0423F32B471 /USR/SBIN/CRON[28930]: (sysadmin) CMD ( if [ -d /var/cache/samba ]; then echo "2" > /var/cache/samba/unexpected.tdb; fi)
76  Jul 26 09:09:02 AMIA0423F32B471 /USR/SBIN/CRON[28931]: (sysadmin) CMD ( [ 0 -eq `ps -ef | grep "/usr/sbin/logrotate" | grep -vc "grep"` ] && /usr/sbin/logrotate /etc/logrotate.conf)
77  Jul 26 09:10:02 AMIA0423F32B471 /USR/SBIN/CRON[28939]: (sysadmin) CMD ( if [ -d /var/cache/samba ]; then echo "2" > /var/cache/samba/unexpected.tdb; fi)
78  Jul 26 09:10:02 AMIA0423F32B471 /USR/SBIN/CRON[28941]: (sysadmin) CMD ( [ 0 -eq `ps -ef | grep "/usr/sbin/logrotate" | grep -vc "grep"` ] && /usr/sbin/logrotate /etc/logrotate.conf)
79  Jul 26 09:11:02 AMIA0423F32B471 /USR/SBIN/CRON[28949]: (sysadmin) CMD ( if [ -d /var/cache/samba ]; then echo "2" > /var/cache/samba/unexpected.tdb; fi)
80  Jul 26 09:11:02 AMIA0423F32B471 /USR/SBIN/CRON[28950]: (sysadmin) CMD ( [ 0 -eq `ps -ef | grep "/usr/sbin/logrotate" | grep -vc "grep"` ] && /usr/sbin/logrotate /etc/logrotate.conf) 

The IP shown in the BMC web interface is only the BMC web interface IP itself (and responds to pings and SSH connections but I cannot authenticate).

#8 Updated by okurz 2 months ago

  • Description updated (diff)

I removed salt keys for now with

sudo salt-key -y -d QA-Power8-*

as OSD deployment was blocked on this.

#9 Updated by mkittler 2 months ago

Looks like my recovery from yesterday was not lasting very long: https://stats.openqa-monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?orgId=1&from=1658755017279&to=1658761222206

Normally the system shouldn't have been rebooting so maybe it is not a boot problem but the running system just crashed or got stuck. That boot was without nmi_watchdog=0 so I'm rebooting with that parameter again.

#11 Updated by okurz 2 months ago

both merged. You need to manually deploy or add machines back to salt and apply a high state.

#12 Updated by mkittler 2 months ago

I've been adding the machines back and applied the changes. It seems to have worked. Let's see whether they continue to run without crashing or getting stuck under this new kernel parameter. I'll also reboot one of them to see whether the new kernel parameter is now persistent.

EDIT: QA-Power8-5-kvm rebooted just fine (unattended) and the kernel parameter is present.

#13 Updated by mkittler 2 months ago

  • Status changed from In Progress to Feedback

Both workers are running jobs and it looks good so far. I'll set the ticket on feedback and check again in the next days.

#14 Updated by okurz 2 months ago

I suggest to also simulate a power outage, e.g. just call ipmi.*power cycle and see how it behaves

#15 Updated by mkittler 2 months ago

Ok, I've just did that with power8-4. Let's see whether it comes up again. (Previously a power cycle needed to be followed by a power reset to get the machine going again but I'll give it a little bit time before trying that.)

By the way, the system journal on power8-4 from yesterday just stops at some point. I couldn't spot any interesting log messages and greping for lockup gives no results. So it reminds me of our arm workers.

#16 Updated by mkittler 2 months ago

It was necessary to invoke power reset as well but then the system booted relatively fast. I suppose that problem with power cycle is nothing new for these two workers, though.

I resumed the alerts to know if one of the machines would become unresponsive again.

#17 Updated by okurz 2 months ago

mkittler wrote:

It was necessary to invoke power reset as well but then the system booted relatively fast. I suppose that problem with power cycle is nothing new for these two workers, though.

I resumed the alerts to know if one of the machines would become unresponsive again.

that just seems to have happened: https://stats.openqa-monitor.qa.suse.de/d/WDQA-Power8-4-kvm/worker-dashboard-qa-power8-4-kvm?editPanel=65105&tab=alert&orgId=1&from=1658848718115&to=1658854799197

I paused the alert again

#18 Updated by mkittler 2 months ago

power8-5 seems to be stable now but power8-4 crashed around 18 p.m. again. I'll reboot it via ipmi. Maybe this time the journal has more info. If not I can add the host to our automatic recovery we have for arm (but needed to take into account that we need power reset here and not cycle).

#19 Updated by mkittler 2 months ago

power8-4 is back but once got stuck on boot even before reaching the OS. I was also often thrown out of the SOL session¹. Again the journal doesn't show anything interesting before the crash².

--

¹ When I tried to reconnect immediately, I got Info: SOL payload already active on another session but after waiting a while or just deactivating the SOL session I could connect again.

²

Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [info] [pid:24538] Download of SLES-12-SP5-ppc64le-containers.qcow2 processed:
Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [info] [#96752] Cache size of "/var/lib/openqa/cache" is 46 GiB, with limit 50 GiB
Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [info] [#96752] Downloading "SLES-12-SP5-ppc64le-containers.qcow2" from "http://openqa.suse.de/tests/9216080/asset/hdd/SLES-12-SP5-ppc64le-containers.qcow2"
Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [info] [#96752] Content of "/var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-ppc64le-containers.qcow2" has not changed, updating last use
Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [debug] [pid:24538] Linked asset "/var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-ppc64le-containers.qcow2" to "/var/lib/openqa/pool/3/SLES-12-SP5-ppc64le-containers.qcow2"
Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [info] [pid:24538] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #96753 sent to Cache Service
Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [debug] [pid:24538] Updating status so job 9216080 is not considered dead.
Jul 26 18:23:24 QA-Power8-4-kvm worker[24538]: [debug] [pid:24538] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9216080/status
Jul 26 18:23:25 QA-Power8-4-kvm worker[14096]: [debug] [pid:14096] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9216040/status
Jul 26 18:23:25 QA-Power8-4-kvm worker[14096]: [debug] [pid:14096] Upload concluded (at bci_prepare)
Jul 26 18:23:25 QA-Power8-4-kvm openqa-worker-cacheservice-minion[24775]: [24775] [i] [#96753] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Jul 26 18:23:25 QA-Power8-4-kvm openqa-worker-cacheservice-minion[24775]: [24775] [i] [#96753] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Jul 26 18:23:26 QA-Power8-4-kvm worker[24463]: [debug] [pid:24463] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9216065/status
Jul 26 18:23:26 QA-Power8-4-kvm worker[24463]: [debug] [pid:24463] Upload concluded (at boot_to_desktop)
Jul 26 18:23:26 QA-Power8-4-kvm worker[22067]: [debug] [pid:22067] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9216045/status
Jul 26 18:23:27 QA-Power8-4-kvm worker[22067]: [debug] [pid:22067] Upload concluded (at image_docker)
-- Boot 48a662843a3a4ab582b9969f25bdec1f --
Jul 27 14:55:47 QA-Power8-4-kvm kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 262144MB)
Jul 27 14:55:47 QA-Power8-4-kvm kernel: dt-cpu-ftrs: setup for ISA 2070
Jul 27 14:55:47 QA-Power8-4-kvm kernel: dt-cpu-ftrs: not enabling: subcore (unknown and unsupported by kernel)
Jul 27 14:55:47 QA-Power8-4-kvm kernel: dt-cpu-ftrs: final cpu/mmu features = 0x000000fb8f5db187 0x3c006001
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: Page sizes from device-tree:
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=20: shift=20, sllp=0x0130, avpnm=0x00000000, tlbiel=0, penc=2
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3
Jul 27 14:55:47 QA-Power8-4-kvm kernel: Enabling pkeys with max key count 32
Jul 27 14:55:47 QA-Power8-4-kvm kernel: Disabling hardware transactional memory (HTM)
Jul 27 14:55:47 QA-Power8-4-kvm kernel: Activating Kernel Userspace Access Prevention
Jul 27 14:55:47 QA-Power8-4-kvm kernel: Activating Kernel Userspace Execution Prevention
Jul 27 14:55:47 QA-Power8-4-kvm kernel: Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24
Jul 27 14:55:47 QA-Power8-4-kvm kernel: Using 1TB segments
Jul 27 14:55:47 QA-Power8-4-kvm kernel: hash-mmu: Initializing hash mmu with SLB
martchus@QA-Power8-4-kvm:~> sudo journalctl --since '24 hours ago' | grep -i lockup
martchus@QA-Power8-4-kvm:~>

#20 Updated by mkittler 2 months ago

power8-4 now crashed again while I was still browsing the journal. Since I still had SOL open we now have a little more output:

[   28.237735][ T2075] device tap4 entered promiscuous mode
[   28.238711][ T2075] device tap71 entered promiscuous mode
[   28.244128][ T2075] device tap135 entered promiscuous mode
[  OK  ] Finished ppc64 set SMT off.
         Starting Authorization Manager...
[   29.908745][ T1719] EXT4-fs error: 42 callbacks suppressed
[   29.908748][ T1719] EXT4-fs error (device sdb1): ext4_mb_generate_buddy:1140: group 2589, block bitmap and bg descriptor incSOL session closed by BMC
[martchus@linux-9lzf ~]$ ipmitool -I lanplus -C 3 -H qa-power8-4.qa.suse.de -U ADMIN -P vera8ahreiph6A sol activate
Info: SOL payload already active on another session
[martchus@linux-9lzf ~]$ ipmitool -I lanplus -C 3 -H qa-power8-4.qa.suse.de -U ADMIN -P vera8ahreiph6A sol activate
Info: SOL payload already active on another session
[martchus@linux-9lzf ~]$ ipmitool -I lanplus -C 3 -H qa-power8-4.qa.suse.de -U ADMIN -P vera8ahreiph6A sol activate
         Mounting /var/lib/openqa/share...
[  196.219518][ T4037] FS-Cache: Loaded
[  196.273137][ T4037] RPC: Registered named UNIX socket transport module.
[  196.273200][ T4037] RPC: Registered udp transport module.
[  196.273210][ T4037] RPC: Registered tcp transport module.
[  196.273303][ T4037] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  196.333186][ T4037] FS-Cache: Netfs 'nfs' registered for caching
[  196.338961][ T4043] Key type dns_resolver registered
[  196.543578][ T4043] NFS: Registering the id_resolver key type
[  196.543606][ T4043] Key type id_resolver registered
[  196.543645][ T4043] Key type id_legacy registered
[  OK  ] Listening on RPCbind Server Activation Socket.
[  OK  ] Reached target RPC Port Mapper.
         Starting Notify NFS peers of a restart...
         Starting NFS status monitor for NFSv2/3 locking....
[  OK  ] Started Notify NFS peers of a restart.
         Starting RPC Bind...
[  OK  ] Started RPC Bind.
[  OK  ] Started NFS status monitor for NFSv2/3 locking..
[  197.340453][ T4071] systemd-fstab-generator[4071]: x-systemd.device-timeout ignored for openqa.suse.de:/var/lib/openqa/share
[  197.342888][ T4081] systemd-sysv-generator[4081]: SysV service '/etc/init.d/boot.local' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in order to make it more safe and robust.
[  OK  ] Mounted /var/lib/openqa/share.
[  OK  ] Started Timeline of Snapper Snapshots.
         Starting DBus interface for snapper...
[  OK  ] Started DBus interface for snapper.
[ ***  ] A start job is running for /etc/ini…ompatibility (4min 19s / no limit)
         Starting Check if mainboard battery is Ok...
[  OK  ] Finished Check if mainboard battery is Ok.
[**    ] A start job is running for /etc/ini…Compatibility (5min 2s / no limit)
[  315.571309][   C48] EXT4-fs (sdb1): error count since last fsck: 2430
[  315.571380][   C48] EXT4-fs (sdb1): initial error at time 1605764783: ext4_mb_generate_buddy:759
[  OK  ] Started /etc/init.d/boot.local Compatibility.
         Starting Hold until boot process finishes up...
         Starting Terminate Plymouth Boot Screen...

Welcome to openSUSE Leap 15.4 - Kernel 5.14.21-150400.24.11-default (hvc0).

br1: 10.0.2.2 fe80::e0bf:2eff:fe2f:b749
eth0:  
eth1:  
eth2:  
eth3: 10.162.6.201 2620:113:80c0:80a0:10:162:29:60f


QA-Power8-4-kvm login: [  365.807470][ T3923] EXT4-fs error (device sdb1) in ext4_free_inode:362: Corrupt filesystem
[  438.050890][   T94] Kernel panic - not syncing: corrupted stack end detected inside scheduler
[  438.051046][   T94] CPU: 16 PID: 94 Comm: ksof

Here you can also see the unstable SOL. So we've got stack corruption within the kernel. Maybe a memory error? Not sure how I'd run a memtest on that power machine.

#21 Updated by mkittler 2 months ago

Haven't found a memtest tool for PowerPC and the Linux kernel's memtest feature is apparently disabled on PowerPC. Not sure whether the firmware provides something. It was also mentioned that we could also contact IBM support about it.

Btw, the machine has 256 GB RAM (but only 8 CPU cores). Maybe just one RAM module is broken and we could live without it.

If it is a memory problem. So far I'm not sure, it only looks like it, especially because power8-5 (which runs the same software and workload) seems stable.

#22 Updated by okurz 2 months ago

I doubt that suddenly the memory breaks just after we upgraded to Leap 15.4. I suggest to rollback/downgrade to 15.3 to cross-check stability

#23 Updated by mkittler 2 months ago

I'll try booting into an earlier snapshot. However, if 15.3 would work that would raise the question why power8-5 remains stable under 15.4 and why it sometimes also crashes even before the OS is booting (at least to me it looks like it is sometimes crashing early).

Moved power8-4 out of salt again, invoked snapper rollback 2181 and rebooted. (I suppose it isn't possible to boot into a specific snapshot via petitboot.) It now already runs some tests under Leap 15.3. Let's see whether it remains stable or crashes.

#24 Updated by mkittler 2 months ago

So far power8-4 looks good on Leap 15.3.

I've noticed that snapper was not configured on power8-5 (in contrast to power8-4) so I configured it. (We already use btrfs and have enough disk space.)

#25 Updated by mkittler about 2 months ago

Considering power8-4 is still running it seems to help downgrading to Leap 15.3, indeed. Before, under Leap 15.4 it never stayed online for so long. However, it can still be just luck. So I'll keep it running over the weekend to be sure.

Maybe it is possible to install the kernel version from Leap 15.3 under Leap 15.4. That configuration would be my next test and if that is stable it could also be a possible workaround.

#26 Updated by mkittler about 2 months ago

Looks like downgrading helps. I suppose I'll boot into Leap 15.4 again and see whether I can install the older kernel version from Leap 15.3.

EDIT: Easier said than done. There's good documentation but I still don't know where I'd get the old kernel version from (for Leap 15.4, I suppose just using the Leap 15.3 package isn't a good idea).

#27 Updated by MDoucha about 2 months ago

I have to ask for QA-Power8-4-kvm to be left on Leap 15.3 for a while. SLE-15SP2 has some strange bug which breaks framebuffer console on PPC64LE QEMU 6.2.0: bsc#1201796. QA-Power8-4-kvm is currently the only PPC64LE worker where the SLE-15SP2 updates can be tested. SLE-15SP3 appears to be fixed but the actual fix is currently unknown.

#28 Updated by okurz about 2 months ago

mkittler wrote:

[…] I still don't know where I'd get the old kernel version from (for Leap 15.4, I suppose just using the Leap 15.3 package isn't a good idea).

That shouldn't be a problem, especially not for the kernel which is self-contained anyway.

#29 Updated by cdywan about 2 months ago

  • Subject changed from recover qa-power8-4+qa-power8-5 to recover qa-power8-4+qa-power8-5 size:M

Suggested to report a product issue after discussing it in the estimation call

#30 Updated by mkittler about 2 months ago

  • Description updated (diff)

#31 Updated by okurz about 2 months ago

  • Due date deleted (2022-08-06)
  • Status changed from Feedback to Blocked

Well written bug report. I suggest for now to block on both the big report you reported https://bugzilla.opensuse.org/show_bug.cgi?id=1202138 as well as https://bugzilla.suse.com/show_bug.cgi?id=1201796

#32 Updated by okurz about 2 months ago

  • Due date set to 2022-11-01
  • Priority changed from High to Normal

So currently both qa-power8-4 and qa-power8-5 are up and running and usable for openQA tests though with differing OS versions. That can be good for debugging and investigation where required.
Let's wait the latest until one month before EOL of Leap 15.3 and check then.

#33 Updated by mkittler about 2 months ago

Looks like now the file system on power8-4 is corrupted:

martchus@QA-Power8-4-kvm:~> sudo journalctl -fu openqa-worker-cacheservice-minion.service 
…
[#107890] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Aug 10 17:33:47 QA-Power8-4-kvm openqa-worker-cacheservice-minion[24456]: rsync: readlink_stat("/var/lib/openqa/cache/openqa.suse.de/tests/sle/products/opensuse/needles/.git/index") failed: Structure needs cleaning (117)
…
martchus@QA-Power8-4-kvm:~> sudo rm -r /var/lib/openqa/cache/openqa.suse.de/tests/sle/products/opensuse/needles/.git/index
rm: das Entfernen von '/var/lib/openqa/cache/openqa.suse.de/tests/sle/products/opensuse/needles/.git/index' ist nicht möglich: Die Struktur muss bereinigt werden

But that's just the throw-away ext2 filesystem created by our NVMe/RAID script so a reboot should repair it.


The ext4 filesystem is actually also not in great shape as I've got many messages about inconsistencies when rebooting:

[  *** ] A start job is running for /etc/ini…Compatibility (1min 7s / no limit)
[   79.648237] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2585, block bitmap and bg descriptor inconsistent: 26423 vs 26428 free clusters
[   79.648407] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2587, block bitmap and bg descriptor inconsistent: 30674 vs 30677 free clusters
[   79.648497] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2589, block bitmap and bg descriptor inconsistent: 31152 vs 31154 free clusters
[   79.648557] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2590, block bitmap and bg descriptor inconsistent: 30689 vs 30693 free clusters
[   79.709621] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2568, block bitmap and bg descriptor inconsistent: 26721 vs 26795 free clusters
[   79.709879] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2575, block bitmap and bg descriptor inconsistent: 27609 vs 27621 free clusters
[   79.756691] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2577, block bitmap and bg descriptor inconsistent: 29779 vs 29828 free clusters
[   79.756883] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2581, block bitmap and bg descriptor inconsistent: 30628 vs 30547 free clusters
[   79.757007] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2582, block bitmap and bg descriptor inconsistent: 13767 vs 13771 free clusters
[  *** ] A start job is running for /etc/ini…ompatibility (1min 31s / no limit)
[  102.907136] EXT4-fs error: 2 callbacks suppressed
[  102.907137] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 6465, block bitmap and bg descriptor inconsistent: 3359 vs 3412 free clusters
[   ***] A start job is running for /etc/ini…ompatibility (1min 38s / no limit)
[  109.661883] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 0, block bitmap and bg descriptor inconsistent: 422 vs 423 free clusters
[  109.662023] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 1, block bitmap and bg descriptor inconsistent: 4478 vs 4479 free clusters
[  109.662045] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 2, block bitmap and bg descriptor inconsistent: 1534 vs 1535 free clusters
[  *** ] A start job is running for /etc/ini…ompatibility (1min 38s / no limit)
[  110.178385] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 6460, block bitmap and bg descriptor inconsistent: 1534 vs 1550 free clusters
[  110.178450] EXT4-fs error (device sda1): ext4_mb_generate_buddy:747: group 6461, block bitmap and bg descriptor inconsistent: 2539 vs 2555 free clusters
[  *** ] A start job is running for /etc/ini…ompatibility (1min 42s / no limit)
[  113.698846] EXT4-fs error (device sda1): ext4_lookup:1728: inode #47653495: comm rsync: deleted inode referenced: 47659659
[    **] A start job is running for /etc/ini…Compatibility (2min 4s / no limit)
[  *** ] A start job is running for /etc/ini…ompatibility (2min 10s / no limit)
[   ***] A start job is running for /etc/ini…ompatibility (2min 10s / no limit)
[ ***  ] A start job is running for /etc/ini…ompatibility (2min 23s / no limit)
[  155.134104] EXT4-fs error (device sda1): ext4_lookup:1728: inode #47653495: comm rsync: deleted inode referenced: 47659659

#34 Updated by cdywan about 2 months ago

  • Related to action #115208: failed-systemd-services: logrotate-openqa alerting on and off size:M added

#35 Updated by mkittler about 2 months ago

The reboot actually didn't help because we're not having our NVMe/RAID script installed:

martchus@QA-Power8-4-kvm:~> sudo systemctl status openqa_nvme_format.service
Unit openqa_nvme_format.service could not be found.

I've just stopped everything:

systemctl stop openqa-worker-auto-restart@* openqa-worker-cacheservice* var-lib-openqa.mount

Looks like we're still using ext2 (despite that it is not re-created on every boot):

martchus@QA-Power8-4-kvm:~> sudo file -sL /dev/sda1
/dev/sda1: Linux rev 1.0 ext2 filesystem data (mounted or unclean), UUID=f5de0a79-bfa8-41c9-ba86-c0d2df739acc (errors) (large files)

Considering the previous crashes it is no surprise to end up with a broken ext2 file system.

The filesystem fix is still running but maybe it would be best to reformat with ext4.

#36 Updated by mkittler about 2 months ago

After the filesystem fix I see no more problems in the journal and tests on several worker slots are already past the test syncing. So I suppose the immediate problem has been fixed.

For further improvements, e.g. moving to ext4 or ensuring the ext2 filesystem is re-created on boot we should create a separate ticket (once we decided what we want to do).

#37 Updated by cdywan about 2 months ago

  • Status changed from Blocked to In Progress

I assume this isn't Blocked since you're actively working on it.

#38 Updated by mkittler about 2 months ago

  • Status changed from In Progress to Blocked

It is still blocked by tickets mentioned in #114565#note-31. My most recent work was just file system cleanup after the crashes. I've been creating #115226 for further improvements (regardless of the bigger problem here).

#39 Updated by mkittler 24 days ago

Now power8-5 just went offline. I'll try a power cycle and investigate.

#40 Updated by mkittler 24 days ago

The log of power8-5 just ended:

Sep 02 04:38:38 QA-Power8-5-kvm worker[53749]: [debug] [pid:53749] Uploading artefact ioperm02-2.txt
Sep 02 04:38:38 QA-Power8-5-kvm worker[59603]: [debug] [pid:59603] Uploading artefact nfs42_ipv6_02-8.txt
Sep 02 04:38:38 QA-Power8-5-kvm worker[59120]: [debug] [pid:59120] Uploading artefact preadv03_64-4.txt
Sep 02 04:38:38 QA-Power8-5-kvm worker[59704]: [debug] [pid:59704] Uploading artefact boot_ltp-31.txt
Sep 02 04:38:38 QA-Power8-5-kvm worker[59625]: [debug] [pid:59625] Uploading artefact test_connect-2.txt
-- Boot 5378bf885e9e49ecbadd3cfc8aeccb90 --
Sep 02 11:30:30 QA-Power8-5-kvm kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 262144MB)
Sep 02 11:30:30 QA-Power8-5-kvm kernel: hash-mmu: Page sizes from device-tree:
Sep 02 11:30:30 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
Sep 02 11:30:30 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
Sep 02 11:30:30 QA-Power8-5-kvm kernel: hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
Sep 02 11:30:30 QA-Power8-5-kvm kernel: hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1

I also couldn't spot anything strange in the hours before.

#41 Updated by cdywan 14 days ago

#42 Updated by mkittler 14 days ago

  • Related to action #116473: Add OSD PowerPC workers to automatic recovery we already have for ARM workers added

#43 Updated by michals 11 days ago

1) you shouldn't need nmi_watchdog=0

2) can you use the stable kernel archives to determine the upstream kernel version that broke your machines?

https://build.opensuse.org/project/show/home:tiwai:kernel:5.8

There is a lot of difference between 5.3 and 5.14, and with these OpenPOWER machines being somewhat uncommon hardware you might be the first to see the problem.

#44 Updated by mkittler 11 days ago

I can try that 5.8 kernel tomorrow.

#45 Updated by okurz 10 days ago

  • Status changed from Blocked to In Progress

as discussed in our weekly meeting mkittler is working on this based on feedback and suggestions from https://bugzilla.opensuse.org/show_bug.cgi?id=1202138 and https://bugzilla.suse.com/show_bug.cgi?id=1201796

#46 Updated by mkittler 7 days ago

  • Status changed from In Progress to Feedback

As of now qa-power8-5-kvm.qa.suse.de runs on Linux 5.8:

martchus@QA-Power8-5-kvm:~> uname -a
Linux QA-Power8-5-kvm 5.8.15-1.gc680e93-default #1 SMP Thu Oct 15 08:10:08 UTC 2020 (c680e93) ppc64le ppc64le ppc64le GNU/Linux

Let's see whether the issue is still reproducible on that version. Note that when trying to install the kernel package the machine was offline again. After recovering the machine broke during the installation of the kernel package. So qa-power8-5-kvm.qa.suse.de can now definitely be considered unstable on Linux 5.14/Leap 15.4 as well (while in the beginning only qa-power8-4-kvm.qa.suse.de seemed affected). At least it was stable enough to install the kernel package on the 2nd attempt. It also showed up in the bootloader and is supposedly even the first/default option now.

#48 Updated by okurz 5 days ago

  • Related to action #116743: [alert] QA-Power8-5-kvm: host up alert added

#49 Updated by okurz 5 days ago

We decided to give it some days to check if the system stays stable. If it turns out to be stable, report that in bug reports and go forward with investigating, e.g. trying next major kernel version 5.9 or something.

#50 Updated by jbaier_cz about 3 hours ago

  • Related to action #117229: [tools] openqa failing on worker QA-Power8-5-kvm:7 added

Also available in: Atom PDF