Project

General

Profile

Actions

action #169939

closed

openQA Project (public) - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

Upgrade Power8 o3 workers to openSUSE Leap 15.6 size:M

Added by gpathak 4 months ago. Updated 5 days ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Organisational
Start date:
2024-11-14
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

  • Need to upgrade workers before EOL of Leap 15.5 and have a consistent environment
  • Upgrading power8 workers - kerosene.qe.nue2.suse.org and qa-power8-3.openqanet.opensuse.org encountered some issues, refer #157972

Acceptance criteria

  • AC1: all power8 o3 worker machines run a clean upgraded openSUSE Leap 15.6 (no failed systemd services, no left over .rpm-new files, etc.)

Suggestions

Rollback Actions

  • Remove zypper package lock perl-Mojo-IOLoop-ReadWriteProcess

Further details

  • Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)

Files

clipboard-202501101213-mq9d6.png (38.4 KB) clipboard-202501101213-mq9d6.png gpathak, 2025-01-10 06:43
clipboard-202501101222-maqgi.png (74.5 KB) clipboard-202501101222-maqgi.png gpathak, 2025-01-10 06:52
clipboard-202501101823-slqbk.png (67.8 KB) clipboard-202501101823-slqbk.png gpathak, 2025-01-10 12:53
clipboard-202501101826-poyf3.png (40.9 KB) clipboard-202501101826-poyf3.png gpathak, 2025-01-10 12:56
clipboard-202501101827-kt565.png (86.1 KB) clipboard-202501101827-kt565.png gpathak, 2025-01-10 12:57
clipboard-202501131748-nqil7.png (60.4 KB) clipboard-202501131748-nqil7.png gpathak, 2025-01-13 12:18
clipboard-202501141602-ndo23.png (72.5 KB) clipboard-202501141602-ndo23.png gpathak, 2025-01-14 10:32
qa-power8-crash (29.6 KB) qa-power8-crash gpathak, 2025-01-16 05:07
crash-qa-power8 (66 KB) crash-qa-power8 Kernel Crash Log gpathak, 2025-01-17 05:38
Crash-Log.7z (2.07 MB) Crash-Log.7z gpathak, 2025-01-23 05:17
Crash-Log.tar.gz (4.37 MB) Crash-Log.tar.gz gpathak, 2025-01-23 09:39
qa-power8-softlockup-2.log (112 KB) qa-power8-softlockup-2.log gpathak, 2025-01-24 11:29
qa-power8-3-kernel-error.log (94.2 KB) qa-power8-3-kernel-error.log gpathak, 2025-02-15 08:43

Related issues 4 (2 open2 closed)

Related to openQA Infrastructure (public) - action #157975: Upgrade osd workers to openSUSE Leap 15.6 size:SResolvedybonatakis

Actions
Copied from openQA Project (public) - action #157972: Upgrade o3 workers to openSUSE Leap 15.6 size:SResolvedgpathak

Actions
Copied to openQA Infrastructure (public) - action #177973: Dying disk on qa-power8-3: Needs replacement? size:SWorkable2024-11-14

Actions
Copied to openQA Infrastructure (public) - action #178063: Investigate the effect of removing kernel package lock on kerosene-8New2024-11-14

Actions
Actions #1

Updated by gpathak 4 months ago

  • Copied from action #157972: Upgrade o3 workers to openSUSE Leap 15.6 size:S added
Actions #2

Updated by gpathak 4 months ago

  • Subject changed from Upgrade Power8 o3 workers to openSUSE Leap 15.6 size:S to Upgrade Power8 o3 workers to openSUSE Leap 15.6
Actions #3

Updated by gpathak 4 months ago

  • File deleted (pb-discover.log)
Actions #4

Updated by gpathak 4 months ago

  • File deleted (clipboard-202411051428-dg60d.png)
Actions #5

Updated by gpathak 4 months ago

  • File deleted (clipboard-202411121229-elbgt.png)
Actions #6

Updated by okurz 4 months ago

  • Target version changed from Ready to Tools - Next
Actions #7

Updated by okurz 3 months ago

  • Related to action #157975: Upgrade osd workers to openSUSE Leap 15.6 size:S added
Actions #8

Updated by okurz 3 months ago

  • Project changed from openQA Project (public) to openQA Infrastructure (public)
  • Description updated (diff)
  • Category changed from Organisational to Organisational
  • Priority changed from Normal to High
  • Target version changed from Tools - Next to Ready
Actions #9

Updated by okurz 3 months ago

  • Priority changed from High to Normal
Actions #10

Updated by livdywan 2 months ago

  • Subject changed from Upgrade Power8 o3 workers to openSUSE Leap 15.6 to Upgrade Power8 o3 workers to openSUSE Leap 15.6 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #11

Updated by gpathak 2 months ago

  • Assignee set to gpathak
Actions #12

Updated by gpathak 2 months ago

  • Status changed from Workable to In Progress
Actions #13

Updated by gpathak 2 months ago

After performing an upgrade on qa-power8-3.openqanet.opensuse.org, the petitboot is taking a lot of time to mount the actual filesystem (/dev/sda3).

I think @nicksinger did something magical previously as part of #169576 to fix this issue.

Actions #14

Updated by gpathak 2 months ago

It took approx 5 minutes for petitboot to detect and mount /dev/sda3

Actions #15

Updated by gpathak about 2 months ago

After removing Kernel 6.4.0.150600.23.30, the machine is now able to boot properly without going into maintenance mode.
Triggered some jobs on o3 to check if tests pass:

Updated by gpathak about 2 months ago

Trying to update the kernel to 6.12.8 from https://download.opensuse.org/repositories/Kernel:/stable/PPC/ is causing package conflict with package filesystem

Trying to upgrade the filesystem to a newer version isn't supported because it is the highest version installed from Factory

The URL of the repo is pointing to OBS factory

Actions #17

Updated by openqa_review about 2 months ago

  • Due date set to 2025-01-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #18

Updated by gpathak about 2 months ago

I was able to upgrade the qa-power8 machine to 15.6 with Kernel version 6.13.0-rc6-3.g25a2d29-default installed from https://download.opensuse.org/repositories/Kernel:/HEAD/PPC/
Have to replace suse module tools to non-standard repository https://download.opensuse.org/repositories/home:/mwilck:/suse-module-tools/PowerPC/
Installed filesystem-84.87-16.1.ppc64le package from https://build.opensuse.org/package/show/openSUSE:Factory/filesystem and changed vendor to obs://build.opensuse.org/Base:System

Tried cloning jobs to run to qa-power8, but the jobs are failing
Fatal error in command /usr/bin/chattr +C /var/lib/openqa/pool/4/raid: Can't locate object method "quirkiness" via package "Mojo::IOLoop::ReadWriteProcess" at /usr/lib/os-autoinst/osutils.pm line 63
https://openqa.opensuse.org/tests/4764546

other error was main: Can't locate object method "_fork_collect_status" via package "Mojo::IOLoop::ReadWriteProcess" at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess/Session.pm line 79
https://openqa.opensuse.org/tests/4765533

Actions #19

Updated by gpathak about 2 months ago

I verified that the system gets IP address and doesn't crash during boot with Linux Kernel 6.4 (without performing reboot stability check) but I have to manually edit the Petitboot option to modify Kernel and Initrd path, I have to delete the initial /@/.snapshots/1/snapshot from Kernel and Initrd path as shown in the screenshot to boot the machine.

Not sure how to save it permanently.

Actions #20

Updated by gpathak about 2 months ago

gpathak wrote in #note-19:

Not sure how to save it permanently.

Turned out that /etc/default/grub needs to have some extra configuration that wasn't present on qa-power8.
Checked petrol and diesel for grub configuration in /etc/default/grub, copied the common configuration on qa-power8 as well and the boot issue mentioned in my previous comment is resolved.

Right now running reboot-stability-check script https://github.com/os-autoinst/scripts/blob/master/reboot-stability-check.
Linux Kernel Version 6.4 is currently active and running on qa-power8

Actions #21

Updated by gpathak about 2 months ago

Reboot stability check executed 30 times and the machine booted successfully.

run: 25, qa-power8-3.openqanet.opensuse.org: ping .. ok, ssh .. ok, uptime/reboot:  21:45:15  up   0:05,  0 users,  load average: 0.88, 0.93, 0.45
run: 26, qa-power8-3.openqanet.opensuse.org: ping .. ok, ssh .. ok, uptime/reboot:  21:58:54  up   0:05,  0 users,  load average: 0.88, 0.92, 0.45
run: 27, qa-power8-3.openqanet.opensuse.org: ping .. ok, ssh .. ok, uptime/reboot:  22:12:29  up   0:05,  0 users,  load average: 0.96, 0.94, 0.45
run: 28, qa-power8-3.openqanet.opensuse.org: ping .. ok, ssh .. ok, uptime/reboot:  22:26:06  up   0:05,  0 users,  load average: 0.89, 0.93, 0.45
run: 29, qa-power8-3.openqanet.opensuse.org: ping .. ping: sendmsg: No route to host
ok, ssh .. ok, uptime/reboot:  22:39:37  up   0:05,  0 users,  load average: 1.11, 1.18, 0.58
run: 30, qa-power8-3.openqanet.opensuse.org: ping .. ping: sendmsg: No route to host
ok, ssh .. ok, uptime/reboot:  22:53:14  up   0:05,  0 users,  load average: 1.02, 1.12, 0.56

but tests aren't running properly and failing...

Actions #22

Updated by gpathak about 2 months ago · Edited

The jobs on qa-power8-3 were failing with error:
[2025-01-10T19:04:01.488855+01:00] [debug] [pid:29227] Fatal error in command/usr/bin/chattr +C /var/lib/openqa/pool/4/raid: Can't locate object method "quirkiness" via package "Mojo::IOLoop::ReadWriteProcess" at /usr/lib/os-autoinst/osutils.pm line 63.

Turned out that there were two ReadWriteProcess.pm files — one located in /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop and the other in /usr/lib/perl5/site_perl/5.26.1/Mojo/IOLoop. The latter did not have any quirkiness defined, such as:

has [qw(blocking_stop serialize quirkiness total_sleeptime_during_kill)] => 0;

This definition was present in the vendor_perl version.

To resolve the issue, I had to delete the ReadWriteProcess.pm file and ReadWriteProcess directory from site_perl, restarted the job, and isotovideo started successfully.

I will do some more tests.

Actions #23

Updated by gpathak about 2 months ago

Kernel 6.4 crashed after running for nearly a day.
Something related to #162296

Actions #24

Updated by gpathak about 2 months ago

  • Description updated (diff)
Actions #25

Updated by gpathak about 2 months ago · Edited

Executed some more jobs with Kernel 6.4:

Seems the machine is stable now, will keep it under observation to see if it crashes again, if it doesn't crash then I will move to next machine and upgrade kerosene.

Actions #26

Updated by gpathak about 2 months ago · Edited

Disabled worker slots 11..30 on kerosene to redirect some job schedule to run on qa-power8.

Actions #27

Updated by gpathak about 2 months ago

Upgraded kerosene to Leap 15.6 as well, and is under observation.

Actions #28

Updated by gpathak about 2 months ago

qa-power8-3 crashed again

Actions #29

Updated by gpathak about 2 months ago

gpathak wrote in #note-28:

qa-power8-3 crashed again

Installed old firewalld package
zypper in --oldpackage https://download.opensuse.org/update/leap/15.3/sle/noarch/firewalld-0.9.3-150300.3.3.1.noarch.rpm https://download.opensuse.org/update/leap/15.3/sle/noarch/python3-firewall-0.9.3-150300.3.3.1.noarch.rpm https://download.opensuse.org/update/leap/15.3/sle/noarch/firewalld-lang-0.9.3-150300.3.3.1.noarch.rpm

and added package lock zypper al -m "boo#1227616" *firewall*

Kerosene is also crashing but it seems unrelated to firewall (I am not sure at this point because I am unable to get any crash log yet)
The issue on kerosene seems to be related to corrupted filesystem of node /dev/sdb1

[  125.598476][ T4515] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282262: comm perl: deleted inode referenced: 22315870
[  125.861617][ T4515] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282262: comm perl: deleted inode referenced: 22315865
[  125.877535][ T4515] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282262: comm perl: deleted inode referenced: 22315864
[  140.606753][ T4035] EXT4-fs error (device sdb1) in ext4_free_inode:362: Corrupt filesystem
[  140.609626][ T4035] EXT4-fs error (device sdb1) in ext4_free_inode:362: Corrupt filesystem
[  140.622667][ T4035] EXT4-fs error (device sdb1) in ext4_free_inode:362: Corrupt filesystem
[  141.286896][ T4035] EXT4-fs error (device sdb1) in ext4_free_inode:362: Corrupt filesystem
[  141.295465][ T4035] EXT4-fs error (device sdb1) in ext4_free_inode:362: Corrupt filesystem
[  141.392507][ T4035] EXT4-fs error (device sdb1) in ext4_free_inode:362: Corrupt filesystem
[  141.395382][ T4035] EXT4-fs error (device sdb1) in ext4_free_inode:362: Corrupt filesystem
[  141.399007][ T4035] EXT4-fs error (device sdb1) in ext4_free_inode:362: Corrupt filesystem
[  141.403024][ T4035] EXT4-fs error (device sdb1) in ext4_free_inode:362: Corrupt filesystem
[  141.407136][ T4035] EXT4-fs error (device sdb1) in ext4_free_inode:362: Corrupt filesystem
[  174.313161][ T4455] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282268: comm perl: deleted inode referenced: 22315988
[  174.327871][ T4455] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282268: comm perl: deleted inode referenced: 22315987
[  174.339871][ T4455] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282268: comm perl: deleted inode referenced: 22315986
[  174.343795][ T4455] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282268: comm perl: deleted inode referenced: 22554957
[  174.349235][ T4455] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282268: comm perl: deleted inode referenced: 22315988
[  174.351765][ T4455] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282268: comm perl: deleted inode referenced: 22315987
[  174.355750][ T4455] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282268: comm perl: deleted inode referenced: 22315986
[  174.359731][ T4455] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282268: comm perl: deleted inode referenced: 22554957
[  174.368650][ T4455] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282268: comm perl: deleted inode referenced: 22315988
[  174.371705][ T4455] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282268: comm perl: deleted inode referenced: 22315987
[  181.099461][ T4045] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282268: comm worker: deleted inode referenced: 22315988
[  181.145671][ T4045] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282268: comm worker: deleted inode referenced: 22315987
[  181.149643][ T4045] EXT4-fs error (device sdb1): ext4_lookup:1858: inode #22282268: comm worker: deleted inode referenced: 22315986
Actions #30

Updated by gpathak about 2 months ago · Edited

19-01-2025: Reverted kerosene to version 15.5 to enable the execution of scheduled/blocked PPC64 tests
19-01-2025: Applied the solution from #162296 on qa-power8-3 (ref https://progress.opensuse.org/issues/162296#note-35). The system remains stable, with all 8 slots successfully running tests for over 24 hours

Will be upgrading kerosene-8 as well.

Actions #31

Updated by gpathak about 2 months ago

Kerosene-8 is also upgraded to 15.6 running kernel 6.4.0-150600.23.33-default

Actions #32

Updated by gpathak about 2 months ago · Edited

  • Status changed from In Progress to Feedback

@okurz
Both power-8 machines in o3 infra are upgraded to Leap 15.6 Kernel 6.4.
Moving it to feedback.

Actions #33

Updated by okurz about 2 months ago

  • Status changed from Feedback to Workable
  1. What about https://progress.opensuse.org/issues/169939#Rollback-Actions
  2. Can you please comment on the bugzilla ticket about the kernel regression which prevented us kernel upgrades since Leap 15.4
Actions #34

Updated by gpathak about 2 months ago

okurz wrote in #note-33:

  1. What about https://progress.opensuse.org/issues/169939#Rollback-Actions
  2. Can you please comment on the bugzilla ticket about the kernel regression which prevented us kernel upgrades since Leap 15.4

The machines are still getting unreachable with kernel 6.4.0-150600.23.33-default, not immediately after boot-up as mentioned in https://bugzilla.suse.com/show_bug.cgi?id=1227616#c2, but now the machines are going offline after approximately 24 Hours.
I can't find a way to collect the reason of the machines going offline.

Actions #35

Updated by gpathak about 2 months ago

This happened on kerosene: https://bugzilla.suse.com/show_bug.cgi?id=1227616#c15
It is not happening consistently everytime.

Actions #36

Updated by gpathak about 2 months ago

  • Status changed from Workable to In Progress
Actions #37

Updated by gpathak about 2 months ago

Even after installing stable Kernel 6.13.0-1.g1513344-default from https://download.opensuse.org/repositories/Kernel:/stable/PPC/ on qa-power8, the machine crashed.
The solution applied on kerosene-8 with nftables replacing firewalld from #162296 also didn't work, the machine crashed.

Actions #38

Updated by gpathak about 2 months ago

Crash Log

Actions #39

Updated by gpathak about 2 months ago · Edited

Observed another softlockup on qa-power8 causing performance issue, making the machine slow to respond.
Filed separate bugzilla tickets for two different issues observed.
Blocking on

Actions #40

Updated by gpathak about 2 months ago

  • Due date deleted (2025-01-25)
Actions #41

Updated by okurz about 1 month ago

@gpathak in your bug reports please add details about steps to reproduce and what statistics about failures you have. Also please reference the last good and the original bug report we had why we kept the kernel at the Leap 15.3 version

Actions #42

Updated by gpathak about 1 month ago · Edited

okurz wrote in #note-41:

@gpathak in your bug reports please add details about steps to reproduce and what statistics about failures you have. Also please reference the last good and the original bug report we had why we kept the kernel at the Leap 15.3 version

There are no clear steps to reproduce the issue at this time. The kernel on the Power8 machine occasionally crashes, causing the system to go offline while running an openQA test (though the specific test is unknown). This does not directly impact openQA results, as the job is simply marked as canceled due to the worker becoming unreachable.

The last known stable kernel version was 5.3.18.

Actions #43

Updated by livdywan 28 days ago

  • Status changed from Blocked to In Progress

gpathak wrote in #note-39:

Discussed briefly in the daily. Gaurav's going to follow up with the suggestions in the ticket.

Actions #44

Updated by openqa_review 27 days ago

  • Due date set to 2025-02-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #45

Updated by gpathak 27 days ago · Edited

Right now I am focusing on one power8 machine and I am currently utilising qa-power8 for the task
Going back to square-one, starting it all over and recording what I did/doing here in this ticket.

12-Feb-2025:

Actions #46

Updated by gpathak 26 days ago

gpathak wrote in #note-45:

Right now I am focusing on one power8 machine and I am currently utilising qa-power8 for the task
Going back to square-one, starting it all over and recording what I did/doing here in this ticket.

12-Feb-2025:

It's been more than 24 Hours and there are significant number of tests ran on qa-power8-3, so far the machine looks stable.
Let's wait and watch the behaviour until end of this week.

Actions #47

Updated by gpathak 25 days ago

  • Status changed from In Progress to Feedback

Discussed this in the infra daily.
Putting this to feedback since we want to keep qa-power-8 machine under observation for more time to gain more confidence because from past experiences power8 machine always get into some issue after upgrade.

Actions #48

Updated by gpathak 24 days ago · Edited

gpathak wrote in #note-47:

Discussed this in the infra daily.
Putting this to feedback since we want to keep qa-power-8 machine under observation for more time to gain more confidence because from past experiences power8 machine always get into some issue after upgrade.

The openqa-continuous-update.service upgraded the kernel from 6.4.0-150600.23.33 to 6.4.0-150600.23.38 and seems like after that the kernel crashed again.
Downgraded the kernel version to 6.4.0-150600.23.33, added zypper package-lock zypper al -m "boo#1236380" *kernel-default* and zypper al -m "boo#1236380" kernel-default

Actions #49

Updated by okurz 22 days ago

  • Priority changed from Normal to Low
Actions #50

Updated by gpathak 20 days ago · Edited

  • Status changed from Feedback to Workable

gpathak wrote in #note-48:

gpathak wrote in #note-47:

Discussed this in the infra daily.
Putting this to feedback since we want to keep qa-power-8 machine under observation for more time to gain more confidence because from past experiences power8 machine always get into some issue after upgrade.

The openqa-continuous-update.service upgraded the kernel from 6.4.0-150600.23.33 to 6.4.0-150600.23.38 and seems like after that the kernel crashed again.
Downgraded the kernel version to 6.4.0-150600.23.33, added zypper package-lock zypper al -m "boo#1236380" *kernel-default* and zypper al -m "boo#1236380" kernel-default

I think we have enough tests executed on qa-power8 and this machine seems stable till now.
I will go ahead and upgrade kerosene-8 as well.

Actions #51

Updated by gpathak 20 days ago

  • Status changed from Workable to Feedback

gpathak wrote in #note-45:

12-Feb-2025:

Upgraded kerosene-8 to Leap 15.6 Running Kernel 6.4.0-150600.23.33-default.
Moving this ticket again to feedback.

Actions #52

Updated by gpathak 20 days ago

  • Due date changed from 2025-02-26 to 2025-03-05

Bumping the due date, we want to be conservative considering power8 issues encountered

Actions #53

Updated by gpathak 19 days ago · Edited

spent time to setup kdump on qa-power8 since it went into the same state.
The kdump setup is not yet complete, will continue it later.

Actions #54

Updated by gpathak 18 days ago

gpathak wrote in #note-53:

spent time to setup kdump on qa-power8 since it went into the same state.
The kdump setup is not yet complete, will continue it later.

Did kdump setup on kerosene as well as qa-power-8, hopefully it should now generate a kernel dump if kernel crashes again in future

  • zypper in -y kexec-tools kdump-*
  • added below contents to /etc/sysconfig/kdump
KDUMP_KERNELVER="/boot/vmlinux-6.4.0-150600.23.33-default"  
KDUMP_CPUS=2  
KDUMP_COMMANDLINE_APPEND=""  
KDUMP_AUTO_RESIZE="false"  
KDUMP_FADUMP="false"  
KDUMP_FADUMP_SHELL="false"  
KDUMP_IMMEDIATE_REBOOT="true"  
KDUMP_SAVEDIR="/var/crash"  
KDUMP_KEEP_OLD_DUMPS=2  
KDUMP_FREE_DISK_SIZE=64  
KDUMP_VERBOSE=0  
KDUMP_DUMPLEVEL=31  
KDUMP_DUMPFORMAT="compressed"  
KDUMP_CONTINUE_ON_ERROR="true"  
KDUMP_NETCONFIG="auto"  
KDUMP_NET_TIMEOUT=30  
  • Symlinks for vmlinux and initrd in /boot/ directory should not be dangling
  • Added GRUB_CMDLINE_LINUX_DEFAULT+=" crashkernel=1G" to /etc/default/grub and then update-bootloader
  • Started kexec-load and kdump services systemctl enable --now kexec-load.service; systemctl enable --now kdump
  • Rebooted the machine once
  • Verified kdump by triggering a "kernel panic" with the procfs interface to Magic SysRq keys: echo c > /proc/sysrq-trigger
Actions #55

Updated by livdywan 12 days ago

Discussed the options briefly:

  • The best work-around idea currently is changing away from btrfs e.g. ext4 but that could obviously lead to corruption
  • Automatic reboots like we do on some arm machines (which crashes less readily)
  • Leave it for now. You made an amazing effort to find a working version. It is what it is. We voted in favor of this
Actions #56

Updated by gpathak 12 days ago

  • Copied to action #177973: Dying disk on qa-power8-3: Needs replacement? size:S added
Actions #57

Updated by gpathak 12 days ago · Edited

livdywan wrote in #note-55:

Discussed the options briefly:

  • The best work-around idea currently is changing away from btrfs e.g. ext4 but that could obviously lead to corruption
  • Automatic reboots like we do on some arm machines (which crashes less readily)
  • Leave it for now. You made an amazing effort to find a working version. It is what it is. We voted in favor of this

I just did a fresh Leap 15.6 installation on qa-power8-3, in my opinion the kernel in-stability seems to be ruled out because kerosene-8 is running flawlessly for more than 5 days.
I discovered some error message in boot logs (#177973) that can be a sign of dying storage medium, and I think it can be a cause of unstable behavior of qa-power8-3 machines, I am saying this because in most of the kernel crash logs of qa-power8, the stack trace points to kernel filesystem APIs for btrfs and in this case the rootfs is of type btrfs

Actions #58

Updated by gpathak 11 days ago

  • Copied to action #178063: Investigate the effect of removing kernel package lock on kerosene-8 added
Actions #59

Updated by gpathak 11 days ago · Edited

  • Status changed from Feedback to Resolved

Resolving this ticket as discussed in daily.
Created a separate ticket #177973 for investigating issue, as this is happening only on one machine.
kerosene-8 was updated successfully, I added a kernel package lock on kerosene-8 to be on the safe side cause we only have one reliable power8 machine, a follow-up ticket is created for this #178063 to investigate and attempt to remove the package lock.

Actions #60

Updated by okurz 8 days ago

  • Due date deleted (2025-03-05)
Actions #61

Updated by gpathak 5 days ago

gpathak wrote in #note-57:

livdywan wrote in #note-55:

Discussed the options briefly:

  • The best work-around idea currently is changing away from btrfs e.g. ext4 but that could obviously lead to corruption
  • Automatic reboots like we do on some arm machines (which crashes less readily)
  • Leave it for now. You made an amazing effort to find a working version. It is what it is. We voted in favor of this

I just did a fresh Leap 15.6 installation on qa-power8-3, in my opinion the kernel in-stability seems to be ruled out because kerosene-8 is running flawlessly for more than 5 days.
I discovered some error message in boot logs (#177973) that can be a sign of dying storage medium, and I think it can be a cause of unstable behavior of qa-power8-3 machines, I am saying this because in most of the kernel crash logs of qa-power8, the stack trace points to kernel filesystem APIs for btrfs and in this case the rootfs is of type btrfs

Since, the crash was more frequent, I rolled back qa-power8-3 to Leap 15.5. If the machine crashes again because of filesystem or inode issue then I think it can be handled in #177973

Actions

Also available in: Atom PDF