Project

General

Profile

Actions

action #162296

open

coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S

Added by okurz 5 months ago. Updated 4 days ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-14
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Observed on w31+w32 that upgraded themselves to Leap 15.6 and then crashed multiple times after booting into kernel 6.4 after a waiting time of 10-20m after boot.

Acceptance criteria

  • AC1: OSD openQA workers can run stable with Leap 15.6 (package locks on reported issues allowed)
  • AC2: ssh osd 'sudo salt \* cmd.run "zypper ll | grep \"\(162296\|1227616\)\""' is empty

Suggestions

  • Temporarily upgrade selected machines to Leap 15.6 with old kernel or vice versa, just kernel 6.4, try to get the system to work in a stable manner
  • Optional: Look into the crash files on w31 in /root/crash-2024-06-14/

Related issues 4 (0 open4 closed)

Related to openQA Infrastructure - action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:MResolvedokurz2023-11-04

Actions
Related to openQA Project - action #157972: Upgrade o3 workers to openSUSE Leap 15.6 size:SResolvedgpathak

Actions
Related to openQA Infrastructure - action #163469: Upgrade a single o3 worker to openSUSE Leap 15.6Resolvedgpathak2024-07-08

Actions
Copied from openQA Infrastructure - action #162293: SMART errors on bootup of worker31, worker32 and worker34 size:MResolvednicksinger2024-06-14

Actions
Actions #1

Updated by okurz 5 months ago

  • Copied from action #162293: SMART errors on bootup of worker31, worker32 and worker34 size:M added
Actions #2

Updated by okurz 5 months ago

  • Description updated (diff)
Actions #3

Updated by livdywan 5 months ago

  • Subject changed from openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 to openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz 5 months ago

  • Priority changed from High to Normal
Actions #5

Updated by dheidler 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #6

Updated by openqa_review 5 months ago

  • Due date set to 2024-07-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by okurz 5 months ago

So originally what happened is that all PRG2 x86_64 upgraded themselves automatically but inconsistently to Leap 15.6 so what I did is call snapper rollback on each and rebooted and then ensured that openQA jobs are properly executed afterwards.

Actions #8

Updated by okurz 5 months ago

Unfortunately dmesg in /root/crash-*/crash/ is all empty. So I guess the next step should be to select any worker, upgrade and check. I suggest to use w36 which is currently offline.

Actions #9

Updated by okurz 5 months ago

  • Related to action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M added
Actions #10

Updated by dheidler 4 months ago

Actions #11

Updated by dheidler 4 months ago

  • Status changed from In Progress to Blocked

As we would have to use a 15.6 with both firewalld and kernel-default from 15.5,
I don't see much value in moving to 15.6 for now.

Let's block this ticket on the bugzilla issue.

Actions #12

Updated by okurz 4 months ago

  • Due date deleted (2024-07-23)
Actions #13

Updated by livdywan 4 months ago

Actions #14

Updated by livdywan 3 months ago

livdywan wrote in #note-13:

Opened https://bugzilla.suse.com/show_bug.cgi?id=1227616

No response so far

Still no update (no pun intended)

Actions #15

Updated by okurz 2 months ago

  • Related to action #157972: Upgrade o3 workers to openSUSE Leap 15.6 size:S added
Actions #16

Updated by okurz 2 months ago

  • Related to action #163469: Upgrade a single o3 worker to openSUSE Leap 15.6 added
Actions #17

Updated by okurz 2 months ago

  • Related to action #160095: Upgraded Leap 15.6 workers able to run s390x tests after #162683 size:M added
Actions #18

Updated by okurz 2 months ago

  • Related to deleted (action #160095: Upgraded Leap 15.6 workers able to run s390x tests after #162683 size:M)
Actions #19

Updated by okurz 2 months ago

  • Description updated (diff)
Actions #20

Updated by livdywan about 1 month ago

Opened https://bugzilla.suse.com/show_bug.cgi?id=1227616

I pinged Denis in Slack as there's been no response for a while

Actions #21

Updated by okurz 4 days ago

livdywan wrote in #note-20:

Opened https://bugzilla.suse.com/show_bug.cgi?id=1227616

I pinged Denis in Slack as there's been no response for a while

As you didn't link a Slack conversation I have to ask you: Do you remember if there was a response?

Actions

Also available in: Atom PDF