Project

General

Profile

Actions

action #162296

open

coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S

Added by okurz about 1 month ago. Updated 1 day ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-14
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Observed on w31+w32 that upgraded themselves to Leap 15.6 and then crashed multiple times after booting into kernel 6.4 after a waiting time of 10-20m after boot.

Acceptance criteria

  • AC1: OSD openQA workers can run stable with Leap 15.6 (package locks on reported issues allowed)

Suggestions

  • Temporarily upgrade selected machines to Leap 15.6 with old kernel or vice versa, just kernel 6.4, try to get the system to work in a stable manner
  • Optional: Look into the crash files on w31 in /root/crash-2024-06-14/

Related issues 2 (2 open0 closed)

Related to openQA Infrastructure - action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:MBlockedokurz2023-11-04

Actions
Copied from openQA Project - action #162293: SMART errors on bootup of w31+w32, possibly moreNew2024-06-14

Actions
Actions #1

Updated by okurz about 1 month ago

  • Copied from action #162293: SMART errors on bootup of w31+w32, possibly more added
Actions #2

Updated by okurz about 1 month ago

  • Description updated (diff)
Actions #3

Updated by livdywan 27 days ago

  • Subject changed from openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 to openQA workers crash with Linux 6.4 after upgrade openSUSE Leap 15.6 size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz 27 days ago

  • Priority changed from High to Normal
Actions #5

Updated by dheidler 9 days ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #6

Updated by openqa_review 8 days ago

  • Due date set to 2024-07-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by okurz 8 days ago

So originally what happened is that all PRG2 x86_64 upgraded themselves automatically but inconsistently to Leap 15.6 so what I did is call snapper rollback on each and rebooted and then ensured that openQA jobs are properly executed afterwards.

Actions #8

Updated by okurz 8 days ago

Unfortunately dmesg in /root/crash-*/crash/ is all empty. So I guess the next step should be to select any worker, upgrade and check. I suggest to use w36 which is currently offline.

Actions #9

Updated by okurz 8 days ago

  • Related to action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M added
Actions #10

Updated by dheidler 7 days ago

Actions #11

Updated by dheidler 1 day ago

  • Status changed from In Progress to Blocked

As we would have to use a 15.6 with both firewalld and kernel-default from 15.5,
I don't see much value in moving to 15.6 for now.

Let's block this ticket on the bugzilla issue.

Actions #12

Updated by okurz 1 day ago

  • Due date deleted (2024-07-23)
Actions

Also available in: Atom PDF