Project

General

Profile

Actions

action #157975

open

openQA Project (public) - coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

Upgrade osd workers to openSUSE Leap 15.6 size:S

Added by okurz 9 months ago. Updated 9 minutes ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
Organisational
Start date:
Due date:
2024-12-26 (Due in 7 days)
% Done:

0%

Estimated time:
Tags:

Description

Motivation

Need to upgrade workers before EOL of Leap 15.5 and have a consistent environment size:S

Acceptance criteria

  • AC1: all osd worker machines run a clean upgraded openSUSE Leap 15.6 (no failed systemd services, no left over .rpm-new files, etc.)

Acceptance tests

  • AT1-1: sudo salt -C 'G@roles:worker and not G@osrelease:15.6' test.ping is empty

Suggestions

  • read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
  • Reserve some time when the workers are only executing a few or no openQA test jobs
  • Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
  • Apply the workaround for #162296, i.e. zypper al -m "boo#1227616" *firewall*
  • Start with non-ppc64le due to #169939
  • After upgrade reboot and check everything working as expected, if not rollback, e.g. with snapper rollback
  • Consider also ppc64le but see #169939

Rollback steps

  • hostname=worker31.oqa.prg2.suse.org ssh osd "sudo salt-key -y -a $hostname && sudo salt --state-output=changes $hostname state.apply"
  • ssh osd "sudo salt -C 'G@roles:worker' cmd.run 'systemctl unmask rebootmgr && systemctl enable --now rebootmgr && rebootmgrctl reboot'"

Further details

  • Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)

Related issues 8 (2 open6 closed)

Related to openQA Tests (public) - action #162239: [s390x] test fails in bootloader_start due to slow response from z/VM hypervisor and/or changed response on "cp i cms" commandBlockedokurz2024-06-13

Actions
Related to openQA Infrastructure (public) - action #162260: auto-update.service fails on various workers due to a package conflictResolvedokurz2024-06-14

Actions
Related to openQA Project (public) - action #169939: Upgrade Power8 o3 workers to openSUSE Leap 15.6New2024-11-14

Actions
Related to openQA Infrastructure (public) - action #174319: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3520298#L74 fails with "File './x86_64/glibc-2.38-150600.14.17.2.x86_64.rpm' not found on medium 'http://download.opensuse.org/update/leap/15.5/sle/'" size:SResolvedybonatakis2024-12-12

Actions
Copied from openQA Project (public) - action #130588: Upgrade osd workers to openSUSE Leap 15.5Resolvedokurz

Actions
Copied to openQA Project (public) - action #162284: Prevent multi-machine tests to be picked up if os-autoinst-openvswitch service does not work size:MResolvedmkittler2024-06-14

Actions
Copied to openQA Infrastructure (public) - action #162293: SMART errors on bootup of worker31, worker32 and worker34 size:MResolvednicksinger2024-06-14

Actions
Copied to openQA Project (public) - action #163472: Upgrade a single osd worker to openSUSE Leap 15.6Resolvedokurz2024-07-08

Actions
Actions #1

Updated by okurz 9 months ago

  • Copied from action #130588: Upgrade osd workers to openSUSE Leap 15.5 added
Actions #2

Updated by okurz 9 months ago

  • Subject changed from Upgrade osd workers to openSUSE Leap 15.5 to Upgrade osd workers to openSUSE Leap 15.6
  • Description updated (diff)
  • Assignee deleted (okurz)
  • Target version changed from Ready to future
Actions #3

Updated by okurz 8 months ago

  • Target version changed from future to Tools - Next
Actions #4

Updated by okurz 6 months ago

Multiple workers upgraded themselves automatically to 15.6 as the main repo URL points to http://download.opensuse.org/distribution/openSUSE-current/repo/oss . Maybe that was me doing that? This might have caused #162239

Actions #5

Updated by okurz 6 months ago

  • Related to action #162239: [s390x] test fails in bootloader_start due to slow response from z/VM hypervisor and/or changed response on "cp i cms" command added
Actions #6

Updated by okurz 6 months ago

  • Assignee set to okurz
  • Target version changed from Tools - Next to Ready
Actions #7

Updated by okurz 6 months ago

  • Assignee deleted (okurz)
  • Target version changed from Ready to Tools - Next
Actions #8

Updated by okurz 6 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version changed from Tools - Next to Ready

okurz wrote in #note-4:

Multiple workers upgraded themselves automatically to 15.6 as the main repo URL points to http://download.opensuse.org/distribution/openSUSE-current/repo/oss . Maybe that was me doing that? This might have caused #162239

I caused this problem with #132137-6 because then the installer uses openSUSE-current also in the deployed repositories but that's then incompatible with update repos which are either hardcoded to the version like 15.5 or use $releasever.

So I called

sudo salt \* cmd.run 'sed -i -e "s@openSUSE-current@leap/\$releasever@" /etc/zypp/repos.d/*'

to correct. Now we should push through with the upgrade to Leap 15.6 to be consistent

Actions #9

Updated by okurz 6 months ago

  • Copied to action #162284: Prevent multi-machine tests to be picked up if os-autoinst-openvswitch service does not work size:M added
Actions #10

Updated by okurz 6 months ago · Edited

  • Description updated (diff)

First w31 where I triggered a reboot. After reboot it took 20m(!) for the machine to be fully reachable over network causing issues because os-autoinst-openvswitch would timeout causing incomplete jobs like https://openqa.suse.de/tests/14611797 . Triggered another reboot and then

for i in WORKER="worker31"; do host=openqa.suse.de failed_since=2024-06-14 comment="label:poo157975" ./openqa-advanced-retrigger-jobs; done

I disabled the openQA worker instances on w31 for now:

sudo systemctl disable --now telegraf $(systemctl list-units | grep openqa-worker-auto-restart | cut -d . -f 1 | xargs)

and aborted reboot requests and masked rebootmgr for now:

sh osd "sudo salt -C 'G@roles:worker' cmd.run 'rebootmgrctl cancel && systemctl mask --now rebootmgr'"

and added according rollback steps.

We found that the system is not able to bring up the network completely and also crashes on a kernel panic so reverting to older version of kernel. systemctl status wicked* showed that wicked-nanny is running into dbus related timeouts.

Booting into older kernel:

# grep 'submenu\|menuentry\>' /boot/grub2/grub.cfg                                                                   
 menuentry "Help on bootable snapshot #$snapshot_num" {
menuentry 'openSUSE Leap 15.6'  --class opensuse --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-
b985246c-fac6-488c-9917-cb53e1d3fd6d' {
submenu 'Advanced options for openSUSE Leap 15.6' --hotkey=1 $menuentry_id_option 'gnulinux-advanced-b985246c-fac6-488c-9917-cb5
3e1d3fd6d' {
        menuentry 'openSUSE Leap 15.6, with Linux 6.4.0-150600.21-default' --hotkey=2 --class opensuse --class gnu-linux --class
 gnu --class os $menuentry_id_option 'gnulinux-6.4.0-150600.21-default-advanced-b985246c-fac6-488c-9917-cb53e1d3fd6d' {
        menuentry 'openSUSE Leap 15.6, with Linux 6.4.0-150600.21-default (recovery mode)' --hotkey=3 --class opensuse --class $
nu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.4.0-150600.21-default-recovery-b985246c-fac6-488c-9917-cb53e1d$
fd6d' {
        menuentry 'openSUSE Leap 15.6, with Linux 5.14.21-150500.55.65-default'  --class opensuse --class gnu-linux --class gnu 
--class os $menuentry_id_option 'gnulinux-5.14.21-150500.55.65-default-advanced-b985246c-fac6-488c-9917-cb53e1d3fd6d' {
        menuentry 'openSUSE Leap 15.6, with Linux 5.14.21-150500.55.65-default (recovery mode)' --hotkey=1 --class opensuse --c$
ass gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.14.21-150500.55.65-default-recovery-b985246c-fac6-488c-99$
7-cb53e1d3fd6d' {
        menuentry 'UEFI Firmware Settings' $menuentry_id_option 'uefi-firmware' {
worker31:~ # grub2-reboot '1>3'                            
Actions #11

Updated by okurz 6 months ago

  • Copied to action #162293: SMART errors on bootup of worker31, worker32 and worker34 size:M added
Actions #12

Updated by okurz 6 months ago

That was not successful, no output after the initial kernel loading. I booted into 6.4 again and then called snapper rollback $id on the $id from 2024-06-12 before the upgrade to 15.6 happened. Rebooted and the system was coming up just fine again but struggles with SMART errors. For this I extracted a worker31 specific ticket into #162293. Crosschecked with w32 and it behaves the same. So I assume it's a generic problem which at least affects all our happyware servers w29-w40 the same, possibly o3 workers w21-w28 as well. Now doing the corresponding snapper rollback on all and patching the repo config as in #157975-8

Actions #13

Updated by okurz 6 months ago

Downgraded w29, w30, w31, w32, w33, w34, w35, w40, warm1, warm2. w36-39 are offline as they are currently not used. And applied

sed -i -e "s@openSUSE-current@leap/\$releasever@" -e "s@15\.5/@\$releasever/@g" /etc/zypp/repos.d/* ; zypper ref && zypper -n dup --dry-run

on all. Also checked systemctl status and alerts. Seems we are back to status "green" again for now. Also all systems currently in control of salt reachable. I recommend we try the upgrade again first with another machine, e.g. a NUE2 or PRG1 based one. openqaworker14 seems to be a good candidate running only worker classes that we have redundant and only qemu. Also we can consider w36-w40 related to #139103

Actions #14

Updated by okurz 6 months ago

  • Related to action #162260: auto-update.service fails on various workers due to a package conflict added
Actions #15

Updated by okurz 6 months ago

  • Status changed from In Progress to Blocked
  • Target version changed from Ready to Tools - Next

reproduced the problem on o3 worker w21 in #157972. I guess we can go back to the original plan and do o3 first so blocking on #157972

Actions #16

Updated by okurz 6 months ago

  • Target version changed from Tools - Next to future
Actions #17

Updated by okurz 5 months ago

  • Copied to action #163472: Upgrade a single osd worker to openSUSE Leap 15.6 added
Actions #18

Updated by okurz 2 months ago

  • Project changed from openQA Project (public) to openQA Infrastructure (public)
  • Description updated (diff)
  • Category changed from Organisational to Organisational
  • Target version changed from future to Ready
Actions #19

Updated by okurz about 1 month ago

We encountered problems with ppc64le workers in #157972 but there is still some time to resolve that first. Blocking on #157972

Actions #20

Updated by okurz 22 days ago

  • Related to action #169939: Upgrade Power8 o3 workers to openSUSE Leap 15.6 added
Actions #21

Updated by okurz 22 days ago

  • Description updated (diff)
  • Status changed from Blocked to New
  • Assignee deleted (okurz)

#157972 resolved with exception for #169939 for ppc.

Actions #22

Updated by mkittler 16 days ago

  • Description updated (diff)
  • Status changed from New to Workable
Actions #23

Updated by mkittler 16 days ago

  • Subject changed from Upgrade osd workers to openSUSE Leap 15.6 to Upgrade osd workers to openSUSE Leap 15.6 size:S
Actions #24

Updated by dheidler 10 days ago

Is the described issue maybe the same as https://bugzilla.suse.com/show_bug.cgi?id=1227616 (note that the bug was reported against 15SP6 but is actually for 15.6).

Actions #25

Updated by dheidler 10 days ago

Or phrased otherwise: Should we block on https://progress.opensuse.org/issues/162296 and/or escalate this matter as the bug blocks us from updating?

Actions #26

Updated by okurz 10 days ago

dheidler wrote in #note-25:

Or phrased otherwise: Should we block on https://progress.opensuse.org/issues/162296 and/or escalate this matter as the bug blocks us from updating?

From the bug report it seems clear that nobody else reproduced the issue so if we don't investigate further in detail an improvement is unlikely. That is something to be done in #162296 .
However here we can still upgrade with the known workarounds

Actions #27

Updated by ybonatakis 8 days ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #28

Updated by ybonatakis 8 days ago

iob@openqa:~> sudo salt -C 'G@roles:worker and not G@osrelease:15.6' test.ping
openqaworker17.qa.suse.cz:
True
openqaworker18.qa.suse.cz:
True
worker29.oqa.prg2.suse.org:
True
openqaworker16.qa.suse.cz:
True
worker31.oqa.prg2.suse.org:
True
worker30.oqa.prg2.suse.org:
True
worker35.oqa.prg2.suse.org:
True
qesapworker-prg7.qa.suse.cz:
True
qesapworker-prg4.qa.suse.cz:
True
worker33.oqa.prg2.suse.org:
True
worker40.oqa.prg2.suse.org:
True
worker-arm2.oqa.prg2.suse.org:
True
worker34.oqa.prg2.suse.org:
True
openqaworker14.qa.suse.cz:
True
qesapworker-prg5.qa.suse.cz:
True
qesapworker-prg6.qa.suse.cz:
True
worker-arm1.oqa.prg2.suse.org:
True
worker32.oqa.prg2.suse.org:
True
diesel.qe.nue2.suse.org:
True
petrol.qe.nue2.suse.org:
True
mania.qe.nue2.suse.org:
True
sapworker1.qe.nue2.suse.org:
True
grenache-1.oqa.prg2.suse.org:
True

Actions #29

Updated by openqa_review 7 days ago

  • Due date set to 2024-12-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #30

Updated by ybonatakis 7 days ago

starting with openqaworker17

openqaworker17:/home/iob # cat /etc/os-release 
NAME="openSUSE Leap"
VERSION="15.6"
ID="opensuse-leap"
ID_LIKE="suse opensuse"
VERSION_ID="15.6"
PRETTY_NAME="openSUSE Leap 15.6"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:leap:15.6"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:Leap"

Only issue I encountered was a conflict with python3-bind

error: unpacking of archive failed on file /usr/lib/python3.6/site-packages/isc-2.0-py3.6.egg-info: cpio: File from package already exists as a directory in system
error: python3-bind-9.16.20-150400.3.6.noarch: install failed
error: python3-bind-9.16.50-150500.8.21.1.noarch: erase skipped

I backed up the file and removed it. upgrade was smooth after that.

Actions #31

Updated by ybonatakis 7 days ago

today morning I found auto-update.service failiing. after a restart it became inactive. Not sure if that was caused from the locked packages for firewall. Befor

openqaworker17:/home/iob # zypper ll

# | Name                        | Type    | Repository | Comment
--+-----------------------------+---------+------------+--------------------------------------------------------------------
1 | *firewall*                  | package | (any)      | boo#1227616
2 | openSUSE-SLE-15.6-2024-3393 | patch   | (any)      | AUTO-UPDATE CONFLICTING PATCH: patch would conflict with *firewall*
3 | openSUSE-SLE-15.6-2024-3953 | patch   | (any)      | AUTO-UPDATE CONFLICTING PATCH: patch would conflict with *firewall*
4 | x3270                       | package | (any)      | 

openqaworker17:/home/iob # systemctl status auto-update.service
○ auto-update.service - Automatically patch system packages.
     Loaded: loaded (/etc/systemd/system/auto-update.service; static)
     Active: inactive (dead) since Thu 2024-12-12 08:06:19 UTC; 2min 4s ago
   Duration: 6.185s
TriggeredBy: ● auto-update.timer
    Process: 5011 ExecStart=/usr/local/bin/auto-update (code=exited, status=0/SUCCESS)
   Main PID: 5011 (code=exited, status=0/SUCCESS)
        CPU: 6.164s

Dec 12 08:06:19 openqaworker17 auto-update[5134]:   firewall-applet firewall-config firewalld-bash-completion firewalld-prometheus-config firewalld-rpcbind-helper firewalld-test firewalld-zsh-completion firewall-macros keylime-firewalld patch:openSUSE-SLE-15.6-2024-3393 patch:openSUSE-SLE-15.6-2024-3953 plasma5-firewall plasma5-firewall-lang SuSEfirewall2 su>
Dec 12 08:06:19 openqaworker17 auto-update[5134]:  Installed:
Dec 12 08:06:19 openqaworker17 auto-update[5134]:   firewalld firewalld-lang python3-firewall x3270 yast2-firewall
Dec 12 08:06:19 openqaworker17 auto-update[5134]: Nothing to do.
Dec 12 08:06:19 openqaworker17 auto-update[5011]: + [[ 0 == 4 ]]
Dec 12 08:06:19 openqaworker17 auto-update[5011]: + [[ 0 == 102 ]]
Dec 12 08:06:19 openqaworker17 auto-update[5011]: + return 0
Dec 12 08:06:19 openqaworker17 auto-update[5011]: + needs-restarting --reboothint
Dec 12 08:06:19 openqaworker17 systemd[1]: auto-update.service: Deactivated successfully.
Dec 12 08:06:19 openqaworker17 systemd[1]: auto-update.service: Consumed 6.164s CPU time.
Actions #32

Updated by ybonatakis 6 days ago

  • Status changed from In Progress to Workable
Actions #33

Updated by ybonatakis 6 days ago

  • Related to action #174319: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3520298#L74 fails with "File './x86_64/glibc-2.38-150600.14.17.2.x86_64.rpm' not found on medium 'http://download.opensuse.org/update/leap/15.5/sle/'" size:S added
Actions #34

Updated by ybonatakis 3 days ago

openqaworker18 updated. I didnt see any problem

Actions #35

Updated by okurz 1 day ago

  • Priority changed from Normal to High
Actions #36

Updated by livdywan about 24 hours ago

So openqaworker17 + openqaworker18 are uptodate. Further workers are planned to be updated (in parallel) next.

Actions #37

Updated by ybonatakis about 22 hours ago

  • Priority changed from High to Normal

openqaworker16 updated

Actions #38

Updated by ybonatakis about 20 hours ago

openqaworker14 upgraded

Actions #39

Updated by okurz about 20 hours ago

  • Status changed from Workable to In Progress
  • Priority changed from Normal to High

please keep the prio "High"

Actions #40

Updated by ybonatakis 10 minutes ago

  • Priority changed from High to Normal

worker29 upgraded

Actions #41

Updated by ybonatakis 9 minutes ago

  • Priority changed from Normal to High
Actions

Also available in: Atom PDF