Project

General

Profile

Actions

action #157975

open

coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

Upgrade osd workers to openSUSE Leap 15.6

Added by okurz 4 months ago. Updated 16 days ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Organisational
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

  • Need to upgrade workers before EOL of Leap 15.5 and have a consistent environment

Acceptance criteria

  • AC1: all osd worker machines run a clean upgraded openSUSE Leap 15.6 (no failed systemd services, no left over .rpm-new files, etc.)

Acceptance tests

  • AT1-1: sudo salt -C 'G@roles:worker and not G@osrelease:15.6' test.ping is empty

Suggestions

Rollback steps

  • hostname=worker31.oqa.prg2.suse.org ssh osd "sudo salt-key -y -a $hostname && sudo salt --state-output=changes $hostname state.apply"
  • ssh osd "sudo salt -C 'G@roles:worker' cmd.run 'systemctl unmask rebootmgr && systemctl enable --now rebootmgr && rebootmgrctl reboot'"

Further details

  • Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)

Related issues 6 (4 open2 closed)

Related to openQA Tests - action #162239: [s390x] test fails in bootloader_start due to slow response from z/VM hypervisor and/or changed response on "cp i cms" commandBlockedokurz2024-06-13

Actions
Related to openQA Infrastructure - action #162260: auto-update.service fails on various workers due to a package conflictResolvedokurz2024-06-14

Actions
Copied from openQA Project - action #130588: Upgrade osd workers to openSUSE Leap 15.5Resolvedokurz

Actions
Copied to openQA Project - action #162284: Prevent multi-machine tests to be picked up if os-autoinst-openvswitch service does not workNew2024-06-14

Actions
Copied to openQA Project - action #162293: SMART errors on bootup of w31+w32, possibly moreNew2024-06-14

Actions
Copied to openQA Project - action #163472: Upgrade a single osd worker to openSUSE Leap 15.6Blockedokurz2024-07-08

Actions
Actions #1

Updated by okurz 4 months ago

  • Copied from action #130588: Upgrade osd workers to openSUSE Leap 15.5 added
Actions #2

Updated by okurz 4 months ago

  • Subject changed from Upgrade osd workers to openSUSE Leap 15.5 to Upgrade osd workers to openSUSE Leap 15.6
  • Description updated (diff)
  • Assignee deleted (okurz)
  • Target version changed from Ready to future
Actions #3

Updated by okurz 3 months ago

  • Target version changed from future to Tools - Next
Actions #4

Updated by okurz about 1 month ago

Multiple workers upgraded themselves automatically to 15.6 as the main repo URL points to http://download.opensuse.org/distribution/openSUSE-current/repo/oss . Maybe that was me doing that? This might have caused #162239

Actions #5

Updated by okurz about 1 month ago

  • Related to action #162239: [s390x] test fails in bootloader_start due to slow response from z/VM hypervisor and/or changed response on "cp i cms" command added
Actions #6

Updated by okurz about 1 month ago

  • Assignee set to okurz
  • Target version changed from Tools - Next to Ready
Actions #7

Updated by okurz about 1 month ago

  • Assignee deleted (okurz)
  • Target version changed from Ready to Tools - Next
Actions #8

Updated by okurz about 1 month ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version changed from Tools - Next to Ready

okurz wrote in #note-4:

Multiple workers upgraded themselves automatically to 15.6 as the main repo URL points to http://download.opensuse.org/distribution/openSUSE-current/repo/oss . Maybe that was me doing that? This might have caused #162239

I caused this problem with #132137-6 because then the installer uses openSUSE-current also in the deployed repositories but that's then incompatible with update repos which are either hardcoded to the version like 15.5 or use $releasever.

So I called

sudo salt \* cmd.run 'sed -i -e "s@openSUSE-current@leap/\$releasever@" /etc/zypp/repos.d/*'

to correct. Now we should push through with the upgrade to Leap 15.6 to be consistent

Actions #9

Updated by okurz about 1 month ago

  • Copied to action #162284: Prevent multi-machine tests to be picked up if os-autoinst-openvswitch service does not work added
Actions #10

Updated by okurz about 1 month ago ยท Edited

  • Description updated (diff)

First w31 where I triggered a reboot. After reboot it took 20m(!) for the machine to be fully reachable over network causing issues because os-autoinst-openvswitch would timeout causing incomplete jobs like https://openqa.suse.de/tests/14611797 . Triggered another reboot and then

for i in WORKER="worker31"; do host=openqa.suse.de failed_since=2024-06-14 comment="label:poo157975" ./openqa-advanced-retrigger-jobs; done

I disabled the openQA worker instances on w31 for now:

sudo systemctl disable --now telegraf $(systemctl list-units | grep openqa-worker-auto-restart | cut -d . -f 1 | xargs)

and aborted reboot requests and masked rebootmgr for now:

sh osd "sudo salt -C 'G@roles:worker' cmd.run 'rebootmgrctl cancel && systemctl mask --now rebootmgr'"

and added according rollback steps.

We found that the system is not able to bring up the network completely and also crashes on a kernel panic so reverting to older version of kernel. systemctl status wicked* showed that wicked-nanny is running into dbus related timeouts.

Booting into older kernel:

# grep 'submenu\|menuentry\>' /boot/grub2/grub.cfg                                                                   
 menuentry "Help on bootable snapshot #$snapshot_num" {
menuentry 'openSUSE Leap 15.6'  --class opensuse --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-
b985246c-fac6-488c-9917-cb53e1d3fd6d' {
submenu 'Advanced options for openSUSE Leap 15.6' --hotkey=1 $menuentry_id_option 'gnulinux-advanced-b985246c-fac6-488c-9917-cb5
3e1d3fd6d' {
        menuentry 'openSUSE Leap 15.6, with Linux 6.4.0-150600.21-default' --hotkey=2 --class opensuse --class gnu-linux --class
 gnu --class os $menuentry_id_option 'gnulinux-6.4.0-150600.21-default-advanced-b985246c-fac6-488c-9917-cb53e1d3fd6d' {
        menuentry 'openSUSE Leap 15.6, with Linux 6.4.0-150600.21-default (recovery mode)' --hotkey=3 --class opensuse --class $
nu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.4.0-150600.21-default-recovery-b985246c-fac6-488c-9917-cb53e1d$
fd6d' {
        menuentry 'openSUSE Leap 15.6, with Linux 5.14.21-150500.55.65-default'  --class opensuse --class gnu-linux --class gnu 
--class os $menuentry_id_option 'gnulinux-5.14.21-150500.55.65-default-advanced-b985246c-fac6-488c-9917-cb53e1d3fd6d' {
        menuentry 'openSUSE Leap 15.6, with Linux 5.14.21-150500.55.65-default (recovery mode)' --hotkey=1 --class opensuse --c$
ass gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.14.21-150500.55.65-default-recovery-b985246c-fac6-488c-99$
7-cb53e1d3fd6d' {
        menuentry 'UEFI Firmware Settings' $menuentry_id_option 'uefi-firmware' {
worker31:~ # grub2-reboot '1>3'                            
Actions #11

Updated by okurz about 1 month ago

  • Copied to action #162293: SMART errors on bootup of w31+w32, possibly more added
Actions #12

Updated by okurz about 1 month ago

That was not successful, no output after the initial kernel loading. I booted into 6.4 again and then called snapper rollback $id on the $id from 2024-06-12 before the upgrade to 15.6 happened. Rebooted and the system was coming up just fine again but struggles with SMART errors. For this I extracted a worker31 specific ticket into #162293. Crosschecked with w32 and it behaves the same. So I assume it's a generic problem which at least affects all our happyware servers w29-w40 the same, possibly o3 workers w21-w28 as well. Now doing the corresponding snapper rollback on all and patching the repo config as in #157975-8

Actions #13

Updated by okurz about 1 month ago

Downgraded w29, w30, w31, w32, w33, w34, w35, w40, warm1, warm2. w36-39 are offline as they are currently not used. And applied

sed -i -e "s@openSUSE-current@leap/\$releasever@" -e "s@15\.5/@\$releasever/@g" /etc/zypp/repos.d/* ; zypper ref && zypper -n dup --dry-run

on all. Also checked systemctl status and alerts. Seems we are back to status "green" again for now. Also all systems currently in control of salt reachable. I recommend we try the upgrade again first with another machine, e.g. a NUE2 or PRG1 based one. openqaworker14 seems to be a good candidate running only worker classes that we have redundant and only qemu. Also we can consider w36-w40 related to #139103

Actions #14

Updated by okurz about 1 month ago

  • Related to action #162260: auto-update.service fails on various workers due to a package conflict added
Actions #15

Updated by okurz about 1 month ago

  • Status changed from In Progress to Blocked
  • Target version changed from Ready to Tools - Next

reproduced the problem on o3 worker w21 in #157972. I guess we can go back to the original plan and do o3 first so blocking on #157972

Actions #16

Updated by okurz 16 days ago

  • Target version changed from Tools - Next to future
Actions #17

Updated by okurz 9 days ago

  • Copied to action #163472: Upgrade a single osd worker to openSUSE Leap 15.6 added
Actions

Also available in: Atom PDF