Project

General

Profile

Actions

action #130585

closed

coordination #130582: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.5

Upgrade o3 workers to openSUSE Leap 15.5

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Organisational
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

  • Need to upgrade workers before EOL of Leap 15.3 and have a consistent environment

Acceptance criteria

  • AC1: all o3 worker machines run a clean upgraded openSUSE Leap 15.4 (no failed systemd services, no left over .rpm-new files, etc.)

Suggestions

  • read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
  • Reserve some time when the workers are only executing a few or no openQA test jobs
  • Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
  • Use the instructions from above but use transactional-update shell for transactional update workers
  • After upgrade reboot and check everything working as expected, if not rollback, e.g. with transactional-update rollback

Further details

  • Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:MResolveddheidler2023-06-29

Actions
Copied from openQA Infrastructure (public) - action #111863: Upgrade o3 workers to openSUSE Leap 15.4 size:MResolvedtinita

Actions
Copied to openQA Project (public) - action #157972: Upgrade o3 workers to openSUSE Leap 15.6 size:SResolvedgpathak

Actions
Actions #1

Updated by okurz over 1 year ago

  • Copied from action #111863: Upgrade o3 workers to openSUSE Leap 15.4 size:M added
Actions #2

Updated by okurz over 1 year ago

  • Subject changed from Upgrade o3 workers to openSUSE Leap 15.4 size:M to Upgrade o3 workers to openSUSE Leap 15.5
  • Category set to Organisational
  • Assignee deleted (tinita)
  • Target version changed from Ready to future
Actions #3

Updated by okurz about 1 year ago

  • Target version changed from future to Tools - Next
Actions #4

Updated by okurz about 1 year ago

hosts="aarch64 openqaworker4 openqaworker6 openqaworker7 openqaworker19 openqaworker20 openqaworker21 openqaworker22 openqaworker23 openqaworker24 openqaworker25 openqaworker26 openqaworker27 openqaworker28 qa-power8-3 rebel"; for i in $hosts; do echo $i && ssh -t root@$i 'grep -q VERSION=.*15.4 /etc/os-release && zypper -n --no-refresh in screen; old=15.4; new=15.5; screen -L -S upgrade sh -c "new=$new; zypper -n up --auto-agree-with-licenses --replacefiles && zypper --releasever=$new -n --gpg-auto-import-keys ref && zypper --releasever=$new -n --no-refresh dup --auto-agree-with-licenses --replacefiles --download-in-advance $interactive"' ; done

and I found inconsistent repositories on openqaworker4 like "devel:openQA" plus "devel_openQA". I tried to rectify this situation but this broke more over night

https://suse.slack.com/archives/C02CANHLANP/p1694842897320189

(Dominique Leuenberger) Good morning! ow4 seems to have trouble testing anything. Am the end up incomplete with this error: Reason: backend died: Can't locate object method "spew" via package "Mojo::File" at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 412.
In case anybody catches a minute, a fix would be appreciated
(Oliver Kurz) on it […] sorry, was my fault. I was preparing workers for Leap 15.5 upgrade and found duplicate repos on w4, removed those and that apparently caused different repo priority combinations as the existing non-dupplicate ones were not correctly configured.

worker=openqaworker4 openqa-advanced-retrigger-jobs

Re-executed the above command multiple times and now all machines are on Leap 15.5 with exception of aarch64+qa-power8-3 which are both currently in transition, waiting for #134864 and #132140 accordingly.

To ensure a consistent updated state for all machines I checked

 hosts="aarch64 openqaworker4 openqaworker6 openqaworker7 openqaworker19 openqaworker20 openqaworker21 openqaworker22 openqaworker23 openqaworker24 openqaworker25 openqaworker26 openqaworker27 openqaworker28 qa-power8-3 rebel"; for i in $hosts; do echo $i && ssh -t root@$i 'systemctl is-enabled openqa-{auto,continuous}-update.timer' ; done
aarch64
ssh: connect to host aarch64 port 22: No route to host
openqaworker4
enabled
enabled
Connection to openqaworker4 closed.
openqaworker6
enabled
enabled
Connection to openqaworker6 closed.
openqaworker7
enabled
enabled
Connection to openqaworker7 closed.
openqaworker19
enabled
enabled
Connection to openqaworker19 closed.
openqaworker20
enabled
enabled
Connection to openqaworker20 closed.
openqaworker21
enabled
enabled
Connection to openqaworker21 closed.
openqaworker22
enabled
enabled
Connection to openqaworker22 closed.
openqaworker23
enabled
enabled
Connection to openqaworker23 closed.
openqaworker24
enabled
enabled
Connection to openqaworker24 closed.
openqaworker25
disabled
disabled
Connection to openqaworker25 closed.
openqaworker26
Failed to get unit file state for openqa-auto-update.timer: No such file or directory
Connection to openqaworker26 closed.
openqaworker27
Failed to get unit file state for openqa-auto-update.timer: No such file or directory
Connection to openqaworker27 closed.
openqaworker28
enabled
enabled
Connection to openqaworker28 closed.
qa-power8-3
ssh: connect to host qa-power8-3 port 22: No route to host
rebel
enabled
enabled
Connection to rebel closed.

so w25 has the timers disabled, w26+w27 not installed at all but we targeted those as bare-metal test machines accordingly. So now also blocking on #132134

Actions #5

Updated by okurz about 1 year ago

  • Related to action #132134: Setup new PRG2 multi-machine openQA worker for o3 size:M added
Actions #8

Updated by okurz about 1 year ago

  • Tags set to infra
  • Status changed from New to Blocked
  • Assignee set to okurz
Actions #9

Updated by okurz about 1 year ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)

#132134 resolved, unblocked

Actions #10

Updated by okurz about 1 year ago

  • Target version changed from Tools - Next to Ready
Actions #11

Updated by okurz about 1 year ago

  • Target version changed from Ready to Tools - Next
Actions #12

Updated by okurz about 1 year ago

  • Status changed from New to Resolved
  • Assignee set to okurz
  • Target version changed from Tools - Next to Ready
okurz@new-ariel:~> for i in $hosts; do echo $i && ssh root@$i "grep VERSION /etc/os-release"; done
aarch64-o3
ssh: Could not resolve hostname aarch64-o3: Name or service not known
kerosene
ssh: Could not resolve hostname kerosene: Name or service not known
openqaworker20
ssh: connect to host openqaworker20 port 22: No route to host
openqaworker21
VERSION="15.5"
VERSION_ID="15.5"
openqaworker22
VERSION="15.5"
VERSION_ID="15.5"
openqaworker23
VERSION="15.5"
VERSION_ID="15.5"
openqaworker24
VERSION="15.5"
VERSION_ID="15.5"
openqaworker25
VERSION="15.5"
VERSION_ID="15.5"
openqaworker26
VERSION="15.5"
VERSION_ID="15.5"
openqaworker27
VERSION="15.5"
VERSION_ID="15.5"
openqaworker28
VERSION="15.5"
VERSION_ID="15.5"
openqaworker-arm21
ssh: connect to host openqaworker-arm21 port 22: No route to host
openqaworker-arm22
VERSION="15.5"
VERSION_ID="15.5"
qa-power8-3
ssh: connect to host qa-power8-3 port 22: No route to host
Actions #13

Updated by okurz 9 months ago

  • Copied to action #157972: Upgrade o3 workers to openSUSE Leap 15.6 size:S added
Actions

Also available in: Atom PDF