Project

General

Profile

action #111866

openQA Project - coordination #111860: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.4

Upgrade osd workers and openqa-monitor to openSUSE Leap 15.4

Added by okurz 3 months ago. Updated 26 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

  • Need to upgrade workers before EOL of Leap 15.3 and have a consistent environment

Acceptance criteria

  • AC1: all osd worker machines run a clean upgraded openSUSE Leap 15.4 (no failed systemd services, no left over .rpm-new files, etc.)
  • AC2: openqa-monitor runs openSUSE Leap 15.4

Suggestions

Further details

  • Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)

Related issues

Related to openQA Project - action #111992: Deal with QEMU and OVMF default resolution being 1280x800, affecting (at least) qxl size:MBlocked2022-06-03

Related to openQA Tests - action #108548: [sle][security][backlog]automation: Integrate 'secure-boot' on Power into openQABlocked2022-03-17

Related to openQA Tests - action #114493: [qe-core][aarch64][installation]test fails in bootloader_start, needle mismatch on installer boot memuResolved2022-07-22

Copied from openQA Infrastructure - action #99192: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 size:MResolved

Copied to openQA Infrastructure - action #114526: recover openqaworker14Resolved

History

#1 Updated by okurz 3 months ago

  • Copied from action #99192: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 size:M added

#2 Updated by okurz 3 months ago

  • Subject changed from Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 size:M to Upgrade osd workers and openqa-monitor to openSUSE Leap 15.4 size:M
  • Description updated (diff)
  • Assignee deleted (cdywan)
  • Priority changed from High to Normal
  • Target version changed from Ready to future

#3 Updated by okurz 3 months ago

  • Project changed from openQA Project to openQA Infrastructure
  • Subject changed from Upgrade osd workers and openqa-monitor to openSUSE Leap 15.4 size:M to Upgrade osd workers and openqa-monitor to openSUSE Leap 15.4

#4 Updated by okurz about 1 month ago

  • Related to action #111992: Deal with QEMU and OVMF default resolution being 1280x800, affecting (at least) qxl size:M added

#5 Updated by okurz about 1 month ago

  • Status changed from New to Blocked
  • Assignee set to okurz
  • Target version changed from future to Ready

#6 Updated by punkioudi 29 days ago

  • Related to action #108548: [sle][security][backlog]automation: Integrate 'secure-boot' on Power into openQA added

#7 Updated by okurz 29 days ago

punkioudi I wonder, where do you see a relation to #108548? What's your expectation?

#8 Updated by okurz 27 days ago

  • Status changed from Blocked to In Progress

#9 Updated by okurz 27 days ago

  • Due date set to 2022-08-04

Upgrading openqaworker11 and openqaworker12 manually as they are currently not in salt. openqaworker12 is still a Leap 15.2 so I will do a Leap15.2->15.4 direct upgrade for fun and because I am curious what happens.

Many repositories have changed the name format from "openSUSE_Leap_$releasever" to "$releasever" so we need to adapt for that.
Actually by now there are more machines controlled by our salt structure so let's upgrade them along the way

powerqaworker-qam-1.qa.suse.de has some conflicts about python2 packages, manually removed first.

sudo salt --no-color --state-output=changes -C 'powerqaworker-qam-1.qa.suse.de' cmd.run 'zypper -n rm -u python2-libxml2-python'

then all the machines:

sudo salt --no-color --state-output=changes -C 'not G@roles:webui' cmd.run '(rpm -q qemu-ovmf-x86_64 && zypper al qemu-ovmf-x86_64) ; zypper rr telegraf-monitoring && sed -i -e "s@/openSUSE_Leap_@/@g" /etc/zypp/repos.d/* && zypper -n --releasever=15.4 ref && zypper -n --releasever=15.4 dup --auto-agree-with-licenses --replacefiles --download-in-advance'

seems to have gone fine

$ salt \* grains.get oscodename
storage.qa.suse.de:
    openSUSE Leap 15.4
openqaworker2.suse.de:
    openSUSE Leap 15.4
openqaworker3.suse.de:
    openSUSE Leap 15.4
openqaworker9.suse.de:
    openSUSE Leap 15.4
openqaworker6.suse.de:
    openSUSE Leap 15.4
QA-Power8-5-kvm.qa.suse.de:
    openSUSE Leap 15.4
openqaworker5.suse.de:
    openSUSE Leap 15.4
openqaworker14.qa.suse.cz:
    openSUSE Leap 15.4
powerqaworker-qam-1.qa.suse.de:
    openSUSE Leap 15.4
openqa-monitor.qa.suse.de:
    openSUSE Leap 15.4
QA-Power8-4-kvm.qa.suse.de:
    openSUSE Leap 15.4
openqaworker13.suse.de:
    openSUSE Leap 15.4
grenache-1.qa.suse.de:
    openSUSE Leap 15.4
openqa.suse.de:
    openSUSE Leap 15.4
openqaworker-arm-2.suse.de:
    openSUSE Leap 15.4
openqaworker8.suse.de:
    openSUSE Leap 15.4
backup.qa.suse.de:
    openSUSE Leap 15.4
openqaworker10.suse.de:
    openSUSE Leap 15.4
openqaworker-arm-1.suse.de:
    openSUSE Leap 15.4

except for openqaworker-arm-3 that repeatedly crashed. Need to try harder. Triggered reboot for most workers now.

openqaworker8, openqaworker9 and openqaworker14 failed to come up yet, "maintenance" mode on openqaworker8 and 9 at least, failed "openqa_nvme_format". The actual command failing is mdadm --create /dev/md/openqa --level=0 --force --assume-clean --raid-devices=1 --run /dev/nvme0n1 with "mdadm: cannot open /dev/nvme0n1: Device or resource busy". Well, that's understandable as on openqaworker8+9 nvme0n1 has three partitions and is also used for the root filesystem hence it's already "busy". Only nvme0n1p3 should be used here.
The reason seems to be this:

├─nvme0n1p2 259:2    0   100G  0 part /var/tmp
│                                     /var/spool
│                                     /var/opt
│                                     /var/log
│                                     /var/lib/pgsql
│                                     /var/lib/named
│                                     /var/lib/mysql
│                                     /var/lib/mariadb
│                                     /var/lib/mailman
│                                     /var/lib/libvirt/images
│                                     /var/lib/machines
│                                     /var/crash
│                                     /var/cache
│                                     /usr/local
│                                     /tmp
│                                     /srv
│                                     /opt
│                                     /boot/grub2/x86_64-efi
│                                     /boot/grub2/i386-pc
│                                     /.snapshots
│                                     /
└─nvme0n1p3 259:3 bash -ex /usr/local/bin/openqa-establish-nvme-setup
# lsblk --noheadings | grep -v nvme | grep "/$"
│                                     /

maybe lsblk has changed it's format. Fixed in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/714, patched openqaworker8+9 manually.

#10 Updated by okurz 26 days ago

  • Related to action #114493: [qe-core][aarch64][installation]test fails in bootloader_start, needle mismatch on installer boot memu added

#11 Updated by okurz 26 days ago

#12 Updated by okurz 26 days ago

#114493 reported, this is also related to #111992 but on aarch64. Trying to install old package from http://download.opensuse.org/ports/aarch64/distribution/leap/15.3/repo/oss/noarch/?P=*qemu-uefi*

with

sudo zypper -n in --oldpackage http://download.opensuse.org/ports/aarch64/distribution/leap/15.3/repo/oss/noarch/qemu-uefi-aarch64-202008-10.8.1.noarch.rpm && sudo zypper al qemu-uefi-aarch64

that helped. openqaworker14 was suffering from the same "lsblk" parse problem as other machines. qa-power8-4+qa-power8-5 might still be problematic

#13 Updated by okurz 26 days ago

  • Due date deleted (2022-08-04)
  • Status changed from In Progress to Resolved

Upgrade done. Some machines fail, specific tickets created to handle

Also available in: Atom PDF