Project

General

Profile

action #99189

coordination #99183: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui, to openSUSE Leap 15.3

Upgrade o3 workers to openSUSE Leap 15.3 size:M

Added by okurz 2 months ago. Updated 26 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

  • Need to upgrade workers before EOL of Leap 15.2 and have a consistent environment

Acceptance criteria

  • AC1: all o3 worker machines run a clean upgraded openSUSE Leap 15.3 (no failed systemd services, no left over .rpm-new files, etc.)

Suggestions

  • read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
  • Reserve some time when the workers are only executing a few or no openQA test jobs
  • Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery
  • Use the instructions from above but use transactional-update shell for transactional update workers
  • After upgrade reboot and check everything working as expected, if not rollback, e.g. with transactional-update rollback

Further details

  • Don't worry, everything can be repaired :) If by any chance the worker gets misconfigured there are btrfs snapshots to recover, the IPMI Serial-over-LAN, a reinstall is possible and not hard, there is no important data on the host (it's only an openQA worker) and there are also other machines that can jobs while one host might be down for a little bit longer. And okurz can hold your hand :)

Related issues

Copied from openQA Infrastructure - action #73189: Upgrade o3 workers to openSUSE Leap 15.2 after openqa-aarch64 already doneResolved

History

#1 Updated by okurz 2 months ago

  • Copied from action #73189: Upgrade o3 workers to openSUSE Leap 15.2 after openqa-aarch64 already done added

#2 Updated by okurz 2 months ago

  • Subject changed from Upgrade o3 workers to openSUSE Leap 15.2 after openqa-aarch64 already done to Upgrade o3 workers to openSUSE Leap 15.3
  • Description updated (diff)
  • Assignee deleted (cdywan)
  • Priority changed from High to Normal

#3 Updated by cdywan 2 months ago

  • Subject changed from Upgrade o3 workers to openSUSE Leap 15.3 to Upgrade o3 workers to openSUSE Leap 15.3 size:M
  • Status changed from New to Workable

#4 Updated by ggardet_arm about 1 month ago

oss-cobbler-03 is a new remote aarch64 machines for aarch64 (and aarch32) based on Leap 15.3 already.
This is a Ampere eMag machine with 32 cores, 125G of RAM and 480G of SSD storage. Currently, it runs 7 workers with no problem.

#5 Updated by okurz about 1 month ago

Cool :)

#6 Updated by mkittler about 1 month ago

  • Assignee set to mkittler

#7 Updated by mkittler about 1 month ago

  • Status changed from Workable to Feedback

Not sure how to upgrade the transactional workers. I cannot directly follow commands on https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Distribution-upgrades. If I try to run them in transactional-update shell the commands don't work because no root certificates exist in that environment. Just enter openqaworker1:~ # transactional-update shell and find that /var/lib/ca-certificates/pem is empty. The problem is also reproducible on openqaworker4 and possibly all other workers which are transactional servers. Note that /etc/ssl/certs is a symlink to /var/lib/ca-certificates/pem.

#8 Updated by mkittler about 1 month ago

I've been upgrading power8 and everything seemed to work well. However, it didn't came back and I cannot even reach it via ipmitool -I lanplus -C 3 -H openqaworker-power8-ipmi.suse.de -U ADMIN -P ADMIN sol activate now.

I had to create an Infra ticket regarding power8: https://sd.suse.com/servicedesk/customer/portal/1/SD-64563

#9 Updated by mkittler about 1 month ago

@ffogt helped recovering power8:

https://bugzilla.suse.com/show_bug.cgi?id=1174166
I used the petitboot environment to chroot into the leap install and replaced kernel-kvmsmall with kernel-default
It's up!
nfs-client was not installed, also probably because of kernel-kvmsmall
I just watched it boot after a power reset, which needs some patience

Apparently the ipmi power commands and sol session worked at some point after all. It now works for me now as well. I'll respond in the Infra ticket.

#10 Updated by mkittler about 1 month ago

I've now been upgrading power8 and openqaworker7. Both generally work now. So aarch64 openqaworker1 openqaworker4 imagetester and rebel are still remaining.


Note that there's a failing service on openqaworker7 but it has been failing in that way at least since Sep 05 03:46:26 (which is almost as far as the logs go back):

openqaworker7:~ # systemctl status snapper-cleanup.service
● snapper-cleanup.service - Daily Cleanup of Snapper Snapshots
     Loaded: loaded (/usr/lib/systemd/system/snapper-cleanup.service; static)
     Active: failed (Result: exit-code) since Tue 2021-10-26 15:34:21 CEST; 2min 30s ago
TriggeredBy: ● snapper-cleanup.timer
       Docs: man:snapper(8)
             man:snapper-configs(5)
    Process: 23457 ExecStart=/usr/lib/snapper/systemd-helper --cleanup (code=exited, status=1/FAILURE)
   Main PID: 23457 (code=exited, status=1/FAILURE)

Okt 26 15:34:20 openqaworker7 systemd[1]: Started Daily Cleanup of Snapper Snapshots.
Okt 26 15:34:20 openqaworker7 systemd-helper[23457]: running cleanup for 'root'.
Okt 26 15:34:20 openqaworker7 systemd-helper[23457]: running number cleanup for 'root'.
Okt 26 15:34:20 openqaworker7 systemd-helper[23457]: Deleting snapshot failed.
Okt 26 15:34:20 openqaworker7 systemd-helper[23457]: number cleanup for 'root' failed.
Okt 26 15:34:20 openqaworker7 systemd-helper[23457]: running timeline cleanup for 'root'.
Okt 26 15:34:20 openqaworker7 systemd-helper[23457]: running empty-pre-post cleanup for 'root'.
Okt 26 15:34:21 openqaworker7 systemd[1]: snapper-cleanup.service: Main process exited, code=exited, status=1/FAILURE
Okt 26 15:34:21 openqaworker7 systemd[1]: snapper-cleanup.service: Failed with result 'exit-code'.

#11 Updated by mkittler about 1 month ago

  • Status changed from Feedback to In Progress

As mentioned by andriinikitin the redirection to https can be avoided via sed -i 's,download.opensuse.org,mirrorcache.opensuse.org,g' /etc/zypp/repos.d/*.repo. That seems to work in the root-certificate-less environment of transactional-update shell. So I'm upgrading the remaining workers now.

#12 Updated by mkittler about 1 month ago

The following units failed on openqaworker1 after a reboot under Leap 15.3:

openqaworker1:~ # systemctl --failed
  UNIT                                          LOAD   ACTIVE SUB    DESCRIPTION                                         
● container-openqaworker1_container_102.service loaded failed failed Podman container-openqaworker1_container_102.service
● container-openqaworker1_container_103.service loaded failed failed Podman container-openqaworker1_container_103.service

I suspect these are leftovers from experimenting with a containerized setup so I disabled them for now.

#13 Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback

All workers (openqaworker1 openqaworker4 openqaworker7 power8 rebel imagetester aarch64) are now running on Leap 15.3. Besides the (unimportant) services mentioned in the comments it looks good so far.

#14 Updated by okurz about 1 month ago

A whole lot of openSUSE Tumbleweed tests have been failing after the upgrade. Dimstar and fvogt narrowed down the problem to a qemu-seabios update. Was reported in https://bugzilla.suse.com/show_bug.cgi?id=1192115

#15 Updated by favogt about 1 month ago

Until the issue is fixed, the qemu-seabios package should be downgraded on the openQA workers for x86, like this:

zypper in --oldpackage https://download.opensuse.org/repositories/openSUSE:/Leap:/15.2:/Update/standard/noarch/qemu-seabios-1.12.1+-lp152.9.20.1.noarch.rpm
zypper al https://download.opensuse.org/repositories/openSUSE:/Leap:/15.2:/Update/standard/noarch/qemu-seabios-1.12.1+-lp152.9.20.1.noarch.rpm

(resp. with transactional-update pkg in and zypper al instead)

I did this on ow7, but on the transactional systems I copied the bios files and used bind mounts instead, to avoid reboots which set back the test states.

I also noticed that os-autoinst uses usb-ehci as controller by default while qemu-xhci has various advantages, and opened a PR to switch to xhci: https://github.com/os-autoinst/os-autoinst/pull/1838. Incidentally, this also appears to work around the bios issue, so the package downgrade could be omitted if qemu-xhci is used.

#17 Updated by mkittler about 1 month ago

I'm aware of the issues and also noted it in #101265#note-9.

Is there anything left to do for o3 workers (since favogt already took care of them)? And maybe we can merge https://github.com/os-autoinst/os-autoinst/pull/1838 despite missing test coverage?

#18 Updated by favogt about 1 month ago

Due to https://bugzilla.opensuse.org/show_bug.cgi?id=1192126, qemu-ovmf-x86_64 had to be downgraded to the Leap 15.2 version as well.

And maybe we can merge https://github.com/os-autoinst/os-autoinst/pull/1838 despite missing test coverage?

In theory we could probably revert the downgrade now that it uses XHCI, but staying on the older seabios for a bit longer won't hurt I'd say.

#19 Updated by mkittler about 1 month ago

I've just checked two transactional workers and both have 1.14.0_0_g155821a-103.2 installed. Not sure about the bind mount but it looks like the workers are already back to normal anyways.

#20 Updated by favogt about 1 month ago

mkittler wrote:

I've just checked two transactional workers and both have 1.14.0_0_g155821a-103.2 installed. Not sure about the bind mount but it looks like the workers are already back to normal anyways.

Looks like there are no zypper locks defined. Either they got deleted somehow or I added them incorrectly without noticing.

#21 Updated by mkittler about 1 month ago

Then I'll leave the systems as they are except for the non-transactional worker openqaworker7 where I removed the lock.

I suppose this ticket can be considered resolved - or were there any other problems?

#22 Updated by mkittler 28 days ago

  • Status changed from Feedback to Resolved

Looks like nothing else came up so I'm closing this issue.

#23 Updated by mkittler 28 days ago

  • Status changed from Resolved to Feedback

The PR fixes only the seabios issue so I'm downgrading the transactional workers now to cover the ovmf issue as well:

transactional-shell
zypper in --oldpackage https://download.opensuse.org/update/leap/15.2/oss/noarch/qemu-ovmf-x86_64-201911-lp152.6.17.1.noarch.rpm
zypper al qemu-ovmf
exit
reboot

So far I've done this only on openqaworker1 which is currently rebooting.

#24 Updated by mkittler 28 days ago

openqaworker1 is up again and the lock and package are in place. I've also checked the other workers but they all had the package and lock still in place. (Except for aarch64 but I suppose only the x86_64 workers are relevant here. And rebel doesn't have the package installed at all.)

#25 Updated by mkittler 27 days ago

It didn't work because it should have been zypper al qemu-ovmf-x86_64. So I downgraded the package again. Judging by the job's history the grub_test module generally works with the downgrade. I'll check tomorrow again whether the downgraded package survived the nightly update.

#26 Updated by mkittler 26 days ago

  • Status changed from Feedback to Resolved

The downgrade survived the nightly transactional update. Also further test runs (conducted on different workers) worked.

Also available in: Atom PDF