Project

General

Profile

Actions

action #157996

closed

coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

Upgrade all other LSG QE salt controlled machines to openSUSE Leap 15.6

Added by okurz 9 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Organisational
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

  • Need to upgrade machines before EOL of Leap 15.5 and have a consistent environment

Acceptance criteria

  • AC1: all LSG QE salt controlled machines run a clean upgraded openSUSE Leap 15.6 (no failed systemd services, no left over .rpm-new files, etc.) except for OSD workers

Acceptance tests

  • AT1-1: sudo salt -C 'not G@roles:worker and not G@roles:webui' grains.get oscodename | grep -B1 'Leap 15.5' is empty

Suggestions

  • read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
  • Reserve some time when the related services are not heavily relied upon
  • Keep IPMI interface ready and test that Serial-over-LAN works for potential recovery or for virtual machines virt-manager access
  • After upgrade reboot and check everything working as expected, if not rollback, e.g. with snapper rollback

Rollback actions

Further details

  • Don't worry, everything can be repaired :) If by any chance the machines gets misconfigured in many cases there are btrfs snapshots to recover, the IPMI Serial-over-LAN, etc.

Related issues 4 (1 open3 closed)

Copied from openQA Project (public) - action #130648: Upgrade all other LSG QE salt controlled machines to openSUSE Leap 15.5Resolvedokurz2023-06-09

Actions
Copied to openQA Project (public) - action #160089: Handle uncommented package lock on "kernel-default" and "kernel-default-base" on openqa-piworkerResolvedjbaier_cz2024-05-08

Actions
Copied to openQA Project (public) - action #160095: Upgraded Leap 15.6 workers able to run s390x tests after #162683 size:MWorkable2024-05-08

Actions
Copied to openQA Infrastructure (public) - action #160098: After the upgrade to Leap 15.6 osiris showed no proper mount points again for libvirt VMs size:SResolvednicksinger2024-05-08

Actions
Actions #1

Updated by okurz 9 months ago

  • Copied from action #130648: Upgrade all other LSG QE salt controlled machines to openSUSE Leap 15.5 added
Actions #2

Updated by okurz 9 months ago

  • Subject changed from Upgrade all other LSG QE salt controlled machines to openSUSE Leap 15.5 to Upgrade all other LSG QE salt controlled machines to openSUSE Leap 15.6
  • Description updated (diff)
  • Assignee deleted (okurz)
  • Target version changed from Ready to future
  • Start date deleted (2023-06-09)
Actions #3

Updated by okurz 8 months ago

  • Target version changed from future to Ready
Actions #4

Updated by okurz 8 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #5

Updated by okurz 8 months ago

  • Copied to action #160089: Handle uncommented package lock on "kernel-default" and "kernel-default-base" on openqa-piworker added
Actions #6

Updated by okurz 8 months ago

  • Description updated (diff)

sudo salt --no-color -C 'not G@roles:worker and not G@roles:webui and not G@osrelease:15.6' cmd.run 'zypper ll' shows that openqa-piworker has kernel locks for kernel-default but no explanation why:

openqa-piworker.qe.nue2.suse.org:

    # | Name                | Type    | Repository | Comment
    --+---------------------+---------+------------+--------
    1 | kernel-default      | package | (any)      |
    2 | kernel-default-base | package | (any)      |

asked in https://suse.slack.com/archives/C02AJ1E568M/p1715189271425109

I will leave that in place for now but upgrade all nevertheless.

sudo salt --no-color -C 'not G@roles:worker and not G@roles:webui and not G@osrelease:15.6' cmd.run 'export new_version=15.6; zypper --releasever=$new_version ref && systemctl stop openqa-continuous-update.timer && zypper -n --releasever=$new_version dup --dry-run --auto-agree-with-licenses --replacefiles --download-in-advance'

http://download.opensuse.org/repositories/devel:/openQA:/monitoring/ is missing 15.6 as well as http://download.opensuse.org/repositories/home:/MMoese:/baremetal_support/

Enabled 15.6 in https://build.opensuse.org/projects/devel:openQA:monitoring/meta and removed EOL 15.4

sudo salt --no-color -C 'not G@roles:worker and not G@roles:webui and not G@osrelease:15.6' cmd.run 'export new_version=15.6; zypper --releasever=$new_version ref && systemctl stop openqa-continuous-update.timer && zypper -n --releasever=$new_version dup --auto-agree-with-licenses --replacefiles --download-in-advance && reboot'

Special handling for baremetal-support and backup-qam. On backup-qam multiple additional packages would be pulled in, hence doing

zypper rm -u -t pattern x11 SDK-C-C++ kvm_server gnome-basic Basis-Devel

and then successfully conducted an upgrade. jenkins is still running the zypper calls for the upgrade but is very slow in responses. It might be necessary to abort the upgrade, rollback to an older snapshot, reboot and try again. backup-vm seems to have aborted the upgrade half-way and is now stuck with an unusable zypper:

error while loading shared libraries: libabsl_log_internal_check_op.so.2308.0.0: cannot open shared object file: No such 
file or directory

/var/log/salt/minion says

(314/800) Installing: libharfbuzz0-8.3.0-150600.1.2.x86_64 [....done]
(315/800) Installing: libgobject-2_0-0-2.78.3-150600.2.1.x86_64 [...done]
(316/800) Removing pkexec-121-150500.1.6.x86_64 [..
error: package pkexec-121-150500.1.6.x86_64 is not installed
error]
Removal of (109119)pkexec-121-150500.1.6.x86_64(@System) failed:
Error: Subprocess failed. Error: RPM failed: Command exited with status 1.
Abort, retry, ignore? [a/r/i] (a): a
Warning: %posttrans scripts skipped while aborting:
    login_defs-4.8.1-150600.15.44.noarch
    systemd-presets-common-SUSE-15-150600.25.2.noarch
    systemd-presets-branding-openSUSE-12.2-lp156.6.1.noarch
    openssl-1_1-1.1.1w-150600.2.16.x86_64
    openSUSE-release-15.6-lp156.404.1.x86_64
    systemd-254.10-150600.1.9.x86_64
    suse-module-tools-15.6.7-150600.1.24.x86_64
    kmod-29-150600.11.3.x86_64
    udev-254.10-150600.1.9.x86_64
    systemd-network-254.10-150600.1.9.x86_64
    systemd-coredump-254.10-150600.1.9.x86_64
    mdadm-4.3-150600.1.25.x86_64
    dracut-059+suse.515.g83296e6f-150600.1.14.x86_64
    adobe-sourcesanspro-fonts-2.045-bp156.3.1.noarch
    sg3_utils-1.48+10.1532339-150600.1.2.x86_64
    shadow-4.8.1-150600.15.44.x86_64
    open-iscsi-2.1.9-150600.49.7.x86_64
    haveged-1.9.14-150600.9.4.x86_64

so reverting to an older snapshot, rebooting, trying again.

On qamaster trying to start again the VM jenkins:

Error starting domain: internal error: process exited while connecting to monitor: /usr/bin/qemu-kvm: error while loading shared libraries: libxenctrl.so.4.17: cannot open shared object file: No such file or directory
Actions #7

Updated by okurz 8 months ago

I found that the DHCP entry for qamaster-sp is wrong. Fixed in https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/5085

The following change seems problematic:

The following 6 packages are going to change vendor:
  perl-DBD-Pg                                 
    3.18.0-lp155.2.2 -> 3.10.4-150600.12.2            x86_64  openSUSE-Leap-15.6-Oss  obs://build.opensuse.org/devel:openQA -> SUSE LLC <https://www.suse.com/>
  perl-DBD-SQLite                             
    1.740.0-lp155.4.1 -> 1.66-150300.3.9.1            x86_64  openSUSE-Leap-15.6-Oss  obs://build.opensuse.org/devel:openQA -> SUSE LLC <https://www.suse.com/>
  perl-DBI                                    
    1.643-lp155.4.1 -> 1.642-3.9.1                    x86_64  openSUSE-Leap-15.6-Oss  obs://build.opensuse.org/devel:openQA -> SUSE LLC <https://www.suse.com/>
  perl-Mojo-SQLite                            
    3.009-lp155.2.1 -> 3.006-bp156.3.1                noarch  openSUSE-Leap-15.6-Oss  obs://build.opensuse.org/devel:openQA -> openSUSE
  perl-Mojolicious                            
    9.360.0-lp155.2.1 -> 9.350.0-bp156.1.1            noarch  openSUSE-Leap-15.6-Oss  obs://build.opensuse.org/devel:openQA -> openSUSE
  perl-Perl-Tidy                              
    20240202.0.0-lp155.3.1 -> 20230912.0.0-bp156.1.1  noarch  openSUSE-Leap-15.6-Oss  obs://build.opensuse.org/devel:openQA -> openSUSE

I guess for all those we want to have an updated package available in devel:openQA:Leap:15.6. Did

for i in perl-DBD-Pg perl-DBD-SQLite perl-DBI perl-Mojo-SQLite perl-Mojolicious perl-Perl-Tidy; do osc linkpac devel:openQA:Leap:15.5 $i devel:openQA:Leap:15.6; done

and in in each new package link added a comment pointing to the according comment from the Leap 15.5 project:

Now https://build.opensuse.org/project/show/devel:openQA:Leap:15.6 has all what's necessary at least for jenkins.qe.nue2.suse.org which IIRC uses openQA-client. With that jenkins.qe.nue2.suse.org does not propose any vendor change for openQA related packages.

Actions #8

Updated by okurz 8 months ago

  • Copied to action #160095: Upgraded Leap 15.6 workers able to run s390x tests after #162683 size:M added
Actions #9

Updated by okurz 8 months ago

  • Copied to action #160098: After the upgrade to Leap 15.6 osiris showed no proper mount points again for libvirt VMs size:S added
Actions #10

Updated by okurz 8 months ago

  • Due date set to 2024-05-22
  • Status changed from In Progress to Feedback

Monitoring on https://openqa.suse.de/tests?resultfilter=Failed&resultfilter=Incomplete I found multiple related failures due to incomplete upgrades or missing reboots on hypervisor hosts so called

failed_since="2024-05-08 17:00Z" result="result='failed'" host=openqa.suse.de comment="label:poo157996" openqa-advanced-retrigger-jobs

on s390zl13 I see

May 08 22:43:40 s390zl13 virtqemud[22259]: unsupported configuration: machine type 's390-ccw-virtio-8.2' does not support ACPI

Probably due to https://github.com/os-autoinst/os-autoinst/blob/master/consoles/sshVirtsh.pm#L114 where os-autoinst adds such element to the VM config but likely there is no ACPI support on s390x VMs anymore now causing that problem. I called snapper rollback … with a snapshot before the Leap 15.5 upgrade on both s390z12 and s390zl13 now and will retrigger according test failures.

Reported #160095 for the s390 problem. #160098 for a problem with drbd+libvirt on osiris.

Now

sudo salt --no-color -C 'not G@roles:worker and not G@roles:webui' cmd.run 'grep VERSION_ID /etc/os-release ; uptime' queue=True

shows that all relevant hosts except s390zl12+13 are upgraded

s390zl12.oqa.prg2.suse.org:
    VERSION_ID="15.5"
     23:20:38  up   0:25,  0 users,  load average: 0.00, 0.00, 0.00
ada.qe.prg2.suse.org:
    VERSION_ID="15.6"
     23:20:38  up   0:36,  0 users,  load average: 0.16, 0.24, 0.37
osiris-1.qe.nue2.suse.org:
    VERSION_ID="15.6"
     23:20:38  up   1:28,  0 users,  load average: 0.00, 0.17, 0.22
storage.qe.prg2.suse.org:
    VERSION_ID="15.6"
     23:20:38  up   0:35,  0 users,  load average: 0.00, 0.01, 0.06
s390zl13.oqa.prg2.suse.org:
    VERSION_ID="15.5"
     23:20:38  up   0:25,  40 users,  load average: 37.61, 34.77, 26.62
schort-server.qe.nue2.suse.org:
    VERSION_ID="15.6"
     23:20:38  up   1:25,  0 users,  load average: 0.03, 0.02, 0.05
jenkins.qe.nue2.suse.org:
    VERSION_ID="15.6"
     23:20:38  up   0:38,  0 users,  load average: 0.00, 0.03, 0.11
backup-vm.qe.nue2.suse.org:
    VERSION_ID="15.6"
     23:20:38  up   1:26,  0 users,  load average: 0.33, 0.08, 0.03
baremetal-support.qe.nue2.suse.org:
    VERSION_ID="15.6"
     23:20:38  up   1:25,  0 users,  load average: 0.00, 0.00, 0.00
tumblesle.qe.nue2.suse.org:
    VERSION_ID="15.6"
     23:20:38  up   1:25,  0 users,  load average: 0.08, 0.04, 0.18
monitor.qe.nue2.suse.org:
    VERSION_ID="15.6"
     23:20:38  up   1:25,  0 users,  load average: 0.25, 0.31, 0.37
qamaster.qe.nue2.suse.org:
    VERSION_ID="15.6"
     23:20:38  up   1:27,  0 users,  load average: 1.19, 1.84, 2.49
openqaw5-xen.qe.prg2.suse.org:
    VERSION_ID="15.6"
     23:20:38  up   0:39,  16 users,  load average: 0.90, 1.18, 0.97
unreal6.qe.nue2.suse.org:
    VERSION_ID="15.6"
     23:20:38  up   0:47,  16 users,  load average: 0.33, 0.55, 0.85
openqa-piworker.qe.nue2.suse.org:
    VERSION_ID="15.6"
     23:20:41  up   3:02,  0 users,  load average: 1.04, 0.35, 0.22
backup-qam.qe.nue2.suse.org:
    VERSION_ID="15.6"
     23:20:54  up   2:03,  0 users,  load average: 0.16, 0.09, 0.07

Putting into feedback for the next days and monitoring for the impact.

Actions #11

Updated by okurz 8 months ago

  • Description updated (diff)
  • Status changed from Feedback to Resolved

Rollback step covered. No further problems observed. Follow-up tickets for specific problems like s390zl12+s390zl13 are created. With this also ACs are covered with the mentioned exceptions.

Actions #12

Updated by okurz 7 months ago

  • Due date deleted (2024-05-22)
Actions

Also available in: Atom PDF