Project

General

Profile

Actions

action #55607

closed

Upgrade all OSD workers to a supported OS version (e.g. from Leap 42.3 to 15.1) and consistent for all

Added by okurz over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
2019-07-11
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

  • AC1: sudo salt -C 'G@roles:worker' cmd.run 'grep VERSION= /etc/os-release' shows no 42.3 / 12-SP3 anymore
  • AC2: All are the same, e.g. 15.1

Related issues 5 (0 open5 closed)

Related to openQA Infrastructure - action #56231: [tools][functional][u] ppc64le worker don't show SLOF boot image anymore after upgrade to Leap 15.1Closedmgriessmeier2019-09-022019-09-06

Actions
Related to openQA Tests - action #56444: [sle][functional][u][spvm] test fails in grub_test - no OS was detected by firmwareRejectedmgriessmeier2019-09-04

Actions
Copied from openQA Infrastructure - action #54137: Upgrade osd to a supported Leap version (from 42.3)Resolvedokurz2019-07-11

Actions
Copied to openQA Infrastructure - action #55616: qa-power8-4-kvm is missing many installed updates, packages unsupported (potentially other machines as well)Resolvedokurz2019-07-11

Actions
Copied to openQA Infrastructure - action #56135: openqaw1 gets stuck on (re-)boot trying to boot from PXE because of infinite timeout?Rejectedokurz2019-07-11

Actions
Actions #1

Updated by okurz over 4 years ago

  • Copied from action #54137: Upgrade osd to a supported Leap version (from 42.3) added
Actions #2

Updated by okurz over 4 years ago

I think nsinger upgraded all workers but seems we are not there yet:

sudo salt -C 'G@roles:worker' cmd.run 'grep VERSION= /etc/os-release' | grep -v 15.1
QA-Power8-5-kvm.qa.suse.de:
    VERSION="12-SP3"
QA-Power8-4-kvm.qa.suse.de:
    VERSION="12-SP3"
malbec.arch.suse.de:
    VERSION="12-SP3"
powerqaworker-qam-1:
    VERSION="12-SP3"
…
openqaw2.qa.suse.de:
    VERSION="42.3"
openqaworker-arm-1.suse.de:
    VERSION="12-SP3"
openqaw1.qa.suse.de:
    VERSION="42.3"
openqaworker-arm-2.suse.de:
    VERSION="12-SP3"
grenache-1.qa.suse.de:
    VERSION="12-SP3"

EDIT: Updated after openqaworker13 was upgraded

Actions #3

Updated by okurz over 4 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz

I found openqaworker13 which was not listed on https://confluence.suse.com/display/openqa/openQA , added. The machine had a mix of 42.3 and 15.0 repos. Harmonized to 15.1 using $releasever, also the update repo was not enabled at all. I wonder if there is a good way to stop all openQA worker instance jobs only when they are not currently executing jobs so that I can slowly disable workers until all are down for a clean upgrade. I have used for i in $(systemctl status openqa-worker@\* | sed -n 's/^.*instance //p') ; do systemctl status openqa-worker@$i | grep 'Cleaning up for next job' | grep -v grep && echo $i && systemctl stop openqa-worker@$i; done for now. That did not work, probably salt re-enabled the jobs again after some minutes. I wonder if I can simply disable salt-minion to work manually during upgrade. So upgraded to most recent set in openSUSE Leap 15.0 and then dist upgrade:

zypper --releasever=15.1 ref && zypper --releasever=15.1 -n dup --allow-vendor-change --download-only && zypper --releasever=15.1 -n dup --allow-vendor-change

rebooted, checked services and openqaworker13 happily works on jobs, e.g. https://openqa.suse.de/tests/3257910

Actions #4

Updated by okurz over 4 years ago

  • Copied to action #55616: qa-power8-4-kvm is missing many installed updates, packages unsupported (potentially other machines as well) added
Actions #5

Updated by okurz over 4 years ago

  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)

updated the list of still todo workers

Actions #6

Updated by okurz over 4 years ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

first, changed all repo definitions in salt to also use $releasever for the workers: https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/154 and https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/156 for another improvement adding proper priorities.

Upgrading currently openqaw1 and openqaw2 and openqaworker-arm-2.

  • openqaw1: upgraded but does not boot, cscreen on qanet.qa (qanetnue) does not show it. EDIT: Solved. Taking a look in /etc/cscreenrc revealed telnet qats03nue.qa.suse.de 7023 to access openqaw1. It was stuck in an endless spinner animation waiting for pxe boot but we should simply boot from the first HDD entry. TODO: Fix to use timeout for PXE boot, fall back to HDD boot . but it runs tests fine: https://openqa.suse.de/tests/3311632
  • openqaw2: ready for reboot but did not trigger as cscreen also does not show anything from it. EDIT: Rebooted fine. Executing job https://openqa.suse.de/tests/3311672 , also looks good.
  • openqaworker-arm-2: repos repaired but not yet upgraded, don't know how to resolve conflicts:
openqaworker-arm-2:/etc/zypp/repos.d # zypper --releasever=15.1 dup --allow-vendor-change --allow-downgrade --replacefiles --download-in-advance
Warning: Enforced setting: $releasever=15.1
Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command.
Loading repository data...
Reading installed packages...
Computing distribution upgrade...
204 Problems:
Problem: nothing provides rubygem(ruby:2.5.0:suse-connect) >= 0.3.10 needed by zypper-migration-plugin-0.12.1529570802.4a668e3-lp151.1.1.noarch
Problem: nothing provides rubygem(ruby:2.5.0:cfa) needed by yast2-tftp-server-4.1.7-lp151.1.1.noarch
Problem: nothing provides yast2 >= 4.1.3 needed by yast2-sysconfig-4.1.2-lp151.1.1.noarch
…
Problem: nothing provides psmisc = 23.0 needed by psmisc-lang-23.0-lp151.6.1.noarch
Problem: nothing provides perl(:MODULE_COMPAT_5.26.1) needed by perl-strictures-2.000005-lp151.2.1.noarch
Problem: nothing provides perl(:MODULE_COMPAT_5.26.1) needed by perl-libwww-perl-6.31-lp151.2.1.noarch
Problem: nothing provides perl(:MODULE_COMPAT_5.26.1) needed by perl-constant-boolean-0.02-lp151.2.1.noarch
…

any ideas?

EDIT: I realized my mistake regarding side-grade. It's aarch64 so the repo URLs need to point to "ports/aarch64/…". So I changed all repo files accordingly and did:

zypper --releasever=15.1 dup --details --allow-vendor-change --allow-downgrade --replacefiles --auto-agree-with-licenses --download-in-advance

During shutdown the system stayed on "[ OK ] Reached target Shutdown." for about 1h(!) but eventually restarted by itself "[5175140.553820] reboot: Restarting system". System came up fine but some services failed to start. Checking repos I saw "Warning: The /etc/products.d/baseproduct symlink is dangling or missing!" which I fixed with sudo ln -sf openSUSE.prod /etc/products.d/baseproduct but a zypper dup did only remove an old kernel. A problem seems to be in /etc/systemd/system/openqa_nvme_prepare.service which is hardly upgrade-related: There are calls to mkdir which will fail if the directories already exist, changed that to mkdir -p. https://gitlab.suse.de/okurz/salt-states-openqa/merge_requests/new?merge_request%5Bsource_branch%5D=fix%2Farm TODO check if this is now reboot- and reinstall-safe. The next reboot left the machine without network, maybe something related to the tap devices and bridge? Killing (not terminating, killing) wickedd processes respawned them and actually fixed it. I assume wickedd relies on some other information or services that were not up in time. TODO crosscheck this for next reboots and find a fix. Also haveged service seems to fail often, the manual command /usr/sbin/haveged -w 1024 -v 2 -F fails with "haveged: Couldn't initialize HAVEGE rng 9". Commented in bsc#1138001. All services started and jobs picked up, e.g. https://openqa.suse.de/tests/3312815 . seems we are missing "Config::Tiny" from tests, see https://openqa.suse.de/tests/3315693/file/autoinst-log.txt . Installed manually, created https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8337 and https://openqa.suse.de/tests/3315695 looks fine.
Upgraded openqaworker-arm-1 as well, https://openqa.suse.de/tests/3316623 is fine. Onto power: First prepared repos and synced them to all machines: for i in powerqaworker-qam-1.qa QA-Power8-5-kvm.qa.suse.de malbec.arch.suse.de QA-Power8-4-kvm.qa.suse.de grenache-1.qa.suse.de; do rsync --delete -aHP ./ $i:/tmp/repos/; done. Then

SUSEConnect -d
rsync -aHP --delete /tmp/repos/ /etc/zypp/repos.d/
zypper --releasever=15.1 ref
zypper --releasever=15.1 dup --details --allow-vendor-change --allow-downgrade --replacefiles --auto-agree-with-licenses --download-in-advance
ln -sf openSUSE.prod /etc/products.d/baseproduct
reboot
zypper rm $(zypper packages --unneeded| awk '/^i/{ print $5 }' ORS=" ")
zypper dup

test job on malbec after upgrade seems fine: https://openqa.suse.de/tests/3316625 . Trying with qa-power8-kvm.qa first following https://wiki.suse.net/index.php?title=SUSE/QA_SLE_PPC_Infrastructure I wanted to see if I can use a remote serial port. The java console within the web interface needs https://serverfault.com/questions/853051/java-issue-driving-me-crazy . I could connect with ipmi after the machine ended up in "petitboot": ipmitool -I lanplus -H qa-power8-4.qa.suse.de -U ADMIN -P XXX sol activate but I don't know how to properly boot the machine further. Trying from the petitboot environment with kexec -l /var/petitboot/mnt/dev/sdb2/boot/vmlinux-4.12.14-lp151.27-default --initrd=/var/petitboot/mnt/dev/sdb2/boot/initrd-4.12.14-lp151.27-default --append="root=UUID=eebe647f-e867-416e-a0fa-7a6732bfcf9d console=tty0 console=ttyS1,115200 nospec" && kexec -e. This worked fine but should be checked again. TODO check reboot safeness for qa-power8-4-kvm.qa , also the error could be related when generating grub config: "/etc/default/grub: line 14: nospec': command not found". I removed the tripple quoting "'" in line 14 of /etc/default/grub and the generation works without error, done the same on qa-power8-5-kvm, TODO crosscheck other ppc64le machines as well.

openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 3314309 _GROUP=0 TEST=okurz_poo55607_test_after_upgrade BUILD=X WORKER_CLASS=QA-Power8-5-kvm

Created job #3316626: sle-15-SP1-Server-DVD-Incidents-Minimal-ppc64le-Build:12406:rpmlint-mini-qam-minimal+sle15@ppc64le -> https://openqa.suse.de/t3316626 seems fine but was wrong host :D

openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 3314309 _GROUP=0 TEST=okurz_poo55607_test_after_upgrade BUILD=X WORKER_CLASS=QA-Power8-4-kvm

Created job #3316627: sle-15-SP1-Server-DVD-Incidents-Minimal-ppc64le-Build:12406:rpmlint-mini-qam-minimal+sle15@ppc64le -> https://openqa.suse.de/t3316627

qa-power8-5-kvm upgraded as well, ensured to fix /etc/default/grub before reboot (see above) and the system rebooted fine. TODO kdump failed because the boot parameters miss proper "crashkernel" information on both qa-power8-4-kvm and qa-power8-5-kvm.

Upgraded powerqaworker-qam-1.qa, found no way for any MC, triggered reboot, TODO does not come up. as suggested by nsinger I could login to https://fsp1-powerqaworker-qam.qa.suse.de/ (username+password on internal openQA wiki page) and could power on/off the machine it seems. For serial port I could ssh to "hscroot@powerhmc2.arch.suse.de" with the old default root password. Using "vtmenu" navigating to "QA-Power8-2-8247-22L-SN1010D5A" and partition 5 with name "qam-1" but the system does not seem to be operational. lssyscfg -r lpar -m "QA-Power8-2-8247-22L-SN1010D5A" | grep 'qam-1' shows the information as well. Somehow I have the feeling that what is available on "powerhmc2.arch.suse.de" does not correspond to that machine. However it seems that powering up over the HMC fixed it. Machine is up and running. TODO Failed systemd services: "kdump", "lm_sensors", "smartd".

This leaves only grenache-1.qa. I can use ssh padmin@grenache.qa.suse.de 'mkvterm --id 3' to get a serial console. Did the complete upgrade, machine came up fine again. Failed services: "smartd.service", "systemd-modules-load.service" and "telegraf.service". Need to take a look into this later.

Looks like the problem of telegraf on ppc64le might be the same as reported in #54128#note-7 , the package is simply not provided by devel:languages:go for neither aarch64 nor ppc64le for Leap 15.1. now building in https://build.opensuse.org/project/show/home:okurz:telegraf , maybe we want to simply add the package to devel:openQA:Leap:15.1

Actions #7

Updated by okurz over 4 years ago

  • Copied to action #56135: openqaw1 gets stuck on (re-)boot trying to boot from PXE because of infinite timeout? added
Actions #8

Updated by mgriessmeier over 4 years ago

  • Related to action #56231: [tools][functional][u] ppc64le worker don't show SLOF boot image anymore after upgrade to Leap 15.1 added
Actions #9

Updated by okurz over 4 years ago

  • Related to action #56444: [sle][functional][u][spvm] test fails in grub_test - no OS was detected by firmware added
Actions #10

Updated by okurz over 4 years ago

  • Status changed from In Progress to Resolved

noted down the problems about the failed services in #56588 . Regarding the rest I call this ticket done

Actions

Also available in: Atom PDF