action #157981
closedcoordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6
Upgrade osd webUI host to openSUSE Leap 15.6 size:S
0%
Description
Motivation¶
- Need to upgrade machines before EOL of Leap 15.5 and have a consistent environment
Acceptance criteria¶
- AC1: osd webui host runs a clean upgraded openSUSE Leap 15.6 (no failed systemd services, no left over .rpm-new files, etc.)
- AC2: The openQA database runs the default version of PostgreSQL in current Leap
Suggestions¶
- read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
- Reserve some time when the instance is only executing a few or no openQA test jobs
- After upgrade reboot and check everything working as expected
- Consider upgrading PostgreSQL according to https://open.qa/docs/#_migrating_postgresql_database_on_opensuse
Further details¶
- If we loose access to the machine we need the help of EngineeringInfrastructure as only they have access to the VM
Updated by okurz 9 months ago
- Copied from action #130594: Upgrade osd webUI host to openSUSE Leap 15.5 added
Updated by okurz 5 months ago
- Status changed from New to In Progress
- Assignee set to okurz
In preparation of the upgrade I am already migrating postgres to 16:
oldver=15 newver=16
zypper in postgresql$newver-server postgresql$newver-contrib
sudo -u postgres /usr/lib/postgresql$newver/bin/initdb --encoding=UTF8 --locale=en_US.UTF-8 --lc-collate=C --lc-ctype=en_US.UTF-8 --lc-messages=C --lc-monetary=C --lc-numeric=C --lc-time=C -D /var/lib/pgsql/data.$newver
sudo -u postgres vimdiff /var/lib/pgsql/data.$oldver/postgresql.conf /var/lib/pgsql/data.$newver/postgresql.conf
sudo -u postgres /usr/lib/postgresql$newver/bin/pg_upgrade --check --link --old-bindir=/usr/lib/postgresql$oldver/bin --new-bindir=/usr/lib/postgresql$newver/bin --old-datadir=/var/lib/pgsql/data.$oldver --new-datadir=/var/lib/pgsql/data.$newver && systemctl stop openqa-webui openqa-scheduler openqa-livehandler openqa-gru postgresql && sudo -u postgres /usr/lib/postgresql$newver/bin/pg_upgrade --link --old-bindir=/usr/lib/postgresql$oldver/bin --new-bindir=/usr/lib/postgresql$newver/bin --old-datadir=/var/lib/pgsql/data.$oldver --new-datadir=/var/lib/pgsql/data.$newver && ln --force --no-dereference --relative --symbolic /var/lib/pgsql/data.$newver /var/lib/pgsql/data && systemctl start postgresql openqa-webui openqa-scheduler openqa-livehandler openqa-gru && sudo -u geekotest psql -c 'select version();' openqa
Updated by okurz 5 months ago · Edited
Just prepared. Want to continue after EOB.
EDIT (2024-07-18 19:27Z): Done. Running pgsql 16 now. zypper se --installed-only postgres
showed that we also had postgresql13 installed. I removed that but kept postgresql15 for now. Should delete the old data directory after some days without problems.
Updated by nicksinger 2 months ago
- Status changed from In Progress to Feedback
Upgrade conducted. After running zypper dup
openqa-webui and openqa-gru failed with:
openqa:~ # systemctl status --failed
× openqa-gru.service - The openQA daemon for various background tasks like cleanup and saving needles
Loaded: loaded (/usr/lib/systemd/system/openqa-gru.service; enabled; preset: disabled)
Drop-In: /etc/systemd/system/openqa-gru.service.d
└─30-openqa-hook-timeout.conf, override.conf
Active: failed (Result: exit-code) since Wed 2024-10-16 18:10:59 UTC; 6min ago
Duration: 2ms
Main PID: 7394 (code=exited, status=217/USER)
Oct 16 18:10:59 openqa systemd[1]: openqa-gru.service: Scheduled restart job, restart counter is at 5.
Oct 16 18:10:59 openqa systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Oct 16 18:10:59 openqa systemd[1]: openqa-gru.service: Start request repeated too quickly.
Oct 16 18:10:59 openqa systemd[1]: openqa-gru.service: Failed with result 'exit-code'.
Oct 16 18:10:59 openqa systemd[1]: Failed to start The openQA daemon for various background tasks like cleanup and saving needles.
× openqa-webui.service - The openQA web UI
Loaded: loaded (/usr/lib/systemd/system/openqa-webui.service; enabled; preset: disabled)
Drop-In: /etc/systemd/system/openqa-webui.service.d
└─30-openqa-webui-hook-timeout.conf, storage.conf
Active: failed (Result: exit-code) since Wed 2024-10-16 18:10:58 UTC; 6min ago
Duration: 3ms
Main PID: 7357 (code=exited, status=217/USER)
Oct 16 18:10:58 openqa systemd[1]: Started The openQA web UI.
Oct 16 18:10:58 openqa (i-daemon)[7357]: openqa-webui.service: Failed to determine user credentials: No such process
Oct 16 18:10:58 openqa (i-daemon)[7357]: openqa-webui.service: Failed at step USER spawning /usr/share/openqa/script/openqa-webui-daemon: No such process
Oct 16 18:10:58 openqa systemd[1]: openqa-webui.service: Main process exited, code=exited, status=217/USER
Oct 16 18:10:58 openqa systemd[1]: openqa-webui.service: Failed with result 'exit-code'.
Oct 16 18:14:54 openqa systemd[1]: openqa-webui.service: Unit cannot be reloaded because it is inactive.
But after a reboot all issues went away and systemd reports "State: running".
I will monitor some jobs over the evening and will resolve tomorrow if no problems arise.
Updated by okurz 2 months ago
- Related to action #168337: [tools]test fails in bootloader_zkvm - auto_review:"qemu-img.*Failed to get shared.*No locks available" added
Updated by nicksinger 2 months ago
- Status changed from Feedback to In Progress
https://suse.slack.com/archives/C02CANHLANP/p1729129131475009 was reported which I looked into and caused https://progress.opensuse.org/issues/168358
Now checking again if the additional restart of the nfs-server might have fixed the original problem already and we can remove the transient workaround (nolocks option to the nfs client) applied from @okurz.
Updated by nicksinger 2 months ago
nicksinger wrote in #note-14:
https://suse.slack.com/archives/C02CANHLANP/p1729129131475009 was reported which I looked into and caused https://progress.opensuse.org/issues/168358
Now checking again if the additional restart of the nfs-server might have fixed the original problem already and we can remove the transient workaround (nolocks option to the nfs client) applied from @okurz.
We can't. But I found that adding nolock
apparently also adds local_lock=all
which was enough on zl12 to make it work again and I like it more then disabling locking completely (despite not knowing the possible problems). It might also give me a hint on what fails to write remotely and why and how to debug/fix it.
Updated by openqa_review 2 months ago
- Due date set to 2024-11-01
Setting due date based on mean cycle time of SUSE QE Tools
Updated by jbaier_cz 2 months ago
- Related to action #168544: [alert] Failed systemd services alert: check-for-kernel-crash, kdump-notify added
Updated by nicksinger 2 months ago
- Status changed from In Progress to Resolved
I consider the upgrade itself done. See linked issues for the related NFS issues and how they got addressed. Consider reopening them instead of this one.
Updated by okurz 2 months ago
- Copied to action #168721: OSD openqa.ini grossly incomplete added