Project

General

Profile

Actions

action #157981

closed

coordination #157969: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui and production workloads, to openSUSE Leap 15.6

Upgrade osd webUI host to openSUSE Leap 15.6 size:S

Added by okurz 9 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Organisational
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

  • Need to upgrade machines before EOL of Leap 15.5 and have a consistent environment

Acceptance criteria

  • AC1: osd webui host runs a clean upgraded openSUSE Leap 15.6 (no failed systemd services, no left over .rpm-new files, etc.)
  • AC2: The openQA database runs the default version of PostgreSQL in current Leap

Suggestions

Further details

  • If we loose access to the machine we need the help of EngineeringInfrastructure as only they have access to the VM

Related issues 4 (0 open4 closed)

Related to openQA Infrastructure (public) - action #168337: [tools]test fails in bootloader_zkvm - auto_review:"qemu-img.*Failed to get shared.*No locks available"Resolvednicksinger2024-10-17

Actions
Related to openQA Infrastructure (public) - action #168544: [alert] Failed systemd services alert: check-for-kernel-crash, kdump-notifyResolvedjbaier_cz

Actions
Copied from openQA Project (public) - action #130594: Upgrade osd webUI host to openSUSE Leap 15.5Resolvedokurz

Actions
Copied to openQA Project (public) - action #168721: OSD openqa.ini grossly incompleteResolvedokurz2024-10-22

Actions
Actions #1

Updated by okurz 9 months ago

  • Copied from action #130594: Upgrade osd webUI host to openSUSE Leap 15.5 added
Actions #2

Updated by okurz 9 months ago

  • Subject changed from Upgrade osd webUI host to openSUSE Leap 15.5 to Upgrade osd webUI host to openSUSE Leap 15.6
  • Description updated (diff)
  • Assignee deleted (okurz)
  • Target version changed from Ready to future
Actions #3

Updated by okurz 8 months ago

  • Target version changed from future to Tools - Next
Actions #4

Updated by okurz 5 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz

In preparation of the upgrade I am already migrating postgres to 16:

oldver=15 newver=16
zypper in postgresql$newver-server postgresql$newver-contrib
sudo -u postgres /usr/lib/postgresql$newver/bin/initdb --encoding=UTF8 --locale=en_US.UTF-8 --lc-collate=C --lc-ctype=en_US.UTF-8 --lc-messages=C --lc-monetary=C --lc-numeric=C --lc-time=C -D /var/lib/pgsql/data.$newver
sudo -u postgres vimdiff /var/lib/pgsql/data.$oldver/postgresql.conf /var/lib/pgsql/data.$newver/postgresql.conf
sudo -u postgres /usr/lib/postgresql$newver/bin/pg_upgrade --check --link --old-bindir=/usr/lib/postgresql$oldver/bin --new-bindir=/usr/lib/postgresql$newver/bin --old-datadir=/var/lib/pgsql/data.$oldver --new-datadir=/var/lib/pgsql/data.$newver && systemctl stop openqa-webui openqa-scheduler openqa-livehandler openqa-gru postgresql && sudo -u postgres /usr/lib/postgresql$newver/bin/pg_upgrade --link --old-bindir=/usr/lib/postgresql$oldver/bin --new-bindir=/usr/lib/postgresql$newver/bin --old-datadir=/var/lib/pgsql/data.$oldver --new-datadir=/var/lib/pgsql/data.$newver && ln --force --no-dereference --relative --symbolic /var/lib/pgsql/data.$newver /var/lib/pgsql/data && systemctl start postgresql openqa-webui openqa-scheduler openqa-livehandler openqa-gru && sudo -u geekotest psql -c 'select version();' openqa
Actions #5

Updated by okurz 5 months ago · Edited

Just prepared. Want to continue after EOB.

EDIT (2024-07-18 19:27Z): Done. Running pgsql 16 now. zypper se --installed-only postgres showed that we also had postgresql13 installed. I removed that but kept postgresql15 for now. Should delete the old data directory after some days without problems.

Actions #6

Updated by okurz 5 months ago

  • Status changed from In Progress to New
  • Assignee deleted (okurz)
Actions #7

Updated by livdywan 5 months ago

  • Subject changed from Upgrade osd webUI host to openSUSE Leap 15.6 to Upgrade osd webUI host to openSUSE Leap 15.6 size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by okurz 2 months ago

  • Target version changed from Tools - Next to Ready
Actions #9

Updated by okurz 2 months ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz
Actions #10

Updated by okurz 2 months ago

  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

o3 webUI upgrade done. No relevant problems encountered.

Actions #11

Updated by tinita 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #12

Updated by nicksinger 2 months ago

  • Status changed from In Progress to Feedback

Upgrade conducted. After running zypper dup openqa-webui and openqa-gru failed with:

openqa:~ # systemctl status --failed
× openqa-gru.service - The openQA daemon for various background tasks like cleanup and saving needles
     Loaded: loaded (/usr/lib/systemd/system/openqa-gru.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/openqa-gru.service.d
             └─30-openqa-hook-timeout.conf, override.conf
     Active: failed (Result: exit-code) since Wed 2024-10-16 18:10:59 UTC; 6min ago
   Duration: 2ms
   Main PID: 7394 (code=exited, status=217/USER)

Oct 16 18:10:59 openqa systemd[1]: openqa-gru.service: Scheduled restart job, restart counter is at 5.
Oct 16 18:10:59 openqa systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Oct 16 18:10:59 openqa systemd[1]: openqa-gru.service: Start request repeated too quickly.
Oct 16 18:10:59 openqa systemd[1]: openqa-gru.service: Failed with result 'exit-code'.
Oct 16 18:10:59 openqa systemd[1]: Failed to start The openQA daemon for various background tasks like cleanup and saving needles.

× openqa-webui.service - The openQA web UI
     Loaded: loaded (/usr/lib/systemd/system/openqa-webui.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/openqa-webui.service.d
             └─30-openqa-webui-hook-timeout.conf, storage.conf
     Active: failed (Result: exit-code) since Wed 2024-10-16 18:10:58 UTC; 6min ago
   Duration: 3ms
   Main PID: 7357 (code=exited, status=217/USER)

Oct 16 18:10:58 openqa systemd[1]: Started The openQA web UI.
Oct 16 18:10:58 openqa (i-daemon)[7357]: openqa-webui.service: Failed to determine user credentials: No such process
Oct 16 18:10:58 openqa (i-daemon)[7357]: openqa-webui.service: Failed at step USER spawning /usr/share/openqa/script/openqa-webui-daemon: No such process
Oct 16 18:10:58 openqa systemd[1]: openqa-webui.service: Main process exited, code=exited, status=217/USER
Oct 16 18:10:58 openqa systemd[1]: openqa-webui.service: Failed with result 'exit-code'.
Oct 16 18:14:54 openqa systemd[1]: openqa-webui.service: Unit cannot be reloaded because it is inactive.

But after a reboot all issues went away and systemd reports "State: running".

I will monitor some jobs over the evening and will resolve tomorrow if no problems arise.

Actions #13

Updated by okurz 2 months ago

  • Related to action #168337: [tools]test fails in bootloader_zkvm - auto_review:"qemu-img.*Failed to get shared.*No locks available" added
Actions #14

Updated by nicksinger 2 months ago

  • Status changed from Feedback to In Progress

https://suse.slack.com/archives/C02CANHLANP/p1729129131475009 was reported which I looked into and caused https://progress.opensuse.org/issues/168358
Now checking again if the additional restart of the nfs-server might have fixed the original problem already and we can remove the transient workaround (nolocks option to the nfs client) applied from @okurz.

Actions #15

Updated by nicksinger 2 months ago

nicksinger wrote in #note-14:

https://suse.slack.com/archives/C02CANHLANP/p1729129131475009 was reported which I looked into and caused https://progress.opensuse.org/issues/168358
Now checking again if the additional restart of the nfs-server might have fixed the original problem already and we can remove the transient workaround (nolocks option to the nfs client) applied from @okurz.

We can't. But I found that adding nolock apparently also adds local_lock=all which was enough on zl12 to make it work again and I like it more then disabling locking completely (despite not knowing the possible problems). It might also give me a hint on what fails to write remotely and why and how to debug/fix it.

Actions #16

Updated by openqa_review 2 months ago

  • Due date set to 2024-11-01

Setting due date based on mean cycle time of SUSE QE Tools

Actions #17

Updated by jbaier_cz 2 months ago

  • Related to action #168544: [alert] Failed systemd services alert: check-for-kernel-crash, kdump-notify added
Actions #18

Updated by nicksinger 2 months ago

  • Status changed from In Progress to Resolved

I consider the upgrade itself done. See linked issues for the related NFS issues and how they got addressed. Consider reopening them instead of this one.

Actions #19

Updated by okurz 2 months ago

  • Due date deleted (2024-11-01)
Actions #20

Updated by okurz 2 months ago

Actions

Also available in: Atom PDF