Project

General

Profile

action #99195

coordination #99183: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui, to openSUSE Leap 15.3

Upgrade o3 webUI host to openSUSE Leap 15.3 size:M

Added by okurz 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

  • Need to upgrade machines before EOL of Leap 15.2 and have a consistent environment

Acceptance criteria

  • AC1: o3 webui host runs a clean upgraded openSUSE Leap 15.3 (no failed systemd services, no left over .rpm-new files, etc.)

Suggestions

Out of scope

  • Spawn a container instead of upgrading the host

Further details

  • If we loose access to the machine we need the help of EngineeringInfrastructure as only they have access to the VM

Related issues

Copied from openQA Infrastructure - action #75241: Upgrade o3 webUI host to openSUSE Leap 15.2Resolved2020-10-24

Copied to openQA Infrastructure - action #99741: Minion jobs for job hooks failed silently on o3New2021-10-04

History

#1 Updated by okurz 2 months ago

  • Copied from action #75241: Upgrade o3 webUI host to openSUSE Leap 15.2 added

#2 Updated by okurz 2 months ago

  • Subject changed from Upgrade o3 webUI host to openSUSE Leap 15.2 to Upgrade o3 webUI host to openSUSE Leap 15.3
  • Assignee deleted (mkittler)
  • Priority changed from High to Normal
  • Start date deleted (2020-10-24)

#3 Updated by cdywan 2 months ago

  • Subject changed from Upgrade o3 webUI host to openSUSE Leap 15.3 to Upgrade o3 webUI host to openSUSE Leap 15.3 size:M
  • Description updated (diff)
  • Status changed from New to Workable

#4 Updated by cdywan 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to cdywan

It occured to me I can do this while poking at running tests, so I'm taking this now

#5 Updated by cdywan 2 months ago

  • Status changed from In Progress to Feedback

cdywan wrote:

It occured to me I can do this while poking at running tests, so I'm taking this now

Went through the upgrade as per the steps in the wiki, rebooted. Workers seem to have reconnected fine and jobs are running.

#6 Updated by okurz 2 months ago

rpmconfigcheck showed one file /etc/postfix/master.cf.rpmnew which I diffed with /etc/postfix/master.cf , took over some updates and then deleted the rpmnew file. I think rest looks really good. Great work!

#7 Updated by cdywan 2 months ago

  • Status changed from Feedback to Resolved

okurz wrote:

rpmconfigcheck showed one file /etc/postfix/master.cf.rpmnew which I diffed with /etc/postfix/master.cf , took over some updates and then deleted the rpmnew file. I think rest looks really good. Great work!

Arg, so I missed something afterall... Thank you for checking!

#8 Updated by cdywan 2 months ago

  • Status changed from Resolved to Feedback

Apparently I missed something else, too:

Oct 01 14:25:44 ariel openqa-gru[30835]: Can't exec "/bin/sh": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Task/Job/FinalizeResults.pm line 63.

Keeping in mind that osd uses UTC, this should fit into the time window of when I was wrapping up the upgrade as per my comment above. And there were apparmor changes, which I presumably didn't do correctly.

I also filed #99741 because this didn't trigger any alerts and was discovered by @tinita.

#9 Updated by cdywan 2 months ago

  • Copied to action #99741: Minion jobs for job hooks failed silently on o3 added

#10 Updated by cdywan 2 months ago

cdywan wrote:

Apparently I missed something else, too:

Oct 01 14:25:44 ariel openqa-gru[30835]: Can't exec "/bin/sh": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Task/Job/FinalizeResults.pm line 63.

Keeping in mind that osd uses UTC, this should fit into the time window of when I was wrapping up the upgrade as per my comment above. And there were apparmor changes, which I presumably didn't do correctly.

I also filed #99741 because this didn't trigger any alerts and was discovered by @tinita.

I reset /etc/apparmor.d/local/usr.share.openqa.script.openqa to a comments-only file, which it should after os-autoinst/openQA/pull/3847 and which I guess is what I mistook for something we needed to keep.

#11 Updated by cdywan 2 months ago

I still can't tell if the files in /etc/apparmor.d/{,local/}usr.share.openqa.script.openqa are correct. And I wasn't able to figure out how to access the most recent copy of /etc/apparmor.d/usr.share.openqa.script.openqa, if one exists, since the Backup section doesn't really explain that and it's pretty much Greek to me if you excuse the pun.

#12 Updated by cdywan 2 months ago

For now I'm monitoring logs to see if the errors persist (via sudo journalctl -f -u openqa-gru). And I added /bin/sh mrix, to /etc/apparmor.d/usr.share.openqa.script.openqa.
Also tried sudo systemctl restart openqa-gru, to no apparent effect. Btw for reference o3 is on apparmor-profiles 2.13.6-1.31, as opposed to osd/2.13.4-lp152.2.3.1.

#13 Updated by cdywan 2 months ago

It would seem comparing to osd was pointless since according to sudo aa-status it's currently switched off there 🤦️

Trying to see now if sudo aa-complain /usr/share/openqa/script/openqa{,-cli} yields some more information here.

#14 Updated by cdywan 2 months ago

/opt/os-autoinst-scripts/openqa-label-known-issues: line 83: hxselect: command not found
grep: write error: Broken pipe

Not sure if these are related, but while I'm at it I'm installing html-xml-utils.

Btw I also created a proof of concept for dependencies.yaml in the scripts repo, although this will need a bit of polishing before it can be used: https://github.com/os-autoinst/scripts/pull/116

#15 Updated by cdywan 2 months ago

  • Status changed from Feedback to Resolved

I'm assuming it's working now since I no longer see errors and I can see investigate jobs that spawned and finished successfully.

#16 Updated by tinita 2 months ago

  • Status changed from Resolved to Feedback

cdywan wrote:

Trying to see now if sudo aa-complain /usr/share/openqa/script/openqa{,-cli} yields some more information here.

This sets it to complain mode, and any violations are just logged (https://wiki.ubuntu.com/DebuggingApparmor#Debugging_procedure)

So if you don't see error messages in the gru journal, that's because it's in complain mode (but I don't know where the "complaints" are actually going to).

So if you didn't do anything else, then this is not a fix.

#17 Updated by tinita 2 months ago

PR for apparmor profile fix: https://github.com/os-autoinst/openQA/pull/4271

In Leap 15.2, /bin/sh points to /bin/bash, while in 15.3,
it points to /usr/bin/sh -> /usr/bin/bash

#18 Updated by cdywan about 2 months ago

tinita wrote:

PR for apparmor profile fix: https://github.com/os-autoinst/openQA/pull/4271

In Leap 15.2, /bin/sh points to /bin/bash, while in 15.3,
it points to /usr/bin/sh -> /usr/bin/bash

I'm wondering how you confirmed that this worked, since I seem to have seen successfully executed hooks without any errors in the entire journal 🤔️
So I guess to resolve it for good I need to find out where the presumed missing error messages end up, and document it.

#19 Updated by tinita about 2 months ago

Sorry, I forgot to add:
I did the mentioned fix locally (add /usr/bin/bash), and then did
aa-enforce /usr/share/openqa/script/openqa
to end the complain mode.
Then I saw successful hooks by looking into the minion_jobs table and I didn't see errors in the openqa-gru journal anymore.

Note that if apparmor is in complain mode, one is not supposed to see the error messages, but there will be messages in /var/log/audit/audit.log.

Today I saw new errors though:
/opt/os-autoinst-scripts/openqa-label-known-issues: line 83: /usr/bin/hxselect: Permission denied
PR for that: https://github.com/os-autoinst/openQA/pull/4273

#20 Updated by tinita about 2 months ago

PR https://github.com/os-autoinst/openQA/pull/4273 merged, and I added the line manually on o3 to not wait until the next deployment.

#21 Updated by cdywan about 2 months ago

Post mortem:

  • [x] I filed #99741 to address the silent failures
  • [x] There's also #57239 which could have helped spot the problem earlier
  • [ ] Is there a feature request/bug on AppArmor wrt unclear error message?
  • [ ] I'll try and propose documentation for how AppArmor is handled with openQA

#22 Updated by cdywan about 2 months ago

cdywan wrote:

  • [ ] Is there a feature request/bug on AppArmor wrt unclear error message?

https://gitlab.com/apparmor/apparmor/-/issues/201

  • [ ] I'll try and propose documentation for how AppArmor is handled with openQA

https://github.com/os-autoinst/openQA/pull/4278

#23 Updated by cdywan about 1 month ago

  • Status changed from Feedback to Resolved

Including the potential upstream improvements and additions to openQA docs, I think the host is looking good at this point. And of course thanks to tinita especially.

Also available in: Atom PDF