action #99195
closedcoordination #99183: [epic] Upgrade all our infrastructure, e.g. o3+osd workers+webui, to openSUSE Leap 15.3
Upgrade o3 webUI host to openSUSE Leap 15.3 size:M
Added by okurz about 3 years ago. Updated about 3 years ago.
0%
Description
Motivation¶
- Need to upgrade machines before EOL of Leap 15.2 and have a consistent environment
Acceptance criteria¶
- AC1: o3 webui host runs a clean upgraded openSUSE Leap 15.3 (no failed systemd services, no left over .rpm-new files, etc.)
Suggestions¶
- read https://progress.opensuse.org/projects/openqav3/wiki#Distribution-upgrades
- Reserve some time when the instance is only executing a few or no openQA test jobs
- After upgrade reboot and check everything working as expected
Out of scope¶
- Spawn a container instead of upgrading the host
Further details¶
- If we loose access to the machine we need the help of EngineeringInfrastructure as only they have access to the VM
Updated by okurz about 3 years ago
- Copied from action #75241: Upgrade o3 webUI host to openSUSE Leap 15.2 added
Updated by okurz about 3 years ago
- Subject changed from Upgrade o3 webUI host to openSUSE Leap 15.2 to Upgrade o3 webUI host to openSUSE Leap 15.3
- Assignee deleted (
mkittler) - Priority changed from High to Normal
- Start date deleted (
2020-10-24)
Updated by livdywan about 3 years ago
- Subject changed from Upgrade o3 webUI host to openSUSE Leap 15.3 to Upgrade o3 webUI host to openSUSE Leap 15.3 size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by livdywan about 3 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
It occured to me I can do this while poking at running tests, so I'm taking this now
Updated by livdywan about 3 years ago
- Status changed from In Progress to Feedback
cdywan wrote:
It occured to me I can do this while poking at running tests, so I'm taking this now
Went through the upgrade as per the steps in the wiki, rebooted. Workers seem to have reconnected fine and jobs are running.
Updated by okurz about 3 years ago
rpmconfigcheck showed one file /etc/postfix/master.cf.rpmnew which I diffed with /etc/postfix/master.cf , took over some updates and then deleted the rpmnew file. I think rest looks really good. Great work!
Updated by livdywan about 3 years ago
- Status changed from Feedback to Resolved
okurz wrote:
rpmconfigcheck showed one file /etc/postfix/master.cf.rpmnew which I diffed with /etc/postfix/master.cf , took over some updates and then deleted the rpmnew file. I think rest looks really good. Great work!
Arg, so I missed something afterall... Thank you for checking!
Updated by livdywan about 3 years ago
- Status changed from Resolved to Feedback
Apparently I missed something else, too:
Oct 01 14:25:44 ariel openqa-gru[30835]: Can't exec "/bin/sh": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Task/Job/FinalizeResults.pm line 63.
Keeping in mind that osd uses UTC, this should fit into the time window of when I was wrapping up the upgrade as per my comment above. And there were apparmor changes, which I presumably didn't do correctly.
I also filed #99741 because this didn't trigger any alerts and was discovered by @tinita.
Updated by livdywan about 3 years ago
- Copied to action #99741: Minion jobs for job hooks failed silently on o3 size:M added
Updated by livdywan about 3 years ago
cdywan wrote:
Apparently I missed something else, too:
Oct 01 14:25:44 ariel openqa-gru[30835]: Can't exec "/bin/sh": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Task/Job/FinalizeResults.pm line 63.
Keeping in mind that osd uses UTC, this should fit into the time window of when I was wrapping up the upgrade as per my comment above. And there were apparmor changes, which I presumably didn't do correctly.
I also filed #99741 because this didn't trigger any alerts and was discovered by @tinita.
I reset /etc/apparmor.d/local/usr.share.openqa.script.openqa
to a comments-only file, which it should after os-autoinst/openQA/pull/3847 and which I guess is what I mistook for something we needed to keep.
Updated by livdywan about 3 years ago
I still can't tell if the files in /etc/apparmor.d/{,local/}usr.share.openqa.script.openqa
are correct. And I wasn't able to figure out how to access the most recent copy of /etc/apparmor.d/usr.share.openqa.script.openqa
, if one exists, since the Backup section doesn't really explain that and it's pretty much Greek to me if you excuse the pun.
Updated by livdywan about 3 years ago
For now I'm monitoring logs to see if the errors persist (via sudo journalctl -f -u openqa-gru
). And I added /bin/sh mrix,
to /etc/apparmor.d/usr.share.openqa.script.openqa
.
Also tried sudo systemctl restart openqa-gru
, to no apparent effect. Btw for reference o3 is on apparmor-profiles 2.13.6-1.31, as opposed to osd/2.13.4-lp152.2.3.1.
Updated by livdywan about 3 years ago
It would seem comparing to osd was pointless since according to sudo aa-status
it's currently switched off there 🤦️
Trying to see now if sudo aa-complain /usr/share/openqa/script/openqa{,-cli}
yields some more information here.
Updated by livdywan about 3 years ago
/opt/os-autoinst-scripts/openqa-label-known-issues: line 83: hxselect: command not found
grep: write error: Broken pipe
Not sure if these are related, but while I'm at it I'm installing html-xml-utils
.
Btw I also created a proof of concept for dependencies.yaml
in the scripts repo, although this will need a bit of polishing before it can be used: https://github.com/os-autoinst/scripts/pull/116
Updated by livdywan about 3 years ago
- Status changed from Feedback to Resolved
I'm assuming it's working now since I no longer see errors and I can see investigate jobs that spawned and finished successfully.
Updated by tinita about 3 years ago
- Status changed from Resolved to Feedback
cdywan wrote:
Trying to see now if
sudo aa-complain /usr/share/openqa/script/openqa{,-cli}
yields some more information here.
This sets it to complain mode, and any violations are just logged (https://wiki.ubuntu.com/DebuggingApparmor#Debugging_procedure)
So if you don't see error messages in the gru journal, that's because it's in complain mode (but I don't know where the "complaints" are actually going to).
So if you didn't do anything else, then this is not a fix.
Updated by tinita about 3 years ago
PR for apparmor profile fix: https://github.com/os-autoinst/openQA/pull/4271
In Leap 15.2, /bin/sh
points to /bin/bash
, while in 15.3,
it points to /usr/bin/sh
-> /usr/bin/bash
Updated by livdywan about 3 years ago
tinita wrote:
PR for apparmor profile fix: https://github.com/os-autoinst/openQA/pull/4271
In Leap 15.2,
/bin/sh
points to/bin/bash
, while in 15.3,
it points to/usr/bin/sh
->/usr/bin/bash
I'm wondering how you confirmed that this worked, since I seem to have seen successfully executed hooks without any errors in the entire journal 🤔️
So I guess to resolve it for good I need to find out where the presumed missing error messages end up, and document it.
Updated by tinita about 3 years ago
Sorry, I forgot to add:
I did the mentioned fix locally (add /usr/bin/bash
), and then did
aa-enforce /usr/share/openqa/script/openqa
to end the complain mode.
Then I saw successful hooks by looking into the minion_jobs
table and I didn't see errors in the openqa-gru journal anymore.
Note that if apparmor is in complain mode, one is not supposed to see the error messages, but there will be messages in /var/log/audit/audit.log
.
Today I saw new errors though:
/opt/os-autoinst-scripts/openqa-label-known-issues: line 83: /usr/bin/hxselect: Permission denied
PR for that: https://github.com/os-autoinst/openQA/pull/4273
Updated by tinita about 3 years ago
PR https://github.com/os-autoinst/openQA/pull/4273 merged, and I added the line manually on o3 to not wait until the next deployment.
Updated by livdywan about 3 years ago
Updated by livdywan about 3 years ago
cdywan wrote:
- [ ] Is there a feature request/bug on AppArmor wrt unclear error message?
https://gitlab.com/apparmor/apparmor/-/issues/201
- [ ] I'll try and propose documentation for how AppArmor is handled with openQA
Updated by livdywan about 3 years ago
- Status changed from Feedback to Resolved
Including the potential upstream improvements and additions to openQA docs, I think the host is looking good at this point. And of course thanks to @tinita especially.
Updated by okurz over 2 years ago
- Copied to action #111869: Upgrade o3 webUI host to openSUSE Leap 15.4 size:S added