action #106543
closedcoordination #102882: [epic] All OSD PPC64LE workers except malbec appear to have horribly broken cache service
Conduct rollback steps and check impact for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:M
0%
Description
Acceptance criteria¶
- AC1: No masked worker instances on all our OSD ppc workers
- AC2: No alerts relating to ppc job queue or other failures related to ppc machines
Suggestions¶
- Conduct the rollback steps described in the epic
- Crosscheck that all ppc64 OSD worker instances are fully online and are able to work on openQA jobs
- Monitor some openQA tests running on these instances, e.g. over https://openqa.suse.de
- Monitor https://monitor.qa.suse.de for related failures
- Ensure that there are no paused alerts relating to ppc that we had previously disabled
- Optional: Crosscheck with EngInfra that all ppc machines are actually connected to rack switches and not anymore to core switches
- Read all comments in the epic to make sure we haven't overlooked something
Updated by livdywan almost 3 years ago
- Subject changed from Conduct rollback steps and check impact for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" to Conduct rollback steps and check impact for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by openqa_review almost 3 years ago
- Due date set to 2022-02-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by kraih almost 3 years ago
- Status changed from In Progress to Feedback
Think i've reverted everything and the machines powerqaworker-qam-1.qa.suse.de, QA-Power8-5.qa.suse.de and QA-Power8-4.qa.suse.de are back in production.
There were also some typo-ed masks on powerqaworker-qam-1 that i've removed too:
UNIT FILE STATE
apparmor.service masked
openqa-worker-auto-restart1.service masked
openqa-worker-auto-restart2.service masked
openqa-worker-auto-restart3.service masked
openqa-worker-auto-restart4.service masked
openqa-worker-auto-restart5.service masked
openqa-worker-auto-restart6.service masked
openqa-worker-auto-restart\x2a.service masked
Updated by okurz almost 3 years ago
I found that qa-power8-4, qa-power8-5, powerqaworker-qam-1 had (still) not been included in accepted salt keys. I now did for i in QA-Power8-4-kvm.qa.suse.de QA-Power8-5-kvm.qa.suse.de powerqaworker-qam-1.qa.suse.de; do sudo salt-key -y -a $i ; done
and then sudo salt --no-color --state-output=changes -C 'G@roles:worker and G@osarch:ppc64le' state.apply
Updated by okurz almost 3 years ago
- Status changed from Resolved to Feedback
We are actually far from done as the machines are actually still on openSUSE Leap 15.2
Updated by okurz almost 3 years ago
- Status changed from Feedback to In Progress
- Assignee changed from kraih to okurz
Executing
sudo salt --no-color --state-output=changes -C 'G@roles:worker and G@osarch:ppc64le and G@osrelease:15.2' cmd.run 'sed -i "s@openSUSE_Leap_\$
releasever@\$releasever@" /etc/zypp/repos.d/NPI.repo && zypper -n --releasever=15.3 ref && zypper -n --releasever=15.3 dup --auto-agree-with-licenses --repla
cefiles --download-in-advance && reboot'
Updated by okurz almost 3 years ago
- Status changed from In Progress to Resolved
- Assignee changed from okurz to kraih
And another run of sudo salt --no-color --state-output=changes -C 'G@roles:worker and G@osarch:ppc64le' state.apply
and all looks good now. I checked jobs running on those machines right now from openqa.suse.de and the machines are happily working on jobs. Back to original assignee and resolved.