Project

General

Profile

Actions

action #106543

closed

coordination #102882: [epic] All OSD PPC64LE workers except malbec appear to have horribly broken cache service

Conduct rollback steps and check impact for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:M

Added by okurz almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2022-02-10
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

  • AC1: No masked worker instances on all our OSD ppc workers
  • AC2: No alerts relating to ppc job queue or other failures related to ppc machines

Suggestions

  • Conduct the rollback steps described in the epic
  • Crosscheck that all ppc64 OSD worker instances are fully online and are able to work on openQA jobs
  • Monitor some openQA tests running on these instances, e.g. over https://openqa.suse.de
  • Monitor https://monitor.qa.suse.de for related failures
  • Ensure that there are no paused alerts relating to ppc that we had previously disabled
  • Optional: Crosscheck with EngInfra that all ppc machines are actually connected to rack switches and not anymore to core switches
  • Read all comments in the epic to make sure we haven't overlooked something
Actions #1

Updated by livdywan almost 3 years ago

  • Subject changed from Conduct rollback steps and check impact for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" to Conduct rollback steps and check impact for "All OSD PPC64LE workers except malbec appear to have horribly broken cache service" size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by kraih almost 3 years ago

  • Assignee set to kraih
Actions #3

Updated by kraih almost 3 years ago

  • Status changed from Workable to In Progress
Actions #4

Updated by openqa_review almost 3 years ago

  • Due date set to 2022-02-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by kraih almost 3 years ago

  • Status changed from In Progress to Feedback

Think i've reverted everything and the machines powerqaworker-qam-1.qa.suse.de, QA-Power8-5.qa.suse.de and QA-Power8-4.qa.suse.de are back in production.

There were also some typo-ed masks on powerqaworker-qam-1 that i've removed too:

UNIT FILE                              STATE
apparmor.service                       masked
openqa-worker-auto-restart1.service    masked
openqa-worker-auto-restart2.service    masked
openqa-worker-auto-restart3.service    masked
openqa-worker-auto-restart4.service    masked
openqa-worker-auto-restart5.service    masked
openqa-worker-auto-restart6.service    masked
openqa-worker-auto-restart\x2a.service masked
Actions #6

Updated by kraih almost 3 years ago

  • Status changed from Feedback to Resolved
Actions #7

Updated by okurz almost 3 years ago

  • Due date deleted (2022-02-26)
Actions #8

Updated by okurz almost 3 years ago

I found that qa-power8-4, qa-power8-5, powerqaworker-qam-1 had (still) not been included in accepted salt keys. I now did for i in QA-Power8-4-kvm.qa.suse.de QA-Power8-5-kvm.qa.suse.de powerqaworker-qam-1.qa.suse.de; do sudo salt-key -y -a $i ; done and then sudo salt --no-color --state-output=changes -C 'G@roles:worker and G@osarch:ppc64le' state.apply

Actions #9

Updated by okurz almost 3 years ago

  • Status changed from Resolved to Feedback

We are actually far from done as the machines are actually still on openSUSE Leap 15.2

Actions #10

Updated by okurz almost 3 years ago

  • Status changed from Feedback to In Progress
  • Assignee changed from kraih to okurz

Executing

sudo salt --no-color --state-output=changes -C 'G@roles:worker and G@osarch:ppc64le and G@osrelease:15.2' cmd.run 'sed -i "s@openSUSE_Leap_\$
releasever@\$releasever@" /etc/zypp/repos.d/NPI.repo && zypper -n --releasever=15.3 ref && zypper -n --releasever=15.3 dup --auto-agree-with-licenses --repla
cefiles --download-in-advance && reboot'
Actions #11

Updated by okurz almost 3 years ago

  • Status changed from In Progress to Resolved
  • Assignee changed from okurz to kraih

And another run of sudo salt --no-color --state-output=changes -C 'G@roles:worker and G@osarch:ppc64le' state.apply and all looks good now. I checked jobs running on those machines right now from openqa.suse.de and the machines are happily working on jobs. Back to original assignee and resolved.

Actions

Also available in: Atom PDF