Actions
action #157726
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #108209: [epic] Reduce load on OSD
osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org)
Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-03-18
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2415705
worker37.oqa.prg2.suse.org:
Minion did not return. [Not connected]
worker36.oqa.prg2.suse.org:
Minion did not return. [Not connected]
worker38.oqa.prg2.suse.org:
Minion did not return. [Not connected]
worker39.oqa.prg2.suse.org:
Minion did not return. [Not connected]
Acceptance criteria¶
- AC1: osd-deployment passes again
- AC1: All w37-w39 run OSD production jobs
Suggestions¶
- DONE Take machine out of production: https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production
- *DONE Remove machine XYZ from production
ssh osd "sudo salt-key -y -d XYZ"
- Retrigger failed osd deployment CI pipeline
- Confirm if this is one or multiple, possibly already known issues
- Fix any potential hardware issue, e.g. with hardware replacement
- Ensure machines are back in production
Rollback steps¶
- https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production
for i in 36 37 38 39 ; do sudo salt-key -y -a worker$i.oqa.prg2.suse.org; done && sleep 30 && for i in 36 37 38 39 ; do sudo salt --state-output=changes "worker$i*" state.apply
Actions