Actions
action #157726
openopenQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project - coordination #108209: [epic] Reduce load on OSD
osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org)
Start date:
2024-03-18
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2415705
worker37.oqa.prg2.suse.org:
Minion did not return. [Not connected]
worker36.oqa.prg2.suse.org:
Minion did not return. [Not connected]
worker38.oqa.prg2.suse.org:
Minion did not return. [Not connected]
worker39.oqa.prg2.suse.org:
Minion did not return. [Not connected]
Acceptance criteria¶
- AC1: osd-deployment passes again
Suggestions¶
- DONE Take machine out of production: https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production
- *DONE Remove machine XYZ from production
ssh osd "sudo salt-key -y -d XYZ"
- Retrigger failed osd deployment CI pipeline
- Confirm if this is one or multiple, possibly already known issues
- Fix any potential hardware issue, e.g. with hardware replacement
- Ensure machines are back in production
Rollback steps¶
- https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production
for i in 36 37 38 39 ; do sudo salt-key -y -a worker$i.oqa.prg2.suse.org; done && sleep 30 && for i in 36 37 38 39 ; do sudo salt --state-output=changes "worker$i*" state.apply
Updated by okurz about 1 month ago
- Status changed from New to In Progress
- Assignee set to okurz
Updated by okurz about 1 month ago
- Related to action #157666: OSD unresponsive and then not starting any more jobs on 2024-03-21 added
Updated by okurz about 1 month ago
- Description updated (diff)
- Status changed from In Progress to Blocked
- Priority changed from High to Normal
- Target version changed from Ready to future
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2415675 continued
blocked on #157666
Updated by okurz about 1 month ago
- Related to coordination #157669: websockets+scheduler improvements added
Actions