Actions
action #157726
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #108209: [epic] Reduce load on OSD
osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org)
Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-03-18
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2415705
worker37.oqa.prg2.suse.org:
Minion did not return. [Not connected]
worker36.oqa.prg2.suse.org:
Minion did not return. [Not connected]
worker38.oqa.prg2.suse.org:
Minion did not return. [Not connected]
worker39.oqa.prg2.suse.org:
Minion did not return. [Not connected]
Acceptance criteria¶
- AC1: osd-deployment passes again
- AC1: All w37-w39 run OSD production jobs
Suggestions¶
- DONE Take machine out of production: https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production
- *DONE Remove machine XYZ from production
ssh osd "sudo salt-key -y -d XYZ"
- Retrigger failed osd deployment CI pipeline
- Confirm if this is one or multiple, possibly already known issues
- Fix any potential hardware issue, e.g. with hardware replacement
- Ensure machines are back in production
Rollback steps¶
- https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production
for i in 36 37 38 39 ; do sudo salt-key -y -a worker$i.oqa.prg2.suse.org; done && sleep 30 && for i in 36 37 38 39 ; do sudo salt --state-output=changes "worker$i*" state.apply
Updated by okurz 9 months ago
- Related to action #157666: OSD unresponsive and then not starting any more jobs on 2024-03-21 added
Updated by okurz 9 months ago
- Related to coordination #157669: websockets+scheduler improvements to support more online worker instances added
Updated by okurz 5 months ago
As discussed today among nicksinger and me for the time being we can keep worker3[6-9] offline, especially as there is less load over the summer and we can save electrical energy. When there is a need to bring in more workers then this can be done at any time as needed.
Updated by okurz 3 months ago
- Related to action #166802: Recover worker37, worker38, worker39 size:S added
Updated by okurz 3 months ago
- Related to action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M added
Actions