action #106666
closedImprove worker startup in our salt states or "openqa-worker-auto-restart repeatedly failing on grenache-1.qa.suse.de"
0%
Description
Motivation¶
It can happen that we disable single worker-instances on openQA workers (e.g. https://progress.opensuse.org/issues/106257#note-9). If we use the mask approach it results in our deployment pipeline failing because our states try to start every worker instance configured in the "numofworkers" field (https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L44) this happens here: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/worker.sls#L190-194
So even commenting out the affected instances wouldn't work.
Suggestions¶
The following flow would allow us to just comment out instances in addition to mask them manually:
- Iterate over every key for each worker (https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L52) and use their instance number to explicitly start them
- Take the last, explicitly defined instance number, subtract it from "numofworkers", start only the remaining instances
Updated by okurz almost 3 years ago
- Priority changed from Normal to Low
- Target version set to future
Updated by nicksinger almost 3 years ago
- Has duplicate action #106753: openqa-worker-auto-restart repeatedly failing on grenache-1.qa.suse.de added
Updated by okurz almost 3 years ago
- Subject changed from Improve worker startup in our salt states to Improve worker startup in our salt states or "openqa-worker-auto-restart repeatedly failing on grenache-1.qa.suse.de"
- Priority changed from Low to High
- Target version changed from future to Ready
Updated by nicksinger almost 3 years ago
- Status changed from New to Feedback
- Assignee set to nicksinger
My thinking in the initial suggestion wasn't right. It can cause problems if workers "in the middle" of our workers list (in the states) are masked. But this allowed me to come up with a pretty clean solution in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/651
Updated by nicksinger almost 3 years ago
- Status changed from Feedback to Resolved
With that merged and a minor whitespace fix (https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/652) we have working deployments back \o/ https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines/320133
Updated by okurz almost 3 years ago
That looks good, thank you! I just wonder what we can have as an "alert" that some units might still be masked and maybe that's something that was forgot from manual work. So salt now does not help us to bring back services that are intended to run but currently don't
Updated by nicksinger almost 3 years ago
- Related to action #106832: Monitor masked units on our infrastructure added
Updated by nicksinger almost 3 years ago
okurz wrote:
That looks good, thank you! I just wonder what we can have as an "alert" that some units might still be masked and maybe that's something that was forgot from manual work. So salt now does not help us to bring back services that are intended to run but currently don't
I've noted this down in https://progress.opensuse.org/issues/106832