Project

General

Profile

Actions

action #106666

closed

Improve worker startup in our salt states or "openqa-worker-auto-restart repeatedly failing on grenache-1.qa.suse.de"

Added by nicksinger about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-02-11
Due date:
% Done:

0%

Estimated time:

Description

Motivation

It can happen that we disable single worker-instances on openQA workers (e.g. https://progress.opensuse.org/issues/106257#note-9). If we use the mask approach it results in our deployment pipeline failing because our states try to start every worker instance configured in the "numofworkers" field (https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L44) this happens here: https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/worker.sls#L190-194
So even commenting out the affected instances wouldn't work.

Suggestions

The following flow would allow us to just comment out instances in addition to mask them manually:

  1. Iterate over every key for each worker (https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L52) and use their instance number to explicitly start them
  2. Take the last, explicitly defined instance number, subtract it from "numofworkers", start only the remaining instances

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #106832: Monitor masked units on our infrastructureResolvedokurz2022-02-15

Actions
Has duplicate openQA Infrastructure - action #106753: openqa-worker-auto-restart repeatedly failing on grenache-1.qa.suse.deRejected2022-02-14

Actions
Actions #1

Updated by okurz about 2 years ago

  • Priority changed from Normal to Low
  • Target version set to future
Actions #2

Updated by nicksinger about 2 years ago

  • Has duplicate action #106753: openqa-worker-auto-restart repeatedly failing on grenache-1.qa.suse.de added
Actions #3

Updated by okurz about 2 years ago

  • Subject changed from Improve worker startup in our salt states to Improve worker startup in our salt states or "openqa-worker-auto-restart repeatedly failing on grenache-1.qa.suse.de"
  • Priority changed from Low to High
  • Target version changed from future to Ready
Actions #4

Updated by nicksinger about 2 years ago

  • Status changed from New to Feedback
  • Assignee set to nicksinger

My thinking in the initial suggestion wasn't right. It can cause problems if workers "in the middle" of our workers list (in the states) are masked. But this allowed me to come up with a pretty clean solution in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/651

Actions #5

Updated by nicksinger about 2 years ago

  • Status changed from Feedback to Resolved
Actions #6

Updated by okurz about 2 years ago

That looks good, thank you! I just wonder what we can have as an "alert" that some units might still be masked and maybe that's something that was forgot from manual work. So salt now does not help us to bring back services that are intended to run but currently don't

Actions #7

Updated by nicksinger about 2 years ago

  • Related to action #106832: Monitor masked units on our infrastructure added
Actions #8

Updated by nicksinger about 2 years ago

okurz wrote:

That looks good, thank you! I just wonder what we can have as an "alert" that some units might still be masked and maybe that's something that was forgot from manual work. So salt now does not help us to bring back services that are intended to run but currently don't

I've noted this down in https://progress.opensuse.org/issues/106832

Actions

Also available in: Atom PDF