ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures
Whenever we reduce "numofworkers" in salt pillars the openQA worker instance systemd services are not disabled and/or not stopped. This can cause multiple problems, e.g. no valid worker instance configurations anymore or no tap devices for the worker instances, see #62853
- Subject changed from ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars to ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures
Just to make it clear I'm also adding the message as in poo#66907#note-10: 'And in the meantime I got access to OSD workers, so I will try to help by maintaining ARM workers and when needed, I will mask unwanted workers which should not be there or restart the network interfaces etc.'
I'm wondering why the existing code doesn't not already cover https://progress.opensuse.org/issues/63874. It looks like it should do exactly what the ticket asks for. The code has already been present for 2 years: https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/e80327e29fce8f6f39051167d389c3cf44099a45
That's maybe because
openqa-worker.target still gets started¹ and it simply pulls as many worker slots in as there are pool directories. So the mentioned salt code might work but the effort could be neglected again by starting
openqa-worker.target. Note that the number of worker slots for
openqa-worker.target to pull in is determined by running a systemd generator which checks for the pool directories present under
¹ It shouldn't be started anymore as it is disabled and no dependencies seem to pull it in. It nevertheless gets started and I still have to find out why.
#14 Updated by mkittler about 2 months ago
- Assignee set to mkittler
After removing the worker target this might even work: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/454
I can try to activate an additional worker slot somewhere and check whether it'll be stopped and disabled on the next salt run.
openqaworker-arm-1. It should be disabled/stopped automatically on the next salt run.
#15 Updated by mkittler about 2 months ago
It didn't work. See https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/455 for details and a fix.
#16 Updated by mkittler about 2 months ago
- Status changed from New to Resolved
The SR has been merged and it works now, e.g. running
salt -l debug openqaworker-arm-1.suse.de state.sls_id stop_and_disable_all_not_configured_workers openqa.worker on OSD stops and disables
openqaworker-arm-1 and also doesn't cause any problems if there aren't any workers to stop. (Works also when applying everything via
salt openqaworker-arm-1.suse.de state.apply.)