



action #63874


ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures

Added by okurz almost 5 years ago. Updated almost 4 years ago.

Target version:
Start date:
Due date:
% Done:


Estimated time:



Whenever we reduce "numofworkers" in salt pillars the openQA worker instance systemd services are not disabled and/or not stopped. This can cause multiple problems, e.g. no valid worker instance configurations anymore or no tap devices for the worker instances, see #62853

Related issues 4 (0 open4 closed)

Related to openQA Project (public) - coordination #65118: [epic] multimachine test fails with symptoms "websocket refusing connection" and other unclear reasonsResolvedokurz2020-04-012020-09-30

Related to openQA Project (public) - action #66376: MM tests fail in obscure way when tap device is not presentResolvedokurz2020-05-04

Has duplicate openQA Tests (public) - action #66907: Multimachine test fails in setup for ARM workersRejectedokurz2020-05-15

Copied from openQA Tests (public) - action #63853: [tools] broken /etc/sysconfig/network/ifcfg-br1Resolvedokurz2020-02-26

Actions #1

Updated by okurz almost 5 years ago

  • Copied from action #63853: [tools] broken /etc/sysconfig/network/ifcfg-br1 added
Actions #2

Updated by pcervinka almost 5 years ago

  • Blocks action #66907: Multimachine test fails in setup for ARM workers added
Actions #3

Updated by okurz over 4 years ago

  • Subject changed from ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars to ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures
Actions #4

Updated by okurz over 4 years ago

  • Blocks deleted (action #66907: Multimachine test fails in setup for ARM workers)
Actions #5

Updated by okurz over 4 years ago

  • Has duplicate action #66907: Multimachine test fails in setup for ARM workers added
Actions #6

Updated by sebchlad over 4 years ago

Just to make it clear I'm also adding the message as in poo#66907#note-10: 'And in the meantime I got access to OSD workers, so I will try to help by maintaining ARM workers and when needed, I will mask unwanted workers which should not be there or restart the network interfaces etc.'

Actions #7

Updated by okurz over 4 years ago

  • Target version set to Ready
Actions #8

Updated by okurz over 4 years ago

  • Tags changed from caching, openQA, sporadic, arm, ipmi, worker to worker
Actions #9

Updated by okurz over 4 years ago

  • Related to coordination #65118: [epic] multimachine test fails with symptoms "websocket refusing connection" and other unclear reasons added
Actions #10

Updated by okurz over 4 years ago

  • Related to action #66376: MM tests fail in obscure way when tap device is not present added
Actions #11

Updated by okurz about 4 years ago

  • Target version changed from Ready to future
Actions #13

Updated by mkittler almost 4 years ago

I'm wondering why the existing code doesn't not already cover It looks like it should do exactly what the ticket asks for. The code has already been present for 2 years:

That's maybe because still gets started¹ and it simply pulls as many worker slots in as there are pool directories. So the mentioned salt code might work but the effort could be neglected again by starting Note that the number of worker slots for to pull in is determined by running a systemd generator which checks for the pool directories present under /var/lib/openqa/pool.

¹ It shouldn't be started anymore as it is disabled and no dependencies seem to pull it in. It nevertheless gets started and I still have to find out why.

Actions #14

Updated by mkittler almost 4 years ago

  • Assignee set to mkittler

After removing the worker target this might even work:

I can try to activate an additional worker slot somewhere and check whether it'll be stopped and disabled on the next salt run.

Enabled/started openqa-worker-auto-restart@42 on openqaworker-arm-1. It should be disabled/stopped automatically on the next salt run.

Actions #15

Updated by mkittler almost 4 years ago

Actions #16

Updated by mkittler almost 4 years ago

  • Status changed from New to Resolved

The SR has been merged and it works now, e.g. running salt -l debug state.sls_id stop_and_disable_all_not_configured_workers openqa.worker on OSD stops and disables openqa-worker-auto-restart@42 on openqaworker-arm-1 and also doesn't cause any problems if there aren't any workers to stop. (Works also when applying everything via salt state.apply.)


Also available in: Atom PDF