https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842020-02-26T21:07:26ZopenSUSE Project Management ToolopenQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=2811432020-02-26T21:07:26Zokurzokurz@suse.com
<ul><li><strong>Copied from</strong> <i><a class="issue tracker-4 status-3 priority-3 priority-lowest closed" href="/issues/63853">action #63853</a>: [tools] broken /etc/sysconfig/network/ifcfg-br1</i> added</li></ul> openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3007392020-05-15T13:19:56Zpcervinkapcervinka@suse.com
<ul><li><strong>Blocks</strong> <i><a class="issue tracker-4 status-6 priority-4 priority-default closed" href="/issues/66907">action #66907</a>: Multimachine test fails in setup for ARM workers</i> added</li></ul> openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3008472020-05-17T12:17:53Zokurzokurz@suse.com
<ul><li><strong>Subject</strong> changed from <i>ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars</i> to <i>ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures</i></li></ul> openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3008532020-05-17T12:18:47Zokurzokurz@suse.com
<ul><li><strong>Blocks</strong> deleted (<i><a class="issue tracker-4 status-6 priority-4 priority-default closed" href="/issues/66907">action #66907</a>: Multimachine test fails in setup for ARM workers</i>)</li></ul> openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3008622020-05-17T12:18:53Zokurzokurz@suse.com
<ul><li><strong>Has duplicate</strong> <i><a class="issue tracker-4 status-6 priority-4 priority-default closed" href="/issues/66907">action #66907</a>: Multimachine test fails in setup for ARM workers</i> added</li></ul> openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3009282020-05-18T06:29:18Zsebchladsebastian.chlad@suse.com
<ul></ul><p>Just to make it clear I'm also adding the message as in poo#66907#note-10: 'And in the meantime I got access to OSD workers, so I will try to help by maintaining ARM workers and when needed, I will mask unwanted workers which should not be there or restart the network interfaces etc.'</p>
openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3153752020-07-29T07:09:24Zokurzokurz@suse.com
<ul><li><strong>Target version</strong> set to <i>Ready</i></li></ul> openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3168492020-08-06T11:35:07Zokurzokurz@suse.com
<ul><li><strong>Tags</strong> changed from <i>caching, openQA, sporadic, arm, ipmi, worker</i> to <i>worker</i></li></ul> openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3170472020-08-07T09:10:29Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-6 status-3 priority-5 priority-high3 closed behind-schedule" href="/issues/65118">coordination #65118</a>: [epic] multimachine test fails with symptoms "websocket refusing connection" and other unclear reasons</i> added</li></ul> openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3170532020-08-07T09:10:43Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-4 priority-default closed" href="/issues/66376">action #66376</a>: MM tests fail in obscure way when tap device is not present</i> added</li></ul> openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3515682020-11-17T12:54:10Zokurzokurz@suse.com
<ul><li><strong>Target version</strong> changed from <i>Ready</i> to <i>future</i></li></ul> openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3817612021-02-05T13:48:11Zmkittlermarius.kittler@suse.com
<ul></ul><p>see <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/438#note_293207" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/438#note_293207</a></p>
openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3826182021-02-15T10:22:57Zmkittlermarius.kittler@suse.com
<ul></ul><blockquote>
<p>I'm wondering why the existing code doesn't not already cover <a href="https://progress.opensuse.org/issues/63874" class="external">https://progress.opensuse.org/issues/63874</a>. It looks like it should do exactly what the ticket asks for. The code has already been present for 2 years: <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/e80327e29fce8f6f39051167d389c3cf44099a45" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/e80327e29fce8f6f39051167d389c3cf44099a45</a></p>
</blockquote>
<p>That's maybe because <code>openqa-worker.target</code> still gets started¹ and it simply pulls as many worker slots in as there are pool directories. So the mentioned salt code might work but the effort could be neglected again by starting <code>openqa-worker.target</code>. Note that the number of worker slots for <code>openqa-worker.target</code> to pull in is determined by running a systemd generator which checks for the pool directories present under <code>/var/lib/openqa/pool</code>.</p>
<p>¹ It shouldn't be started anymore as it is disabled and no dependencies seem to pull it in. It nevertheless gets started and I still have to find out why.</p>
openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3829722021-02-16T09:47:59Zmkittlermarius.kittler@suse.com
<ul><li><strong>Assignee</strong> set to <i>mkittler</i></li></ul><p>After removing the worker target this might even work: <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/454" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/454</a></p>
<p>I can try to activate an additional worker slot somewhere and check whether it'll be stopped and disabled on the next salt run.</p>
<hr>
<p>Enabled/started <code>openqa-worker-auto-restart@42</code> on <code>openqaworker-arm-1</code>. It should be disabled/stopped automatically on the next salt run.</p>
openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3839472021-02-19T12:37:31Zmkittlermarius.kittler@suse.com
<ul></ul><p>It didn't work. See <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/455" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/455</a> for details and a fix.</p>
openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failureshttps://progress.opensuse.org/issues/63874?journal_id=3840192021-02-19T16:06:50Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Resolved</i></li></ul><p>The SR has been merged and it works now, e.g. running <code>salt -l debug openqaworker-arm-1.suse.de state.sls_id stop_and_disable_all_not_configured_workers openqa.worker</code> on OSD stops and disables <code>openqa-worker-auto-restart@42</code> on <code>openqaworker-arm-1</code> and also doesn't cause any problems if there aren't any workers to stop. (Works also when applying everything via <code>salt openqaworker-arm-1.suse.de state.apply</code>.)</p>