https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842021-02-18T09:43:09ZopenSUSE Project Management ToolopenQA Project - action #88459: low-prio single-machine jobs can starve out high-prio multi-machine testshttps://progress.opensuse.org/issues/88459?journal_id=3836002021-02-18T09:43:09Zmkittlermarius.kittler@suse.com
<ul><li><strong>Assignee</strong> set to <i>mkittler</i></li></ul><blockquote>
<p>Review the scheduling code which AFAIR should try to prefer multi-machine jobs by blocking relevant worker slots until enough free slots are free to execute the cluster</p>
</blockquote>
<p>There's indeed such code. It should prioritize jobs part of relevant parallel clusters and at some point even "hold a worker" to avoid starvation.</p>
<p>Judging by the logs the code is triggered, e.g.:</p>
<pre><code>[2021-02-18T10:32:57.0587 CET] [debug] [pid:1401] Holding worker 967 for job 5480608 to avoid starvation
</code></pre>
<p>However, when looking at <a href="https://openqa.suse.de/admin/workers/967" class="external">https://openqa.suse.de/admin/workers/967</a> it turned out that the worker has actually just started working on the single job <a href="https://openqa.suse.de/tests/5480515" class="external">https://openqa.suse.de/tests/5480515</a> (instead of <a href="https://openqa.suse.de/tests/5480608" class="external">https://openqa.suse.de/tests/5480608</a> which was still scheduled). So despite the log message it doesn't really work. I've also seen a few similar cases (using <code>grep 'Discarding job' /var/log/openqa_scheduler</code> and <code>grep 'Holding worker' /var/log/openqa_scheduler</code> on OSD to find relevant log messages).</p>
openQA Project - action #88459: low-prio single-machine jobs can starve out high-prio multi-machine testshttps://progress.opensuse.org/issues/88459?journal_id=3836242021-02-18T10:30:55Zmkittlermarius.kittler@suse.com
<ul></ul><p>Judging by the full scheduler log the situation described in the previous comment is actually not a problem at all. The scheduler run where the held job was assigned looks like this:</p>
<pre><code>[2021-02-18T10:27:52.0180 CET] [debug] [pid:1401] Need to schedule 3 parallel jobs for job 5480608 (with priority 0)
[2021-02-18T10:27:52.0180 CET] [debug] [pid:1401] Holding worker 599 for job 5480608 to avoid starvation
[2021-02-18T10:27:52.0181 CET] [debug] [pid:1401] Need to schedule 3 parallel jobs for job 5481843 (with priority 8)
[2021-02-18T10:27:52.0181 CET] [debug] [pid:1401] Discarding job 5481843 (with priority 8) due to incomplete parallel cluster
[2021-02-18T10:27:52.0181 CET] [debug] [pid:1401] Need to schedule 3 parallel jobs for job 5481845 (with priority 10)
[2021-02-18T10:27:52.0181 CET] [debug] [pid:1401] Discarding job 5481843 (with priority 10) due to incomplete parallel cluster
[2021-02-18T10:27:52.0181 CET] [debug] [pid:1401] Need to schedule 1 parallel jobs for job 5482359 (with priority 30)
…
[2021-02-18T10:27:52.0182 CET] [debug] [pid:1401] Need to schedule 1 parallel jobs for job 5480515 (with priority 50)
[2021-02-18T10:27:52.0182 CET] [debug] [pid:1401] Need to schedule 1 parallel jobs for job 5480516 (with priority 50)
[2021-02-18T10:27:52.0182 CET] [debug] [pid:1401] Need to schedule 1 parallel jobs for job 5480517 (with priority 50)
…
</code></pre>
<p>So we're just "holding" a different worker for that job. This should be ok as well because it doesn't matter which worker we're "holding" as long as we're holding some worker.</p>
<p>Judging by the rest of the log the held worker is never used for a different job within the same scheduler run. And due to their increased priority the parallel jobs are also tried to be scheduled at the very beginning of each scheduler run.</p>
<hr>
<p>I still see one flaw: It looks like we would always only hold one worker per parallel parent. The code was likely written having simple parallel clusters with one parent and one child in mind. However, when a parent has multiple children we're still holding only one worker which is not sufficient. E.g. the parallel clusters which are currently starving on OSD consist of one parent and 2 children so we needed to hold at least 2 workers until a 3rd worker becomes available.</p>
openQA Project - action #88459: low-prio single-machine jobs can starve out high-prio multi-machine testshttps://progress.opensuse.org/issues/88459?journal_id=3836422021-02-18T12:55:27Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>Workable</i> to <i>In Progress</i></li></ul><p>I've created a unit test to check whether the flaw mentioned in my last comment is actually present: <a href="https://github.com/os-autoinst/openQA/pull/3741" class="external">https://github.com/os-autoinst/openQA/pull/3741</a></p>
<p>It doesn't seem that way. If there are 3 parallel jobs (like the jobs I've seen on OSD) but only 2 matching workers, the 2 matching workers are actually "held".</p>
<p>Of course if there's only one matching worker the scheduler is only able to hold one worker as we can not reserve busy workers. I've also tested this by putting <code>pop @mocked_free_workers;</code> before the last <code>OpenQA::Scheduler::Model::Jobs->singleton->schedule</code> in my unit test. Then only one <code>Holding worker …</code> line is logged like on OSD. I would assume that it is simply that case which we see on OSD. So there's still no problem.</p>
openQA Project - action #88459: low-prio single-machine jobs can starve out high-prio multi-machine testshttps://progress.opensuse.org/issues/88459?journal_id=3836572021-02-18T13:28:43Zmkittlermarius.kittler@suse.com
<ul></ul><p>I've been checking more concrete examples on OSD. It looks like there actually are cases where 2 workers are held for a 3-job-cluster:</p>
<pre><code>[2021-02-18T14:08:27.0654 CET] [debug] [pid:1401] Need to schedule 3 parallel jobs for job 5482683 (with priority 0)
[2021-02-18T14:08:27.0654 CET] [debug] [pid:1401] Holding worker 927 for job 5482686 to avoid starvation
[2021-02-18T14:08:27.0654 CET] [debug] [pid:1401] Holding worker 1254 for job 5482683 to avoid starvation
</code></pre>
<p>Here 5482686 and 5482683 are part of the same cluster. In this example we see that it also works with mixed <code>WORKER_CLASS</code>es (<code>tap</code> and <code>sap_sle15</code>).</p>
<hr>
<p>The first time this job reaches zero-prio where the the worker reservation kicks in:</p>
<pre><code>[2021-02-18T13:10:48.0651 CET] [debug] [pid:1401] Need to schedule 3 parallel jobs for job 5482683 (with priority 0)
[2021-02-18T13:10:48.0651 CET] [debug] [pid:1401] Holding worker 600 for job 5482683 to avoid starvation
</code></pre>
<p>It took a while to get there (compare the timestamps):</p>
<pre><code>[2021-02-18T12:15:51.0257 CET] [debug] [pid:1401] Need to schedule 3 parallel jobs for job 5482683 (with priority 50)
…
[2021-02-18T12:19:13.0863 CET] [debug] [pid:1401] Need to schedule 3 parallel jobs for job 5482683 (with priority 50)
[2021-02-18T12:19:13.0863 CET] [debug] [pid:1401] Discarding job 5482683 (with priority 50) due to incomplete parallel cluster
…
[2021-02-18T13:10:47.0569 CET] [debug] [pid:1401] Need to schedule 3 parallel jobs for job 5482683 (with priority 1)
[2021-02-18T13:10:47.0569 CET] [debug] [pid:1401] Discarding job 5482683 (with priority 1) due to incomplete parallel cluster
</code></pre>
<p>We could make increasing the prio more aggressive, e.g. make it configurable.</p>
openQA Project - action #88459: low-prio single-machine jobs can starve out high-prio multi-machine testshttps://progress.opensuse.org/issues/88459?journal_id=3836962021-02-18T15:47:53Zmkittlermarius.kittler@suse.com
<ul></ul><p>PR for reducing the prio more aggressively (configurable via env variable): <a href="https://github.com/os-autoinst/openQA/pull/3742" class="external">https://github.com/os-autoinst/openQA/pull/3742</a></p>
<hr>
<blockquote>
<p>[16:23] fvogt: As already explained, it takes a while until the prio decreases. The worker is not really aggressively doing that and only puts other workers on hold if the prio reaches 0.<br>
[16:24] I'd argue that it should do that if there's no higher prio job</p>
</blockquote>
<p>It would make sense to implement this as well.</p>
openQA Project - action #88459: low-prio single-machine jobs can starve out high-prio multi-machine testshttps://progress.opensuse.org/issues/88459?journal_id=3837552021-02-19T04:07:42Zopenqa_reviewopenqa-review@suse.de
<ul><li><strong>Due date</strong> set to <i>2021-03-05</i></li></ul><p>Setting due date based on mean cycle time of SUSE QE Tools</p>
openQA Project - action #88459: low-prio single-machine jobs can starve out high-prio multi-machine testshttps://progress.opensuse.org/issues/88459?journal_id=3840252021-02-19T16:40:31Zmkittlermarius.kittler@suse.com
<ul></ul><blockquote>
<p>I've now made the prio decreasing 10 times faster on o3. Feel free to adjust this yourself as needed (adjust env variable <code>systemctl edit openqa-scheduler</code>). But note that the change is not actually effective yet because <a href="https://github.com/os-autoinst/openQA/pull/3742" class="external">https://github.com/os-autoinst/openQA/pull/3742</a> has not been deployed yet.</p>
</blockquote>
openQA Project - action #88459: low-prio single-machine jobs can starve out high-prio multi-machine testshttps://progress.opensuse.org/issues/88459?journal_id=3847212021-02-22T14:37:01Zmkittlermarius.kittler@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p>The change to make the prio decreasing 10 times faster should now be effective on o3.</p>
openQA Project - action #88459: low-prio single-machine jobs can starve out high-prio multi-machine testshttps://progress.opensuse.org/issues/88459?journal_id=3872832021-03-03T13:41:09Zfavogtfvogt@suse.com
<ul></ul><p>mkittler wrote:</p>
<blockquote>
<p>The change to make the prio decreasing 10 times faster should now be effective on o3.</p>
</blockquote>
<p>Looking at the current state of the TW group, there are a few normal prio tests and the usual amount of low-prio investigation jobs queued, but but multimachine tests mostly finished already. So this indeed seems to work as it should!</p>
openQA Project - action #88459: low-prio single-machine jobs can starve out high-prio multi-machine testshttps://progress.opensuse.org/issues/88459?journal_id=3872862021-03-03T13:43:27Zggardet_armguillaume.gardet@arm.com
<ul></ul><p>At least on aarch64, multimachine are now scheduled properly!</p>
openQA Project - action #88459: low-prio single-machine jobs can starve out high-prio multi-machine testshttps://progress.opensuse.org/issues/88459?journal_id=3881142021-03-05T07:36:31Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><p>Thanks all for the feedback. This should be enough to call this done.</p>