https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842020-03-31T11:49:01ZopenSUSE Project Management ToolopenQA Project - action #65082: flaky/sporadic unstable failure in t/43-scheduling-and-worker-scalability.thttps://progress.opensuse.org/issues/65082?journal_id=2896022020-03-31T11:49:01Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p><a href="https://github.com/os-autoinst/openQA/pull/2885" class="external">https://github.com/os-autoinst/openQA/pull/2885</a></p>
openQA Project - action #65082: flaky/sporadic unstable failure in t/43-scheduling-and-worker-scalability.thttps://progress.opensuse.org/issues/65082?journal_id=2912572020-04-07T09:51:50Zokurzokurz@suse.com
<ul></ul><p><a href="https://github.com/os-autoinst/openQA/pull/2882" class="external">https://github.com/os-autoinst/openQA/pull/2882</a> is the first part from Martchus bumping the internal timeout waiting for jobs to finish. <a href="https://github.com/os-autoinst/openQA/pull/2885#issuecomment-610291020" class="external">https://github.com/os-autoinst/openQA/pull/2885#issuecomment-610291020</a> describes a new, unexpected problem in the scheduler "full" test.</p>
openQA Project - action #65082: flaky/sporadic unstable failure in t/43-scheduling-and-worker-scalability.thttps://progress.opensuse.org/issues/65082?journal_id=3007572020-05-15T14:26:58Zlivdywanliv.dywan@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/300757/diff?detail_id=297742">diff</a>)</li></ul> openQA Project - action #65082: flaky/sporadic unstable failure in t/43-scheduling-and-worker-scalability.thttps://progress.opensuse.org/issues/65082?journal_id=3100872020-06-26T10:15:36Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Workable</i></li><li><strong>Assignee</strong> deleted (<del><i>okurz</i></del>)</li><li><strong>Priority</strong> changed from <i>High</i> to <i>Low</i></li></ul><p>I have not seen the test module t/43-scheduling-and-worker-scalability.t failing on the top level since we have a RETRY=3 and also my recent changes to the test setup, e.g. using dynamic-but-ensured-to-be-free ports. To complete this ticket one would still need to ensure stability without RETRY=3</p>
openQA Project - action #65082: flaky/sporadic unstable failure in t/43-scheduling-and-worker-scalability.thttps://progress.opensuse.org/issues/65082?journal_id=3143982020-07-23T09:26:08Zokurzokurz@suse.com
<ul><li><strong>Target version</strong> set to <i>Ready</i></li></ul> openQA Project - action #65082: flaky/sporadic unstable failure in t/43-scheduling-and-worker-scalability.thttps://progress.opensuse.org/issues/65082?journal_id=3851112021-02-23T15:00:20Zokurzokurz@suse.com
<ul></ul><p><a href="https://github.com/os-autoinst/openQA/pull/3714" class="external">https://github.com/os-autoinst/openQA/pull/3714</a> is a related change that tried to stabilize t/43-scheduling-and-worker-scalability.t . Maybe that's already enough and the test can be run without retries now.</p>
openQA Project - action #65082: flaky/sporadic unstable failure in t/43-scheduling-and-worker-scalability.thttps://progress.opensuse.org/issues/65082?journal_id=3851292021-02-23T15:35:00Zmkittlermarius.kittler@suse.com
<ul></ul><p>The fix will not help with this one because it only prevents us from running into <code>BAIL_OUT('Unable to assign jobs to (idling) workers');</code>. Previously it was possible that not a single worker has already been <em>fully</em> registered. However, in the case of this ticket the test was already further but the jobs haven't been processed fast enough.</p>
<p>In fact, I could just reproduce this case locally (with the previous fix present) via <code>time env runs=400 SCALABILITY_TEST=1 "$OPENQA_BASEDIR/repos/okurz-github-scripts/count_fail_ratio" openqa-test t/43-scheduling-and-worker-scalability.t</code>:</p>
<pre><code> not ok 2 - all jobs done
# Failed test 'all jobs done'
# at t/43-scheduling-and-worker-scalability.t line 211.
# got: '4'
# expected: '5'
not ok 3 - all jobs passed
# Failed test 'all jobs passed'
# at t/43-scheduling-and-worker-scalability.t line 212.
# got: '4'
# expected: '5'
# All jobs:
# - id: 1, state: done, result: passed, reason: none
# - id: 2, state: done, result: passed, reason: none
# - id: 3, state: assigned, result: none, reason: none
# - id: 4, state: done, result: passed, reason: none
# - id: 5, state: done, result: passed, reason: none
1..3
# Looks like you failed 2 tests of 3.
not ok 2 - assign and run jobs
</code></pre>
<p>But that was also the only failing test run out of 400. Normally it takes 6 polling attempts (<code># Waiting until all jobs are done, try 6</code>) on my machine. In the failed case there were 125 attempts before (<code># Waiting until all jobs are done, try 125</code>).</p>
openQA Project - action #65082: flaky/sporadic unstable failure in t/43-scheduling-and-worker-scalability.thttps://progress.opensuse.org/issues/65082?journal_id=4020662021-04-29T07:42:20Zokurzokurz@suse.com
<ul><li><strong>Target version</strong> changed from <i>Ready</i> to <i>future</i></li></ul>