openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842024-03-11T10:27:19ZopenSUSE Project Management Tool
Redmine openQA Project - action #157018 (Resolved): [sporadic] Build failed in Jenkins: submit-openQA-TW-...https://progress.opensuse.org/issues/1570182024-03-11T10:27:19Ztinitatina.mueller+trick-redmine@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<pre><code>Date: Sat, 9 Mar 2024 03:49:48 +0100 (CET)
See <http://jenkins.qa.suse.de/job/submit-openQA-TW-to-oS_Fctry/1001/display/redirect>
Changes:
------------------------------------------
[...truncated 4.20 MiB...]
<result project="devel:openQA:tested" repository="openSUSE_Factory" arch="x86_64" code="blocked" state="blocked">
<status package="openQA" code="blocked">
+ echo 'Waiting while openQA is in progress'
Waiting while openQA is in progress
...
4.6.1709822711.90519fe6
4.6.1709822711.90519fe6
4.6.1709822711.90519fe6
4.6.1709822711.90519fe6' openSUSE:Factory
Server returned an error: HTTP Error 503: Service Unavailable
</code></pre>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Short unavailabilities of OBS are covered with retry</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Use <a href="https://build.opensuse.org/package/show/openSUSE:Factory/retry" class="external">https://build.opensuse.org/package/show/openSUSE:Factory/retry</a> in the according script from github.com/os-autoinst/scripts/</li>
</ul>
openQA Tests - action #122830 (Resolved): [tools][openQA-in-openQA][sporadic] test fails in login...https://progress.opensuse.org/issues/1228302023-01-09T10:57:40Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_install+publish@64bit-2G fails in<br>
<a href="https://openqa.opensuse.org/tests/3022232/modules/login/steps/1" class="external">login</a></p>
<a name="Reproducible"></a>
<h2 >Reproducible<a href="#Reproducible" class="wiki-anchor">¶</a></h2>
<p>Fails since (at least) Build <a href="https://openqa.opensuse.org/tests/3022232" class="external">:TW.14539</a> (current job)</p>
<a name="Expected-result"></a>
<h2 >Expected result<a href="#Expected-result" class="wiki-anchor">¶</a></h2>
<p>Last good: <a href="https://openqa.opensuse.org/tests/3022225" class="external">:TW.14538</a> (or more recent)</p>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>Always latest result in this scenario: <a href="https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-2G&test=openqa_install%2Bpublish&version=Tumbleweed" class="external">latest</a></p>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Confirm why the needle match doesn't get the area anyway. Maybe openQA ends up clicking on the popup</li>
<li>See if we don't have existing code handling similar popups, or try to disable the survey (app.normandy.enabled = false)</li>
<li>From the video it looks to be racy. The login matches just before the popup comes up
<ul>
<li><a href="https://www.askvg.com/tip-disable-surveys-rate-your-experience-out-of-date-notifications-in-firefox/" class="external">https://www.askvg.com/tip-disable-surveys-rate-your-experience-out-of-date-notifications-in-firefox/</a></li>
</ul></li>
</ul>
openQA Project - action #122440 (Resolved): [sporadic] openQA Assetpack download can fail on init...https://progress.opensuse.org/issues/1224402022-12-25T11:25:10Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://openqa.opensuse.org/tests/2978761/logfile?filename=openqa_webui-openqa_nohup_out.txt" class="external">https://openqa.opensuse.org/tests/2978761/logfile?filename=openqa_webui-openqa_nohup_out.txt</a> shows</p>
<pre><code>[info] Caching "https://cdn.datatables.net/1.10.16/css/dataTables.bootstrap4.css" to "/root/openQA/script/../assets/cache/cdn.datatables.net/1.10.16/css/dataTables.bootstrap4.css".
[info] Caching "https://cdn.jsdelivr.net/npm/fork-awesome@1.2.0/css/fork-awesome.min.css" to "/root/openQA/script/../assets/cache/cdn.jsdelivr.net/npm/fork-awesome@1.2.0/css/fork-awesome.min.css".
[warn] [AssetPack] Unable to download https://raw.githubusercontent.com/bootstrapthemesco/bootstrap-4-multi-dropdown-navbar/beta2.0/css/bootstrap-4-navbar.css: Connect timeout
Could not find input asset "https://raw.githubusercontent.com/bootstrapthemesco/bootstrap-4-multi-dropdown-navbar/beta2.0/css/bootstrap-4-navbar.css". at /usr/lib/perl5/vendor_perl/5.36.0/Mojolicious/Plugin/AssetPack.pm line 172.
openQA is licensed GPL-2.0 - Version 4.6.1671708203.c9f8b10
</code></pre>
<p>The openqa-investigate retry jobs passed so sporadic download issue</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> the openQA asset handling ensures that temporary network issues "Connect timeout" are handled with retrying</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>We can probably do some retrying on download issues</li>
<li>Check the code in the AssetPack Mojolicious plugin. As necessary propose upstream solution, as alternative handle with some hard-core log parsing in downstream and just retry :)</li>
</ul>
openQA Project - action #121042 (Resolved): [sporadic] typing issue in comments UI test size:Mhttps://progress.opensuse.org/issues/1210422022-11-28T13:08:12Zmkittlermarius.kittler@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<pre><code>[12:42:18] t/ui/15-comments.t ......................... 12/? get: unexpected alert open: {Alert text : The comment text mustn't be empty.} at /home/squamata/project/t/ui/../lib/OpenQA/SeleniumTest.pm:81 at /home/squamata/project/t/ui/../lib/OpenQA/SeleniumTest.pm line 84.
OpenQA::SeleniumTest::__ANON__(Test::Selenium::Chrome=HASH(0x55cbb5d49500), "Error while executing command: get: unexpected alert open: {A"..., HASH(0x55cbb2ea54f0), HASH(0x55cbb621f7c0)) called at /usr/lib/perl5/vendor_perl/5.26.1/Selenium/Remote/Driver.pm line 356
Selenium::Remote::Driver::catch {...} ("Error while executing command: get: unexpected alert open: {A"...) called at /usr/lib/perl5/vendor_perl/5.26.1/Try/Tiny.pm line 123
Try::Tiny::try(CODE(0x55cbb6206820), Try::Tiny::Catch=REF(0x55cbb61f4f10)) called at /usr/lib/perl5/vendor_perl/5.26.1/Selenium/Remote/Driver.pm line 361
Selenium::Remote::Driver::__ANON__(CODE(0x55cbb5ac8590), Test::Selenium::Chrome=HASH(0x55cbb5d49500), HASH(0x55cbb2ea54f0), HASH(0x55cbb621f7c0)) called at (eval 1713)[/usr/lib/perl5/vendor_perl/5.26.1/Class/Method/Modifiers.pm:89] line 1
Selenium::Remote::Driver::__ANON__(Test::Selenium::Chrome=HASH(0x55cbb5d49500), HASH(0x55cbb2ea54f0), HASH(0x55cbb621f7c0)) called at (eval 1715)[/usr/lib/perl5/vendor_perl/5.26.1/Class/Method/Modifiers.pm:148] line 2
Selenium::Remote::Driver::_execute_command(Test::Selenium::Chrome=HASH(0x55cbb5d49500), HASH(0x55cbb2ea54f0), HASH(0x55cbb621f7c0)) called at /usr/lib/perl5/vendor_perl/5.26.1/Selenium/Remote/Driver.pm line 946
Selenium::Remote::Driver::get(Test::Selenium::Chrome=HASH(0x55cbb5d49500), "/group_overview/1001") called at t/ui/15-comments.t line 493
main::__ANON__() called at /usr/lib/perl5/5.26.1/Test/Builder.pm line 309
eval {...} called at /usr/lib/perl5/5.26.1/Test/Builder.pm line 309
Test::Builder::subtest(Test::Builder=HASH(0x55cba96c1160), "group overview: /group_overview/1001", CODE(0x55cbb61ec710)) called at /usr/lib/perl5/5.26.1/Test/More.pm line 807
Test::More::subtest("group overview: /group_overview/1001", CODE(0x55cbb61ec710)) called at t/ui/15-comments.t line 505
main::__ANON__() called at /usr/lib/perl5/5.26.1/Test/Builder.pm line 309
eval {...} called at /usr/lib/perl5/5.26.1/Test/Builder.pm line 309
Test::Builder::subtest(Test::Builder=HASH(0x55cba96c1160), "editing when logged in as regular user", CODE(0x55cbb61e0128)) called at /usr/lib/perl5/5.26.1/Test/More.pm line 807
Test::More::subtest("editing when logged in as regular user", CODE(0x55cbb61e0128)) called at t/ui/15-comments.t line 506
[12:42:18] t/ui/15-comments.t ......................... 13/? # Tests were run but no plan was declared and done_testing() was not seen.
[12:42:18] t/ui/15-comments.t ......................... Dubious, test returned 254 (wstat 65024, 0xfe00)
All 13 subtests passed
</code></pre>
<p>(see <a href="https://app.circleci.com/pipelines/github/os-autoinst/openQA/10659/workflows/5fa05316-b883-48e5-9680-571ca7baf77b/jobs/99866/steps" class="external">https://app.circleci.com/pipelines/github/os-autoinst/openQA/10659/workflows/5fa05316-b883-48e5-9680-571ca7baf77b/jobs/99866/steps</a>)</p>
<p>I've encountered it only once so far and retrying helped. Likely it happens only very rarely.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Sporadic issue no longer appears</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Use Circle CI ssh feature to debug (since it has not been replicated locally)</li>
</ul>
openQA Project - action #113138 (Resolved): sporadic failure in openQA test "t/ui/23-audit-log.t"...https://progress.opensuse.org/issues/1131382022-07-01T08:30:32Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://app.circleci.com/pipelines/github/os-autoinst/openQA/9895/workflows/bd749c46-c563-4fcd-9d17-6195780465f3/jobs/93334/steps" class="external">https://app.circleci.com/pipelines/github/os-autoinst/openQA/9895/workflows/bd749c46-c563-4fcd-9d17-6195780465f3/jobs/93334/steps</a> shows</p>
<pre><code>[06:59:35] t/ui/23-audit-log.t ........................ 12/?
# Failed test 'correct number of elements'
# at t/ui/23-audit-log.t line 132.
# got: '9'
# expected: '1'
# less than a minute ago Demo job_create {
# "TEST": "foo",
# "id": 1
# }, less than a minute ago Demo comment_create {
# "id": 1,
# "job_id": 1
# }, less than a minute ago Demo table_create {
# "backend": "qemu",
# "description": null,
# "id": 1,
# "name": "foo",
# "settings": [],
# "table": "Machines"
# }, less than a minute ago Demo table_create {
# "description": null,
# "id": 1,
# "name": "testsuite",
# "settings": [],
# "table": "TestSuites"
# }, less than a minute ago Demo table_create {
# "arch": "x86_64",
# "description": null,
# "distri": "opensuse",
# "flavor": "DVD",
# "id": 1,
# "name": "",
# "settings": [],
# "table": "Products",
# "version": "13.2"
# }, less than a minute ago Demo user_login null, less than a minute ago system startup openQA restarted, less than a minute ago Demo user_login null, less than a minute ago system startup openQA restarted
# Looks like you failed 1 test of 18.
[06:59:35] t/ui/23-audit-log.t ........................ 13/?
# Failed test 'clickable events'
# at t/ui/23-audit-log.t line 140.
[06:59:35] t/ui/23-audit-log.t ........................ 14/? # Looks like you failed 1 test of 14.
[06:59:35] t/ui/23-audit-log.t ........................ Dubious, test returned 1 (wstat 256, 0x100)
</code></pre>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> t/ui/23-audit-log.t stable in many runs</li>
<li><strong>AC2:</strong> t/ui/23-audit-log.t is not included in tools/unstable_tests.txt</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Read about the history of latest changes in <a href="https://progress.opensuse.org/issues/111539" class="external">https://progress.opensuse.org/issues/111539</a>, maybe okurz already tried to run 1000 runs locally and couldn't reproduce or something.</li>
<li>Try it out locally with coverage enabled and see if it can be reproduced. Maybe try NON_HEADLESS=1. Or try to spot the mistake by code analysis.</li>
<li>Fix the test race-free</li>
<li>Remove test module from tools/unstable_tests.txt</li>
<li>Optional: Try to gather statistics from circleCI because we have similar problems reappearing</li>
</ul>
openQA Project - action #95995 (Resolved): [sporadic][openqa-in-openqa] Test openqa_from_git eve...https://progress.opensuse.org/issues/959952021-07-26T08:51:35Zilausuchilausuch@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>In this test run <a href="https://openqa.opensuse.org/tests/1857293#step/openqa_webui/33" class="external">https://openqa.opensuse.org/tests/1857293#step/openqa_webui/33</a> the is a problem. Seems that the server is not responsive</p>
<pre><code># Test died: command 'while ! [ -f nohup.out ]; do sleep 1 ; done && grep -qP "Listening at.*(127.0.0.1|localhost)" <(tail -f -n0 nohup.out) ' timed out at openqa//tests/install/openqa_webui.pm line 68.
</code></pre>
<p><a href="https://openqa.opensuse.org/tests/1857293/logfile?filename=openqa_webui-openqa_nohup_out.txt" class="external">https://openqa.opensuse.org/tests/1857293/logfile?filename=openqa_webui-openqa_nohup_out.txt</a> shows</p>
<pre><code>[warn] [AssetPack] Unable to download https://cdnjs.cloudflare.com/ajax/libs/chosen/1.7.0/chosen.css: Connect timeout
Could not find input asset "https://cdnjs.cloudflare.com/ajax/libs/chosen/1.7.0/chosen.css". at /usr/lib/perl5/vendor_perl/5.32.1/Mojolicious/Plugin/AssetPack.pm line 172.
</code></pre>
<p>which <em>maybe</em> is causing the problem, maybe not.</p>
<a name="Acceptance-Criteria"></a>
<h2 >Acceptance Criteria<a href="#Acceptance-Criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC 1</strong>: The above timeout does not appear again in at least 10 consecutive rounds</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Crosscheck in a passed test if the asset connect timeout warning also shows up to prevent us following a "red herring"</li>
<li>DONE: Check if the above download URL from asset definitions can work -> the link <a href="https://cdnjs.cloudflare.com/ajax/libs/chosen/1.7.0/chosen.css" class="external">https://cdnjs.cloudflare.com/ajax/libs/chosen/1.7.0/chosen.css</a> works</li>
<li>Try to reproduce locally as well as use <a href="https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation" class="external">https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation</a> to get statistics of failures</li>
<li>Prevent timeout either on low-level, e.g. asset preparation or high-level, e.g. retry within the openQA-in-openQA tests</li>
</ul>
openQA Project - action #89899 (Resolved): Fix flaky coverage - t/ui/27-plugin_obs_rsync_status_d...https://progress.opensuse.org/issues/898992021-03-11T07:00:28Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>See <a class="issue tracker-6 status-3 priority-4 priority-default closed child parent" title="coordination: [epic] Let's make codecov reports reliable (Resolved)" href="https://progress.opensuse.org/issues/55364">#55364</a> : codecov reports often report about coverage changes which are obviously not related to the actual changes of a PR, e.g. when documentation is changed. We can already trust our coverage analysis more but should have only coverage changes reported for actual changes we introduced in a pull request.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> t/ui/27-plugin_obs_rsync_status_details.t does not appear anymore as changing code coverage in unrelated changes</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Try to reproduce locally with <code>rm -rf cover_db/ && make coverage KEEP_DB=1 TESTS=t/ui/27-plugin_obs_rsync_status_details.t</code></li>
<li>check coverage details in generated html report, e.g. call <code>firefox cover_db/coverage.html</code></li>
<li>Fix uncovered lines with "uncoverable" statements, see previous commits adding these comments or look into <a href="https://metacpan.org/pod/Devel::Cover#UNCOVERABLE-CRITERIA" class="external">https://metacpan.org/pod/Devel::Cover#UNCOVERABLE-CRITERIA</a> or other means</li>
<li>retry multiple times to check for flakyness</li>
</ul>
openQA Project - action #78019 (Rejected): [sporadic] os-autoinst t/18-backend-qemu.t timed out i...https://progress.opensuse.org/issues/780192020-11-16T13:30:30Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://build.opensuse.org/package/live_build_log/devel:openQA:TestGithub:OPR-1567/os-autoinst/openSUSE_Factory/x86_64" class="external">https://build.opensuse.org/package/live_build_log/devel:openQA:TestGithub:OPR-1567/os-autoinst/openSUSE_Factory/x86_64</a> shows</p>
<pre><code>[ 191s] 3: ./12-bmwqemu.t ........................... ok
[ 191s] 3: ./15-logging.t ........................... ok
[ 191s] 3: ./16-send_with_fd.t ...................... ok
[ 191s] 3: ./17-basetest.t .......................... ok
[ 191s] 3: Bailout called. Further testing stopped: test exceeds runtime limit of '10' seconds
[ 191s] 3: FAILED--Further testing stopped: test exceeds runtime limit of '10' seconds
[ 191s] 3/3 Test #3: test-perl-testsuite ..............***Failed 109.48 sec
[ 191s]
</code></pre>
<p>this likely means that the <em>next</em> test module which is not explicitly mentioned here times out. That would be according to the alphabetical order "t/18-backend-qemu.t".</p>
<p>Locally I ran: With <code>count_fail_ratio prove -v -I . -I external/os-autoinst-common/lib -l --timer t/18-backend-qemu.t</code> using <a href="https://github.com/okurz/scripts/tree/master/count_fail_ratio" class="external">https://github.com/okurz/scripts/tree/master/count_fail_ratio</a> I observe a very consistent runtime of 1s.</p>
<p>Running tests on caa97e71 checked out I can reproduce the same times. So if a regression it could be in dependencies.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Stable test locally, in travis CI and OBS</li>
<li><strong>AC2:</strong> No significant increase in runtime</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>If not feasible to fix fast please at least prevent the flaky test result, e.g. by bumping the timeout in the test module and reference this ticket</li>
<li>As old travis CI logs do not give us any indication for the runtime of individual test modules I suggest same as we have for openQA to introduce an environment variable with which we can add the prove option <code>--timer</code> and call that within CI but not by default locally</li>
<li>Crosscheck locally, compare to old results</li>
<li>If necessary bump up the timeout but ensure that we do not have a performance regression that we addressed in caa97e71</li>
</ul>
<a name="Workaround"></a>
<h2 >Workaround<a href="#Workaround" class="wiki-anchor">¶</a></h2>
<p>Retrigger test</p>
openQA Project - action #75265 (Resolved): sporadic errors in test suite of perl-Mojo-IOLoop-Read...https://progress.opensuse.org/issues/752652020-10-25T13:44:15Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p><a href="https://build.opensuse.org/package/live_build_log/devel:openQA:Leap:15.1/perl-Mojo-IOLoop-ReadWriteProcess/openSUSE_Leap_15.1/x86_64">https://build.opensuse.org/package/live_build_log/devel:openQA:Leap:15.1/perl-Mojo-IOLoop-ReadWriteProcess/openSUSE_Leap_15.1/x86_64</a> shows</p>
<pre><code>[ 97s] TEST error print
[ 97s] TEST error print
[ 97s] t/01_run.t ............... ok
[ 108s] t/02_parallel.t .......... ok
[ 109s] Can't use an undefined value as filehandle reference at lib/Mojo/IOLoop/ReadWriteProcess.pm line 298.
[ 126s] t/03_func.t .............. ok
[ 127s] t/04_queues.t ............ ok
[ 127s] t/05_serialize.t ......... ok
[ 131s] t/06_events.t ............ ok
[ 191s]
[ 191s] # Failed test 'collect_status fired 8 times'
[ 191s] # at t/07_autodetect.t line 411.
[ 191s] # got: undef
[ 191s] # expected: '1'
[ 191s]
[ 191s] # Failed test 'new_subprocess fired 7 times'
[ 191s] # at t/07_autodetect.t line 412.
[ 191s] # got: '9'
[ 191s] # expected: '8'
[ 191s]
[ 191s] # Failed test 'detection works'
[ 191s] # at t/07_autodetect.t line 414.
[ 191s] # got: '9'
[ 191s] # expected: '8'
[ 191s] # bless( {
[ 191s] # '_status' => -1,
[ 191s] # 'args' => [],
[ 191s] # 'error' => bless( [], 'Mojo::Collection' ),
[ 191s] # 'error_stream' => bless( \*Symbol::GEN122, 'IO::Handle' ),
[ 191s] # 'events' => {},
[ 191s] # 'execute' => '/home/abuild/rpmbuild/BUILD/Mojo-IOLoop-ReadWriteProcess-0.28/t/data/subreaper/roulette.sh',
[ 191s] # 'process_id' => 1744,
[ 191s] # 'read_stream' => bless( \*Symbol::GEN120, 'IO::Handle' ),
[ 191s] # 'separate_err' => 1,
[ 191s] # 'session' => bless( {
[ 191s] # 'collect_status' => 1,
[ 191s] # 'events' => {
[ 191s] # 'collected' => [
[ 191s] # sub { "DUMMY" }
[ 191s] # ],
[ 191s] # 'collected_orphan' => [
[ 191s] # sub { "DUMMY" }
[ 191s] # ]
[ 191s] # },
[ 191s] # 'handler' => undef,
[ 191s] # 'orphans' => {
[ 191s] # '1744' => bless( {
[ 191s] # '_status' => 256,
[ 191s] # 'process_id' => 1744
[ 191s] # }, 'Mojo::IOLoop::ReadWriteProcess' ),
[ 191s] # '1748' => bless( {
[ 191s] # '_status' => 256,
[ 191s] # 'process_id' => 1748
[ 191s] # }, 'Mojo::IOLoop::ReadWriteProcess' ),
[ 191s] # '1749' => bless( {
[ 191s] # '_status' => 256,
[ 191s] # 'process_id' => 1749
[ 191s] # }, 'Mojo::IOLoop::ReadWriteProcess' ),
[ 191s] # '1755' => bless( {
[ 191s] # '_status' => 256,
[ 191s] # 'process_id' => 1755
[ 191s] # }, 'Mojo::IOLoop::ReadWriteProcess' ),
[ 191s] # '1756' => bless( {
[ 191s] # '_status' => 256,
[ 191s] # 'process_id' => 1756
[ 191s] # }, 'Mojo::IOLoop::ReadWriteProcess' ),
[ 191s] # '1762' => bless( {
[ 191s] # '_status' => 0,
[ 191s] # 'process_id' => 1762
[ 191s] # }, 'Mojo::IOLoop::ReadWriteProcess' ),
[ 191s] # '1763' => bless( {
[ 191s] # '_status' => 0,
[ 191s] # 'process_id' => 1763
[ 191s] # }, 'Mojo::IOLoop::ReadWriteProcess' ),
[ 191s] # '1773' => bless( {
[ 191s] # '_status' => 0,
[ 191s] # 'process_id' => 1773
[ 191s] # }, 'Mojo::IOLoop::ReadWriteProcess' ),
[ 191s] # '1774' => bless( {
[ 191s] # '_status' => 0,
[ 191s] # 'process_id' => 1774
[ 191s] # }, 'Mojo::IOLoop::ReadWriteProcess' )
[ 191s] # },
[ 191s] # 'process_table' => {
[ 191s] # '1744' => \$VAR1
[ 191s] # },
[ 191s] # 'subreaper' => 1
[ 191s] # }, 'Mojo::IOLoop::ReadWriteProcess::Session' ),
[ 191s] # 'set_pipes' => 1,
[ 191s] # 'subreaper' => 1,
[ 191s] # 'write_stream' => bless( \*Symbol::GEN121, 'IO::Handle' )
[ 191s] # }, 'Mojo::IOLoop::ReadWriteProcess' )
[ 191s] # Looks like you failed 3 tests of 4.
[ 191s]
[ 191s] # Failed test 'subreaper_bash_roulette'
[ 191s] # at t/07_autodetect.t line 418.
[ 191s] 0 at t/07_autodetect.t line 414.
[ 191s] # Tests were run but no plan was declared and done_testing() was not seen.
[ 191s] # Looks like your test exited with 255 just after 7.
[ 191s] t/07_autodetect.t ........
[ 191s] Dubious, test returned 255 (wstat 65280, 0xff00)
[ 191s] Failed 1/7 subtests
[ 191s] (less 2 skipped subtests: 4 okay)
[ 193s] t/08_ioloop.t ............ ok
[ 194s] t/09_session.t ........... ok
[ 195s] t/10_cgroupv1.t .......... ok
[ 196s] t/10_cgroupv2.t .......... ok
[ 197s] t/11_containers.t ........ skipped: This test works only if you have cgroups permissions
[ 255s] t/12_mocked_container.t .. ok
[ 256s] t/13_shared.t ............ skipped: Skipped unless TEST_SHARED is set
[ 256s]
[ 256s] Test Summary Report
[ 256s] -------------------
[ 256s] t/07_autodetect.t (Wstat: 65280 Tests: 7 Failed: 1)
[ 256s] Failed test: 7
[ 256s] Non-zero exit status: 255
[ 256s] Parse errors: No plan found in TAP output
[ 256s] Files=15, Tests=62, 162 wallclock secs ( 0.25 usr 0.15 sys + 11.89 cusr 2.20 csys = 14.49 CPU)
[ 256s] Result: FAIL
[ 256s] Failed 1/15 test programs. 1/62 subtests failed.
</code></pre>
<p>so unhandled output in t/01_run.t, perl warning "Can't use an undefined value as filehandle reference at lib/Mojo/IOLoop/ReadWriteProcess.pm line 298." in t/03_func.t and errors in t/07_autodetect.t</p>
<p>Then in <a href="https://build.opensuse.org/package/live_build_log/devel:openQA:Leap:15.2/perl-Mojo-IOLoop-ReadWriteProcess/openSUSE_Leap_15.2/aarch64">https://build.opensuse.org/package/live_build_log/devel:openQA:Leap:15.2/perl-Mojo-IOLoop-ReadWriteProcess/openSUSE_Leap_15.2/aarch64</a></p>
<pre><code>[ 187s] t/11_containers.t ........ skipped: This test works only if you have cgroups permissions
[ 211s]
[ 211s] # Failed test 'procs interface contains the added pids'
[ 211s] # at t/12_mocked_container.t line 37.
[ 211s] # got: ''
[ 211s] # expected: '1785
[ 211s] # '
[ 211s] # 1785
[ 244s] # Looks like you failed 1 test of 43.
[ 244s]
[ 244s] # Failed test 'container_3'
[ 244s] # at t/12_mocked_container.t line 258.
[ 244s] # Looks like you failed 1 test of 3.
[ 244s] t/12_mocked_container.t ..
[ 244s] Dubious, test returned 1 (wstat 256, 0x100)
[ 244s] Failed 1/3 subtests
</code></pre>
<p>and in<br>
<a href="https://build.opensuse.org/package/live_build_log/devel:openQA:Leap:15.2/perl-Mojo-IOLoop-ReadWriteProcess/openSUSE_Leap_15.2/x86_64">https://build.opensuse.org/package/live_build_log/devel:openQA:Leap:15.2/perl-Mojo-IOLoop-ReadWriteProcess/openSUSE_Leap_15.2/x86_64</a></p>
<pre><code>[ 70s] t/03_func.t .............. ok
[ 71s] t/04_queues.t ............ ok
[ 3679s] qemu-system-x86_64: terminating on signal 15 from pid 18026 ()
Job seems to be stuck here, killed. (after 3600 seconds of inactivity)
</code></pre>
<p>so stuck in tests.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> All tests included in OBS checks running stable, i.e. either tests stabilized or excluded from running</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Try to reproduce problems locall and try to fix them or exclude them from checks within the spec file with according comments</li>
</ul>
openQA Tests - action #69787 (Resolved): [qe-core][qam][sporadic] test fails in rsync_client not ...https://progress.opensuse.org/issues/697872020-08-10T15:11:33Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>openQA test in scenario sle-15-Server-DVD-Updates-x86_64-qam-rsync-client@64bit fails in<br>
<a href="https://openqa.suse.de/tests/4544454/modules/rsync_client/steps/10" class="external">rsync_client</a><br>
with</p>
<pre><code>�[0m�[37m[2020-08-10T16:49:17.503 CEST] [debug] barrier wait 'rsync_setup'
�[0m[2020-08-10T16:49:17.503 CEST] [debug] tests/console/rsync_client.pm:28 called lockapi::barrier_wait
[2020-08-10T16:49:17.503 CEST] [debug] <<< testapi::record_info(title="Paused", output="Wait for rsync_setup (on parent job)", result="ok")
�[33m[2020-08-10T16:49:17.584 CEST] [info] ::: lockapi::_try_lock: Retry 1 of 7...
�[0m�[33m[2020-08-10T16:49:27.662 CEST] [info] ::: lockapi::_try_lock: Retry 2 of 7...
�[0m�[33m[2020-08-10T16:49:37.736 CEST] [info] ::: lockapi::_try_lock: Retry 3 of 7...
�[0m�[33m[2020-08-10T16:49:47.810 CEST] [info] ::: lockapi::_try_lock: Retry 4 of 7...
�[0m�[33m[2020-08-10T16:49:57.876 CEST] [info] ::: lockapi::_try_lock: Retry 5 of 7...
�[0m�[33m[2020-08-10T16:50:07.944 CEST] [info] ::: lockapi::_try_lock: Retry 6 of 7...
�[0m�[33m[2020-08-10T16:50:18.008 CEST] [info] ::: lockapi::_try_lock: Retry 7 of 7...
�[0m[2020-08-10T16:50:28.008 CEST] [debug] tests/console/rsync_client.pm:28 called lockapi::barrier_wait
[2020-08-10T16:50:28.009 CEST] [debug] <<< bmwqemu::mydie(cause_of_death="barrier 'rsync_setup': lock owner already finished")
�[33m[2020-08-10T16:50:28.089 CEST] [info] ::: basetest::runtest: # Test died: mydie at /usr/lib/os-autoinst/lockapi.pm line 41.
</code></pre>
<p>whereas the corresponding server code is</p>
<pre><code>�[0m�[37m[2020-08-10T16:54:21.049 CEST] [debug] barrier wait 'rsync_setup'
�[0m[2020-08-10T16:54:21.049 CEST] [debug] tests/console/rsync_server.pm:67 called lockapi::barrier_wait
[2020-08-10T16:54:21.049 CEST] [debug] <<< testapi::record_info(title="Paused", output="Wait for rsync_setup (on parent job)", result="ok")
�[37m[2020-08-10T16:54:21.091 CEST] [debug] barrier 'rsync_setup' not released, sleeping 5s
�[0m�[37m[2020-08-10T16:54:26.121 CEST] [debug] barrier 'rsync_setup' not released, sleeping 5s
…
�[0m�[37m[2020-08-10T16:59:32.771 CEST] [debug] barrier 'rsync_setup' not released, sleeping 5s
�[0m�[37m�[37m�[37m[2020-08-10T16:59:36.015 CEST] [debug] backend got TERM
�[0m[2020-08-10T16:59:36.015 CEST] [debug] autotest received signal TERM, saving results of current test before exiting
</code></pre>
<p>so the client already gave up waiting after a minute whereas the server has not even reached this point.</p>
<a name="Reproducible"></a>
<h2 >Reproducible<a href="#Reproducible" class="wiki-anchor">¶</a></h2>
<p>Fails often but not always. <a class="user active user-mention" href="https://progress.opensuse.org/users/11830">@dzedro</a> tends to mark according issues with <a class="issue tracker-6 status-3 priority-5 priority-high3 closed behind-schedule" title="coordination: [epic] multimachine test fails with symptoms "websocket refusing connection" and other unclear re... (Resolved)" href="https://progress.opensuse.org/issues/65118">#65118</a> . All accordingly labeled tests can be investigated for this issue.</p>
<a name="Expected-result"></a>
<h2 >Expected result<a href="#Expected-result" class="wiki-anchor">¶</a></h2>
<p>A good case: <a href="https://openqa.suse.de/tests/4542505" class="external">20200810-1</a></p>
<p>The test should be robust to cover the corresponding needed synchronisation period from both client and server side</p>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>Always latest result in this scenario: <a href="https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=qam-rsync-client&version=15" class="external">latest</a></p>
openQA Project - action #64776 (Resolved): [cache][worker] cache service suddenly stopped to down...https://progress.opensuse.org/issues/647762020-03-24T16:20:22Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Observed on openqaworker6.s.d as reported in <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: [functional][y] Test that test data files exist in case mentioned in the schedule (Resolved)" href="https://progress.opensuse.org/issues/64752">#64752</a> (ticket lost due to db recovery) the cache service out of a sudden failed to download any new requested assets. Jobs that rely on assets still existing in the cache are still running fine, any jobs that trigger a request for download any new asset incomplete. There are no obvious mentions in the log files what went wrong.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> There is more detail than just "Failed to download", accessible to users</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Investigate if more details could be provided and forwarded to the details provided and logs and also the reason that is put on the job details web page.</li>
<li>If possible, prevent the situation that was observed in <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: [functional][y] Test that test data files exist in case mentioned in the schedule (Resolved)" href="https://progress.opensuse.org/issues/64752">#64752</a></li>
</ul>
<a name="Workaround"></a>
<h2 >Workaround<a href="#Workaround" class="wiki-anchor">¶</a></h2>
<p>In <a class="issue tracker-4 status-3 priority-4 priority-default closed child" title="action: [functional][y] Test that test data files exist in case mentioned in the schedule (Resolved)" href="https://progress.opensuse.org/issues/64752">#64752</a> it helped to restart the cache service and cache service minion</p>
<a name="Further-details"></a>
<h2 >Further details<a href="#Further-details" class="wiki-anchor">¶</a></h2>
<p>Original subject content: <code>just "Failed to download", no further reason.</code></p>
openQA Infrastructure - action #64685 (Resolved): openqaworker1 showing NVMe problems "kernel: nv...https://progress.opensuse.org/issues/646852020-03-20T13:20:34Zokurzokurz@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<pre><code>[20/03/2020 11:39:23] <DimStar> Martchus_: any idea what's up here? https://openqa.opensuse.org/tests/overview?arch=&machine=&modules=&todo=1&distri=microos&distri=opensuse&version=Tumbleweed&build=20200318&groupid=1# ?
[20/03/2020 11:52:02] <guillaume_g> Defolos: fyi, https://bugzilla.opensuse.org/show_bug.cgi?id=1167232
[20/03/2020 11:52:05] <|Anna|> openSUSE bug 1167232 in openSUSE Tumbleweed "Vagrant Tumbleweed 20200317 fails due to unsupported configuration PS2" [Normal, New]
[20/03/2020 11:53:02] <DimStar> okurz: the nvme issues we had last time around was ow4, right?
</code></pre>
<p>On w1 in system journal:</p>
<pre><code>nvme nvme0: Abort status: 0x0
</code></pre> openQA Infrastructure - action #64580 (Workable): Detect and recover from I/O blocked worker mach...https://progress.opensuse.org/issues/645802020-03-18T15:53:16Zokurzokurz@suse.com
<a name="Motivation"></a>
<h2 >Motivation<a href="#Motivation" class="wiki-anchor">¶</a></h2>
<p>In <a class="issue tracker-4 status-3 priority-3 priority-lowest closed" title="action: all arm worker die after some time (Resolved)" href="https://progress.opensuse.org/issues/41882">#41882</a> we identified arm machines being completely unresponsive and are automatically detecting these situations and recover. But there are also cases when systems are I/O blocked, the machine still responds to ping but is not "usable". In this situation the machine can still have openQA jobs assigned that are then stuck for many hours. Also the machine is not detected as broken in grafana hence never recovered automatically. We should detect a situation like this and recover automatically.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Machines in I/O blocked stated for multiple minutes/hours are detected and recovered, e.g. with reboot, similar/same as "worker completely down"</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Check if there are already measurements available in grafana that could be used to trigger alerts which then trigger the reboot actions same as <a href="https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1" class="external">https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1</a></li>
<li>If not, find an additional measurement/alert for this purpose</li>
<li>Ensure the alerts and notification configurations are covered in salt</li>
</ul>