action #67972
closedone module failed to upload results ending an otherwise fine job as failed (was: Some tests are flagged as failed whereas they passed)
Description
Observation¶
Some tests are flagged as failed whereas they passed. But one test result is none
and no screenshot uploaded.
Example: https://openqa.opensuse.org/tests/1296756
Within OSD infrastructure, probably since last deployment we observe the same, e.g.
https://openqa.suse.de/tests/4348455 - worker.log: Unable to read result-consoletest_finish.json: No such file or directory
https://openqa.suse.de/tests/4348369 - worker.log: Unable to read result-disable_grub_timeout.json: No such file or directory
https://openqa.suse.de/tests/4357369
Reproducible¶
Fails since (at least) Build 20200615-1 (current job)
Expected result¶
Last good: 20200614-1 (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by okurz over 4 years ago
- Subject changed from Some tests are flagged as failed whereas they passed to one module failed to upload results ending an otherwise fine job as failed (was: Some tests are flagged as failed whereas they passed)
- Category set to Regressions/Crashes
It is expected behaviour that a job can only end up as "passed" if all test modules are "passed" as well. So the question is really why this one job module failed to upload.
Updated by livdywan over 4 years ago
It seems the welcome test finishes successfully for all intents and purposes, by the time we get to ||| finished welcome we're past default an unset result to ok
.
The os-autoinst logs seem to match the last succesful run as well.
Updated by okurz over 4 years ago
- Priority changed from Normal to Low
- Target version set to future
- Difficulty set to hard
Unless we can reproduce this I consider the issue infeasible to be fixed.
@ggardet_arm I hope you can help to find jobs where this happens again if at all.
Updated by okurz over 4 years ago
- Has duplicate action #68104: [tools] test fails being unable to read result-<test>.json: No such file or directory added
Updated by okurz over 4 years ago
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from Low to Urgent
- Target version changed from future to Ready
- Difficulty deleted (
hard)
seems to be a problem we recently introduced and it can be reproduced on osd as well, see #68104
I merged in the content from #68104
Likely the change is one of the following from http://mailman.suse.de/mlarch/SuSE/openqa/2020/openqa.2020.06/msg00003.html :
- Update to version 4.6.1591699942.8543386ed:
* Dependency cron 200609
* Add notes about arbitrary reasons
* Remove out-dated comments; stopping will no longer be delayed as stated
* Use newly introduced constants for worker commands and stop reasons
* Improve dealing with worker commands and stop reasons
* Add missing copyright header in constants modules
* Remove stop type 'scheduler_abort' which is not used anymore
* Improve coding style in set_running
* Prevent Perl warning in downloader code
* Improve UI for selecting the display order of builds
* Select interval checkbox for interval from query parameter if applicable
* Improve wording in the index page form, mainly to improve consistency
* Improve coding style in JavaScript/CSS/template files
* Avoid accumulating `comments_page` query parameters when using pagination
* Fix typo in pagination code
* Add autorefresh in the index page (#3141)
* Maintain pagination when an user role changes
* Fix pagination in the group overview page
* Dependency cron 200605
* Turn two schema functions into methods
* Hide livestream when loading test details and job is currently uploading
* Update out-dated comment; the image upload is async via Mojo::IOLoop->subprocess
* t: Use Test::MockModule->redefine in t/28-logging.t
* Only reload test schedule when test_order.json has been changed
* Make restart button work again after the job state changed
* Dependency cron 200604
* t: Allow to configure amount of unstable/rejective workers in 05-scheduler-full.t
* t: Simplify worker startup in 05-scheduler-full with map
* t: Improve test description strings in 05-scheduler-full
* t: Fix warning in 28-logging.t with strict MockModule and Test::Warnings
* Extract methods for "ignore_known_images" and "ignore_known_files" in Worker::Job
* Simplify Worker::Job slightly
* Remove unnecessary single-line step in Worker::Job
* Use the more idiomatic ->find instead of ->single
* Replace duplicate method OpenQA::Schema::Result::Workers::to_string
* Delete unnecessary pre-declaration of log_warning in OpenQA::Log
* Generalize handling the video
* Remove dead code from OpenQA::WebAPI::Plugin::ObsRsync
* doc: Improve virtio/svirt serial terminal docs
* Fix tests on OBS when "ps" is not installed
* Use :report_warnings feature of Test::Warnings
* Mark "25-cache-service.t" as unstable test as observed in circleci
* Mark old issues and pull requests as stale
* Extract methods for "ignore_known_images" and "ignore_known_files" in Worker::Job
* Simplify Worker::Job slightly
* Remove unnecessary single-line step in Worker::Job
* t: Fix "Dubious, test returned 9" in 05-scheduler-full.t
* t: Fix unstable test 'Allocated maximum number' in 05-scheduler-full
* t: Sleep only when necessary when waiting for worker
* t: Fix sporadic test failures on mis-assignment of jobs in scheduling
* t: Increase robustness of 05-scheduler-full with reduced scheduler tick
* t: Prevent use of occupied port in all full-stack/scalability tests
* Makefile: Bump timeout limit for various tests for slow circleci
* t: Provide unique process name for worker in scalability test as well
* t: More cleanup in t/05-scheduler-full.t
* Tue Jun 02 2020 coolo@suse.com
- Update to version 4.6.1591092399.b9fb68988:
* t: Make output nicer on time limit
* Dependency cron 200601
* Avoid stopping the live log when the worker already works on different job
* worker: Keep uploading results until logs and assets have been uploaded
* worker: Log every 10th uploaded chunk to autoinst-log.txt as well
* Keep 'live' tab shown while uploading for improved feedback
* Avoid rendering 'live' tab when job is already done
* Update comments in worker function _upload_result
* Fix style issue in running.js so jshint shows no errors
* Do not shadow $errors in utils-job-templates test
* Capture base_state.json JSON error warning
* Add '--merge' option to prove
* Show an explicit warning when failing to wakeup scheduler
* Prefetch parent job in JobModules::job_modules()
* Capture screenshot created message in screenshot test
* Dependency cron 200529
* t: Make process handling more robust with IPC::Run
* Log errors when claiming OpenID identity
* Fix visual artifact in audio result box
* t: Use OpenQA::Test::TimeLimit in basic.t for a start
* t: Add test module to check for test runtime
os-autoinst changes:
* Tue Jun 09 2020 okurz@suse.com
- Update to version 4.6.1591696864.96b3acb2:
* Move MakeMaker dependency to build_requires
* Fix missing qemu-kvm dependency on non-x86_64
* Slightly simplify amt backend
* spec: Fix 'update-deps' not to be packaged
* Deps: rename %requires to %main_requires
* Fix wrong comment about replacerect
* Fix YAML::PP dependency
* Add flags for non-verbose output to ffmpeg example
* Use :report_warnings feature of Test::Warnings
* Update documentation of videoencoder.cpp
* Support using an external video encoder
* Mark old issues and pull requests as stale
* Capture isotovideo output in isotovideo test
* Dependencies: Add code to update Dockerfile
* Makefiles: set CXXFLAGS not CFLAGS or CPPFLAGS
* Improve coding style in code file migration of QEMU backend
* Improve 'Migrate to file failed [...]' error message to include unit
* make update-deps
* Dependencies: add tools/update-deps
* Add YAML::PP dependency
* Simplify use of Test::Output with bare blocks rather than sub refs
Judging from the changelog text I suspect it is one of
* Dependency cron 200609
* Add notes about arbitrary reasons
* Remove out-dated comments; stopping will no longer be delayed as stated
* Use newly introduced constants for worker commands and stop reasons
* Improve dealing with worker commands and stop reasons
* Remove stop type 'scheduler_abort' which is not used anymore
* Improve coding style in set_running
* Prevent Perl warning in downloader code
* Dependency cron 200605
* Hide livestream when loading test details and job is currently uploading
* Update out-dated comment; the image upload is async via Mojo::IOLoop->subprocess
* Only reload test schedule when test_order.json has been changed
* Dependency cron 200604
* Dependency cron 200601
* Avoid stopping the live log when the worker already works on different job
* worker: Keep uploading results until logs and assets have been uploaded
* worker: Log every 10th uploaded chunk to autoinst-log.txt as well
* Keep 'live' tab shown while uploading for improved feedback
* Avoid rendering 'live' tab when job is already done
* Update comments in worker function _upload_result
* Do not shadow $errors in utils-job-templates test
* Capture base_state.json JSON error warning
* Prefetch parent job in JobModules::job_modules()
* Capture screenshot created message in screenshot test
* Dependency cron 200529
One can reduce the diff further by looking into actual source code changes.
EDIT: Based on looking into actual source code changes:
36d40a917 Dependency cron 200609
989c72f81 Use newly introduced constants for worker commands and stop reasons
86da6e1e8 Improve dealing with worker commands and stop reasons
c12830442 Remove stop type 'scheduler_abort' which is not used anymore
a58b3e925 Dependency cron 200605
3a566d824 Turn two schema functions into methods
0d1061afd Only reload test schedule when test_order.json has been changed
857c44176 Dependency cron 200604
f3c04a623 Avoid stopping the live log when the worker already works on different job
a61c520a9 worker: Keep uploading results until logs and assets have been uploaded
aab58ce63 worker: Log every 10th uploaded chunk to autoinst-log.txt as well
ea13c22d6 Prefetch parent job in JobModules::job_modules()
3d5a2b7fd Dependency cron 200529
Updated by okurz over 4 years ago
- Status changed from Workable to In Progress
- Assignee set to okurz
We suspect 0d1061afd816b190fdd265d7da412ce8f78bbe3b to be the culprit. I will prepare a partial revert of that commit and check what can be kept from tests.
Updated by tinita over 4 years ago
I was debugging, and found out that I can reproduce it by adding a sleep here:
# os-autoinst/basetest.pm
sub runtest {
...
$self->done();
sleep 5; # happens sometimes, using 10 seconds it happens always
$self->{execution_time} = execution_time($starttime);
...
In that case I get the Unable to read result-boot.json: No such file or directory
message, the test is labeled as none
, and the job is marked as failed.
So a timing issue as expected. That might give us something to work with for fixing it.
Updated by okurz over 4 years ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to Normal
merged.
@tinita feel free to take over as you worked on reproducing the problem locally.
Updated by tinita over 4 years ago
The fix (revert) has been deployed.
Created a new PR for https://progress.opensuse.org/issues/58700
Updated by tinita about 4 years ago
- Status changed from Feedback to Resolved
Please reopen if necessary.