action #80226
opencoordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA
coordination #102909: [epic] Prevent more incompletes already within os-autoinst or openQA
job incomplete with autoinst-log.txt ending just in the middle
0%
Description
Observation¶
https://openqa.suse.de/tests/5057715/file/autoinst-log.txt shows at the end of the log file
[0m[2020-11-23T13:50:54.432 CET] [debug] tests/installation/await_install.pm:129 called testapi::check_screen
[2020-11-23T13:50:54.432 CET] [debug] <<< testapi::check_screen(mustmatch=[
"rebootnow",
"yast2_wrong_digest",
"yast2_package_retry",
"yast_error",
"ERROR-removing-package",
"DIALOG-packages-notifications"
], timeout=30)
[37m[2020-11-23T13:50:54.523 CET] [debug] no match: 29.9s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[0m[37m[2020-11-23T13:50:55.524 CET] [debug] no match: 28.9s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[0m[37m[2020-11-23T13:50:56.525 CET] [debug] no match: 27.9s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[0m[37m[2020-11-23T13:50:57.526 CET] [debug] no match: 26.9s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[0m[37m[2020-11-23T13:50:59.609 CET] [debug] WARNING: check_asserted_screen took 1.17 seconds for 18 candidate needles - make your needles more specific
[0m[37m[2020-11-23T13:50:59.610 CET] [debug] no match: 25.9s, best candidate: rebootnow-20160504 (0.51)
[0m[37m[2020-11-23T13:50:59.710 CET] [debug] no match: 24.8s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[0m[37m[2020-11-23T13:51:00.703 CET] [debug] no match: 23.8s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[0m[37m[2020-11-23T13:51:01.703 CET] [debug] no match: 22.8s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[0m[37m[2020-11-23T13:51:02.704 CET] [debug] no match: 21.8s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[0m[37m[2020-11-23T13:51:04.787 CET] [debug] WARNING: check_asserted_screen took 1.17 seconds for 18 candidate needles - make your needles more specific
[0m[37m[2020-11-23T13:51:04.787 CET] [debug] no match: 20.8s, best candidate: DIALOG-packages-notifications-20191220 (0.39)
[0m[37m[2020-11-23T13:51:04.889 CET] [debug] no match: 19.6s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[0m[37m[2020-11-23T13:51:05.884 CET] [debug] no match: 18.6s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[0m[37m[2020-11-23T13:51:06.882 CET] [debug] no match: 17.6s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[0m[37m[2020-11-23T13:51:07.883 CET] [debug] no match: 16.6s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[0m[37m[2020-11-23T13:51:09.965 CET] [debug] WARNING: check_asserted_screen took 1.17 seconds for 18 candidate needles - make your needles more specific
[0m[37m[2020-11-23T13:51:09.965 CET] [debug] no match: 15.6s, best candidate: DIALOG-packages-notifications-20191220 (0.39)
[0m[37m[2020-11-23T13:51:10.061 CET] [debug] no match: 14.4s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[0m[37m[2020-11-23T13:51:10.971 CET] [debug] no change: 13.4s
[0m[37m[2020-11-23T13:51:11.972 CET] [debug] no change: 12.4s
[0m[37m[2020-11-23T13:51:12.973 CET] [debug] no change: 11.4
and that's it. Nothing more.
https://openqa.suse.de/tests/5057715/file/worker-log.txt looks pretty complete:
[2020-11-23T13:51:08.0820 CET] [debug] [pid:9516] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5057715/status
[2020-11-23T13:51:13.0051 CET] [debug] [pid:9516] Stopping job 5057715 from openqa.suse.de: 05057715-sle-15-SP3-Migration-from-SLE12-SPx-ppc64le-Build88.1-offline_sles12sp4_ltss_pscc_lp_def_full@ppc64le - reason: died
[2020-11-23T13:51:13.0052 CET] [debug] [pid:9516] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5057715/status
[2020-11-23T13:51:13.0053 CET] [debug] [pid:9516] Registered process:63427
[2020-11-23T13:51:15.0053 CET] [debug] [pid:9516] Registered process:63482
[2020-11-23T13:51:17.0085 CET] [info] [pid:68579] Uploading patch_sle-rpm-qa.txt
[2020-11-23T13:51:17.0121 CET] [info] [pid:68579] Uploading record_disk_info-btrfs-filesystem-df.txt
[2020-11-23T13:51:17.0148 CET] [info] [pid:68579] Uploading record_disk_info-btrfs-filesystem-usage.txt
[2020-11-23T13:51:17.0177 CET] [info] [pid:68579] Uploading record_disk_info-df.txt
[2020-11-23T13:51:17.0215 CET] [info] [pid:68579] Uploading record_disk_info-snapper-list.txt
[2020-11-23T13:51:17.0286 CET] [info] [pid:68579] Uploading vars.json
[2020-11-23T13:51:17.0323 CET] [info] [pid:68579] Uploading autoinst-log.txt
[2020-11-23T13:51:17.0367 CET] [info] [pid:68579] Uploading worker-log.txt
Acceptance criteria¶
- AC1: If the autoinst-log.txt is incomplete and not even showing that the job finished, the job should automatically retrigger
Suggestions¶
- Research how we currently automatically retrigger incompletes within the worker code
- Try to reproduce this error in tests by simulating premature isotovideo endings where isotovideo does not even have a proper chance to run till the end
- Extend to retrigger if there was no proper exit code or anything from isotovideo
Updated by livdywan over 3 years ago
Discussing this in the standup a bit:
- Do we know that the upload succeeded?
Uploading autoinst-log.txt
suggests yes, but maybe it wasn't - Did something crash while
autoinst-log.txt
was written? - Is retrigger what we want - since there might be a bug here preventing the logs from being written, ostensibly we'll just retrigger the same bug
- Do we know that e.g. isotovideo finished writing logs to say "should be complete, check for the last message"?
- Who writes
autoinst-log.txt
? - Is this the same upload code as for assets? Maybe we should unify these to make it more robust?
- Shuold we checksum uploaded files to ensure it's complete
Updated by Xiaojing_liu over 3 years ago
cdywan wrote:
Discussing this in the standup a bit:
- Do we know that the upload succeeded?
Uploading autoinst-log.txt
suggests yes, but maybe it wasn't- Did something crash while
autoinst-log.txt
was written?
I tried to reproduce this issue but failed. Not sure how to reproduce it.
- Is retrigger what we want - since there might be a bug here preventing the logs from being written, ostensibly we'll just retrigger the same bug
I think retriggering is a good way, because this issue is random, and it hasn't happened in recent days.
- Do we know that e.g. isotovideo finished writing logs to say "should be complete, check for the last message"?
there should be something like Isotovideo exit status: 1
in autoinst-log.txt when isotovideo finishes.
- Who writes
autoinst-log.txt
?
If I understand correctly isotovide
and worker
both write it.
Is this the same upload code as for assets? Maybe we should unify these to make it more robust?
Shuold we checksum uploaded files to ensure it's complete
Could we simply retrigger it if the log file is incomplete by checking Isotovideo exit status: 1
?
Updated by okurz over 3 years ago
- Status changed from Workable to New
moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size