Project

General

Profile

action #80226

coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA

coordination #102909: [epic] Prevent more incompletes already within os-autoinst or openQA

job incomplete with autoinst-log.txt ending just in the middle

Added by okurz about 1 year ago. Updated 9 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2020-11-23
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

https://openqa.suse.de/tests/5057715/file/autoinst-log.txt shows at the end of the log file

[2020-11-23T13:50:54.432 CET] [debug] tests/installation/await_install.pm:129 called testapi::check_screen
[2020-11-23T13:50:54.432 CET] [debug] <<< testapi::check_screen(mustmatch=[
  "rebootnow",
  "yast2_wrong_digest",
  "yast2_package_retry",
  "yast_error",
  "ERROR-removing-package",
  "DIALOG-packages-notifications"
], timeout=30)
[2020-11-23T13:50:54.523 CET] [debug] no match: 29.9s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[2020-11-23T13:50:55.524 CET] [debug] no match: 28.9s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[2020-11-23T13:50:56.525 CET] [debug] no match: 27.9s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[2020-11-23T13:50:57.526 CET] [debug] no match: 26.9s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[2020-11-23T13:50:59.609 CET] [debug] WARNING: check_asserted_screen took 1.17 seconds for 18 candidate needles - make your needles more specific
[2020-11-23T13:50:59.610 CET] [debug] no match: 25.9s, best candidate: rebootnow-20160504 (0.51)
[2020-11-23T13:50:59.710 CET] [debug] no match: 24.8s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[2020-11-23T13:51:00.703 CET] [debug] no match: 23.8s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[2020-11-23T13:51:01.703 CET] [debug] no match: 22.8s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[2020-11-23T13:51:02.704 CET] [debug] no match: 21.8s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[2020-11-23T13:51:04.787 CET] [debug] WARNING: check_asserted_screen took 1.17 seconds for 18 candidate needles - make your needles more specific
[2020-11-23T13:51:04.787 CET] [debug] no match: 20.8s, best candidate: DIALOG-packages-notifications-20191220 (0.39)
[2020-11-23T13:51:04.889 CET] [debug] no match: 19.6s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[2020-11-23T13:51:05.884 CET] [debug] no match: 18.6s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[2020-11-23T13:51:06.882 CET] [debug] no match: 17.6s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[2020-11-23T13:51:07.883 CET] [debug] no match: 16.6s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[2020-11-23T13:51:09.965 CET] [debug] WARNING: check_asserted_screen took 1.17 seconds for 18 candidate needles - make your needles more specific
[2020-11-23T13:51:09.965 CET] [debug] no match: 15.6s, best candidate: DIALOG-packages-notifications-20191220 (0.39)
[2020-11-23T13:51:10.061 CET] [debug] no match: 14.4s, best candidate: DIALOG-packages-notifications-20191220 (0.18)
[2020-11-23T13:51:10.971 CET] [debug] no change: 13.4s
[2020-11-23T13:51:11.972 CET] [debug] no change: 12.4s
[2020-11-23T13:51:12.973 CET] [debug] no change: 11.4

and that's it. Nothing more.

https://openqa.suse.de/tests/5057715/file/worker-log.txt looks pretty complete:

[2020-11-23T13:51:08.0820 CET] [debug] [pid:9516] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5057715/status
[2020-11-23T13:51:13.0051 CET] [debug] [pid:9516] Stopping job 5057715 from openqa.suse.de: 05057715-sle-15-SP3-Migration-from-SLE12-SPx-ppc64le-Build88.1-offline_sles12sp4_ltss_pscc_lp_def_full@ppc64le - reason: died
[2020-11-23T13:51:13.0052 CET] [debug] [pid:9516] REST-API call: POST http://openqa.suse.de/api/v1/jobs/5057715/status
[2020-11-23T13:51:13.0053 CET] [debug] [pid:9516] Registered process:63427
[2020-11-23T13:51:15.0053 CET] [debug] [pid:9516] Registered process:63482
[2020-11-23T13:51:17.0085 CET] [info] [pid:68579] Uploading patch_sle-rpm-qa.txt
[2020-11-23T13:51:17.0121 CET] [info] [pid:68579] Uploading record_disk_info-btrfs-filesystem-df.txt
[2020-11-23T13:51:17.0148 CET] [info] [pid:68579] Uploading record_disk_info-btrfs-filesystem-usage.txt
[2020-11-23T13:51:17.0177 CET] [info] [pid:68579] Uploading record_disk_info-df.txt
[2020-11-23T13:51:17.0215 CET] [info] [pid:68579] Uploading record_disk_info-snapper-list.txt
[2020-11-23T13:51:17.0286 CET] [info] [pid:68579] Uploading vars.json
[2020-11-23T13:51:17.0323 CET] [info] [pid:68579] Uploading autoinst-log.txt
[2020-11-23T13:51:17.0367 CET] [info] [pid:68579] Uploading worker-log.txt

Acceptance criteria

  • AC1: If the autoinst-log.txt is incomplete and not even showing that the job finished, the job should automatically retrigger

Suggestions

  • Research how we currently automatically retrigger incompletes within the worker code
  • Try to reproduce this error in tests by simulating premature isotovideo endings where isotovideo does not even have a proper chance to run till the end
  • Extend to retrigger if there was no proper exit code or anything from isotovideo

History

#1 Updated by cdywan 9 months ago

Discussing this in the standup a bit:

  • Do we know that the upload succeeded? Uploading autoinst-log.txt suggests yes, but maybe it wasn't
  • Did something crash while autoinst-log.txt was written?
  • Is retrigger what we want - since there might be a bug here preventing the logs from being written, ostensibly we'll just retrigger the same bug
  • Do we know that e.g. isotovideo finished writing logs to say "should be complete, check for the last message"?
  • Who writes autoinst-log.txt?
  • Is this the same upload code as for assets? Maybe we should unify these to make it more robust?
  • Shuold we checksum uploaded files to ensure it's complete

#2 Updated by Xiaojing_liu 8 months ago

cdywan wrote:

Discussing this in the standup a bit:

  • Do we know that the upload succeeded? Uploading autoinst-log.txt suggests yes, but maybe it wasn't
  • Did something crash while autoinst-log.txt was written?

I tried to reproduce this issue but failed. Not sure how to reproduce it.

  • Is retrigger what we want - since there might be a bug here preventing the logs from being written, ostensibly we'll just retrigger the same bug

I think retriggering is a good way, because this issue is random, and it hasn't happened in recent days.

  • Do we know that e.g. isotovideo finished writing logs to say "should be complete, check for the last message"?

there should be something like Isotovideo exit status: 1 in autoinst-log.txt when isotovideo finishes.

  • Who writes autoinst-log.txt?

If I understand correctly isotovide and worker both write it.

  • Is this the same upload code as for assets? Maybe we should unify these to make it more robust?

  • Shuold we checksum uploaded files to ensure it's complete

Could we simply retrigger it if the log file is incomplete by checking Isotovideo exit status: 1?

#3 Updated by okurz 5 months ago

  • Status changed from Workable to New

moving all tickets without size confirmation by the team back to "New". The team should move the tickets back after estimating and agreeing on a consistent size

#4 Updated by okurz 5 months ago

  • Target version changed from Ready to future

#5 Updated by okurz 9 days ago

  • Parent task changed from #62420 to #102909

Also available in: Atom PDF