Project

General

Profile

coordination #62420

coordination #39719: [saga][epic] Detect "known failures" and mark jobs as such to make tests more stable, reviewing test results and tracking known issues easier

[epic] Distinguish all types of incompletes

Added by okurz 12 months ago. Updated about 14 hours ago.

Status:
Blocked
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-12-12
Due date:
2021-01-26
% Done:

46%

Estimated time:
(Total: 0.00 h)
Difficulty:

Description

Motivation

As a test reviewer I want to understand the reason for incompletes so that I know who should do what to fix it

Acceptance criteria

  • AC1: All incompletes provide more details about the reason for the incompletion
  • AC2: Types of incompletes are distinguishable without needing to read logs
  • AC3: The incomplete reason is visualized in the UI (not only in logs)
  • AC4: If the reason is not known all available log details are accessible from the job

Acceptance tests

  • AT1-1: Given an incomplete job, When reading out the job from the openQA database, Then the incomplete reason is given
  • AT1-2: Same as AT1-1 but for a failed job, Then no incomplete reason is given
  • AT2-1: Given an incomplete job, When reading out the job over the API, Then the incomplete reason is rendered
  • AT3-1: Given an incomplete job, When showing job details in the webui, Then the incomplete reason is visible (not only in logs)

Suggestions

  • Check were "setup failure" and other results are provided and when no further details from other services, e.g. cacheservice, and ensure that there is a hint about the problem source.
  • Extend the API to also accept the incomplete reason
  • Extend the UI to also show the incomplete reason
  • Split "setup failure" into more specific types
  • Also for example we use "setup failure" in multiple cases but do not forward the result to the webui except in the log files as strings. In the case here we do not even have any information string that would point out what the real problem was or is, e.g. at least show the available logs or even extract log excerpts but I guess we already have this covered by showing the logs in the details tab when there is no other information available.

Further details

What to do when we asked the systemd service to stop because we want to reboot the machine? IMHO we should abort as fast as possible on TERM but provide a better information. This and all the experiences with different sources of incompletes from the past months brings me to the conclusion we just want to pass the internal "reason" we already have on the worker to the webui, e.g. "setup-failure" as we already have. And "worker-shutdown" can be another reason. Based on the reason we can also decide if we should auto-duplicate. For a "compilation-error", no retrigger, for "worker-shutdown" yes.

Started as ticket "Improve reporting on incompletes with result "setup-failure" and no further explanation".

See #62237 . There were many incompletes with not much details, e.g. https://openqa.suse.de/tests/3795872 shows just

[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] +++ setup notes +++
[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] Start time: 2020-01-17 09:56:50
[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] Running on QA-Power8-5-kvm:6 (Linux 4.12.14-lp151.27-default #1 SMP Fri May 10 14:13:15 UTC 2019 (862c838) ppc64le)
[2020-01-17T11:01:50.0997 CET] [info] [pid:110583] +++ worker notes +++
[2020-01-17T11:01:50.0998 CET] [info] [pid:110583] End time: 2020-01-17 10:01:50
[2020-01-17T11:01:50.0998 CET] [info] [pid:110583] Result: setup failure
[2020-01-17T11:01:51.0002 CET] [info] [pid:21796] Uploading autoinst-log.txt

We have the result "setup failure" or "setup-failure" as it is also used but this is worker-internal and not forwarded to the webui.


Subtasks

action #45062: Better visualization of incompletes - show module in which incomplete happensResolvedokurz

action #59926: test incompletes in middle of execution with auto_review:"Unexpected end of data 0", system journal shows "kernel: traps: /usr/bin/isotov[2300] general protection ip:7fd5ef11771e sp:7ffe066f2200 error:0 in libc-2.26.so[7fd5ef094000+1b1000]"New

coordination #61922: [epic] Incomplete jobs with no logs at allResolvedmkittler

action #62984: Fix problem with job-worker assignment resulting in API errorsResolvedmkittler

action #63718: incomplete reason with just "quit"/"died" could provide more informationResolvedmkittler

action #64854: qemu-img error message is incorrectly tried to be parsed as JSON auto_review:"malformed JSON string"Resolvedtinita

action #64857: Put single-line error messages into incomplete reason for "died"Resolvedcdywan

action #64884: Distinguish test contributor errors from unexpected backend crashesResolvedmkittler

action #64917: auto_review:"(?s)qemu-img.*runcmd.*failed with exit code 1" sometimes but no apparent error messageResolvedokurz

action #66066: incomplete with reason "died: terminated prematurely" but log shows error 404 failing to download asset into cache auto_review:"(?s)Download.*failed: 404.*No scripts"Rejectedokurz

action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retryResolvedmkittler

action #69448: test incompletes with auto_review:"(?s)was downloaded by.*details.*unavailable here.*Failed to download":retry , not helpful detailsWorkable

coordination #69451: [epic] test incompletes with "(?s)Download.*successful.*Failed to download":retry, not helpful detailsBlockedokurz

action #69553: job incompletes with "Failed to rsync tests: exit code 10":retry, improve user feedbackResolvedkraih

action #69691: Improve incomplete output for qemu related problems, e.g. auto_review:"Failed to allocate KVM HPT.*Cannot allocate memory":retry instead of "can't open qmp"Workable

action #71185: job incompletes with auto_review:"setup failure: Cache service status error: Premature connection close":retry and does not retry, should we just automatically retry the connection?Resolvedokurz

action #71188: job incomplete with auto_review:"backend died: QEMU exited unexpectedly, see log for details" and no other obvious information in the logfile what went wrongWorkable

action #71227: job incompletes with auto_review:"backend died: 'current_console' is not set at /usr/lib/os-autoinst/backend/baseclass.pm line 932."Workable

action #71827: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry because worker cache prunes the asset it just downloadedResolvedmkittler

action #73273: job incompletes with auto_review:"setup failure: Cache service status error from API.* file is not a database .*":retryWorkable

action #73282: auto_review:"setup failure: Cache service status error from API: Minion job.*Worker went away":retryWorkable

action #73285: test incompletes with auto_review:"(?s)Download of.*processed[^:].*Failed to download":retry , not helpful details about reason of errorBlockedokurz

action #73288: auto_review:"setup failure: Cache service status error from API: Minion job.*Job terminated unexpectedly":retryWorkable

action #73294: auto_review:"isotovideo died: needles_dir not found" should be 'tests died' or something similar obvious to test maintainers that they need to actWorkable

action #73339: auto_review:"setup failure: Cache service status error from API: Minion job.* failed: Can't use an undefined value as a HASH reference at.*"Resolvedkraih

action #73369: Job incompletes with auto_review:"(?s)backend died: runcmd .*qemu-img create -f qcow2 .* failed with exit code 1: 'Formatting .*" on o3Workable

action #73375: Job incompletes with reason auto_review:"(?m)api failure$" (and no further details)Workable

action #73396: job incompletes with auto_review:"setup failure: Failed to rsync tests: exit code 23":retryResolvedXiaojing_liu

action #73525: Job incompletes with auto_review:"backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm.*":retryNew

action #75388: Explicit error feedback to test reviewers on wrong test API usageWorkable

action #78055: job incomplete exiting prematurely before reaching needle check timeout auto_review:"(?s)called testapi::assert_screen.*no match: [^-0]+\.[0-9]s,[^\n]*\n[^\n]*backend process exited: 0.*\[autotest\] process exited: 1":retryNew

action #78169: after osd-deploy 2020-11-18 incompletes with auto_review:"Cache service (status error from API|.*error 500: Internal Server Error)":retryResolvedmkittler

openQA Infrastructure - action #80106: corrupted worker cache sqlite: Enlarge systemd service kill timeout temporarilyResolvednicksinger

action #80118: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry, not effective on osd, or second fix neededResolvedokurz

action #80226: job incomplete with autoinst-log.txt ending just in the middleWorkable

action #80334: job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retriggerIn ProgressXiaojing_liu

action #80356: incompletes with auto_review:"Cache service.*error: Connection refused":retryWorkable

openQA Infrastructure - action #80408: revert longer timeout override for openQA services as we could not see less problems with corrupted worker cacheResolvednicksinger

openQA Tests - action #80776: [jeos] job incomplete auto_review:"(?s)(podman|docker).*Virtio terminal and svirt serial terminal do not support send_key":retryNew

action #80778: job incompletes with "Virtio terminal and svirt serial terminal do not support send_key", we should change the reason message to be ignored by auto-review, but be clear for the test maintainerWorkable


Related issues

Related to openQA Project - action #57782: retrigger of job with failed gru download task ends up incomplete with 404 on asset, does not retry downloadResolved2019-10-08

Related to openQA Project - action #54557: All openQA tests incomplete but neither package build nor unit tests fail when a new file is added to os-autoinst without mentioning in Makefile.amResolved2019-07-23

Related to openQA Project - action #54869: improve feedback in case job is incompleted due to too long uploading (was: Test fails as incomplete most of the time, no clue what happens from the logs.)Resolved2019-07-30

Related to openQA Project - action #55415: restart "recent incompletes" or "incompletes of today" easier then over web UINew2019-08-13

Related to openQA Project - action #57620: job is incomplete if websockets server (or webui?) is unreachable for a minute, e.g. during upgradeResolved2019-10-022020-04-09

Related to openQA Project - action #60458: Improve consistency of job states/results (was: /tests/overview shows Passed: 0, Failed: 0 in summary but nothing else for a build that consists of single incomplete job only)Resolved2019-11-30

Related to openQA Project - action #43631: [tools] Job terminated by a SIGTERM, ending up incomplete, unclear reason for stopping even though test could have looked green so far, "Result: done"Resolved2018-11-09

Related to openQA Project - action #60443: job incomplete with "(?s)process exited: 0.*isotovideo failed.*EXIT 1":retry but no further details what is wrongResolved2019-11-29

Related to openQA Project - action #34783: Don't let jobs incomplete if mandatory resources are missingResolved2018-04-12

Copied from openQA Project - action #62237: many incompletes with just "setup failure" and no further informationResolved2020-01-17

History

#1 Updated by okurz 12 months ago

  • Copied from action #62237: many incompletes with just "setup failure" and no further information added

#2 Updated by okurz 12 months ago

  • Description updated (diff)

#3 Updated by okurz 12 months ago

  • Subject changed from Improve reporting on incompletes with result "setup-failure" and no further explanation to [epic] Distinguish all types of incompletes
  • Description updated (diff)

#4 Updated by okurz 12 months ago

  • Related to action #59926: test incompletes in middle of execution with auto_review:"Unexpected end of data 0", system journal shows "kernel: traps: /usr/bin/isotov[2300] general protection ip:7fd5ef11771e sp:7ffe066f2200 error:0 in libc-2.26.so[7fd5ef094000+1b1000]" added

#5 Updated by okurz 12 months ago

  • Related to action #57782: retrigger of job with failed gru download task ends up incomplete with 404 on asset, does not retry download added

#6 Updated by okurz 12 months ago

  • Related to action #54557: All openQA tests incomplete but neither package build nor unit tests fail when a new file is added to os-autoinst without mentioning in Makefile.am added

#7 Updated by okurz 12 months ago

  • Related to action #54869: improve feedback in case job is incompleted due to too long uploading (was: Test fails as incomplete most of the time, no clue what happens from the logs.) added

#8 Updated by okurz 12 months ago

  • Related to action #55415: restart "recent incompletes" or "incompletes of today" easier then over web UI added

#9 Updated by okurz 12 months ago

  • Related to action #57620: job is incomplete if websockets server (or webui?) is unreachable for a minute, e.g. during upgrade added

#10 Updated by okurz 12 months ago

  • Related to action #60458: Improve consistency of job states/results (was: /tests/overview shows Passed: 0, Failed: 0 in summary but nothing else for a build that consists of single incomplete job only) added

#11 Updated by okurz 12 months ago

  • Related to action #43631: [tools] Job terminated by a SIGTERM, ending up incomplete, unclear reason for stopping even though test could have looked green so far, "Result: done" added

#12 Updated by okurz 12 months ago

  • Related to action #60443: job incomplete with "(?s)process exited: 0.*isotovideo failed.*EXIT 1":retry but no further details what is wrong added

#13 Updated by okurz 12 months ago

  • Related to action #34783: Don't let jobs incomplete if mandatory resources are missing added

#14 Updated by mkittler 12 months ago

Split "setup failure" into more specific types

Note that the existing error classes make sense for the worker's internal error handling. I wouldn't change them for the sake of displaying better error messages on the web UI. To me it makes more sense to simply forward the concrete error message or to have additional error (sub-)categories.

[...] but I guess we already have this covered by showing the logs in the details tab when there is no other information available.

That's true. So simply forwarding the concrete error message wouldn't be that much of an advantage anymore (as long as we ensure the relevant messages end up in the os-autoinst log and not only in the worker log). That speaks more for having additional error categories.

Not sure what kind of test you have in mind for AT1-1 and AT1-2. I doubt you want to test, e.g. DBIx here and the way we would be using it would be already covered by AT2 and AT3.

The other points make sense to me.

#15 Updated by okurz 12 months ago

  • Parent task set to #39719

#16 Updated by mkittler 12 months ago

  • Due date set to 2020-02-03

due to changes in a related task

#17 Updated by okurz 10 months ago

mkittler wrote:

Split "setup failure" into more specific types

Note that the existing error classes make sense for the worker's internal error handling. I wouldn't change them for the sake of displaying better error messages on the web UI. To me it makes more sense to simply forward the concrete error message or to have additional error (sub-)categories.

[...] but I guess we already have this covered by showing the logs in the details tab when there is no other information available.

That's true. So simply forwarding the concrete error message wouldn't be that much of an advantage anymore (as long as we ensure the relevant messages end up in the os-autoinst log and not only in the worker log). That speaks more for having additional error categories.

Not sure what kind of test you have in mind for AT1-1 and AT1-2. I doubt you want to test, e.g. DBIx here and the way we would be using it would be already covered by AT2 and AT3.

The acceptance tests specified here can be used for manual validation based on the automated tests we bring in place with the individual pull requests. With our recent work I am confident we have covered
AT1-1, AT1-2, AT2-1, AT3-1. I have executed all ATs manually for validation.

What I see as next tasks: We should forward more internal failure reasons, e.g. for "setup failure" as well as "died". E.g. from https://openqa.suse.de/tests?&resultfilter=Incomplete currently I can find https://openqa.suse.de/tests/4035208 which gives as reason "died: terminated prematurely, see log output for details" . In the logs we see

[2020-03-25T13:05:27.966 CET] [debug] Backend process died, backend errors are reported below in the following lines:
runcmd failed with exit code 1 at /usr/lib/os-autoinst/osutils.pm line 121.

[2020-03-25T13:05:27.966 CET] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2020-03-25T13:05:27.968 CET] [debug] flushing frames
[2020-03-25T13:05:28.052 CET] [debug] sending magic and exit
[2020-03-25T13:05:28.052 CET] [debug] received magic close
[2020-03-25T13:05:28.057 CET] [debug] backend process exited: 0
failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 141.

In this example I think when the message after "Backend process died" is only a single line we can put that into the reason directly but also we need something better than "runcmd failed with exit code 1". What actually happened is not obvious to me.

Another example is https://openqa.suse.de/tests/4034770 with "died: terminated prematurely, see log output for details" with the logs

[2020-03-25T11:22:20.886 CET] [debug] <<< testapi::record_soft_failure(reason="bsc#1167633 - No desktop session in tty2 on SLED")
Can't call method "record_soft_failure_result" on an undefined value at /usr/lib/os-autoinst/testapi.pm line 184.
Compilation failed in require at /usr/bin/isotovideo line 289.

We could put the two lines into the reason.

https://openqa.suse.de/tests/4034740 has

[2020-03-25T10:28:19.341 CET] [debug] error on tests/console/validate_fs_table.pm: Global symbol "$pattern" requires explicit package name (did you forget to declare "my $pattern"?) at /var/lib/openqa/pool/19/os-autoinst-distri-opensuse/tests/console/validate_fs_table.pm line 25.
Compilation failed in require at (eval 199) line 1.

error on tests/console/validate_fs_table.pm: Global symbol "$pattern" requires explicit package name (did you forget to declare "my $pattern"?) at /var/lib/openqa/pool/19/os-autoinst-distri-opensuse/tests/console/validate_fs_table.pm line 25.
Compilation failed in require at (eval 199) line 1.
Compilation failed in require at /usr/bin/isotovideo line 289.

which looks like the same problem repeated one time.

EDIT: Split out as #64857

#18 Updated by okurz 10 months ago

  • Status changed from New to Blocked
  • Assignee set to okurz

Waiting for recently added subtasks

#19 Updated by okurz 9 months ago

  • Due date set to 2020-04-25

due to changes in a related task: #66066

#20 Updated by okurz 8 months ago

  • Due date set to 2020-05-25

due to changes in a related task: #64917

#21 Updated by okurz 8 months ago

  • Due date changed from 2020-05-25 to 2020-06-09

due to changes in a related task: #64917

#22 Updated by okurz 7 months ago

  • Due date changed from 2020-06-09 to 2020-04-25

due to changes in a related task: #64917

#23 Updated by okurz 7 months ago

  • Target version set to Ready

#24 Updated by okurz 6 months ago

https://openqa.suse.de/tests/4536282# shows reason "isotovideo died: unable to handle generated assets: machine not shut down when uploading disks" which is clearly a problem in the test schedule. I wonder what we could do to make it more obvious in the reason that this is something for the test maintainer, and not the instance admin or so.

#25 Updated by szarate 3 months ago

  • Tracker changed from action to coordination
  • Status changed from Blocked to New

#27 Updated by okurz 3 months ago

  • Status changed from New to Blocked

#28 Updated by okurz 3 months ago

  • Related to action #73525: Job incompletes with auto_review:"backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm.*":retry added

Also available in: Atom PDF