Project

General

Profile

Actions

coordination #62420

closed

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

[epic] Distinguish all types of incompletes

Added by okurz almost 5 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-12-12
Due date:
% Done:

100%

Estimated time:
(Total: 120.00 h)

Description

Motivation

As a test reviewer I want to understand the reason for incompletes so that I know who should do what to fix it

Acceptance criteria

  • AC1: All incompletes provide more details about the reason for the incompletion
  • AC2: Types of incompletes are distinguishable without needing to read logs
  • AC3: The incomplete reason is visualized in the UI (not only in logs)
  • AC4: If the reason is not known all available log details are accessible from the job

Acceptance tests

  • AT1-1: Given an incomplete job, When reading out the job from the openQA database, Then the incomplete reason is given
  • AT1-2: Same as AT1-1 but for a failed job, Then no incomplete reason is given
  • AT2-1: Given an incomplete job, When reading out the job over the API, Then the incomplete reason is rendered
  • AT3-1: Given an incomplete job, When showing job details in the webui, Then the incomplete reason is visible (not only in logs)

Suggestions

  • Check were "setup failure" and other results are provided and when no further details from other services, e.g. cacheservice, and ensure that there is a hint about the problem source.
  • Extend the API to also accept the incomplete reason
  • Extend the UI to also show the incomplete reason
  • Split "setup failure" into more specific types
  • Also for example we use "setup failure" in multiple cases but do not forward the result to the webui except in the log files as strings. In the case here we do not even have any information string that would point out what the real problem was or is, e.g. at least show the available logs or even extract log excerpts but I guess we already have this covered by showing the logs in the details tab when there is no other information available.

Further details

What to do when we asked the systemd service to stop because we want to reboot the machine? IMHO we should abort as fast as possible on TERM but provide a better information. This and all the experiences with different sources of incompletes from the past months brings me to the conclusion we just want to pass the internal "reason" we already have on the worker to the webui, e.g. "setup-failure" as we already have. And "worker-shutdown" can be another reason. Based on the reason we can also decide if we should auto-duplicate. For a "compilation-error", no retrigger, for "worker-shutdown" yes.

Started as ticket "Improve reporting on incompletes with result "setup-failure" and no further explanation".

See #62237 . There were many incompletes with not much details, e.g. https://openqa.suse.de/tests/3795872 shows just

[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] +++ setup notes +++
[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] Start time: 2020-01-17 09:56:50
[2020-01-17T10:56:50.0830 CET] [info] [pid:110583] Running on QA-Power8-5-kvm:6 (Linux 4.12.14-lp151.27-default #1 SMP Fri May 10 14:13:15 UTC 2019 (862c838) ppc64le)
[2020-01-17T11:01:50.0997 CET] [info] [pid:110583] +++ worker notes +++
[2020-01-17T11:01:50.0998 CET] [info] [pid:110583] End time: 2020-01-17 10:01:50
[2020-01-17T11:01:50.0998 CET] [info] [pid:110583] Result: setup failure
[2020-01-17T11:01:51.0002 CET] [info] [pid:21796] Uploading autoinst-log.txt

We have the result "setup failure" or "setup-failure" as it is also used but this is worker-internal and not forwarded to the webui.


Subtasks 24 (0 open24 closed)

action #45062: Better visualization of incompletes - show module in which incomplete happensResolvedokurz2018-12-12

Actions
coordination #61922: [epic] Incomplete jobs with no logs at allResolvedmkittler2020-02-03

Actions
action #62984: Fix problem with job-worker assignment resulting in API errorsResolvedmkittler2020-02-03

Actions
action #63718: incomplete reason with just "quit"/"died" could provide more informationResolvedmkittler2020-02-21

Actions
action #64854: qemu-img error message is incorrectly tried to be parsed as JSON auto_review:"malformed JSON string"Resolvedtinita2020-03-26

Actions
action #64857: Put single-line error messages into incomplete reason for "died"Resolvedlivdywan2020-03-26

Actions
action #64884: Distinguish test contributor errors from unexpected backend crashesResolvedmkittler2020-03-26

Actions
action #64917: auto_review:"(?s)qemu-img.*runcmd.*failed with exit code 1" sometimes but no apparent error messageResolvedokurz2020-03-26

Actions
action #66066: incomplete with reason "died: terminated prematurely" but log shows error 404 failing to download asset into cache auto_review:"(?s)Download.*failed: 404.*No scripts"Rejectedokurz2020-04-25

Actions
action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retryResolvedmkittler2020-05-18

Actions
action #69553: job incompletes with "Failed to rsync tests: exit code 10":retry, improve user feedbackResolvedkraih2020-08-04

Actions
action #71185: job incompletes with auto_review:"setup failure: Cache service status error: Premature connection close":retry and does not retry, should we just automatically retry the connection?Resolvedokurz2020-09-10

Actions
action #71827: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry because worker cache prunes the asset it just downloadedResolvedmkittler2020-07-30

Actions
action #73285: test incompletes with auto_review:"(?s)Download of.*processed[^:].*Failed to download":retry , not helpful details about reason of errorResolvedokurz2020-07-30

Actions
action #73339: auto_review:"setup failure: Cache service status error from API: Minion job.* failed: Can't use an undefined value as a HASH reference at.*"Resolvedkraih2020-10-14

Actions
action #73396: job incompletes with auto_review:"setup failure: Failed to rsync tests: exit code 23":retryResolvedXiaojing_liu2020-10-15

Actions
action #78169: after osd-deploy 2020-11-18 incompletes with auto_review:"Cache service (status error from API|.*error 500: Internal Server Error)":retryResolvedmkittler2020-11-18

Actions
openQA Infrastructure - action #80106: corrupted worker cache sqlite: Enlarge systemd service kill timeout temporarilyResolvednicksinger

Actions
action #80118: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry, not effective on osd, or second fix neededResolvedokurz

Actions
action #80334: job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retriggerResolvedXiaojing_liu2020-11-25

Actions
openQA Infrastructure - action #80408: revert longer timeout override for openQA services as we could not see less problems with corrupted worker cacheResolvednicksinger2020-11-26

Actions
Containers and images - action #80776: [jeos] job incomplete auto_review:"(?s)(podman|docker).*Virtio terminal and svirt serial terminal do not support send_key":retryResolvedybonatakis

Actions
action #89614: openqa workers on `ip-172-25-5-39` fails with no clue on failureResolvedggardet_arm2021-03-08

Actions
action #90974: Make it obvious if qemu gets terminated unexpectedly due to out-of-memoryResolvedXiaojing_liu

Actions

Related issues 10 (1 open9 closed)

Related to openQA Project - action #57782: retrigger of job with failed gru download task ends up incomplete with 404 on asset, does not retry downloadResolvedmkittler2019-10-08

Actions
Related to openQA Project - action #54557: All openQA tests incomplete but neither package build nor unit tests fail when a new file is added to os-autoinst without mentioning in Makefile.amResolvedokurz2019-07-23

Actions
Related to openQA Project - action #54869: improve feedback in case job is incompleted due to too long uploading (was: Test fails as incomplete most of the time, no clue what happens from the logs.)Resolvedokurz2019-07-30

Actions
Related to openQA Project - action #55415: restart "recent incompletes" or "incompletes of today" easier then over web UINew2019-08-13

Actions
Related to openQA Project - action #57620: job is incomplete if websockets server (or webui?) is unreachable for a minute, e.g. during upgradeResolvedokurz2019-10-022020-04-09

Actions
Related to openQA Project - action #60458: Improve consistency of job states/results (was: /tests/overview shows Passed: 0, Failed: 0 in summary but nothing else for a build that consists of single incomplete job only)Resolvedmkittler2019-11-30

Actions
Related to openQA Project - action #43631: [tools] Job terminated by a SIGTERM, ending up incomplete, unclear reason for stopping even though test could have looked green so far, "Result: done"Resolvedokurz2018-11-09

Actions
Related to openQA Project - action #60443: job incomplete with "(?s)process exited: 0.*isotovideo failed.*EXIT 1":retry but no further details what is wrongResolvedokurz2019-11-29

Actions
Related to openQA Project - action #34783: Don't let jobs incomplete if mandatory resources are missingResolvedmkittler2018-04-12

Actions
Copied from openQA Project - action #62237: many incompletes with just "setup failure" and no further informationResolvedokurz2020-01-17

Actions
Actions

Also available in: Atom PDF