Project

General

Profile

Actions

coordination #39719

closed

[saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

Added by okurz over 5 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-05-23
Due date:
% Done:

100%

Estimated time:
(Total: 128.00 h)

Description

User Story

As a reviewer of failed openQA tests I want known failures of jobs regardless of the error source to be marked as such automatically to not waste time on investigating known failures

Acceptance criteria

  • AC1: If a job fails for any reason that is "known" already in the context of the current openQA instance no further "test review" effort is needed by human reviewers

Suggestions

  • Provide a mechanism to match on regex in serial0.txt (as provided by existing "serial exception catching"-feature) based on patterns defined in the test distribution
  • Same for autoinst-log.txt
  • Provide patterns defined in os-autoinst for backend specific stuff, e.g. the "key event queue full"-thingy -> look for that string in os-autoinst for existing code to handle that
  • Same as above but patterns defined in instance specific configuration, e.g. workers.ini (managed by salt for SLE)
  • Maybe the same based on needles? But maybe the current approach using the "workaround" property and soft-fail needles to be always preferred is already good enough :)
  • It might be necessary to re-define "soft-fail" as "known issue" and nothing more so that we can use the "known failure" detection to set a job to soft-failed referencing the known issue, immediately aborting the further execution of a job to prevent it failing at a sporadic later step which would pose the need to provide openQA comments to provide a label

Further details

Definitions:

  • "known" means that a certain symptom of a test failure has been described with e.g. a matching pattern in either a test distribution, os-autoinst or maybe openQA itself as for the later mentioned jenkins plugin
  • "test review" means what we currently do in openSUSE or SLE by providing job labels with issue references in openQA comments which are carried over – which so far only works within individual scenarios

See https://wiki.jenkins.io/display/JENKINS/Build+Failure+Analyzer for an example. This jenkins plugin uses a "knowledge base" with jenkins instance global "known failures" defined with description and pattern matching, e.g. on "build log parsing", to mark failures as known when any log content matches existing patterns


Subtasks 61 (0 open61 closed)

coordination #19720: [epic] Simplify investigation of job failuresResolvedokurz2019-12-17

Actions
action #61103: Use CodeMirror to render diffs in the Investigation tabRejectedokurz2019-12-17

Actions
action #69085: Make "last good" a link to a job instead of plain job IDResolvedokurz2020-07-17

Actions
action #69088: Present changes between packages on openQA worker machines in "investigation"Resolvedilausuch2020-07-17

Actions
coordination #91518: [epic] Provide 'first bad' vs. 'last good' difference in investigation infoResolvedokurz2021-04-21

Actions
action #91521: link to "first bad" in investigation tabResolvedosukup2021-04-21

Actions
action #92188: test reviewers are pointed to the "first bad vs. last good" comparison if current job is not already the first badResolvedtinita

Actions
action #91527: Cleanup logging in autoinst-log.txtResolvedilausuch2021-04-21

Actions
action #91878: Improve git log entries in failed test investigationResolvedybonatakis2021-04-27

Actions
action #92731: clickable git log entries in investigation tabResolvedybonatakis

Actions
action #92746: Log viewer in openQA webUI with color parsingResolvedmkittler2021-05-17

Actions
action #93940: text thumbnail preview feels inconsistent to other screenshots size:MResolvedosukup2021-06-14

Actions
action #95581: ci: Use a git commit message style checker size:SResolvedVANASTASIADIS2021-07-16

Actions
action #101533: Make text thumbnails easily distinguishable from info thumbnailsResolvedmkittler2021-10-27

Actions
action #101725: Improve text result preview font size in chromium based browsersResolveddheidler2021-10-29

Actions
openQA Tests - action #38621: [functional][y] test fails in welcome - "Module is not signed with expected PKCS#7 message" (bsc#1093659) - Use serial exception catching feature from openQA to make sure the jobs reference the bug, e.g. as labelResolvedriafarov2018-05-23

Actions
action #60560: Self-investigate potential reasons for failures in openQAResolvedokurz2019-12-03

Actions
coordination #62420: [epic] Distinguish all types of incompletesResolvedokurz2018-12-12

Actions
action #45062: Better visualization of incompletes - show module in which incomplete happensResolvedokurz2018-12-12

Actions
coordination #61922: [epic] Incomplete jobs with no logs at allResolvedmkittler2020-02-03

Actions
action #62984: Fix problem with job-worker assignment resulting in API errorsResolvedmkittler2020-02-03

Actions
action #63718: incomplete reason with just "quit"/"died" could provide more informationResolvedmkittler2020-02-21

Actions
action #64854: qemu-img error message is incorrectly tried to be parsed as JSON auto_review:"malformed JSON string"Resolvedtinita2020-03-26

Actions
action #64857: Put single-line error messages into incomplete reason for "died"Resolvedlivdywan2020-03-26

Actions
action #64884: Distinguish test contributor errors from unexpected backend crashesResolvedmkittler2020-03-26

Actions
action #64917: auto_review:"(?s)qemu-img.*runcmd.*failed with exit code 1" sometimes but no apparent error messageResolvedokurz2020-03-26

Actions
action #66066: incomplete with reason "died: terminated prematurely" but log shows error 404 failing to download asset into cache auto_review:"(?s)Download.*failed: 404.*No scripts"Rejectedokurz2020-04-25

Actions
action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retryResolvedmkittler2020-05-18

Actions
action #69553: job incompletes with "Failed to rsync tests: exit code 10":retry, improve user feedbackResolvedkraih2020-08-04

Actions
action #71185: job incompletes with auto_review:"setup failure: Cache service status error: Premature connection close":retry and does not retry, should we just automatically retry the connection?Resolvedokurz2020-09-10

Actions
action #71827: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry because worker cache prunes the asset it just downloadedResolvedmkittler2020-07-30

Actions
action #73285: test incompletes with auto_review:"(?s)Download of.*processed[^:].*Failed to download":retry , not helpful details about reason of errorResolvedokurz2020-07-30

Actions
action #73339: auto_review:"setup failure: Cache service status error from API: Minion job.* failed: Can't use an undefined value as a HASH reference at.*"Resolvedkraih2020-10-14

Actions
action #73396: job incompletes with auto_review:"setup failure: Failed to rsync tests: exit code 23":retryResolvedXiaojing_liu2020-10-15

Actions
action #78169: after osd-deploy 2020-11-18 incompletes with auto_review:"Cache service (status error from API|.*error 500: Internal Server Error)":retryResolvedmkittler2020-11-18

Actions
openQA Infrastructure - action #80106: corrupted worker cache sqlite: Enlarge systemd service kill timeout temporarilyResolvednicksinger

Actions
action #80118: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry, not effective on osd, or second fix neededResolvedokurz

Actions
action #80334: job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retriggerResolvedXiaojing_liu2020-11-25

Actions
openQA Infrastructure - action #80408: revert longer timeout override for openQA services as we could not see less problems with corrupted worker cacheResolvednicksinger2020-11-26

Actions
action #89614: openqa workers on `ip-172-25-5-39` fails with no clue on failureResolvedggardet_arm2021-03-08

Actions
action #90974: Make it obvious if qemu gets terminated unexpectedly due to out-of-memoryResolvedXiaojing_liu

Actions
QA - action #52655: [epic] Move openqa-review from cron-jobs on lord.arch to a more sustainable long-term solutionResolvedokurz2021-04-19

Actions
QA - action #91356: Save openqa-review reports as gitlab CI artifactsResolvedosukup2021-04-19

Actions
QA - action #93710: Reference individual openqa-review reports in gitlab CI artifacts, e.g. using gitlab pagesResolvedlivdywan

Actions
action #75232: error message when worker has no network (yet): Unable to serialize fatal error: Can't open file "base_state.json": Permission denied at /usr/lib/os-autoinst/bmwqemu.pm line 86."Resolvedlivdywan2020-10-24

Actions
QA - coordination #77899: [epic] Extend "auto-review" for failed jobs as wellResolvedokurz2020-11-26

Actions
QA - action #80414: [proof-of-concept] Extend "auto-review" for failed jobs as well, start with o3Resolvedokurz2020-11-26

Actions
QA - action #80418: [learning] Fix parse errors in "openqa-investigate" "parse error: Invalid numeric literal at line 1, column 10"Resolvedmkittler2020-11-26

Actions
QA - action #80806: Extend "auto-review" for failed jobs as well - Generalize openqa-monitor-investigation-candidates to look at more than just one job groupResolvedokurz2020-12-07

Actions
QA - action #80808: Extend "auto-review" for failed jobs as well - enable same as on o3 but on osdResolvedokurz2020-12-07

Actions
QA - action #77944: Run "auto-review" more often but alarm lessResolvedokurz2020-11-14

Actions
action #80264: multimachine tests unable to get vars from its pair jobResolvedmkittler2020-11-24

Actions
action #80412: tests fail with auto_review:"(?s)version is 4\.6\.1606298538\.191b5988.*Can.*t locate object method.*code.*via package":retryResolvedokurz2020-11-24

Actions
action #80772: [jeos] auto_review:"(?s)GENERAL_HW_FLASH_CMD.*No space left on device":retry incomplete in flash scriptResolvedggardet_arm2020-12-07

Actions
action #80774: [jeos] auto_review:"(?s)GENERAL_HW_FLASH_CMD.*No route to host":retry incomplete in flash scriptResolvedggardet_arm2020-12-07

Actions
coordination #80828: [epic] Trigger 'auto-review' and 'openqa-investigate' from within openQA when jobs incomplete or fail on o3+osdResolvedokurz2020-12-08

Actions
action #80736: Trigger 'auto-review' from within openQA when jobs incomplete (or fail) , for testing: auto_review:"tests died: unable to load main.pm, check the log for the cause"Resolvedokurz

Actions
action #80826: Trigger 'auto-review' from within openQA when jobs incomplete on osd as wellResolvedokurz2020-12-08

Actions
action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3Resolvedokurz2020-12-08

Actions
action #81206: Trigger 'openqa-investigate' from within openQA when jobs fail on osdResolvedokurz

Actions
action #81859: openqa-investigate triggers incomplete sets for multi-machine scenariosResolvedmkittler2021-01-07

Actions

Related issues 8 (5 open3 closed)

Related to openQA Project - action #13242: WDYT: For every job that does not have a label or bugref, retrigger some times to see if it's sporadic. Like rescheduling on incomplete but on failedRejectedokurz2016-11-25

Actions
Related to openQA Project - coordination #13812: [epic][dashboard] openQA Dashboard ideasNew2017-01-10

Actions
Related to openQA Tests - action #42446: [qe-core][functional] many opensuse tests fail in desktop_runner or gimp or other modules in what I think is boo#1105691 – can we detect this bug from the journal and track as soft-fail?New2018-10-13

Actions
Related to openQA Project - action #40382: Make "ignored" issues more prominent (was: create new state "ignored")Workable2018-08-29

Actions
Related to openQA Tests - action #43784: [functional][y][sporadic] test fails in yast2_snapper now reproducibly not exiting the "show differences" screenResolvedoorlov2018-11-14

Actions
Related to openQA Project - action #57452: Automatic summary of failuresRejected2019-09-27

Actions
Related to openQA Project - action #45011: Allow detection of known failures at the autoinst-log.txtWorkable2018-12-11

Actions
Copied to openQA Project - coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQANew2018-04-16

Actions
Actions

Also available in: Atom PDF