QA (public) &raquo; openQA Project (public)

Category:

Feature requests

Target version:

Ready

Start date:

2018-05-23

Due date:

% Done:

100%

Estimated time:

(Total: 128.00 h)

Description

User Story¶

As a reviewer of failed openQA tests I want known failures of jobs regardless of the error source to be marked as such automatically to not waste time on investigating known failures

Acceptance criteria¶

AC1: If a job fails for any reason that is "known" already in the context of the current openQA instance no further "test review" effort is needed by human reviewers

Suggestions¶

Provide a mechanism to match on regex in serial0.txt (as provided by existing "serial exception catching"-feature) based on patterns defined in the test distribution
Same for autoinst-log.txt
Provide patterns defined in os-autoinst for backend specific stuff, e.g. the "key event queue full"-thingy -> look for that string in os-autoinst for existing code to handle that
Same as above but patterns defined in instance specific configuration, e.g. workers.ini (managed by salt for SLE)
Maybe the same based on needles? But maybe the current approach using the "workaround" property and soft-fail needles to be always preferred is already good enough :)
It might be necessary to re-define "soft-fail" as "known issue" and nothing more so that we can use the "known failure" detection to set a job to soft-failed referencing the known issue, immediately aborting the further execution of a job to prevent it failing at a sporadic later step which would pose the need to provide openQA comments to provide a label

Further details¶

Definitions:

"known" means that a certain symptom of a test failure has been described with e.g. a matching pattern in either a test distribution, os-autoinst or maybe openQA itself as for the later mentioned jenkins plugin
"test review" means what we currently do in openSUSE or SLE by providing job labels with issue references in openQA comments which are carried over – which so far only works within individual scenarios

See https://wiki.jenkins.io/display/JENKINS/Build+Failure+Analyzer for an example. This jenkins plugin uses a "knowledge base" with jenkins instance global "known failures" defined with description and pattern matching, e.g. on "build log parsing", to mark failures as known when any log content matches existing patterns

Subtasks 62 (0 open — 62 closed)

coordination #19720: [epic] Simplify investigation of job failures

Resolved

2019-12-17

action #61103: Use CodeMirror to render diffs in the Investigation tab

Rejected

2019-12-17

action #69085: Make "last good" a link to a job instead of plain job ID

Resolved

2020-07-17

action #69088: Present changes between packages on openQA worker machines in "investigation"

Resolved

ilausuch

2020-07-17

coordination #91518: [epic] Provide 'first bad' vs. 'last good' difference in investigation info

Resolved

2021-04-21

action #91521: link to "first bad" in investigation tab

Resolved

osukup

2021-04-21

action #92188: test reviewers are pointed to the "first bad vs. last good" comparison if current job is not already the first bad

Resolved

tinita

action #91527: Cleanup logging in autoinst-log.txt

Resolved

ilausuch

2021-04-21

action #91878: Improve git log entries in failed test investigation

Resolved

ybonatakis

2021-04-27

action #92731: clickable git log entries in investigation tab

Resolved

ybonatakis

action #92746: Log viewer in openQA webUI with color parsing

Resolved

2021-05-17

action #93940: text thumbnail preview feels inconsistent to other screenshots size:M

Resolved

osukup

2021-06-14

action #95581: ci: Use a git commit message style checker size:S

Resolved

VANASTASIADIS

2021-07-16

action #101533: Make text thumbnails easily distinguishable from info thumbnails

Resolved

2021-10-27

action #101725: Improve text result preview font size in chromium based browsers

Resolved

dheidler

2021-10-29

openQA Tests (public) - action #38621: [functional][y] test fails in welcome - "Module is not signed with expected PKCS#7 message" (bsc#1093659) - Use serial exception catching feature from openQA to make sure the jobs reference the bug, e.g. as label

Resolved

riafarov

2018-05-23

action #60560: Self-investigate potential reasons for failures in openQA

Resolved

2019-12-03

coordination #62420: [epic] Distinguish all types of incompletes

Resolved

2018-12-12

action #45062: Better visualization of incompletes - show module in which incomplete happens

Resolved

2018-12-12

coordination #61922: [epic] Incomplete jobs with no logs at all

Resolved

2020-02-03

action #62984: Fix problem with job-worker assignment resulting in API errors

Resolved

2020-02-03

action #63718: incomplete reason with just "quit"/"died" could provide more information

Resolved

2020-02-21

action #64854: qemu-img error message is incorrectly tried to be parsed as JSON auto_review:"malformed JSON string"

Resolved

tinita

2020-03-26

action #64857: Put single-line error messages into incomplete reason for "died"

Resolved

livdywan

2020-03-26

action #64884: Distinguish test contributor errors from unexpected backend crashes

Resolved

2020-03-26

action #64917: auto_review:"(?s)qemu-img.*runcmd.*failed with exit code 1" sometimes but no apparent error message

Resolved

2020-03-26

action #66066: incomplete with reason "died: terminated prematurely" but log shows error 404 failing to download asset into cache auto_review:"(?s)Download.*failed: 404.*No scripts"

Rejected

2020-04-25

action #67000: Job incompletes due to malformed worker cache database disk image with auto_review:"Cache service status error.*(database disk image is malformed|Specified job ID is invalid).*":retry

Resolved

2020-05-18

action #69553: job incompletes with "Failed to rsync tests: exit code 10":retry, improve user feedback

Resolved

kraih

2020-08-04

action #71185: job incompletes with auto_review:"setup failure: Cache service status error: Premature connection close":retry and does not retry, should we just automatically retry the connection?

Resolved

2020-09-10

action #71827: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry because worker cache prunes the asset it just downloaded

Resolved

2020-07-30

action #73285: test incompletes with auto_review:"(?s)Download of.*processed[^:].*Failed to download":retry , not helpful details about reason of error

Resolved

2020-07-30

action #73339: auto_review:"setup failure: Cache service status error from API: Minion job.* failed: Can't use an undefined value as a HASH reference at.*"

Resolved

kraih

2020-10-14

action #73396: job incompletes with auto_review:"setup failure: Failed to rsync tests: exit code 23":retry

Resolved

Xiaojing_liu

2020-10-15

action #78169: after osd-deploy 2020-11-18 incompletes with auto_review:"Cache service (status error from API|.*error 500: Internal Server Error)":retry

Resolved

2020-11-18

openQA Infrastructure (public) - action #80106: corrupted worker cache sqlite: Enlarge systemd service kill timeout temporarily

Resolved

nicksinger

action #80118: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry, not effective on osd, or second fix needed

Resolved

action #80334: job incompletes with auto_review:"(?s)terminated prematurely with corrupted state file.*No space left on device":retry , should automatically retrigger

Resolved

Xiaojing_liu

2020-11-25

openQA Infrastructure (public) - action #80408: revert longer timeout override for openQA services as we could not see less problems with corrupted worker cache

Resolved

nicksinger

2020-11-26

Containers and images - action #80776: [jeos] job incomplete auto_review:"(?s)(podman|docker).*Virtio terminal and svirt serial terminal do not support send_key":retry

Resolved

ybonatakis

action #89614: openqa workers on `ip-172-25-5-39` fails with no clue on failure

Resolved

ggardet_arm

2021-03-08

action #90974: Make it obvious if qemu gets terminated unexpectedly due to out-of-memory

Resolved

Xiaojing_liu

QA (public) - action #52655: [epic] Move openqa-review from cron-jobs on lord.arch to a more sustainable long-term solution

Resolved

2021-04-19

QA (public) - action #91356: Save openqa-review reports as gitlab CI artifacts

Resolved

osukup

2021-04-19

QA (public) - action #93710: Reference individual openqa-review reports in gitlab CI artifacts, e.g. using gitlab pages

Resolved

livdywan

action #75232: error message when worker has no network (yet): Unable to serialize fatal error: Can't open file "base_state.json": Permission denied at /usr/lib/os-autoinst/bmwqemu.pm line 86."

Resolved

livdywan

2020-10-24

QA (public) - coordination #77899: [epic] Extend "auto-review" for failed jobs as well

Resolved

2020-11-26

QA (public) - action #80414: [proof-of-concept] Extend "auto-review" for failed jobs as well, start with o3

Resolved

2020-11-26

QA (public) - action #80418: [learning] Fix parse errors in "openqa-investigate" "parse error: Invalid numeric literal at line 1, column 10"

Resolved

2020-11-26

QA (public) - action #80806: Extend "auto-review" for failed jobs as well - Generalize openqa-monitor-investigation-candidates to look at more than just one job group

Resolved

2020-12-07

QA (public) - action #80808: Extend "auto-review" for failed jobs as well - enable same as on o3 but on osd

Resolved

2020-12-07

QA (public) - action #77944: Run "auto-review" more often but alarm less

Resolved

2020-11-14

action #80264: multimachine tests unable to get vars from its pair job

Resolved

2020-11-24

action #80412: tests fail with auto_review:"(?s)version is 4\.6\.1606298538\.191b5988.*Can.*t locate object method.*code.*via package":retry

Resolved

2020-11-24

action #80772: [jeos] auto_review:"(?s)GENERAL_HW_FLASH_CMD.*No space left on device":retry incomplete in flash script

Resolved

ggardet_arm

2020-12-07

action #80774: [jeos] auto_review:"(?s)GENERAL_HW_FLASH_CMD.*No route to host":retry incomplete in flash script

Resolved

ggardet_arm

2020-12-07

coordination #80828: [epic] Trigger 'auto-review' and 'openqa-investigate' from within openQA when jobs incomplete or fail on o3+osd

Resolved

2020-12-08

action #80736: Trigger 'auto-review' from within openQA when jobs incomplete (or fail) , for testing: auto_review:"tests died: unable to load main.pm, check the log for the cause"

Resolved

action #80826: Trigger 'auto-review' from within openQA when jobs incomplete on osd as well

Resolved

2020-12-08

action #80830: Trigger 'openqa-investigate' from within openQA when jobs fail on o3

Resolved

2020-12-08

action #81206: Trigger 'openqa-investigate' from within openQA when jobs fail on osd

Resolved

action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarios

Resolved

2021-01-07

Related issues 8 (5 open — 3 closed)

Related to openQA Project (public) - action #13242: WDYT: For every job that does not have a label or bugref, retrigger some times to see if it's sporadic. Like rescheduling on incomplete but on failed

Rejected

2016-11-25

Related to openQA Project (public) - coordination #13812: [epic][dashboard] openQA Dashboard ideas

New

2017-01-10

Related to openQA Tests (public) - action #42446: [qe-core][functional] many opensuse tests fail in desktop_runner or gimp or other modules in what I think is boo#1105691 – can we detect this bug from the journal and track as soft-fail?

New

2018-10-13

Related to openQA Project (public) - action #40382: Make "ignored" issues more prominent (was: create new state "ignored")

Workable

2018-08-29

Related to openQA Tests (public) - action #43784: [functional][y][sporadic] test fails in yast2_snapper now reproducibly not exiting the "show differences" screen

Resolved

oorlov

2018-11-14

Related to openQA Project (public) - action #57452: Automatic summary of failures

Rejected

2019-09-27

Related to openQA Project (public) - action #45011: Allow detection of known failures at the autoinst-log.txt

Workable

2018-12-11

Copied to openQA Project (public) - coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA

New

2018-04-16

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Updated by okurz over 6 years ago

Related to action #13242: WDYT: For every job that does not have a label or bugref, retrigger some times to see if it's sporadic. Like rescheduling on incomplete but on failed added

Actions

Updated by okurz over 6 years ago

Related to action #38621: [functional][y] test fails in welcome - "Module is not signed with expected PKCS#7 message" (bsc#1093659) - Use serial exception catching feature from openQA to make sure the jobs reference the bug, e.g. as label added

Actions

Updated by okurz over 6 years ago

Related to coordination #13812: [epic][dashboard] openQA Dashboard ideas added

Actions

Updated by okurz over 6 years ago

Related to deleted (action #38621: [functional][y] test fails in welcome - "Module is not signed with expected PKCS#7 message" (bsc#1093659) - Use serial exception catching feature from openQA to make sure the jobs reference the bug, e.g. as label)

Actions

Updated by nicksinger over 6 years ago

Another idea which could be checked/better reported to the user:

If a crucial component in the "os-autoinst-chain" fails (e.g. xterm for ipmi jobs), openQA could easily report this earlier. As it is right now, the job stalls (hangs as "running") but only shows a black screen. Example: https://openqa.suse.de/tests/1970948 (look for "PermissionError" in the osautoinst-log.txt)

Actions

Updated by coolo over 6 years ago

Target version set to future

IMO this is best handled by an automated review from outside. The problem is not so much the detecting the issue, but how to handle it. For some projects/objects you would do a retrigger, for others you would prefer defining a label.

Actions

Updated by okurz over 6 years ago

"outside", yes, I agree. Should be outside what is currently defined as "openQA" but it could be that we still call it "the openQA ecosystem" so I guess this issue tracker is still best suited. Some parts we have already covered with the proof-of-concept of detecting known failures in the serial port output.

Actions

Updated by coolo over 6 years ago

I don't disagree with the issue tracker - I just don't want a High priority epic in my 'to be sorted' list

Actions

Updated by okurz over 6 years ago

Related to action #42446: [qe-core][functional] many opensuse tests fail in desktop_runner or gimp or other modules in what I think is boo#1105691 – can we detect this bug from the journal and track as soft-fail? added

Actions

#10

Updated by okurz over 6 years ago

Subject changed from [epic] Detect "known failures" and mark jobs as such to [functional][y][u][epic] Detect "known failures" and mark jobs as such

Trying to bring it forward with help of QSF again…

Actions

#11

Updated by okurz over 6 years ago

Related to action #27004: [opensuse][sle][functional][yast][y][hard] yast2 gui modules fail to start in the defined time frame added

Actions

#12

Updated by okurz over 6 years ago

Related to deleted (action #27004: [opensuse][sle][functional][yast][y][hard] yast2 gui modules fail to start in the defined time frame)

Actions

#13

Updated by okurz over 6 years ago

Blocks action #27004: [opensuse][sle][functional][yast][y][hard] yast2 gui modules fail to start in the defined time frame added

Actions

#14

Updated by okurz over 6 years ago

Related to action #40382: Make "ignored" issues more prominent (was: create new state "ignored") added

Actions

#15

Updated by okurz over 6 years ago

https://github.com/os-autoinst/os-autoinst/pull/1052 to "Add option to override status of test modules with soft-fail"

Actions

#16

Updated by okurz over 6 years ago

Status changed from New to Feedback
Assignee set to okurz

Actions

-> https://github.com/os-autoinst/os-autoinst/pull/1062

#17

Updated by okurz over 6 years ago

The feature is not working as intended as in https://github.com/os-autoinst/os-autoinst/blob/master/basetest.pm#L286 we overwrite the result again. I am trying to simply remove that method :)

Also presented my idea to riafarov and we identified one problematic scenario: What if we force the status of a parent job to "softfail"? For now openQA would still trigger the downstream jobs which then most likely should fail because a module in the parent job failed, in the worst case even making the downstream jobs incomplete because the HDD image was never published properly. We should avoid this though.

Actions

#18

Updated by okurz about 6 years ago

Related to action #43784: [functional][y][sporadic] test fails in yast2_snapper now reproducibly not exiting the "show differences" screen added

Actions

#19

Updated by szarate about 6 years ago

Related to action #45011: Allow detection of known failures at the autoinst-log.txt added

Actions

#20

Updated by szarate about 6 years ago

I see that one of the suggestions on this ticket was exactly what poo#45011 is about :)

Actions

#21

Updated by agraul about 6 years ago

Related to deleted (action #45011: Allow detection of known failures at the autoinst-log.txt)

Actions

#22

Updated by agraul about 6 years ago

Blocked by action #45011: Allow detection of known failures at the autoinst-log.txt added

Actions

#23

Updated by agraul about 6 years ago

Status changed from Feedback to Blocked

#45011

Actions

#24

Updated by okurz about 6 years ago

Due date changed from 2018-08-28 to 2019-03-12

due to changes in a related task

Actions

#25

Updated by okurz almost 6 years ago

Due date changed from 2019-03-12 to 2019-06-30

due to changes in a related task

Actions

#26

Updated by okurz over 5 years ago

Assignee changed from okurz to riafarov

Move to new QSF-y PO after I moved to the "tools"-team. I mainly checked the subject line so in individual instances you might not agree to take it over completely into QSF-y. Feel free to reassign to me or someone else in this case. Thanks.

Actions

#27

Updated by riafarov over 5 years ago

Blocks deleted (action #27004: [opensuse][sle][functional][yast][y][hard] yast2 gui modules fail to start in the defined time frame)

Actions

#28

Updated by riafarov over 5 years ago

Due date changed from 2019-06-30 to 2019-08-06

due to changes in a related task

Actions

#29

Updated by riafarov over 5 years ago

Due date changed from 2019-08-06 to 2019-12-31

due to changes in a related task

Actions

#30

Updated by okurz over 5 years ago

Related to action #57452: Automatic summary of failures added

Actions

#31

Updated by okurz about 5 years ago

Using https://github.com/os-autoinst/scripts/blob/master/monitor-openqa_job and https://github.com/os-autoinst/scripts/blob/master/openqa-label-known-issues I setup a gitlab CI pipeline in https://gitlab.suse.de/openqa/auto-review/ that automatically labels (and restarts) incompletes for which we know the reasons. The approach could also be extended to cover not only incompletes.

Actions

#32

Updated by okurz about 5 years ago

Related to coordination #19720: [epic] Simplify investigation of job failures added

Actions

#33

Updated by riafarov about 5 years ago

Assignee changed from riafarov to okurz

As it's mainly tools team working on this epic, @okurz I will set you as an assignee to track the progress. Feel free to change it, I rely on your expertise to set more suitable person if it's not you. Thanks!

Actions

#34

Updated by okurz about 5 years ago

Subject changed from [functional][y][u][epic] Detect "known failures" and mark jobs as such to [epic] Detect "known failures" and mark jobs as such

that's ok, it's me :)

There is currently only one subtask open #46988 on QSF-u though.

Actions

#35

Updated by okurz about 5 years ago

Due date changed from 2019-12-31 to 2020-12-31

due to changes in a related task

Actions

#36

Updated by okurz almost 5 years ago

Subject changed from [epic] Detect "known failures" and mark jobs as such to [saga] Detect "known failures" and mark jobs as such

Actions

#37

Updated by okurz almost 5 years ago

Subject changed from [saga] Detect "known failures" and mark jobs as such to [saga][epic] Detect "known failures" and mark jobs as such

Actions

#38

Updated by SLindoMansilla almost 5 years ago

Due date changed from 2020-12-31 to 2020-03-27

due to changes in a related task: #46988

Actions

#39

Updated by okurz over 4 years ago

Due date changed from 2020-06-09 to 2020-03-27

due to changes in a related task: #62420

Actions

#40

Updated by okurz over 4 years ago

Due date changed from 2018-08-28 to 2020-03-27

due to changes in a related task: #38621

Actions

#41

Updated by szarate over 4 years ago

Tracker changed from action to coordination
Status changed from Blocked to New

Actions

#42

Updated by szarate over 4 years ago

See for the reason of tracker change: http://mailman.suse.de/mailman/private/qa-sle/2020-October/002722.html

Actions

#43

Updated by okurz over 4 years ago

Status changed from New to Blocked
Target version changed from future to Ready

Discussed the topic of "auto-review" with SUSE QA Tools team and the general opinion was that this epic is interesting to follow up with so putting it to the backlog now.

Actions

#44

Updated by okurz about 4 years ago

Subject changed from [saga][epic] Detect "known failures" and mark jobs as such to [saga][epic] Detect "known failures" and mark jobs as such to make tests more stable, reviewing test results and tracking known issues easier

Actions

#45

Updated by livdywan about 4 years ago

Once again wondering: where's the due date coming from? It's not visible. Do we need to go through every single ticket again to check?

Actions

#46

Updated by okurz about 4 years ago

Maybe the API helps to find that easily but in this case it's #80264

Actions

#47

Updated by okurz almost 4 years ago

Subject changed from [saga][epic] Detect "known failures" and mark jobs as such to make tests more stable, reviewing test results and tracking known issues easier to [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

Actions

#48

Updated by okurz about 3 years ago

Copied to coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA added

Actions

#49

Updated by okurz about 3 years ago

Blocked by deleted (action #45011: Allow detection of known failures at the autoinst-log.txt)

Actions

#50

Updated by okurz about 3 years ago

Related to action #45011: Allow detection of known failures at the autoinst-log.txt added

Actions