action #39719

[saga][epic] Detect "known failures" and mark jobs as such

Added by okurz over 1 year ago. Updated 5 days ago.

Status:BlockedStart date:23/05/2018
Priority:HighDue date:31/12/2020
Assignee:okurz% Done:

39%

Category:Feature requestsEstimated time:50.00 hours
Target version:QA - future
Difficulty:
Duration: 682

Description

User Story

As a reviewer of failed openQA tests I want known failures of jobs regardless of the error source to be marked as such automatically to not waste time on investigating known failures

Acceptance criteria

  • AC1: If a job fails for any reason that is "known" already in the context of the current openQA instance no further "test review" effort is needed by human reviewers

Suggestions

  • Provide a mechanism to match on regex in serial0.txt (as provided by existing "serial exception catching"-feature) based on patterns defined in the test distribution
  • Same for autoinst-log.txt
  • Provide patterns defined in os-autoinst for backend specific stuff, e.g. the "key event queue full"-thingy -> look for that string in os-autoinst for existing code to handle that
  • Same as above but patterns defined in instance specific configuration, e.g. workers.ini (managed by salt for SLE)
  • Maybe the same based on needles? But maybe the current approach using the "workaround" property and soft-fail needles to be always preferred is already good enough :)
  • It might be necessary to re-define "soft-fail" as "known issue" and nothing more so that we can use the "known failure" detection to set a job to soft-failed referencing the known issue, immediately aborting the further execution of a job to prevent it failing at a sporadic later step which would pose the need to provide openQA comments to provide a label

Further details

Definitions:

  • "known" means that a certain symptom of a test failure has been described with e.g. a matching pattern in either a test distribution, os-autoinst or maybe openQA itself as for the later mentioned jenkins plugin
  • "test review" means what we currently do in openSUSE or SLE by providing job labels with issue references in openQA comments which are carried over – which so far only works within individual scenarios

See https://wiki.jenkins.io/display/JENKINS/Build+Failure+Analyzer for an example. This jenkins plugin uses a "knowledge base" with jenkins instance global "known failures" defined with description and pattern matching, e.g. on "build log parsing", to mark failures as known when any log content matches existing patterns


Subtasks

openQA Tests - action #38621: [functional][y] test fails in welcome - "Module is not si...Resolvedriafarov

openQA Tests - action #46988: [functional][u] Detect known bugs from system journalWorkable

action #60560: Self-investigate potential reasons for failures in openQAResolvedokurz

action #62420: [epic] Distinguish all types of incompletesBlockedokurz

action #45062: [feature][tools] Better visualization of incompletes - sh...Workable

action #61922: [epic] Incomplete jobs with no logs at allResolvedmkittler

action #62984: Fix problem with job-worker assignment resulting in API e...Resolvedmkittler

action #63718: incomplete reason with just "quit"/"died" could provide m...Resolvedmkittler

action #64854: qemu-img error message is incorrectly tried to be parsed ...Resolvedtinita

action #64857: Put single-line error messages into incomplete reason for...In Progresscdywan

action #64884: Distinguish test contributor errors from unexpected backe...Workable

action #64917: qemu-img create sometimes fails with exit code 1 but no a...Workable

action #63065: [gsoc] dynamic detection of error conditions from test re...New


Related issues

Related to openQA Project - action #13242: WDYT: For every job that does not have a label or bugref,... Rejected 25/11/2016
Related to openQA Project - action #13812: [epic][dashboard] openQA Dashboard ideas Blocked 10/01/2017
Related to openQA Tests - action #42446: [functional][u] many opensuse tests fail in desktop_runne... Blocked 13/10/2018
Related to openQA Project - action #40382: Make "ignored" issues more prominent (was: create new sta... Workable 29/08/2018
Related to openQA Tests - action #43784: [functional][y][sporadic] test fails in yast2_snapper now... Resolved 14/11/2018
Related to openQA Project - action #57452: Automatic summary of failures Rejected 27/09/2019
Related to openQA Project - action #19720: Simplify investigation of job failures Feedback 17/12/2019
Blocked by openQA Project - action #45011: [tools] Allow detection of known failures at the autoinst... Workable 11/12/2018

History

#1 Updated by okurz over 1 year ago

  • Related to action #13242: WDYT: For every job that does not have a label or bugref, retrigger some times to see if it's sporadic. Like rescheduling on incomplete but on failed added

#2 Updated by okurz over 1 year ago

  • Related to action #38621: [functional][y] test fails in welcome - "Module is not signed with expected PKCS#7 message" (bsc#1093659) - Use serial exception catching feature from openQA to make sure the jobs reference the bug, e.g. as label added

#3 Updated by okurz over 1 year ago

  • Related to action #13812: [epic][dashboard] openQA Dashboard ideas added

#4 Updated by okurz over 1 year ago

  • Related to deleted (action #38621: [functional][y] test fails in welcome - "Module is not signed with expected PKCS#7 message" (bsc#1093659) - Use serial exception catching feature from openQA to make sure the jobs reference the bug, e.g. as label)

#5 Updated by nicksinger over 1 year ago

Another idea which could be checked/better reported to the user:

If a crucial component in the "os-autoinst-chain" fails (e.g. xterm for ipmi jobs), openQA could easily report this earlier. As it is right now, the job stalls (hangs as "running") but only shows a black screen. Example: https://openqa.suse.de/tests/1970948 (look for "PermissionError" in the osautoinst-log.txt)

#6 Updated by coolo over 1 year ago

  • Target version set to future

IMO this is best handled by an automated review from outside. The problem is not so much the detecting the issue, but how to handle it. For some projects/objects you would do a retrigger, for others you would prefer defining a label.

#7 Updated by okurz over 1 year ago

"outside", yes, I agree. Should be outside what is currently defined as "openQA" but it could be that we still call it "the openQA ecosystem" so I guess this issue tracker is still best suited. Some parts we have already covered with the proof-of-concept of detecting known failures in the serial port output.

#8 Updated by coolo over 1 year ago

I don't disagree with the issue tracker - I just don't want a High priority epic in my 'to be sorted' list

#9 Updated by okurz over 1 year ago

  • Related to action #42446: [functional][u] many opensuse tests fail in desktop_runner or gimp or other modules in what I think is boo#1105691 – can we detect this bug from the journal and track as soft-fail? added

#10 Updated by okurz over 1 year ago

  • Subject changed from [epic] Detect "known failures" and mark jobs as such to [functional][y][u][epic] Detect "known failures" and mark jobs as such

Trying to bring it forward with help of QSF again…

#11 Updated by okurz over 1 year ago

  • Related to action #27004: [opensuse][sle][functional][yast][y][hard] yast2 gui modules fail to start in the defined time frame added

#12 Updated by okurz over 1 year ago

  • Related to deleted (action #27004: [opensuse][sle][functional][yast][y][hard] yast2 gui modules fail to start in the defined time frame)

#13 Updated by okurz over 1 year ago

  • Blocks action #27004: [opensuse][sle][functional][yast][y][hard] yast2 gui modules fail to start in the defined time frame added

#14 Updated by okurz over 1 year ago

  • Related to action #40382: Make "ignored" issues more prominent (was: create new state "ignored") added

#15 Updated by okurz over 1 year ago

https://github.com/os-autoinst/os-autoinst/pull/1052 to "Add option to override status of test modules with soft-fail"

#16 Updated by okurz over 1 year ago

  • Status changed from New to Feedback
  • Assignee set to okurz

#17 Updated by okurz over 1 year ago

The feature is not working as intended as in https://github.com/os-autoinst/os-autoinst/blob/master/basetest.pm#L286 we overwrite the result again. I am trying to simply remove that method :)

-> https://github.com/os-autoinst/os-autoinst/pull/1062

Also presented my idea to riafarov and we identified one problematic scenario: What if we force the status of a parent job to "softfail"? For now openQA would still trigger the downstream jobs which then most likely should fail because a module in the parent job failed, in the worst case even making the downstream jobs incomplete because the HDD image was never published properly. We should avoid this though.

#18 Updated by okurz over 1 year ago

  • Related to action #43784: [functional][y][sporadic] test fails in yast2_snapper now reproducibly not exiting the "show differences" screen added

#19 Updated by szarate over 1 year ago

  • Related to action #45011: [tools] Allow detection of known failures at the autoinst-log.txt added

#20 Updated by szarate over 1 year ago

I see that one of the suggestions on this ticket was exactly what poo#45011 is about :)

#21 Updated by agraul about 1 year ago

  • Related to deleted (action #45011: [tools] Allow detection of known failures at the autoinst-log.txt)

#22 Updated by agraul about 1 year ago

  • Blocked by action #45011: [tools] Allow detection of known failures at the autoinst-log.txt added

#23 Updated by agraul about 1 year ago

  • Status changed from Feedback to Blocked

#24 Updated by okurz about 1 year ago

  • Due date changed from 28/08/2018 to 12/03/2019

due to changes in a related task

#25 Updated by okurz about 1 year ago

  • Due date changed from 12/03/2019 to 30/06/2019

due to changes in a related task

#26 Updated by okurz 11 months ago

  • Assignee changed from okurz to riafarov

Move to new QSF-y PO after I moved to the "tools"-team. I mainly checked the subject line so in individual instances you might not agree to take it over completely into QSF-y. Feel free to reassign to me or someone else in this case. Thanks.

#27 Updated by riafarov 11 months ago

  • Blocks deleted (action #27004: [opensuse][sle][functional][yast][y][hard] yast2 gui modules fail to start in the defined time frame)

#28 Updated by riafarov 9 months ago

  • Due date changed from 30/06/2019 to 06/08/2019

due to changes in a related task

#29 Updated by riafarov 8 months ago

  • Due date changed from 06/08/2019 to 31/12/2019

due to changes in a related task

#30 Updated by okurz 6 months ago

#31 Updated by okurz 4 months ago

Using https://github.com/os-autoinst/scripts/blob/master/monitor-openqa_job and https://github.com/os-autoinst/scripts/blob/master/openqa-label-known-issues I setup a gitlab CI pipeline in https://gitlab.suse.de/openqa/auto-review/ that automatically labels (and restarts) incompletes for which we know the reasons. The approach could also be extended to cover not only incompletes.

#32 Updated by okurz 4 months ago

  • Related to action #19720: Simplify investigation of job failures added

#33 Updated by riafarov 3 months ago

  • Assignee changed from riafarov to okurz

As it's mainly tools team working on this epic, @okurz I will set you as an assignee to track the progress. Feel free to change it, I rely on your expertise to set more suitable person if it's not you. Thanks!

#34 Updated by okurz 3 months ago

  • Subject changed from [functional][y][u][epic] Detect "known failures" and mark jobs as such to [epic] Detect "known failures" and mark jobs as such

that's ok, it's me :)

There is currently only one subtask open #46988 on QSF-u though.

#35 Updated by okurz 3 months ago

  • Due date changed from 31/12/2019 to 31/12/2020

due to changes in a related task

#36 Updated by okurz about 1 month ago

  • Subject changed from [epic] Detect "known failures" and mark jobs as such to [saga] Detect "known failures" and mark jobs as such

#37 Updated by okurz about 1 month ago

  • Subject changed from [saga] Detect "known failures" and mark jobs as such to [saga][epic] Detect "known failures" and mark jobs as such

Also available in: Atom PDF