[epic] Detect "known failures" and mark jobs as such
|Category:||Feature requests||Estimated time:||50.00 hours|
|Target version:||QA - future|
As a reviewer of failed openQA tests I want known failures of jobs regardless of the error source to be marked as such automatically to not waste time on investigating known failures
- AC1: If a job fails for any reason that is "known" already in the context of the current openQA instance no further "test review" effort is needed by human reviewers
- Provide a mechanism to match on regex in serial0.txt (as provided by existing "serial exception catching"-feature) based on patterns defined in the test distribution
- Same for autoinst-log.txt
- Provide patterns defined in os-autoinst for backend specific stuff, e.g. the "key event queue full"-thingy -> look for that string in os-autoinst for existing code to handle that
- Same as above but patterns defined in instance specific configuration, e.g. workers.ini (managed by salt for SLE)
- Maybe the same based on needles? But maybe the current approach using the "workaround" property and soft-fail needles to be always preferred is already good enough :)
- It might be necessary to re-define "soft-fail" as "known issue" and nothing more so that we can use the "known failure" detection to set a job to soft-failed referencing the known issue, immediately aborting the further execution of a job to prevent it failing at a sporadic later step which would pose the need to provide openQA comments to provide a label
- "known" means that a certain symptom of a test failure has been described with e.g. a matching pattern in either a test distribution, os-autoinst or maybe openQA itself as for the later mentioned jenkins plugin
- "test review" means what we currently do in openSUSE or SLE by providing job labels with issue references in openQA comments which are carried over – which so far only works within individual scenarios
See https://wiki.jenkins.io/display/JENKINS/Build+Failure+Analyzer for an example. This jenkins plugin uses a "knowledge base" with jenkins instance global "known failures" defined with description and pattern matching, e.g. on "build log parsing", to mark failures as known when any log content matches existing patterns
#5 Updated by nicksinger over 1 year ago
Another idea which could be checked/better reported to the user:
If a crucial component in the "os-autoinst-chain" fails (e.g. xterm for ipmi jobs), openQA could easily report this earlier. As it is right now, the job stalls (hangs as "running") but only shows a black screen. Example: https://openqa.suse.de/tests/1970948 (look for "PermissionError" in the osautoinst-log.txt)
#7 Updated by okurz over 1 year ago
"outside", yes, I agree. Should be outside what is currently defined as "openQA" but it could be that we still call it "the openQA ecosystem" so I guess this issue tracker is still best suited. Some parts we have already covered with the proof-of-concept of detecting known failures in the serial port output.
#15 Updated by okurz over 1 year ago
https://github.com/os-autoinst/os-autoinst/pull/1052 to "Add option to override status of test modules with soft-fail"
#17 Updated by okurz over 1 year ago
The feature is not working as intended as in https://github.com/os-autoinst/os-autoinst/blob/master/basetest.pm#L286 we overwrite the result again. I am trying to simply remove that method :)
Also presented my idea to riafarov and we identified one problematic scenario: What if we force the status of a parent job to "softfail"? For now openQA would still trigger the downstream jobs which then most likely should fail because a module in the parent job failed, in the worst case even making the downstream jobs incomplete because the HDD image was never published properly. We should avoid this though.
- Assignee changed from okurz to riafarov
Move to new QSF-y PO after I moved to the "tools"-team. I mainly checked the subject line so in individual instances you might not agree to take it over completely into QSF-y. Feel free to reassign to me or someone else in this case. Thanks.
Using https://github.com/os-autoinst/scripts/blob/master/monitor-openqa_job and https://github.com/os-autoinst/scripts/blob/master/openqa-label-known-issues I setup a gitlab CI pipeline in https://gitlab.suse.de/openqa/auto-review/ that automatically labels (and restarts) incompletes for which we know the reasons. The approach could also be extended to cover not only incompletes.