Project

General

Profile

coordination #77899

openQA Project - coordination #39719: [saga][epic] Detect "known failures" and mark jobs as such to make tests more stable, reviewing test results and tracking known issues easier

[epic] Extend "auto-review" for failed jobs as well

Added by okurz 2 months ago. Updated 22 days ago.

Status:
Workable
Priority:
Normal
Assignee:
Target version:
Start date:
2020-11-26
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Motivation

Especially SUSE QEM suffers from the workload of manually reviewing openQA test results due to the comparatively high false-positive rate (as the product is of higher quality after GM in comparison to products in development before GM). The existing scenario based "label carry-over" is much less useful for the current setup of QAM scenarios that are spread over many different job groups. With "auto-review" we have a good solution to handle known incompletes, retrigger automatically where it makes sense as well as find new, unknown incompletes easily. As "auto-review" can work regardless of the result of the job but is just depending on what list of jobs is passed, we should evaluate to extend it for handling unlabeled failed results as well.

Acceptance criteria

  • AC1: Failed openQA jobs where the log(s) match a regex specified in progress tickets with "auto_review" like for incomplete jobs are labeled with the corresponding ticket
  • AC2: No gitlab CI pipelines monitored by the team SUSE QE Tools fail if there are unlabeled unknown failed jobs encountered
  • AC3: Same for o3 and osd
  • AC4: Power users know about the feature and how it can be used

Suggestions

  • Don't fail gitlab CI pipelines in case failed jobs are not known as SUSE QE Tools can't handle that load of unreviewed, new, failed tests and should not be concerned about that
  • Start with o3 as "testbed" and extend to osd if the process on o3 runs in a convincing way
  • Consider including the solution within openQA itself, e.g. as plugin, triggering a synchronous action when a job finishes and after automatic label carry-over did not find a convincing candidate
  • Consider caching of tickets to reduce the need for recurring loading from redmine API but still ensure that ticket updates, e.g. fixed auto-review regex's, have an effect, e.g. only cache for 10s or 1m
  • Present to power users, e.g. documentation, blog article, feature video, workshop

Subtasks

action #80414: [proof-of-concept] Extend "auto-review" for failed jobs as well, start with o3Resolvedokurz

action #80418: [learning] Fix parse errors in "openqa-investigate" "parse error: Invalid numeric literal at line 1, column 10"Resolvedmkittler

action #80806: Extend "auto-review" for failed jobs as well - Generalize openqa-monitor-investigation-candidates to look at more than just one job groupResolvedokurz

action #80808: Extend "auto-review" for failed jobs as well - enable same as on o3 but on osdResolvedokurz


Related issues

Copied to QA - action #77944: Run "auto-review" more often but alarm lessResolved2020-11-14

History

#1 Updated by okurz 2 months ago

  • Description updated (diff)
  • Target version set to Ready

#2 Updated by okurz 2 months ago

  • Copied to action #77944: Run "auto-review" more often but alarm less added

#3 Updated by okurz 2 months ago

  • Parent task set to #39719

#4 Updated by okurz about 2 months ago

  • Tracker changed from action to coordination
  • Subject changed from Extend "auto-review" for failed jobs as well to [epic] Extend "auto-review" for failed jobs as well
  • Description updated (diff)
  • Status changed from New to Workable

#5 Updated by okurz about 2 months ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

tracking both subtasks

#6 Updated by okurz about 2 months ago

  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

With both current subtasks resolved I see the proof-of-concept succesfully in place. As next steps I recommend to extend the approach to a selected product or job group on osd as well as all "non-development" job groups on o3. For this anyone can specify the next subtasks and follow on in these.

#7 Updated by okurz about 1 month ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

blocked on subtasks

#8 Updated by okurz about 1 month ago

  • Status changed from Blocked to Workable

All current subtasks resolved. Latest results in #80806#note-4

  • Switching off the triggers for investigation jobs or even the complete schedule from the gitlab CI pipeline

I disabled both "daily" and "hourly" schedules on
https://gitlab.suse.de/openqa/auto-review/-/pipeline_schedules
with corresponding comments in the schedule names, e.g. "DISABLED:, replaced by job-done-hooks, see https://progress.opensuse.org/issues/77899 - hourly". Let's see if o3 and osd run fine just based on job-done-hooks

  • Check if auto-review is also correctly triggered for both o3 + osd still

I checked with

for i in o3 osd; do ssh $i "sudo -u geekotest psql openqa -c \"select jobs.id, result_dir,t_finished from comments,jobs,users where comments.user_id = users.id and comments.job_id = jobs.id and username ~ 'auto-review' order by id DESC limit 10;\""; done

and last comments where "auto-review" had to comment was some days ago. So the process seems to work in general.

TODO

  • Update description of epic
  • Try to post comments as "auto-review", not "geekotest" with corresponding user account keys, etc. , e.g. check what user "geekotest" is doing, either just use new api-key and secret or sudo?

Also available in: Atom PDF