Project

General

Profile

Actions

action #109920

closed

coordination #102915: [saga][epic] Automated classification of failures

QA - coordination #94105: [epic] Use feedback from openqa-investigate to automatically inform on github pull requests, open tickets, weed out automatically failed tests

Identify reproducible product issues using openqa-investigate size:M

Added by okurz over 2 years ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-03-23
Due date:
% Done:

100%

Estimated time:

Description

Motivation

See parent #94105 where we identified multiple users stories regarding creating tickets or identifying direct or indirect users of openQA based on openqa-investigate results. As a next step we could try to identify product issues from openqa-investigate results, in particular the step "S3: retry X, last_good_test X, last_good_build V, last_good_test+build V -> reproducible product issue => if QAM test write comment on IBS/OBS or smelt, for non-QAM report product bug" from https://progress.opensuse.org/issues/94105#Suggestions

Acceptance criteria

  • AC1: On failed openQA jobs with openqa-investigate info "retry X, last_good_test X, last_good_build V, last_good_test+build V" (X: failed, V: passed) a comment is written pointing to a likely product regression
  • AC2: No such comment is written on other jobs

Suggestions

  • Take a look how we identify likely sporadic issues as a result of the "retry" job in https://github.com/os-autoinst/scripts/blob/master/openqa-investigate#L136=
  • Then using #110176 try to fan-in on the results of multiple investigation jobs to find the jobs with the combination "retry X, last_good_test X, last_good_build V, last_good_test+build V" (X: failed, V: passed). The challenge is that job done hooks are called on a single job so one would need to identify other sibling investigation jobs. And any other job can finish sooner than the others. Maybe we just call this investigation step on the "last_good_test" and if other jobs are not finished by then, then trigger another incarnation of the same minion job with a delay (exponential back-off?). Communicate by exit code? This would also avoid the need to run job_done_hooks on passed jobs.
  • https://github.com/os-autoinst/scripts/pull/170 might give good ideas how to find "related test results" and read them out
  • Extend existing unit tests
  • Then add an openQA comment stating the observation about a likely product regression
  • Note: The bash script openqa-investigate itself must not know anything about "openQA minion jobs" or schedule any

Concrete proposal

When do we want to run the logic regarding product regressions?

  • The simple retry is failing
  • The "last good" is also failing
  • The "last good build" is also failing (older image) What does "fanning in" mean here exactly?
  • We need to consult relevant jobs, and we may need to consider the last/penultimate/nth job for reference
  • See also https://en.wikipedia.org/wiki/Fan-in for where the term comes from

Suggestion 1

Create 3 investigate jobs first and save the ids, then create the simple retry job with a new setting that lists the other ids:

id1 = systemd-networkd:investigate:last_good_tests:c2c7d0f5ef0e75043509bf7fe1324a81eee077e3: http://openqa.opensuse.org/t3071082
id2 = systemd-networkd:investigate:last_good_build:369.2: http://openqa.opensuse.org/t3071083
id3 = systemd-networkd:investigate:last_good_tests_and_build:c2c7d0f5ef0e75043509bf7fe1324a81eee077e3+369.2: http://openqa.opensuse.org/t3071084

job4: systemd-networkd:investigate:retry: http://openqa.opensuse.org/t3071081 OTHER_INVESTIGATE_JOBS=id1,id2,id3

Then only in the hook script of the simple retry job we check all other 3 jobs if they are finished.

  • if not: return 142, so the minion job will be run again later
  • if yes: get the results of all 4 results and decide from that if is a product issue

Suggestion 2

Write a comment like:

* systemd-networkd:investigate:retry: http://openqa.opensuse.org/t3071081
* systemd-networkd:investigate:last_good_tests:c2c7d0f5ef0e75043509bf7fe1324a81eee077e3: http://openqa.opensuse.org/t3071082
* systemd-networkd:investigate:last_good_build:369.2: http://openqa.opensuse.org/t3071083
* systemd-networkd:investigate:last_good_tests_and_build:c2c7d0f5ef0e75043509bf7fe1324a81eee077e3+369.2: http://openqa.opensuse.org/t3071084

IDS: 3071081,3071082,3071083,3071084

And then in the hook script go to the investigate origin job, fetch comments and parse the ids out. Maybe not the best idea though.


Related issues 4 (0 open4 closed)

Copied to QA - action #110176: [spike solution] [timeboxed:10h] Restart hook script in delayed minion job based on exit code size:MResolvedkraih2022-06-15

Actions
Copied to openQA Project - action #124991: Copy ids of other investigate jobs to retry jobRejectedokurz2023-02-23

Actions
Copied to openQA Project - action #132272: Identify reproducible *TEST* issues (not product issues anymore) using openqa-investigate size:MResolvedtinita

Actions
Copied to openQA Project - action #132332: Multiple investigation comments for multimachine tests size:MResolvedtinita2023-03-23

Actions
Actions #1

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #2

Updated by mkittler over 2 years ago

  • Copied to action #110176: [spike solution] [timeboxed:10h] Restart hook script in delayed minion job based on exit code size:M added
Actions #3

Updated by okurz over 2 years ago

  • Description updated (diff)
  • Status changed from New to Blocked
  • Assignee set to okurz
Actions #4

Updated by okurz over 2 years ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)
Actions #5

Updated by livdywan over 2 years ago

  • Description updated (diff)
Actions #6

Updated by okurz over 2 years ago

  • Subject changed from Identify reproducible product issues using openqa-investigate to Identify reproducible product issues using openqa-investigate size:M
  • Description updated (diff)
  • Status changed from New to Blocked
  • Assignee set to okurz
Actions #7

Updated by okurz about 2 years ago

  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

#95783 resolved, work can continue

Actions #8

Updated by okurz about 2 years ago

  • Project changed from QA to openQA Project
Actions #9

Updated by okurz about 2 years ago

  • Category set to Feature requests
Actions #10

Updated by okurz almost 2 years ago

  • Priority changed from Normal to High
Actions #11

Updated by livdywan over 1 year ago

  • Status changed from Workable to Blocked
  • Assignee set to tinita

We discussed it briefly, and since #98862 looks to involve some changes in the logical dependency between the scripts we decided to consider this blocked (we didn't previously think it had to be).

Assigning Tina to ensure we have someone to track the blocker. Of course anyone else can still pick it up afterwards.

Actions #12

Updated by okurz over 1 year ago

  • Status changed from Blocked to Workable
  • Assignee deleted (tinita)
Actions #13

Updated by livdywan over 1 year ago

  • Description updated (diff)
Actions #14

Updated by tinita over 1 year ago

  • Copied to action #124991: Copy ids of other investigate jobs to retry job added
Actions #15

Updated by tinita over 1 year ago

  • Status changed from Workable to Blocked
  • Assignee set to tinita

I created a subtask #124991 because we identified this as a task which could be done on its own.

Actions #16

Updated by okurz over 1 year ago

  • Description updated (diff)

Fixed links to other tickets in the description

Actions #17

Updated by okurz over 1 year ago

Next to "Suggestion 1" and "Suggestion 2" I am considering a third alternative. Don't we already delay evaluation in minion jobs when we detect that "other" investigation jobs are not finished yet? Then wouldn't it be the logical next step to continue in the "else"-branch of that evaluation if it turns out that an investigation job is the last one finishing to collect results from all other jobs?

Actions #18

Updated by tinita over 1 year ago

okurz wrote:

Next to "Suggestion 1" and "Suggestion 2" I am considering a third alternative. Don't we already delay evaluation in minion jobs when we detect that "other" investigation jobs are not finished yet? Then wouldn't it be the logical next step to continue in the "else"-branch of that evaluation if it turns out that an investigation job is the last one finishing to collect results from all other jobs?

Yes, and as we already discussed in last weeks mob session and in today's unblock, the delay is the easiest part, because it's already implemented, we just need to use it.
The TODO is to find out the ids of the other investigation jobs, to check if they are finished and to read their test results.. And the majority of people in the mob session seemed to be against parsing the comment of the original job.

edit: oh wait, now reading more carefully:

Don't we already delay evaluation in minion jobs when we detect that "other" investigation jobs are not finished yet?

No. We check chained jobs. That's different from "other investigation jobs".

Actions #19

Updated by okurz over 1 year ago

Sure. I am simply suggesting to not rule out options yet but to explore all in a spike solution task

Actions #20

Updated by tinita over 1 year ago

  • Status changed from Blocked to Workable
  • Assignee deleted (tinita)

See spike solution in #126527

Actions #21

Updated by okurz over 1 year ago

  • Priority changed from Normal to High
Actions #22

Updated by tinita over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita
Actions #23

Updated by openqa_review over 1 year ago

  • Due date set to 2023-07-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions #24

Updated by tinita over 1 year ago

I pushed an update to the draft PR, but still need more tests.

Actions #25

Updated by tinita over 1 year ago

Actions #26

Updated by tinita over 1 year ago

  • Status changed from In Progress to Feedback
Actions #28

Updated by okurz over 1 year ago

  • Copied to action #132272: Identify reproducible *TEST* issues (not product issues anymore) using openqa-investigate size:M added
Actions #29

Updated by tinita over 1 year ago

https://github.com/os-autoinst/scripts/pull/244 Make sure exit code gets propagated to caller (merged)

Actions #30

Updated by tinita over 1 year ago

  • Status changed from Feedback to Resolved
Actions #31

Updated by tinita over 1 year ago

  • Status changed from Resolved to In Progress

There is some weird thing happening on osd: https://openqa.suse.de/tests/11507412#comments

Actions #32

Updated by tinita over 1 year ago

  • Status changed from In Progress to Resolved

This is something that presumably happened before and is not related to my change. Will create a new ticket.

Actions #33

Updated by tinita over 1 year ago

  • Copied to action #132332: Multiple investigation comments for multimachine tests size:M added
Actions #34

Updated by livdywan over 1 year ago

  • Description updated (diff)
Actions #35

Updated by okurz about 1 year ago

  • Due date deleted (2023-07-08)
Actions

Also available in: Atom PDF