Project

General

Profile

Actions

action #124274

closed

openQA reports non-sporadic issue when retry job just softfailed size:M

Added by clanig about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-02-10
Due date:
% Done:

0%

Estimated time:

Description

Motivation

I got the feedback that the retry job failed for osd#10460004.

Investigate retry job: https://openqa.suse.de/t10461879 failed, likely not a sporadic failure

However, the retry job just softfailed and the affected step previously failing is entirely green:
osd#10461879

Acceptance Criteria

  • AC1: softfailed not handled as failure in this case

Suggestions


Related issues 1 (0 open1 closed)

Related to openQA Project - action #98862: Comment about intermittent/sporadic test issues on original job if openqa-investigate retry job passes size:MResolvedtinita2021-09-18

Actions
Actions #1

Updated by okurz about 1 year ago

  • Tags set to reactive work
  • Target version set to Ready
Actions #2

Updated by mkittler about 1 year ago

We have job_done_hook = env host=openqa.suse.de exclude_group_regex='.*(Development|Public Cloud|Released|Others|Kernel|Virtualization).*' grep_timeout=60 nice ionice -c idle /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook configured so the hook script runs regardless of the job's result. I'm wondering where we take care not to run into the "Investigate retry job: … failed" assumption for passed/softfailed jobs. Since we have no job_done_hook_enable_… = 1 settings the hook script is actually only running for failed, incomplete or timeout_exceeded results.

Since the job has _TRIGGER_JOB_DONE_HOOK=1 the generic hook script is triggered for this particular job after all (regardless of the result). We apparently don't do any extra checks in openqa-label-known-issues-and-investigate-hook to avoid running into the "Investigate retry job: … failed" assumption so this is what's happening. Supposedly we should have an extra check there. I'm not sure where the _TRIGGER_JOB_DONE_HOOK=1 job settings comes from and why it was added.

Actions #3

Updated by tinita about 1 year ago

mkittler wrote:

I'm not sure where the _TRIGGER_JOB_DONE_HOOK=1 job settings comes from and why it was added.

_TRIGGER_JOB_DONE_HOOK=1 was added by me as part of #98862 for investigate:retry jobs.

We need to run the hook script in order to report when a retry job passed.
For that I also needed to enable job_done_hook, and I guess this is now also called for softfailed. Should I rather configure job_done_hook_passed instead?

Actions #4

Updated by tinita about 1 year ago

  • Related to action #98862: Comment about intermittent/sporadic test issues on original job if openqa-investigate retry job passes size:M added
Actions #5

Updated by mkittler about 1 year ago

  • Assignee set to mkittler

For that I also needed to enable job_done_hook, and I guess this is now also called for softfailed. Should I rather configure job_done_hook_passed instead?

I don't think so. The "if openqa-investigate retry job passes" part in #98862 is likely also supposed to include softfails.

I suppose I will just add a check to skip writing this comments for passed/softfailed jobs.

Actions #6

Updated by mkittler about 1 year ago

  • Assignee deleted (mkittler)

Or maybe let's estimate it first.

Actions #7

Updated by mkittler about 1 year ago

It would be good to estimate this with @okurz to clarify whether we can really treat "softfailed" as "passed" here.

Actions #8

Updated by okurz about 1 year ago

In general what users commonly expect is that the investigation jobs tell if the*same* issue happens again. We make the assumption that if a job fails again then likely it's the same issue even though that will not be generally true. IMHO that assumption is still fine for the sake of openqa-investigate. Regarding failed, softfailed I assume we only trigger openqa-investigate in the first place for failed jobs hence we want to know if retry jobs fail. So in my understanding all jobs with "ok-result" should be treated the same

Actions #9

Updated by okurz about 1 year ago

mkittler wrote:

I suppose I will just add a check to skip writing this comments for passed/softfailed jobs.

As discussed in the weekly 2023-03-10 we clarified that we do have the feature to write a comment if a job passes so we should ensure that for any "ok" result we treat it the same. As soft-fail effectively means "known issue" then the reason for job failure can not be the same as the original "new, unreviewed issue".

Actions #10

Updated by livdywan about 1 year ago

  • Subject changed from openQA reports non-sporadic issue when retry job just softfailed to openQA reports non-sporadic issue when retry job just softfailed size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #11

Updated by tinita about 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita
Actions #12

Updated by tinita about 1 year ago

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/scripts/pull/221 Treat softfailed as passed in openqa-investigate

Actions #13

Updated by tinita about 1 year ago

  • Status changed from Feedback to Resolved

The PR was merged, resolving. @clanig let us know if you see something unexpected again, thanks

Actions

Also available in: Atom PDF