action #124274
closed
openQA reports non-sporadic issue when retry job just softfailed size:M
Added by clanig almost 2 years ago.
Updated over 1 year ago.
Category:
Regressions/Crashes
Description
Motivation¶
I got the feedback that the retry job failed for osd#10460004.
Investigate retry job: https://openqa.suse.de/t10461879 failed, likely not a sporadic failure
However, the retry job just softfailed and the affected step previously failing is entirely green:
osd#10461879
Acceptance Criteria¶
- AC1: softfailed not handled as failure in this case
Suggestions¶
- Tags set to reactive work
- Target version set to Ready
We have job_done_hook = env host=openqa.suse.de exclude_group_regex='.*(Development|Public Cloud|Released|Others|Kernel|Virtualization).*' grep_timeout=60 nice ionice -c idle /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
configured so the hook script runs regardless of the job's result. I'm wondering where we take care not to run into the "Investigate retry job: … failed" assumption for passed/softfailed jobs. Since we have no job_done_hook_enable_… = 1
settings the hook script is actually only running for failed
, incomplete
or timeout_exceeded
results.
Since the job has _TRIGGER_JOB_DONE_HOOK=1
the generic hook script is triggered for this particular job after all (regardless of the result). We apparently don't do any extra checks in openqa-label-known-issues-and-investigate-hook
to avoid running into the "Investigate retry job: … failed" assumption so this is what's happening. Supposedly we should have an extra check there. I'm not sure where the _TRIGGER_JOB_DONE_HOOK=1
job settings comes from and why it was added.
mkittler wrote:
I'm not sure where the _TRIGGER_JOB_DONE_HOOK=1
job settings comes from and why it was added.
_TRIGGER_JOB_DONE_HOOK=1 was added by me as part of #98862 for investigate:retry
jobs.
We need to run the hook script in order to report when a retry job passed.
For that I also needed to enable job_done_hook
, and I guess this is now also called for softfailed. Should I rather configure job_done_hook_passed
instead?
- Related to action #98862: Comment about intermittent/sporadic test issues on original job if openqa-investigate retry job passes size:M added
For that I also needed to enable job_done_hook, and I guess this is now also called for softfailed. Should I rather configure job_done_hook_passed instead?
I don't think so. The "if openqa-investigate retry job passes" part in #98862 is likely also supposed to include softfails.
I suppose I will just add a check to skip writing this comments for passed/softfailed jobs.
- Assignee deleted (
mkittler)
Or maybe let's estimate it first.
It would be good to estimate this with @okurz to clarify whether we can really treat "softfailed" as "passed" here.
In general what users commonly expect is that the investigation jobs tell if the*same* issue happens again. We make the assumption that if a job fails again then likely it's the same issue even though that will not be generally true. IMHO that assumption is still fine for the sake of openqa-investigate. Regarding failed, softfailed I assume we only trigger openqa-investigate in the first place for failed jobs hence we want to know if retry jobs fail. So in my understanding all jobs with "ok-result" should be treated the same
mkittler wrote:
I suppose I will just add a check to skip writing this comments for passed/softfailed jobs.
As discussed in the weekly 2023-03-10 we clarified that we do have the feature to write a comment if a job passes so we should ensure that for any "ok" result we treat it the same. As soft-fail effectively means "known issue" then the reason for job failure can not be the same as the original "new, unreviewed issue".
- Subject changed from openQA reports non-sporadic issue when retry job just softfailed to openQA reports non-sporadic issue when retry job just softfailed size:M
- Description updated (diff)
- Status changed from New to Workable
- Status changed from Workable to In Progress
- Assignee set to tinita
- Status changed from In Progress to Feedback
- Status changed from Feedback to Resolved
The PR was merged, resolving. @clanig let us know if you see something unexpected again, thanks
Also available in: Atom
PDF