action #124274
closed
openQA reports non-sporadic issue when retry job just softfailed size:M
Added by clanig over 2 years ago.
Updated about 2 years ago.
Category:
Regressions/Crashes
Description
Motivation¶
I got the feedback that the retry job failed for osd#10460004.
Investigate retry job: https://openqa.suse.de/t10461879 failed, likely not a sporadic failure
However, the retry job just softfailed and the affected step previously failing is entirely green:
osd#10461879
Acceptance Criteria¶
-
AC1: softfailed not handled as failure in this case
Suggestions¶
- Tags set to reactive work
- Target version set to Ready
We have job_done_hook = env host=openqa.suse.de exclude_group_regex='.*(Development|Public Cloud|Released|Others|Kernel|Virtualization).*' grep_timeout=60 nice ionice -c idle /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
configured so the hook script runs regardless of the job's result. I'm wondering where we take care not to run into the "Investigate retry job: … failed" assumption for passed/softfailed jobs. Since we have no job_done_hook_enable_… = 1
settings the hook script is actually only running for failed
, incomplete
or timeout_exceeded
results.
Since the job has _TRIGGER_JOB_DONE_HOOK=1
the generic hook script is triggered for this particular job after all (regardless of the result). We apparently don't do any extra checks in openqa-label-known-issues-and-investigate-hook
to avoid running into the "Investigate retry job: … failed" assumption so this is what's happening. Supposedly we should have an extra check there. I'm not sure where the _TRIGGER_JOB_DONE_HOOK=1
job settings comes from and why it was added.
mkittler wrote:
I'm not sure where the _TRIGGER_JOB_DONE_HOOK=1
job settings comes from and why it was added.
_TRIGGER_JOB_DONE_HOOK=1 was added by me as part of #98862 for investigate:retry
jobs.
We need to run the hook script in order to report when a retry job passed.
For that I also needed to enable job_done_hook
, and I guess this is now also called for softfailed. Should I rather configure job_done_hook_passed
instead?
- Related to action #98862: Comment about intermittent/sporadic test issues on original job if openqa-investigate retry job passes size:M added
For that I also needed to enable job_done_hook, and I guess this is now also called for softfailed. Should I rather configure job_done_hook_passed instead?
I don't think so. The "if openqa-investigate retry job passes" part in #98862 is likely also supposed to include softfails.
I suppose I will just add a check to skip writing this comments for passed/softfailed jobs.
- Assignee deleted (
mkittler)
Or maybe let's estimate it first.
It would be good to estimate this with @okurz to clarify whether we can really treat "softfailed" as "passed" here.
In general what users commonly expect is that the investigation jobs tell if thesame issue happens again. We make the assumption that if a job fails again then likely it's the same issue even though that will not be generally true. IMHO that assumption is still fine for the sake of openqa-investigate. Regarding failed, softfailed I assume we only trigger openqa-investigate in the first place for failed jobs hence we want to know if retry jobs fail. So in my understanding all jobs with "ok-result" should be treated the same
mkittler wrote:
I suppose I will just add a check to skip writing this comments for passed/softfailed jobs.
As discussed in the weekly 2023-03-10 we clarified that we do have the feature to write a comment if a job passes so we should ensure that for any "ok" result we treat it the same. As soft-fail effectively means "known issue" then the reason for job failure can not be the same as the original "new, unreviewed issue".
- Subject changed from openQA reports non-sporadic issue when retry job just softfailed to openQA reports non-sporadic issue when retry job just softfailed size:M
- Description updated (diff)
- Status changed from New to Workable
- Status changed from Workable to In Progress
- Assignee set to tinita
- Status changed from In Progress to Feedback
- Status changed from Feedback to Resolved
The PR was merged, resolving. @clanig let us know if you see something unexpected again, thanks
Also available in: Atom
PDF