action #115178
closedopenqa-investigate: Ensure proper error handling size:M
Description
Observation¶
In #114451 I found that it can happen that openqa-investigate can abort the script and therefore the bisect script is never run.
https://openqa.suse.de/tests/9298857
Also the error messages are not really helpful:
% env script_dir=/opt/os-autoinst-scripts enable_force_result=true host=openqa.suse.de exclude_group_regex='.*(Development|Public Cloud|Released|Others|Kernel|Virtualization).*' grep_timeout=60 nice ionice -c idle /tmp/openqa-label-known-issues-and-investigate-hook 9298857
Skipping investigation of job 9298857: job cluster is already being investigated, see comment on job 9298857
403 Forbidden
404 Not Found
{"error_status":404}
% echo $?
255
Acceptance criteria¶
- AC1: openqa-investigate does exit with zero usually (if there is no fatal problem)
- AC2: Errors from fetching urls should be handled better so we know which URL resulted in an error
- AC3: Bisect jobs are run even if the investigation failed
- AC4:
:investigate:
jobs are not bisected again
Updated by tinita about 2 years ago
- Copied from action #114451: Incidents from all test issues variables are collected during bisect size:M added
Updated by okurz about 2 years ago
- Category changed from Regressions/Crashes to Feature requests
- Target version set to Ready
Updated by okurz about 2 years ago
- Subject changed from openqa-investigate: Improve error handling to openqa-investigate: Ensure proper error handling
- Category changed from Feature requests to Regressions/Crashes
Updated by mkittler about 2 years ago
- Subject changed from openqa-investigate: Ensure proper error handling to openqa-investigate: Ensure proper error handling size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by tinita about 2 years ago
Just to clarify: I looked at the linked test https://openqa.suse.de/tests/9298857#comments and saw that investigate jobs were triggered, but no bisect jobs.
I ran the hook manually but commented out line 18 echo "$test" | "$script_dir/openqa-investigate"
and then the bisect jobs were triggered.
So I assumed openqa-investigate must have returned a non-zero code so the hook script was aborted. But I don't know what happened because we don't have enough log data in the journal.
And the minion job just has this:
https://openqa.suse.de/minion/jobs?id=5039650
notes:
hook_cmd: env enable_force_result=true from_email=openqa-review@suse.de notification_address=discuss-openqa-auto-r-aaaagmhuypu2hq2kmzgovutmqm@suse.slack.com
host=openqa.suse.de exclude_group_regex='.*(Development|Public Cloud|Released|Others|Kernel|Virtualization).*'
grep_timeout=60 nice ionice -c idle /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
hook_rc: 255
hook_result: "{\"id\":582257}\n"
To lookup a job in the database:
select id, task, notes->'hook_rc' as hook_rc, notes->'hook_result' as hook_result, created, finished from minion_jobs where jsonb_typeof(args->1) = 'number' and cast(args->1 as int) = 9298857;
When I reran the hook script after enabling line 18 again, I got the error messages I pasted in the ticket description, but of course I don't know if these are the same errors that happened in the automatic run.
Updated by okurz about 2 years ago
I found other, maybe related problems.
Observation¶
https://openqa.suse.de/tests/9318205#comments shows a comment "Automatic bisect jobs:" but no investigation jobs. However the job is being retried automatically due to RETRY. And the final job in the retry chain https://openqa.suse.de/tests/9319846#comments shows investigation jobs but no bisect jobs
Expectation¶
- E1: For a job that is automatically retried no investigation and no bisect jobs are triggered
- E2: For a "aggregate tests" job that was was created as clone of other jobs with automatic retry investigation+bisect jobs are triggered
Updated by tinita about 2 years ago
- Status changed from Workable to In Progress
- Assignee set to tinita
Updated by tinita about 2 years ago
https://github.com/os-autoinst/scripts/pull/178 - Do not run bisect jobs when job has been retried
Updated by okurz about 2 years ago
tinita wrote:
https://github.com/os-autoinst/scripts/pull/178 - Do not run bisect jobs when job has been retried
merged. That should cover E1 from #115178#note-7
Updated by tinita about 2 years ago
https://github.com/os-autoinst/scripts/pull/179 - Localize return code in openqa-investigate
Fixes AC1
Updated by okurz about 2 years ago
The following looks good:
- https://openqa.suse.de/tests/9330269#comments is a job that was restarted with RETRY and apparently has no investigate and no bisect jobs as expected
- https://openqa.suse.de/tests/9331646#comments shows both investigation as well as bisect jobs
so I see both E1 and E2 from #115178#note-7 covered.
Updated by okurz about 2 years ago
https://openqa.suse.de/tests/9335507#comments might be a regression. The job is sle-12-SP5-JeOS-for-kvm-and-xen-Updates-x86_64-jeos-ltp-syscalls-ipc:investigate:retry@64bit-virtio-vga so it's an investigate job itself but has bisect jobs when it should not have any.
Updated by tinita about 2 years ago
- Due date changed from 2022-08-16 to 2022-08-23
I think that's not a regression. So far the bisect script didn't have this logic to not run on investigate jobs.
But I can work on this.
Bumping due date also due to absence during the next days.
Updated by tinita about 2 years ago
AC4: https://github.com/os-autoinst/scripts/pull/183 - Do not bisect investigate jobs themselves (merged)
Updated by tinita about 2 years ago
- Due date changed from 2022-08-23 to 2022-08-26
Bumped due date due to absence and additional tasks that were originally not part of the ticket
Updated by tinita about 2 years ago
- Status changed from In Progress to Feedback
AC3: https://github.com/os-autoinst/scripts/pull/186 - Add error-handling for openqa-investigate
Updated by tinita about 2 years ago
- Status changed from Feedback to Resolved
https://github.com/os-autoinst/scripts/pull/186 merged.
I think all AC are fulfilled.