action #115178: openqa-investigate: Ensure proper error handling size:M - openQA Project - openSUSE Project Management Tool

Actions

Copy link

action #115178

closed

openqa-investigate: Ensure proper error handling size:M

Added by tinita about 2 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

tinita

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2022-07-21

Due date:

2022-08-26

% Done:

Estimated time:

Description

Observation¶

In #114451 I found that it can happen that openqa-investigate can abort the script and therefore the bisect script is never run.
https://openqa.suse.de/tests/9298857
Also the error messages are not really helpful:

% env script_dir=/opt/os-autoinst-scripts enable_force_result=true host=openqa.suse.de exclude_group_regex='.*(Development|Public Cloud|Released|Others|Kernel|Virtualization).*' grep_timeout=60 nice ionice -c idle /tmp/openqa-label-known-issues-and-investigate-hook 9298857
Skipping investigation of job 9298857: job cluster is already being investigated, see comment on job 9298857
403 Forbidden
404 Not Found
{"error_status":404}
% echo $?
255

Acceptance criteria¶

AC1: openqa-investigate does exit with zero usually (if there is no fatal problem)
AC2: Errors from fetching urls should be handled better so we know which URL resulted in an error
AC3: Bisect jobs are run even if the investigation failed
AC4: :investigate: jobs are not bisected again

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by tinita about 2 years ago

Copied from action #114451: Incidents from all test issues variables are collected during bisect size:M added

Actions

Copy link

Updated by tinita about 2 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz about 2 years ago

Category changed from Regressions/Crashes to Feature requests
Target version set to Ready

Actions

Copy link

Updated by okurz about 2 years ago

Subject changed from openqa-investigate: Improve error handling to openqa-investigate: Ensure proper error handling
Category changed from Feature requests to Regressions/Crashes

Actions

Copy link

Updated by mkittler about 2 years ago

Subject changed from openqa-investigate: Ensure proper error handling to openqa-investigate: Ensure proper error handling size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by tinita about 2 years ago

Just to clarify: I looked at the linked test https://openqa.suse.de/tests/9298857#comments and saw that investigate jobs were triggered, but no bisect jobs.

I ran the hook manually but commented out line 18 echo "$test" | "$script_dir/openqa-investigate" and then the bisect jobs were triggered.
So I assumed openqa-investigate must have returned a non-zero code so the hook script was aborted. But I don't know what happened because we don't have enough log data in the journal.
And the minion job just has this:
https://openqa.suse.de/minion/jobs?id=5039650

notes:
  hook_cmd: env enable_force_result=true from_email=openqa-review@suse.de notification_address=discuss-openqa-auto-r-aaaagmhuypu2hq2kmzgovutmqm@suse.slack.com
    host=openqa.suse.de exclude_group_regex='.*(Development|Public Cloud|Released|Others|Kernel|Virtualization).*'
    grep_timeout=60 nice ionice -c idle /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
  hook_rc: 255
  hook_result: "{\"id\":582257}\n"

To lookup a job in the database:

select id, task, notes->'hook_rc' as hook_rc, notes->'hook_result' as hook_result, created, finished from minion_jobs where jsonb_typeof(args->1) = 'number' and  cast(args->1 as int) = 9298857;

When I reran the hook script after enabling line 18 again, I got the error messages I pasted in the ticket description, but of course I don't know if these are the same errors that happened in the automatic run.

Actions

Copy link

Updated by okurz about 2 years ago

I found other, maybe related problems.

Observation¶

https://openqa.suse.de/tests/9318205#comments shows a comment "Automatic bisect jobs:" but no investigation jobs. However the job is being retried automatically due to RETRY. And the final job in the retry chain https://openqa.suse.de/tests/9319846#comments shows investigation jobs but no bisect jobs

Expectation¶

E1: For a job that is automatically retried no investigation and no bisect jobs are triggered
E2: For a "aggregate tests" job that was was created as clone of other jobs with automatic retry investigation+bisect jobs are triggered

Actions

Copy link

Updated by tinita about 2 years ago

Status changed from Workable to In Progress
Assignee set to tinita

Actions

Copy link

Updated by tinita about 2 years ago

https://github.com/os-autoinst/scripts/pull/178 - Do not run bisect jobs when job has been retried

Actions

Copy link

#10

Updated by okurz about 2 years ago

tinita wrote:

https://github.com/os-autoinst/scripts/pull/178 - Do not run bisect jobs when job has been retried

merged. That should cover E1 from #115178#note-7

Actions

Copy link

#11

Updated by tinita about 2 years ago

https://github.com/os-autoinst/scripts/pull/179 - Localize return code in openqa-investigate
Fixes AC1

Actions

Copy link

#12

Updated by okurz about 2 years ago

The following looks good:

https://openqa.suse.de/tests/9330269#comments is a job that was restarted with RETRY and apparently has no investigate and no bisect jobs as expected
https://openqa.suse.de/tests/9331646#comments shows both investigation as well as bisect jobs

so I see both E1 and E2 from #115178#note-7 covered.

Actions

Copy link

#13

Updated by okurz about 2 years ago

https://openqa.suse.de/tests/9335507#comments might be a regression. The job is sle-12-SP5-JeOS-for-kvm-and-xen-Updates-x86_64-jeos-ltp-syscalls-ipc:investigate:retry@64bit-virtio-vga so it's an investigate job itself but has bisect jobs when it should not have any.

Actions

Copy link

#14