Project

General

Profile

action #115178

openqa-investigate: Ensure proper error handling size:M

Added by tinita about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2022-07-21
Due date:
2022-08-26
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

In #114451 I found that it can happen that openqa-investigate can abort the script and therefore the bisect script is never run.
https://openqa.suse.de/tests/9298857
Also the error messages are not really helpful:

% env script_dir=/opt/os-autoinst-scripts enable_force_result=true host=openqa.suse.de exclude_group_regex='.*(Development|Public Cloud|Released|Others|Kernel|Virtualization).*' grep_timeout=60 nice ionice -c idle /tmp/openqa-label-known-issues-and-investigate-hook 9298857
Skipping investigation of job 9298857: job cluster is already being investigated, see comment on job 9298857
403 Forbidden
404 Not Found
{"error_status":404}
% echo $?
255

Acceptance criteria

  • AC1: openqa-investigate does exit with zero usually (if there is no fatal problem)
  • AC2: Errors from fetching urls should be handled better so we know which URL resulted in an error
  • AC3: Bisect jobs are run even if the investigation failed
  • AC4: :investigate: jobs are not bisected again

Related issues

Copied from openQA Project - action #114451: Incidents from all test issues variables are collected during bisect size:MResolved2022-07-212022-09-07

History

#1 Updated by tinita about 2 months ago

  • Copied from action #114451: Incidents from all test issues variables are collected during bisect size:M added

#2 Updated by tinita about 2 months ago

  • Description updated (diff)

#3 Updated by okurz about 2 months ago

  • Category changed from Concrete Bugs to Feature requests
  • Target version set to Ready

#4 Updated by okurz about 2 months ago

  • Subject changed from openqa-investigate: Improve error handling to openqa-investigate: Ensure proper error handling
  • Category changed from Feature requests to Concrete Bugs

#5 Updated by mkittler about 2 months ago

  • Subject changed from openqa-investigate: Ensure proper error handling to openqa-investigate: Ensure proper error handling size:M
  • Description updated (diff)
  • Status changed from New to Workable

#6 Updated by tinita about 2 months ago

Just to clarify: I looked at the linked test https://openqa.suse.de/tests/9298857#comments and saw that investigate jobs were triggered, but no bisect jobs.

I ran the hook manually but commented out line 18 echo "$test" | "$script_dir/openqa-investigate" and then the bisect jobs were triggered.
So I assumed openqa-investigate must have returned a non-zero code so the hook script was aborted. But I don't know what happened because we don't have enough log data in the journal.
And the minion job just has this:
https://openqa.suse.de/minion/jobs?id=5039650

notes:
  hook_cmd: env enable_force_result=true from_email=openqa-review@suse.de notification_address=discuss-openqa-auto-r-aaaagmhuypu2hq2kmzgovutmqm@suse.slack.com
    host=openqa.suse.de exclude_group_regex='.*(Development|Public Cloud|Released|Others|Kernel|Virtualization).*'
    grep_timeout=60 nice ionice -c idle /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook
  hook_rc: 255
  hook_result: "{\"id\":582257}\n"

To lookup a job in the database:

select id, task, notes->'hook_rc' as hook_rc, notes->'hook_result' as hook_result, created, finished from minion_jobs where jsonb_typeof(args->1) = 'number' and  cast(args->1 as int) = 9298857;

When I reran the hook script after enabling line 18 again, I got the error messages I pasted in the ticket description, but of course I don't know if these are the same errors that happened in the automatic run.

#7 Updated by okurz about 2 months ago

I found other, maybe related problems.

Observation

https://openqa.suse.de/tests/9318205#comments shows a comment "Automatic bisect jobs:" but no investigation jobs. However the job is being retried automatically due to RETRY. And the final job in the retry chain https://openqa.suse.de/tests/9319846#comments shows investigation jobs but no bisect jobs

Expectation

  • E1: For a job that is automatically retried no investigation and no bisect jobs are triggered
  • E2: For a "aggregate tests" job that was was created as clone of other jobs with automatic retry investigation+bisect jobs are triggered

#8 Updated by tinita about 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita

#9 Updated by tinita about 2 months ago

https://github.com/os-autoinst/scripts/pull/178 - Do not run bisect jobs when job has been retried

#10 Updated by okurz about 1 month ago

tinita wrote:

https://github.com/os-autoinst/scripts/pull/178 - Do not run bisect jobs when job has been retried

merged. That should cover E1 from #115178#note-7

#11 Updated by tinita about 1 month ago

https://github.com/os-autoinst/scripts/pull/179 - Localize return code in openqa-investigate
Fixes AC1

#12 Updated by okurz about 1 month ago

The following looks good:

so I see both E1 and E2 from #115178#note-7 covered.

#13 Updated by okurz about 1 month ago

https://openqa.suse.de/tests/9335507#comments might be a regression. The job is sle-12-SP5-JeOS-for-kvm-and-xen-Updates-x86_64-jeos-ltp-syscalls-ipc:investigate:retry@64bit-virtio-vga so it's an investigate job itself but has bisect jobs when it should not have any.

#14 Updated by tinita about 1 month ago

  • Due date changed from 2022-08-16 to 2022-08-23

I think that's not a regression. So far the bisect script didn't have this logic to not run on investigate jobs.

But I can work on this.

Bumping due date also due to absence during the next days.

#15 Updated by tinita about 1 month ago

  • Description updated (diff)

#16 Updated by tinita about 1 month ago

AC4: https://github.com/os-autoinst/scripts/pull/183 - Do not bisect investigate jobs themselves (merged)

#17 Updated by tinita about 1 month ago

  • Due date changed from 2022-08-23 to 2022-08-26

Bumped due date due to absence and additional tasks that were originally not part of the ticket

#18 Updated by tinita about 1 month ago

  • Status changed from In Progress to Feedback

AC3: https://github.com/os-autoinst/scripts/pull/186 - Add error-handling for openqa-investigate

#19 Updated by tinita about 1 month ago

  • Status changed from Feedback to Resolved

https://github.com/os-autoinst/scripts/pull/186 merged.

I think all AC are fulfilled.

Also available in: Atom PDF