Project

General

Profile

Actions

coordination #56999

open

[epic] Run post-fail hook when cancelling cluster jobs?

Added by AdamWill over 4 years ago. Updated about 2 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2019-09-17
Due date:
% Done:

0%

Estimated time:

Description

Motivation

"post_fail_hooks" are a powerful concept for os-autoinst. In a test cluster parent jobs by default are immediately cancelled when children abort so no post_fail_hook on the parent have a chance to execute.

Acceptance criteria

  • AC1: There is an obvious way to run post_fail_hooks for parent jobs in a cluster when children are about to fail

Suggestions

  • Try to keep the tests "running" by using a barrier in the post_fail_hook of children and parent jobs to ensure every job had the chance to execute it's post_fail_hook
  • If above works good enough cover this in documentation else accomodate this use case in the logic of openQA that aborts parents when children are stopped

Further details

Original motivation

So, there's a Fedora update where a FreeIPA client job fails:

https://openqa.fedoraproject.org/tests/452797

now it'd be great to know why this test is failing! Unfortunately, when it fails, the server job that it runs in parallel with:

https://openqa.fedoraproject.org/tests/452794

just gets cancelled as 'parallel_failed'. Notably, its post_fail_hook is not run...so we don't get any logs from the server end. So because the client test appears to be failing because something went wrong on the server end, we just can't debug the problem at all, because we've got no logs from the server, and no very good way to get logs out of the server end.

Would it perhaps be good to (possibly optionally, somehow) run the post_fail_hook of a job before cancelling it as parallel_failed?

Actions

Also available in: Atom PDF