action #56999

[epic] Run post-fail hook when cancelling cluster jobs?

Added by AdamWill 5 months ago. Updated 3 months ago.

Status:WorkableStart date:17/09/2019
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:Feature requests
Target version:Ready
Difficulty:hard
Duration:

Description

Motivation

"post_fail_hooks" are a powerful concept for os-autoinst. In a test cluster parent jobs by default are immediately cancelled when children abort so no post_fail_hook on the parent have a chance to execute.

Acceptance criteria

  • AC1: There is an obvious way to run post_fail_hooks for parent jobs in a cluster when children are about to fail

Suggestions

  • Try to keep the tests "running" by using a barrier in the post_fail_hook of children and parent jobs to ensure every job had the chance to execute it's post_fail_hook
  • If above works good enough cover this in documentation else accomodate this use case in the logic of openQA that aborts parents when children are stopped

Further details

Original motivation

So, there's a Fedora update where a FreeIPA client job fails:

https://openqa.fedoraproject.org/tests/452797

now it'd be great to know why this test is failing! Unfortunately, when it fails, the server job that it runs in parallel with:

https://openqa.fedoraproject.org/tests/452794

just gets cancelled as 'parallel_failed'. Notably, its post_fail_hook is not run...so we don't get any logs from the server end. So because the client test appears to be failing because something went wrong on the server end, we just can't debug the problem at all, because we've got no logs from the server, and no very good way to get logs out of the server end.

Would it perhaps be good to (possibly optionally, somehow) run the post_fail_hook of a job before cancelling it as parallel_failed?

History

#1 Updated by AdamWill 5 months ago

  • Category set to Feature requests

#2 Updated by okurz 4 months ago

I don't think it would be good to teach openQA to know what "post_fail_hooks" are as basically this is an os-autoinst feature that is just displayed differently. However I think instead of immediately failing a job in the cluster which then cancels all parallel jobs the post_fail_hook of any cluster job can trigger and wait for parallel jobs to execute their post_fail_hooks based on mutexes and barriers giving them time to collect the necessary data before finally failing.

#3 Updated by AdamWill 4 months ago

note, when I filed this issue I didn't actually realize we're right in the 'cancel cluster jobs' logic I already poked with https://pagure.io/fesco/issue/1858#comment-506261 , but I realized shortly after filing...

anyway, in outline I agree your solution should do the trick, it's just a case of figuring out an implementation. Since I already poked this code once I'll see if I can come up with something, if I can find the time (we're in Fedora 31 crunch right now).

#5 Updated by coolo 4 months ago

  • Subject changed from Run post-fail hook when cancelling cluster jobs? to [epic] Run post-fail hook when cancelling cluster jobs?
  • Target version set to Ready
  • Difficulty set to hard

A tricky one to implement

#6 Updated by mkittler 3 months ago

That could work similar to how the developer mode hooks into the test execution:

  1. The worker would send a command to os-autoinst's command server to "flag" the job as parallel failed. It would not stop the isotovideo process tree on its own like it does now.
  2. The regular test execution is paused on the next isotovideo query. The problem is that we can not easily run the post fail hook within the autotest process as usual. The blocking nature of the autotest process is the tricky part.

#7 Updated by okurz 3 months ago

@AdamWill would you agree that this can be solved on the level of tests? If yes I would actually reject the ticket for openQA unless we want to provide this hint in the documentation of course.

#8 Updated by AdamWill 3 months ago

Um. Well. Thinking about it again, I'm not sure. Here's okurz's idea again:

"the post_fail_hook of any cluster job can trigger and wait for parallel jobs to execute their post_fail_hooks based on mutexes and barriers"

so, I'm having a little trouble seeing how that's going to work exactly. In the example, we have this flow:

  1. Server test preps server-y stuff
  2. Server test sets a mutex to tell child test to go ahead and do child-y stuff, then does wait_for_children; to wait until child completes
  3. Child test fails, runs its post-fail hook, quits
  4. Server notices child has quit and dies as parallel_failed

So...I'm not sure exactly how I'd change that here. I guess we'd have to not use wait_for_children;, right? We'd have to tell the server to wait for the children some other way. Have a barrier called 'children_complete' and have each child wait on that barrier when it finishes, have the server wait for it with check_dead_job ? But the docs say check_dead_job "will kill all jobs waiting in +barrier_wait+ if one of the cluster jobs dies", which doesn't really sound like what we want.

So...I guess I'm not quite sure how I'd go about doing this purely in tests, is what I'm saying.

#9 Updated by okurz 3 months ago

Well, haven't tried myself but basically what I was thinking: If you want to do something from tests, regardless if on parent or child, you need to keep the tests "running" and technically in a post_fail_hook the test is still running, so just use another barrier as is also used for startup, etc., which one can reach in normal test flow as well as in the post_fail_hook.

#10 Updated by okurz 3 months ago

  • Description updated (diff)
  • Status changed from New to Workable

Also available in: Atom PDF