[epic] Run post-fail hook when cancelling cluster jobs?
"post_fail_hooks" are a powerful concept for os-autoinst. In a test cluster parent jobs by default are immediately cancelled when children abort so no post_fail_hook on the parent have a chance to execute.
- AC1: There is an obvious way to run post_fail_hooks for parent jobs in a cluster when children are about to fail
- Try to keep the tests "running" by using a barrier in the
post_fail_hookof children and parent jobs to ensure every job had the chance to execute it's
- If above works good enough cover this in documentation else accomodate this use case in the logic of openQA that aborts parents when children are stopped
So, there's a Fedora update where a FreeIPA client job fails:
now it'd be great to know why this test is failing! Unfortunately, when it fails, the server job that it runs in parallel with:
just gets cancelled as 'parallel_failed'. Notably, its post_fail_hook is not run...so we don't get any logs from the server end. So because the client test appears to be failing because something went wrong on the server end, we just can't debug the problem at all, because we've got no logs from the server, and no very good way to get logs out of the server end.
Would it perhaps be good to (possibly optionally, somehow) run the post_fail_hook of a job before cancelling it as parallel_failed?
I don't think it would be good to teach openQA to know what "post_fail_hooks" are as basically this is an os-autoinst feature that is just displayed differently. However I think instead of immediately failing a job in the cluster which then cancels all parallel jobs the post_fail_hook of any cluster job can trigger and wait for parallel jobs to execute their post_fail_hooks based on mutexes and barriers giving them time to collect the necessary data before finally failing.
note, when I filed this issue I didn't actually realize we're right in the 'cancel cluster jobs' logic I already poked with https://pagure.io/fesco/issue/1858#comment-506261 , but I realized shortly after filing...
anyway, in outline I agree your solution should do the trick, it's just a case of figuring out an implementation. Since I already poked this code once I'll see if I can come up with something, if I can find the time (we're in Fedora 31 crunch right now).
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8671 tries to solve the same problem.
That could work similar to how the developer mode hooks into the test execution:
- The worker would send a command to os-autoinst's command server to "flag" the job as parallel failed. It would not stop the isotovideo process tree on its own like it does now.
- The regular test execution is paused on the next isotovideo query. The problem is that we can not easily run the post fail hook within the autotest process as usual. The blocking nature of the autotest process is the tricky part.
Um. Well. Thinking about it again, I'm not sure. Here's okurz's idea again:
"the post_fail_hook of any cluster job can trigger and wait for parallel jobs to execute their post_fail_hooks based on mutexes and barriers"
so, I'm having a little trouble seeing how that's going to work exactly. In the example, we have this flow:
- Server test preps server-y stuff
- Server test sets a mutex to tell child test to go ahead and do child-y stuff, then does
wait_for_children;to wait until child completes
- Child test fails, runs its post-fail hook, quits
- Server notices child has quit and dies as parallel_failed
So...I'm not sure exactly how I'd change that here. I guess we'd have to not use
wait_for_children;, right? We'd have to tell the server to wait for the children some other way. Have a barrier called 'children_complete' and have each child wait on that barrier when it finishes, have the server wait for it with check_dead_job ? But the docs say
check_dead_job "will kill all jobs waiting in +barrier_wait+ if one of the cluster jobs dies", which doesn't really sound like what we want.
So...I guess I'm not quite sure how I'd go about doing this purely in tests, is what I'm saying.
Well, haven't tried myself but basically what I was thinking: If you want to do something from tests, regardless if on parent or child, you need to keep the tests "running" and technically in a
post_fail_hook the test is still running, so just use another barrier as is also used for startup, etc., which one can reach in normal test flow as well as in the post_fail_hook.