coordination #56999
open[epic] Run post-fail hook when cancelling cluster jobs?
0%
Description
Motivation¶
"post_fail_hooks" are a powerful concept for os-autoinst. In a test cluster parent jobs by default are immediately cancelled when children abort so no post_fail_hook on the parent have a chance to execute.
Acceptance criteria¶
- AC1: There is an obvious way to run post_fail_hooks for parent jobs in a cluster when children are about to fail
Suggestions¶
- Try to keep the tests "running" by using a barrier in the
post_fail_hook
of children and parent jobs to ensure every job had the chance to execute it'spost_fail_hook
- If above works good enough cover this in documentation else accomodate this use case in the logic of openQA that aborts parents when children are stopped
Further details¶
Original motivation¶
So, there's a Fedora update where a FreeIPA client job fails:
https://openqa.fedoraproject.org/tests/452797
now it'd be great to know why this test is failing! Unfortunately, when it fails, the server job that it runs in parallel with:
https://openqa.fedoraproject.org/tests/452794
just gets cancelled as 'parallel_failed'. Notably, its post_fail_hook is not run...so we don't get any logs from the server end. So because the client test appears to be failing because something went wrong on the server end, we just can't debug the problem at all, because we've got no logs from the server, and no very good way to get logs out of the server end.
Would it perhaps be good to (possibly optionally, somehow) run the post_fail_hook of a job before cancelling it as parallel_failed?
Updated by okurz about 5 years ago
I don't think it would be good to teach openQA to know what "post_fail_hooks" are as basically this is an os-autoinst feature that is just displayed differently. However I think instead of immediately failing a job in the cluster which then cancels all parallel jobs the post_fail_hook of any cluster job can trigger and wait for parallel jobs to execute their post_fail_hooks based on mutexes and barriers giving them time to collect the necessary data before finally failing.
Updated by AdamWill about 5 years ago
note, when I filed this issue I didn't actually realize we're right in the 'cancel cluster jobs' logic I already poked with https://pagure.io/fesco/issue/1858#comment-506261 , but I realized shortly after filing...
anyway, in outline I agree your solution should do the trick, it's just a case of figuring out an implementation. Since I already poked this code once I'll see if I can come up with something, if I can find the time (we're in Fedora 31 crunch right now).
Updated by okurz about 5 years ago
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8671 tries to solve the same problem.
Updated by coolo about 5 years ago
- Subject changed from Run post-fail hook when cancelling cluster jobs? to [epic] Run post-fail hook when cancelling cluster jobs?
- Target version set to Ready
- Difficulty set to hard
A tricky one to implement
Updated by mkittler about 5 years ago
That could work similar to how the developer mode hooks into the test execution:
- The worker would send a command to os-autoinst's command server to "flag" the job as parallel failed. It would not stop the isotovideo process tree on its own like it does now.
- The regular test execution is paused on the next isotovideo query. The problem is that we can not easily run the post fail hook within the autotest process as usual. The blocking nature of the autotest process is the tricky part.
Updated by okurz about 5 years ago
@AdamWill would you agree that this can be solved on the level of tests? If yes I would actually reject the ticket for openQA unless we want to provide this hint in the documentation of course.
Updated by AdamWill about 5 years ago
Um. Well. Thinking about it again, I'm not sure. Here's okurz's idea again:
"the post_fail_hook of any cluster job can trigger and wait for parallel jobs to execute their post_fail_hooks based on mutexes and barriers"
so, I'm having a little trouble seeing how that's going to work exactly. In the example, we have this flow:
- Server test preps server-y stuff
- Server test sets a mutex to tell child test to go ahead and do child-y stuff, then does
wait_for_children;
to wait until child completes - Child test fails, runs its post-fail hook, quits
- Server notices child has quit and dies as parallel_failed
So...I'm not sure exactly how I'd change that here. I guess we'd have to not use wait_for_children;
, right? We'd have to tell the server to wait for the children some other way. Have a barrier called 'children_complete' and have each child wait on that barrier when it finishes, have the server wait for it with check_dead_job ? But the docs say check_dead_job
"will kill all jobs waiting in +barrier_wait+ if one of the cluster jobs dies", which doesn't really sound like what we want.
So...I guess I'm not quite sure how I'd go about doing this purely in tests, is what I'm saying.
Updated by okurz about 5 years ago
Well, haven't tried myself but basically what I was thinking: If you want to do something from tests, regardless if on parent or child, you need to keep the tests "running" and technically in a post_fail_hook
the test is still running, so just use another barrier as is also used for startup, etc., which one can reach in normal test flow as well as in the post_fail_hook.
Updated by okurz about 5 years ago
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 4 years ago
- Priority changed from Normal to Low
so coming back to my previous comment I still think this can (only) be solved on the level of tests. No question that it is not obvious that this can be done or how but I do not see openQA itself – or rather os-autoinst – being able to provide that differently.
Updated by mkittler over 4 years ago
My thoughts about implementing this on test level:
- Let's say we have two parallel jobs A and B and A fails. So far A's post fail hook will be executed. After that B is "forcefully" stopped by the worker which can only be prevented by letting A live longer.
- So one would wait for a barrier in A's post fail hook to make it live longer.
- Now it becomes tricky. How would you signal B to stop and execute its own post fail hook? As @AdamWill already mentioned B might be busy with
wait_for_children
. In general B might be busy running arbitrary test code. This code needed to be swapped out with checks for A having entered its post fail hook. Not sure how to do that. We could usebarrier_try_wait
on the barrier lock used in A's post fail hook and let the test fail if we can acquire the barrier. But that would let A finish and therefore lead to B being stopped by the worker after all. So we needed A's post fail hook to wait for a 2nd barrier (after waiting for the 1st barrier) to make it live long enough. - The 2nd barrier would then the acquired at the end of B's post fail hook. That shouldn't block B and A would unblock as well. In the end, both post fail hooks would have been executed.
- Of course this only works in one direction. To make it work in the other direction one would need to do 1. to 4. again swapping A and B.
- I guess mutexes could be used as well for just two jobs. However, in practice we have likely more jobs. Not sure how well this approach would work for more than just 2 jobs considering that it sounds already overwhelmingly complicated for two jobs.
@okurz Maybe I'm overlooking something but the setup I've come up with doesn't seem as easy as you make it sound like. For instance I'm wondering how you would implement it with just one barrier. And the part "This code needed to be swapped out …" seems very problematic/unpractical.
I previously mentioned a way to implement this within os-autoinst/openQA but the blocking nature of the autotest process is likely the problem here. An alternative would be handling SIGINT/SIGTERM within autotest. The post fail hook could then be executed from the signal handler. The way the worker currently forcefully stops the backend prevents that but it would be nice to change that anyways (e.g. https://github.com/os-autoinst/openQA/pull/3152).
Updated by okurz over 4 years ago
mkittler wrote:
The post fail hook could […] be executed from the signal handler. The way the worker currently forcefully stops the backend prevents that […]
Running actual test code from the signal handler would be possible but IMHO very risky as it can lead to a lot of problematic situations where you still want to enforce the process to stop. Then you need to handle multiple signals received and be more strict about it, e.g. on first TERM try to still execute the post_fail_hook code but if a subsequent TERM is received abort even that and stop earlier.
Updated by okurz over 4 years ago
mkravec described how caasp tests (now microos/kubic) are/were achieving this based on what I suggested to wait in post_fail_hooks.
The idea is that in parent job you wait_for_children in post_fail_hook (this prevents "canceled" jobs)
like here:
# Controller job is parent. If it fails we need to export deployment logs from child jobs
# Without this post_fail_hook they would stop with parallel_failed result
sub post_fail_hook {
# Variable to enable failed cluster debug
sleep if check_var('DEBUG_SLEEP', 'controller');
# Destroy barriers and create mutexes to avoid deadlock
barrier_destroy 'NODES_ONLINE';
barrier_try_wait 'DELAYED_NODES_ONLINE';
…
# Wait for log export from all nodes
wait_for_children;
}
Then in child jobs you export logs in post_run hook and finish
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8750/files#diff-e8ef7e60908d387354fd6385e14abb65L105
sub post_run_hook {
# Some nodes were removed & powered off during test run
return if query_isotovideo('backend_is_shutdown');
export_cluster_logs;
}
I suggest we abstract that and put it into openQA documentation
Updated by mkittler over 4 years ago
The code from the CasSP tests cuts some corners:
- I don't see how the approach implements AC1. It ensures by no means that the parent isn't simply stopped as it currently is when a child fails. Therefore the
post_fail_hook
can not be executed on the parent when children are failing. - It only works the other way around than stated in AC1. So the children will not be immediately stopped when the parent fails. That's basically step
1.
and2.
from my previous comment. - It completely skips the tricky part mentioned in
3.
of my previous comment. That means the children will not be explicitly stopped and would keep running (likely until some timeout occurs). - It does not even actually run the
post_fail_hook
of the children (and exports logs in thepost_run_hook
instead).
So maybe I'm missing something here but I don't see how the approach fulfills the acceptance criteria. But maybe I'm also missing something or the acceptance criteria can be changed.
Updated by szarate about 4 years ago
- Tracker changed from action to coordination
- Status changed from Workable to New
- Difficulty deleted (
hard)
Updated by szarate about 4 years ago
See for the reason of tracker change: http://mailman.suse.de/mailman/private/qa-sle/2020-October/002722.html