Project

General

Profile

coordination #56999

Updated by okurz over 4 years ago

## Motivation 
 "post_fail_hooks" are a powerful concept for os-autoinst. In a test cluster parent jobs by default are immediately cancelled when children abort so no post_fail_hook on the parent have a chance to execute. 

 ## Acceptance criteria 

 * **AC1:** There is an obvious way to run post_fail_hooks for parent jobs in a cluster when children are about to fail 

 ## Suggestions 

 * Try to keep the tests "running" by using a barrier in the `post_fail_hook` of children and parent jobs to ensure every job had the chance to execute it's `post_fail_hook` 
 * If above works good enough cover this in documentation else accomodate this use case in the logic of openQA that aborts parents when children are stopped 

 ## Further details 

 ### Original motivation 

 So, there's a Fedora update where a FreeIPA client job fails: 

 https://openqa.fedoraproject.org/tests/452797 

 now it'd be great to know why this test is failing! Unfortunately, when it fails, the *server* job that it runs in parallel with: 

 https://openqa.fedoraproject.org/tests/452794 

 just gets cancelled as 'parallel_failed'. Notably, its post_fail_hook is not run...so we don't get any logs from the server end. So because the client test appears to be failing because something went wrong on the server end, we just can't debug the problem at all, because we've got no logs from the server, and no very good way to get logs out of the server end. 

 Would it perhaps be good to (possibly optionally, somehow) run the post_fail_hook of a job before cancelling it as parallel_failed?

Back