action #112256
opencoordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
Some children of parent job not cancelled (or later, restarted) when parent `parallel_failed` due to another child's parallel job failing
0%
Description
See here:
https://openqa.stg.fedoraproject.org/tests/1851730#dependencies
We have a parent job, server_cockpit_default , with three children: realmd_join_cockpit, server_cockpit_basic, and server_cockpit_updates. One of the children, realmd_join_cockpit, is parallel with two other tests - realmd_join_sssd and server_role_deploy_domain_controller - which do not depend on server_cockpit_default.
It looks like what happened is that, while server_cockpit_default was running, realmd_join_sssd failed. openQA parallel_failed realmd_join_cockpit because it was parallel with realmd_join_sssd, and then also decided to parallel_fail server_cockpit_default, I guess because it's the parent of realmd_join_cockpit. However, the other children of server_cockpit_default - server_cockpit_updates and server_cockpit_basic - were not cancelled. They were run, and failed immediately because they could not find the expected disk image that server_cockpit_default would have uploaded if it were not cancelled.
We have a plugin downstream - https://src.fedoraproject.org/rpms/openqa/blob/rawhide/f/FedoraUpdateRestart.pm - which restarts update tests on first failure, by calling $job->auto_duplicate;
(I intend to move to the newish built-in implementation of this, the RETRY
variable, but haven't yet). In this case, that plugin kicked in and restarted job 1851739, the realmd_join_sssd job that failed. It seems that restarting that job caused its parallel siblings - including realmd_join_cockpit - to be restarted, and that also caused realmd_join_cockpit's parent, server_cockpit_default, to be restarted. However, server_cockpit_default's other two children were not restarted.
So we wind up with a situation where server_cockpit_default and realmd_join_cockpit both passed (on the restart), but the other two children of server_cockpit_default - server_cockpit_updates and server_cockpit_basic - are incomplete, and restarting the current instance of server_cockpit_default does not restart them, because they are not children of it. I also cannot restart the children directly, because openQA knows the disk image they need is missing.
The only way to get all the tests run, I think, would be to re-trigger the tests entirely. I can do that, but it seems like something should be improved here, though I'm not sure what. Don't parallel_fail the parent of a job that's being parallel_failed if it has other children?
Updated by AdamWill over 2 years ago
Note, this is with a git snapshot from about a week ago. I don't recall this happening with older builds. I suspect this got changed by the changes @mkittler made around this area in April, e.g. 255d69a7a626af589f62853fdea83decdfff96ee .
Updated by okurz over 2 years ago
- Priority changed from Normal to Low
- Target version set to future
- Parent task set to #103962
yes, I expect changes from this year could influence the behaviour
Updated by mkittler over 2 years ago
- Related to coordination #110458: [epic] Improve `RETRY=…`-behavior for jobs with dependencies added
Updated by mkittler over 2 years ago
Unfortunately the RETRY
feature also doesn't handle things sensibly when certain dependencies are involved, see #110458. Likely both tickets are about a similar issue.
Updated by AdamWill over 2 years ago
Yeah, I've noticed the same since switching to RETRY
. I assume they wind up going down the same codepath (either RETRY
uses $job->auto_duplicate
or does effectively the same thing).
Updated by okurz over 2 years ago
- Parent task changed from #103962 to #112862
Move future ideas to the actual "Future ideas" tracker #112862