action #112256
opencoordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
Some children of parent job not cancelled (or later, restarted) when parent `parallel_failed` due to another child's parallel job failing
0%
Description
See here:
https://openqa.stg.fedoraproject.org/tests/1851730#dependencies
We have a parent job, server_cockpit_default , with three children: realmd_join_cockpit, server_cockpit_basic, and server_cockpit_updates. One of the children, realmd_join_cockpit, is parallel with two other tests - realmd_join_sssd and server_role_deploy_domain_controller - which do not depend on server_cockpit_default.
It looks like what happened is that, while server_cockpit_default was running, realmd_join_sssd failed. openQA parallel_failed realmd_join_cockpit because it was parallel with realmd_join_sssd, and then also decided to parallel_fail server_cockpit_default, I guess because it's the parent of realmd_join_cockpit. However, the other children of server_cockpit_default - server_cockpit_updates and server_cockpit_basic - were not cancelled. They were run, and failed immediately because they could not find the expected disk image that server_cockpit_default would have uploaded if it were not cancelled.
We have a plugin downstream - https://src.fedoraproject.org/rpms/openqa/blob/rawhide/f/FedoraUpdateRestart.pm - which restarts update tests on first failure, by calling $job->auto_duplicate;
(I intend to move to the newish built-in implementation of this, the RETRY
variable, but haven't yet). In this case, that plugin kicked in and restarted job 1851739, the realmd_join_sssd job that failed. It seems that restarting that job caused its parallel siblings - including realmd_join_cockpit - to be restarted, and that also caused realmd_join_cockpit's parent, server_cockpit_default, to be restarted. However, server_cockpit_default's other two children were not restarted.
So we wind up with a situation where server_cockpit_default and realmd_join_cockpit both passed (on the restart), but the other two children of server_cockpit_default - server_cockpit_updates and server_cockpit_basic - are incomplete, and restarting the current instance of server_cockpit_default does not restart them, because they are not children of it. I also cannot restart the children directly, because openQA knows the disk image they need is missing.
The only way to get all the tests run, I think, would be to re-trigger the tests entirely. I can do that, but it seems like something should be improved here, though I'm not sure what. Don't parallel_fail the parent of a job that's being parallel_failed if it has other children?