Project

General

Profile

Actions

action #112256

open

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

Some children of parent job not cancelled (or later, restarted) when parent `parallel_failed` due to another child's parallel job failing

Added by AdamWill over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2022-06-09
Due date:
% Done:

0%

Estimated time:

Description

See here:

https://openqa.stg.fedoraproject.org/tests/1851730#dependencies

We have a parent job, server_cockpit_default , with three children: realmd_join_cockpit, server_cockpit_basic, and server_cockpit_updates. One of the children, realmd_join_cockpit, is parallel with two other tests - realmd_join_sssd and server_role_deploy_domain_controller - which do not depend on server_cockpit_default.

It looks like what happened is that, while server_cockpit_default was running, realmd_join_sssd failed. openQA parallel_failed realmd_join_cockpit because it was parallel with realmd_join_sssd, and then also decided to parallel_fail server_cockpit_default, I guess because it's the parent of realmd_join_cockpit. However, the other children of server_cockpit_default - server_cockpit_updates and server_cockpit_basic - were not cancelled. They were run, and failed immediately because they could not find the expected disk image that server_cockpit_default would have uploaded if it were not cancelled.

We have a plugin downstream - https://src.fedoraproject.org/rpms/openqa/blob/rawhide/f/FedoraUpdateRestart.pm - which restarts update tests on first failure, by calling $job->auto_duplicate; (I intend to move to the newish built-in implementation of this, the RETRY variable, but haven't yet). In this case, that plugin kicked in and restarted job 1851739, the realmd_join_sssd job that failed. It seems that restarting that job caused its parallel siblings - including realmd_join_cockpit - to be restarted, and that also caused realmd_join_cockpit's parent, server_cockpit_default, to be restarted. However, server_cockpit_default's other two children were not restarted.

So we wind up with a situation where server_cockpit_default and realmd_join_cockpit both passed (on the restart), but the other two children of server_cockpit_default - server_cockpit_updates and server_cockpit_basic - are incomplete, and restarting the current instance of server_cockpit_default does not restart them, because they are not children of it. I also cannot restart the children directly, because openQA knows the disk image they need is missing.

The only way to get all the tests run, I think, would be to re-trigger the tests entirely. I can do that, but it seems like something should be improved here, though I'm not sure what. Don't parallel_fail the parent of a job that's being parallel_failed if it has other children?


Related issues 1 (1 open0 closed)

Related to openQA Project (public) - coordination #110458: [epic] Improve `RETRY=…`-behavior for jobs with dependenciesNew2022-04-29

Actions
Actions

Also available in: Atom PDF