Project

General

Profile

Actions

action #112256

open

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

Some children of parent job not cancelled (or later, restarted) when parent `parallel_failed` due to another child's parallel job failing

Added by AdamWill almost 2 years ago. Updated over 1 year ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2022-06-09
Due date:
% Done:

0%

Estimated time:

Description

See here:

https://openqa.stg.fedoraproject.org/tests/1851730#dependencies

We have a parent job, server_cockpit_default , with three children: realmd_join_cockpit, server_cockpit_basic, and server_cockpit_updates. One of the children, realmd_join_cockpit, is parallel with two other tests - realmd_join_sssd and server_role_deploy_domain_controller - which do not depend on server_cockpit_default.

It looks like what happened is that, while server_cockpit_default was running, realmd_join_sssd failed. openQA parallel_failed realmd_join_cockpit because it was parallel with realmd_join_sssd, and then also decided to parallel_fail server_cockpit_default, I guess because it's the parent of realmd_join_cockpit. However, the other children of server_cockpit_default - server_cockpit_updates and server_cockpit_basic - were not cancelled. They were run, and failed immediately because they could not find the expected disk image that server_cockpit_default would have uploaded if it were not cancelled.

We have a plugin downstream - https://src.fedoraproject.org/rpms/openqa/blob/rawhide/f/FedoraUpdateRestart.pm - which restarts update tests on first failure, by calling $job->auto_duplicate; (I intend to move to the newish built-in implementation of this, the RETRY variable, but haven't yet). In this case, that plugin kicked in and restarted job 1851739, the realmd_join_sssd job that failed. It seems that restarting that job caused its parallel siblings - including realmd_join_cockpit - to be restarted, and that also caused realmd_join_cockpit's parent, server_cockpit_default, to be restarted. However, server_cockpit_default's other two children were not restarted.

So we wind up with a situation where server_cockpit_default and realmd_join_cockpit both passed (on the restart), but the other two children of server_cockpit_default - server_cockpit_updates and server_cockpit_basic - are incomplete, and restarting the current instance of server_cockpit_default does not restart them, because they are not children of it. I also cannot restart the children directly, because openQA knows the disk image they need is missing.

The only way to get all the tests run, I think, would be to re-trigger the tests entirely. I can do that, but it seems like something should be improved here, though I'm not sure what. Don't parallel_fail the parent of a job that's being parallel_failed if it has other children?


Related issues 1 (1 open0 closed)

Related to openQA Project - coordination #110458: [epic] Improve `RETRY=…`-behavior for jobs with dependenciesNew2022-04-29

Actions
Actions #1

Updated by AdamWill almost 2 years ago

Note, this is with a git snapshot from about a week ago. I don't recall this happening with older builds. I suspect this got changed by the changes @mkittler made around this area in April, e.g. 255d69a7a626af589f62853fdea83decdfff96ee .

Actions #2

Updated by okurz almost 2 years ago

  • Priority changed from Normal to Low
  • Target version set to future
  • Parent task set to #103962

yes, I expect changes from this year could influence the behaviour

Actions #3

Updated by mkittler almost 2 years ago

  • Related to coordination #110458: [epic] Improve `RETRY=…`-behavior for jobs with dependencies added
Actions #4

Updated by mkittler almost 2 years ago

Unfortunately the RETRY feature also doesn't handle things sensibly when certain dependencies are involved, see #110458. Likely both tickets are about a similar issue.

Actions #5

Updated by AdamWill almost 2 years ago

Yeah, I've noticed the same since switching to RETRY. I assume they wind up going down the same codepath (either RETRY uses $job->auto_duplicate or does effectively the same thing).

Actions #6

Updated by okurz over 1 year ago

  • Parent task changed from #103962 to #112862

Move future ideas to the actual "Future ideas" tracker #112862

Actions

Also available in: Atom PDF