action #112256: Some children of parent job not cancelled (or later, restarted) when parent `parallel_failed` due to another child's parallel job failing - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #112256

open

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

Some children of parent job not cancelled (or later, restarted) when parent `parallel_failed` due to another child's parallel job failing

Added by AdamWill almost 3 years ago. Updated almost 3 years ago.

Status:

New

Priority:

Low

Assignee:

Category:

Regressions/Crashes

Target version:

QA (public) - future

Start date:

2022-06-09

Due date:

% Done:

Estimated time:

Description

See here:

https://openqa.stg.fedoraproject.org/tests/1851730#dependencies

We have a parent job, server_cockpit_default , with three children: realmd_join_cockpit, server_cockpit_basic, and server_cockpit_updates. One of the children, realmd_join_cockpit, is parallel with two other tests - realmd_join_sssd and server_role_deploy_domain_controller - which do not depend on server_cockpit_default.

It looks like what happened is that, while server_cockpit_default was running, realmd_join_sssd failed. openQA parallel_failed realmd_join_cockpit because it was parallel with realmd_join_sssd, and then also decided to parallel_fail server_cockpit_default, I guess because it's the parent of realmd_join_cockpit. However, the other children of server_cockpit_default - server_cockpit_updates and server_cockpit_basic - were not cancelled. They were run, and failed immediately because they could not find the expected disk image that server_cockpit_default would have uploaded if it were not cancelled.

We have a plugin downstream - https://src.fedoraproject.org/rpms/openqa/blob/rawhide/f/FedoraUpdateRestart.pm - which restarts update tests on first failure, by calling $job->auto_duplicate; (I intend to move to the newish built-in implementation of this, the RETRY variable, but haven't yet). In this case, that plugin kicked in and restarted job 1851739, the realmd_join_sssd job that failed. It seems that restarting that job caused its parallel siblings - including realmd_join_cockpit - to be restarted, and that also caused realmd_join_cockpit's parent, server_cockpit_default, to be restarted. However, server_cockpit_default's other two children were not restarted.

So we wind up with a situation where server_cockpit_default and realmd_join_cockpit both passed (on the restart), but the other two children of server_cockpit_default - server_cockpit_updates and server_cockpit_basic - are incomplete, and restarting the current instance of server_cockpit_default does not restart them, because they are not children of it. I also cannot restart the children directly, because openQA knows the disk image they need is missing.

The only way to get all the tests run, I think, would be to re-trigger the tests entirely. I can do that, but it seems like something should be improved here, though I'm not sure what. Don't parallel_fail the parent of a job that's being parallel_failed if it has other children?

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by AdamWill almost 3 years ago

Note, this is with a git snapshot from about a week ago. I don't recall this happening with older builds. I suspect this got changed by the changes @mkittler made around this area in April, e.g. 255d69a7a626af589f62853fdea83decdfff96ee .

Actions

Copy link

Updated by okurz almost 3 years ago

Priority changed from Normal to Low
Target version set to future
Parent task set to #103962

yes, I expect changes from this year could influence the behaviour

Actions

Copy link

Updated by mkittler almost 3 years ago

Related to coordination #110458: [epic] Improve `RETRY=…`-behavior for jobs with dependencies added

Actions

Copy link

Updated by mkittler almost 3 years ago

Unfortunately the RETRY feature also doesn't handle things sensibly when certain dependencies are involved, see #110458. Likely both tickets are about a similar issue.

Actions

Copy link

Updated by AdamWill almost 3 years ago

Yeah, I've noticed the same since switching to RETRY. I assume they wind up going down the same codepath (either RETRY uses $job->auto_duplicate or does effectively the same thing).

Actions

Copy link

Updated by okurz almost 3 years ago

Parent task changed from #103962 to #112862

Move future ideas to the actual "Future ideas" tracker #112862

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #112256

Some children of parent job not cancelled (or later, restarted) when parent `parallel_failed` due to another child's parallel job failing

Updated by AdamWill almost 3 years ago

Updated by okurz almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by AdamWill almost 3 years ago

Updated by okurz almost 3 years ago