Project

General

Profile

Actions

action #150917

closed

Restarting a job together with failed children will break dependencies of the new job

Added by MDoucha 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-11-15
Due date:
% Done:

0%

Estimated time:

Description

Restarting a job together with failed children will break dependencies of the new job

Restarting a job with parent/child jobs using the Skip restarting "OK" (passed/softfailed) children drop-down link will break dependencies of the new job. In addition to the restarted children, it'll also link to all parents and children of the original job:
https://openqa.suse.de/tests/12817616#dependencies

The broken dependencies then prevent further restarts.

Acceptance criteria

  • AC1: After restarting a chained parent (using "Skip restarting OK" but actually regardless of the option) none of the chained children have two parents
  • AC2: The same counts for other dependency types.

Suggestion

  • Find out why we have not seen this in before. Does nobody else use that or care about it?
  • Replicate the scenario by extending/adjusting the unit tests
  • Clone the scenario locally to test this for real (no reason to run any jobs, just fake results via DB updates)
Actions #1

Updated by okurz 5 months ago

  • Category set to Support
  • Target version set to Ready
Actions #2

Updated by mkittler 5 months ago

  • Just for the record: The linked dependency tree contains only chained dependencies. So I guess one would try a tree with chained dependencies when reproducing but it would also make sense to later check whether other dependency types are affected as well.
  • When I remember correctly, it is wanted/expected behavior that restarting a child but not its chained parent will lead to the not-restarted parent becoming the chained parent of the restarted child as well. I think this is wanted because otherwise the newly restarted child would not have a parent at all and e.g. cloning that job would not automatically clone any parent. It would also be a bit intransparent that this job is actually a chained child. And without the dependency the chained child would also not have access to private assets created by the parent. The downside of this is of course that it can lead to very big dependency trees in which is is not obvious at all what jobs have been restarted. Not relevant here as we actually restart the parent here. (So all parents are restarted here and only some of the children not.)
  • I'm not sure how this cluster looked before the restarting. It would be great to have a simpler case with a before and after state.
  • I'm not sure what dependencies are broken exactly here. It would be great to know the expected outcome.
Actions #3

Updated by MDoucha 5 months ago

mkittler wrote in #note-2:

  • I'm not sure how this cluster looked before the restarting. It would be great to have a simpler case with a before and after state.
  • I'm not sure what dependencies are broken exactly here. It would be great to know the expected outcome.

The dependency tree before restarting looked exactly like this (different build, same job schedule):
https://openqa.suse.de/tests/12808877#dependencies

Attaching the restarted job to parents is OK. Attaching it to children of another job is not. That's what broke in the example in ticket description.

Actions #4

Updated by livdywan 5 months ago

  • Description updated (diff)
Actions #5

Updated by mkittler 5 months ago

  • Category changed from Support to Feature requests
  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #6

Updated by mkittler 5 months ago

Looks that in my commit 5f55768e5b36770199c9aaf52157be569f613e76 I actually added tests asserting that the dependency creation happens as it currently happens but with a note that it isn't clear whether this is wanted. Considering this ticket this was likely the wrong decisions so I'll change it now.

When reproducing this for real I also ran into a UI issue for which I have already created a PR: https://github.com/os-autoinst/openQA/pull/5364

Actions #7

Updated by mkittler 5 months ago

  • Status changed from In Progress to Feedback

PR to fix the actual issue: https://github.com/os-autoinst/openQA/pull/5365

I'll definitely fix the problematic behavior for this scenario because I've tested it locally after cloning https://openqa.suse.de/tests/12808877#dependencies with child jobs. I invoked update jobs set state = 'done', result = 'passed' where state = 'scheduled' and TEST like '%ltp%'; which conveniently made all jobs appear passed except for the child kernel-live-patching@ppc64le-virtio. When then restarting the parent skipping "ok" children one ends up with a very simply dependency tree of only the restarted parent and the single child kernel-live-patching@ppc64le-virtio. I think that's how it is supposed to work.

None of the existing tests broke after the change (except the one case which we want to change). That's good and I currently also cannot think of a case that would break due to the early-return introduced by this PR.

Actions #8

Updated by mkittler 5 months ago

My changes have been deployed on OSD as of Saturday.

@MDoucha So you may want to try it out in production. Otherwise we can also just resolve this issue (and possibly reopen it later if it doesn't behave as expected after all).

Actions #9

Updated by MDoucha 5 months ago

mkittler wrote in #note-8:

My changes have been deployed on OSD as of Saturday.

@MDoucha So you may want to try it out in production. Otherwise we can also just resolve this issue (and possibly reopen it later if it doesn't behave as expected after all).

Tested, looks good: https://openqa.suse.de/tests/12856086#dependencies
Original jobs: https://openqa.suse.de/tests/12848951#dependencies

Actions #10

Updated by okurz 5 months ago

  • Status changed from Feedback to Resolved

ok, thx

Actions

Also available in: Atom PDF