action #70720: Unable to restart a child from START_DIRECTLY_AFTER_TEST chain if another child has been restarted already - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #70720

closed

Unable to restart a child from START_DIRECTLY_AFTER_TEST chain if another child has been restarted already

Added by ggardet_arm over 4 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2020-08-31

Due date:

% Done:

Estimated time:

Description

Observation¶

I had 2 jobs to restart (in skipped state). Both are children from RPi_flash_firmware@RPi3 and use START_DIRECTLY_AFTER_TEST.
I can restart the first job properly, but when I try to restart the other child, I get an error:

Errors occurred when restarting jobs:
  Job 1375370 has already been cloned as 1377993

You can try to restart https://openqa.opensuse.org/tests/1375372

Steps to reproduce¶

Find cluster of directly-chained jobs with parent, here called A, and at least two children, called B1 and B2
Restart one child job, B1
Observe that B1 is cloned to B1' and A is cloned to A'
Navigate to second child job, B2
Observe that retrigger button is available
Observe that clicking on retrigger button yields "Job A has already been cloned as A'" but B2 is not retriggered

Effect: There is no way available over UI to restart B2 and the cluster relationship is not obvious

Acceptance criteria¶

AC1: Every job without a clone can be retriggered directly or indirectly or is linked to what can be considered a clone

Problem¶

Regression introduced with #68956 .

Suggestions¶

In the above example "Steps to reproduce" the actual problem already starts with retriggering B1 to B1' which implicitly clones A to A' but does not touch other siblings. Retriggering the parent A to A' would have correctly cloned all children. Maybe it is the easiest option for now to just prevent retriggering a child of a directly-chained dependency and only allow retriggering the parent. Based on this I propose:

Given the API endpoint of the retrigger button click, e.g. the "duplicate" method,
If job has directly-chained parent and job has no clone,
Then explain that the proper way is to retrigger parent and link there and mention API alternatives

Workaround¶

Instead of trying to restart failed children restart the parent over API and skip passed children following https://open.qa/docs/#_further_notes_2
As an alternative to START_DIRECTLY_AFTER_TEST one can define a specific "machine" with a specific worker class that is only fulfilled by a single, unique worker instance

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by ggardet_arm over 4 years ago

Description updated (diff)

Actions

Copy link

Updated by mkittler over 4 years ago

The best solution which is currently available is mentioned within the documentation: https://open.qa/docs/#_further_notes_2

This feature request has actually already been requested multiple times. See the comments in #68956, #69979, https://github.com/os-autoinst/openQA/pull/3262#issuecomment-673382461 and https://github.com/os-autoinst/openQA/pull/3300. So the overall problem is tricky and can not be trivially solved. My best idea so far is noted down in #70618.

This ticket is basically just the first point from #70618 where I claim it isn't much of an improvement. But if you really just wanted to be able to restart the job regardless of how many times the parent is executed, implementing this first step alone might already be and improvement. I'm keeping this ticket open as it is not en exact duplicate of #70618.

Actions

Copy link

Updated by mkittler over 4 years ago

Related to action #70618: Automatically avoid restarting the directly chained parent if possible to save time added

Actions

Copy link

Updated by okurz over 4 years ago

Category set to Feature requests
Priority changed from High to Low
Target version set to future

@ggardet_arm I hope the different possibilities as described by mkittler help you for the time being. With this I would like to regard this feature request as low prio and actually not schedule it to be done by the SUSE QA Tools team, hence selecting the "future" target version. Cool ideas suggested in pull requests by any contributor are always welcome though :)

Actions

Copy link

Updated by mkittler over 4 years ago

Related to action #69979: Advanced job restarting via the web UI added

Actions

Copy link

Updated by okurz over 4 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 4 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 4 years ago

Description updated (diff)
Category changed from Feature requests to Regressions/Crashes
Status changed from New to Workable
Priority changed from Low to High
Target version changed from future to Ready

Based on more feedback I invested more time into the topic. The impact of the problem is a bit bigger than I expected. After #68956 we have now the situation that the user can easily come into the situation of not being able to retrigger a job at all even though other jobs in the same cluster could be retriggered with a simple click of a button in the web UI.

Maybe instead of allowing the operation to clone a child and its parent but destroy the relationship to others instead point just to the parent to restart that one. Updated ticket description with "steps to reproduce", "suggestions", etc.

Actions

Copy link

Updated by mkittler over 4 years ago

Assignee set to mkittler

Actions

Copy link

#10

Updated by mkittler over 4 years ago

PR following the suggestion: https://github.com/os-autoinst/openQA/pull/3401

So this doesn't really allow restarting the child. In fact one receives the same error as before. The change is to show already an error when restarting the first child with a suggestion to restart the parent instead. Similar to the missing assets detection one can click on the force button to restart the job nevertheless (including its direct parent).

Of course it would still be possible to actually allow restarting the parent by restarting its clone (or re-using the existing clone if it is still scheduled). However, without fully implementing #70618 that's likely not very useful because the parent would be ran unnecessarily often.

Actions

Copy link

#11

Updated by mkittler over 4 years ago

Status changed from Workable to In Progress

Actions

Copy link

#12

Updated by mkittler over 4 years ago

Status changed from In Progress to Feedback

The PR has been merged. As mentioned in the previous comment I've been implementing @okurz's suggestion which doesn't really allow restarting the child. However, I suppose that's good enough for now and the real solution (at least the best I can currently think of) would be #70618.

Actions

Copy link

#13

Updated by mkittler over 4 years ago

Status changed from Feedback to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #70720

Unable to restart a child from START_DIRECTLY_AFTER_TEST chain if another child has been restarted already

Observation¶

Steps to reproduce¶

Acceptance criteria¶

Problem¶

Suggestions¶

Workaround¶

Updated by ggardet_arm over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by okurz over 4 years ago

Updated by mkittler over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago