Project

General

Profile

action #70720

Unable to restart a child from START_DIRECTLY_AFTER_TEST chain if another child has been restarted already

Added by ggardet_arm 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2020-08-31
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

I had 2 jobs to restart (in skipped state). Both are children from RPi_flash_firmware@RPi3 and use START_DIRECTLY_AFTER_TEST.
I can restart the first job properly, but when I try to restart the other child, I get an error:

Errors occurred when restarting jobs:
  Job 1375370 has already been cloned as 1377993

You can try to restart https://openqa.opensuse.org/tests/1375372

Steps to reproduce

  • Find cluster of directly-chained jobs with parent, here called A, and at least two children, called B1 and B2
  • Restart one child job, B1
  • Observe that B1 is cloned to B1' and A is cloned to A'
  • Navigate to second child job, B2
  • Observe that retrigger button is available
  • Observe that clicking on retrigger button yields "Job A has already been cloned as A'" but B2 is not retriggered

Effect: There is no way available over UI to restart B2 and the cluster relationship is not obvious

Acceptance criteria

  • AC1: Every job without a clone can be retriggered directly or indirectly or is linked to what can be considered a clone

Problem

Regression introduced with #68956 .

Suggestions

In the above example "Steps to reproduce" the actual problem already starts with retriggering B1 to B1' which implicitly clones A to A' but does not touch other siblings. Retriggering the parent A to A' would have correctly cloned all children. Maybe it is the easiest option for now to just prevent retriggering a child of a directly-chained dependency and only allow retriggering the parent. Based on this I propose:

  • Given the API endpoint of the retrigger button click, e.g. the "duplicate" method, If job has directly-chained parent and job has no clone, Then explain that the proper way is to retrigger parent and link there and mention API alternatives

Workaround

  • Instead of trying to restart failed children restart the parent over API and skip passed children following https://open.qa/docs/#_further_notes_2
  • As an alternative to START_DIRECTLY_AFTER_TEST one can define a specific "machine" with a specific worker class that is only fulfilled by a single, unique worker instance

Related issues

Related to openQA Project - action #70618: Automatically avoid restarting the directly chained parent if possible to save timeNew2020-08-27

Related to openQA Project - action #69979: Advanced job restarting via the web UINew2020-08-13

History

#1 Updated by ggardet_arm 5 months ago

  • Description updated (diff)

#2 Updated by mkittler 5 months ago

The best solution which is currently available is mentioned within the documentation: https://open.qa/docs/#_further_notes_2

This feature request has actually already been requested multiple times. See the comments in #68956, #69979, https://github.com/os-autoinst/openQA/pull/3262#issuecomment-673382461 and https://github.com/os-autoinst/openQA/pull/3300. So the overall problem is tricky and can not be trivially solved. My best idea so far is noted down in #70618.

This ticket is basically just the first point from #70618 where I claim it isn't much of an improvement. But if you really just wanted to be able to restart the job regardless of how many times the parent is executed, implementing this first step alone might already be and improvement. I'm keeping this ticket open as it is not en exact duplicate of #70618.

#3 Updated by mkittler 5 months ago

  • Related to action #70618: Automatically avoid restarting the directly chained parent if possible to save time added

#4 Updated by okurz 5 months ago

  • Category set to Feature requests
  • Priority changed from High to Low
  • Target version set to future

ggardet_arm I hope the different possibilities as described by mkittler help you for the time being. With this I would like to regard this feature request as low prio and actually not schedule it to be done by the SUSE QA Tools team, hence selecting the "future" target version. Cool ideas suggested in pull requests by any contributor are always welcome though :)

#5 Updated by mkittler 5 months ago

  • Related to action #69979: Advanced job restarting via the web UI added

#6 Updated by okurz 5 months ago

  • Description updated (diff)

#7 Updated by okurz 5 months ago

  • Description updated (diff)

#8 Updated by okurz 4 months ago

  • Description updated (diff)
  • Category changed from Feature requests to Concrete Bugs
  • Status changed from New to Workable
  • Priority changed from Low to High
  • Target version changed from future to Ready

Based on more feedback I invested more time into the topic. The impact of the problem is a bit bigger than I expected. After #68956 we have now the situation that the user can easily come into the situation of not being able to retrigger a job at all even though other jobs in the same cluster could be retriggered with a simple click of a button in the web UI.

Maybe instead of allowing the operation to clone a child and its parent but destroy the relationship to others instead point just to the parent to restart that one. Updated ticket description with "steps to reproduce", "suggestions", etc.

#9 Updated by mkittler 4 months ago

  • Assignee set to mkittler

#10 Updated by mkittler 4 months ago

PR following the suggestion: https://github.com/os-autoinst/openQA/pull/3401

So this doesn't really allow restarting the child. In fact one receives the same error as before. The change is to show already an error when restarting the first child with a suggestion to restart the parent instead. Similar to the missing assets detection one can click on the force button to restart the job nevertheless (including its direct parent).

Of course it would still be possible to actually allow restarting the parent by restarting its clone (or re-using the existing clone if it is still scheduled). However, without fully implementing #70618 that's likely not very useful because the parent would be ran unnecessarily often.

#11 Updated by mkittler 4 months ago

  • Status changed from Workable to In Progress

#12 Updated by mkittler 4 months ago

  • Status changed from In Progress to Feedback

The PR has been merged. As mentioned in the previous comment I've been implementing okurz's suggestion which doesn't really allow restarting the child. However, I suppose that's good enough for now and the real solution (at least the best I can currently think of) would be #70618.

#13 Updated by mkittler 4 months ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF