action #70720
closedUnable to restart a child from START_DIRECTLY_AFTER_TEST chain if another child has been restarted already
Description
Observation¶
I had 2 jobs to restart (in skipped
state). Both are children from RPi_flash_firmware@RPi3
and use START_DIRECTLY_AFTER_TEST
.
I can restart the first job properly, but when I try to restart the other child, I get an error:
Errors occurred when restarting jobs:
Job 1375370 has already been cloned as 1377993
You can try to restart https://openqa.opensuse.org/tests/1375372
Steps to reproduce¶
- Find cluster of directly-chained jobs with parent, here called A, and at least two children, called B1 and B2
- Restart one child job, B1
- Observe that B1 is cloned to B1' and A is cloned to A'
- Navigate to second child job, B2
- Observe that retrigger button is available
- Observe that clicking on retrigger button yields "Job A has already been cloned as A'" but B2 is not retriggered
Effect: There is no way available over UI to restart B2 and the cluster relationship is not obvious
Acceptance criteria¶
- AC1: Every job without a clone can be retriggered directly or indirectly or is linked to what can be considered a clone
Problem¶
Regression introduced with #68956 .
Suggestions¶
In the above example "Steps to reproduce" the actual problem already starts with retriggering B1 to B1' which implicitly clones A to A' but does not touch other siblings. Retriggering the parent A to A' would have correctly cloned all children. Maybe it is the easiest option for now to just prevent retriggering a child of a directly-chained dependency and only allow retriggering the parent. Based on this I propose:
- Given the API endpoint of the retrigger button click, e.g. the "duplicate" method, If job has directly-chained parent and job has no clone, Then explain that the proper way is to retrigger parent and link there and mention API alternatives
Workaround¶
- Instead of trying to restart failed children restart the parent over API and skip passed children following https://open.qa/docs/#_further_notes_2
- As an alternative to
START_DIRECTLY_AFTER_TEST
one can define a specific "machine" with a specific worker class that is only fulfilled by a single, unique worker instance
Updated by mkittler over 4 years ago
The best solution which is currently available is mentioned within the documentation: https://open.qa/docs/#_further_notes_2
This feature request has actually already been requested multiple times. See the comments in #68956, #69979, https://github.com/os-autoinst/openQA/pull/3262#issuecomment-673382461 and https://github.com/os-autoinst/openQA/pull/3300. So the overall problem is tricky and can not be trivially solved. My best idea so far is noted down in #70618.
This ticket is basically just the first point from #70618 where I claim it isn't much of an improvement. But if you really just wanted to be able to restart the job regardless of how many times the parent is executed, implementing this first step alone might already be and improvement. I'm keeping this ticket open as it is not en exact duplicate of #70618.
Updated by mkittler over 4 years ago
- Related to action #70618: Automatically avoid restarting the directly chained parent if possible to save time added
Updated by okurz over 4 years ago
- Category set to Feature requests
- Priority changed from High to Low
- Target version set to future
@ggardet_arm I hope the different possibilities as described by mkittler help you for the time being. With this I would like to regard this feature request as low prio and actually not schedule it to be done by the SUSE QA Tools team, hence selecting the "future" target version. Cool ideas suggested in pull requests by any contributor are always welcome though :)
Updated by mkittler over 4 years ago
- Related to action #69979: Advanced job restarting via the web UI added
Updated by okurz over 4 years ago
- Description updated (diff)
- Category changed from Feature requests to Regressions/Crashes
- Status changed from New to Workable
- Priority changed from Low to High
- Target version changed from future to Ready
Based on more feedback I invested more time into the topic. The impact of the problem is a bit bigger than I expected. After #68956 we have now the situation that the user can easily come into the situation of not being able to retrigger a job at all even though other jobs in the same cluster could be retriggered with a simple click of a button in the web UI.
Maybe instead of allowing the operation to clone a child and its parent but destroy the relationship to others instead point just to the parent to restart that one. Updated ticket description with "steps to reproduce", "suggestions", etc.
Updated by mkittler over 4 years ago
PR following the suggestion: https://github.com/os-autoinst/openQA/pull/3401
So this doesn't really allow restarting the child. In fact one receives the same error as before. The change is to show already an error when restarting the first child with a suggestion to restart the parent instead. Similar to the missing assets detection one can click on the force button to restart the job nevertheless (including its direct parent).
Of course it would still be possible to actually allow restarting the parent by restarting its clone (or re-using the existing clone if it is still scheduled). However, without fully implementing #70618 that's likely not very useful because the parent would be ran unnecessarily often.
Updated by mkittler over 4 years ago
- Status changed from Workable to In Progress
Updated by mkittler over 4 years ago
- Status changed from In Progress to Feedback