Actions
action #112868
opencoordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
Helpful instructions to prevent incomplete cluster restarts
Start date:
2022-06-22
Due date:
% Done:
0%
Estimated time:
Description
Motivation¶
In a case like
https://openqa.suse.de/tests/8966763#dependencies
a job is not passed so users might like to restart. Trying to retrigger over the button in the webUI shows an error
Errors occurred when restarting jobs:
Job 8966755 already has clone 8998406
First an inconvenience is that just the job IDs are shown but no links are rendered. Second, the user would still like to restart the job but can't. In the above example 8966755 is the serial parent "create_hdd_ha_textmode_maintenance" which already has a clone 8998406 which likely was created when a job in another sub-cluster was retriggered
Suggestions¶
- In https://github.com/os-autoinst/openQA/blob/1fa560517e812a3886219eb3667e9fc05f9f873d/lib/OpenQA/Schema/Result/Jobs.pm#L625
- Extend the die-message, maybe with proper URLs? Can we do that here?
- Add explanations to that die-message to explain what options the user has, e.g. include the text from the section "Workaround" further below
Further details¶
See https://suse.slack.com/archives/C02CANHLANP/p1655887247175179 for details
Workaround¶
- To avoid this problem retrigger the serial parent for multiple sub-clusters to achieve consistent results
- To fix the situation if already an incomplete cluster was created delete the serial parent job which prevents cloning of the original failed job and restart the serial parent of the complete cluster (instead of any child job)
Actions