action #81859
closedcoordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues
coordination #80828: [epic] Trigger 'auto-review' and 'openqa-investigate' from within openQA when jobs incomplete or fail on o3+osd
openqa-investigate triggers incomplete sets for multi-machine scenarios
Description
Observation¶
From https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/425#note_286971 "MM tests are not restarted properly, e.g. of 3 node MM test are restarted only two nodes and twice, node one with supportserver and node two with supportserver. https://openqa.suse.de/tests/5252051#dependencies"
Acceptance criteria¶
- AC1: No incomplete multi-machine clusters are triggered by "openqa-investigate", i.e. either failing tests in multi-machine scenarios trigger correct multi-machine clusters or no wrong investigation jobs are triggered in these cases
Suggestions¶
- Explore if the parameter "--clone-children" would help to trigger correct multi-machine test clusters for investigation
- Try to find a correct way to trigger multi-machine tests for investigation or if not possible exclude from investigation jobs
Updated by okurz over 3 years ago
As wished by dzedro I accepted an MR to disable investigation jobs for OSD maintenance scenarios although there is nothing OSD nor maintenance specific going on but maybe it makes him less grumpy :) After improving how investigation jobs are triggered for multi-machine scenarios we can re-enable openqa-investigate for this specific use case as well.
Updated by okurz over 3 years ago
- Related to action #81206: Trigger 'openqa-investigate' from within openQA when jobs fail on osd added
Updated by okurz over 3 years ago
- Description updated (diff)
- Status changed from New to Workable
The openqa-clone-job parameter "--clone-children" has been mentioned. It likely comes with caveats. I assume either mkittler or Xiaojing_liu would be able to come up with a viable solution :)
Updated by dzedro over 3 years ago
I'm not sure if MM jobs with 3+ nodes can be cloned and if yes then the clone job does need special options or cloned job must contain PARALLEL_WITH
with all nodes.
I created this 3 node jobs to reproduce what was happening with the clone https://openqa.suse.de/tests/5274167#dependencies
The result of 2x 2 node being restarted out of 3 is happening with "simple" clone_job e.g. when clone_job is done on node1 and node 2.
As result 2 of 3 nodes are cloned/restarted. https://openqa.suse.de/tests/5289094#dependencies
sudo -u geekotest /usr/share/openqa/script/clone_job.pl --from localhost --host localhost --skip-download --skip-chained-deps 5274167
Cloning dependencies of sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_node01@64bit
Created job #5289085: sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_support_server@64bit -> http://localhost/t5289085
Created job #5289086: sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_node01@64bit -> http://localhost/t5289086
sudo -u geekotest /usr/share/openqa/script/clone_job.pl --from localhost --host localhost --skip-download --skip-chained-deps --clone-children 5274167
Cloning dependencies of sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_node01@64bit
Created job #5289093: sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_support_server@64bit -> http://localhost/t5289093
Created job #5289094: sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_node01@64bit -> http://localhost/t5289094
Updated by mkittler over 3 years ago
@dzedro You need to clone the parallel parent which is the job which has not PARALLEL_WITH
but is mentioned as PARALLEL_WITH
in another job. So the job you need to clone is usually "the server", e.g. script/openqa-clone-job … --clone-children https://openqa.suse.de/tests/5274166
clones 3 jobs here.
I see the following problems:
- The clone job script operates non-atomically. So it can happen that one job of the cluster can be cloned successfully and the next one fails. I'm not aware that in this case the successfully cloned job would not be discarded again. Hence this looks like a recipe for ending up with half-scheduled clusters anyways.
- openqa-investigate considers each job (it possible ends up cloning) individually. This way it might end up cloning the same cluster twice if multiple jobs within the same cluster are selected to be cloned. This also means that openqa-investigate possibly ends up cloning just the parallel child instead of the parent. Even with
--clone-children
this would not recreate the whole cluster as mentioned before. - Using the clone job script also means that we're relying on the scheduler to repair half-assigned MM clusters and that directly chained dependencies are not supported at all.
Updated by mkittler over 3 years ago
By the way, the restart API of openQA already does the right thing for each dependency type automatically and in an atomic way. When I remember correctly, it also returns the IDs of jobs which have been restarted so openqa-investigate could take these into account to avoid restarting the same cluster twice. So it seems tempting to simply use that API instead of the clone-job script. However, it has the following limitations so far:
- It only works within the same instance. That shouldn't be a problem here.
- It does not allow to change settings. That's not so much effort to implement and likely useful anyways.
- The new job is always considered a clone of the original job and one job can only be restarted if it has no clone yet. I suppose we needed a "detached" mode for the restarting API to circumvent that. Likely not much effort to implement.
Updated by okurz over 3 years ago
mkittler wrote:
By the way, the restart API of openQA already does the right thing for each dependency type automatically and in an atomic way. When I remember correctly, it also returns the IDs of jobs which have been restarted so openqa-investigate could take these into account to avoid restarting the same cluster twice. So it seems tempting to simply use that API instead of the clone-job script. However, it has the following limitations so far:
- It only works within the same instance. That shouldn't be a problem here.
- It does not allow to change settings. That's not so much effort to implement and likely useful anyways.
- The new job is always considered a clone of the original job and one job can only be restarted if it has no clone yet. I suppose we needed a "detached" mode for the restarting API to circumvent that. Likely not much effort to implement.
Would that mean that we move more functionality from the openqa-clone-job script into a lower layer and reuse it for the API and the clone-job script?
But another question: Would that fix the original issue in the best way? As an alternative I see that within openqa-investigate we look if the clone-candidate has siblings and a parent and clone the parent instead of the clone-candidate?
Updated by mkittler over 3 years ago
Would that mean that we move more functionality from the openqa-clone-job script into a lower layer and reuse it for the API and the clone-job script?
That's at least not what I meant in my previous comment. I thought that openqa-investigate would migrate to use the restart API via openqa-cli. I also don't think we can easily move any functionality from openqa-clone-job into the restart API because that script is supposed to work between multiple web UIs.
Would that fix the original issue in the best way?
That's the "cleanest" solution I can currently think of in the sense that we can reuse all the dependency handling already provided by the restart API and don't have to do a lots of manual calls from the outside and that changing settings when restarting a job is beneficial anyways. Not sure whether it is the best™ solution, though.
As an alternative I see that within openqa-investigate we look if the clone-candidate has siblings and a parent and clone the parent instead of the clone-candidate?
We could do that. Then we still need to keep track of the job IDs which have actually been cloned and the output of the clone script isn't meant to be parsed (so far). Then we would still have not solved problems "1." and "3." I've mentioned in #81859#note-7. (Problem "3." is likely not so important.)
Updated by okurz over 3 years ago
as discussed in meeting:
- turn to epic
- first step as subtask: Detect if there are any siblings for investigate candidate and abort early, optional debug log message about "unsupported multi-machine cluster"
- next steps: Extend API to support an atomic operation for "list of jobs with dependencies", then potentially use that for openqa-investigate/client/openqa-clone-job
Updated by mkittler over 3 years ago
Extend API to support an atomic operation for "list of jobs with dependencies"
For the mere "listing" of dependencies we don't need an atomic operation. The job creation should be atomic in the sense that multiple jobs which belong to the same cluster are created by one API call internally using one DB transaction. So the problematic part is how to do the cloning/restarting. (See my comments #81859#note-7 and #81859#note-8.)
Updated by mkittler over 3 years ago
first step as subtask: Detect if there are any siblings for investigate candidate and abort early, optional debug log message about "unsupported multi-machine cluster"
PR for that: https://github.com/os-autoinst/scripts/pull/67
Updated by openqa_review over 3 years ago
- Due date set to 2021-02-16
Setting due date based on mean cycle time of SUSE QE Tools
Updated by acarvajal over 3 years ago
Hello. I have been seeing the same issue in the HA job groups as well.
I have submitted https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/440 to temporarily disable it. I hope 'HA' is precise enough and does not accidentally disable this for other groups. A quick search through the group names makes me think 'HA' will be good enough, but I will wait for reviews.
Updated by acarvajal over 3 years ago
mkittler wrote:
first step as subtask: Detect if there are any siblings for investigate candidate and abort early, optional debug log message about "unsupported multi-machine cluster"
PR for that: https://github.com/os-autoinst/scripts/pull/67
Sorry, have not seen this message before submitting the MR.
I will keep an eye today to the HA group, and close the MR if I see that the incomplete re-triggers are gone.
Updated by mkittler over 3 years ago
Thanks because I've admittedly only tested the PR locally and we'll yet have to see how well it works in production.
Updated by acarvajal over 3 years ago
mkittler wrote:
Thanks because I've admittedly only tested the PR locally and we'll yet have to see how well it works in production.
Judging by the HA group yesterday and today, no jobs were automatically re-triggered with :investigate:last_good_tests:
, so I think it's working.
Only odd thing I saw were many jobs cancelled as obsolete, but I don't believe it is related to this.
Updated by okurz over 3 years ago
- Status changed from Workable to Feedback
@mkittler after your changes are effective also on OSD I can continue in #81868 . As the ticket is phrased we are ok to just prevent incomplete sets being triggered, we do not necessarily need to fix that (now or ever). So, can you do a final check and resolve the ticket?
Updated by mkittler over 3 years ago
I've been checking jobs which were finished 7 days ago and sooner. There are still jobs cloned with chained dependencies, see:
select jobs.id, child_job_id, parent_job_id, comments.text from jobs join comments on jobs.id = comments.job_id join job_dependencies on jobs.id = job_dependencies.parent_job_id or jobs.id = job_dependencies.child_job_id where jobs.id >= 5409964 and comments.text like '%Automatic investigation jobs%' and dependency = 1;
But there are no more jobs cloned with other dependencies, see:
select jobs.id, child_job_id, parent_job_id, comments.text from jobs join comments on jobs.id = comments.job_id join job_dependencies on jobs.id = job_dependencies.parent_job_id or jobs.id = job_dependencies.child_job_id where jobs.id >= 5409964 and comments.text like '%Automatic investigation jobs%' and dependency != 1;
The investigate script still creates an empty comment these jobs which could be improved.
Updated by mkittler over 3 years ago
PR for avoiding the empty comment: https://github.com/os-autoinst/scripts/pull/68
Updated by mkittler over 3 years ago
- Status changed from Feedback to Resolved
So, can you do a final check and resolve the ticket?
The mentioned PR has been merged so I'd consider this done as well. We can decide later whether it makes sense to tackle the problems mentioned in #81859#note-7.
Updated by okurz about 3 years ago
- Copied to action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M added