Project

General

Profile

action #81859

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

coordination #80828: [epic] Trigger 'auto-review' and 'openqa-investigate' from within openQA when jobs incomplete or fail on o3+osd

openqa-investigate triggers incomplete sets for multi-machine scenarios

Added by okurz 5 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-01-07
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

From https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/425#note_286971 "MM tests are not restarted properly, e.g. of 3 node MM test are restarted only two nodes and twice, node one with supportserver and node two with supportserver. https://openqa.suse.de/tests/5252051#dependencies"

Acceptance criteria

  • AC1: No incomplete multi-machine clusters are triggered by "openqa-investigate", i.e. either failing tests in multi-machine scenarios trigger correct multi-machine clusters or no wrong investigation jobs are triggered in these cases

Suggestions

  • Explore if the parameter "--clone-children" would help to trigger correct multi-machine test clusters for investigation
  • Try to find a correct way to trigger multi-machine tests for investigation or if not possible exclude from investigation jobs

Related issues

Related to openQA Project - action #81206: Trigger 'openqa-investigate' from within openQA when jobs fail on osdResolved

History

#1 Updated by okurz 5 months ago

As wished by dzedro I accepted an MR to disable investigation jobs for OSD maintenance scenarios although there is nothing OSD nor maintenance specific going on but maybe it makes him less grumpy :) After improving how investigation jobs are triggered for multi-machine scenarios we can re-enable openqa-investigate for this specific use case as well.

#3 Updated by okurz 5 months ago

  • Related to action #81206: Trigger 'openqa-investigate' from within openQA when jobs fail on osd added

#4 Updated by okurz 5 months ago

  • Parent task set to #80828

#5 Updated by okurz 5 months ago

  • Description updated (diff)
  • Status changed from New to Workable

The openqa-clone-job parameter "--clone-children" has been mentioned. It likely comes with caveats. I assume either mkittler or Xiaojing_liu would be able to come up with a viable solution :)

#6 Updated by dzedro 5 months ago

I'm not sure if MM jobs with 3+ nodes can be cloned and if yes then the clone job does need special options or cloned job must contain PARALLEL_WITH with all nodes.
I created this 3 node jobs to reproduce what was happening with the clone https://openqa.suse.de/tests/5274167#dependencies
The result of 2x 2 node being restarted out of 3 is happening with "simple" clone_job e.g. when clone_job is done on node1 and node 2.
As result 2 of 3 nodes are cloned/restarted. https://openqa.suse.de/tests/5289094#dependencies

sudo -u geekotest /usr/share/openqa/script/clone_job.pl --from localhost --host localhost --skip-download --skip-chained-deps 5274167
Cloning dependencies of sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_node01@64bit
Created job #5289085: sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_support_server@64bit -> http://localhost/t5289085
Created job #5289086: sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_node01@64bit -> http://localhost/t5289086

 sudo -u geekotest /usr/share/openqa/script/clone_job.pl --from localhost --host localhost --skip-download --skip-chained-deps --clone-children 5274167
Cloning dependencies of sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_node01@64bit
Created job #5289093: sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_support_server@64bit -> http://localhost/t5289093
Created job #5289094: sle-12-SP3-Server-DVD-x86_64-Build12sp3-qam_ha_rolling_update_node01@64bit -> http://localhost/t5289094

#7 Updated by mkittler 5 months ago

dzedro You need to clone the parallel parent which is the job which has not PARALLEL_WITH but is mentioned as PARALLEL_WITH in another job. So the job you need to clone is usually "the server", e.g. script/openqa-clone-job … --clone-children https://openqa.suse.de/tests/5274166 clones 3 jobs here.


I see the following problems:

  1. The clone job script operates non-atomically. So it can happen that one job of the cluster can be cloned successfully and the next one fails. I'm not aware that in this case the successfully cloned job would not be discarded again. Hence this looks like a recipe for ending up with half-scheduled clusters anyways.
  2. openqa-investigate considers each job (it possible ends up cloning) individually. This way it might end up cloning the same cluster twice if multiple jobs within the same cluster are selected to be cloned. This also means that openqa-investigate possibly ends up cloning just the parallel child instead of the parent. Even with --clone-children this would not recreate the whole cluster as mentioned before.
  3. Using the clone job script also means that we're relying on the scheduler to repair half-assigned MM clusters and that directly chained dependencies are not supported at all.

#8 Updated by mkittler 5 months ago

By the way, the restart API of openQA already does the right thing for each dependency type automatically and in an atomic way. When I remember correctly, it also returns the IDs of jobs which have been restarted so openqa-investigate could take these into account to avoid restarting the same cluster twice. So it seems tempting to simply use that API instead of the clone-job script. However, it has the following limitations so far:

  1. It only works within the same instance. That shouldn't be a problem here.
  2. It does not allow to change settings. That's not so much effort to implement and likely useful anyways.
  3. The new job is always considered a clone of the original job and one job can only be restarted if it has no clone yet. I suppose we needed a "detached" mode for the restarting API to circumvent that. Likely not much effort to implement.

#9 Updated by okurz 5 months ago

mkittler wrote:

By the way, the restart API of openQA already does the right thing for each dependency type automatically and in an atomic way. When I remember correctly, it also returns the IDs of jobs which have been restarted so openqa-investigate could take these into account to avoid restarting the same cluster twice. So it seems tempting to simply use that API instead of the clone-job script. However, it has the following limitations so far:

  1. It only works within the same instance. That shouldn't be a problem here.
  2. It does not allow to change settings. That's not so much effort to implement and likely useful anyways.
  3. The new job is always considered a clone of the original job and one job can only be restarted if it has no clone yet. I suppose we needed a "detached" mode for the restarting API to circumvent that. Likely not much effort to implement.

Would that mean that we move more functionality from the openqa-clone-job script into a lower layer and reuse it for the API and the clone-job script?

But another question: Would that fix the original issue in the best way? As an alternative I see that within openqa-investigate we look if the clone-candidate has siblings and a parent and clone the parent instead of the clone-candidate?

#10 Updated by mkittler 5 months ago

Would that mean that we move more functionality from the openqa-clone-job script into a lower layer and reuse it for the API and the clone-job script?

That's at least not what I meant in my previous comment. I thought that openqa-investigate would migrate to use the restart API via openqa-cli. I also don't think we can easily move any functionality from openqa-clone-job into the restart API because that script is supposed to work between multiple web UIs.

Would that fix the original issue in the best way?

That's the "cleanest" solution I can currently think of in the sense that we can reuse all the dependency handling already provided by the restart API and don't have to do a lots of manual calls from the outside and that changing settings when restarting a job is beneficial anyways. Not sure whether it is the best™ solution, though.

As an alternative I see that within openqa-investigate we look if the clone-candidate has siblings and a parent and clone the parent instead of the clone-candidate?

We could do that. Then we still need to keep track of the job IDs which have actually been cloned and the output of the clone script isn't meant to be parsed (so far). Then we would still have not solved problems "1." and "3." I've mentioned in #81859#note-7. (Problem "3." is likely not so important.)

#11 Updated by okurz 5 months ago

as discussed in meeting:

  • turn to epic
  • first step as subtask: Detect if there are any siblings for investigate candidate and abort early, optional debug log message about "unsupported multi-machine cluster"
  • next steps: Extend API to support an atomic operation for "list of jobs with dependencies", then potentially use that for openqa-investigate/client/openqa-clone-job

#12 Updated by mkittler 5 months ago

Extend API to support an atomic operation for "list of jobs with dependencies"

For the mere "listing" of dependencies we don't need an atomic operation. The job creation should be atomic in the sense that multiple jobs which belong to the same cluster are created by one API call internally using one DB transaction. So the problematic part is how to do the cloning/restarting. (See my comments #81859#note-7 and #81859#note-8.)

#13 Updated by mkittler 5 months ago

  • Assignee set to mkittler

#14 Updated by mkittler 5 months ago

first step as subtask: Detect if there are any siblings for investigate candidate and abort early, optional debug log message about "unsupported multi-machine cluster"

PR for that: https://github.com/os-autoinst/scripts/pull/67

#15 Updated by openqa_review 4 months ago

  • Due date set to 2021-02-16

Setting due date based on mean cycle time of SUSE QE Tools

#16 Updated by acarvajal 4 months ago

Hello. I have been seeing the same issue in the HA job groups as well.

I have submitted https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/440 to temporarily disable it. I hope 'HA' is precise enough and does not accidentally disable this for other groups. A quick search through the group names makes me think 'HA' will be good enough, but I will wait for reviews.

#17 Updated by acarvajal 4 months ago

mkittler wrote:

first step as subtask: Detect if there are any siblings for investigate candidate and abort early, optional debug log message about "unsupported multi-machine cluster"

PR for that: https://github.com/os-autoinst/scripts/pull/67

Sorry, have not seen this message before submitting the MR.

I will keep an eye today to the HA group, and close the MR if I see that the incomplete re-triggers are gone.

#18 Updated by mkittler 4 months ago

Thanks because I've admittedly only tested the PR locally and we'll yet have to see how well it works in production.

#19 Updated by acarvajal 4 months ago

mkittler wrote:

Thanks because I've admittedly only tested the PR locally and we'll yet have to see how well it works in production.

Judging by the HA group yesterday and today, no jobs were automatically re-triggered with :investigate:last_good_tests:, so I think it's working.

Only odd thing I saw were many jobs cancelled as obsolete, but I don't believe it is related to this.

#20 Updated by okurz 4 months ago

  • Status changed from Workable to Feedback

mkittler after your changes are effective also on OSD I can continue in #81868 . As the ticket is phrased we are ok to just prevent incomplete sets being triggered, we do not necessarily need to fix that (now or ever). So, can you do a final check and resolve the ticket?

#21 Updated by mkittler 4 months ago

I've been checking jobs which were finished 7 days ago and sooner. There are still jobs cloned with chained dependencies, see:

select jobs.id, child_job_id, parent_job_id, comments.text from jobs join comments on jobs.id = comments.job_id join job_dependencies on jobs.id = job_dependencies.parent_job_id or jobs.id = job_dependencies.child_job_id where jobs.id >= 5409964 and comments.text like '%Automatic investigation jobs%' and dependency = 1;

But there are no more jobs cloned with other dependencies, see:

select jobs.id, child_job_id, parent_job_id, comments.text from jobs join comments on jobs.id = comments.job_id join job_dependencies on jobs.id = job_dependencies.parent_job_id or jobs.id = job_dependencies.child_job_id where jobs.id >= 5409964 and comments.text like '%Automatic investigation jobs%' and dependency != 1;

The investigate script still creates an empty comment these jobs which could be improved.

#22 Updated by mkittler 4 months ago

PR for avoiding the empty comment: https://github.com/os-autoinst/scripts/pull/68

#23 Updated by mkittler 4 months ago

  • Status changed from Feedback to Resolved

So, can you do a final check and resolve the ticket?

The mentioned PR has been merged so I'd consider this done as well. We can decide later whether it makes sense to tackle the problems mentioned in #81859#note-7.

#24 Updated by okurz 3 months ago

  • Due date deleted (2021-02-16)

Also available in: Atom PDF