Project

General

Profile

action #95783

coordination #103962: [saga][epic] Easy multi-machine handling: MM-tests as first-class citizens

coordination #103971: [epic] Easy *re*-triggering and cloning of multi-machine tests

Provide support for multi-machine scenarios handled by openqa-investigate size:M

Added by okurz 10 months ago. Updated 25 days ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

See #81859#note-7

Acceptance criteria

  • AC1: "openqa-investigate" triggers complete sets of multi-machine scenarios for investigation
  • AC2: The investigation multi-machine scenarios are categorized like other investigation jobs, e.g. ":investigate:" in the name, outside any job group, no build, etc.

Suggestions

  • Have openqa-clone-job's --skip-chained-deps option only affect chained dependencies but not directly chained dependencies
  • * Add --skip-directly-chained-deps
  • Add a hook in openQA to invoke a script once all jobs in a dependency tree are done.
  • Make openqa-investigate use that hook and investigate all jobs that weren't successful. It should use --max-depth 0 to ensure parallel clusters are always fully cloned (so it is not necessary to distinguish between parallel parents and children). It needs to keep track of handled job IDs to avoid investigating jobs multiple times as openqa-clone-job will already handle dependencies as needed (and therefore might clone already multiple jobs we need to investigate in one go). The tracking should be easy because openqa-clone-job --json-output has already been implemented.
  • Add an opt-out (e.g. by specifying a certain test variable) so users who consider these tests as a waste of time won't complain e.g. configurable via a test variable
  • * We could also make it in opt-in. So we'd keep the current behavior of skipping the investigation of jobs with parallel and directly chained dependencies unless a user specifies some test variable.

Out of scope

  • Multiple root jobs. We can consider that a future ticket for now.
  • Spawning too many investigation jobs under high load. We could consider such jobs as low priority and drop them (user story, not technical definition).

Related issues

Related to openQA Project - action #103425: Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:MResolved

Related to openQA Project - action #71809: Enable multi-machine jobs trigger without "isos post"Resolved2020-09-24

Related to openQA Project - action #69976: Show dependency graph for cloned jobsResolved2020-08-13

Related to QA - action #107014: trigger openqa-trigger-bisect-jobs from our automatic investigations whenever the cause is not already known size:MResolved2022-02-17

Related to openQA Project - action #110518: Call job_done_hooks if requested by test setting (not only openQA config as done so far) size:MResolved2021-09-18

Related to openQA Project - action #110530: Do NOT call job_done_hooks if requested by test settingResolved2021-09-18

Copied from openQA Project - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenariosResolved2021-01-07

History

#1 Updated by okurz 10 months ago

  • Copied from action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarios added

#2 Updated by okurz 10 months ago

Maybe this can be handled more easily by providing the possibility for the restart API route to take optional test parameters, same as for openqa-clone-job

#3 Updated by okurz 6 months ago

  • Related to action #103425: Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M added

#4 Updated by okurz 5 months ago

  • Related to action #71809: Enable multi-machine jobs trigger without "isos post" added

#5 Updated by okurz 5 months ago

  • Description updated (diff)

fixed typo in description

#6 Updated by okurz 5 months ago

  • Parent task set to #103971

#7 Updated by okurz 4 months ago

  • Target version changed from future to Ready

#8 Updated by mkittler 4 months ago

  • Assignee set to mkittler

#10 Updated by mkittler 4 months ago

  • Status changed from In Progress to Feedback

All PR for openQA have been merged. I'm waiting for the deployment on o3 to test the investigation script.

#11 Updated by mkittler 4 months ago

I checked out https://github.com/os-autoinst/scripts/pull/127 on o3 15 minutes ago so it has been using my changes recent investigation runs. So far I couldn't spot any problems in the logs. There are also still investigation jobs created, e.g. https://openqa.opensuse.org/tests/2170735, https://openqa.opensuse.org/tests/2170567, https://openqa.opensuse.org/tests/2170569. They look good (name is correct, outside of group, origin referenced, job not shown as clone). So I suppose it is safe to merge the PR.

#12 Updated by mkittler 4 months ago

I encountered a few problems when looking into the logs again today so I created https://github.com/os-autoinst/scripts/pull/128 and https://github.com/os-autoinst/scripts/pull/129 to address them.

#13 Updated by okurz 4 months ago

  • Status changed from Feedback to In Progress

A problem report about broken links by guillaume_g in https://matrix.to/#/!ilXMcHXPOjTZeauZcg:libera.chat/$Vh0JxSRfc4-GGh6Znu2FtDA1mKew1_Wt3kitW7owOFM?via=libera.chat&via=matrix.org&via=gnome.org . Think we have a recent regression in openqa-investigate, see the broken links in https://openqa.opensuse.org/tests/2172673#comments which should be absolute links. mkittler is informed and is on it

#14 Updated by mkittler 4 months ago

This problem mentioned in the previous comment should be fixed now while https://github.com/os-autoinst/scripts/pull/129 is still pending.


I also got feedback. There are multiple remaining issues:

  1. Since we don't consider investigation jobs as clones our usual restrictions for displaying only the latest jobs in the "clone/restart chain" in the dependency tab doesn't apply. Therefore we see one big dependency tree containing all the jobs. It doesn't mean there's a problem with restarting the dependency cluster in general but the way it is displayed is rather confusing. This problem is related to #69976 (and might be solved within that ticket).
  2. It is problematic if there are multiple failing "root" jobs (like here: https://openqa.suse.de/tests/8088091). The investigate script restarts then one of these "root" jobs first and by that also restart the rest of the dependency tree which is reachable from the "root" going down the tree. It would however not go up and also restart the other "root" job. This one would attempted to be restarted separately which might lead to restarting some jobs down the dependency tree multiple times as they have already been restarted for the first "root" job.
    1. I suppose we needed to go also up the tree when restarting the first "root" job if it has failed.
    2. If we follow the previous idea we would restart multiple failing jobs at the same time in the same way (although their "last good" jobs might differ). That's a limitation we'd likely have to accept.
  3. The investigation can restart many jobs. Especially when we are short on resources it seems like a waste to trigger so many jobs (e.g. PowerPC jobs are problematic right now).

I'm wondering whether I should add the guard for avoid investigating MM jobs back until these problems have been resolved.

#15 Updated by okurz 4 months ago

mkittler wrote:

I also got feedback. There are multiple remaining issues:
[…]

For now I tend to keep MM investigation jobs enabled. I understand "1." as something that is only visual and I assume that test reviewers will learn how to interpret the extended dependency trees correctly. The problem "2." regarding root jobs I don't fully understand but as you stated that you see it as an acceptable limitation so do I for now :) . 3. Shouldn't stop us. We should reward test maintainers of tests with low fail-ratio by giving them additional tools in the form of investigation jobs in case tests do fail. This means that test maintainers of unstable tests need to wait longer for all jobs to complete which might be a good motivation for them to stabilize tests or optimize runtime :) In most cases test reviewers would likely anyway restart jobs in case of failures to find out reproducibility. This is wasteful and causes busy-waiting of reviewers when investigation jobs can already provide that information at the time of review as they are triggered as soon as a job fails and not just when jobs are reviewed for the first time.

#16 Updated by mkittler 4 months ago

I though about it a little bit further:

  • If we implement #69976 that will bring us nothing for 1. because the implementation will rely on clone_id being set. (At least I don't see any other way.)
  • It shouldn't be a problem, though: I suppose if 2. is fixed then 1. is fixed as well because then the "weird overlap"¹ between the new and old dependency tree is gone.
  • For 3. we can likely come up with an exclude regex. (Maybe the existing exclude regexes are even already enough.)

So I'll create a draft to show how 2. would look like. DONE: https://github.com/os-autoinst/openQA/pull/4498


¹ Jobs in the new dependency tree should not refer to any job in the old dependency tree anymore (making it effectively one big dependency tree) because all parents would be cloned (and new children having still parents from the old cluster as the parent is not cloned as well is the problem here).

#17 Updated by mkittler 4 months ago

  • Status changed from In Progress to Feedback

#18 Updated by dzedro 4 months ago

Will the PR fix also this lonely support servers, I assume they are created by openqa-investigate ?
https://openqa.suse.de/tests/8119478
https://openqa.suse.de/tests/8119479

#19 Updated by tinita 4 months ago

I added another test to openqa-investigate: https://github.com/os-autoinst/scripts/pull/133

#20 Updated by okurz 4 months ago

  • Status changed from Feedback to In Progress

Some problems have been found with this, e.g. reported in https://suse.slack.com/archives/C02CANHLANP/p1644483914332169 . mkittler can you please look into https://openqa.suse.de/tests/8133359#comments to understand why there are 4 retry jobs and 6 last good build ones? As discussed we merged https://github.com/os-autoinst/scripts/pull/131 and https://github.com/os-autoinst/scripts/pull/134 to mitigate some problems. Please closely observe the impact.

#21 Updated by mkittler 4 months ago

With https://github.com/os-autoinst/scripts/pull/131 and https://github.com/os-autoinst/scripts/pull/134 being merged we're basically back at where we were before. There are still multiple problems to address.

  1. I suppose the fix for 2. (from #note-14) worked but had the side-effect of restarting even more jobs and that can be problematic (see 3. from #note-14).
  2. We still accidentally restarted jobs twice because the tracking of already restarted jobs only works within the scope of one openqa-investigate run. We somehow need to ensure that further openqa-investigate invocations skip jobs which have already been investigated as dependency.
    1. I suppose jobs like "08134285-sle-15-SP4-Online-ppc64le-RAID1:investigate:retry*:investigate:last_good_build*:91.2@ppc64le-hmc-4disk" (mind the 2nd ":investigate:…" suffix; https://openqa.suse.de/tests/8134285) are also due to that but it could also be a problem within the general skipping logic and also with the skipping logic within one script invocation.
    2. Maybe the investigate script could record all restarted job IDs in a persistent text file to keep track of restarted jobs beyond the scope of one execution.
    3. Or we finally make it a real openQA feature - although doing the tacking within openQA itself would likely also be quite some work.
  3. Sometimes users seem to restart the jobs also manually, e.g. https://openqa.suse.de/tests/8133359. In this case it is really annoying that #69976 is still unresolved because knowing the original dependency tree would be very useful.
  4. That we get many investigation job, e.g. as shown in https://openqa.suse.de/tests/8133359#comment-484933, is unfortunately "normal" if we really want this feature. Of the 10 jobs in the mentioned comment 8 jobs are actually valid (and only two of them are cases as mentioned in 2.1). If that is too much we needed to reconsider the whole feature.

#22 Updated by cdywan 4 months ago

dzedro wrote:

Will the PR fix also this lonely support servers, I assume they are created by openqa-investigate ?
https://openqa.suse.de/tests/8119478
https://openqa.suse.de/tests/8119479

These don't have investigate in the name, so to my understanding they couldn't have been.

#23 Updated by okurz 4 months ago

  • Priority changed from Low to Urgent

mkittler the situation seems to be severly broken now.

I executed

echo https://openqa.suse.de/tests/8140400 | host=openqa.suse.de openqa-investigate

with the git commit 054cb9c of os-autoinst scripts getting https://openqa.suse.de/tests/8140400#comment-485767 which looks complete with 4 investigation jobs triggered. With current master 3ae1f1a I got https://openqa.suse.de/tests/8140400#comment-485765 with only two jobs, the "retry" and "last_good_build". So what we observed today in the review training session was a regression by one of your recent changes.

An additional problem seems to be that we can't call openqa-investigate on a job that already has clones which is also a regression.

#24 Updated by cdywan 4 months ago

okurz wrote:

a regression by one of your recent changes.

Just to be clear, this is only preventing investigation jobs from running and it's not interfering with cloning or jobs outside of that? So an immediate revert or rollback may not be necessary (that would be my thought anyway).

An additional problem seems to be that we can't call openqa-investigate on a job that already has clones which is also a regression.

Does that mean the "is clone" flag is not set reliably? Maybe it'd be best to identify where this happens and extend test coverage before attempting a fix.

#25 Updated by mkittler 4 months ago

An additional problem seems to be that we can't call openqa-investigate on a job that already has clones which is also a regression.

Right, the clone=0 flag only means the restarted jobs won't be considered clones. But if the jobs have already been restarted it is no help.


Maybe it is best to revert the changes to the scripts repo completely then. At least I likely won't manage to debug and fix the problems today anymore.

#27 Updated by openqa_review 3 months ago

  • Due date set to 2022-02-26

Setting due date based on mean cycle time of SUSE QE Tools

#28 Updated by okurz 3 months ago

  • Related to action #69976: Show dependency graph for cloned jobs added

#29 Updated by okurz 3 months ago

  • Due date deleted (2022-02-26)
  • Status changed from In Progress to Blocked
  • Priority changed from Urgent to Normal

Urgency was adressed with https://github.com/os-autoinst/scripts/pull/135 merged and deployed.

As discussed we pull in #69976 first and let's see if that helps us.

#30 Updated by mkittler 3 months ago

  • Status changed from Blocked to Feedback

#69976 has been resolved so I'm unblocking the issue.

However, there are still open questions which should be answered before proceeding:

  1. Should I follow the "restart API approach"? I needed to change clone=0 so it ignores if a job has already been restarted (and restarts it anyways). Not sure what side-effects this will have.
  2. I could also implement remaining ACs of #103971 and keep using the clone script here. I wouldn't have to care about the restart API anymore but it'll come with its own difficulties (like implementing those ACs in the first place, parsing the output of the clone script to keep track of all jobs which have been cloned).
  3. Regardless of whether I'd choose 1. or 2. the problem of keeping track of already restarted jobs between multiple invocations needs to be solved. Maybe it makes sense to simply store handles job IDs in a simple SQLite file? (SQLite would implement searching, updating and locking for us, that's why I bring it up. So it might even be simpler to use than writing to a text file but still just a single additional file. The required SQL should be trivial.)

#31 Updated by okurz 3 months ago

mkittler wrote:

#69976 has been resolved so I'm unblocking the issue.

However, there are still open questions which should be answered before proceeding:

  1. Should I follow the "restart API approach"? I needed to change clone=0 so it ignores if a job has already been restarted (and restarts it anyways). Not sure what side-effects this will have.

Don't we already have a "force" flag which we use when assets are missing and then the user can force-restart if they still want to?

  1. I could also implement remaining ACs of #103971 and keep using the clone script here. I wouldn't have to care about the restart API anymore but it'll come with its own difficulties (like implementing those ACs in the first place, parsing the output of the clone script to keep track of all jobs which have been cloned).

How about giving it a try with "openqa-clone-job" for a limited time and see where we can go?

  1. Regardless of whether I'd choose 1. or 2. the problem of keeping track of already restarted jobs between multiple invocations needs to be solved. Maybe it makes sense to simply store handles job IDs in a simple SQLite file? (SQLite would implement searching, updating and locking for us, that's why I bring it up. So it might even be simpler to use than writing to a text file but still just a single additional file. The required SQL should be trivial.)

sqlite sounds like overkill but you might be right that it might be easier than parsing a text file in custom format. However so far we had stateless scripts which is even easier if we can sustain that. As you mentioned I think it's a good idea to bring this ticket up in the unblock

#32 Updated by mkittler 3 months ago

Don't we already have a "force" flag which we use when assets are missing and then the user can force-restart if they still want to?

No, I don't think it already allows to force this (only to ignore missing assets and some settings problematic for directly chained dependencies).

How about giving it a try with "openqa-clone-job" for a limited time and see where we can go?

The "openqa-clone-job" script won't behave very nicely. That's why we currently skip those jobs in the first place. I'll note down more details in the summary of conversation from today.

#33 Updated by mkittler 3 months ago

Recap for those who aren't familiar with the topic:

  1. What is special about MM jobs and how does it relate to openqa-investigate?
    1. Those jobs parallel job dependencies - so the jobs are scheduled to run at the same time. The scheduler code itself works just fine so that's actually not a concern here.
    2. Points to improve are:
      1. Posting jobs (via the jobs post API) which have those dependencies. It is possible but only in a non-atomic way which is problematic as we end up with half-running parallel jobs.
      2. Thus the openqa-clone-job script which is using the jobs post API is affected by that problems.
      3. Thus the openqa-investigate script is affected by those problems. In addition, we need to take care not to create redundant investigation jobs.
  2. How comes the restart API into the picture?
    1. The openqa-investigate script could utilize it instead of the openqa-clone-job script as it doesn't need to do inter-openQA-instance cloning.
    2. Note that the openqa-clone-job script is nevertheless used by users and 1.2.1 and 1.2.2 are also impairing users. So we still need to take care of openqa-clone-job - regardless of how we handle the investigation.

Summary of today's discussion:

  1. Which dependent jobs do we want/need to restart when investigating?

    1. The general goal is to avoid producing results which are not needed.
    2. It depends on the dependency type:
      1. If a child fails:
        1. Chained parents don't need to be restarted. (Unless they failed, but then the child will be skipped and is thus not investigated anyways.)
        2. Directly chained parents need to be restarted so the chain is not broken. The whole direct chain of parents needs to be restarted (recursively).
        3. Parallel parents need to be restarted as e.g. the "client" job needs the "server" job to run. Presumably parallel siblings can affect each other so the parallel parent's other children and their parallel dependencies need to be restarted as well (resulting in restarting the whole parallel cluster).
      2. If a parent fails:
        1. Chained children don't need to be restarted. (We are mainly interested in finding out why the parent fails, not in producing some further results for the children.)
        2. Directly chained children don't need to be restarted. (Same counts as for regularly chained children.)
        3. Parallel children need to be restarted as e.g. a server crash can maybe only be reproduced if there's a client connecting to the server. Presumably nested parallel children are important as well so they need to be restarted as well (resulting in restarting the whole parallel cluster).
      3. If a "job in the middle" fails (a job which has parents and children at the same time):
        1. Both previous points apply. So parents and children need (or don't need) to be restarted as explained in the previous points.
      4. Note that "chained" and "directly chained" (and "parallel") are distinct dependency types. A dependency only has one of these types and a directly chained dependency is not a chained dependency at the same time. So 1.2.1.1 and 1.2.1.2 don't contradict each other.
  2. Can we investigate each failure "in isolation"?

    1. First an example what "in isolation" would mean:
      1. Assume we have 2 failed parallel children within the same cluster.
      2. Assume we would create 4 investigation jobs per faild child without considering dependencies.
      3. For each investigation job we would clone the whole cluster as explained in 1.2.1, let's say 3 jobs.
      4. That would make 24 clones in total (number of failed jobs * number of investigation jobs per failed job * number of dependent jobs to be cloned per job).
    2. This might be acceptable in general …
    3. … but we need to think at least about making exceptions as well.
    4. Alternative: Somehow "merge" 2.1.2 and 2.1.3 so we would only have X investigation jobs per dependency tree.
      1. In the example from 2.1 we would end up with "only" 12 jobs (number of failed jobs * number of investigation jobs per failed job).
      2. To achieve that we needed to keep track of which jobs we have already investigated which could be done on different levels:
        1. The openqa-investigate script keeps track, e.g.
          1. using a SQLite file as suggested in #95783#note-30 (or some other persistent storage).
          2. by adding a special comment or job setting in all cloned jobs (basically utilizing openQA's database).
        2. openQA invokes the post-fail-hook per "dependency tree" and not per job.
          1. So the post-fail-hook would receive a list of all failed job IDs within the cluster and not just a single job ID.
          2. The investigate script would then loop over these job IDs and skip jobs which have already been cloned in a previous iteration (or only create a comment there).
        3. For openQA the previous point boils down to invoking hooks only for dependency trees where all jobs have been cancelled/done. The problem here is that multiple jobs can end up cancelled/done at the same time.
          1. Maybe Minion locks can help here. However, it would be very problematic to run only one finalize_job_results task at the same time (e.g. finalize_job_results will pile up because there's a blocker - we allow hook scripts to run 5 minutes and it can sometimes indeed take a while in practice). So a more fine-grained locking would be required. Unfortunately we don't have the concept of a "dependency tree ID" in openQA (which could simply be used as lock name).
          2. Maybe there's a way to query whether a dependency tree is "pending" in a single SQL query to avoid the race condition.
          3. If the previous is not possible, we could use a database transaction to avoid the race condition. The following should do the trick, right? $schema->storage->dbh->prepare('SET TRANSACTION ISOLATION LEVEL REPEATABLE READ READ ONLY DEFERRABLE;')->execute();
  3. What would be necessary to change within the openqa-clone script to implement 1.?

    1. Support for posting multiple jobs at once (so it happens atomically) and use the API in the clone script.
      1. For parallel dependencies we could skip this relying on the scheduler's ability to repair half-scheduled clusters. However, that doesn't cover and might not work nicely as not enough worker slots might be available.
    2. Support for cloning only parallel children (for 1.2.2.3) but not any kind of chained children (for 1.2.2.1 and for 1.2.2.2).
      1. There's already --clone-children but I suppose it affects all kinds of children. We'd likely needed --clone-parallel-children in addition.
    3. Note that skipping chained parents (for 1.2.1.1) while still cloning directly chained and parallel parents (for 1.2.1.2 and 1.2.1.3) should already be possible by specifying --skip-chained-deps.
    4. Note that we could skip 3.2 at the cost of also cloning all kinds of child jobs (per investigation).
    5. For 2.4.2 we needed to implement a machine readable output format to keep track of the cloned jobs unless we decide for 2.4.2.2.
  4. What would be necessary to change within the restart API to implement 1.?

    1. Rules for restarting dependent jobs are already mostly according to 1..
    2. Add a flag to skip restarting chained and directly chained children (for 1.2.2.1 and for 1.2.2.2) which would effectively only restart parallel children (for 1.2.2.3).
    3. Add a flag to force the restart even though the job (or some other job in the cluster) has already been restarted.
    4. As of 1. we don't necessarily restart the full dependency tree. So we need to add a flag to avoid creating dependencies between restarted jobs and not restarted jobs. This is to avoid a connection between the old and the restarted dependency tree making it effectively one big dependency tree.
    5. I suppose the previous point is only a displaying issue so we could skip it.

#34 Updated by mkittler 3 months ago

If https://github.com/os-autoinst/openQA/pull/4537 has been merged, the point 3.1 from my previous comment is done. This only leaves 3.2 and 3.5 (only if we decide for 2.4.2.1) which should be simple to implement. So maybe it makes actually most sense to stick with the clone-job script at this point.

#35 Updated by mkittler 3 months ago

PR for implementing --clone-parallel-children in openqa-clone-job: https://github.com/os-autoinst/openQA/pull/4551

With that the only thing left to use openqa-clone-job in openqa-investigate for parallel dependencies would be an additional hook within openQA that would only fire for the whole dependency cluster. (As explained in 2.4.2 there would be other alternatives but an additional hook seems like the best solution.)

#36 Updated by mkittler 3 months ago

Looks like we currently invoke openqa-clone-job with --skip-chained-deps. That breaks 1.2.1.2 (parents of directly chained children need to be restarted). So I suppose --skip-chained-deps should be changed to only affect chained deps but not directly-chained deps (which it currently does).

There's one more problem to sort out: Even if openQA invokes the hook-script for the entire dependency tree we need to find out which of those jobs should be the "root" job to clone. Or we simply use --max-depth 0 to ensure a parallel cluster is fully cloned in any case and simply ignore jobs we've already cloned. That means we needed JSON-output in openqa-clone-job after all to keep track of that (within one openqa-investigate run).

#37 Updated by mkittler 2 months ago

PR for JSON output in openqa-clone-job: https://github.com/os-autoinst/openQA/pull/4564

#38 Updated by okurz 2 months ago

  • Due date set to 2022-03-25

#39 Updated by mkittler 2 months ago

  • Due date changed from 2022-03-25 to 2022-04-25

Changing the due date since we wanted to make a pause. Then the next steps would be:

  1. Have openqa-clone-job's --skip-chained-deps option only affect chained dependencies but not directly chained dependencies. Possibly add --skip-directly-chained-deps in case that's wanted after all. (for 1.2.1.2)
  2. Add a hook in openQA to invoke a script once all jobs in a dependency tree are done.
  3. Make openqa-investigate use that hook and investigate all jobs that weren't successful. It should use --max-depth 0 to ensure parallel clusters are always fully cloned (so it is not necessary to distinguish between parallel parents and children). It needs to keep track of handled job IDs to avoid investigating jobs multiple times as openqa-clone-job will already handle dependencies as needed (and therefore might clone already multiple jobs we need to investigate in one go). The tracking should be easy because openqa-clone-job --json-output has already been implemented.
  4. -> #110530 : add an opt-out (e.g. by specifying a certain test variable) so users who consider these tests as a waste of time won't complain. I suppose making this configurable via a test variable should be sufficient.
    1. We could also make it in opt-in. So we'd keep the current behavior of skipping the investigation of jobs with parallel and directly chained dependencies unless a user specifies some test variable.

#40 Updated by mkittler about 2 months ago

  • Related to action #107014: trigger openqa-trigger-bisect-jobs from our automatic investigations whenever the cause is not already known size:M added

#41 Updated by mkittler about 2 months ago

I've been adding #107014 as related because I suppose everything applies to openqa-trigger-bisect-job as well if we run it as hook script in production.

#42 Updated by mkittler about 1 month ago

  • Due date changed from 2022-04-25 to 2022-05-02

Since okurz gave the impulse to delay this I suppose he as product owner should decide when we want to resume here. So I'm setting the due date for next week to decide until then when we want to continue. Additionally, some feedback about my last comments is appreciated.

#43 Updated by cdywan about 1 month ago

mkittler wrote:

  1. Maybe add an opt-out (e.g. by specifying a certain test variable) so users who consider these tests as a waste of time won't complain. I suppose making this configurable via a test variable should be sufficient.
    1. We could also make it in opt-in. So we'd keep the current behavior of skipping the investigation of jobs with parallel and directly chained dependencies unless a user specifies some test variable.

I'd suggest making it opt-out via a job setting provided we have follow-up plans for the "problematic" cases mentioned above:

  • Multiple root jobs. We can consider that a future ticket for now.
  • Spawning too many investigation jobs under high load. We could consider such jobs as low priority and drop them (user story, not technical definition).

#44 Updated by okurz 26 days ago

  • Description updated (diff)
  • Due date changed from 2022-05-02 to 2022-05-09

#45 Updated by okurz 25 days ago

  • Related to action #110518: Call job_done_hooks if requested by test setting (not only openQA config as done so far) size:M added

#46 Updated by okurz 25 days ago

  • Related to action #110530: Do NOT call job_done_hooks if requested by test setting added

#47 Updated by okurz 25 days ago

  • Due date deleted (2022-05-09)
  • Status changed from Feedback to Blocked

mkittler wait for #110530 first please

#48 Updated by okurz 25 days ago

okurz and mkittler discussed:

mkittler wrote:

Changing the due date since we wanted to make a pause. Then the next steps would be:

  1. Have openqa-clone-job's --skip-chained-deps option only affect chained dependencies but not directly chained dependencies. Possibly add --skip-directly-chained-deps in case that's wanted after all. (for 1.2.1.2)

I suggest for now to treat directly chained dependencies as unsupported -> So please in openqa-investigate itself instead of skipping parallel and directly-chained deps only skip directly-chained deps as unsupported for now

  1. Add a hook in openQA to invoke a script once all jobs in a dependency tree are done.

Wait for #110176 first

  1. Make openqa-investigate use that hook and investigate all jobs that weren't successful. It should use --max-depth 0 to ensure parallel clusters are always fully cloned (so it is not necessary to distinguish between parallel parents and children). It needs to keep track of handled job IDs to avoid investigating jobs multiple times as openqa-clone-job will already handle dependencies as needed (and therefore might clone already multiple jobs we need to investigate in one go). The tracking should be easy because openqa-clone-job --json-output has already been implemented.

This could be directly done but in general we better wait for #110530

  1. -> #110530 : add an opt-out (e.g. by specifying a certain test variable) so users who consider these tests as a waste of time won't complain. I suppose making this configurable via a test variable should be sufficient.
    1. We could also make it in opt-in. So we'd keep the current behavior of skipping the investigation of jobs with parallel and directly chained dependencies unless a user specifies some test variable.

#49 Updated by okurz 25 days ago

  • Subject changed from Provide support for multi-machine scenarios handled by openqa-investigate to Provide support for multi-machine scenarios handled by openqa-investigate size:M

Also available in: Atom PDF