Project

General

Profile

Actions

action #95783

closed

coordination #103962: [saga][epic] Easy multi-machine handling: MM-tests as first-class citizens

coordination #103971: [epic] Easy *re*-triggering and cloning of multi-machine tests

Provide support for multi-machine scenarios handled by openqa-investigate size:M

Added by okurz over 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See #81859#note-7

Acceptance criteria

  • AC1: "openqa-investigate" triggers complete sets of multi-machine scenarios for investigation
  • AC2: The investigation multi-machine scenarios are categorized like other investigation jobs, e.g. ":investigate:" in the name, outside any job group, no build, etc.

Suggestions

  • Have openqa-clone-job's --skip-chained-deps option only affect chained dependencies but not directly chained dependencies
  • Add --skip-directly-chained-deps
  • Add a hook in openQA to invoke a script once all jobs in a dependency tree are done.
  • Make openqa-investigate use that hook and investigate all jobs that weren't successful. It should use --max-depth 0 to ensure parallel clusters are always fully cloned (so it is not necessary to distinguish between parallel parents and children). It needs to keep track of handled job IDs to avoid investigating jobs multiple times as openqa-clone-job will already handle dependencies as needed (and therefore might clone already multiple jobs we need to investigate in one go). The tracking should be easy because openqa-clone-job --json-output has already been implemented.
  • Add an opt-out (e.g. by specifying a certain test variable) so users who consider these tests as a waste of time won't complain e.g. configurable via a test variable
  • * We could also make it in opt-in. So we'd keep the current behavior of skipping the investigation of jobs with parallel and directly chained dependencies unless a user specifies some test variable.

Out of scope

  • Multiple root jobs. We can consider that a future ticket for now.
  • Spawning too many investigation jobs under high load. We could consider such jobs as low priority and drop them (user story, not technical definition).

Related issues 6 (0 open6 closed)

Related to openQA Project (public) - action #103425: Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:MResolvedmkittler

Actions
Related to openQA Project (public) - action #71809: Enable multi-machine jobs trigger without "isos post"Resolvedmkittler2020-09-24

Actions
Related to openQA Project (public) - action #69976: Show dependency graph for cloned jobsResolvedmkittler2020-08-13

Actions
Related to openQA Project (public) - action #110518: Call job_done_hooks if requested by test setting (not only openQA config as done so far) size:MResolvedmkittler2021-09-18

Actions
Related to openQA Project (public) - action #110530: Do NOT call job_done_hooks if requested by test settingResolvedmkittler2021-09-18

Actions
Copied from openQA Project (public) - action #81859: openqa-investigate triggers incomplete sets for multi-machine scenariosResolvedmkittler2021-01-07

Actions
Actions #1

Updated by okurz over 3 years ago

  • Copied from action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarios added
Actions #2

Updated by okurz over 3 years ago

Maybe this can be handled more easily by providing the possibility for the restart API route to take optional test parameters, same as for openqa-clone-job

Actions #3

Updated by okurz almost 3 years ago

  • Related to action #103425: Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M added
Actions #4

Updated by okurz almost 3 years ago

  • Related to action #71809: Enable multi-machine jobs trigger without "isos post" added
Actions #5

Updated by okurz almost 3 years ago

  • Description updated (diff)

fixed typo in description

Actions #6

Updated by okurz almost 3 years ago

  • Parent task set to #103971
Actions #7

Updated by okurz almost 3 years ago

  • Target version changed from future to Ready
Actions #8

Updated by mkittler almost 3 years ago

  • Assignee set to mkittler
Actions #10

Updated by mkittler almost 3 years ago

  • Status changed from In Progress to Feedback

All PR for openQA have been merged. I'm waiting for the deployment on o3 to test the investigation script.

Actions #11

Updated by mkittler almost 3 years ago

I checked out https://github.com/os-autoinst/scripts/pull/127 on o3 15 minutes ago so it has been using my changes recent investigation runs. So far I couldn't spot any problems in the logs. There are also still investigation jobs created, e.g. https://openqa.opensuse.org/tests/2170735, https://openqa.opensuse.org/tests/2170567, https://openqa.opensuse.org/tests/2170569. They look good (name is correct, outside of group, origin referenced, job not shown as clone). So I suppose it is safe to merge the PR.

Actions #12

Updated by mkittler almost 3 years ago

I encountered a few problems when looking into the logs again today so I created https://github.com/os-autoinst/scripts/pull/128 and https://github.com/os-autoinst/scripts/pull/129 to address them.

Actions #13

Updated by okurz almost 3 years ago

  • Status changed from Feedback to In Progress

A problem report about broken links by guillaume_g in https://matrix.to/#/!ilXMcHXPOjTZeauZcg:libera.chat/$Vh0JxSRfc4-GGh6Znu2FtDA1mKew1_Wt3kitW7owOFM?via=libera.chat&via=matrix.org&via=gnome.org . Think we have a recent regression in openqa-investigate, see the broken links in https://openqa.opensuse.org/tests/2172673#comments which should be absolute links. mkittler is informed and is on it

Actions #14

Updated by mkittler almost 3 years ago

This problem mentioned in the previous comment should be fixed now while https://github.com/os-autoinst/scripts/pull/129 is still pending.


I also got feedback. There are multiple remaining issues:

  1. Since we don't consider investigation jobs as clones our usual restrictions for displaying only the latest jobs in the "clone/restart chain" in the dependency tab doesn't apply. Therefore we see one big dependency tree containing all the jobs. It doesn't mean there's a problem with restarting the dependency cluster in general but the way it is displayed is rather confusing. This problem is related to #69976 (and might be solved within that ticket).
  2. It is problematic if there are multiple failing "root" jobs (like here: https://openqa.suse.de/tests/8088091). The investigate script restarts then one of these "root" jobs first and by that also restart the rest of the dependency tree which is reachable from the "root" going down the tree. It would however not go up and also restart the other "root" job. This one would attempted to be restarted separately which might lead to restarting some jobs down the dependency tree multiple times as they have already been restarted for the first "root" job.
    1. I suppose we needed to go also up the tree when restarting the first "root" job if it has failed.
    2. If we follow the previous idea we would restart multiple failing jobs at the same time in the same way (although their "last good" jobs might differ). That's a limitation we'd likely have to accept.
  3. The investigation can restart many jobs. Especially when we are short on resources it seems like a waste to trigger so many jobs (e.g. PowerPC jobs are problematic right now).

I'm wondering whether I should add the guard for avoid investigating MM jobs back until these problems have been resolved.

Actions #15

Updated by okurz almost 3 years ago

mkittler wrote:

I also got feedback. There are multiple remaining issues:
[…]

For now I tend to keep MM investigation jobs enabled. I understand "1." as something that is only visual and I assume that test reviewers will learn how to interpret the extended dependency trees correctly. The problem "2." regarding root jobs I don't fully understand but as you stated that you see it as an acceptable limitation so do I for now :) . 3. Shouldn't stop us. We should reward test maintainers of tests with low fail-ratio by giving them additional tools in the form of investigation jobs in case tests do fail. This means that test maintainers of unstable tests need to wait longer for all jobs to complete which might be a good motivation for them to stabilize tests or optimize runtime :) In most cases test reviewers would likely anyway restart jobs in case of failures to find out reproducibility. This is wasteful and causes busy-waiting of reviewers when investigation jobs can already provide that information at the time of review as they are triggered as soon as a job fails and not just when jobs are reviewed for the first time.

Actions #16

Updated by mkittler almost 3 years ago

I though about it a little bit further:

  • If we implement #69976 that will bring us nothing for 1. because the implementation will rely on clone_id being set. (At least I don't see any other way.)
  • It shouldn't be a problem, though: I suppose if 2. is fixed then 1. is fixed as well because then the "weird overlap"¹ between the new and old dependency tree is gone.
  • For 3. we can likely come up with an exclude regex. (Maybe the existing exclude regexes are even already enough.)

So I'll create a draft to show how 2. would look like. DONE: https://github.com/os-autoinst/openQA/pull/4498


¹ Jobs in the new dependency tree should not refer to any job in the old dependency tree anymore (making it effectively one big dependency tree) because all parents would be cloned (and new children having still parents from the old cluster as the parent is not cloned as well is the problem here).

Actions #17

Updated by mkittler almost 3 years ago

  • Status changed from In Progress to Feedback
Actions #18

Updated by dzedro almost 3 years ago

Will the PR fix also this lonely support servers, I assume they are created by openqa-investigate ?
https://openqa.suse.de/tests/8119478
https://openqa.suse.de/tests/8119479

Actions #19

Updated by tinita almost 3 years ago

I added another test to openqa-investigate: https://github.com/os-autoinst/scripts/pull/133

Actions #20

Updated by okurz almost 3 years ago

  • Status changed from Feedback to In Progress

Some problems have been found with this, e.g. reported in https://suse.slack.com/archives/C02CANHLANP/p1644483914332169 . @mkittler can you please look into https://openqa.suse.de/tests/8133359#comments to understand why there are 4 retry jobs and 6 last good build ones? As discussed we merged https://github.com/os-autoinst/scripts/pull/131 and https://github.com/os-autoinst/scripts/pull/134 to mitigate some problems. Please closely observe the impact.

Actions #21

Updated by mkittler almost 3 years ago

With https://github.com/os-autoinst/scripts/pull/131 and https://github.com/os-autoinst/scripts/pull/134 being merged we're basically back at where we were before. There are still multiple problems to address.

  1. I suppose the fix for 2. (from #note-14) worked but had the side-effect of restarting even more jobs and that can be problematic (see 3. from #note-14).
  2. We still accidentally restarted jobs twice because the tracking of already restarted jobs only works within the scope of one openqa-investigate run. We somehow need to ensure that further openqa-investigate invocations skip jobs which have already been investigated as dependency.
    1. I suppose jobs like "08134285-sle-15-SP4-Online-ppc64le-RAID1:investigate:retry*:investigate:last_good_build*:91.2@ppc64le-hmc-4disk" (mind the 2nd ":investigate:…" suffix; https://openqa.suse.de/tests/8134285) are also due to that but it could also be a problem within the general skipping logic and also with the skipping logic within one script invocation.
    2. Maybe the investigate script could record all restarted job IDs in a persistent text file to keep track of restarted jobs beyond the scope of one execution.
    3. Or we finally make it a real openQA feature - although doing the tacking within openQA itself would likely also be quite some work.
  3. Sometimes users seem to restart the jobs also manually, e.g. https://openqa.suse.de/tests/8133359. In this case it is really annoying that #69976 is still unresolved because knowing the original dependency tree would be very useful.
  4. That we get many investigation job, e.g. as shown in https://openqa.suse.de/tests/8133359#comment-484933, is unfortunately "normal" if we really want this feature. Of the 10 jobs in the mentioned comment 8 jobs are actually valid (and only two of them are cases as mentioned in 2.1). If that is too much we needed to reconsider the whole feature.
Actions #22

Updated by livdywan almost 3 years ago

dzedro wrote:

Will the PR fix also this lonely support servers, I assume they are created by openqa-investigate ?
https://openqa.suse.de/tests/8119478
https://openqa.suse.de/tests/8119479

These don't have investigate in the name, so to my understanding they couldn't have been.

Actions #23

Updated by okurz almost 3 years ago

  • Priority changed from Low to Urgent

@mkittler the situation seems to be severly broken now.

I executed

echo https://openqa.suse.de/tests/8140400 | host=openqa.suse.de openqa-investigate

with the git commit 054cb9c of os-autoinst scripts getting https://openqa.suse.de/tests/8140400#comment-485767 which looks complete with 4 investigation jobs triggered. With current master 3ae1f1a I got https://openqa.suse.de/tests/8140400#comment-485765 with only two jobs, the "retry" and "last_good_build". So what we observed today in the review training session was a regression by one of your recent changes.

An additional problem seems to be that we can't call openqa-investigate on a job that already has clones which is also a regression.

Actions #24

Updated by livdywan almost 3 years ago

okurz wrote:

a regression by one of your recent changes.

Just to be clear, this is only preventing investigation jobs from running and it's not interfering with cloning or jobs outside of that? So an immediate revert or rollback may not be necessary (that would be my thought anyway).

An additional problem seems to be that we can't call openqa-investigate on a job that already has clones which is also a regression.

Does that mean the "is clone" flag is not set reliably? Maybe it'd be best to identify where this happens and extend test coverage before attempting a fix.

Actions #25

Updated by mkittler almost 3 years ago

An additional problem seems to be that we can't call openqa-investigate on a job that already has clones which is also a regression.

Right, the clone=0 flag only means the restarted jobs won't be considered clones. But if the jobs have already been restarted it is no help.


Maybe it is best to revert the changes to the scripts repo completely then. At least I likely won't manage to debug and fix the problems today anymore.

Actions #27

Updated by openqa_review almost 3 years ago

  • Due date set to 2022-02-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #28

Updated by okurz almost 3 years ago

  • Related to action #69976: Show dependency graph for cloned jobs added
Actions #29

Updated by okurz almost 3 years ago

  • Due date deleted (2022-02-26)
  • Status changed from In Progress to Blocked
  • Priority changed from Urgent to Normal

Urgency was adressed with https://github.com/os-autoinst/scripts/pull/135 merged and deployed.

As discussed we pull in #69976 first and let's see if that helps us.

Actions #30

Updated by mkittler almost 3 years ago

  • Status changed from Blocked to Feedback

#69976 has been resolved so I'm unblocking the issue.

However, there are still open questions which should be answered before proceeding:

  1. Should I follow the "restart API approach"? I needed to change clone=0 so it ignores if a job has already been restarted (and restarts it anyways). Not sure what side-effects this will have.
  2. I could also implement remaining ACs of #103971 and keep using the clone script here. I wouldn't have to care about the restart API anymore but it'll come with its own difficulties (like implementing those ACs in the first place, parsing the output of the clone script to keep track of all jobs which have been cloned).
  3. Regardless of whether I'd choose 1. or 2. the problem of keeping track of already restarted jobs between multiple invocations needs to be solved. Maybe it makes sense to simply store handles job IDs in a simple SQLite file? (SQLite would implement searching, updating and locking for us, that's why I bring it up. So it might even be simpler to use than writing to a text file but still just a single additional file. The required SQL should be trivial.)
Actions #31

Updated by okurz almost 3 years ago

mkittler wrote:

#69976 has been resolved so I'm unblocking the issue.

However, there are still open questions which should be answered before proceeding:

  1. Should I follow the "restart API approach"? I needed to change clone=0 so it ignores if a job has already been restarted (and restarts it anyways). Not sure what side-effects this will have.

Don't we already have a "force" flag which we use when assets are missing and then the user can force-restart if they still want to?

  1. I could also implement remaining ACs of #103971 and keep using the clone script here. I wouldn't have to care about the restart API anymore but it'll come with its own difficulties (like implementing those ACs in the first place, parsing the output of the clone script to keep track of all jobs which have been cloned).

How about giving it a try with "openqa-clone-job" for a limited time and see where we can go?

  1. Regardless of whether I'd choose 1. or 2. the problem of keeping track of already restarted jobs between multiple invocations needs to be solved. Maybe it makes sense to simply store handles job IDs in a simple SQLite file? (SQLite would implement searching, updating and locking for us, that's why I bring it up. So it might even be simpler to use than writing to a text file but still just a single additional file. The required SQL should be trivial.)

sqlite sounds like overkill but you might be right that it might be easier than parsing a text file in custom format. However so far we had stateless scripts which is even easier if we can sustain that. As you mentioned I think it's a good idea to bring this ticket up in the unblock

Actions #32

Updated by mkittler almost 3 years ago

Don't we already have a "force" flag which we use when assets are missing and then the user can force-restart if they still want to?

No, I don't think it already allows to force this (only to ignore missing assets and some settings problematic for directly chained dependencies).

How about giving it a try with "openqa-clone-job" for a limited time and see where we can go?

The "openqa-clone-job" script won't behave very nicely. That's why we currently skip those jobs in the first place. I'll note down more details in the summary of conversation from today.

Actions #33

Updated by mkittler almost 3 years ago

Recap for those who aren't familiar with the topic:

  1. What is special about MM jobs and how does it relate to openqa-investigate?
    1. Those jobs parallel job dependencies - so the jobs are scheduled to run at the same time. The scheduler code itself works just fine so that's actually not a concern here.
    2. Points to improve are:
      1. Posting jobs (via the jobs post API) which have those dependencies. It is possible but only in a non-atomic way which is problematic as we end up with half-running parallel jobs.
      2. Thus the openqa-clone-job script which is using the jobs post API is affected by that problems.
      3. Thus the openqa-investigate script is affected by those problems. In addition, we need to take care not to create redundant investigation jobs.
  2. How comes the restart API into the picture?
    1. The openqa-investigate script could utilize it instead of the openqa-clone-job script as it doesn't need to do inter-openQA-instance cloning.
    2. Note that the openqa-clone-job script is nevertheless used by users and 1.2.1 and 1.2.2 are also impairing users. So we still need to take care of openqa-clone-job - regardless of how we handle the investigation.

Summary of today's discussion:

  1. Which dependent jobs do we want/need to restart when investigating?

    1. The general goal is to avoid producing results which are not needed.
    2. It depends on the dependency type:
      1. If a child fails:
        1. Chained parents don't need to be restarted. (Unless they failed, but then the child will be skipped and is thus not investigated anyways.)
        2. Directly chained parents need to be restarted so the chain is not broken. The whole direct chain of parents needs to be restarted (recursively).
        3. Parallel parents need to be restarted as e.g. the "client" job needs the "server" job to run. Presumably parallel siblings can affect each other so the parallel parent's other children and their parallel dependencies need to be restarted as well (resulting in restarting the whole parallel cluster).
      2. If a parent fails:
        1. Chained children don't need to be restarted. (We are mainly interested in finding out why the parent fails, not in producing some further results for the children.)
        2. Directly chained children don't need to be restarted. (Same counts as for regularly chained children.)
        3. Parallel children need to be restarted as e.g. a server crash can maybe only be reproduced if there's a client connecting to the server. Presumably nested parallel children are important as well so they need to be restarted as well (resulting in restarting the whole parallel cluster).
      3. If a "job in the middle" fails (a job which has parents and children at the same time):
        1. Both previous points apply. So parents and children need (or don't need) to be restarted as explained in the previous points.
      4. Note that "chained" and "directly chained" (and "parallel") are distinct dependency types. A dependency only has one of these types and a directly chained dependency is not a chained dependency at the same time. So 1.2.1.1 and 1.2.1.2 don't contradict each other.
  2. Can we investigate each failure "in isolation"?

    1. First an example what "in isolation" would mean:
      1. Assume we have 2 failed parallel children within the same cluster.
      2. Assume we would create 4 investigation jobs per faild child without considering dependencies.
      3. For each investigation job we would clone the whole cluster as explained in 1.2.1, let's say 3 jobs.
      4. That would make 24 clones in total (number of failed jobs * number of investigation jobs per failed job * number of dependent jobs to be cloned per job).
    2. This might be acceptable in general …
    3. … but we need to think at least about making exceptions as well.
    4. Alternative: Somehow "merge" 2.1.2 and 2.1.3 so we would only have X investigation jobs per dependency tree.
      1. In the example from 2.1 we would end up with "only" 12 jobs (number of failed jobs * number of investigation jobs per failed job).
      2. To achieve that we needed to keep track of which jobs we have already investigated which could be done on different levels:
        1. The openqa-investigate script keeps track, e.g.
          1. using a SQLite file as suggested in #95783#note-30 (or some other persistent storage).
          2. by adding a special comment or job setting in all cloned jobs (basically utilizing openQA's database).
        2. openQA invokes the post-fail-hook per "dependency tree" and not per job.
          1. So the post-fail-hook would receive a list of all failed job IDs within the cluster and not just a single job ID.
          2. The investigate script would then loop over these job IDs and skip jobs which have already been cloned in a previous iteration (or only create a comment there).
        3. For openQA the previous point boils down to invoking hooks only for dependency trees where all jobs have been cancelled/done. The problem here is that multiple jobs can end up cancelled/done at the same time.
          1. Maybe Minion locks can help here. However, it would be very problematic to run only one finalize_job_results task at the same time (e.g. finalize_job_results will pile up because there's a blocker - we allow hook scripts to run 5 minutes and it can sometimes indeed take a while in practice). So a more fine-grained locking would be required. Unfortunately we don't have the concept of a "dependency tree ID" in openQA (which could simply be used as lock name).
          2. Maybe there's a way to query whether a dependency tree is "pending" in a single SQL query to avoid the race condition.
          3. If the previous is not possible, we could use a database transaction to avoid the race condition. The following should do the trick, right? $schema->storage->dbh->prepare('SET TRANSACTION ISOLATION LEVEL REPEATABLE READ READ ONLY DEFERRABLE;')->execute();
  3. What would be necessary to change within the openqa-clone script to implement 1.?

    1. Support for posting multiple jobs at once (so it happens atomically) and use the API in the clone script.
      1. For parallel dependencies we could skip this relying on the scheduler's ability to repair half-scheduled clusters. However, that doesn't cover and might not work nicely as not enough worker slots might be available.
    2. Support for cloning only parallel children (for 1.2.2.3) but not any kind of chained children (for 1.2.2.1 and for 1.2.2.2).
      1. There's already --clone-children but I suppose it affects all kinds of children. We'd likely needed --clone-parallel-children in addition.
    3. Note that skipping chained parents (for 1.2.1.1) while still cloning directly chained and parallel parents (for 1.2.1.2 and 1.2.1.3) should already be possible by specifying --skip-chained-deps.
    4. Note that we could skip 3.2 at the cost of also cloning all kinds of child jobs (per investigation).
    5. For 2.4.2 we needed to implement a machine readable output format to keep track of the cloned jobs unless we decide for 2.4.2.2.
  4. What would be necessary to change within the restart API to implement 1.?

    1. Rules for restarting dependent jobs are already mostly according to 1..
    2. Add a flag to skip restarting chained and directly chained children (for 1.2.2.1 and for 1.2.2.2) which would effectively only restart parallel children (for 1.2.2.3).
    3. Add a flag to force the restart even though the job (or some other job in the cluster) has already been restarted.
    4. As of 1. we don't necessarily restart the full dependency tree. So we need to add a flag to avoid creating dependencies between restarted jobs and not restarted jobs. This is to avoid a connection between the old and the restarted dependency tree making it effectively one big dependency tree.
    5. I suppose the previous point is only a displaying issue so we could skip it.
Actions #34

Updated by mkittler almost 3 years ago

If https://github.com/os-autoinst/openQA/pull/4537 has been merged, the point 3.1 from my previous comment is done. This only leaves 3.2 and 3.5 (only if we decide for 2.4.2.1) which should be simple to implement. So maybe it makes actually most sense to stick with the clone-job script at this point.

Actions #35

Updated by mkittler almost 3 years ago

PR for implementing --clone-parallel-children in openqa-clone-job: https://github.com/os-autoinst/openQA/pull/4551

With that the only thing left to use openqa-clone-job in openqa-investigate for parallel dependencies would be an additional hook within openQA that would only fire for the whole dependency cluster. (As explained in 2.4.2 there would be other alternatives but an additional hook seems like the best solution.)

Actions #36

Updated by mkittler almost 3 years ago

Looks like we currently invoke openqa-clone-job with --skip-chained-deps. That breaks 1.2.1.2 (parents of directly chained children need to be restarted). So I suppose --skip-chained-deps should be changed to only affect chained deps but not directly-chained deps (which it currently does).

There's one more problem to sort out: Even if openQA invokes the hook-script for the entire dependency tree we need to find out which of those jobs should be the "root" job to clone. Or we simply use --max-depth 0 to ensure a parallel cluster is fully cloned in any case and simply ignore jobs we've already cloned. That means we needed JSON-output in openqa-clone-job after all to keep track of that (within one openqa-investigate run).

Actions #37

Updated by mkittler over 2 years ago

PR for JSON output in openqa-clone-job: https://github.com/os-autoinst/openQA/pull/4564

Actions #38

Updated by okurz over 2 years ago

  • Due date set to 2022-03-25
Actions #39

Updated by mkittler over 2 years ago

  • Due date changed from 2022-03-25 to 2022-04-25

Changing the due date since we wanted to make a pause. Then the next steps would be:

  1. Have openqa-clone-job's --skip-chained-deps option only affect chained dependencies but not directly chained dependencies. Possibly add --skip-directly-chained-deps in case that's wanted after all. (for 1.2.1.2)
  2. Add a hook in openQA to invoke a script once all jobs in a dependency tree are done.
  3. Make openqa-investigate use that hook and investigate all jobs that weren't successful. It should use --max-depth 0 to ensure parallel clusters are always fully cloned (so it is not necessary to distinguish between parallel parents and children). It needs to keep track of handled job IDs to avoid investigating jobs multiple times as openqa-clone-job will already handle dependencies as needed (and therefore might clone already multiple jobs we need to investigate in one go). The tracking should be easy because openqa-clone-job --json-output has already been implemented.
  4. -> #110530 : add an opt-out (e.g. by specifying a certain test variable) so users who consider these tests as a waste of time won't complain. I suppose making this configurable via a test variable should be sufficient.
    1. We could also make it in opt-in. So we'd keep the current behavior of skipping the investigation of jobs with parallel and directly chained dependencies unless a user specifies some test variable.
Actions #41

Updated by mkittler over 2 years ago

I've been adding #107014 as related because I suppose everything applies to openqa-trigger-bisect-job as well if we run it as hook script in production.

Actions #42

Updated by mkittler over 2 years ago

  • Due date changed from 2022-04-25 to 2022-05-02

Since @okurz gave the impulse to delay this I suppose he as product owner should decide when we want to resume here. So I'm setting the due date for next week to decide until then when we want to continue. Additionally, some feedback about my last comments is appreciated.

Actions #43

Updated by livdywan over 2 years ago

mkittler wrote:

  1. Maybe add an opt-out (e.g. by specifying a certain test variable) so users who consider these tests as a waste of time won't complain. I suppose making this configurable via a test variable should be sufficient.
    1. We could also make it in opt-in. So we'd keep the current behavior of skipping the investigation of jobs with parallel and directly chained dependencies unless a user specifies some test variable.

I'd suggest making it opt-out via a job setting provided we have follow-up plans for the "problematic" cases mentioned above:

  • Multiple root jobs. We can consider that a future ticket for now.
  • Spawning too many investigation jobs under high load. We could consider such jobs as low priority and drop them (user story, not technical definition).
Actions #44

Updated by okurz over 2 years ago

  • Description updated (diff)
  • Due date changed from 2022-05-02 to 2022-05-09
Actions #45

Updated by okurz over 2 years ago

  • Related to action #110518: Call job_done_hooks if requested by test setting (not only openQA config as done so far) size:M added
Actions #46

Updated by okurz over 2 years ago

  • Related to action #110530: Do NOT call job_done_hooks if requested by test setting added
Actions #47

Updated by okurz over 2 years ago

  • Due date deleted (2022-05-09)
  • Status changed from Feedback to Blocked

@mkittler wait for #110530 first please

Actions #48

Updated by okurz over 2 years ago

okurz and mkittler discussed:

mkittler wrote:

Changing the due date since we wanted to make a pause. Then the next steps would be:

  1. Have openqa-clone-job's --skip-chained-deps option only affect chained dependencies but not directly chained dependencies. Possibly add --skip-directly-chained-deps in case that's wanted after all. (for 1.2.1.2)

I suggest for now to treat directly chained dependencies as unsupported -> So please in openqa-investigate itself instead of skipping parallel and directly-chained deps only skip directly-chained deps as unsupported for now

  1. Add a hook in openQA to invoke a script once all jobs in a dependency tree are done.

Wait for #110176 first

  1. Make openqa-investigate use that hook and investigate all jobs that weren't successful. It should use --max-depth 0 to ensure parallel clusters are always fully cloned (so it is not necessary to distinguish between parallel parents and children). It needs to keep track of handled job IDs to avoid investigating jobs multiple times as openqa-clone-job will already handle dependencies as needed (and therefore might clone already multiple jobs we need to investigate in one go). The tracking should be easy because openqa-clone-job --json-output has already been implemented.

This could be directly done but in general we better wait for #110530

  1. -> #110530 : add an opt-out (e.g. by specifying a certain test variable) so users who consider these tests as a waste of time won't complain. I suppose making this configurable via a test variable should be sufficient.
    1. We could also make it in opt-in. So we'd keep the current behavior of skipping the investigation of jobs with parallel and directly chained dependencies unless a user specifies some test variable.
Actions #49

Updated by okurz over 2 years ago

  • Subject changed from Provide support for multi-machine scenarios handled by openqa-investigate to Provide support for multi-machine scenarios handled by openqa-investigate size:M
Actions #51

Updated by okurz over 2 years ago

  • Status changed from Blocked to Workable
Actions #52

Updated by mkittler over 2 years ago

The next steps would be (obsoleting #95783#note-39):

  1. #110176 has been implemented. That means we could now check within the hook script whether all dependencies have been finished and if not retry later.
  2. We should also check whether the current job is a parallel child. If it is then we'd just skip the job completely to avoid cloning jobs multiple times (cloning the parent will clone the children).
  3. When investigating the parallel parent I suppose the "worst" result within the cluster should be assumed (e.g. a passing parallel parent would be treated as if it had failed if at least one parallel child failed).
  4. Since we defined directly chained children out-of-scope we don't need to care about --skip-chained-deps interfering with them (that flag is currently used by the investigate script).
Actions #53

Updated by mkittler over 2 years ago

Draft: https://github.com/os-autoinst/scripts/pull/170

Unfortunately, considering the result (of the whole parallel cluster) isn't that easy. At the point we're currently handling dependencies the result is already evaluated.

Actions #54

Updated by mkittler over 2 years ago

  • Status changed from Workable to In Progress

I updated the draft https://github.com/os-autoinst/scripts/pull/170 to check the result accordingly. This change means we need to run the investigate script for all jobs (regardless of the result) and do the check for the result within the script (considering the whole job cluster).

I have only tested the jq commands in my local shell so far.

Actions #55

Updated by openqa_review over 2 years ago

  • Due date set to 2022-07-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions #56

Updated by mkittler over 2 years ago

I now added some tests. The changes should basically work.

To enable the hook script for all job results the following change could be done: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/707

Then I'd still need to implement skipping other results in the labeling script (as it would then also be called for all job results) and of course implement that way of configuring it in openQA upstream.

Actions #57

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #58

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Feedback

I've just learned that executing the hook script for all jobs is a no-go. I assumed we had agreed on doing this kind of logic in the hook script because #112523 was very much in-line with that. Well, back to the drawing board.

Note that the general problem we need to resolve here is synchronization (of the investigation of parallel jobs). The question is just where this synchronization is supposed to happen. If we don't synchronize it properly we could either accidentally miss or duplicate the effort.

There are multiple approaches:

  1. Call the hook script for all job results and do the synchronization within the hook script. (This is the approach I would have taken.)
    1. The hook script determines whether a cluster has finished (and postpones until then using #112523) and whether a cluster contains a failed job.
    2. The hook script only considers parent jobs to avoid duplicated investigations and therefore needs to be called for all job results.
      1. This might be problematic for clusters with multiple parents but could be solved by:
        1. Ensuring we're really find the top-level parent in the cluster. (We will fail to find the top-level parent in case of cyclic dependencies. It should be fine to not support it but we need to prevent any endless loops in our code.)
        2. If the clone the top-level parent with --max-depth 0 we can ensure we're cloning the full cluster (0 means infinity here). Since we're using --skip-chained-deps and not --clone-children this should not lead to cloning any unwanted jobs outside the cluster.
    3. This is deemed too expensive. However, no other technicalities would prevent the approach.
  2. Call the hook script still only for failed jobs and abuse openQA's comment system to do the synchronization within the hook script.
    1. [same as 1.1] The hook script determines whether a cluster has finished (and postpones until then using #112523) and whether a cluster contains a failed job.
    2. The hook script switches to investigate the parallel parent if it is called for a parallel child. This means we would duplicate the effort if there are multiple failures within the same cluster so we need to synchronize:
      1. The hook script writes the investigation comment before starting the investigation, e.g. "Spawning investigation jobs".
      2. The hook script checks whether another investigation comment has been created in the meantime and only proceeds if its own comment has the lower ID. Otherwise it deletes its comment and aborts.
      3. The hook script edits the comment with the actual contents after spawning the investigation jobs.
      4. Point 1.2.1 applies here as well.
    3. Not sure yet what technicalities will go in the way.
  3. Call the hook script still only for failed jobs and and track already investigated jobs within the hook script (e.g. relying on an SQLite database).
    1. No further details as we'd likely don't want that approach anyways.
  4. Call the hook script for the whole cluster providing the hook script with the appropriate job ID to focus the investigation on. This means the synchronization happens in openQA.
    1. No further logic in the hook script is required but the additional openQA upstream feature might get a little involved.
    2. openQA's implementation needed to take 1.2.1 into account as well when providing the appropriate job ID. So we don't loose that complexity.

I suppose we should go with approach 2 (still taking 1.2.1 into account of course).

Actions #59

Updated by mkittler over 2 years ago

The idea to use --max-depth 0 to fix 1.2.1 is actually very good because it has the nice side-effect to clone the full parallel cluster regardless from where we start. Normally the depth is limited to direct children so if we clone a parallel parent for its child than that's it. However, with --max-depth 0 we will actually go down the tree again and basically clone the full cluster. I've just tested it locally and it really behaves as expected.

So we could avoid finding the top-level parent. Actually we also should avoid that because otherwise the case when there are multiple roots would be problematic (as there is not one top-level parent). However, I suppose for (ab)using comments for synchronization (see approach 2. of the previous comment) we still need to decide on one job within the cluster to write the comment on. Otherwise, if different concurrent jobs would read/write comments on different jobs it wouldn't work. I would suggest to simply write the comment on the job with the lowest ID in the cluster. That job can be very easy determined and there is no ambiguity (unlike the top-level parent job which is harder to find and there might be multiple and dependency circles are problematic).

Actions #60

Updated by okurz over 2 years ago

@mkittler a very nice write-up indeed. I agree that we should go with approach 2. Eventually in the future we might need approach 4 for other reasons. That is ok as we consider os-autoinst/scripts also an experimental ground for potential later openQA built-in features.

Actions #61

Updated by livdywan over 2 years ago

One thought regarding 2.2.2 - maybe for better debugging it's worth leaving all comments, e.g. have something like "Triggered investigation jobs because of ..." where the reason it was spawned would still be visible in case it doesn't work as expected and the comment can provide some context

Actions #62

Updated by mkittler over 2 years ago

  • Status changed from Feedback to In Progress

https://github.com/os-autoinst/scripts/pull/170 has been merged since 2 days ago. I haven't received any feedback from users since then. (Before they complained quite quickly if the investigate script had done a bad job dealing with parallel clusters.)

So I've just checked parallel jobs being investigated on OSD myself via select jobs.id from jobs join comments on jobs.id = comments.job_id join job_dependencies on jobs.id = job_dependencies.parent_job_id where job_dependencies.dependency = 2 and comments.text like '%investigation%' and t_finished > '2022-07-11' order by jobs.id desc limit 25;.

All jobs I've looked into look good, e.g. on https://openqa.suse.de/tests/9111048#comments and https://openqa.suse.de/tests/9105044#comments we can see that the parallel parent was selected correctly for the sync comment and the whole cluster was cloned for each investigation job. The same counts for https://openqa.suse.de/tests/9109613#dependencies which is also part of a bigger dependency tree and only jobs from its parallel cluster have been cloned (as expected).

On o3 I only found the job https://openqa.opensuse.org/tests/2464678#dependencies (and its clones). This job failed because its parallel job hasn't been scheduled correctly. (I haven't investigated why.) The investigation was postponed once because not all dependencies where done:

Jul 12 10:25:02 ariel openqa-gru[4018]: Postponing to investigate job 2464678: waiting until pending dependencies have finished

Apparently postponing doesn't work because no automatic investigation was triggered later. I need to look into it because it also doesn't seem to work on OSD. Then the job was manually restarted by ggardet_arm. That didn't work either. The job ended up as parallel failed despite having not even a parallel dependency within the dependency tree. However, likely it is just a displaying issue because the investigation actually cloned the cluster correctly (see https://openqa.opensuse.org/tests/2465350). The strangeness of that parallel dependency is maybe something to look into but out of the scope of this ticket.

So I guess at least the investigation itself works as expected except for the postponing case. I'll need to look into that.

Actions #64

Updated by mkittler over 2 years ago

The postponing now works. The job https://openqa.suse.de/tests/9126523 has been postponed¹ and was then investigated later. However, the job was actually postponed needlessly. Maybe it could still be optimized so jobs are only postponed if there are pending jobs within the same parallel cluster (and not just any pending jobs within the related dependency tree).


¹

…
Jul 14 11:12:22 openqa openqa-gru[5421]: Postponing to investigate job 9126523: waiting until pending dependencies have finished
Jul 14 11:13:57 openqa openqa-gru[8560]: Postponing to investigate job 9126523: waiting until pending dependencies have finished
Jul 14 11:15:22 openqa openqa-gru[10801]: Postponing to investigate job 9126523: waiting until pending dependencies have finished
Jul 14 11:17:17 openqa openqa-gru[13377]: Postponing to investigate job 9126523: waiting until pending dependencies have finished
Actions #65

Updated by mkittler over 2 years ago

PR to implement the improvement mentioned in the previous comment: https://github.com/os-autoinst/scripts/pull/173

Actions #66

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Resolved

I've checked a few more jobs after the PR has been merged and it looks all good.

Also the feedback comment is written on the right job: https://openqa.suse.de/tests/9105078#comment-565424

So I suppose we can actually finally consider this ticket resolved.

Actions #67

Updated by okurz over 2 years ago

  • Due date deleted (2022-07-19)
Actions

Also available in: Atom PDF