action #95783: Provide support for multi-machine scenarios handled by openqa-investigate size:M - openQA Project (public) - openSUSE Project Management Tool

#1

Updated by okurz over 3 years ago

Copied from action #81859: openqa-investigate triggers incomplete sets for multi-machine scenarios added

#2

Updated by okurz over 3 years ago

#3

Updated by okurz over 3 years ago

Related to action #103425: Ratio of multi-machine tests alerting with ratio_mm_failed 5.280 size:M added

#4

Updated by okurz over 3 years ago

Related to action #71809: Enable multi-machine jobs trigger without "isos post" added

#5

Updated by okurz over 3 years ago

Description updated (diff)

#6

Updated by okurz over 3 years ago

Parent task set to #103971

#7

Updated by okurz about 3 years ago

Target version changed from future to Ready

#8

Updated by mkittler about 3 years ago

Assignee set to mkittler

#9

Updated by mkittler about 3 years ago

Status changed from New to In Progress

#10

Updated by mkittler about 3 years ago

Status changed from In Progress to Feedback

#11

Updated by mkittler about 3 years ago

#12

Updated by mkittler about 3 years ago

#13

Updated by okurz about 3 years ago

Status changed from Feedback to In Progress

#14

Updated by mkittler about 3 years ago

#15

Updated by okurz about 3 years ago

#16

Updated by mkittler about 3 years ago

#17

Updated by mkittler about 3 years ago

Status changed from In Progress to Feedback

#18

Updated by dzedro about 3 years ago

#19

Updated by tinita about 3 years ago

#20

Updated by okurz about 3 years ago

Status changed from Feedback to In Progress

#21

Updated by mkittler about 3 years ago

#22

Updated by livdywan about 3 years ago

#23

Updated by okurz about 3 years ago

Priority changed from Low to Urgent

@mkittler the situation seems to be severly broken now.

I executed

echo https://openqa.suse.de/tests/8140400 | host=openqa.suse.de openqa-investigate

with the git commit 054cb9c of os-autoinst scripts getting https://openqa.suse.de/tests/8140400#comment-485767 which looks complete with 4 investigation jobs triggered. With current master 3ae1f1a I got https://openqa.suse.de/tests/8140400#comment-485765 with only two jobs, the "retry" and "last_good_build". So what we observed today in the review training session was a regression by one of your recent changes.

An additional problem seems to be that we can't call openqa-investigate on a job that already has clones which is also a regression.

#24

Updated by livdywan about 3 years ago

#25

Updated by mkittler about 3 years ago

#26

Updated by mkittler about 3 years ago

#27

Updated by openqa_review about 3 years ago

Due date set to 2022-02-26

#28

Updated by okurz about 3 years ago

Related to action #69976: Show dependency graph for cloned jobs added

#29

Updated by okurz about 3 years ago

Due date deleted (~~2022-02-26~~)
Status changed from In Progress to Blocked
Priority changed from Urgent to Normal

#30

Updated by mkittler about 3 years ago

Status changed from Blocked to Feedback

#31

Updated by okurz about 3 years ago

#32

Updated by mkittler about 3 years ago

#33

Updated by mkittler about 3 years ago

Recap for those who aren't familiar with the topic:

What is special about MM jobs and how does it relate to openqa-investigate?
1. Those jobs parallel job dependencies - so the jobs are scheduled to run at the same time. The scheduler code itself works just fine so that's actually not a concern here.
2. Points to improve are:
  1. Posting jobs (via the jobs post API) which have those dependencies. It is possible but only in a non-atomic way which is problematic as we end up with half-running parallel jobs.
  2. Thus the openqa-clone-job script which is using the jobs post API is affected by that problems.
  3. Thus the openqa-investigate script is affected by those problems. In addition, we need to take care not to create redundant investigation jobs.
How comes the restart API into the picture?
1. The openqa-investigate script could utilize it instead of the openqa-clone-job script as it doesn't need to do inter-openQA-instance cloning.
2. Note that the openqa-clone-job script is nevertheless used by users and 1.2.1 and 1.2.2 are also impairing users. So we still need to take care of openqa-clone-job - regardless of how we handle the investigation.

Summary of today's discussion:

Which dependent jobs do we want/need to restart when investigating?
1. The general goal is to avoid producing results which are not needed.
2. It depends on the dependency type:
  1. If a child fails:
    1. Chained parents don't need to be restarted. (Unless they failed, but then the child will be skipped and is thus not investigated anyways.)
    2. Directly chained parents need to be restarted so the chain is not broken. The whole direct chain of parents needs to be restarted (recursively).
    3. Parallel parents need to be restarted as e.g. the "client" job needs the "server" job to run. Presumably parallel siblings can affect each other so the parallel parent's other children and their parallel dependencies need to be restarted as well (resulting in restarting the whole parallel cluster).
  2. If a parent fails:
    1. Chained children don't need to be restarted. (We are mainly interested in finding out why the parent fails, not in producing some further results for the children.)
    2. Directly chained children don't need to be restarted. (Same counts as for regularly chained children.)
    3. Parallel children need to be restarted as e.g. a server crash can maybe only be reproduced if there's a client connecting to the server. Presumably nested parallel children are important as well so they need to be restarted as well (resulting in restarting the whole parallel cluster).
  3. If a "job in the middle" fails (a job which has parents and children at the same time):
    1. Both previous points apply. So parents and children need (or don't need) to be restarted as explained in the previous points.
  4. Note that "chained" and "directly chained" (and "parallel") are distinct dependency types. A dependency only has one of these types and a directly chained dependency is not a chained dependency at the same time. So 1.2.1.1 and 1.2.1.2 don't contradict each other.
Can we investigate each failure "in isolation"?
1. First an example what "in isolation" would mean:
  1. Assume we have 2 failed parallel children within the same cluster.
  2. Assume we would create 4 investigation jobs per faild child without considering dependencies.
  3. For each investigation job we would clone the whole cluster as explained in 1.2.1, let's say 3 jobs.
  4. That would make 24 clones in total (number of failed jobs * number of investigation jobs per failed job * number of dependent jobs to be cloned per job).
2. This might be acceptable in general …
3. … but we need to think at least about making exceptions as well.
4. Alternative: Somehow "merge" 2.1.2 and 2.1.3 so we would only have X investigation jobs per dependency tree.
  1. In the example from 2.1 we would end up with "only" 12 jobs (number of failed jobs * number of investigation jobs per failed job).
  2. To achieve that we needed to keep track of which jobs we have already investigated which could be done on different levels:
    1. The openqa-investigate script keeps track, e.g.
      1. using a SQLite file as suggested in #95783#note-30 (or some other persistent storage).
      2. by adding a special comment or job setting in all cloned jobs (basically utilizing openQA's database).
    2. openQA invokes the post-fail-hook per "dependency tree" and not per job.
      1. So the post-fail-hook would receive a list of all failed job IDs within the cluster and not just a single job ID.
      2. The investigate script would then loop over these job IDs and skip jobs which have already been cloned in a previous iteration (or only create a comment there).
    3. For openQA the previous point boils down to invoking hooks only for dependency trees where all jobs have been cancelled/done. The problem here is that multiple jobs can end up cancelled/done at the same time.
      1. Maybe Minion locks can help here. However, it would be very problematic to run only one finalize_job_results task at the same time (e.g. finalize_job_results will pile up because there's a blocker - we allow hook scripts to run 5 minutes and it can sometimes indeed take a while in practice). So a more fine-grained locking would be required. Unfortunately we don't have the concept of a "dependency tree ID" in openQA (which could simply be used as lock name).
      2. Maybe there's a way to query whether a dependency tree is "pending" in a single SQL query to avoid the race condition.
      3. If the previous is not possible, we could use a database transaction to avoid the race condition. The following should do the trick, right? $schema->storage->dbh->prepare('SET TRANSACTION ISOLATION LEVEL REPEATABLE READ READ ONLY DEFERRABLE;')->execute();
What would be necessary to change within the openqa-clone script to implement 1.?
1. Support for posting multiple jobs at once (so it happens atomically) and use the API in the clone script.
  1. For parallel dependencies we could skip this relying on the scheduler's ability to repair half-scheduled clusters. However, that doesn't cover and might not work nicely as not enough worker slots might be available.
2. Support for cloning only parallel children (for 1.2.2.3) but not any kind of chained children (for 1.2.2.1 and for 1.2.2.2).
  1. There's already --clone-children but I suppose it affects all kinds of children. We'd likely needed --clone-parallel-children in addition.
3. Note that skipping chained parents (for 1.2.1.1) while still cloning directly chained and parallel parents (for 1.2.1.2 and 1.2.1.3) should already be possible by specifying --skip-chained-deps.
4. Note that we could skip 3.2 at the cost of also cloning all kinds of child jobs (per investigation).
5. For 2.4.2 we needed to implement a machine readable output format to keep track of the cloned jobs unless we decide for 2.4.2.2.
What would be necessary to change within the restart API to implement 1.?
1. Rules for restarting dependent jobs are already mostly according to 1..
2. Add a flag to skip restarting chained and directly chained children (for 1.2.2.1 and for 1.2.2.2) which would effectively only restart parallel children (for 1.2.2.3).
3. Add a flag to force the restart even though the job (or some other job in the cluster) has already been restarted.
4. As of 1. we don't necessarily restart the full dependency tree. So we need to add a flag to avoid creating dependencies between restarted jobs and not restarted jobs. This is to avoid a connection between the old and the restarted dependency tree making it effectively one big dependency tree.
5. I suppose the previous point is only a displaying issue so we could skip it.

#34

Updated by mkittler about 3 years ago

#35

Updated by mkittler about 3 years ago

#36

Updated by mkittler about 3 years ago

#37

Updated by mkittler about 3 years ago

#38

Updated by okurz about 3 years ago

Due date set to 2022-03-25

#39

Updated by mkittler about 3 years ago

Due date changed from 2022-03-25 to 2022-04-25

#40

Updated by mkittler about 3 years ago

Related to action #107014: trigger openqa-trigger-bisect-jobs from our automatic investigations whenever the cause is not already known size:M added

#41

Updated by mkittler about 3 years ago

#42

Updated by mkittler almost 3 years ago

Due date changed from 2022-04-25 to 2022-05-02

#43

Updated by livdywan almost 3 years ago

#44

Updated by okurz almost 3 years ago

Description updated (diff)
Due date changed from 2022-05-02 to 2022-05-09

#45

Updated by okurz almost 3 years ago

Related to action #110518: Call job_done_hooks if requested by test setting (not only openQA config as done so far) size:M added

#46

Updated by okurz almost 3 years ago

Related to action #110530: Do NOT call job_done_hooks if requested by test setting added

#47

Updated by okurz almost 3 years ago

Due date deleted (~~2022-05-09~~)
Status changed from Feedback to Blocked

#48

Updated by okurz almost 3 years ago

#49

Updated by okurz almost 3 years ago

Subject changed from Provide support for multi-machine scenarios handled by openqa-investigate to Provide support for multi-machine scenarios handled by openqa-investigate size:M

#50

Updated by mkittler almost 3 years ago

Related to action #110176: [spike solution] [timeboxed:10h] Restart hook script in delayed minion job based on exit code size:M added

#51

Updated by okurz almost 3 years ago

Status changed from Blocked to Workable

#52

Updated by mkittler almost 3 years ago

#53

Updated by mkittler almost 3 years ago

#54

Updated by mkittler almost 3 years ago

Status changed from Workable to In Progress

#55

Updated by openqa_review almost 3 years ago

Due date set to 2022-07-19

#56

Updated by mkittler almost 3 years ago

#57

Updated by okurz almost 3 years ago

Description updated (diff)

#58

Updated by mkittler almost 3 years ago

Status changed from In Progress to Feedback

I've just learned that executing the hook script for all jobs is a no-go. I assumed we had agreed on doing this kind of logic in the hook script because #112523 was very much in-line with that. Well, back to the drawing board.

Note that the general problem we need to resolve here is synchronization (of the investigation of parallel jobs). The question is just where this synchronization is supposed to happen. If we don't synchronize it properly we could either accidentally miss or duplicate the effort.

There are multiple approaches:

Call the hook script for all job results and do the synchronization within the hook script. (This is the approach I would have taken.)
1. The hook script determines whether a cluster has finished (and postpones until then using #112523) and whether a cluster contains a failed job.
2. The hook script only considers parent jobs to avoid duplicated investigations and therefore needs to be called for all job results.
  1. This might be problematic for clusters with multiple parents but could be solved by:
    1. Ensuring we're really find the top-level parent in the cluster. (We will fail to find the top-level parent in case of cyclic dependencies. It should be fine to not support it but we need to prevent any endless loops in our code.)
    2. If the clone the top-level parent with --max-depth 0 we can ensure we're cloning the full cluster (0 means infinity here). Since we're using --skip-chained-deps and not --clone-children this should not lead to cloning any unwanted jobs outside the cluster.
3. This is deemed too expensive. However, no other technicalities would prevent the approach.
Call the hook script still only for failed jobs and abuse openQA's comment system to do the synchronization within the hook script.
1. [same as 1.1] The hook script determines whether a cluster has finished (and postpones until then using #112523) and whether a cluster contains a failed job.
2. The hook script switches to investigate the parallel parent if it is called for a parallel child. This means we would duplicate the effort if there are multiple failures within the same cluster so we need to synchronize:
  1. The hook script writes the investigation comment before starting the investigation, e.g. "Spawning investigation jobs".
  2. The hook script checks whether another investigation comment has been created in the meantime and only proceeds if its own comment has the lower ID. Otherwise it deletes its comment and aborts.
  3. The hook script edits the comment with the actual contents after spawning the investigation jobs.
  4. Point 1.2.1 applies here as well.
3. Not sure yet what technicalities will go in the way.
Call the hook script still only for failed jobs and and track already investigated jobs within the hook script (e.g. relying on an SQLite database).
1. No further details as we'd likely don't want that approach anyways.
Call the hook script for the whole cluster providing the hook script with the appropriate job ID to focus the investigation on. This means the synchronization happens in openQA.
1. No further logic in the hook script is required but the additional openQA upstream feature might get a little involved.
2. openQA's implementation needed to take 1.2.1 into account as well when providing the appropriate job ID. So we don't loose that complexity.

I suppose we should go with approach 2 (still taking 1.2.1 into account of course).

#59

Updated by mkittler almost 3 years ago

#60

Updated by okurz almost 3 years ago

#61

Updated by livdywan almost 3 years ago

#62

Updated by mkittler over 2 years ago

Status changed from Feedback to In Progress

https://github.com/os-autoinst/scripts/pull/170 has been merged since 2 days ago. I haven't received any feedback from users since then. (Before they complained quite quickly if the investigate script had done a bad job dealing with parallel clusters.)

So I've just checked parallel jobs being investigated on OSD myself via select jobs.id from jobs join comments on jobs.id = comments.job_id join job_dependencies on jobs.id = job_dependencies.parent_job_id where job_dependencies.dependency = 2 and comments.text like '%investigation%' and t_finished > '2022-07-11' order by jobs.id desc limit 25;.

All jobs I've looked into look good, e.g. on https://openqa.suse.de/tests/9111048#comments and https://openqa.suse.de/tests/9105044#comments we can see that the parallel parent was selected correctly for the sync comment and the whole cluster was cloned for each investigation job. The same counts for https://openqa.suse.de/tests/9109613#dependencies which is also part of a bigger dependency tree and only jobs from its parallel cluster have been cloned (as expected).

On o3 I only found the job https://openqa.opensuse.org/tests/2464678#dependencies (and its clones). This job failed because its parallel job hasn't been scheduled correctly. (I haven't investigated why.) The investigation was postponed once because not all dependencies where done:

Jul 12 10:25:02 ariel openqa-gru[4018]: Postponing to investigate job 2464678: waiting until pending dependencies have finished

Apparently postponing doesn't work because no automatic investigation was triggered later. I need to look into it because it also doesn't seem to work on OSD. Then the job was manually restarted by ggardet_arm. That didn't work either. The job ended up as parallel failed despite having not even a parallel dependency within the dependency tree. However, likely it is just a displaying issue because the investigation actually cloned the cluster correctly (see https://openqa.opensuse.org/tests/2465350). The strangeness of that parallel dependency is maybe something to look into but out of the scope of this ticket.

So I guess at least the investigation itself works as expected except for the postponing case. I'll need to look into that.

#63

Updated by mkittler over 2 years ago

#64

Updated by mkittler over 2 years ago

The postponing now works. The job https://openqa.suse.de/tests/9126523 has been postponed¹ and was then investigated later. However, the job was actually postponed needlessly. Maybe it could still be optimized so jobs are only postponed if there are pending jobs within the same parallel cluster (and not just any pending jobs within the related dependency tree).

¹

…
Jul 14 11:12:22 openqa openqa-gru[5421]: Postponing to investigate job 9126523: waiting until pending dependencies have finished
Jul 14 11:13:57 openqa openqa-gru[8560]: Postponing to investigate job 9126523: waiting until pending dependencies have finished
Jul 14 11:15:22 openqa openqa-gru[10801]: Postponing to investigate job 9126523: waiting until pending dependencies have finished
Jul 14 11:17:17 openqa openqa-gru[13377]: Postponing to investigate job 9126523: waiting until pending dependencies have finished

#65

Updated by mkittler over 2 years ago

#66

Updated by mkittler over 2 years ago

Status changed from In Progress to Resolved

#67

Updated by okurz over 2 years ago

Due date deleted (~~2022-07-19~~)

Project

General

Tags

Custom queries

Profile

QA (public) » openQA Project (public)

action #95783

Provide support for multi-machine scenarios handled by openqa-investigate size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Out of scope¶

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by dzedro about 3 years ago

Updated by tinita about 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by livdywan about 3 years ago

Updated by okurz about 3 years ago

Updated by livdywan about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by openqa_review about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler about 3 years ago

Updated by mkittler almost 3 years ago

Updated by livdywan almost 3 years ago

Updated by okurz almost 3 years ago

Updated by okurz almost 3 years ago

Updated by okurz almost 3 years ago

Updated by okurz almost 3 years ago

Updated by okurz almost 3 years ago

Updated by okurz almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by okurz almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by openqa_review almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by okurz almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by mkittler almost 3 years ago

Updated by okurz almost 3 years ago

Updated by livdywan almost 3 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by okurz over 2 years ago