Project

General

Profile

action #69979

coordination #103962: [saga][epic] Easy multi-machine handling: MM-tests as first-class citizens

coordination #103971: [epic] Easy *re*-triggering and cloning of multi-machine tests

Advanced job restarting via the web UI

Added by mkittler almost 2 years ago. Updated 4 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-08-13
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

As a user of the web UI I want to have the advanced restart actions available over the UI so that I do not have to learn or use the API

Acceptance criteria

Out of scope

Any additional UI extensions, e.g. to change test parameters on restart

Further notes

  • This should be fairly simple to implement. Just create another icon next to the help icon and show some checkboxes somewhere. If there's no fitting place, just put them into a modal dialog.
  • Compare to github_merge_button_with_options.png but use radio buttons
  • For simplicity I would not touch the other places where the web UI shows the restart button.

Workaround

  • Instead of trying to restart failed children restart the parent over API and skip passed children following https://open.qa/docs/#_further_notes_2
  • As an alternative to START_DIRECTLY_AFTER_TEST one can define a specific "machine" with a specific worker class that is only fulfilled by a single, unique worker instance

Related issues

Related to openQA Project - action #70618: Automatically avoid restarting the directly chained parent if possible to save timeNew2020-08-27

Related to openQA Project - action #69976: Show dependency graph for cloned jobsResolved2020-08-13

Related to openQA Project - action #70720: Unable to restart a child from START_DIRECTLY_AFTER_TEST chain if another child has been restarted alreadyResolved2020-08-31

Related to openQA Project - action #68956: Restart the parent and child jobs of a test in a START_DIRECTLY_AFTER_TEST test chainResolved2020-07-14

Related to openQA Project - action #104241: Retrigger the original/initial job chain after parts have been retriggeredNew2021-12-22

History

#1 Updated by okurz almost 2 years ago

Please use the ticket template from https://progress.opensuse.org/projects/openqav3/wiki#Feature-requests , e.g. you are missing the section header for the AC's and motivation at best in form of a user story and suggestions

#2 Updated by okurz almost 2 years ago

  • Category set to Feature requests
  • Priority changed from Normal to Low
  • Target version set to future

#3 Updated by xlai over 1 year ago

mkittler okurz, Thanks a lot for the discussions, brilliant ideas and efforts on ticket https://progress.opensuse.org/issues/68956, which makes our task to improve openqa test efficiency of virtualization group possible. Really appreciate that!

If we apply it in SLE15SP3, during which we will do review and restart for failure(or even pass) jobs in every daily build, it is quite likely that:
1) in a dependency/chain tree, several leaf tests fail(mostly fail by ipmi unstability, some due to immature code)
2) it is also very likely that we get what we need after more than once of rerun by restarting different branch sets from original full tree, but not only from the last time's restarted incomplete tree.

If the web UI can provide the capability, it will facilitate our team members' daily review a lot, as well as improving the restarted test efficiency a lot. It will give us enough confidence to officially apply this "START_DIRECTLY_AFTER_TEST" chain method in virtualization job group, by which 200~300 testsuites will be distributed in 20+ trees from our evaluation and design. Otherwise, we really worry about the upcoming constraints to our daily tasks.

So, here I would like to request this feature of "Advanced job restarting via the web UI" ASAP, so that we can catch up with SLE15SP3 early alpha testing after trying/tuning/configuring based on it. Thanks in advance!

@Julie_CAO FYI.

#4 Updated by ph03nix over 1 year ago

10399

Hey there! I would like to raise awareness about some upcoming issues with the new approach:

  1. Once a child is restarted, the full dependency tree is gone. Which means you always end up in individual subtrees, where a bunch of jobs are successful, while other jobs are not visible any more. This makes openQA reviews a major inconvenience and people will complain about this very soon. This also increases the risk of losing test coverage, as it's siblings will not be displayed any more
  2. Point 2 in the previous post is indirectly asking for the old behaviour
  3. Cropping the dependency graph eliminates possibilities for fine-tuning (e.g. further restarting siblings, which was possible before). The previously mentioned Point 2 asks in a way for this functionality again, but in a sub-branch - I'm wondering if this makes things more complicated than it must be. We then have multiple dependency trees for one single test case? How do we keep track of them?

So, perhaps it's worth to consider the following suggestions:

  1. The newly upcoming issues might suggest that restarting parent jobs for children is necessary but perhaps should not be the default, as it is destructive
  2. The old behaviour is apparently still needed, also if the new approach is active (e.g. restarting siblings)
  3. Don't crop the dependency graph when restarting the parents This point is tricky. What happens about already completed jobs in another subtree?

I fearlessly steal the graph from Julie_CAO to illustrate the behaviour :-)

Assuming job D get's restartet but we want to keep jobs C, E and F in the graph. I suggest to keep them with their current state - if needed they can then be restarted in the webui as well (with the old restart behaviour)

I can see, that this new behaviour still requires the traditional restart method and we can avoid lots of upcoming issues by allowing that
I would even suggest to restore the default, but add the restart child+parents as additional option in the webui because it is a) non-destructive, b) arguably what a user expects when clicking restart this job and c) probably (and arguably) used more often than restart child+parent. Also because some of the new feature requests are things, that were possible before.

So, the new feature of restarting a child+parent is definitely good and adds possibilities to openQA, but I'm wondering if the Beijing colleagues would like to keep this as default behaviour? Perhaps we should consider this as additional but not default feature in openQA? What do you think?

#5 Updated by mkittler over 1 year ago

Once a child is restarted, the full dependency tree is gone. Which means you always end up in individual subtrees, where a bunch of jobs are successful, while other jobs are not visible any more. This makes openQA reviews a major inconvenience and people will complain about this very soon. This also increases the risk of losing test coverage, as it's siblings will not be displayed any more

I've created #69976 for that.

it is also very likely that we get what we need after more than once of rerun by restarting different branch sets from original full tree, but not only from the last time's restarted incomplete tree.

So far we've never allowed restarting a job which has already been cloned. Allowing that now would mean changing the database schema and every piece of code using it. I doubt we want to go down that road. I know that it is annoying that you can not restart a directly chained child after you have already been restarted its directly chained sibling. That's what you're after, right? Maybe I'll find a solution for this.

Cropping the dependency graph

Sorry, I'm not following what you mean by that. Note that this is the ticket about advanced restarting which is not that much concerned with the dependency graph.

The newly upcoming issues might suggest that restarting parent jobs for children is necessary but perhaps should not be the default, as it is destructive

The directly chained dependency would not work without restarting the parent. I'm afraid that directly chained dependencies come with certain disadvantages which can not be easily solved. Restarting directly chained parent is the right thing to do. It merely reveals a disadvantage of using directly chained dependencies. Please think twice whether you actually need to use directly chained dependencies in the first place. If skipping the parent works for you then you likely have a very constrained setup and can use regularly chained dependencies.

The old behaviour is apparently still needed, also if the new approach is active (e.g. restarting siblings)

If you restart directly chained siblings one after each other without restarting the parent you'll break the direct chain. If it worked for you before that means using regularly chained dependencies would be sufficient in your case.

Don't crop the dependency graph when restarting the parents This point is tricky. What happens about already completed jobs in another subtree?

I still don't follow. I can just redirect you to #69976 where I already noted down some thought about displaying the dependency tree for cloned jobs.

So, the new feature of restarting a child+parent is definitely good and adds possibilities to openQA

Restarting the parent is not a feature. It is a bug fix. Without that fix, directly chained dependencies don't work when restarting them and are basically just as good as regularly chained dependencies. When I implemented directly chained dependencies it was a mistake not to restart the parent as well. It should have been like this from the beginning.

Perhaps we should consider this as additional but not default feature in openQA? What do you think?

No, we shouldn't. I consider it a bug fix to make something which was broken before working again. If you didn't notice it was broken then please just use regularly chained dependencies because in your setup you're clearly not benefiting from the actual feature which the directly chained dependencies provide.


I'll look into your comments with the example dependency tree later.

#6 Updated by ph03nix over 1 year ago

10402
10408

mkittler wrote:

Once a child is restarted, the full dependency tree is gone. Which means you always end up in individual subtrees, where a bunch of jobs are successful, while other jobs are not visible any more. This makes openQA reviews a major inconvenience and people will complain about this very soon. This also increases the risk of losing test coverage, as it's siblings will not be displayed any more

I've created #69976 for that.

it is also very likely that we get what we need after more than once of rerun by restarting different branch sets from original full tree, but not only from the last time's restarted incomplete tree.

So far we've never allowed restarting a job which has already been cloned. Allowing that now would mean changing the database schema and every piece of code using it. I doubt we want to go down that road. I know that it is annoying that you can not restart a directly chained child after you have already been restarted its directly chained sibling. That's what you're after, right? Maybe I'll find a solution for this.

Cropping the dependency graph

Sorry, I'm not following what you mean by that. Note that this is the ticket about advanced restarting which is not that much concerned with the dependency graph.

I just checked now, and this behaviour seems to be gone in the meantime. What I observed when this whole issue was arising, is that when I restart a child job, the dependency tree of the cloned job got cropped down to the child and it's parent. For instance:

When I now restart the qam-kvm-storage job, the restarted job ends up in a dependency graph that contains only qam-kvm-install@64bit-ipmi and qam-kvm-storage@64bit-ipmi.
I wanted to post some graphs but I cannot reproduce this issue. Looks like it got solved in the meantime.

However, in https://github.com/os-autoinst/openQA/pull/3300#issuecomment-673318353 I illustrated this issue and you can see it in the output of openqa-mon:

This shows, that the cloned job qam-kvm-hotplugging@64bit-ipmi has only its direct parent (qam-kvm-install@64bit-ipmi) in the dependencies. There the dependency tree of the restarted job has been cropped down to child+parent only. That's what I meant.

The newly upcoming issues might suggest that restarting parent jobs for children is necessary but perhaps should not be the default, as it is destructive

The directly chained dependency would not work without restarting the parent. I'm afraid that directly chained dependencies come with certain disadvantages which can not be easily solved. Restarting directly chained parent is the right thing to do. It merely reveals a disadvantage of using directly chained dependencies. Please think twice whether you actually need to use directly chained dependencies in the first place. If skipping the parent works for you then you likely have a very constrained setup and can use regularly chained dependencies.

The old behaviour is apparently still needed, also if the new approach is active (e.g. restarting siblings)

If you restart directly chained siblings one after each other without restarting the parent you'll break the direct chain. If it worked for you before that means using regularly chained dependencies would be sufficient in your case.

The reason chain dependencies are used is the requirement, that the children are executed successively on the same worker. IMHO [1] that's not possible with regularly chained dependencies. When you look at the above posted dependency graph, we need to ensure, that all jobs after qam-kvm-install run on the same IPMI worker.
And since we are using an impi-worker pool that contains multiple workers, openQA would schedule the runs on different workers.

Don't crop the dependency graph when restarting the parents This point is tricky. What happens about already completed jobs in another subtree?

I still don't follow. I can just redirect you to #69976 where I already noted down some thought about displaying the dependency tree for cloned jobs.

This issue seems to be resolved in the meantime. If it re-occurs, I make sure to document it accordingly.

So, the new feature of restarting a child+parent is definitely good and adds possibilities to openQA

Restarting the parent is not a feature. It is a bug fix. Without that fix, directly chained dependencies don't work when restarting them and are basically just as good as regularly chained dependencies. When I implemented directly chained dependencies it was a mistake not to restart the parent as well. It should have been like this from the beginning.

Perhaps we should consider this as additional but not default feature in openQA? What do you think?

No, we shouldn't. I consider it a bug fix to make something which was broken before working again. If you didn't notice it was broken then please just use regularly chained dependencies because in your setup you're clearly not benefiting from the actual feature which the directly chained dependencies provide.


I'll look into your comments with the example dependency tree later.

I understand that. What options do we have now to do

  1. Ensure jobs are all executed on the same worker, one after another [2]
  2. Having the possibility to restart jobs without triggering the whole hypervisor install procedure (parent job)

Requirement 1 cannot be fulfilled by regularly chained dependency, and the bugfix breaks requirement 2 for directly chained dependencies and is effectively breaking our workflow.
The workaround of using openqa-client with skip-parents=1 works for now, but this can only be a temporary solution.
We are also definitely open for a different approach, but regular chained dependencies doesn't seem to be an option as well.

To make matters simple: I understand that the concept of directly chained dependencies intents to restart child+parent to guarantee consistency. We build our workflow with this bug being present and this is unfortunate. But since advanced job restarting is already being planned and it would probably resolve our issues and some of the issues of our colleagues as well, we consider our issues as being resolved soon. In the end, the issue of the cropped dependency tree was much bigger, but that seems to be resolved in the meantime.

I really think, that this option of Advanced job restarting being possible via the web UI would not only solves our issues for our virtualization workflow, but also allows for better fine-tuning of more complex dependency trees. I really believe that this will bring a big value to a lot of users. Thanks for working on this!

We are definitely looking forward to see this happening soon :-)

[1] From http://open.qa/docs/: "The difference between CHAINED and DIRECTLY_CHAINED dependencies is that DIRECTLY_CHAINED means the tests must run directly after another on the same worker slot." - The later is a strict requirement for our runs, as all tests need to be executed on the same hypervisor.
[2] Restarting jobs was always a bit tricky, as they ended up on a random worker, so we sometimes had to restart a job until it finally ended up on the right one (Yes we are aware that this is nasty)

#7 Updated by Julie_CAO over 1 year ago

mkittler wrote:

So far we've never allowed restarting a job which has already been cloned. Allowing that now would mean changing the database schema and every piece of code using it. I doubt we want to go down that road. I know that it is annoying that you can not restart a directly chained child after you have already been restarted its directly chained sibling. That's what you're after, right? Maybe I'll find a solution for this.

Hi Marius,
That is what my team is after. This feature almost block our test design.I understand the changing from openQA side would be huge, but restarting an original job is quite necessary to our tests. "original" here, I mean the job which has not restarted. We just want it is allowed to be restarted anytime in any condition, even an non-perfect solution is ok for us.

#8 Updated by xlai over 1 year ago

Julie_CAO wrote:

mkittler wrote:

So far we've never allowed restarting a job which has already been cloned. Allowing that now would mean changing the database schema and every piece of code using it. I doubt we want to go down that road. I know that it is annoying that you can not restart a directly chained child after you have already been restarted its directly chained sibling. That's what you're after, right? Maybe I'll find a solution for this.

mkittler, Yes, we want this badly :-). Thanks for the upcoming solution. Really appreciate your effort on it!

From this ticket's description AC0, it seems that this solution is not in this ticket's scope. How can we follow it? Will you open a new ticket?

FYI. With ticket #69979, the above solution(in a new ticket?), and ticket #69976, the usability&flexibility of webui will be waaaay better. From our practical job group review&rerun&maintainance needs, the major requirements will be met. Thanks for the great help!

#9 Updated by mkittler over 1 year ago

The reason chain dependencies are used is the requirement, that the children are executed successively on the same worker. IMHO [1] that's not possible with regularly chained dependencies. When you look at the above posted dependency graph, we need to ensure, that all jobs after qam-kvm-install run on the same IPMI worker. And since we are using an impi-worker pool that contains multiple workers, openQA would schedule the runs on different workers.

Then you were just lucky that the previous restarting behavior worked for you because without restarting the parent openQA ensures by no means that jobs run on the same worker. Without restarting the parent it (obviously) can also not guarantee that no other possibly disturbing jobs ran on the worker between running the parent and one of the children.

Requirement 1 cannot be fulfilled by regularly chained dependency, and the bugfix breaks requirement 2 for directly chained dependencies and is effectively breaking our workflow.

The requirement 1 was never fullfilled when you restarted the job as stated in requirement 2 except you had luck that the same worker was chosen again or you just had one matching worker anyways.

I understand that the concept of directly chained dependencies intents to restart child+parent to guarantee consistency.

That is not ture. The main reason to restart directly chained parents is not to guarantee consistency. It is to guarantee that the children run on the same worker as the parent and that no other jobs ran in between.

I really believe that this will bring a big value to a lot of users.

Me, too. Especially if we have built the basic UI other features like amending test variables are conceivable as well.

Thanks for working on this!

This issue is set to the future target version so I'm actually not working on it right now. However, maybe I can convince okurz to move the feature to ready. It shouldn't take very long to implement it then.

"original" here, I mean the job which has not restarted.
We just want it is allowed to be restarted anytime in any condition, even an non-perfect solution is ok for us.
Yes, we want this badly :-). Thanks for the upcoming solution. Really appreciate your effort on it!
FYI. With ticket #69979, the above solution(in a new ticket?) […] the usability&flexibility of webui will be waaaay better.

Well, the advanced restarting would allow skipping the parent and that would allow restarting the "original" job. However, I'd like to emphasize that skipping the parent restart means the children will not necessarily run on the same worker the parent ran on. So this is not really a good solution and also the reason why I've changed the default behavior. And yes, I'm repeating myself - but considering your excitement for being able to skip the parent easily it just seems that you did not understand that when skipping the parent the job might not run on the same worker than its directly chained parent.

#10 Updated by mkittler over 1 year ago

Because you seem to want to avoid restarting the directly chained parent so badly: What do you think about this idea? #70618

Of course I can not promise that we will find the time to implement it any time soon and I hope I haven't overlooked some obstacles.

#11 Updated by Julie_CAO over 1 year ago

mkittler wrote:

"original" here, I mean the job which has not restarted.
We just want it is allowed to be restarted anytime in any condition, even an non-perfect solution is ok for us.
Yes, we want this badly :-). Thanks for the upcoming solution. Really appreciate your effort on it!
FYI. With ticket #69979, the above solution(in a new ticket?) […] the usability&flexibility of webui will be waaaay better.

Well, the advanced restarting would allow skipping the parent and that would allow restarting the "original" job. However, I'd like to emphasize that skipping the parent restart means the children will not necessarily run on the same worker the parent ran on. So this is not really a good solution and also the reason why I've changed the default behavior. And yes, I'm repeating myself - but considering your excitement for being able to skip the parent easily it just seems that you did not understand that when skipping the parent the job might not run on the same worker than its directly chained parent.

No, my team is expecting restarting the "original" job only. we have not the requirement to skip the parent. But we don't mind web UI provide the possibility to skip parent as it will help other teams.

#12 Updated by xlai over 1 year ago

mkittler, maybe I can give some people introduction here to help you better understand the different needs and where they are from. There are two different groups of openqa users, who are using this START_DIRECTLY_AFTER_TEST feature and gave comment in this ticket, QA SLE team(Julie_CAO&xlai) and QA maintenance team(ph03nix).

  • QA SLE team, wants to restart parent when rerun child(we requested feature #68956 for it). As further request for better webui usability&flexibility, we hope to request 3 new capabilities on webui if possible:

    • 1) this ticket #69979 (high need), to let webui restart button be able to rerun with "skip_ok_result_children", mainly for rerun efficiency considerations
    • 2) no ticket yet (high need), as Julie_CAO replied, the ability to rerun "original" jobs
    • 3) ticket #69976 (middle need), Show dependency graph for cloned jobs
  • QA maintenance team wants the ability to skip parent jobs when rerun child jobs. A more complete list for what they want is:

    • 1) this ticket #69979 (high need), to let webui restart button be able to rerun with "skip_parent", to fix their current test issues
    • 2) no ticket yet (middle need?), the ability to rerun "original" jobs
    • 3) ticket #69976 (middle need?), Show dependency graph for cloned jobs
    • 4) ticket #70618 (high need?): Avoid restarting the directly chained parent if possible

I am not sure whether I get the need's priority correctly for QA maintenance on needs 2)3)4). Please correct me if I am wrong. @ph03nix.

#13 Updated by mkittler over 1 year ago

Thanks for giving that overview.

There's indeed no ticket for rerunning the "original" job but it would also needed to be allowed when implementing #70618. Note that restarting the "original" job is currently prevented because the parent would be cloned twice. I would fix it as part of #70618 by simply restarting the most recent clone of the parent (taking care not to abort it if it is still running).

Oh, and for the record and to avoid misunderstandings: With restarting the "original" job we mean restarting a directly chained child (from the original dependency tree) after one of its direct siblings (and therefore its direct parent) has already been restarted.

#14 Updated by xlai over 1 year ago

mkittler wrote:

Thanks for giving that overview.

There's indeed no ticket for rerunning the "original" job but it would also needed to be allowed when implementing #70618. Note that restarting the "original" job is currently prevented because the parent would be cloned twice. I would fix it as part of #70618 by simply restarting the most recent clone of the parent (taking care not to abort it if it is still running).

Oh, and for the record and to avoid misunderstandings: With restarting the "original" job we mean restarting a directly chained child (from the original dependency tree) after one of its direct siblings (and therefore its direct parent) has already been restarted.

My pleasure! Thanks for the big support on all direct chain related features and issues! I can imagine that this feature will be widely used in our virtualization testing(200~300 testsuites) when the main features are done. So does the QA Maintenance team! It will also be a mature&good enough feature for any jobs from any team that run on baremetal/ipmi machine and need to be chained directly to pursue test efficiency&flexibility! Thank you very much for this wonderful result!

#15 Updated by ph03nix over 1 year ago

Hi Marius

mkittler wrote:

The reason chain dependencies are used is the requirement, that the children are executed successively on the same worker. IMHO [1] that's not possible with regularly chained dependencies. When you look at the above posted dependency graph, we need to ensure, that all jobs after qam-kvm-install run on the same IPMI worker. And since we are using an impi-worker pool that contains multiple workers, openQA would schedule the runs on different workers.

Then you were just lucky that the previous restarting behavior worked for you because without restarting the parent openQA ensures by no means that jobs run on the same worker. Without restarting the parent it (obviously) can also not guarantee that no other possibly disturbing jobs ran on the worker between running the parent and one of the children.

We are painfully aware of this ;-)
See footnote [2] above, where I stated that we had to click several times until the new job by chance lands on the same worker. This was actually the reason, why we started to experiment to assign jobs to a specific worker, so we can avoid this issue. It comes with other caveats though, so we are not sure if this is going to become the new norm.

It's hacky, but as of now the best approach we have. And we are constantly improving it, so one fine day, we will have a nice and clean testsuite that is green and green only ;-)

Requirement 1 cannot be fulfilled by regularly chained dependency, and the bugfix breaks requirement 2 for directly chained dependencies and is effectively breaking our workflow.

The requirement 1 was never fullfilled when you restarted the job as stated in requirement 2 except you had luck that the same worker was chosen again or you just had one matching worker anyways.

I understand that the concept of directly chained dependencies intents to restart child+parent to guarantee consistency.

That is not ture. The main reason to restart directly chained parents is not to guarantee consistency. It is to guarantee that the children run on the same worker as the parent and that no other jobs ran in between.

True, that's what I meant - consistency was perhaps the wrong wording.

I really believe that this will bring a big value to a lot of users.

Me, too. Especially if we have built the basic UI other features like amending test variables are conceivable as well.

Thanks for working on this!

This issue is set to the future target version so I'm actually not working on it right now. However, maybe I can convince okurz to move the feature to ready. It shouldn't take very long to implement it then.

"original" here, I mean the job which has not restarted.
We just want it is allowed to be restarted anytime in any condition, even an non-perfect solution is ok for us.
Yes, we want this badly :-). Thanks for the upcoming solution. Really appreciate your effort on it!
FYI. With ticket #69979, the above solution(in a new ticket?) […] the usability&flexibility of webui will be waaaay better.

Well, the advanced restarting would allow skipping the parent and that would allow restarting the "original" job. However, I'd like to emphasize that skipping the parent restart means the children will not necessarily run on the same worker the parent ran on. So this is not really a good solution and also the reason why I've changed the default behavior. And yes, I'm repeating myself - but considering your excitement for being able to skip the parent easily it just seems that you did not understand that when skipping the parent the job might not run on the same worker than its directly chained parent.

Also: #70618 would be a very nice addition IMHO. Good idea!

#16 Updated by okurz over 1 year ago

  • Related to action #70618: Automatically avoid restarting the directly chained parent if possible to save time added

#17 Updated by okurz over 1 year ago

  • Related to action #69976: Show dependency graph for cloned jobs added

#18 Updated by mkittler over 1 year ago

  • Related to action #70720: Unable to restart a child from START_DIRECTLY_AFTER_TEST chain if another child has been restarted already added

#19 Updated by okurz over 1 year ago

10450

I added an implementation suggestion based on how a comparable button looks like on github. Given that the functionality is provided over API I still consider this of low prio and keep it in the "future" target version, i.e. the SUSE QA tools team does not plan to implement this anytime soon (not within the next months or years). However given that the business logic does not need to change it should be feasible for any outside contributor to implement. We are always happy to receive pull requests from anyone :)

#20 Updated by okurz over 1 year ago

  • Description updated (diff)

#21 Updated by okurz over 1 year ago

  • Related to action #68956: Restart the parent and child jobs of a test in a START_DIRECTLY_AFTER_TEST test chain added

#22 Updated by xlai 6 months ago

okurz mkittler Please allow me to state again, the importance to virtualization testing, of this ticket and the further sub-requests of it in https://progress.opensuse.org/issues/69979#note-12.
Virtualization tests relie on baremetal machine testing, which is carried out with openqa ipmi backend jobs. Virtualization test has a important character that it has a very large host and guest matrix, which means that on each product host, a long list of guests will be tested with lots of further guest features tests. Then we get a difficulty like this: if we put all guests' test of the same product host in one testsuite, the openqa job will be quite long(over 10 hours), and it will be more likely to end up failing somewhere. So we have to break them down into smaller testsuites -- fewer guests on the same product host, however that would mean that for one single product host, multiple testsuites will be needed, and thus multiple host installation and further environment setup will be needed. It brings serious redundancy and efficiency waste.
This is the background of requiring START_DIRECTLY_AFTER_TEST feature. With it, we will be able to reorganize the testsuites -- make even smaller testsuites to have shorter job and meanwhile less redundancy in setting up host environment and pursue a highly efficient testing. And thanks a lot to both of you in providing the initial code support.

However, now we haven't yet reconstructing the testsuites in large scale. The main obstacle is not able to rerun failure jobs with high efficiency and flexibility via webui (this ticket). Yes, there can be workaround to use cmdline, however, it is not that easy to use it, with daily coming builds and several hundreds of testsuites(after breaking down into smaller ones) to handle.

Looking forward to your further support! Thanks a lot!

#23 Updated by xlai 5 months ago

okurz We would recommend to give this ticket HIGH priority, based on the importance of it to virtualization test, which is described in https://progress.opensuse.org/issues/69979#note-22. Do you think it makes sense?

#24 Updated by okurz 5 months ago

xlai wrote:

However, now we haven't yet reconstructing the testsuites in large scale. The main obstacle is not able to rerun failure jobs with high efficiency and flexibility via webui (this ticket). Yes, there can be workaround to use cmdline, however, it is not that easy to use it, with daily coming builds and several hundreds of testsuites(after breaking down into smaller ones) to handle.

Please elaborate on that.

  1. Why do you need to manually restart so many tests at all?
  2. Why are the existing ways using the API and command line tools not easy to use?

xlai wrote:

okurz We would recommend to give this ticket HIGH priority, based on the importance of it to virtualization test, which is described in https://progress.opensuse.org/issues/69979#note-22. Do you think it makes sense?

We commonly don't regard any feature requests as "High" priority. Please see my reasoning and explanation for the prioritization in #69979#note-19

#25 Updated by xlai 5 months ago

okurz wrote:

xlai wrote:

However, now we haven't yet reconstructing the testsuites in large scale. The main obstacle is not able to rerun failure jobs with high efficiency and flexibility via webui (this ticket). Yes, there can be workaround to use cmdline, however, it is not that easy to use it, with daily coming builds and several hundreds of testsuites(after breaking down into smaller ones) to handle.

Please elaborate on that.

  1. Why do you need to manually restart so many tests at all?

Potentially, any job has chance to be rerun for whatever reasons, especially when there are infra issues like network/repository/new image in syncing to the desired url/ipmi not functioning stably well enough(ipmi backend workers only), etc.

  1. Why are the existing ways using the API and command line tools not easy to use?

Same reasons as why openqa has webui -- usability, less human fault, automatic handling of complex dependecy relationships between jobs, etc.

xlai wrote:

okurz We would recommend to give this ticket HIGH priority, based on the importance of it to virtualization test, which is described in https://progress.opensuse.org/issues/69979#note-22. Do you think it makes sense?

We commonly don't regard any feature requests as "High" priority. Please see my reasoning and explanation for the prioritization in #69979#note-19

If your group does not give any feature request high priority, it is fine. I do not mean to break your internal priority rules. But we hope there can be target plan for it.

And I really do not agree fully with #69979#note-19. If you look twice into #69979#note-12, you can see that we need not only a standalone ticket(this one), but also relevant ones, such as 2) no ticket yet (high need), as Julie_CAO replied, the ability to rerun "original" jobs, and 3) ticket #69976 (middle need), Show dependency graph for cloned jobs, and others. We need a series of relevant tickets.

IMHO, they are far from our knowledge to contribute them. They need very deep understanding of how openqa internally works(scheduling, job relationships, webui, etc), to carefully design and implement these relevants tickets one by one. To be honest, I have some knowledge about openqa-webui and os-autoinst code, however, sometimes, I even do not understand some of mkittler's discussions/explanations. That's obviously not his fault, it is because of my limited understanding of the whole openqa design. IMHO again, it makes more sense to let one expert openqa engineer, to think about them carefully and carry them out. We can fully accept if it can only be done slowly if he can only be partially on it, but rather never start.

I get a gut feeling, that our understanding of this "whole START_DIRECTLY_AFTER_TEST
feature" (initial code support, the later fixes #68956/#70720, the TODO items #70618/#69976/#69979, and some even known TODO but without tickets yet, but mkittler knows them well), is similar with mkittler. But your understanding of it, is a little far from us. I could be wrong. But at least mkittler should have thought more than anyone else on it because he had been working on it. Can I suggest you two to talk about this so that everyone can be one the same page?

#26 Updated by cdywan 5 months ago

xlai wrote:

okurz wrote:

Please elaborate on that.

  1. Why do you need to manually restart so many tests at all?

Potentially, any job has chance to be rerun for whatever reasons, especially when there are infra issues like network/repository/new image in syncing to the desired url/ipmi not functioning stably well enough(ipmi backend workers only), etc.

  1. Why are the existing ways using the API and command line tools not easy to use?

Same reasons as why openqa has webui -- usability, less human fault, automatic handling of complex dependecy relationships between jobs, etc.

This ticket, as I understand it, is about doing the equivalent of existing API/ CLI to restart jobs. No new features besides the UX based on that. And that is not high priority relative to improving underlying functionality that doesn't exist yet.

xlai If I understand you correctly you're also talking about related points that are covered in other tickets, or still need tickets to be filed. For instance #69976 and #70618. It might be a good idea to re-assess those if they block you.

#27 Updated by xlai 5 months ago

cdywan wrote:

This ticket, as I understand it, is about doing the equivalent of existing API/ CLI to restart jobs. No new features besides the UX based on that. And that is not high priority relative to improving underlying functionality that doesn't exist yet.

xlai If I understand you correctly you're also talking about related points that are covered in other tickets, or still need tickets to be filed. For instance #69976 and #70618. It might be a good idea to re-assess those if they block you.

cdywan Exactly! You got my point :-). The seires of the tickets to make this complext "START_DIRECTLY_AFTER_TEST" complete from backend to webui, in a practically easy-to-use way, for daily openqa QA routines. Thanks for agreeing to re-assess those tickets from the openqa expert perspective. This means a lot to us!

#28 Updated by okurz 5 months ago

xlai wrote:

cdywan Exactly! You got my point :-). The seires of the tickets to make this complext "START_DIRECTLY_AFTER_TEST" complete from backend to webui, in a practically easy-to-use way, for daily openqa QA routines. Thanks for agreeing to re-assess those tickets from the openqa expert perspective. This means a lot to us!

Yes, We already looked into further planning some time ago, e.g. 5 days ago when we created #103962 where now #69976 is part of that and I estimate that we can work on that during the next year. This year feature-wise was mostly impacted by #91646 which wasn't exactly planned in before.

xlai wrote:

  1. Why do you need to manually restart so many tests at all?

Potentially, any job has chance to be rerun for whatever reasons, especially when there are infra issues like network/repository/new image in syncing to the desired url/ipmi not functioning stably well enough(ipmi backend workers only), etc.

I strongly recommend to look into better solutions which already exist and helper others already a lot, e.g.

  1. Why are the existing ways using the API and command line tools not easy to use?

Same reasons as why openqa has webui -- usability, less human fault, automatic handling of complex dependecy relationships between jobs, etc.

Can you elaborate on "automatic handling of complex dependecy relationships between jobs"?

xlai wrote:
And I really do not agree fully with #69979#note-19

I would like to explain further what I mean with "Given that the functionality is provided over API I still consider this of low prio": Similar as cdywan explained it's important (for us) to consider where currently different ways exist to achieve the same, e.g. not possible over UI but possible over API. In such case the priority for us is mostly lower in relation to other feature requests where the user does not have any way to achieve a certain workflow at all.

IMHO, they are far from our knowledge to contribute them.

Sure, I understand. My related statement was only that outside contributors should be able to cover that, not necessarily from your team.

If you look twice into #69979#note-12, you can see that we need not only a standalone ticket(this one), but also relevant ones, such as 2) no ticket yet (high need), as Julie_CAO replied, the ability to rerun "original" jobs […]

It can be beneficial if you report a separate, dedicated ticket about '2) the ability to rerun "original" jobs' as this might touch more areas which are relevant for much more users as well

#29 Updated by okurz 5 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version changed from future to Ready

#30 Updated by okurz 5 months ago

  • Status changed from In Progress to Feedback

#32 Updated by cdywan 5 months ago

okurz wrote:

I am giving this a try
https://github.com/os-autoinst/openQA/pull/4417

Merged

#33 Updated by Julie_CAO 5 months ago

  • Related to action #104241: Retrigger the original/initial job chain after parts have been retriggered added

#34 Updated by xlai 5 months ago

xlai wrote:

It can be beneficial if you report a separate, dedicated ticket about '2) the ability to rerun "original" jobs' as this might touch more areas which are relevant for much more users as well

@Julie_CAO will open it and share the ticket number later.

Ticket is created as #104241, and added as Relevant ticket. Thanks to @Julie_CAO.

#35 Updated by okurz 5 months ago

12333

https://openqa.suse.de/tests/7893640#dependencies shows at the current time how the restarting of "skip_ok_result_children" retriggered tests while keeping "passed" results in children in place

openQA_Screenshot_20211222_084122_parent_retriggered_with_skip_ok_result_children.png

I realized that clicking one of the links in the pull down menu triggers the correct action but the webUI does not update so it is not obvious to the user that the desired action had any effect until manually refreshing the page.

#36 Updated by okurz 5 months ago

https://github.com/os-autoinst/openQA/pull/4425 to fix the styling and redirection to the cloned job

#37 Updated by okurz 5 months ago

  • Parent task set to #103971

#38 Updated by okurz 4 months ago

  • Status changed from Feedback to Resolved

All in place now.

Also available in: Atom PDF