Project

General

Profile

Actions

action #138593

closed

coordination #154768: [saga][ux] State-of-art user experience for openQA

coordination #157345: [epic] Improved test reviewer user experience

Restart of scheduled products is prone to retriggers by humans size:S

Added by waynechen55 9 months ago. Updated 15 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-10-26
Due date:
% Done:

0%

Estimated time:

Description

Observation

I found many scheduled test suites waiting for ipmi workers, but the number of running tests (also using ipmi worker) is less than the number of ipmi workers. It is very obvious and can be seen easily in openQA Build page like this https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=28.1&groupid=263.

I learned that llzhao reran the whole product, because she did not find a way to rerun only her job group. Maybe this can be considered within this feature request or a new one.

Steps to reproduce

  • Will update if there are known steps help reproduce

Impact

Workers are already assigned to tests that are supposed to stop. So resource is wasted.

Problem

After investigating this a while, I found some workers are executing some tests secretly, which means it can not be seen from above openQA Build page, for example:

All test suites were already cancelled.

But after navigating into specific worker page, I found it was executing a test suite secretly, for example:

Because the running one is not on top, so it can not be seen by just look up in openQA job group or build page.

Let's take worker sapworker3:2 and test suite sle-15-SP6-Online-x86_64-Build28.1-uefi-gi-guest_sles12sp5-on-host_developing-kvm@64bit-ipmi-large-mem as an example. The scheduled test run https://openqa.suse.de/tests/12668997 was already cancelled, and it is expected to stop. But it was then triggered and run again secretly as another test run https://openqa.suse.de/tests/12664633 without being displayed in openQA job group or build page. Although the test number 12664633 is smaller than 12668997, the 12668997 was cancelled hours earlier. And I confirmed sapworker3:2 is assigned to test 12664633 after it became idle, and at that time point, test 12668997 was already cancelled hours ago. The 12664633 is hidden in sapworker3:2 worker page, for example:

So I have to click the "working" button right to sapworker3:2 to discover it.

I did not manually triggered or scheduled any relevant test for this issue.

Suggestions

  • Show running test run always on top if feasible
  • Better schedule and trigger logic
  • Make the "restart" icon for products a proper button with a frame and a label like "Restart the whole product"
    • Move the button out of the area with the labels (results, product, assigned) e.g. more to the right
  • Additionally consider a confirmation dialogue (maybe with xy jobs restarted if possible)

Workaround

n/a

Acceptance Criteria

  • AC1: Accidental retriggers of complete products from test details pages are less likely

Out of scope

  • Partial scheduling of products #157342

Files

Selection_284.png (49.9 KB) Selection_284.png waynechen55, 2023-10-26 07:41
Selection_283.png (110 KB) Selection_283.png waynechen55, 2023-10-26 07:43
Selection_285.png (121 KB) Selection_285.png waynechen55, 2023-10-26 07:49
Selection_284.png (49.9 KB) Selection_284.png waynechen55, 2023-10-26 08:06
Selection_283.png (110 KB) Selection_283.png waynechen55, 2023-10-26 08:06
Selection_285.png (121 KB) Selection_285.png waynechen55, 2023-10-26 08:06
Selection_287.png (78.2 KB) Selection_287.png waynechen55, 2023-11-03 00:06
Selection_288.png (99.2 KB) Selection_288.png waynechen55, 2023-11-03 00:06
6th_rounds.png (39.8 KB) 6th_rounds.png Julie_CAO, 2023-11-03 02:21
5th_rounds.png (45 KB) 5th_rounds.png Julie_CAO, 2023-11-03 02:21
4th_rounds.png (44 KB) 4th_rounds.png Julie_CAO, 2023-11-03 02:21
3rd_rounds.png (45.4 KB) 3rd_rounds.png Julie_CAO, 2023-11-03 02:21
screenshot__Y_M_D__H_m_S.png (92.9 KB) screenshot__Y_M_D__H_m_S.png re-scheduling of https://openqa.suse.de/tests/14610658 mkittler, 2024-06-19 12:36

Related issues 3 (1 open2 closed)

Related to openQA Project - action #157342: Partial product re-scheduling scheduled whole buildResolvedokurz2024-03-15

Actions
Related to openQA Project - action #162608: Rescheduling product from the webUI adds invalid settingsResolvedmkittler2024-06-20

Actions
Copied to openQA Project - action #156277: [qe-core] Some jobs are retriggered by geekotest (likely obs-sync plugin) without clear answer whyNewszarate2023-10-26

Actions
Actions #1

Updated by waynechen55 9 months ago

  • Subject changed from [openQA][worker][assignment][job] Some jobs triggered and executed secretly without being noticed to [openQA][worker][schedule][assignment][job] Some jobs triggered and executed secretly without being noticed
Actions #2

Updated by Julie_CAO 9 months ago

There were two rounds of product triggering:

Scheduled product 1980120

Time User Status Distri Version Flavor Arch Build ISO Actions
about 16 hours ago geekotest scheduled SLE 15-SP6 Online x86_64 28.1 SLE-15-SP6-Online-x86_64-Build28.1-Media1.iso

Scheduled product 1980232

Time User Status Distri Version Flavor Arch Build ISO Actions
about 14 hours ago geekotest scheduled SLE 15-SP6 Online x86_64 28.1 SLE-15-SP6-Online-x86_64-Build28.1-Media1.iso
Actions #3

Updated by okurz 9 months ago

  • Subject changed from [openQA][worker][schedule][assignment][job] Some jobs triggered and executed secretly without being noticed to [openQA][worker][schedule][assignment][job][qe-core] Some jobs triggered and executed secretly without being noticed

@qe-core this looks like an issue in obs-rsync-trigger. Can you take a look?

Actions #4

Updated by okurz 9 months ago

  • Target version set to QE-Core: Ready

Updated by waynechen55 9 months ago

Build28.1 was triggered again yesterday. This issue must be resolved ASAP.


Actions #6

Updated by waynechen55 9 months ago

  • Priority changed from Normal to High

Updated by Julie_CAO 9 months ago

For build #28.1, except those two rounds of build initially, there have been 4 more rounds in these 4 days. It has exausted lots of resources and brought us much troubles to handle these tests. @xlai

3rd_rounds
4th_round
5th_round
6th_round

Actions #8

Updated by xlai 9 months ago

I raised the topic of multiple openqa triggering of same build , in https://suse.slack.com/archives/C02CANHLANP/p1698981583137759. Hope more attention can be given and avoid it.

@ybonatakis @emiura I am not sure if I am reaching the same ones that have openqa accounts mentioned in https://progress.opensuse.org/issues/138593#note-7. Please forgive me if not. If it is correct, I would appreciate if you can share why you need to rerun such a large range of tests, including virtualization, for build 28.1. Thanks!

@szarate Hello Santi, would you please help a look why two times of geekotest triggering for the build? Any bug to fix on CI tool?

Actions #9

Updated by ybonatakis 9 months ago

xlai wrote in #note-8:

I raised the topic of multiple openqa triggering of same build , in https://suse.slack.com/archives/C02CANHLANP/p1698981583137759. Hope more attention can be given and avoid it.

@ybonatakis @emiura I am not sure if I am reaching the same ones that have openqa accounts mentioned in https://progress.opensuse.org/issues/138593#note-7. Please forgive me if not. If it is correct, I would appreciate if you can share why you need to rerun such a large range of tests, including virtualization, for build 28.1. Thanks!

This meant to run only HPC test group from my side. Usually it just do that. I dont know why Virtualization tests re-triggered.

@szarate Hello Santi, would you please help a look why two times of geekotest triggering for the build? Any bug to fix on CI tool?

Actions #10

Updated by xlai 9 months ago

ybonatakis wrote in #note-9:

xlai wrote in #note-8:

I raised the topic of multiple openqa triggering of same build , in https://suse.slack.com/archives/C02CANHLANP/p1698981583137759. Hope more attention can be given and avoid it.

@ybonatakis @emiura I am not sure if I am reaching the same ones that have openqa accounts mentioned in https://progress.opensuse.org/issues/138593#note-7. Please forgive me if not. If it is correct, I would appreciate if you can share why you need to rerun such a large range of tests, including virtualization, for build 28.1. Thanks!

This meant to run only HPC test group from my side. Usually it just do that. I dont know why Virtualization tests re-triggered.

@ybonatakis thanks for the reply. Would you please paste your trigger command here, so that the relevant team may help to find out the reason? From https://openqa.suse.de/admin/productlog?id=1988393, there is a long successful job list which seems much more than HPC only.

BTW, did you specify "TEST=" in your command? If nothing given, but only mediums info provided, all tests binded to that would be rerun. That used to happen on OSD before by accident.

Actions #11

Updated by emiura 9 months ago

xlai wrote in #note-8:

I raised the topic of multiple openqa triggering of same build , in https://suse.slack.com/archives/C02CANHLANP/p1698981583137759. Hope more attention can be given and avoid it.

@ybonatakis @emiura I am not sure if I am reaching the same ones that have openqa accounts mentioned in https://progress.opensuse.org/issues/138593#note-7. Please forgive me if not. If it is correct, I would appreciate if you can share why you need to rerun such a large range of tests, including virtualization, for build 28.1. Thanks!

@szarate Hello Santi, would you please help a look why two times of geekotest triggering for the build? Any bug to fix on CI tool?

@xlai
I'm sorry for the trouble, my intention was to restart the teststo check if registry related errors had been fixed. It will not happen again.

Actions #12

Updated by xlai 9 months ago

emiura wrote in #note-11:

@xlai
I'm sorry for the trouble, my intention was to restart the teststo check if registry related errors had been fixed. It will not happen again.

No worries. Thank you! @emiura

Actions #13

Updated by waynechen55 8 months ago

For Build34.1 and Build37.1, the historical scheduled products are as below:

1992810 a day ago   epaolantonio    scheduled   SLE 15-SP6  Online  x86_64  37.1    SLE-15-SP6-Online-x86_64-Build37.1-Media1.iso   
1990407 8 days ago  geekotest   scheduled   SLE 15-SP6  Online  x86_64  34.1    SLE-15-SP6-Online-x86_64-Build34.1-Media1.iso   
Actions #14

Updated by waynechen55 8 months ago

@ybonatakis It seems you scheduled test for Build39.1 again. All virtualization tests are running again now.

about an hour ago ybonatakis scheduled sle 15-SP6 Online x86_64 39.1 SLE-15-SP6-Online-x86_64-Build39.1-Media1.iso

Why do you need to do this ?

Actions #15

Updated by xlai 5 months ago

Similar thing happens again. Individuals triggers full build test accidentally. It's also raised and discussed in https://suse.slack.com/archives/C0564EU665S/p1709176558631769.

Actions #16

Updated by szarate 5 months ago

  • Copied to action #156277: [qe-core] Some jobs are retriggered by geekotest (likely obs-sync plugin) without clear answer why added
Actions #17

Updated by szarate 5 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Subject changed from [openQA][worker][schedule][assignment][job][qe-core] Some jobs triggered and executed secretly without being noticed to Restart of scheduled products is prone to retriggers by humans
  • Description updated (diff)
  • Assignee set to okurz
  • Priority changed from High to Normal
  • Target version changed from QE-Core: Ready to future
  • Parent task set to #154768

Following up my message over slack, on @xlai's thread:

Shall we take some actions to avoid that? Eg, forbid it on the web UI to rerun scheduled product? Or bump access right for that button?
One of the things I was discussing days ago with @Oliver Kurz was the possibility of having RBAC - Role Based Access Control in openQA so that things like restarting a whole product is behind a specific permission. because, at the moment, everybody basically can.
Another possibility is fixing the UI, as mentioned by @Xiaoli Ai, but eventually, weโ€™ll have to get there.

We can focus first on moving the restart button for scheduled products behind an admin permission. Look into the RBAC later

Actions #18

Updated by szarate 5 months ago

  • Priority changed from Normal to High
Actions #19

Updated by xlai 5 months ago

Just for clarification:
The historical comments and the original ticket description may mix two issues up.
- In this ticket, please focus on what the ticket title gives -- Restart of scheduled products is prone to retriggers by humans, which will need the help from openqa tool to improve. For anyone helps with it, please focus on related comments for it only.
- the other issue of multiple triggers by geekotest has been scoped into a new ticket #156277.

Actions #20

Updated by ybonatakis 5 months ago

I am new to the discussion but as a user who have fell into the mistake of trigger incidentally jobs for other job groups I want to say that I disagree with the suggestion of an admin permission for that. Usually the problem is the misuse or misunderstanding of pressing a button or of openqa-cli. Instead, I would like to have a prompt and a warning before the execution.
For example when i do an post isos from the console, and if this has a results to schedule a whole job group, I would expect a prompt which shows me what it is going to schedule and ask for confirmation to continue.

Actions #21

Updated by okurz 5 months ago

I also question both the ACs

  • "RBAC is implemented or an Epic for its implementation exists": IMHO that already exists. We already have three different levels with user, operator, admin. What do you see as missing here?
  • Restart of Scheduled produtcs is put behind a higher level permission (Admin?) Same as ybonatakis stated that wouldn't help: we have many admins and they can still make a mistake. And others might still have a valid need to retrigger. I would like us to check why we even have that retrigger button on the job details page as we already have one in the "scheduled products" page where I think it shows more clearly that many jobs would be affected
Actions #22

Updated by Julie_CAO 5 months ago

The entire build 59.2 was retriggered again ...

about 11 hours ago  coolo   scheduled   SLE 15-SP6  Online  x86_64  59.2    SLE-15-SP6-Online-x86_64-Build59.2-Media1.iso
Actions #23

Updated by Julie_CAO 5 months ago

One more retrigger ...

about 12 hours ago  dheidler    scheduled   SLE 15-SP6  Online  x86_64  59.2    SLE-15-SP6-Online-x86_64-Build59.2-Media1.iso
Actions #24

Updated by okurz 4 months ago

  • Status changed from New to Feedback

No response on #138593-21, lowering prio

Actions #25

Updated by okurz 4 months ago

  • Priority changed from High to Low
Actions #26

Updated by xlai 4 months ago

okurz wrote in #note-24:

No response on #138593-21, lowering prio

I would like us to check why we even have that retrigger button on the job details page as we already have one in the "scheduled products" page where I think it shows more clearly that many jobs would be affected

I vote for this too. Maybe removing the button of rescheduling products from job details page can decrease such accidental trigger. Anyone would like to do so, better to go to "scheduled products" page.

Then how about those accidental triggers from command line via openqa-cli?

Actions #27

Updated by okurz 4 months ago

  • Description updated (diff)
  • Category set to Feature requests
  • Status changed from Feedback to New
  • Assignee deleted (okurz)

ok. I replaced the two questionable ACs with "* AC1: Accidental retriggers of complete products from test details pages are less likely"

Actions #28

Updated by okurz 4 months ago

  • Related to action #157342: Partial product re-scheduling scheduled whole build added
Actions #29

Updated by Julie_CAO 2 months ago

one more time again today:

about 4 hours ago   llzhao  scheduled   SLE 15-SP6  Online  x86_64  88.1    SLE-15-SP6-Online-x86_64-Build88.1-Media1.iso
Actions #30

Updated by xlai 2 months ago

  • Status changed from New to Workable
  • Priority changed from Low to High

Given the large impact of accidental re-triggering of whole product and not very high effort to remove the button from job details page, suggest to bump the priority and do it sooner.

@okurz Is it possible to plan into tools team's recent sprints?

Actions #31

Updated by okurz 2 months ago

  • Status changed from Workable to New
  • Priority changed from High to Normal
  • Target version changed from future to Ready

Careful when changing the fields. This ticket is not workable according to https://progress.opensuse.org/projects/qa/wiki/Tools#Definition-of-READY-for-new-features

We will look into this with high priority. Be aware though that we don't have sprints.

Removing the button might be easy but also it was added based on a certain request so it's not as easy as that. But I am certain we will be able to find a better solution

Actions #32

Updated by xlai 2 months ago

okurz wrote in #note-31:

Careful when changing the fields. This ticket is not workable according to https://progress.opensuse.org/projects/qa/wiki/Tools#Definition-of-READY-for-new-features

Oops, sorry. Which item do you think is missing for workable?

We will look into this with high priority. Be aware though that we don't have sprints.

I see.

Removing the button might be easy but also it was added based on a certain request so it's not as easy as that. But I am certain we will be able to find a better solution

Cool, thanks!
BTW, just for your information, I learned that llzhao reran the whole product, because she did not find a way to rerun only her job group. Maybe this can be considered within this feature request or a new one.

Actions #33

Updated by okurz 2 months ago

xlai wrote in #note-32:

okurz wrote in #note-31:

Careful when changing the fields. This ticket is not workable according to https://progress.opensuse.org/projects/qa/wiki/Tools#Definition-of-READY-for-new-features

Oops, sorry. Which item do you think is missing for workable?

The ticket has good details but we need the tools team to estimate the ticket and verify that we have all necessary information about the goal and starting steps

BTW, just for your information, I learned that llzhao reran the whole product, because she did not find a way to rerun only her job group. Maybe this can be considered within this feature request or a new one.

That would be related to
#157342 but "just one job group" is a kinda special request. We will have to see if it fits within what we can do with the webUI or just CLI tooling

Actions #34

Updated by livdywan 2 months ago

  • Subject changed from Restart of scheduled products is prone to retriggers by humans to Restart of scheduled products is prone to retriggers by humans size:S
  • Description updated (diff)
Actions #35

Updated by livdywan 2 months ago

  • Status changed from New to Workable
Actions #36

Updated by mkittler about 2 months ago

  • Assignee set to mkittler
Actions #37

Updated by mkittler about 2 months ago

  • Status changed from Workable to In Progress
Actions #38

Updated by mkittler about 2 months ago

  • Status changed from In Progress to Feedback
Actions #39

Updated by mkittler about 2 months ago

The PR has been merged and deployed (on o3 at least). I got a "thumbs-up" reaction on the PR by @Julie_CAO.

The feedback about the buttons I got on Slack is also good. The explicitness also triggered some questions - but I guess that's a good thing. Some minor tweaks to apply (considering the Slack conversation so far):

  • Don't show "Actions:" at all if there are no actions to show. It already doesn't show if the user has not the right permissions but it shows on jobs that have already been restarted. Maybe it would make more sense to make that explicit (instead of hiding the button completely), e.g. show the restart button but disabled and with a popover stating that the job has already a clone so it cannot be restarted.
  • Maybe mention in the help popover for restating where to find the advanced options. (The "<" on the left is maybe a bit too subtle on its own.)
Actions #40

Updated by okurz about 1 month ago

  • Due date set to 2024-06-10
Actions #41

Updated by xlai about 1 month ago

Hi @mkittler ,
Thanks for the effort on this.

To me, it is much more differentiated for the two rerun buttons now. I believe it will give less chance that the rerun button is clicked by mistake. Nice improvement!

I still have some confusion here on the explanation for the button Reschedule product from here in job details page. Would you please explain the difference between it and rescheduling product tests in https://openqa.suse.de/admin/productlog?

Actions #42

Updated by mkittler about 1 month ago

@xlai Note that the "Reschedule from here" button was added for #124469. It is not just a shortcut for the button on the product log page. The requirement was to re-schedule a product only partially starting from a specific job. Only this "start job" and its child dependencies (but not parents) were supposed to be scheduled again. So that is what this feature does - as explained in the help text/icon next to the button.


I'll see whether it makes sense to implement ideas from #138593#note-39 after the bootstrap migration (because it would cause conflicts).

Actions #43

Updated by okurz about 1 month ago

  • Due date changed from 2024-06-10 to 2024-06-24
  • Status changed from Feedback to Workable

#159408 has sufficiently progressed. You can now continue here.

Actions #44

Updated by mkittler about 1 month ago

  • Status changed from Workable to Feedback

PR to implement additional improvements: https://github.com/os-autoinst/openQA/pull/5688

I would consider the ticket resolved when it has been merged.

Actions #45

Updated by okurz about 1 month ago

  • Status changed from Feedback to Resolved
Actions #46

Updated by okurz about 1 month ago

  • Due date deleted (2024-06-24)
  • Parent task changed from #154768 to #157345
Actions #47

Updated by xlai about 1 month ago

  • Status changed from Resolved to Workable

@okurz @mkittler Hi guys, given the explanation in https://progress.opensuse.org/issues/138593#note-42, it seems there was misunderstanding before in the discussion. It turns out that clicking the "wrong" rerun button in job detail page, would not have caused so many large scale tests, eg those actions of rerunning HPC job group, but in the end virtualization jobs were all rerun, which had no dependency to HPC jobs at all. Then it seems that the most "wrong" product rerun actually came from cmd openqa-cli.

Would you please have a look at comment https://progress.opensuse.org/issues/138593#note-10 and https://progress.opensuse.org/issues/138593#note-20, and help find some way to decrease such accidental triggers from command line via openqa-cli, eg by adding some warnings and confirmation before proceeding?

Sorry for the late request due to misunderstanding before. But this is crucial to avoid accidental product trigger in future.

Actions #48

Updated by mkittler about 1 month ago

What was the openqa-cli command that was invoked accidentally? And what was the command that should have been invoked instead? With that information we could think how to improve openqa-cli. However, we should keep in mind that breaking compatibility of openqa-cli is probably a bad idea as well. Maybe we can create a new user friendly sub command for whatever the use case was or add a --dry-run feature to the existing command and advertise its use.

Actions #49

Updated by xlai about 1 month ago

mkittler wrote in #note-48:

What was the openqa-cli command that was invoked accidentally? And what was the command that should have been invoked instead? With that information we could think how to improve openqa-cli. However, we should keep in mind that breaking compatibility of openqa-cli is probably a bad idea as well. Maybe we can create a new user friendly sub command for whatever the use case was or add a --dry-run feature to the existing command and advertise its use.

@ybonatakis Would you please provide some input for above based on your #note-20?

ybonatakis wrote in #note-20:

I am new to the discussion but as a user who have fell into the mistake of trigger incidentally jobs for other job groups I want to say that I disagree with the suggestion of an admin permission for that. Usually the problem is the misuse or misunderstanding of pressing a button or of openqa-cli. Instead, I would like to have a prompt and a warning before the execution.
For example when i do an post isos from the console, and if this has a results to schedule a whole job group, I would expect a prompt which shows me what it is going to schedule and ask for confirmation to continue.

Actions #50

Updated by xlai about 1 month ago ยท Edited

@mkittler
This morning, there is one more problematic product rerun from job details page via Reschedule product from here button. To me, it seems a bug of the button that it does not only trigger CHILD jobs, but only jobs without dependency at all.

Rescheduled product link is https://openqa.suse.de/admin/productlog?id=2180274. From it, you can see overall 974 jobs got rerun.

The owner (@tinawang123 ) did not remember exactly the job that she clicked the button. But she confirmed below:

More background info:

  • why she clicks that button Reschedule product from here: she updated some settings, then click the Restart this and dependent jobs won't have the updated setting, so she clicked it

Maybe this is one of the main root causes for the initial problem -- virtualization jobs got rerun multiple times? Please help have a look. Thanks.

Actions #51

Updated by mkittler about 1 month ago

Ok, that's useful information. The number of re-triggered jobs should indeed not exceed 900 in that case. I will import a production database dumb and see whether I can reproduce the problem.

Note that she was also generally correct to click the button to make use of updated settings from tables like test suites. This aspect was one of the reasons why we initially added the feature. It might be helpful to know what settings have changed. If dependency-related settings have changed this might make a difference.

Actions #52

Updated by xlai about 1 month ago

mkittler wrote in #note-51:

Ok, that's useful information. The number of re-triggered jobs should indeed not exceed 900 in that case. I will import a production database dumb and see whether I can reproduce the problem.

Glad to know that. Thank you!

Note that she was also generally correct to click the button to make use of updated settings from tables like test suites. This aspect was one of the reasons why we initially added the feature. It might be helpful to know what settings have changed. If dependency-related settings have changed this might make a difference.

I asked @tinawang123 , she said she only added this setting START_AFTER_TEST: 'create_hdd_textmode', no other setting change.

Actions #53

Updated by mkittler 30 days ago

  • Status changed from Workable to In Progress

I was able to reproduce the problem, see https://github.com/os-autoinst/openQA/pull/5706. I also mentioned in that PR that there are unfortunately more problems with the partial rescheduling.

Actions #54

Updated by openqa_review 29 days ago

  • Due date set to 2024-07-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #55

Updated by mkittler 29 days ago

  • Status changed from In Progress to Feedback

PR to fix dependencies: https://github.com/os-autoinst/openQA/pull/5707

After that is merged and deployed I would consider this issue resolved.

Actions #56

Updated by xlai 28 days ago

Ok, that's useful information. The number of re-triggered jobs should indeed not exceed 900 in that case. I will import a production database dumb and see whether I can reproduce the problem.

@mkittler Thanks for the fixes. May I ask for some data? With your above 2 fixes, how many jobs shall actually be rerun for the scene that the problem was triggered?

Actions #57

Updated by okurz 28 days ago

@xlai I can't directly answer your question but as of today https://github.com/os-autoinst/openQA/pull/5707 is deployed also on OSD. Changelog notification message in https://mailman.suse.de/mlarch/SuSE/openqa/2024/openqa.2024.06/msg00014.html . If you are using the different retriggering features maybe you will observe the change in behaviour and can verify that it's working better now.

Actions #58

Updated by mkittler 28 days ago

@mkittler Thanks for the fixes. May I ask for some data? With your above 2 fixes, how many jobs shall actually be rerun for the scene that the problem was triggered?

When I re-scheduled https://openqa.suse.de/tests/14611144#dependencies locally (after importing the OSD production database locally) only that exact job was re-scheduled. But in general it really depends on from where you do the re-scheduling (hence the button is on a concrete job details page) and what test suites are defined at this time. So for instance I also re-scheduled https://openqa.suse.de/tests/14610658#dependencies. This job has children which also have children and parallel dependencies so this entire sub-tree was re-scheduled (16 jobs, see attached screenshot).

Another side-note from the behavior I noticed when extending unit tests: The algorithm never goes up the dependency chain. So if e.g. job C is supposed to start after A and B (but there is no dependency between A and B; job C simply has multiple chained parents) and you re-trigger from A you get A and C but not B. (And if you re-trigger from B you get B and C but not A.)

I hope now that the actual behavior is in-line with the documentation the documentation also makes more sense.

Actions #59

Updated by xlai 27 days ago

Thank you all. I will follow the new deployed fixes for a while.

Actions #60

Updated by mkittler 27 days ago

  • Related to action #162608: Rescheduling product from the webUI adds invalid settings added
Actions #61

Updated by livdywan 15 days ago

  • Due date deleted (2024-07-02)
  • Priority changed from Normal to Low

Let's give it more time. Not in a rush.

Actions #62

Updated by okurz 15 days ago

  • Status changed from Feedback to Resolved
  • Priority changed from Low to Normal

I don't think we should wait until SLE15SP7 builds are in a usable enough state to further verify this. We can assume that this works fine using the current verification and having automated tests. And as xlai stated she will be able to comment later at any time in case of unexpected issues still rising up.

Actions

Also available in: Atom PDF