Project

General

Profile

Actions

action #138593

open

coordination #154768: [saga][ux] State-of-art user experience for openQA

Restart of scheduled products is prone to retriggers by humans

Added by waynechen55 6 months ago. Updated about 1 month ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2023-10-26
Due date:
% Done:

0%

Estimated time:

Description

Observation

I found many scheduled test suites waiting for ipmi workers, but the number of running tests (also using ipmi worker) is less than the number of ipmi workers. It is very obvious and can be seen easily in openQA Build page like this https://openqa.suse.de/tests/overview?distri=sle&version=15-SP6&build=28.1&groupid=263.

Steps to reproduce

  • Will update if there are known steps help reproduce

Impact

Workers are already assigned to tests that are supposed to stop. So resource is wasted.

Problem

After investigating this a while, I found some workers are executing some tests secretly, which means it can not be seen from above openQA Build page, for example:

All test suites were already cancelled.

But after navigating into specific worker page, I found it was executing a test suite secretly, for example:

Because the running one is not on top, so it can not be seen by just look up in openQA job group or build page.

Let's take worker sapworker3:2 and test suite sle-15-SP6-Online-x86_64-Build28.1-uefi-gi-guest_sles12sp5-on-host_developing-kvm@64bit-ipmi-large-mem as an example. The scheduled test run https://openqa.suse.de/tests/12668997 was already cancelled, and it is expected to stop. But it was then triggered and run again secretly as another test run https://openqa.suse.de/tests/12664633 without being displayed in openQA job group or build page. Although the test number 12664633 is smaller than 12668997, the 12668997 was cancelled hours earlier. And I confirmed sapworker3:2 is assigned to test 12664633 after it became idle, and at that time point, test 12668997 was already cancelled hours ago. The 12664633 is hidden in sapworker3:2 worker page, for example:

So I have to click the "working" button right to sapworker3:2 to discover it.

I did not manually triggered or scheduled any relevant test for this issue.

Suggestions

  • Show running test run always on top if feasible
  • Better schedule and trigger logic

Workaround

n/a

Acceptance Criteria

  • AC1: Accidental retriggers of complete products from test details pages are less likely

Files

Selection_284.png (49.9 KB) Selection_284.png waynechen55, 2023-10-26 07:41
Selection_283.png (110 KB) Selection_283.png waynechen55, 2023-10-26 07:43
Selection_285.png (121 KB) Selection_285.png waynechen55, 2023-10-26 07:49
Selection_284.png (49.9 KB) Selection_284.png waynechen55, 2023-10-26 08:06
Selection_283.png (110 KB) Selection_283.png waynechen55, 2023-10-26 08:06
Selection_285.png (121 KB) Selection_285.png waynechen55, 2023-10-26 08:06
Selection_287.png (78.2 KB) Selection_287.png waynechen55, 2023-11-03 00:06
Selection_288.png (99.2 KB) Selection_288.png waynechen55, 2023-11-03 00:06
6th_rounds.png (39.8 KB) 6th_rounds.png Julie_CAO, 2023-11-03 02:21
5th_rounds.png (45 KB) 5th_rounds.png Julie_CAO, 2023-11-03 02:21
4th_rounds.png (44 KB) 4th_rounds.png Julie_CAO, 2023-11-03 02:21
3rd_rounds.png (45.4 KB) 3rd_rounds.png Julie_CAO, 2023-11-03 02:21

Related issues 2 (2 open0 closed)

Related to openQA Project - action #157342: Partial product re-scheduling scheduled whole buildNew2024-03-15

Actions
Copied to openQA Project - action #156277: [qe-core] Some jobs are retriggered by geekotest (likely obs-sync plugin) without clear answer whyNewszarate2023-10-26

Actions
Actions #1

Updated by waynechen55 6 months ago

  • Subject changed from [openQA][worker][assignment][job] Some jobs triggered and executed secretly without being noticed to [openQA][worker][schedule][assignment][job] Some jobs triggered and executed secretly without being noticed
Actions #2

Updated by Julie_CAO 6 months ago

There were two rounds of product triggering:

Scheduled product 1980120

Time User Status Distri Version Flavor Arch Build ISO Actions
about 16 hours ago geekotest scheduled SLE 15-SP6 Online x86_64 28.1 SLE-15-SP6-Online-x86_64-Build28.1-Media1.iso

Scheduled product 1980232

Time User Status Distri Version Flavor Arch Build ISO Actions
about 14 hours ago geekotest scheduled SLE 15-SP6 Online x86_64 28.1 SLE-15-SP6-Online-x86_64-Build28.1-Media1.iso
Actions #3

Updated by okurz 6 months ago

  • Subject changed from [openQA][worker][schedule][assignment][job] Some jobs triggered and executed secretly without being noticed to [openQA][worker][schedule][assignment][job][qe-core] Some jobs triggered and executed secretly without being noticed

@qe-core this looks like an issue in obs-rsync-trigger. Can you take a look?

Actions #4

Updated by okurz 6 months ago

  • Target version set to QE-Core: Ready

Updated by waynechen55 5 months ago

Build28.1 was triggered again yesterday. This issue must be resolved ASAP.


Actions #6

Updated by waynechen55 5 months ago

  • Priority changed from Normal to High

Updated by Julie_CAO 5 months ago

For build #28.1, except those two rounds of build initially, there have been 4 more rounds in these 4 days. It has exausted lots of resources and brought us much troubles to handle these tests. @xlai

3rd_rounds
4th_round
5th_round
6th_round

Actions #8

Updated by xlai 5 months ago

I raised the topic of multiple openqa triggering of same build , in https://suse.slack.com/archives/C02CANHLANP/p1698981583137759. Hope more attention can be given and avoid it.

@ybonatakis @emiura I am not sure if I am reaching the same ones that have openqa accounts mentioned in https://progress.opensuse.org/issues/138593#note-7. Please forgive me if not. If it is correct, I would appreciate if you can share why you need to rerun such a large range of tests, including virtualization, for build 28.1. Thanks!

@szarate Hello Santi, would you please help a look why two times of geekotest triggering for the build? Any bug to fix on CI tool?

Actions #9

Updated by ybonatakis 5 months ago

xlai wrote in #note-8:

I raised the topic of multiple openqa triggering of same build , in https://suse.slack.com/archives/C02CANHLANP/p1698981583137759. Hope more attention can be given and avoid it.

@ybonatakis @emiura I am not sure if I am reaching the same ones that have openqa accounts mentioned in https://progress.opensuse.org/issues/138593#note-7. Please forgive me if not. If it is correct, I would appreciate if you can share why you need to rerun such a large range of tests, including virtualization, for build 28.1. Thanks!

This meant to run only HPC test group from my side. Usually it just do that. I dont know why Virtualization tests re-triggered.

@szarate Hello Santi, would you please help a look why two times of geekotest triggering for the build? Any bug to fix on CI tool?

Actions #10

Updated by xlai 5 months ago

ybonatakis wrote in #note-9:

xlai wrote in #note-8:

I raised the topic of multiple openqa triggering of same build , in https://suse.slack.com/archives/C02CANHLANP/p1698981583137759. Hope more attention can be given and avoid it.

@ybonatakis @emiura I am not sure if I am reaching the same ones that have openqa accounts mentioned in https://progress.opensuse.org/issues/138593#note-7. Please forgive me if not. If it is correct, I would appreciate if you can share why you need to rerun such a large range of tests, including virtualization, for build 28.1. Thanks!

This meant to run only HPC test group from my side. Usually it just do that. I dont know why Virtualization tests re-triggered.

@ybonatakis thanks for the reply. Would you please paste your trigger command here, so that the relevant team may help to find out the reason? From https://openqa.suse.de/admin/productlog?id=1988393, there is a long successful job list which seems much more than HPC only.

BTW, did you specify "TEST=" in your command? If nothing given, but only mediums info provided, all tests binded to that would be rerun. That used to happen on OSD before by accident.

Actions #11

Updated by emiura 5 months ago

xlai wrote in #note-8:

I raised the topic of multiple openqa triggering of same build , in https://suse.slack.com/archives/C02CANHLANP/p1698981583137759. Hope more attention can be given and avoid it.

@ybonatakis @emiura I am not sure if I am reaching the same ones that have openqa accounts mentioned in https://progress.opensuse.org/issues/138593#note-7. Please forgive me if not. If it is correct, I would appreciate if you can share why you need to rerun such a large range of tests, including virtualization, for build 28.1. Thanks!

@szarate Hello Santi, would you please help a look why two times of geekotest triggering for the build? Any bug to fix on CI tool?

@xlai
I'm sorry for the trouble, my intention was to restart the teststo check if registry related errors had been fixed. It will not happen again.

Actions #12

Updated by xlai 5 months ago

emiura wrote in #note-11:

@xlai
I'm sorry for the trouble, my intention was to restart the teststo check if registry related errors had been fixed. It will not happen again.

No worries. Thank you! @emiura

Actions #13

Updated by waynechen55 5 months ago

For Build34.1 and Build37.1, the historical scheduled products are as below:

1992810 a day ago   epaolantonio    scheduled   SLE 15-SP6  Online  x86_64  37.1    SLE-15-SP6-Online-x86_64-Build37.1-Media1.iso   
1990407 8 days ago  geekotest   scheduled   SLE 15-SP6  Online  x86_64  34.1    SLE-15-SP6-Online-x86_64-Build34.1-Media1.iso   
Actions #14

Updated by waynechen55 5 months ago

@ybonatakis It seems you scheduled test for Build39.1 again. All virtualization tests are running again now.

about an hour ago ybonatakis scheduled sle 15-SP6 Online x86_64 39.1 SLE-15-SP6-Online-x86_64-Build39.1-Media1.iso

Why do you need to do this ?

Actions #15

Updated by xlai about 2 months ago

Similar thing happens again. Individuals triggers full build test accidentally. It's also raised and discussed in https://suse.slack.com/archives/C0564EU665S/p1709176558631769.

Actions #16

Updated by szarate about 2 months ago

  • Copied to action #156277: [qe-core] Some jobs are retriggered by geekotest (likely obs-sync plugin) without clear answer why added
Actions #17

Updated by szarate about 2 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Subject changed from [openQA][worker][schedule][assignment][job][qe-core] Some jobs triggered and executed secretly without being noticed to Restart of scheduled products is prone to retriggers by humans
  • Description updated (diff)
  • Assignee set to okurz
  • Priority changed from High to Normal
  • Target version changed from QE-Core: Ready to future
  • Parent task set to #154768

Following up my message over slack, on @xlai's thread:

Shall we take some actions to avoid that? Eg, forbid it on the web UI to rerun scheduled product? Or bump access right for that button?
One of the things I was discussing days ago with @Oliver Kurz was the possibility of having RBAC - Role Based Access Control in openQA so that things like restarting a whole product is behind a specific permission. because, at the moment, everybody basically can.
Another possibility is fixing the UI, as mentioned by @Xiaoli Ai, but eventually, we’ll have to get there.

We can focus first on moving the restart button for scheduled products behind an admin permission. Look into the RBAC later

Actions #18

Updated by szarate about 2 months ago

  • Priority changed from Normal to High
Actions #19

Updated by xlai about 2 months ago

Just for clarification:
The historical comments and the original ticket description may mix two issues up.
- In this ticket, please focus on what the ticket title gives -- Restart of scheduled products is prone to retriggers by humans, which will need the help from openqa tool to improve. For anyone helps with it, please focus on related comments for it only.
- the other issue of multiple triggers by geekotest has been scoped into a new ticket #156277.

Actions #20

Updated by ybonatakis about 2 months ago

I am new to the discussion but as a user who have fell into the mistake of trigger incidentally jobs for other job groups I want to say that I disagree with the suggestion of an admin permission for that. Usually the problem is the misuse or misunderstanding of pressing a button or of openqa-cli. Instead, I would like to have a prompt and a warning before the execution.
For example when i do an post isos from the console, and if this has a results to schedule a whole job group, I would expect a prompt which shows me what it is going to schedule and ask for confirmation to continue.

Actions #21

Updated by okurz about 2 months ago

I also question both the ACs

  • "RBAC is implemented or an Epic for its implementation exists": IMHO that already exists. We already have three different levels with user, operator, admin. What do you see as missing here?
  • Restart of Scheduled produtcs is put behind a higher level permission (Admin?) Same as ybonatakis stated that wouldn't help: we have many admins and they can still make a mistake. And others might still have a valid need to retrigger. I would like us to check why we even have that retrigger button on the job details page as we already have one in the "scheduled products" page where I think it shows more clearly that many jobs would be affected
Actions #22

Updated by Julie_CAO about 1 month ago

The entire build 59.2 was retriggered again ...

about 11 hours ago  coolo   scheduled   SLE 15-SP6  Online  x86_64  59.2    SLE-15-SP6-Online-x86_64-Build59.2-Media1.iso
Actions #23

Updated by Julie_CAO about 1 month ago

One more retrigger ...

about 12 hours ago  dheidler    scheduled   SLE 15-SP6  Online  x86_64  59.2    SLE-15-SP6-Online-x86_64-Build59.2-Media1.iso
Actions #24

Updated by okurz about 1 month ago

  • Status changed from New to Feedback

No response on #138593-21, lowering prio

Actions #25

Updated by okurz about 1 month ago

  • Priority changed from High to Low
Actions #26

Updated by xlai about 1 month ago

okurz wrote in #note-24:

No response on #138593-21, lowering prio

I would like us to check why we even have that retrigger button on the job details page as we already have one in the "scheduled products" page where I think it shows more clearly that many jobs would be affected

I vote for this too. Maybe removing the button of rescheduling products from job details page can decrease such accidental trigger. Anyone would like to do so, better to go to "scheduled products" page.

Then how about those accidental triggers from command line via openqa-cli?

Actions #27

Updated by okurz about 1 month ago

  • Description updated (diff)
  • Category set to Feature requests
  • Status changed from Feedback to New
  • Assignee deleted (okurz)

ok. I replaced the two questionable ACs with "* AC1: Accidental retriggers of complete products from test details pages are less likely"

Actions #28

Updated by okurz about 1 month ago

  • Related to action #157342: Partial product re-scheduling scheduled whole build added
Actions

Also available in: Atom PDF