Project

General

Profile

action #104209

coordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release

[qem] dashboard.qam.suse.de checkpoints for aggregates

Added by hurhaj 5 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
Start date:
2021-12-21
Due date:
% Done:

0%

Estimated time:

Description

Scenario:
Update is in queue for a week, but qam-openqa is not yet approved. Looking into d.q.s.d/blocked you can see some aggregate job groups failed some still running. But checking past runs, one can see that all of them already passed at some point since the update was added.

Problem is that currently, dashboards wants everything green at the same time to approve an update. That rarely happens (let's not get sidetracked on this claim). Solution would be if dashboard could be checking only until specific aggregate job group is green for the first time and then disregard any later runs.


Related issues

Related to QA - action #97274: qam dashboard improvement ideasNew2021-06-29

Related to QA - action #97118: enhance bot automatic approval: check multiple daysNew2021-08-18

History

#1 Updated by okurz 5 months ago

  • Target version set to future

A valid feature request for the future. I assume that means that some component would need to store always the highest "watermark" per incident and approve as soon as the watermark exceeds a threshold

#2 Updated by mgrifalconi 5 months ago

Hello, that's great to see more agreement on this topic, but I feel this is a duplicate of a problem already mentioned here:

#3 Updated by okurz 5 months ago

  • Related to action #97274: qam dashboard improvement ideas added

#4 Updated by okurz 5 months ago

  • Related to action #97118: enhance bot automatic approval: check multiple days added

#5 Updated by hurhaj 5 months ago

mgrifalconi wrote:

Hello, that's great to see more agreement on this topic, but I feel this is a duplicate of a problem already mentioned here:

Right, seems like two similar solutions for the same problem. Anyway, I guess this issue is what's currently the biggest issue in openQA review. Not only updates are not being approved, but reviewer has to check every fail and running job groups. So I believe this feature should be planed for near future :)

#6 Updated by okurz 5 months ago

hurhaj wrote:

So I believe this feature should be planed for near future :)

I assume that the effort to implement this properly would be pretty high because we also need traceability why releases have been approved. And if we just say "random jobs happened to pass at random times but not all together" then this will be hard to follow unless we save corresponding test results from the corresponding times that they have passed. I would be surprised if handing the actual job failures in openQA is not a better solution. There are multiple ways how to make "failed jobs disappear", among others:

  1. Fix the actual test failure cause (always the preferred choice of course)
  2. Implement a workaround with a soft-failure so that other test modules are still executed
  3. Retriggering known sporadic issues, at best with auto-review
  4. Remove failing tests (or move to the development groups)
  5. Overwrite the results using http://open.qa/docs/#_overwrite_result_of_job
  6. Retry of openQA jobs based on test variables

If all of that does not work I think we got a more severe problem than needing this dashboard feature. So how can I help to make people use the above or is there something else missing we can do?

EDIT: 2021-12-22 Added https://github.com/os-autoinst/openQA/pull/4422 as additional option

#7 Updated by hurhaj 5 months ago

That's what we're doing and we are in a situation in which I decided to open this issue.

We're releasing around 70 updates (not packages) weekly. Situation in repository is changing too fast to hope for ideal situation. Meanwhile, I'm going through dozens of unapproved updates, hanging in the queue for weeks, hoping I didn't miss something.

#8 Updated by hurhaj 5 months ago

okurz wrote:

I assume that the effort to implement this properly would be pretty high
Fully aware, I'm not expecting Christmas miracle here

because we also need traceability why releases have been approved
Maybe bot could both comment in IBS with links to passed jobs and approve?

#9 Updated by okurz 5 months ago

hurhaj wrote:

That's what we're doing and we are in a situation in which I decided to open this issue.

It feels like dozens of QA engineers in QE still don't provide the work that is needed to stabilize unstable or false-positive tests. We try already in multiple of issues to address not only the technical parts but also the process related ones, e.g. in #96543, #95479, #91649, #103656, #102197, #101355, #101187

We're releasing around 70 updates (not packages) weekly. Situation in repository is changing too fast to hope for ideal situation. Meanwhile, I'm going through dozens of unapproved updates, hanging in the queue for weeks, hoping I didn't miss something.

That's of course not how it should be and it shouldn't be necessary that you need to cleanup the queue this way.

One other thought: Would it help to move more tests from aggregate into incident tests?

hurhaj wrote:

because we also need traceability why releases have been approved

Maybe bot could both comment in IBS with links to passed jobs and approve?

That would be a list of individual jobs as there would be no corresponding view in openQA to show the test results for incidents corresponding to different points in time.

#10 Updated by okurz 5 months ago

  • Parent task set to #80194

#11 Updated by okurz 4 months ago

Discussed with hurhaj. We agreed that with increasing number of products, pending maintenance updates and number of tests the situation of "not all tests can ever pass at the same time" is becoming more likely.

okurz wrote:

I assume that the effort to implement this properly would be pretty high because we also need traceability why releases have been approved. And if we just say "random jobs happened to pass at random times but not all together" then this will be hard to follow unless we save corresponding test results from the corresponding times that they have passed.

We agreed that this is a risk although from experience of hurhaj such cases where an after-the-fact investigation would be needed were neither not necessary, did not happen or only so seldomly that we don't need to care about it.

I would be surprised if handing the actual job failures in openQA is not a better solution. There are multiple ways how to make "failed jobs disappear", among others:

  1. Fix the actual test failure cause (always the preferred choice of course)
  2. Implement a workaround with a soft-failure so that other test modules are still executed
  3. Retriggering known sporadic issues, at best with auto-review
  4. Remove failing tests (or move to the development groups)
  5. Overwrite the results using http://open.qa/docs/#_overwrite_result_of_job
  6. Retry of openQA jobs based on test variables

From all of the above options only 5. "Overwrite the results using http://open.qa/docs/#_overwrite_result_of_job" would be able to conduct without reconducting tests again which has the benefit of no additional waiting time needed. I presented to hurhaj https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger and in particular the feature to use auto-review+force-result. According to hurhaj that would not help in the following case when update X is faulty and blocks approval of Y due to aggregate tests including X+Y failing and there is no time to reject X and rerun all tests without X or the likelyhood of newly added update Z blocking approval of Y is high. The requirement here would be similar as in #95479 to mark a failed job as acceptable, i.e. same as passed, but only for Y.

Also available in: Atom PDF