Project

General

Profile

Actions

action #131279

closed

QA - coordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release

coordination #99306: [epic] Future improvements: Make reviewing openQA results per squad easier

[timeboxed:6h][spike solution] a single command line or openQA webUI search view to show all tests blocking an incident by squad size:S

Added by okurz 11 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

From #121246-15: "We'd need to look for all the tests that are failing for a given incident, using the same TEST_ISSUES for both, Aggregates and Incidents". So what is needed is a single command line or openQA webUI search view to show all tests blocking an incident by squad. After #117655 and #119746 we should combine both.

Suggestions

  • We can get a job for a particular incident (#117655#note-33)
    • openqa-cli api --o3 /job_settings/jobs key=*_TEST_ISSUES list_value=1234567
    • openqa-cli api --osd /job_settings/jobs key=LTSS_TEST_ISSUES list_value=20988
    • Note there is an implicit enforced limit of 20000 jobs here (see also the later suggestion)
  • We have support for group globbing (#134933#note-32)
  • "squads" could be mapped into openQA for example with special job settings, e.g. QE Core ensures to trigger all their tests with _SQUAD='QE Core' and then be able to filter by that
  • Explore removing the key/job id limit, or add a way to override it (and/or make followup ticket to finally introduce a trigram gin index for fast text searching without limits on keys)
  • Maybe add a openqa-cli command blocked for this (similar to recently introduced monitor sub-command, make followup ticket)
  • This doesn't need to be specific to squads/blocking tests (openQA itself should not know about these SUSE specific concepts)

Files

Screenshot_20240305_092052.png (58.1 KB) Screenshot_20240305_092052.png Screenshot_20240305_092052.png okurz, 2024-03-05 08:21

Related issues 2 (2 open0 closed)

Copied to openQA Project - action #156547: A single API route to show all not-ok tests blocking a SLE maintenance incident size:MWorkable

Actions
Copied to openQA Project - action #156553: [timeboxed:10h][spike solution] openQA webUI search view to show all tests blocking an incident by squad - take 2Blockedokurz

Actions
Actions #2

Updated by okurz 8 months ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)

Both blockers resolved

Actions #3

Updated by okurz 8 months ago

  • Target version changed from Ready to future
Actions #4

Updated by okurz 4 months ago

  • Target version changed from future to Tools - Next

With both #117655 and #119746 resolved we can continue here.

Actions #5

Updated by mgrifalconi 4 months ago · Edited

Hello! Thanks for working on these cool topics to help improving efficiency and developer experience related to updates approvals!
I will leave a few thoughts here and I am happy to follow-up anytime.

Feature and usage:

  • "Per review squad" right now is translated to "Filter using regex on Job Group name". I fear that something might slip out if you do not use the correct regex for your squad. I wonder if it's possible/meaningful to allow something like "Show me (all/for this incident) failures grouped by squads, with this definition of squads: "Core: [core maintenance updates], SAP: [CONTAINS(SAP OR HA)], Orphaned: (all the rest). This would be to highlight the job groups that are not being looked into by anyone and address the issue + making crystal clear for stakeholders (like maintenance coordinator) not only what job group is failing but what squad to contact.

API usage to also take decisions on update approvals in future, to use direct data instead of the cached qem-dashboard data.
Would like to raise 2 possible issues to be verified:

SMELT Incidents ID can be reused for multiple Release Requests and what the process uses right now is the incident ID to tag a test that is crucial for the RR approval. Now the bot/dashboard combo uses a workaround of deleting some openqa results (from dashboard DB) to prevent issues (see https://github.com/openSUSE/qem-dashboard/pull/78/files ) but this makes the bot approval logic complex and shared between bot and dashboard code. Would be nice to switch from SMELT ID to IBS RR ID (or just add the RR on top) to resolve the issue at the origin.

RR are not unique either, but in a different way: RR can be revoked and then reopen (maybe with different content to test? to be checked). I know the bot recognize (some) changes and re-triggers incident tests, but what about aggregates? Is there a chance they could be wrongly considered for approval decision? Also incident channels could be changed while the incident/RR combo is being tested causing some confusion on bot side. If this proves to be a real issue, a solution idea would be to make sure test results related to older 'version' of a RR are not considered and the bot waits for new ones. Maybe add to SMELT-ID/RR combo, also a timestamp of smelt-incident/ibs-rr latest change?

Should we discuss this topic on a separate ticket? I see it still relevant problems to this task but maybe not tightly related since the solution would be likely implemented on the bot side? I just wanna make sure this API feature can provide usable data, not mixing results of older/irrelevant RR.

I can expect the valid argument that these are rare corner cases, but we should also consider that we are here to catch corner cases. Complex updates that gets modified while being tested should get enhanced attention and not reduced IMO.

Actions #6

Updated by okurz 4 months ago

mgrifalconi wrote in #note-5:

Hello! Thanks for working on these cool topics to help improving efficiency and developer experience related to updates approvals!
I will leave a few thoughts here and I am happy to follow-up anytime.

Feature and usage:

  • "Per review squad" right now is translated to "Filter using regex on Job Group name".

My original idea was to use any entity that defines "Maintainer:", that can be job groups, test suites, test modules. Is that better than only job group name or do you have another better idea?

API usage to also take decisions on update approvals in future, to use direct data instead of the cached qem-dashboard data.
Would like to raise 2 possible issues to be verified: […] Should we discuss this topic on a separate ticket?

Yes, please. Better copy your text to other specific tickets to not confuse anyone wanting to pick up this ticket here which is about "shows tests blocking by squad".

Actions #7

Updated by mgrifalconi 4 months ago · Edited

Created #153886

Actions #8

Updated by okurz 3 months ago

  • Target version changed from Tools - Next to Ready
Actions #9

Updated by livdywan 3 months ago

  • Description updated (diff)
Actions #10

Updated by livdywan 3 months ago

  • Subject changed from [timeboxed:6h][spike solution] a single command line or openQA webUI search view to show all tests blocking an incident by squad to [timeboxed:6h][spike solution] a single command line or openQA webUI search view to show all tests blocking an incident by squad size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #11

Updated by ybonatakis 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #12

Updated by openqa_review 2 months ago

  • Due date set to 2024-03-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by ybonatakis 2 months ago

  • Status changed from In Progress to Feedback

I dont think i have to add something significant other than what already have been said in other tickets.

openqa-cli api --osd /job_settings/jobs key="*_TEST_ISSUES" list_value=<update_id> only returns the jobids.
There is no way to pass any parameter to the API to filter the failed jobs.

propose: Make /job_settings/jobs/failed with some optional parameters per groupid
disadvantage: still not clear map between groupid and squad

The most direct approach is through http://dashboard.qam.suse.de/ and the UI. The query can be simple as http://dashboard.qam.suse.de/blocked?group_names=hpc%2C+kernel&incident=32192
However this url cant be used from the cli.
dashboard.qam.suse.de lacks of an API endpoint to /blocked which I think will be useful.
I experimented with other API call but are not designed to facilitate association between incidents and failures per openqa grouid(aka squad)

for instance

curl https://dashboard.qam.suse.de/api/incidents/32192 | jq -r '[.channels]' |grep -E 'HPC|Micro'

can get all the groups which run incident tests and filter per products. But it is not possible to filter the results

propose: extend API to /blocked. I think it will provide everything including the results which will be easier to filter later on cli commands. I guess dashboard can extends and provide squad map as well without the need to make openQA squad agnostic.
disadvantage: not suitable for O3??!

Actions #14

Updated by okurz 2 months ago

  • Status changed from Feedback to Workable

ybonatakis wrote in #note-13:

[…]
propose: Make /job_settings/jobs/failed with some optional parameters per groupid
disadvantage: still not clear map between groupid and squad

ok, try that. About "still not clear map between groupid and squad" either think of a proposal and ask review squad members for feedback on the proposal or ask them already before what they think might work for the mapping.

The most direct approach is through http://dashboard.qam.suse.de/ and the UI. The query can be simple as http://dashboard.qam.suse.de/blocked?group_names=hpc%2C+kernel&incident=32192
[…]
disadvantage: not suitable for O3??!

Exactly. One more reason why this ticket is about "openQA-only", so don't extend dashboard.qam.suse.de but make openQA work

Actions #15

Updated by ybonatakis 2 months ago

ybonatakis wrote in #note-13:

I dont think i have to add something significant other than what already have been said in other tickets.

openqa-cli api --osd /job_settings/jobs key="*_TEST_ISSUES" list_value=<update_id> only returns the jobids.
There is no way to pass any parameter to the API to filter the failed jobs.

propose: Make /job_settings/jobs/failed with some optional parameters per groupid
disadvantage: still not clear map between groupid and squad

The most direct approach is through http://dashboard.qam.suse.de/ and the UI. The query can be simple as http://dashboard.qam.suse.de/blocked?group_names=hpc%2C+kernel&incident=32192
However this url cant be used from the cli.
dashboard.qam.suse.de lacks of an API endpoint to /blocked which I think will be useful.
I experimented with other API call but are not designed to facilitate association between incidents and failures per openqa grouid(aka squad)

for instance

curl https://dashboard.qam.suse.de/api/incidents/32192 | jq -r '[.channels]' |grep -E 'HPC|Micro'

can get all the groups which run incident tests and filter per products. But it is not possible to filter the results

propose: extend API to /blocked. I think it will provide everything including the results which will be easier to filter later on cli commands. I guess dashboard can extends and provide squad map as well without the need to make openQA squad agnostic.
disadvantage: not suitable for O3??!

actually i was wrong about the lack of dashboard api. There is a API call

http://dashboard.qam.suse.de/app/api/blocked

which provides all the required metadata like failed, groupid etc

Actions #16

Updated by ybonatakis 2 months ago

  • Status changed from Workable to In Progress
Actions #17

Updated by ybonatakis 2 months ago

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/5497
just a very basic attempt, but I expect a lot of discussion

Actions #18

Updated by okurz 2 months ago

  • Copied to action #156547: A single API route to show all not-ok tests blocking a SLE maintenance incident size:M added
Actions #19

Updated by okurz 2 months ago

  • Copied to action #156553: [timeboxed:10h][spike solution] openQA webUI search view to show all tests blocking an incident by squad - take 2 added
Actions #20

Updated by okurz 2 months ago

  • Due date deleted (2024-03-12)
  • Status changed from Feedback to Resolved

We can wait a bit longer for discussion around that PR. As follow-ups I created #156547 and #156553 so we can resolve here. Thank you for your work.

Actions #22

Updated by ybonatakis 2 months ago

An idea to bring the concept of squads in OSD

Add another property into job group properties ex: named owner and maybe add another table in the db to associate owners with job_groups.

I am against of the special job settings, e.g. _SQUAD='QE Core' as a cheap solution. job settings can get easily messy and they will require more maintenance in long term IMO

Actions #23

Updated by okurz 2 months ago

ybonatakis wrote in #note-22:

An idea to bring the concept of squads in OSD

Add another property into job group properties ex: named owner and maybe add another table in the db to associate owners with job_groups.

We should be careful with such approach: Most openQA instances do not need ownership in the job groups so I don't see a widespread benefit in adding another table for that. Also, as noted in #131279-6 ownership can apply for more than just job groups, e.g. parent job groups (those are separate from job groups), test suites, test modules, maybe also based on product, machine, etc. So when we would have a table for job group ownership then there would still be a need to declare ownership in other areas.

I am against of the special job settings, e.g. _SQUAD='QE Core' as a cheap solution. job settings can get easily messy and they will require more maintenance in long term IMO

Well, job settings and job group ownership need to be maintained either way. And having another property also needs maintenance so I don't see that point.

@ybonatakis as you reopened the ticket: What do we need to do to resolve?

Actions #24

Updated by ybonatakis 2 months ago

okurz wrote in #note-23:

ybonatakis wrote in #note-22:

An idea to bring the concept of squads in OSD

Add another property into job group properties ex: named owner and maybe add another table in the db to associate owners with job_groups.

We should be careful with such approach: Most openQA instances do not need ownership in the job groups so I don't see a widespread benefit in adding another table for that. Also, as noted in #131279-6 ownership can apply for more than just job groups, e.g. parent job groups (those are separate from job groups), test suites, test modules, maybe also based on product, machine, etc. So when we would have a table for job group ownership then there would still be a need to declare ownership in other areas.

Most of the job groups already represent some sort of ownership in the yaml definitions.
In any case i would like to discuss it. Maybe send an email out to get what's the feeling out of it.

I am against of the special job settings, e.g. _SQUAD='QE Core' as a cheap solution. job settings can get easily messy and they will require more maintenance in long term IMO

Well, job settings and job group ownership need to be maintained either way. And having another property also needs maintenance so I don't see that point.

We do not touch those properties often. And it is as single as a single edit on the UI, as opposed to the various ways squads construct the yamls

@ybonatakis as you reopened the ticket: What do we need to do to resolve?

I dint reopen it. it remains in feedback. I consider it done. Although i kept working on that yesterday as i havent notice that you had changed the status.
I tried to adjust the PR according to the comments but i had no much progress anyway.

Actions #25

Updated by okurz 2 months ago

ybonatakis wrote in #note-24:

okurz wrote in #note-23:

ybonatakis wrote in #note-22:

An idea to bring the concept of squads in OSD

Add another property into job group properties ex: named owner and maybe add another table in the db to associate owners with job_groups.

We should be careful with such approach: Most openQA instances do not need ownership in the job groups so I don't see a widespread benefit in adding another table for that. Also, as noted in #131279-6 ownership can apply for more than just job groups, e.g. parent job groups (those are separate from job groups), test suites, test modules, maybe also based on product, machine, etc. So when we would have a table for job group ownership then there would still be a need to declare ownership in other areas.

Most of the job groups already represent some sort of ownership in the yaml definitions.
In any case i would like to discuss it. Maybe send an email out to get what's the feeling out of it.

Sure, we will follow-up. I created #156631 for a new concept which you might consider cleaner than _SQUAD=foo :)

@ybonatakis as you reopened the ticket: What do we need to do to resolve?

I dint reopen it. it remains in feedback. I consider it done. Although i kept working on that yesterday as i havent notice that you had changed the status.

you did reopen, see screenshot. Maybe not on intention.
Screenshot_20240305_092052.png

Actions

Also available in: Atom PDF