Project

General

Profile

action #122311

coordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release

Use live openQA test results instead of inconsistent qem-dashboard database in qem-bot approver

Added by okurz 3 months ago. Updated about 1 month ago.

Status:
Feedback
Priority:
Normal
Assignee:
Target version:
Start date:
2022-12-21
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See #97118#note-10. qem-dashboard seems to have sometimes inconsistent results. Maybe instead of relying on the content in the dashboard database we should change places like https://github.com/openSUSE/qem-bot/blob/a3701ce5b9874f3552cf6bd2c98ae5a52963ab49/openqabot/approver.py#L101 to look at the most recent results in openQA directly.


Related issues

Related to QA - action #123286: Bot and dashboard reference to wrong data and block update approval size:MResolved2022-12-21

Copied from QA - action #122308: Handle invalid openQA job references in qem-dashboard size:MResolved2022-12-21

History

#1 Updated by okurz 3 months ago

  • Copied from action #122308: Handle invalid openQA job references in qem-dashboard size:M added

#2 Updated by mgrifalconi 3 months ago

  • Assignee set to mgrifalconi

#3 Updated by mgrifalconi 2 months ago

  • Status changed from New to In Progress

Some status update after trying to learn how bot/dashboard work and playing around with the code a bit.

This is what I would consider a dangerous change/refactor, since this tool is handling approval of updates to our customers and we currently donĀ“t have a staging environment.

Considering that, I propose to develop a module with the new feature, with 3 settings:

  • dry run and log, to test out the new way of handling things, compare performance and make sure the same update requests are approved/rejected
  • emergency shutoff: this quickly disables the new code, without need of PRs approval/fixes and minimizing impact on production
  • enabled, switch over to the new system, eventually this would become default, the switch would go away and the old method removed from the code

Uploaded the WIP code https://github.com/michaelgrifalconi/qem-bot/commit/dc2ca5e9b03ad5f06ace62092512be00fd99b7fe but in the meantime found an issue.

It appears difficult to change the approver logic to only look at live real data.

This is the current flow for release request approvals:
(only looking at aggregate now)

In short, it's not a straight forward change since it's not just about reusing the same functions that feeds data to the dashboard and use directly that data.

I think it would be worth to separate what collects/process data, execute actions (bot) and the what visualizes that data (dashboard). I agree to query data in different way to better visualize but I personally don't like the bot to strongly rely on it and have the some data processing in the bot and some in the dashboard.

We could argue about the performance reasons to have the bot use cached dashboard data or not (or have one option with a fail over to the other one) but I would like to have the dashboard not be another business-critical component if we can avoid it and simply keep that burden/risk only on the bot script.

#4 Updated by mgrifalconi 2 months ago

  • Copied to action #123286: Bot and dashboard reference to wrong data and block update approval size:M added

#5 Updated by mgrifalconi 2 months ago

  • Copied to deleted (action #123286: Bot and dashboard reference to wrong data and block update approval size:M)

#6 Updated by mgrifalconi 2 months ago

  • Related to action #123286: Bot and dashboard reference to wrong data and block update approval size:M added

#7 Updated by mgrifalconi about 1 month ago

  • Status changed from In Progress to Feedback

With the help of the discussion here, https://suse.slack.com/archives/C02CANHLANP/p1675927786104149?thread_ts=1674639850.741499&cid=C02CANHLANP
I got a better understanding of the architecture of the bot, and seems tricky to make the bot use only live data for approvals, since there are a lot of steps where data is fetched, stored to dashboard, downloaded from dashboard.
There is also no interest from dashboard/bot maintainers to switch to such logic.

What we can still do, is to rely on dashboard data until just before taking the approve/not-approve decision and not before. This would make the logic even more complicated IMHO because we first look at dashboard data(download live data, upload to dashboard, read from dashboard - all this multiple times for different kind of data), and then in the end we would double check on openqa data.
I am not a huge fan of this approach, but could still help in situations like the issue linked and see no other solution since a refactor to only use live data is not possible for reason mentioned before.

Before continuing on that, I would like some feedback from the tools team, to make sure this mixed approach can be something we can try and is worth investing some time.
Also available for a call if needed, just ping me :)

Also available in: Atom PDF