action #122311
opencoordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release
Use live openQA test results instead of inconsistent qem-dashboard database in qem-bot approver
0%
Description
Motivation¶
See #97118#note-10. qem-dashboard seems to have sometimes inconsistent results. Maybe instead of relying on the content in the dashboard database we should change places like https://github.com/openSUSE/qem-bot/blob/a3701ce5b9874f3552cf6bd2c98ae5a52963ab49/openqabot/approver.py#L101 to look at the most recent results in openQA directly.
Updated by okurz about 2 years ago
- Copied from action #122308: Handle invalid openQA job references in qem-dashboard size:M added
Updated by mgrifalconi almost 2 years ago
- Status changed from New to In Progress
Some status update after trying to learn how bot/dashboard work and playing around with the code a bit.
This is what I would consider a dangerous change/refactor, since this tool is handling approval of updates to our customers and we currently donĀ“t have a staging environment.
Considering that, I propose to develop a module with the new feature, with 3 settings:
- dry run and log, to test out the new way of handling things, compare performance and make sure the same update requests are approved/rejected
- emergency shutoff: this quickly disables the new code, without need of PRs approval/fixes and minimizing impact on production
- enabled, switch over to the new system, eventually this would become default, the switch would go away and the old method removed from the code
Uploaded the WIP code https://github.com/michaelgrifalconi/qem-bot/commit/dc2ca5e9b03ad5f06ace62092512be00fd99b7fe but in the meantime found an issue.
It appears difficult to change the approver logic to only look at live real data.
This is the current flow for release request approvals:
(only looking at aggregate now)
- _approvable calls get_incidents_approver to query the dashboard about current update (inc and rr numbers)
- easy to replicate here
- then
_approvable
gets get_aggregate_settings on qem to query dashboard on some data that do not come directly from openqa/smelt- not easy, since raw data received from openqa and data from dashboard is different. Dashboard is doing some work on that data as soon as it arrives: see https://github.com/openSUSE/qem-dashboard/blob/ebfeada7f6198ffc109ff8eb34a90ad8f49bd572/lib/Dashboard/Model/Incidents.pm#L252-L287
- not easy, since raw data received from openqa and data from dashboard is different. Dashboard is doing some work on that data as soon as it arrives: see https://github.com/openSUSE/qem-dashboard/blob/ebfeada7f6198ffc109ff8eb34a90ad8f49bd572/lib/Dashboard/Model/Incidents.pm#L252-L287
- based on that data,
get_incident_result
will query the dashboard once again to get test results- not yet looked into
In short, it's not a straight forward change since it's not just about reusing the same functions that feeds data to the dashboard and use directly that data.
I think it would be worth to separate what collects/process data, execute actions (bot) and the what visualizes that data (dashboard). I agree to query data in different way to better visualize but I personally don't like the bot to strongly rely on it and have the some data processing in the bot and some in the dashboard.
We could argue about the performance reasons to have the bot use cached dashboard data or not (or have one option with a fail over to the other one) but I would like to have the dashboard not be another business-critical component if we can avoid it and simply keep that burden/risk only on the bot script.
Updated by mgrifalconi almost 2 years ago
- Copied to action #123286: Bot and dashboard reference to wrong data and block update approval size:M added
Updated by mgrifalconi almost 2 years ago
- Copied to deleted (action #123286: Bot and dashboard reference to wrong data and block update approval size:M)
Updated by mgrifalconi almost 2 years ago
- Related to action #123286: Bot and dashboard reference to wrong data and block update approval size:M added
Updated by mgrifalconi almost 2 years ago
- Status changed from In Progress to Feedback
With the help of the discussion here, https://suse.slack.com/archives/C02CANHLANP/p1675927786104149?thread_ts=1674639850.741499&cid=C02CANHLANP
I got a better understanding of the architecture of the bot, and seems tricky to make the bot use only live data for approvals, since there are a lot of steps where data is fetched, stored to dashboard, downloaded from dashboard.
There is also no interest from dashboard/bot maintainers to switch to such logic.
What we can still do, is to rely on dashboard data until just before taking the approve/not-approve decision and not before. This would make the logic even more complicated IMHO because we first look at dashboard data(download live data, upload to dashboard, read from dashboard - all this multiple times for different kind of data), and then in the end we would double check on openqa data.
I am not a huge fan of this approach, but could still help in situations like the issue linked and see no other solution since a refactor to only use live data is not possible for reason mentioned before.
Before continuing on that, I would like some feedback from the tools team, to make sure this mixed approach can be something we can try and is worth investing some time.
Also available for a call if needed, just ping me :)