action #180809: [spike][timeboxed:20h] qem-bot+qem-dashboard reading from/writing to gitea instead of smelt/IBS size:M - QA (public) - openSUSE Project Management Tool

Actions

action #180809

open

openQA Project (public) - coordination #180626: [saga][epic] Support switch to gitea for openSUSE/SUSE based products, e.g. SLE16

[spike][timeboxed:20h] qem-bot+qem-dashboard reading from/writing to gitea instead of smelt/IBS size:M

Added by okurz about 1 month ago. Updated 1 day ago.

Status:

Feedback

Priority:

High

Assignee:

Target version:

openQA Project (public) - Ready

Start date:

2025-04-10

Due date:

2025-05-10 (Due in 0 days)

% Done:

0%

Estimated time:

Description

Motivation¶

As explained in the parent #180806 new maintenance workflows are planned based on the https://src.suse.de gitea instead and while SMELT is planned to be adapted right now it is not yet there, see https://gitlab.suse.de/tools/smash/-/issues/1458 and https://gitlab.suse.de/tools/smelt/-/issues/1147 , still the plan exists to have a first "dry run" of the SLE 16 maintenance workflow ready by 2025-07-08 as decided as part of #180602. As QE engineers so far rely on qem-dashboard and qem-bot features like "approvable_for" comments we should find out if and how feasible an adaptation of qem-bot+qem-dashboard to such workflow is.

Goals¶

G1: We know how it is feasible to use qem-bot+qem-dashboard to trigger+monitor+approve SLE 16 maintenance updates based on src.suse.de w/o SMELT
G2: Tickets with specific tasks for necessary adaptations exist
G3: We know the impact on other plans regarding qembot/dashboard features

Suggestions¶

Grab the code and prepare a proof of concept relying on gitea

Related issues 2 (2 open — 0 closed)

Actions

#1

Updated by okurz about 1 month ago

Copied to action #180812: qem-bot+qem-dashboard reading from/writing to gitea instead of IBS added

Actions

#2

Updated by okurz about 1 month ago

Target version changed from future to Tools - Next

Actions

#3

Updated by okurz 29 days ago

Priority changed from Normal to High
Target version changed from Tools - Next to Ready

Now more important so that we can proceed with https://jira.suse.com/browse/MSQA-960

Actions

#4

Updated by livdywan 25 days ago

This is expected to exceed our SLO's by tomorrow. But maybe we should really estimate it before someone starts working on it to be sure what we want here. So ideally we can estimate it tomorrow, and see that someone can also pick it up.

Actions

#5

Updated by livdywan 23 days ago

Subject changed from [spike][timeboxed:20h] qem-bot+qem-dashboard reading from/writing to gitea instead of smelt/IBS to [spike][timeboxed:20h] qem-bot+qem-dashboard reading from/writing to gitea instead of smelt/IBS size:S
Description updated (diff)
Status changed from New to Workable

Actions

#6

Updated by okurz 23 days ago

okurz wrote:

* Clarify with @okurz whether that is mean with G3 which was previously mentioning "sibling upgrade tasks".

Sorry, I think this was a left over when I copied from a ticket about "Leap 16.0 upgrade"

Actions

#7

Updated by livdywan 16 days ago

Description updated (diff)

* Clarify with @okurz whether that is mean with G3 which was previously mentioning "sibling upgrade tasks".
Sorry, I think this was a left over when I copied from a ticket about "Leap 16.0 upgrade"

Ack

Actions

#8

Updated by mkittler 15 days ago

Assignee set to mkittler

Actions

#9

Updated by mkittler 15 days ago

Subject changed from [spike][timeboxed:20h] qem-bot+qem-dashboard reading from/writing to gitea instead of smelt/IBS size:S to [spike][timeboxed:20h] qem-bot+qem-dashboard reading from/writing to gitea instead of smelt/IBS size:M
Status changed from Workable to In Progress

When we estimated #180632 we said the size must be M because of the timebox duration. I think the same counts for this ticket then.

Actions

#10

Updated by mkittler 15 days ago · Edited

The qem-dashboard uses:

rabbit.suse.de
- for messages from OSD so that doesn't need to change
build.opensuse.org (or probably rather build.suse.de in the actual production config), smelt.suse.de
- to display the following links to those pages on the web UI but not for any requests
  - ${this.appConfig.obsUrl}/request/show/${this.incident.rr_number}
  - ${this.appConfig.smeltUrl}/incident/${this.incident.number}

The incident number and "rr_number" are database fields.

So on the dashboard we probably need extend the database schema and accompanying API routes to store links to Gitea PRs instead. Then we would add this.appConfig.gitUrl for src.suse.de and could render URLs to Gitea PRs similarly to how we render URLs to OBS and SMELT now.

Maybe it would make sense to wait for #156553 until we invest more effort into the dashboard here. If #156553 shows that we can move the dashboard functionality into openQA we can simply remove the dashboard. Of course we still need to make sure cross-referencing in openQA to Gitea is as accessible as cross-referencing is right now on the dashboard to OBS/SMELT.

We need to keep the priority-based view in mind that is currently only visible on the qem-dashboard.

Actions

#11

Updated by mkittler 15 days ago · Edited

qem-bot uses everything defined here: https://github.com/openSUSE/qem-bot/blob/master/openqabot/__init__.py

We have the following sub-commands:

full-run Full schedule for Maintenance Incidents in openQA
- Combination of incidents-run and updates-run.
- Should be extended if we introduce more …-run sub-commands.
incidents-run Incidents only schedule for Maintenance Incidents in openQA
- Schedules openQA jobs for "incidents" on SMELT/IBS.
- Should stay as-is for compatibility but we probably need something similar for PRs on Gitea like this dummy PR.
- Related Jira ticket: https://jira.suse.com/browse/MSQA-949
- It would make sense to trigger tests via a Gitea webhook, see notes on inc-approve. That would have the advantage that we wouldn't need an incidents-run command anymore. However, the complexity of knowing what tests to schedule needed to be covered by openQA itself.
- Otherwise we could keep this kind of command. This would have the advantage that we might be able to reuse the existing approach for knowing what tests need to be scheduled. We could still specify the Gitea PR when triggering tests so openQA can at least report results back directly.
  - For this we would need to extend openQA so it can report back results to a PR even though tests were not triggered by a webhook. We could probably still re-used existing code.
  - The bot cannot provide an API for webhooks. So we'd probably go though PRs on Gitea when incidents-run is invoked.
updates-run updates only schedule for Maintenance Incidents in openQA
- Schedules openQA jobs for "aggregates" on SMELT/IBS.
- Should stay as-is for compatibility. Is there an equivalent in the new workflow?
smelt-sync Sync data from SMELT into QEM Dashboard
- The upstream description is self-explanatory.
- We could skip this if we leave the qem-dashboard completely out in favor of #156553. Otherwise we probably need to sync the status from Gitea to the dashboard after implementing what I mentioned in #note-10.
inc-approve Approve incidents which passed tests
- Approves SR for incidents on OBS when openQA tests pass.
- We need something similar for PRs on GitLab when openQA tests pass.
- To represent the state of openQA tests on Gitea it would make more sense to show them as CI integration status. This way the openQA job status/result would be immediately and in any case visible on Gitea and a "Details" button would lead to some page on openQA with the relevant jobs. We might still need to add a formal review via some bot account.
  - We already have code on the openQA-side to report the test status back to GitHub using the GitHub statuses API. Gitea also supports statuses with an almost identical API endpoint. So we could extend this existing support within openQA with minimal effort and avoid extending the qem-bot.
    - In order to make use of this we need just need to schedule tests in our version incidents-run for Gitea using api/v1/webhooks and not api/v1/isos. This route needed to be extended to support Gitea webhooks/PRs in addition to GitHub webhooks/PRs.
    - openQA needed to be extended to leave a formal review on Gitea if we want that (instead of just giving test results as a CI status check).
    - openQA can't just start testing on PR creation, though. It needs to wait until OBS has finished building. So far OBS writes a comment on Gitea which we could react to. However, there are probably better alternatives:
      - OBS could use the Gitea statuses API. It can already do that as part of its scm integration. Then openQA could read this status and trigger tests if the OBS status has passed.
      - OBS could provide a webhook on its own that would tell openQA to start testing for a certain PR.
    - Of course this leads to the question of how openQA would know what tests to schedule and run for a certain PR. Here the scenario definitions would preferably come into play on as an existing openQA upstream feature.
      - If scenario definitions are not dynamic enough we might need to extend them.
      - If we still want/need an additional bot to figure out what needs to be scheduled this bot could specify the Gitea PR when triggering tests so openQA can at least report results back directly (and we can still avoid an inc-approve command).
      - Considering https://jira.suse.com/browse/MSQA-955 is still unresolved this point still needs clarification.
        
        So far an extra "meta-data" repository is used with one YAML file per product, e.g. https://gitlab.suse.de/qa-maintenance/metadata/-/blob/master/bot-ng/sles15sp7.yml. This is packages on IBS where also a container is built for use on GitLab.
        
        Maybe it would be possible to create a scenario definitions file for SLE 16 instead. This would have the advantage that it would be directly consumable by openQA and its webhook/PR feature. So when we know what we need/want to test I would recommend to use scenario definitions and see how far we'll come with that. This should be done together with the maintenance team which will supposedly be the team responsible for creating the "test plan".
      - Assets are so far mostly just repo assets provided by IBS directly. This is the best case for us as we would likely not need much additions on the openQA-side then.
        
        Example: LTSS_TEST_REPOS=http://%REPO_MIRROR_HOST%/ibs/SUSE:/Maintenance:/37799/SUSE_Updates_SLE-Product-SLES_15-SP4-LTSS_x86_64
        
        There is also https://gitlab.suse.de/openqa/openqa-trigger-from-ibs-plugin, though. If this will still be required we'll have to see how we can integrate it into the PR/webhook-based workflow. Maybe we can also replace this kind of synchronization with the asset download feature of openQA which is probably already usable via the PR/webhook-based workflow.
inc-comment Comment incidents in BuildService
- Comments on incidents on IBS.
- We probably don't need this for Gitea if we implement approving as mentioned under inc-approve.
- It is disabled on https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules anyway.
inc-sync-results Sync results of openQA incidents jobs to Dashboard
- I think this is about the qem-dashboard.
- We could skip this if we leave the qem-dashboard completely out in favor of #156553. Otherwise we probably only need to change details as this involves only openQA and the qem-dashboard which both would not change.
aggr-sync-results Sync results of openQA aggregates jobs to Dashboard
- I think this is about the qem-dashboard.
- We could skip this if we leave the qem-dashboard completely out in favor of #156553. Otherwise we probably only need to change details as this involves only openQA and the qem-dashboard which both would not change.
amqp AMQP listener daemon
- We can live without it as it is disabled on https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules anyway (and as far as I know also wasn't successfully deployed anywhere else).
- Reads suse.openqa.job.done AMQP messages which would not need to change.
- Approves incidents using the same code as inc-approve so it can be extended along with that sub-command.

The bot/dashbaord also have the following useful features that need to be preserved:

The bot can ignore failues and then nevertheless approve:
1. Via @review:acceptable_for:incident_… comments in openQA; this kind of acceptance is also forwarded to qem-dashboard so these kinds of failures don't show up as such on the dashboard.
2. By checking for older eligible openQA jobs. In this case it'll log a message like Ignoring failed aggregate job %s for incident %s due to older eligible openQA job being ok.
The dashboard can display blocking openQA jobs by group/squad.
The blocked incidents are sorted by priority on the dashboard and incidents with a very high prio are highlighted specifically.
It is possible to filter the blocked page by group and store a link to that, e.g. https://dashboard.qam.suse.de/blocked?group_names=kernel. When I remember correctly that feature was valued a lot at some point.
The bot reads meta-data from https://gitlab.suse.de/qa-maintenance/metadata. I haven't seen any SLE 16 related files there yet so this might not be relevant at all, though.
The bot has a Public Could helpers. It seems like a collection of utility functions useful in the context of Public Cloud. Not sure whether this is still relevant and whether it'll be needed in SLE 16.

Things the bot/dashboard are missing:

It does not give negative feedback, at least not directly on OBS. I think we could do better here and update the review status in any case.
A primarily event-driven approach.

Actions

#12

Updated by openqa_review 15 days ago

Due date set to 2025-05-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions

#13

Updated by mkittler 12 days ago

Related to action #180632: [spike][timeboxed:20h] Automated reviewer triggering openQA tests and feeding back openQA test results for https://src.suse.de/TEST_SLFO/mini-SLFO/pulls/10 size:M added

Actions

#14

Updated by mkittler 12 days ago · Edited

Clarification is needed on the following topics from the OBS side:
1. How OBS will be able to communicate that builds for a certain PR on Gitea are ready so openQA knows when to start testing.
  - In the worst case we have to read the comment OBS leaves on Gitea. Then openQA tests can be triggered by a webhook on PR comments (which is a possible webhook event in Gitea) but openQA needed to distinguish/parse those comments which is probably error prone.
  - Otherwise OBS could uses the Gitea statuses API to communicate the build result like its scm integration already does. Then openQA tests can be triggered by a webhook on commit status update (which is a possible webhook event in Gitea).
  - OBS could provide a hook so we could call openQA directly from OBS to trigger tests.
  - OBS could provide events via RabbitMQ. Monitoring and filtering relevant RabbitMQ events on our side would be something completely new (as opposed to just extending our API to support certain webhooks). So I would avoid this solution.
  - In any case we need a fallback so the workflow doesn't get stuck in case openQA is unavailable when a hook would be triggered or an event received. (This of course also counts for hooks/events we would receive from Gitea itself.)
2. How the scope of a change (e.g. affected products and packages) are made available by OBS so openQA can make a decision what to test.
  - This was mentioned in the meeting on 2025-04-28 but I could not remember it in detail. In the best case this is documented somewhere or we have a link to examples.
3. Will there still be "incidents" and "aggregates"?
  - If yes, how are those made distinct on Gitea and does openQA need to distinguish them?
  - If no, is there an other distinction to be made? Or can we just treat every PR on Gitea the same?
4. Is https://gitlab.suse.de/openqa/openqa-trigger-from-ibs-plugin still required?
  - I think in the meeting we leaned towards a no. So the creation of a PR on Gitea and OBS signaling somehow that builds are ready is the only trigger point for tests. We also do not need openqa-trigger-from-ibs-plugin to synchronize assets.
  - If it was still required it probably needed to be changed to handle the now dynamic project structure on OBS.
Clarification is needed on the following topics from the test maintainer side:
1. How do we decide what tests to run based on 1.2?
  - This will be discussed in a follow-up meeting.
  - In the best case scenario definitions are good enough. In the worst case we keep using the current meta-data approach as-is and therefore keep using qem-bot for scheduling. (See my previous comment for more ideas/details.)

Actions

#15

Updated by mkittler 11 days ago

I experimented with webhooks on src.suse.de and ran into the error context deadline exceeded (Client.Timeout exceeded while awaiting headers) even after we extended the timeout to 10 seconds. Most likely the traffic is blocked by a firewall so I created https://sd.suse.com/servicedesk/customer/portal/1/SD-186615.

Actions

#16

Updated by mkittler 10 days ago

As discussed in the unblock meeting, there are a few additional challenges:

qem-bot/qem-dashboard have some features on a detailed-level that we might want to keep (e.g. the dashboard has a priority-based view and the bot can approve incidents despite failures based on openQA comments).
Because of the previous point we still want to explore keeping the bot and dashboard and just extending them.

So I'll look into the code of the bot/dashboard in further detail to come up with a list of detailed features we need to take into account.

I'll create follow-up tickets for how to change bot/dashboard and for how to extend upstream openQA after we have clarification on points mentioned on #180809#note-14.

Actions

#17

Updated by jlausuch 10 days ago

Let me answer 1.4.

obs-sync can't be used any more for one big reason:

obs-sync relies on static project names from IBS which monitors (based on rabitmq messages) and the staged updates in SLFO will have a dynamic project in IBS based on the Pull Request number (similar to incidents today in SLE15).

There are other drawbacks in that tool:

It does not support dynamic scheduling (based on update/package(s) under test) -> there was some effort started for this, but ended up in the void, cause it would make that tool more complex than it is already and it's not designed for this.
There is very few people who really understand obs-sync in depth, and the one who knows most of it is not in our department.

Another remark or requirements from my side.

we need to keep job groups by squat for staging and increments. Each squad will be owner of their job group.
we should allow squats to have flexibility when it comes which tests to trigger for each update. I really like how it is done in the bot-ng with metadata yaml, e.g. https://gitlab.suse.de/qa-maintenance/metadata/-/blob/master/bot-ng/sles15sp7.yml?ref_type=heads where owners of each yaml defines which flavor they want to trigger depending on the package. We could keep this method or use a new one, but the outcome should be similar.

Actions

#18

Updated by mkittler 10 days ago · Edited

jlausuch wrote in #note-17:

Let me answer 1.4.

obs-sync can't be used any more for one big reason:

obs-sync relies on static project names from IBS which monitors (based on rabitmq messages) and the staged updates in SLFO will have a dynamic project in IBS based on the Pull Request number (similar to incidents today in SLE15).

There are other drawbacks in that tool:

It does not support dynamic scheduling (based on update/package(s) under test) -> there was some effort started for this, but ended up in the void, cause it would make that tool more complex than it is already and it's not designed for this.

There is very few people who really understand obs-sync in depth, and the one who knows most of it is not in our department.

Thanks for the info. This is in-line with what I gathered so far.

Another remark or requirements from my side.

we need to keep job groups by squat for staging and increments. Each squad will be owner of their job group.

we should allow squats to have flexibility when it comes which tests to trigger for each update. I really like how it is done in the bot-ng with metadata yaml, e.g. https://gitlab.suse.de/qa-maintenance/metadata/-/blob/master/bot-ng/sles15sp7.yml?ref_type=heads where owners of each yaml defines which flavor they want to trigger depending on the package. We could keep this method or use a new one, but the outcome should be similar.

Good to know. So if we'd ever replace the meta-data concept it needed to be replaced with something similar. Maybe it makes sense for me to see whether these meta-data files could be turned into scenario definitions which openQA would be able to understand out of the box. However, this speaks more for keeping qem-bot at least for scheduling for the time being.

Actions

#19

Updated by livdywan 9 days ago

Status changed from In Progress to Workable

Actions

#20

Updated by hrommel1 5 days ago

mkittler wrote in #note-16:

As discussed in the unblock meeting, there are a few additional challenges:

qem-bot/qem-dashboard have some features on a detailed-level that we might want to keep (e.g. the dashboard has a priority-based view and the bot can approve incidents despite failures based on openQA comments).

Because of the previous point we still want to explore keeping the bot and dashboard and just extending them.

So I'll look into the code of the bot/dashboard in further detail to come up with a list of detailed features we need to take into account.

I'll create follow-up tickets for how to change bot/dashboard and for how to extend upstream openQA after we have clarification on points mentioned on #180809#note-14.

Just to confirm: yes, reading openQA comments to approve despite failures is something that we must have in as well. Priority-based view is important but has not the same priority. Reading priorities might even require reading data from SMELT, which at this point in time is unclear when available/ready for SLE16.

Actions

#21

Updated by mkittler 4 days ago

Ok, good to know. That again speaks more for keep using qem-bot instead of using upstream openQA features.

Unfortunately none of the questions under "1." in #180809#note-14 have been answered yet.

After reading https://jira.suse.com/browse/MSQA-949 I can answer "Will there still be "incidents" and "aggregates"?" myself, though. There will be:

"staging triggered tests": only "tests which are relevant for the package updates in the stagings" - whatever that means exactly
"product increment triggered tests": focus on integration tests, but with the option to filter this down by package level

The other questions are still open.

Actions

#22

Updated by mkittler 3 days ago · Edited

Status changed from Workable to Blocked

https://sd.suse.com/servicedesk/customer/portal/1/SD-186615 has been resolved. So we could move on exploring a webhook-based solution.

At this point it seems more likely that we'll stick with qem-bot/qem-dashboard so we'd probably just poll for new PRs on Gitea within the bot (and potentially poll for a new status on OBS).

Blocking this ticket as clarification is still needed on most points mentioned in #180809#note-14.

Actions

#23

Updated by okurz 3 days ago

Status changed from Blocked to Feedback

Only block on tickets or other open external references. You can reference the Slack thread where we are waiting for more

Actions

#24

Updated by jlausuch 3 days ago

Related Slack thread: https://suse.slack.com/archives/C08DC2SHABV/p1746015720858179

Actions

#25

Updated by mkittler 2 days ago

I added the findings we have so far to #180812 which seems to be the next best ticket to work on. I'm unassigning from #180632 as a full replacement of qem-bot/dashboard seems not feasible right now but I extended the suggestions in that ticket with my findings from here.

Actions

#26

Updated by mkittler 2 days ago

Just a few notes for myself as I'm trying to develop a sketch for how to change qem-bot/dasboard.

class OpenQABot is the high-level class for scheduling, it relies on data previously synced to the dashboard via /api/incidents (does not read SMELT directly)
- Incidents and aggregate scheduling only differs in how they read the meta-data (in load_metadata). Both just request /api/incidents from the dashboard. (The dashboard has not even a table called aggregates or updates. It has incident_in_update and update_openqa_settings, though. Not sure how those come into play but they probably refer to aggregates in some way.)
- Maybe we can treat Gitea PRs as "incidents" here and not adjust much.
class SMELTSync needs a correspondence for Gitea
On the dashboard we can treat Gitea PRs as "incidents".
- number could be the PR ID
- We could add a column to store the URL so the frontend can render that directly. Or we use the concept of incident channels adding a Gitea channel.

Actions

#27

Updated by jbaier_cz 1 day ago

The dashboard has not even a table called aggregates or updates. It has incident_in_update and update_openqa_settings, though. Not sure how those come into play but they probably refer to aggregates in some way.

I guess part of the problem is that bot and dashboard were initially written by different people with different nomenclature. In this context I think it is safe to assume updates and aggregates are synonyms.

Just a few notes for myself

I full agree with your observation, it also looks to me that something like class GiteaSync should handle the translation from Gitea PR to some data structure known to dashboard (i.e. some incidents mapping). PR id seems like a good unique identifier in this case.

Actions

#28

Updated by mkittler 1 day ago

jbaier_cz wrote in #note-27:

I guess part of the problem is that bot and dashboard were initially written by different people with different nomenclature. In this context I think it is safe to assume updates and aggregates are synonyms.

I know but that's not the point. Even when considering those synonyms the dashboard has no distinct list of aggregates/updates.

Here's a PR introducing "gitea-sync": https://github.com/openSUSE/qem-bot/pull/195

It of course breaks the existing workflow (as no distinction between incidents from Gitea and SMELT is possible so far).

Actions

Also available in: Atom PDF