Project

General

Profile

Actions

action #157741

open

coordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release

coordination #97121: [epic] enable qem-bot comments on IBS (was: enable qa-maintenance/openQABot comments on smelt again)

Approve/reject SLE maintenance release requests on IBS synchronously listening to AMQP events when testing for one release request as "openQA product build" is finished size:M

Added by okurz 7 months ago. Updated 10 days ago.

Status:
Blocked
Priority:
Normal
Target version:
Start date:
2024-03-22
Due date:
2025-01-31 (Due in about 4 months)
% Done:

0%

Estimated time:

Description

Motivation

One of the most important responsibilities within SLE maintenance testing is to approve/reject SLE maintenance release requests based on openQA test results. So far qem-bot is sufficient to schedule openQA tests but merely does a mediocre job of reporting back results as test results are asynchronously polled based on a periodic schedule https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules causing unnecessary delays, inefficient polling, using outdated results #122311 and not even reporting back on blocking test failures #97121. Let's use a proper architecture with efficient event based triggers providing relevant information back to release requests on IBS using core openQA features rather than too much custom lacking downstream tooling: After the PoC in #154498-14 we should fully implement that to approve/reject the according release request synchronously after AMQP event listening.

Acceptance criteria

  • AC1: something synchronously approves based on AMQP events

Suggestions

  • Follow-on with the PoC of #154498-14
  • Setup qem-bot or an alternative on existing or new server but make access to the logs
  • Add it as part of qem-dashboard which already has AMQP support
  • Ensure that qem-bot runs near-continuous to be able to listen to all AMQP events accordingly, maybe back-to-back gitlab CI jobs with limits to prevent parallel execution which we already have?

Further details

Also related to #122311, #123088, #97121, #99303, #152939, #131279, #117655


Related issues 2 (1 open1 closed)

Related to openQA Infrastructure - action #165345: [spike][timeboxed:20h] Custom qa-tools team managed low-maintenance platform for hosting team-owned containerized workloadNewjbaier_cz2024-08-15

Actions
Copied from QA - action #154498: [spike][timeboxed:20h][integration] Approve/reject SLE maintenance release requests on IBS synchronously listening to AMQP events when testing for one release request as "openQA product build" is finished size:MResolvedjbaier_cz

Actions
Actions #1

Updated by okurz 7 months ago

  • Copied from action #154498: [spike][timeboxed:20h][integration] Approve/reject SLE maintenance release requests on IBS synchronously listening to AMQP events when testing for one release request as "openQA product build" is finished size:M added
Actions #2

Updated by szarate 7 months ago

Two questions I have: does a build also consider aggregates?

consider i.e Wicked: https://openqa.suse.de/tests/overview?distri=sle&&build=%3A32459%3Awicked&&build=:32458:wicked&build=:32460:wicked

Where this search is only showing single incidents, but doesn't show aggregate updates :D

Actions #3

Updated by okurz 7 months ago

  • Target version changed from Tools - Next to Ready
Actions #4

Updated by okurz 7 months ago

szarate wrote in #note-2:

Two questions I have: does a build also consider aggregates?

Yes.

What's the second question?

Actions #5

Updated by okurz 7 months ago

  • Description updated (diff)
Actions #6

Updated by okurz 7 months ago

  • Subject changed from Approve/reject SLE maintenance release requests on IBS synchronously listening to AMQP events when testing for one release request as "openQA product build" is finished to Approve/reject SLE maintenance release requests on IBS synchronously listening to AMQP events when testing for one release request as "openQA product build" is finished size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #7

Updated by mkittler 6 months ago

  • Assignee set to mkittler
Actions #8

Updated by mkittler 6 months ago

  • Status changed from Workable to In Progress
Actions #9

Updated by openqa_review 6 months ago

  • Due date set to 2024-04-30

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by okurz 6 months ago

https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2502425#L57

ModuleNotFoundError: No module named 'pika'

Actions #12

Updated by mkittler 6 months ago · Edited

The PR was merged and I configured the pipeline under https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules.

I created https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 to allow the traffic because the pipeline currently runs into a connection error.

Maybe we also still need to take care that the TLS certificate is available within the container (like what was done for #158907). The TLS certificates are already installed in the container (see https://build.suse.de/projects/QA:Maintenance/packages/openSUSE-Leap-Container/files/Dockerfile?expand=1).

Actions #13

Updated by mkittler 6 months ago

  • Status changed from In Progress to Blocked
Actions #14

Updated by mkittler 6 months ago

  • Status changed from Blocked to In Progress

In https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 I was asked for the approval of the buildops team as it is the owner of amqps://rabbit.suse.de. They were not happy with us "abusing shared gitlab resources for this" so I suppose we better not go down that road. I'll setup the daemon on qam2.qe.prg2.suse.org instead. I suppose the only real disadvantage is that the AMQP "job" won't show up alongside the others on GitLab.

Actions #16

Updated by livdywan 6 months ago

Failed with pika.exceptions.AMQPConnectionError now, see https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2512747

Actions #17

Updated by mkittler 6 months ago

  • Status changed from In Progress to Blocked

I just tried it again to see whether DNS has changed now but it still fails.

I also stopped qem-bot-amqp-watcher.service on qam2.qe.prg2.suse.org again as we're going for openplatform. If that turns out working I'll completely remove the service from qam2.qe.prg2.suse.org.

For now I keep this blocked on #156214.

Actions #19

Updated by mkittler 6 months ago · Edited

Once we have access we'd probably need build an RPM package for bot-ng and a container image installing it (according to https://itpe.io.suse.de/open-platform/docs/docs/getting_started/quickstart/#build-rpm-packages-and-container-images). We could maybe also skip the packaging step and add clone the Git repo directly when building the container. That might simplify things and we don't need to build anything here anyway.


By the way, I tried to improve the error handling of the AMQP code so we get more than just the exception type AMQPConnectionError: https://github.com/Martchus/qem-bot/pull/new/amqp-2
This didn't work, though. It looks like the error message is actually shown also without such a change, e.g.:

…
  File "/usr/lib64/python3.11/socket.py", line 962, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -3] Temporary failure in name resolution

However, in case of a connection error (and not a DNS error) there simply seems to be no error message available because even with my change all I get is:

./bot-ng.py --configs ../metadata -t 1234 --dry amqp --url amqp://10.145.56.20
2024-04-22 17:33:31 ERROR    Establishing AMQP connection to 'amqp://10.145.56.20': 

So this change makes things even worse as we now don't even know that it is an AMQPConnectionError. Considering https://pika.readthedocs.io/en/stable/modules/exceptions.html#pika.exceptions.AMQPConnectionError the error class AMQPConnectionError is probably the best we can get in certain cases.

Actions #20

Updated by mkittler 6 months ago

Of course we could also just use https://build.suse.de/projects/QA:Maintenance/packages/openSUSE-Leap-Container/files/Dockerfile again and to the checkout manually like in https://gitlab.suse.de/qa-maintenance/bot-ng/-/blob/master/.gitlab-ci.yml#L47.

Otherwise I suppose https://build.suse.de/project/show/QA:Maintenance would be the right place to add a new container (based on the existing openSUSE-Leap-Container in the same project).

Actions #21

Updated by okurz 6 months ago

mkittler wrote in #note-20:

Of course we could also just use https://build.suse.de/projects/QA:Maintenance/packages/openSUSE-Leap-Container/files/Dockerfile again and to the checkout manually like in https://gitlab.suse.de/qa-maintenance/bot-ng/-/blob/master/.gitlab-ci.yml#L47.

Otherwise I suppose https://build.suse.de/project/show/QA:Maintenance would be the right place to add a new container (based on the existing openSUSE-Leap-Container in the same project).

I suggest to not use IBS unless we have to. Shouldn't be too hard to create our own variant in OBS.

Actions #22

Updated by okurz 5 months ago

  • Due date deleted (2024-04-30)

removing due-date due to block

Actions #23

Updated by mkittler 5 months ago · Edited

  • Status changed from Blocked to In Progress

Deploying this on OpenPlatform was rather simple. There was a little bit of clicking on the web UI involved (to assign resources and download the op-prg2-1-staging.yaml file) and then the following CLI commands did the trick:

cd /hdd/openqa-devel/openplatform
export KUBECONFIG=$PWD/op-prg2-1-staging.yaml
kubectl config view # to check whether the env variable is considered as expected
kubectl get nodes # to check whether the CLI client generally works
kubectl apply -f qem-bot.yaml -n qem-bot # to deploy the workload

For the configuration file qem-bot.yaml, see https://github.com/Martchus/qem-bot/pull/new/openplatform.

Unfortunately it runs into the same (probably firewall-related) issue we saw when trying to run it on GitLab:

$ kubectl logs -f -p qem-bot-7496cb6967-hmvrd -n qem-bot
…
Traceback (most recent call last):
  File "./qem-bot/bot-ng.py", line 7, in <module>
    main()
  File "/qem-bot/openqabot/main.py", line 32, in main
    sys.exit(cfg.func(cfg))
  File "/qem-bot/openqabot/args.py", line 77, in do_amqp
    amqp = AMQP(args)
  File "/qem-bot/openqabot/amqp.py", line 33, in __init__
    self.connection = pika.BlockingConnection(pika.URLParameters(args.url))
  File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 359, in __init__
    self._impl = self._create_connection(parameters, _impl_class)
  File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 450, in _create_connection
    raise self._reap_last_connection_workflow_error(error)
pika.exceptions.AMQPConnectionError

It also didn't help to specify --url amqp://… with the IP (instead of using the domain name and TLS). So maybe we need yet another SD-ticket but I first asked in the existing SD-ticket.

Actions #24

Updated by mkittler 5 months ago

  • Status changed from In Progress to Blocked
Actions #25

Updated by okurz 5 months ago

to be explicit as the ticket URL was some comments back: Blocked on https://sd.suse.com/servicedesk/customer/portal/1/SD-154403

Actions #26

Updated by mkittler 5 months ago

Since I don’t know how to answer your questions myself I asked about it on Slack: https://suse.slack.com/archives/C04S88VCHS7/p1714640151238429

Actions #27

Updated by mkittler 5 months ago

We got a subnet in the SD ticket but probably need help to configure it so I'm waiting for a response in the SD ticket.

Actions #28

Updated by livdywan 4 months ago

okurz wrote in #note-25:

to be explicit as the ticket URL was some comments back: Blocked on https://sd.suse.com/servicedesk/customer/portal/1/SD-154403

Conversation on-going (checking it since our SLO's require an update within the next couple days)

Actions #29

Updated by okurz 4 months ago

  • Status changed from Blocked to In Progress
  • Assignee changed from mkittler to okurz

The team is getting annoyed and frustrated by not-the-place, not-the-time arguments and people coming up with weird blockers in https://sd.suse.com/servicedesk/customer/portal/1/SD-154403. I will add an according comment and escalate so that we can use OpenPlatform rather than reverting to manually maintaining virtual machines on other less maintained hypervisors.

Actions #30

Updated by okurz 4 months ago

  • Status changed from In Progress to Blocked

I provided a suggestion in https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 and mentioned that with the potential for need of escalation in https://suse.slack.com/archives/C02CANHLANP/p1717755396214829 in case we don't get to clear decisions.

Actions #31

Updated by mkittler 4 months ago

Maybe we should nevertheless respond somehow in the SD ticket. (Even though we don't fully understand the options mentioned there.)

Actions #32

Updated by livdywan 4 months ago

mkittler wrote in #note-31:

Maybe we should nevertheless respond somehow in the SD ticket. (Even though we don't fully understand the options mentioned there.)

Honestly I don't know what the difference between the last two suggestions from Martin is. I would consider saying something like, "if we deploy qembot on op, which one is the easiest way forward for everyone involved?".

Actions #33

Updated by okurz 3 months ago

  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)
  • Priority changed from Normal to High

As you guys stated and as the latest comment in the ticket was from Martin Piala we need to act on it to drive the implementation further, i.e. to find a way to allow Openplatform containers to access rabbit.suse.de . Setting this to High as mgriessmeier asked about this.

Actions #34

Updated by okurz 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

I think it's a better approach to make the work on https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 more explicit and to ensure we have the necessary requirements fulfilled to be able to run qem-bot. Creating separate dedicated infra ticket.

Actions #36

Updated by okurz 3 months ago

  • Status changed from In Progress to Blocked

#163331

Actions #37

Updated by okurz 3 months ago

  • Priority changed from High to Normal
Actions #38

Updated by okurz 3 months ago

I wonder why we decided to use containers on OpenPlatform rather than gitlab CI. In https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 Ruediger Oertel 2024-04-19 12:50 commented

looks like this was already discussed elsewhere and the approach with the gitlap runners was not really optimal for what was being done.

and later on Marius Kittler 2024-04-19 14:29 says

The pipeline is still running into the pika.exceptions.AMQPConnectionError. Not sure whether the DNS change is already effective.
Probably we’re now going with openplatform anyway - although it’ll be still great if the GitLab pipeline worked as short-term solution.

and later Marius Kittler 2024-06-10 12:00

Our first attempt of using GitLab runners was considered an abuse so we moved the container to OpenPlatform

Actions #39

Updated by okurz 3 months ago

  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

Found the conversation https://suse.slack.com/archives/C02BXKBMXNV/p1713446598715539 which Ruediger Oertel and Marius Kittler referred to. Quoting from there:

(Marius Kittler) (3 months ago) Does anybody here know who the owner of https://rabbit.suse.de is? I was asked about it when asking to allow gitlab-runners to access amqps://rabbit.suse.de. It was initially setup by @Dominik Heidler and @coolo but they are not owning it now anymore. Maybe someone on the OBS-side is?
[…] (not quoting the complete discussion for the purpose of privacy
(Marius Kittler) We are trying to connect to amqps://rabbit.suse.de from a GitLab CI job to get openQA-related events. And yes, this is normally not what one would use a CI job for but it'll get the job done. that is like really wasteful for the gitlab-runner side. a simple python script which then in async does the "reject/approve" stuff after parsing the amqp event is way more efficient
(Hendrik Vogelsang) @Marius Kittler an example would be https://github.com/openSUSE/kurren
(Marcus Rueckert) but abusing gitlab for this is imho not the best way
(Marius Kittler) We already have a service (written in Python) that listens for AMQP events and does the required actions when receiving them. Of course this is abusing GitLab instead of just setting up a server. I don't think that having a long-running GitLab job (I suppose 4 hours is the maximum) is that wasteful, though. (It would not be one job per event/action.) And it would be in-line with our existing GitLab-based setup for performing those and similar actions (on a regular schedule). (edited)
gitlab-runners are not your replacemnt for setting up a small VM for this […] you should not abuse shared gitlab resources for this. and if you set up a VM for your own gitlab-runner. then you can just run the script there and dont have to funnel it through gitlab
(Georg Pfützenreuter) sounds like a use case for OpenPlatform

However reconsidering right now mkittler and me think that gitlab CI is providing the benefits of reporting, alerting, secrets management over containers in OpenPlatform. Also by now with https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 progressed gitlab CI runners should have access to rabbit.suse.de so we can continue with our original plan.

Actions #40

Updated by okurz 3 months ago

  • Target version changed from Ready to Tools - Next
Actions #41

Updated by okurz 2 months ago

  • Target version changed from Tools - Next to Ready
Actions #42

Updated by okurz about 2 months ago

  • Related to action #165345: [spike][timeboxed:20h] Custom qa-tools team managed low-maintenance platform for hosting team-owned containerized workload added
Actions #43

Updated by livdywan about 1 month ago

In line with our SLO's somebody needs to pick this up now, hence raising to High

Actions #44

Updated by livdywan about 1 month ago

  • Priority changed from Normal to High
Actions #45

Updated by livdywan about 1 month ago

  • Description updated (diff)
Actions #46

Updated by okurz 27 days ago

To give this ticket more priority I moved more tickets out of the current backlog and have reduced the priority of others which leaves this ticket as the most important unassigned workable ticket in the tools team dev backlog.

Actions #47

Updated by livdywan 20 days ago · Edited

However reconsidering right now mkittler and me think that gitlab CI is providing the benefits of reporting, alerting, secrets management over containers in OpenPlatform. Also by now with https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 progressed gitlab CI runners should have access to rabbit.suse.de so we can continue with our original plan.

The SD ticket is seemingly blocked on https://jira.suse.com/browse/PLAT-499 which has been "in progress" since May. Do we assume this is usable right now or should we block on this?

Or do we go ahead with GitLab despite the concerns from others? If so, I guess we need to check that it works.

Actions #48

Updated by okurz 20 days ago

livdywan wrote in #note-47:

However reconsidering right now mkittler and me think that gitlab CI is providing the benefits of reporting, alerting, secrets management over containers in OpenPlatform. Also by now with https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 progressed gitlab CI runners should have access to rabbit.suse.de so we can continue with our original plan.

The SD ticket is seemingly blocked on https://jira.suse.com/browse/PLAT-499 which has been "in progress" since May. Do we assume this is usable right now or should we block on this?

Or do we go ahead with GitLab despite the concerns from others? If so, I guess we need to check that it works.

we go ahead with gitlab and make it work.

Actions #49

Updated by robert.richardson 17 days ago

  • Status changed from Workable to In Progress
  • Assignee set to robert.richardson
Actions #50

Updated by openqa_review 16 days ago

  • Due date set to 2024-10-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions #51

Updated by robert.richardson 13 days ago · Edited

Gitlab CI runners seem to not have access yet, as i cannot reproduce the pika.exceptions.AMQPConnectionError issue locally.
For me everything works fine when i create the container and reproduce the commands of the pipeline on my machine (within VPN):

$BOT_LAUNCHER ./bot-ng.py -c /etc/openqabot --token $BOT_
TOKEN $BOT_PARAMS $BOT_CMD 2>&1 | tee bot_${BOT_CMD}_log.log
+ tee bot_amqp_log.log
+ timeout --preserve-status --signal=SIGINT 230m ./bot-ng.py -c /etc/openqabot --token op
3nQAB0tT0k3n --dry --debug amqp
2024-09-27 11:04:27 INFO     AMQP listening started
2024-09-27 11:07:10 DEBUG    Received AMQP message: {'ARCH': 'x86_64',...
Actions #52

Updated by robert.richardson 10 days ago

  • Due date changed from 2024-10-08 to 2025-01-31
  • Status changed from In Progress to Blocked
  • Priority changed from High to Normal
Actions

Also available in: Atom PDF