action #157741
opencoordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release
coordination #97121: [epic] enable qem-bot comments on IBS (was: enable qa-maintenance/openQABot comments on smelt again)
Approve/reject SLE maintenance release requests on IBS synchronously listening to AMQP events when testing for one release request as "openQA product build" is finished size:M
0%
Description
Motivation¶
One of the most important responsibilities within SLE maintenance testing is to approve/reject SLE maintenance release requests based on openQA test results. So far qem-bot is sufficient to schedule openQA tests but merely does a mediocre job of reporting back results as test results are asynchronously polled based on a periodic schedule https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules causing unnecessary delays, inefficient polling, using outdated results #122311 and not even reporting back on blocking test failures #97121. Let's use a proper architecture with efficient event based triggers providing relevant information back to release requests on IBS using core openQA features rather than too much custom lacking downstream tooling: After the PoC in #154498-14 we should fully implement that to approve/reject the according release request synchronously after AMQP event listening.
Acceptance criteria¶
- AC1: something synchronously approves based on AMQP events
Suggestions¶
- Follow-on with the PoC of #154498-14
- Setup qem-bot or an alternative on existing or new server but make access to the logs
- Add it as part of qem-dashboard which already has AMQP support
- Ensure that qem-bot runs near-continuous to be able to listen to all AMQP events accordingly, maybe back-to-back gitlab CI jobs with limits to prevent parallel execution which we already have?
Further details¶
Also related to #122311, #123088, #97121, #99303, #152939, #131279, #117655
Updated by okurz 9 months ago
- Copied from action #154498: [spike][timeboxed:20h][integration] Approve/reject SLE maintenance release requests on IBS synchronously listening to AMQP events when testing for one release request as "openQA product build" is finished size:M added
Updated by szarate 9 months ago
Two questions I have: does a build also consider aggregates?
consider i.e Wicked: https://openqa.suse.de/tests/overview?distri=sle&&build=%3A32459%3Awicked&&build=:32458:wicked&build=:32460:wicked
Where this search is only showing single incidents, but doesn't show aggregate updates :D
Updated by okurz 9 months ago
- Subject changed from Approve/reject SLE maintenance release requests on IBS synchronously listening to AMQP events when testing for one release request as "openQA product build" is finished to Approve/reject SLE maintenance release requests on IBS synchronously listening to AMQP events when testing for one release request as "openQA product build" is finished size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by openqa_review 9 months ago
- Due date set to 2024-04-30
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 9 months ago
https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2502425#L57
ModuleNotFoundError: No module named 'pika'
Updated by mkittler 9 months ago · Edited
The PR was merged and I configured the pipeline under https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules.
I created https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 to allow the traffic because the pipeline currently runs into a connection error.
Maybe we also still need to take care that the TLS certificate is available within the container (like what was done for #158907). The TLS certificates are already installed in the container (see https://build.suse.de/projects/QA:Maintenance/packages/openSUSE-Leap-Container/files/Dockerfile?expand=1).
Updated by mkittler 9 months ago
- Status changed from Blocked to In Progress
In https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 I was asked for the approval of the buildops team as it is the owner of amqps://rabbit.suse.de. They were not happy with us "abusing shared gitlab resources for this" so I suppose we better not go down that road. I'll setup the daemon on qam2.qe.prg2.suse.org instead. I suppose the only real disadvantage is that the AMQP "job" won't show up alongside the others on GitLab.
Updated by mkittler 9 months ago
We decided to give https://itpe.io.suse.de/open-platform/docs/docs/category/getting-started a try instead.
Updated by livdywan 8 months ago
Failed with pika.exceptions.AMQPConnectionError now, see https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2512747
Updated by mkittler 8 months ago
- Status changed from In Progress to Blocked
I just tried it again to see whether DNS has changed now but it still fails.
I also stopped qem-bot-amqp-watcher.service
on qam2.qe.prg2.suse.org again as we're going for openplatform. If that turns out working I'll completely remove the service from qam2.qe.prg2.suse.org.
For now I keep this blocked on #156214.
Updated by mkittler 8 months ago · Edited
Once we have access we'd probably need build an RPM package for bot-ng
and a container image installing it (according to https://itpe.io.suse.de/open-platform/docs/docs/getting_started/quickstart/#build-rpm-packages-and-container-images). We could maybe also skip the packaging step and add clone the Git repo directly when building the container. That might simplify things and we don't need to build anything here anyway.
By the way, I tried to improve the error handling of the AMQP code so we get more than just the exception type AMQPConnectionError
: https://github.com/Martchus/qem-bot/pull/new/amqp-2
This didn't work, though. It looks like the error message is actually shown also without such a change, e.g.:
…
File "/usr/lib64/python3.11/socket.py", line 962, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -3] Temporary failure in name resolution
However, in case of a connection error (and not a DNS error) there simply seems to be no error message available because even with my change all I get is:
./bot-ng.py --configs ../metadata -t 1234 --dry amqp --url amqp://10.145.56.20
2024-04-22 17:33:31 ERROR Establishing AMQP connection to 'amqp://10.145.56.20':
So this change makes things even worse as we now don't even know that it is an AMQPConnectionError
. Considering https://pika.readthedocs.io/en/stable/modules/exceptions.html#pika.exceptions.AMQPConnectionError the error class AMQPConnectionError
is probably the best we can get in certain cases.
Updated by mkittler 8 months ago
Of course we could also just use https://build.suse.de/projects/QA:Maintenance/packages/openSUSE-Leap-Container/files/Dockerfile again and to the checkout manually like in https://gitlab.suse.de/qa-maintenance/bot-ng/-/blob/master/.gitlab-ci.yml#L47.
Otherwise I suppose https://build.suse.de/project/show/QA:Maintenance would be the right place to add a new container (based on the existing openSUSE-Leap-Container in the same project).
Updated by okurz 8 months ago
mkittler wrote in #note-20:
Of course we could also just use https://build.suse.de/projects/QA:Maintenance/packages/openSUSE-Leap-Container/files/Dockerfile again and to the checkout manually like in https://gitlab.suse.de/qa-maintenance/bot-ng/-/blob/master/.gitlab-ci.yml#L47.
Otherwise I suppose https://build.suse.de/project/show/QA:Maintenance would be the right place to add a new container (based on the existing openSUSE-Leap-Container in the same project).
I suggest to not use IBS unless we have to. Shouldn't be too hard to create our own variant in OBS.
Updated by mkittler 8 months ago · Edited
- Status changed from Blocked to In Progress
Deploying this on OpenPlatform was rather simple. There was a little bit of clicking on the web UI involved (to assign resources and download the op-prg2-1-staging.yaml
file) and then the following CLI commands did the trick:
cd /hdd/openqa-devel/openplatform
export KUBECONFIG=$PWD/op-prg2-1-staging.yaml
kubectl config view # to check whether the env variable is considered as expected
kubectl get nodes # to check whether the CLI client generally works
kubectl apply -f qem-bot.yaml -n qem-bot # to deploy the workload
For the configuration file qem-bot.yaml
, see https://github.com/Martchus/qem-bot/pull/new/openplatform.
Unfortunately it runs into the same (probably firewall-related) issue we saw when trying to run it on GitLab:
$ kubectl logs -f -p qem-bot-7496cb6967-hmvrd -n qem-bot
…
Traceback (most recent call last):
File "./qem-bot/bot-ng.py", line 7, in <module>
main()
File "/qem-bot/openqabot/main.py", line 32, in main
sys.exit(cfg.func(cfg))
File "/qem-bot/openqabot/args.py", line 77, in do_amqp
amqp = AMQP(args)
File "/qem-bot/openqabot/amqp.py", line 33, in __init__
self.connection = pika.BlockingConnection(pika.URLParameters(args.url))
File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 359, in __init__
self._impl = self._create_connection(parameters, _impl_class)
File "/usr/lib/python3.6/site-packages/pika/adapters/blocking_connection.py", line 450, in _create_connection
raise self._reap_last_connection_workflow_error(error)
pika.exceptions.AMQPConnectionError
It also didn't help to specify --url amqp://…
with the IP (instead of using the domain name and TLS). So maybe we need yet another SD-ticket but I first asked in the existing SD-ticket.
Updated by okurz 8 months ago
to be explicit as the ticket URL was some comments back: Blocked on https://sd.suse.com/servicedesk/customer/portal/1/SD-154403
Updated by mkittler 8 months ago
Since I don’t know how to answer your questions myself I asked about it on Slack: https://suse.slack.com/archives/C04S88VCHS7/p1714640151238429
Updated by livdywan 7 months ago
okurz wrote in #note-25:
to be explicit as the ticket URL was some comments back: Blocked on https://sd.suse.com/servicedesk/customer/portal/1/SD-154403
Conversation on-going (checking it since our SLO's require an update within the next couple days)
Updated by okurz 7 months ago
- Status changed from Blocked to In Progress
- Assignee changed from mkittler to okurz
The team is getting annoyed and frustrated by not-the-place, not-the-time arguments and people coming up with weird blockers in https://sd.suse.com/servicedesk/customer/portal/1/SD-154403. I will add an according comment and escalate so that we can use OpenPlatform rather than reverting to manually maintaining virtual machines on other less maintained hypervisors.
Updated by okurz 7 months ago
- Status changed from In Progress to Blocked
I provided a suggestion in https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 and mentioned that with the potential for need of escalation in https://suse.slack.com/archives/C02CANHLANP/p1717755396214829 in case we don't get to clear decisions.
Updated by livdywan 6 months ago
mkittler wrote in #note-31:
Maybe we should nevertheless respond somehow in the SD ticket. (Even though we don't fully understand the options mentioned there.)
Honestly I don't know what the difference between the last two suggestions from Martin is. I would consider saying something like, "if we deploy qembot on op, which one is the easiest way forward for everyone involved?".
Updated by okurz 6 months ago
- Status changed from Blocked to Workable
- Assignee deleted (
okurz) - Priority changed from Normal to High
As you guys stated and as the latest comment in the ticket was from Martin Piala we need to act on it to drive the implementation further, i.e. to find a way to allow Openplatform containers to access rabbit.suse.de . Setting this to High as mgriessmeier asked about this.
Updated by okurz 6 months ago
- Status changed from Workable to In Progress
- Assignee set to okurz
I think it's a better approach to make the work on https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 more explicit and to ensure we have the necessary requirements fulfilled to be able to run qem-bot. Creating separate dedicated infra ticket.
Updated by okurz 6 months ago
I wonder why we decided to use containers on OpenPlatform rather than gitlab CI. In https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 Ruediger Oertel 2024-04-19 12:50 commented
looks like this was already discussed elsewhere and the approach with the gitlap runners was not really optimal for what was being done.
and later on Marius Kittler 2024-04-19 14:29 says
The pipeline is still running into the pika.exceptions.AMQPConnectionError. Not sure whether the DNS change is already effective.
Probably we’re now going with openplatform anyway - although it’ll be still great if the GitLab pipeline worked as short-term solution.
and later Marius Kittler 2024-06-10 12:00
Our first attempt of using GitLab runners was considered an abuse so we moved the container to OpenPlatform
Updated by okurz 6 months ago
- Status changed from Blocked to Workable
- Assignee deleted (
okurz)
Found the conversation https://suse.slack.com/archives/C02BXKBMXNV/p1713446598715539 which Ruediger Oertel and Marius Kittler referred to. Quoting from there:
(Marius Kittler) (3 months ago) Does anybody here know who the owner of https://rabbit.suse.de is? I was asked about it when asking to allow gitlab-runners to access amqps://rabbit.suse.de. It was initially setup by @Dominik Heidler and @coolo but they are not owning it now anymore. Maybe someone on the OBS-side is?
[…] (not quoting the complete discussion for the purpose of privacy
(Marius Kittler) We are trying to connect to amqps://rabbit.suse.de from a GitLab CI job to get openQA-related events. And yes, this is normally not what one would use a CI job for but it'll get the job done. … that is like really wasteful for the gitlab-runner side. a simple python script which then in async does the "reject/approve" stuff after parsing the amqp event is way more efficient
(Hendrik Vogelsang) @Marius Kittler an example would be https://github.com/openSUSE/kurren
(Marcus Rueckert) but abusing gitlab for this is imho not the best way
(Marius Kittler) We already have a service (written in Python) that listens for AMQP events and does the required actions when receiving them. Of course this is abusing GitLab instead of just setting up a server. I don't think that having a long-running GitLab job (I suppose 4 hours is the maximum) is that wasteful, though. (It would not be one job per event/action.) And it would be in-line with our existing GitLab-based setup for performing those and similar actions (on a regular schedule). (edited)
… gitlab-runners are not your replacemnt for setting up a small VM for this […] you should not abuse shared gitlab resources for this. and if you set up a VM for your own gitlab-runner. then you can just run the script there and dont have to funnel it through gitlab
(Georg Pfützenreuter) sounds like a use case for OpenPlatform
However reconsidering right now mkittler and me think that gitlab CI is providing the benefits of reporting, alerting, secrets management over containers in OpenPlatform. Also by now with https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 progressed gitlab CI runners should have access to rabbit.suse.de so we can continue with our original plan.
Updated by okurz 5 months ago
- Related to action #165345: [spike][timeboxed:20h] Custom qa-tools team managed low-maintenance platform for hosting team-owned containerized workload added
Updated by livdywan 3 months ago · Edited
However reconsidering right now mkittler and me think that gitlab CI is providing the benefits of reporting, alerting, secrets management over containers in OpenPlatform. Also by now with https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 progressed gitlab CI runners should have access to rabbit.suse.de so we can continue with our original plan.
The SD ticket is seemingly blocked on https://jira.suse.com/browse/PLAT-499 which has been "in progress" since May. Do we assume this is usable right now or should we block on this?
Or do we go ahead with GitLab despite the concerns from others? If so, I guess we need to check that it works.
Updated by okurz 3 months ago
livdywan wrote in #note-47:
However reconsidering right now mkittler and me think that gitlab CI is providing the benefits of reporting, alerting, secrets management over containers in OpenPlatform. Also by now with https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 progressed gitlab CI runners should have access to rabbit.suse.de so we can continue with our original plan.
The SD ticket is seemingly blocked on https://jira.suse.com/browse/PLAT-499 which has been "in progress" since May. Do we assume this is usable right now or should we block on this?
Or do we go ahead with GitLab despite the concerns from others? If so, I guess we need to check that it works.
we go ahead with gitlab and make it work.
Updated by robert.richardson 3 months ago
- Status changed from Workable to In Progress
- Assignee set to robert.richardson
Updated by openqa_review 3 months ago
- Due date set to 2024-10-08
Setting due date based on mean cycle time of SUSE QE Tools
Updated by robert.richardson 3 months ago · Edited
Gitlab CI runners seem to not have access yet, as i cannot reproduce the pika.exceptions.AMQPConnectionError
issue locally.
For me everything works fine when i create the container and reproduce the commands of the pipeline on my machine (within VPN):
$BOT_LAUNCHER ./bot-ng.py -c /etc/openqabot --token $BOT_
TOKEN $BOT_PARAMS $BOT_CMD 2>&1 | tee bot_${BOT_CMD}_log.log
+ tee bot_amqp_log.log
+ timeout --preserve-status --signal=SIGINT 230m ./bot-ng.py -c /etc/openqabot --token op
3nQAB0tT0k3n --dry --debug amqp
2024-09-27 11:04:27 INFO AMQP listening started
2024-09-27 11:07:10 DEBUG Received AMQP message: {'ARCH': 'x86_64',...
Updated by robert.richardson 3 months ago
- Due date changed from 2024-10-08 to 2025-01-31
- Status changed from In Progress to Blocked
- Priority changed from High to Normal
Updated by livdywan about 1 month ago
I opened SD-174039 to request a dedicated cluster.
Updated by okurz about 1 month ago
- Status changed from Blocked to Feedback
- Assignee changed from robert.richardson to okurz
https://sd.suse.com/servicedesk/customer/portal/1/SD-174039 was rejected because we do not want to run more infrastructure on our own. https://sd.suse.com/servicedesk/customer/portal/1/SD-154403 was completed as "Won't Do" because of miscommunication. szarate will try to get an answer over other communication channels why shared gitlab runners shouldn't be able to access rabbit.suse.de
Updated by okurz about 1 month ago
- Due date deleted (
2025-01-31) - Priority changed from Normal to Low
- Target version changed from Ready to future