Project

General

Profile

Actions

action #157135

open

coordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release

coordination #155671: [epic] Better handling of SLE maintenance test review

http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes size:M

Added by okurz 2 months ago. Updated about 2 months ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Target version:
Start date:
2024-03-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

As far as I know the representation of openQA job status on http://dashboard.qam.suse.de/blocked should auto-update based on AMQP events from https://openqa.suse.de but when a failed job is restarted I don't see those updates happening on http://dashboard.qam.suse.de/blocked until a new forced status update happens in the "sync aggregates" as scheduled from https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules every 30m.

EDIT: I just saw some minutes later at 0815Z that a red box turned to blue for a job that was restarted at 0802Z. That was possibly because https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2380594 "sync aggregates" just finished.

We looked into the problem together during the tools team unblock meeting and we verified that in general AMQP events are sent by OSD and are received by qam2. AMQP events are handled by the dashboard to update the database content. We have identified the specific case where inconsistencies appear: When a build has failed jobs but then jobs are retriggered or new jobs are triggered in the same build then the dashboard still shows a red box for failed jobs and the incident details shows still failed jobs. But when following the "See openQA for details" link one sees an openQA /tests/overview page with no failed but scheduled jobs (potentially not yet running) so inconsistent with the display on the dashboard. To phrase the expected result following the common BDD template as AC2

So what happens is:

Jobs are running. The box on the blocked page is blue as expected.

One job fails (e.g. https://openqa.suse.de/tests/13826903). The box is red as expected.

A new job is created (e.g. openQA restarts the job) and this new/scheduled job takes its place in openQA's test result overview. The box is still red, that's probably not expected. It would probably make more sense if the box would go back to blue (and the job was considered "waiting" instead of "failed").

Maybe it is already intended that the box goes back blue and that's just not working because:

openQA doesn't create the expected event for the newly scheduled job

the event isn't evaluated correctly by the dashboard's AMQP bot

So I guess we need to check those two points.

We are sure that https://github.com/openSUSE/qem-dashboard/blob/main/lib/Dashboard/Model/Jobs.pm#L99C2-L99C110 is called meaning that the qem-dashboard knows that a job is restarted. So likely the evaluation of the old failed job and the new waiting job is not meeting our expectation. Could be the controller code or javascript code where the job results are evaluated.

Acceptance criteria

Suggestions

  • Consider doing #157204 first to get more experience and ensure that we have better test coverage
  • Look into log messages from the according "dashboard-amqp-watcher.service" systemd service running next to the dashboard that should capture all according actions
  • Look at https://github.com/openSUSE/qem-dashboard/blob/main/lib/Dashboard/Command/amqp_watcher.pm
  • Verify that the code works as intended and that we actually see events from openQA, etc.
  • Verify that openQA job updates actually show up to users looking at the dashboard for both actual "restarts" as well as "new jobs in the same scenario"
Actions

Also available in: Atom PDF