Project

General

Profile

Actions

action #157135

open

coordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release

coordination #155671: [epic] Better handling of SLE maintenance test review

http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes size:M

Added by okurz about 2 months ago. Updated about 1 month ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Target version:
Start date:
2024-03-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

As far as I know the representation of openQA job status on http://dashboard.qam.suse.de/blocked should auto-update based on AMQP events from https://openqa.suse.de but when a failed job is restarted I don't see those updates happening on http://dashboard.qam.suse.de/blocked until a new forced status update happens in the "sync aggregates" as scheduled from https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules every 30m.

EDIT: I just saw some minutes later at 0815Z that a red box turned to blue for a job that was restarted at 0802Z. That was possibly because https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2380594 "sync aggregates" just finished.

We looked into the problem together during the tools team unblock meeting and we verified that in general AMQP events are sent by OSD and are received by qam2. AMQP events are handled by the dashboard to update the database content. We have identified the specific case where inconsistencies appear: When a build has failed jobs but then jobs are retriggered or new jobs are triggered in the same build then the dashboard still shows a red box for failed jobs and the incident details shows still failed jobs. But when following the "See openQA for details" link one sees an openQA /tests/overview page with no failed but scheduled jobs (potentially not yet running) so inconsistent with the display on the dashboard. To phrase the expected result following the common BDD template as AC2

So what happens is:

Jobs are running. The box on the blocked page is blue as expected.

One job fails (e.g. https://openqa.suse.de/tests/13826903). The box is red as expected.

A new job is created (e.g. openQA restarts the job) and this new/scheduled job takes its place in openQA's test result overview. The box is still red, that's probably not expected. It would probably make more sense if the box would go back to blue (and the job was considered "waiting" instead of "failed").

Maybe it is already intended that the box goes back blue and that's just not working because:

openQA doesn't create the expected event for the newly scheduled job

the event isn't evaluated correctly by the dashboard's AMQP bot

So I guess we need to check those two points.

We are sure that https://github.com/openSUSE/qem-dashboard/blob/main/lib/Dashboard/Model/Jobs.pm#L99C2-L99C110 is called meaning that the qem-dashboard knows that a job is restarted. So likely the evaluation of the old failed job and the new waiting job is not meeting our expectation. Could be the controller code or javascript code where the job results are evaluated.

Acceptance criteria

Suggestions

  • Consider doing #157204 first to get more experience and ensure that we have better test coverage
  • Look into log messages from the according "dashboard-amqp-watcher.service" systemd service running next to the dashboard that should capture all according actions
  • Look at https://github.com/openSUSE/qem-dashboard/blob/main/lib/Dashboard/Command/amqp_watcher.pm
  • Verify that the code works as intended and that we actually see events from openQA, etc.
  • Verify that openQA job updates actually show up to users looking at the dashboard for both actual "restarts" as well as "new jobs in the same scenario"
Actions #1

Updated by okurz about 2 months ago

  • Subject changed from http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't to http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes.
  • Description updated (diff)
Actions #2

Updated by okurz about 2 months ago

  • Subject changed from http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes. to http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by ybonatakis about 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #4

Updated by ybonatakis about 2 months ago

What i have seen so far (after have restarted the dashboard-amqp-watcher.service)

a. amqp messages are populated to the queue once a job is restarted.

[x] suse.openqa.job.restart:{"ARCH":"x86_64","BUILD":":32885:kernel-ec2", "FLAVOR":"Server-DVD-Incidents-Kernel", ents-Kernel@64bit-with-ltp.qcow2,"MACHINE":"64bit","TEST":"ltp_controllers","auto":0, "bugref":"bsc#1220264","bugurl":"https:\/\/bugzilla.suse.com\/show_bug.cgi?id=1220264", "group_id":281,"id":"13782223","remaining":1,"result":{"13782223":13791364}}

b. another job is restarted and journal also logs once a job is restarted

qam2 dashboard[30577]: [30577] [i] restart 13793528 -> 13793577

c. https://dashboard.qam.suse.de/blocked on the Network tab shows constant blocked request in regular frequency

observation are after a couple of hours (~2h)
I will keep an eye on the queues for a while to see any setback

Actions #5

Updated by ybonatakis about 1 month ago

  • Status changed from In Progress to Feedback

after a whole day the dashboard and amqp still properly. Set it to feedback for now

Actions #6

Updated by okurz about 1 month ago

I don't understand. Your previous point "c." reads like a confirmation of the original issue, is it? And in the next comment you stated "after a whole day the dashboard and amqp still properly". I assume you wanted to say "still work properly"? But I have not seen any statement of applied changes. So what are your planned steps?

Actions #7

Updated by ybonatakis about 1 month ago

  • Status changed from Feedback to In Progress

Today the dashboard is again misbehaving.
the blocked request is still coming in but with the same data. all with the last_updated: 1710747916616

in the queue i found

[x] suse.obs.metrics:build,namespace=SUSE,worker=h01-ch3b:5,arch=x86_64,reason=meta_change,state=success,buildtype=spec duration=168,latency=35,total=203                                                                                    
Server connection <Connection: "amqps://suse:******@rabbit.suse.de:5671" at 0x7f750460ae40> was stuck. No frames were received in 183 seconds.
Actions #8

Updated by ybonatakis about 1 month ago

restart of the service has no impact

Actions #9

Updated by ybonatakis about 1 month ago

now dashboard works again for some reason and i cant reproduce the problem. no change from my side

Actions #10

Updated by okurz about 1 month ago

ybonatakis wrote in #note-9:

now dashboard works again

What exactly you do you mean by that?

Actions #11

Updated by ybonatakis about 1 month ago

okurz wrote in #note-10:

ybonatakis wrote in #note-9:

now dashboard works again

What exactly you do you mean by that?

The underline problem is not reproducible. Restarted jobs on openqa side, and checking the dashboard to verify that the state is changing as expected. When jobs fails again the dashboard shows the failure.

Liv and I tried to add debug logs but we couldnt make it work. More specifically, we edit the /etc/systemd/system/dashboard-amqp-watcher.service added Environment="MOJO_LOG_LEVEL=debug" and restart the service.
Another attempt was to set that in the environment MOJO_LOG_LEVEL=debug and start the service on this terminal. but no debug logs appeared.

personally i am running out of ideas.
I have confirmed that the openqa messages appear in the queue even when the dashboard misfunction.

Actions #12

Updated by ybonatakis about 1 month ago

I hardcode the log level in https://github.com/openSUSE/qem-dashboard/blob/main/lib/Dashboard.pm#L41 and i will leave it running, hoping that it is gonna give some useful data in case it starts misfunction again

Actions #13

Updated by ybonatakis about 1 month ago

In general there is not problem at all with amqp. All the data are provided in the requests.
I ended believing that the problem comes down to the UI and particular the https://github.com/openSUSE/qem-dashboard/blob/main/assets/vue/components/PageBlocked.vue. I think it might be something wrong with the script that match the accidents and updates the UI. this is the next step.

Actions #14

Updated by openqa_review about 1 month ago

  • Due date set to 2024-04-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by livdywan about 1 month ago

Apparently the patch to the log level causes the pipeline to fail https://gitlab.suse.de/qa-maintenance/qamops/-/jobs/2400856

Actions #16

Updated by ybonatakis about 1 month ago

I start looking to the scripts to make sense. At that point I am completely lost.
the data between /blocked request and openqa and dashboard are not synced. And they appear sync after long time in some cases. But it is hard to follow among all the data.

Sometimes /blocked+openqa looks same but dashboard
others /blocked+dashboard but openqa not

And sometimes i cant find entries in /blocked data

dashboard-amqp-watcher.service logs stopped since Mar 20 00:47:02 (today morning)
dashboard.service keeps doing request but cleanup reports (in cleanup) Can't remove file "/tmp/prefork.pid": Operation not permitted at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Server/Prefork.pm line 20. not sure if it is related but i thought to mention it.

I think this issue should have higher priority

Actions #17

Updated by ybonatakis about 1 month ago

livdywan wrote in #note-15:

Apparently the patch to the log level causes the pipeline to fail https://gitlab.suse.de/qa-maintenance/qamops/-/jobs/2400856

I reverted it and restart the service. And i also retriggered https://gitlab.suse.de/qa-maintenance/qamops/-/pipelines/1046526

Actions #18

Updated by ybonatakis about 1 month ago

I restarted ltp_syscalls from https://openqa.suse.de/tests/overview?build=%3A32948%3Akernel-ec2&groupid=193

journalctl -u dashboard-amqp-watcher.service | grep 13814830
Mar 18 21:07:13 qam2 dashboard[16674]:     "13778984" => 13814830
Mar 18 21:07:13 qam2 dashboard[16674]: [16674] [i] restart 13778984 -> 13814830
Mar 18 21:40:14 qam2 dashboard[16674]:   "id" => 13814830,
Mar 18 21:40:14 qam2 dashboard[16674]: [16674] [i] 13814830: failed (failed)
Mar 20 10:42:54 qam2 dashboard[27458]: [27458] [i] restart 13814830 -> 13827412```

but /blocked data doesnt contain anything after 13 requests. How i check: on dev panel>select /blocked XHR> find (select regex and switch to raw print) > type `:kernel-debug\S+12-SP3-TERADATA Kernel`
Also checking the queues of messages, i dont see anything for this job
Actions #19

Updated by ybonatakis about 1 month ago

Maybe is easier to debug the functionality using the https://dashboard.qam.suse.de/incident/ and checking the responses where
something like

"incident_summary": {
            "failed": 1,
            "passed": 49,
            "waiting": 1
        },

shows the inconsistent data

Actions #20

Updated by ybonatakis about 1 month ago

ybonatakis wrote in #note-19:

Maybe is easier to debug the functionality using the https://dashboard.qam.suse.de/incident/ and checking the responses where
something like

"incident_summary": {
            "failed": 1,
            "passed": 49,
            "waiting": 1
        },

shows the inconsistent data

My mistake. because data looks correct. but due to https://github.com/openSUSE/qem-dashboard/blob/61972b1b63f5f29c9ab17237cc933b0e243b9d2c/assets/vue/components/ResultSummary.vue#L14 the red box is shown as it has priority over the waiting kobs

Actions #21

Updated by okurz about 1 month ago

  • Tags changed from reactive work to qem-bot, qem-dashboard, maintenance
  • Subject changed from http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes size:M to http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes
  • Description updated (diff)
  • Due date deleted (2024-04-02)
  • Status changed from In Progress to New
  • Assignee deleted (ybonatakis)
  • Parent task set to #155671

We looked into the problem together during the tools team unblock meeting and we verified that in general AMQP events are sent by OSD and are received by qam2. AMQP events are handled by the dashboard to update the database content. We have identified the specific case where inconsistencies appear: When a build has failed jobs but then jobs are retriggered or new jobs are triggered in the same build then the dashboard still shows a red box for failed jobs and the incident details shows still failed jobs. But when following the "See openQA for details" link one sees an openQA /tests/overview page with no failed but scheduled jobs (potentially not yet running) so inconsistent with the display on the dashboard. To phrase the expected result following the common BDD template as AC2

Actions #22

Updated by mkittler about 1 month ago ยท Edited

So what happens is:

  1. Jobs are running. The box on the blocked page is blue as expected.
  2. One job fails (e.g. https://openqa.suse.de/tests/13826903). The box is red as expected.
  3. A new job is created (e.g. openQA restarts the job) and this new/scheduled job takes its place in openQA's test result overview. The box is still red, that's probably not expected. It would probably make more sense if the box would go back to blue (and the job was considered "waiting" instead of "failed").

Maybe it is already intended that the box goes back blue and that's just not working because:

  • openQA doesn't create the expected event for the newly scheduled job
  • the event isn't evaluated correctly by the dashboard's AMQP bot

So I guess we need to check those two points.

Actions #23

Updated by okurz about 1 month ago

  • Subject changed from http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes to http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #24

Updated by okurz about 1 month ago

  • Target version changed from Ready to Tools - Next
Actions

Also available in: Atom PDF