Project

General

Profile

Actions

action #157135

closed

coordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release

coordination #155671: [epic] Better handling of SLE maintenance test review

http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes size:M

Added by okurz 9 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Start date:
2024-03-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

As far as I know the representation of openQA job status on http://dashboard.qam.suse.de/blocked should auto-update based on AMQP events from https://openqa.suse.de but when a failed job is restarted I don't see those updates happening on http://dashboard.qam.suse.de/blocked until a new forced status update happens in the "sync aggregates" as scheduled from https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules every 30m.

EDIT: I just saw some minutes later at 0815Z that a red box turned to blue for a job that was restarted at 0802Z. That was possibly because https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2380594 "sync aggregates" just finished.

We looked into the problem together during the tools team unblock meeting and we verified that in general AMQP events are sent by OSD and are received by qam2. AMQP events are handled by the dashboard to update the database content. We have identified the specific case where inconsistencies appear: When a build has failed jobs but then jobs are retriggered or new jobs are triggered in the same build then the dashboard still shows a red box for failed jobs and the incident details shows still failed jobs. But when following the "See openQA for details" link one sees an openQA /tests/overview page with no failed but scheduled jobs (potentially not yet running) so inconsistent with the display on the dashboard. To phrase the expected result following the common BDD template as AC2

So what happens is:

Jobs are running. The box on the blocked page is blue as expected.

One job fails (e.g. https://openqa.suse.de/tests/13826903). The box is red as expected.

A new job is created (e.g. openQA restarts the job) and this new/scheduled job takes its place in openQA's test result overview. The box is still red, that's probably not expected. It would probably make more sense if the box would go back to blue (and the job was considered "waiting" instead of "failed").

Maybe it is already intended that the box goes back blue and that's just not working because:

openQA doesn't create the expected event for the newly scheduled job

the event isn't evaluated correctly by the dashboard's AMQP bot

So I guess we need to check those two points.

We are sure that https://github.com/openSUSE/qem-dashboard/blob/main/lib/Dashboard/Model/Jobs.pm#L99C2-L99C110 is called meaning that the qem-dashboard knows that a job is restarted. So likely the evaluation of the old failed job and the new waiting job is not meeting our expectation. Could be the controller code or javascript code where the job results are evaluated.

Acceptance criteria

Suggestions

  • Consider doing #157204 first to get more experience and ensure that we have better test coverage
  • Look into log messages from the according "dashboard-amqp-watcher.service" systemd service running next to the dashboard that should capture all according actions
  • Look at https://github.com/openSUSE/qem-dashboard/blob/main/lib/Dashboard/Command/amqp_watcher.pm
  • Verify that the code works as intended and that we actually see events from openQA, etc.
  • Verify that openQA job updates actually show up to users looking at the dashboard for both actual "restarts" as well as "new jobs in the same scenario"
Actions #1

Updated by okurz 9 months ago

  • Subject changed from http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't to http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes.
  • Description updated (diff)
Actions #2

Updated by okurz 9 months ago

  • Subject changed from http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes. to http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by ybonatakis 9 months ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #4

Updated by ybonatakis 9 months ago

What i have seen so far (after have restarted the dashboard-amqp-watcher.service)

a. amqp messages are populated to the queue once a job is restarted.

[x] suse.openqa.job.restart:{"ARCH":"x86_64","BUILD":":32885:kernel-ec2", "FLAVOR":"Server-DVD-Incidents-Kernel", ents-Kernel@64bit-with-ltp.qcow2,"MACHINE":"64bit","TEST":"ltp_controllers","auto":0, "bugref":"bsc#1220264","bugurl":"https:\/\/bugzilla.suse.com\/show_bug.cgi?id=1220264", "group_id":281,"id":"13782223","remaining":1,"result":{"13782223":13791364}}

b. another job is restarted and journal also logs once a job is restarted

qam2 dashboard[30577]: [30577] [i] restart 13793528 -> 13793577

c. https://dashboard.qam.suse.de/blocked on the Network tab shows constant blocked request in regular frequency

observation are after a couple of hours (~2h)
I will keep an eye on the queues for a while to see any setback

Actions #5

Updated by ybonatakis 9 months ago

  • Status changed from In Progress to Feedback

after a whole day the dashboard and amqp still properly. Set it to feedback for now

Actions #6

Updated by okurz 9 months ago

I don't understand. Your previous point "c." reads like a confirmation of the original issue, is it? And in the next comment you stated "after a whole day the dashboard and amqp still properly". I assume you wanted to say "still work properly"? But I have not seen any statement of applied changes. So what are your planned steps?

Actions #7

Updated by ybonatakis 9 months ago

  • Status changed from Feedback to In Progress

Today the dashboard is again misbehaving.
the blocked request is still coming in but with the same data. all with the last_updated: 1710747916616

in the queue i found

[x] suse.obs.metrics:build,namespace=SUSE,worker=h01-ch3b:5,arch=x86_64,reason=meta_change,state=success,buildtype=spec duration=168,latency=35,total=203                                                                                    
Server connection <Connection: "amqps://suse:******@rabbit.suse.de:5671" at 0x7f750460ae40> was stuck. No frames were received in 183 seconds.
Actions #8

Updated by ybonatakis 9 months ago

restart of the service has no impact

Actions #9

Updated by ybonatakis 9 months ago

now dashboard works again for some reason and i cant reproduce the problem. no change from my side

Actions #10

Updated by okurz 9 months ago

ybonatakis wrote in #note-9:

now dashboard works again

What exactly you do you mean by that?

Actions #11

Updated by ybonatakis 9 months ago

okurz wrote in #note-10:

ybonatakis wrote in #note-9:

now dashboard works again

What exactly you do you mean by that?

The underline problem is not reproducible. Restarted jobs on openqa side, and checking the dashboard to verify that the state is changing as expected. When jobs fails again the dashboard shows the failure.

Liv and I tried to add debug logs but we couldnt make it work. More specifically, we edit the /etc/systemd/system/dashboard-amqp-watcher.service added Environment="MOJO_LOG_LEVEL=debug" and restart the service.
Another attempt was to set that in the environment MOJO_LOG_LEVEL=debug and start the service on this terminal. but no debug logs appeared.

personally i am running out of ideas.
I have confirmed that the openqa messages appear in the queue even when the dashboard misfunction.

Actions #12

Updated by ybonatakis 9 months ago

I hardcode the log level in https://github.com/openSUSE/qem-dashboard/blob/main/lib/Dashboard.pm#L41 and i will leave it running, hoping that it is gonna give some useful data in case it starts misfunction again

Actions #13

Updated by ybonatakis 9 months ago

In general there is not problem at all with amqp. All the data are provided in the requests.
I ended believing that the problem comes down to the UI and particular the https://github.com/openSUSE/qem-dashboard/blob/main/assets/vue/components/PageBlocked.vue. I think it might be something wrong with the script that match the accidents and updates the UI. this is the next step.

Actions #14

Updated by openqa_review 9 months ago

  • Due date set to 2024-04-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by livdywan 9 months ago

Apparently the patch to the log level causes the pipeline to fail https://gitlab.suse.de/qa-maintenance/qamops/-/jobs/2400856

Actions #16

Updated by ybonatakis 9 months ago

I start looking to the scripts to make sense. At that point I am completely lost.
the data between /blocked request and openqa and dashboard are not synced. And they appear sync after long time in some cases. But it is hard to follow among all the data.

Sometimes /blocked+openqa looks same but dashboard
others /blocked+dashboard but openqa not

And sometimes i cant find entries in /blocked data

dashboard-amqp-watcher.service logs stopped since Mar 20 00:47:02 (today morning)
dashboard.service keeps doing request but cleanup reports (in cleanup) Can't remove file "/tmp/prefork.pid": Operation not permitted at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Server/Prefork.pm line 20. not sure if it is related but i thought to mention it.

I think this issue should have higher priority

Actions #17

Updated by ybonatakis 9 months ago

livdywan wrote in #note-15:

Apparently the patch to the log level causes the pipeline to fail https://gitlab.suse.de/qa-maintenance/qamops/-/jobs/2400856

I reverted it and restart the service. And i also retriggered https://gitlab.suse.de/qa-maintenance/qamops/-/pipelines/1046526

Actions #18

Updated by ybonatakis 9 months ago

I restarted ltp_syscalls from https://openqa.suse.de/tests/overview?build=%3A32948%3Akernel-ec2&groupid=193

journalctl -u dashboard-amqp-watcher.service | grep 13814830
Mar 18 21:07:13 qam2 dashboard[16674]:     "13778984" => 13814830
Mar 18 21:07:13 qam2 dashboard[16674]: [16674] [i] restart 13778984 -> 13814830
Mar 18 21:40:14 qam2 dashboard[16674]:   "id" => 13814830,
Mar 18 21:40:14 qam2 dashboard[16674]: [16674] [i] 13814830: failed (failed)
Mar 20 10:42:54 qam2 dashboard[27458]: [27458] [i] restart 13814830 -> 13827412```

but /blocked data doesnt contain anything after 13 requests. How i check: on dev panel>select /blocked XHR> find (select regex and switch to raw print) > type `:kernel-debug\S+12-SP3-TERADATA Kernel`
Also checking the queues of messages, i dont see anything for this job
Actions #19

Updated by ybonatakis 9 months ago

Maybe is easier to debug the functionality using the https://dashboard.qam.suse.de/incident/ and checking the responses where
something like

"incident_summary": {
            "failed": 1,
            "passed": 49,
            "waiting": 1
        },

shows the inconsistent data

Actions #20

Updated by ybonatakis 9 months ago

ybonatakis wrote in #note-19:

Maybe is easier to debug the functionality using the https://dashboard.qam.suse.de/incident/ and checking the responses where
something like

"incident_summary": {
            "failed": 1,
            "passed": 49,
            "waiting": 1
        },

shows the inconsistent data

My mistake. because data looks correct. but due to https://github.com/openSUSE/qem-dashboard/blob/61972b1b63f5f29c9ab17237cc933b0e243b9d2c/assets/vue/components/ResultSummary.vue#L14 the red box is shown as it has priority over the waiting kobs

Actions #21

Updated by okurz 9 months ago

  • Tags changed from reactive work to qem-bot, qem-dashboard, maintenance
  • Subject changed from http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes size:M to http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes
  • Description updated (diff)
  • Due date deleted (2024-04-02)
  • Status changed from In Progress to New
  • Assignee deleted (ybonatakis)
  • Parent task set to #155671

We looked into the problem together during the tools team unblock meeting and we verified that in general AMQP events are sent by OSD and are received by qam2. AMQP events are handled by the dashboard to update the database content. We have identified the specific case where inconsistencies appear: When a build has failed jobs but then jobs are retriggered or new jobs are triggered in the same build then the dashboard still shows a red box for failed jobs and the incident details shows still failed jobs. But when following the "See openQA for details" link one sees an openQA /tests/overview page with no failed but scheduled jobs (potentially not yet running) so inconsistent with the display on the dashboard. To phrase the expected result following the common BDD template as AC2

Actions #22

Updated by mkittler 9 months ago ยท Edited

So what happens is:

  1. Jobs are running. The box on the blocked page is blue as expected.
  2. One job fails (e.g. https://openqa.suse.de/tests/13826903). The box is red as expected.
  3. A new job is created (e.g. openQA restarts the job) and this new/scheduled job takes its place in openQA's test result overview. The box is still red, that's probably not expected. It would probably make more sense if the box would go back to blue (and the job was considered "waiting" instead of "failed").

Maybe it is already intended that the box goes back blue and that's just not working because:

  • openQA doesn't create the expected event for the newly scheduled job
  • the event isn't evaluated correctly by the dashboard's AMQP bot

So I guess we need to check those two points.

Actions #23

Updated by okurz 9 months ago

  • Subject changed from http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes to http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #24

Updated by okurz 9 months ago

  • Target version changed from Ready to Tools - Next
Actions #25

Updated by okurz 7 months ago

  • Target version changed from Tools - Next to Ready
Actions #26

Updated by jbaier_cz 6 months ago

  • Assignee set to jbaier_cz
Actions #27

Updated by jbaier_cz 6 months ago

  • Status changed from Workable to In Progress

I was able to recreate a testing setup with my own dashboard and one incident with one running openqa job. Recreating own MQTT broker tends to be a bigger task so I manually invoke script/dashboard eval '$app->amqp->handle("suse.openqa.job.done",{"id" => 14616167, "result" => "failed"});' which should simulate receiving of a message (at least this method is used in tests). As expected, the box turned red. After that I tried script/dashboard eval '$app->amqp->handle("suse.openqa.job.restart",{"id" => 14616167, "result" => {"14616167" => 14616168}});' to "restart" the test. This actually did the correct thing and the box turned to blue again and the database was updated with the new openqa job id. That means, that the code indeed behaves as described in the tests. As a next step, I will setup a small AMQP monitoring to see what messages are transmitted and if the format corresponds to the tests.

Actions #28

Updated by jbaier_cz 6 months ago

I just found out a weird AMQP message:

[2024-06-20 13:38:12.19083] [22997] [debug] HANDLE suse.openqa.job.restart {
  "ARCH" => "ppc64le",
  "BUILD" => "144.8",
  "FLAVOR" => "Online-QR",
  "ISO" => "SLE-15-SP5-Online-ppc64le-Build144.8-Media1.iso",
  "MACHINE" => "ppc64le-hmc-single-disk",
  "TEST" => "create_hdd_textmode_hmc",
  "auto" => 0,
  "bugref" => undef,
  "group_id" => 517,
  "id" => 14674537,
  "remaining" => 14,
  "result" => {
    "14674537" => 14692218,
    "14674538" => undef,
    "14674539" => 14692219,
    "14674540" => 14692220,
    "14674541" => 14692221,
    "14674542" => 14692222,
    "14674543" => 14692223,
    "14674544" => 14692224,
    "14674545" => 14692225,
    "14674546" => 14692226,
    "14674547" => 14692227,
    "14674548" => 14692228,
    "14674549" => 14692229,
    "14674550" => 14692230,
    "14674551" => 14692231
  }
}

Note the undef.

Actions #29

Updated by openqa_review 6 months ago

  • Due date set to 2024-07-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions #30

Updated by jbaier_cz 6 months ago

  • Status changed from In Progress to Feedback

A new job is created (e.g. openQA restarts the job) and this new/scheduled job takes its place in openQA's test result overview. The box is still red, that's probably not expected. It would probably make more sense if the box would go back to blue (and the job was considered "waiting" instead of "failed").

Just to make sure I also tried the setup with multiple jobs to see the behavior. The important thing (which might be the source of confusion) is that each box corresponds to multiple tests. The box will turn from red to blue upon job restart only if all failed tests were restarted, if there is but one failed job inside the "box" it will stay red (because that correctly represents that this box will prevent qem-bot to approve the incident as there is a failed job in this group and none of the restarted job will change that).

However I saw some weird behavior on the system level where for some reason during the night the AMQP watcher stops to process the events, maybe there is an issue with the listening socket which is not correctly recognized by the listener and it will not restart the socket. I do not want to spend time with the low level debugging (yet) and will try to solve it with tweaking the RuntimeMaxSec which we are already using for the service anyway.

I am putting this into feedback for now and will monitor the impact.

Actions #31

Updated by jbaier_cz 6 months ago

  • Due date deleted (2024-07-05)
  • Status changed from Feedback to Resolved

I would guess this could count as a validation https://suse.slack.com/archives/C02CANHLANP/p1719403335705509

Actions #32

Updated by okurz 6 months ago

jbaier_cz wrote in #note-31:

I would guess this could count as a validation https://suse.slack.com/archives/C02CANHLANP/p1719403335705509

As proprietary solutions have a tendency to go away and be unaccessible eventually let me quote the complete conversation behind that URL:

(Felix Niederwanger) Hey QA-Tools team! I love the new feature where http://dashboard.qam.suse.de/ respons very quickly to restarted jobs! How is this done internally? RabbitMQ?
(Jan Baier) yes, dashboard app is listening to AMQP events emitted from openQA
(Felix Niederwanger) :wow_amaze: This is very cool! Kudos :thumbs-up:

Actions

Also available in: Atom PDF