action #157135
closedcoordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release
coordination #155671: [epic] Better handling of SLE maintenance test review
http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes size:M
0%
Description
Observation¶
As far as I know the representation of openQA job status on http://dashboard.qam.suse.de/blocked should auto-update based on AMQP events from https://openqa.suse.de but when a failed job is restarted I don't see those updates happening on http://dashboard.qam.suse.de/blocked until a new forced status update happens in the "sync aggregates" as scheduled from https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules every 30m.
EDIT: I just saw some minutes later at 0815Z that a red box turned to blue for a job that was restarted at 0802Z. That was possibly because https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2380594 "sync aggregates" just finished.
We looked into the problem together during the tools team unblock meeting and we verified that in general AMQP events are sent by OSD and are received by qam2. AMQP events are handled by the dashboard to update the database content. We have identified the specific case where inconsistencies appear: When a build has failed jobs but then jobs are retriggered or new jobs are triggered in the same build then the dashboard still shows a red box for failed jobs and the incident details shows still failed jobs. But when following the "See openQA for details" link one sees an openQA /tests/overview page with no failed but scheduled jobs (potentially not yet running) so inconsistent with the display on the dashboard. To phrase the expected result following the common BDD template as AC2
So what happens is:
Jobs are running. The box on the blocked page is blue as expected.
One job fails (e.g. https://openqa.suse.de/tests/13826903). The box is red as expected.
A new job is created (e.g. openQA restarts the job) and this new/scheduled job takes its place in openQA's test result overview. The box is still red, that's probably not expected. It would probably make more sense if the box would go back to blue (and the job was considered "waiting" instead of "failed").
Maybe it is already intended that the box goes back blue and that's just not working because:
openQA doesn't create the expected event for the newly scheduled job
the event isn't evaluated correctly by the dashboard's AMQP bot
So I guess we need to check those two points.
We are sure that https://github.com/openSUSE/qem-dashboard/blob/main/lib/Dashboard/Model/Jobs.pm#L99C2-L99C110 is called meaning that the qem-dashboard knows that a job is restarted. So likely the evaluation of the old failed job and the new waiting job is not meeting our expectation. Could be the controller code or javascript code where the job results are evaluated.
Acceptance criteria¶
- AC1: We ensured that openQA job status events over AMQP update the accordingly display on http://dashboard.qam.suse.de/blocked
- AC2: Given existing openQA results for a SLE maintenance incident in https://openqa.suse.de and on https://dashboard.qam.suse.de/blocked with at least one failed openQA job When all failed openQA jobs for that incident results are scheduled, e.g. by restarting or triggering a new job in the same scenario, but not yet running, e.g. triggered for testing purposes with an invalid WORKER_CLASS, Then https://dashboard.qam.suse.de/blocked and http://dashboard.qam.suse.de/incident/$incident MUST NOT show any failed openQA jobs
Suggestions¶
- Consider doing #157204 first to get more experience and ensure that we have better test coverage
- Look into log messages from the according "dashboard-amqp-watcher.service" systemd service running next to the dashboard that should capture all according actions
- Look at https://github.com/openSUSE/qem-dashboard/blob/main/lib/Dashboard/Command/amqp_watcher.pm
- Verify that the code works as intended and that we actually see events from openQA, etc.
- Verify that openQA job updates actually show up to users looking at the dashboard for both actual "restarts" as well as "new jobs in the same scenario"
Updated by okurz 6 months ago
- Subject changed from http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't to http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes.
- Description updated (diff)
Updated by okurz 6 months ago
- Subject changed from http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes. to http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by ybonatakis 6 months ago
- Status changed from Workable to In Progress
- Assignee set to ybonatakis
Updated by ybonatakis 6 months ago
What i have seen so far (after have restarted the dashboard-amqp-watcher.service)
a. amqp messages are populated to the queue once a job is restarted.
[x] suse.openqa.job.restart:{"ARCH":"x86_64","BUILD":":32885:kernel-ec2", "FLAVOR":"Server-DVD-Incidents-Kernel", ents-Kernel@64bit-with-ltp.qcow2,"MACHINE":"64bit","TEST":"ltp_controllers","auto":0, "bugref":"bsc#1220264","bugurl":"https:\/\/bugzilla.suse.com\/show_bug.cgi?id=1220264", "group_id":281,"id":"13782223","remaining":1,"result":{"13782223":13791364}}
b. another job is restarted and journal also logs once a job is restarted
qam2 dashboard[30577]: [30577] [i] restart 13793528 -> 13793577
c. https://dashboard.qam.suse.de/blocked on the Network tab shows constant blocked request in regular frequency
observation are after a couple of hours (~2h)
I will keep an eye on the queues for a while to see any setback
Updated by ybonatakis 6 months ago
- Status changed from In Progress to Feedback
after a whole day the dashboard and amqp still properly. Set it to feedback for now
Updated by okurz 6 months ago
I don't understand. Your previous point "c." reads like a confirmation of the original issue, is it? And in the next comment you stated "after a whole day the dashboard and amqp still properly". I assume you wanted to say "still work properly"? But I have not seen any statement of applied changes. So what are your planned steps?
Updated by ybonatakis 6 months ago
- Status changed from Feedback to In Progress
Today the dashboard is again misbehaving.
the blocked request is still coming in but with the same data. all with the last_updated: 1710747916616
in the queue i found
[x] suse.obs.metrics:build,namespace=SUSE,worker=h01-ch3b:5,arch=x86_64,reason=meta_change,state=success,buildtype=spec duration=168,latency=35,total=203
Server connection <Connection: "amqps://suse:******@rabbit.suse.de:5671" at 0x7f750460ae40> was stuck. No frames were received in 183 seconds.
Updated by ybonatakis 6 months ago
now dashboard works again for some reason and i cant reproduce the problem. no change from my side
Updated by ybonatakis 6 months ago
okurz wrote in #note-10:
ybonatakis wrote in #note-9:
now dashboard works again
What exactly you do you mean by that?
The underline problem is not reproducible. Restarted jobs on openqa side, and checking the dashboard to verify that the state is changing as expected. When jobs fails again the dashboard shows the failure.
Liv and I tried to add debug logs but we couldnt make it work. More specifically, we edit the /etc/systemd/system/dashboard-amqp-watcher.service added Environment="MOJO_LOG_LEVEL=debug"
and restart the service.
Another attempt was to set that in the environment MOJO_LOG_LEVEL=debug and start the service on this terminal. but no debug logs appeared.
personally i am running out of ideas.
I have confirmed that the openqa messages appear in the queue even when the dashboard misfunction.
Updated by ybonatakis 6 months ago
I hardcode the log level in https://github.com/openSUSE/qem-dashboard/blob/main/lib/Dashboard.pm#L41 and i will leave it running, hoping that it is gonna give some useful data in case it starts misfunction again
Updated by ybonatakis 6 months ago
In general there is not problem at all with amqp. All the data are provided in the requests.
I ended believing that the problem comes down to the UI and particular the https://github.com/openSUSE/qem-dashboard/blob/main/assets/vue/components/PageBlocked.vue. I think it might be something wrong with the script that match the accidents and updates the UI. this is the next step.
Updated by openqa_review 6 months ago
- Due date set to 2024-04-02
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan 6 months ago
Apparently the patch to the log level causes the pipeline to fail https://gitlab.suse.de/qa-maintenance/qamops/-/jobs/2400856
Updated by ybonatakis 6 months ago
I start looking to the scripts to make sense. At that point I am completely lost.
the data between /blocked request and openqa and dashboard are not synced. And they appear sync after long time in some cases. But it is hard to follow among all the data.
Sometimes /blocked+openqa looks same but dashboard
others /blocked+dashboard but openqa not
And sometimes i cant find entries in /blocked data
dashboard-amqp-watcher.service logs stopped since Mar 20 00:47:02 (today morning)
dashboard.service keeps doing request but cleanup reports (in cleanup) Can't remove file "/tmp/prefork.pid": Operation not permitted at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Server/Prefork.pm line 20.
not sure if it is related but i thought to mention it.
I think this issue should have higher priority
Updated by ybonatakis 6 months ago
livdywan wrote in #note-15:
Apparently the patch to the log level causes the pipeline to fail https://gitlab.suse.de/qa-maintenance/qamops/-/jobs/2400856
I reverted it and restart the service. And i also retriggered https://gitlab.suse.de/qa-maintenance/qamops/-/pipelines/1046526
Updated by ybonatakis 6 months ago
I restarted ltp_syscalls from https://openqa.suse.de/tests/overview?build=%3A32948%3Akernel-ec2&groupid=193
journalctl -u dashboard-amqp-watcher.service | grep 13814830
Mar 18 21:07:13 qam2 dashboard[16674]: "13778984" => 13814830
Mar 18 21:07:13 qam2 dashboard[16674]: [16674] [i] restart 13778984 -> 13814830
Mar 18 21:40:14 qam2 dashboard[16674]: "id" => 13814830,
Mar 18 21:40:14 qam2 dashboard[16674]: [16674] [i] 13814830: failed (failed)
Mar 20 10:42:54 qam2 dashboard[27458]: [27458] [i] restart 13814830 -> 13827412```
but /blocked data doesnt contain anything after 13 requests. How i check: on dev panel>select /blocked XHR> find (select regex and switch to raw print) > type `:kernel-debug\S+12-SP3-TERADATA Kernel`
Also checking the queues of messages, i dont see anything for this job
Updated by ybonatakis 6 months ago
Maybe is easier to debug the functionality using the https://dashboard.qam.suse.de/incident/ and checking the responses where
something like
"incident_summary": {
"failed": 1,
"passed": 49,
"waiting": 1
},
shows the inconsistent data
Updated by ybonatakis 6 months ago
ybonatakis wrote in #note-19:
Maybe is easier to debug the functionality using the https://dashboard.qam.suse.de/incident/ and checking the responses where
something like"incident_summary": { "failed": 1, "passed": 49, "waiting": 1 },
shows the inconsistent data
My mistake. because data looks correct. but due to https://github.com/openSUSE/qem-dashboard/blob/61972b1b63f5f29c9ab17237cc933b0e243b9d2c/assets/vue/components/ResultSummary.vue#L14 the red box is shown as it has priority over the waiting kobs
Updated by okurz 6 months ago
- Tags changed from reactive work to qem-bot, qem-dashboard, maintenance
- Subject changed from http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes size:M to http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes
- Description updated (diff)
- Due date deleted (
2024-04-02) - Status changed from In Progress to New
- Assignee deleted (
ybonatakis) - Parent task set to #155671
We looked into the problem together during the tools team unblock meeting and we verified that in general AMQP events are sent by OSD and are received by qam2. AMQP events are handled by the dashboard to update the database content. We have identified the specific case where inconsistencies appear: When a build has failed jobs but then jobs are retriggered or new jobs are triggered in the same build then the dashboard still shows a red box for failed jobs and the incident details shows still failed jobs. But when following the "See openQA for details" link one sees an openQA /tests/overview page with no failed but scheduled jobs (potentially not yet running) so inconsistent with the display on the dashboard. To phrase the expected result following the common BDD template as AC2
Updated by mkittler 6 months ago ยท Edited
So what happens is:
- Jobs are running. The box on the blocked page is blue as expected.
- One job fails (e.g. https://openqa.suse.de/tests/13826903). The box is red as expected.
- A new job is created (e.g. openQA restarts the job) and this new/scheduled job takes its place in openQA's test result overview. The box is still red, that's probably not expected. It would probably make more sense if the box would go back to blue (and the job was considered "waiting" instead of "failed").
Maybe it is already intended that the box goes back blue and that's just not working because:
- openQA doesn't create the expected event for the newly scheduled job
- the event isn't evaluated correctly by the dashboard's AMQP bot
So I guess we need to check those two points.
Updated by okurz 6 months ago
- Subject changed from http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes to http://dashboard.qam.suse.de/blocked should update job status by AMQP events, but seemingly doesn't or only after some minutes size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by jbaier_cz 3 months ago
- Status changed from Workable to In Progress
I was able to recreate a testing setup with my own dashboard and one incident with one running openqa job. Recreating own MQTT broker tends to be a bigger task so I manually invoke script/dashboard eval '$app->amqp->handle("suse.openqa.job.done",{"id" => 14616167, "result" => "failed"});'
which should simulate receiving of a message (at least this method is used in tests). As expected, the box turned red. After that I tried script/dashboard eval '$app->amqp->handle("suse.openqa.job.restart",{"id" => 14616167, "result" => {"14616167" => 14616168}});'
to "restart" the test. This actually did the correct thing and the box turned to blue again and the database was updated with the new openqa job id. That means, that the code indeed behaves as described in the tests. As a next step, I will setup a small AMQP monitoring to see what messages are transmitted and if the format corresponds to the tests.
Updated by jbaier_cz 3 months ago
I just found out a weird AMQP message:
[2024-06-20 13:38:12.19083] [22997] [debug] HANDLE suse.openqa.job.restart {
"ARCH" => "ppc64le",
"BUILD" => "144.8",
"FLAVOR" => "Online-QR",
"ISO" => "SLE-15-SP5-Online-ppc64le-Build144.8-Media1.iso",
"MACHINE" => "ppc64le-hmc-single-disk",
"TEST" => "create_hdd_textmode_hmc",
"auto" => 0,
"bugref" => undef,
"group_id" => 517,
"id" => 14674537,
"remaining" => 14,
"result" => {
"14674537" => 14692218,
"14674538" => undef,
"14674539" => 14692219,
"14674540" => 14692220,
"14674541" => 14692221,
"14674542" => 14692222,
"14674543" => 14692223,
"14674544" => 14692224,
"14674545" => 14692225,
"14674546" => 14692226,
"14674547" => 14692227,
"14674548" => 14692228,
"14674549" => 14692229,
"14674550" => 14692230,
"14674551" => 14692231
}
}
Note the undef
.
Updated by openqa_review 3 months ago
- Due date set to 2024-07-05
Setting due date based on mean cycle time of SUSE QE Tools
Updated by jbaier_cz 3 months ago
- Status changed from In Progress to Feedback
A new job is created (e.g. openQA restarts the job) and this new/scheduled job takes its place in openQA's test result overview. The box is still red, that's probably not expected. It would probably make more sense if the box would go back to blue (and the job was considered "waiting" instead of "failed").
Just to make sure I also tried the setup with multiple jobs to see the behavior. The important thing (which might be the source of confusion) is that each box corresponds to multiple tests. The box will turn from red to blue upon job restart only if all failed tests were restarted, if there is but one failed job inside the "box" it will stay red (because that correctly represents that this box will prevent qem-bot to approve the incident as there is a failed job in this group and none of the restarted job will change that).
However I saw some weird behavior on the system level where for some reason during the night the AMQP watcher stops to process the events, maybe there is an issue with the listening socket which is not correctly recognized by the listener and it will not restart the socket. I do not want to spend time with the low level debugging (yet) and will try to solve it with tweaking the RuntimeMaxSec
which we are already using for the service anyway.
I am putting this into feedback for now and will monitor the impact.
Updated by jbaier_cz 2 months ago
- Due date deleted (
2024-07-05) - Status changed from Feedback to Resolved
I would guess this could count as a validation https://suse.slack.com/archives/C02CANHLANP/p1719403335705509
Updated by okurz 2 months ago
jbaier_cz wrote in #note-31:
I would guess this could count as a validation https://suse.slack.com/archives/C02CANHLANP/p1719403335705509
As proprietary solutions have a tendency to go away and be unaccessible eventually let me quote the complete conversation behind that URL:
(Felix Niederwanger) Hey QA-Tools team! I love the new feature where http://dashboard.qam.suse.de/ respons very quickly to restarted jobs! How is this done internally? RabbitMQ?
(Jan Baier) yes, dashboard app is listening to AMQP events emitted from openQA
(Felix Niederwanger) :wow_amaze: This is very cool! Kudos :thumbs-up: