Project

General

Profile

action #157135

Updated by okurz 9 months ago

## Observation 
 As far as I know the representation of openQA job status on http://dashboard.qam.suse.de/blocked should auto-update based on AMQP events from https://openqa.suse.de but when a failed job is restarted I don't see those updates happening on http://dashboard.qam.suse.de/blocked until a new forced status update happens in the "sync aggregates" as scheduled from https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules every 30m. 

 EDIT: I just saw some minutes later at 0815Z that a red box turned to blue for a job that was restarted at 0802Z. That was possibly because https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2380594 "sync aggregates" just finished. 

 We looked into the problem together during the tools team unblock meeting and we verified that in general AMQP events are sent by OSD and are received by qam2. AMQP events are handled by the dashboard to update the database content. We have identified the specific case where inconsistencies appear: When a build has failed jobs but then jobs are retriggered or new jobs are triggered in the same build then the dashboard still shows a red box for failed jobs and the incident details shows still failed jobs. But when following the "See openQA for details" link one sees an openQA /tests/overview page with no failed but scheduled jobs (potentially not yet running) so inconsistent with the display on the dashboard. To phrase the expected result following the common BDD template as AC2 
 
 So what happens is: 

     Jobs are running. The box on the blocked page is blue as expected. 

     One job fails (e.g. https://openqa.suse.de/tests/13826903). The box is red as expected. 

     A new job is created (e.g. openQA restarts the job) and this new/scheduled job takes its place in openQA's test result overview. The box is still red, that's probably not expected. It would probably make more sense if the box would go back to blue (and the job was considered "waiting" instead of "failed"). 

 Maybe it is already intended that the box goes back blue and that's just not working because: 

     openQA doesn't create the expected event for the newly scheduled job 

     the event isn't evaluated correctly by the dashboard's AMQP bot 

 So I guess we need to check those two points. 

 We are sure that https://github.com/openSUSE/qem-dashboard/blob/main/lib/Dashboard/Model/Jobs.pm#L99C2-L99C110 is called meaning that the qem-dashboard knows that a job is restarted. So likely the evaluation of the old failed job and the new waiting job is not meeting our expectation. Could be the controller code or javascript code where the job results are evaluated. 


 ## Acceptance criteria 
 * **AC1:** We ensured that openQA job status events over AMQP update the accordingly display on http://dashboard.qam.suse.de/blocked 
 * **AC2:** *Given* existing openQA results for a SLE maintenance incident in https://openqa.suse.de and on https://dashboard.qam.suse.de/blocked with at least one failed openQA job *When* all failed openQA jobs for that incident results are scheduled, e.g. by restarting or triggering a new job in the same scenario, but not yet running, e.g. triggered for testing purposes with an invalid WORKER_CLASS, *Then* https://dashboard.qam.suse.de/blocked and http://dashboard.qam.suse.de/incident/$incident MUST NOT show any failed openQA jobs 

 ## Suggestions 
 * Consider doing #157204 first to get more experience and ensure that we have better test coverage 
 * Look into log messages from the according "dashboard-amqp-watcher.service" systemd service running next to the dashboard that should capture all according actions 
 * Look at https://github.com/openSUSE/qem-dashboard/blob/main/lib/Dashboard/Command/amqp_watcher.pm 
 * Verify that the code works as intended and that we actually see events from openQA, etc. 
 * Verify that openQA job updates actually show up to users looking at the dashboard for both actual "restarts" as well as "new jobs in the same scenario"

Back