action #164733
closedcoordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release
coordination #155671: [epic] Better handling of SLE maintenance test review
qem-dashboard (and hence qem-bot) see a job as failed even though it's marked as softfailed since > 30 days in openQA size:M
0%
Description
Observation¶
https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2892821#L54 shows
2024-07-31 07:03:25 INFO Found failed, not-ignored job https://openqa.suse.de/t14779563 for incident 34532
even though in https://openqa.suse.de/tests/14779563 it's visible that the job was "force_result'd" as part of https://openqa.suse.de/tests/14779563#comment-1538601 already on 2024-07-02. The most recent "sync incidents" job https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2892922 does not mention 14779563.
Expected result¶
- E1: http://dashboard.qam.suse.de/blocked?incident=34532&group_names=SP5 or the equivalent URL from the database should show no failed job
- E2: http://dashboard.qam.suse.de/incident/34532 should show no failed job
- E3: The latest "approve incidents" pipeline from https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules should not mention https://openqa.suse.de/tests/14779563 as failed
Suggestions¶
- Ask test reviewers about examples
- Look for jobs that are softfailed via a force_result label and at the same time still failed on the dashboard.
- Check how the "sync incidents" works on already finished results. Maybe already finished results are only revisited if an AMQP event for a new comment is received and that event could have gone missed so the existing result is never revisited?
- To reproduce a qem-dashboard database dump is available on qam2.suse.de within the "postgresql" machine in /root/dashboard_db-2024-07-31T09:43:21+02:00.sql.xz
- Setup dashboard/bot/openQA locally and simulate how the dashboard/bot behave if an already finished job changes its result
- Look at the qem-bot code to check for any obvious problems with handling softfailed jobs
Updated by okurz 5 months ago
- Related to action #157204: Sync openQA job removal events to qem-dashboard listening to AMQP events size:M added
Updated by livdywan about 2 months ago
- Subject changed from qem-dashboard (and hence qem-bot) see a job as failed even though it's marked as softfailed since > 30 days in openQA to qem-dashboard (and hence qem-bot) see a job as failed even though it's marked as softfailed since > 30 days in openQA size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler about 2 months ago
- Target version changed from Tools - Next to Ready
Updated by jbaier_cz about 2 months ago
Let's create a data point here, at this moment (from the latest bot approval pipeline https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/3322469) updates are blocked by 12 failing openQA tests. None of them is soft-failed or ignored with a comment, so everything behave as expected so far.
Updated by jbaier_cz about 2 months ago
Another datapoint, right now we have some soft-failed test (via automatic force_result) like https://openqa.suse.de/tests/15939391; the dashboard show correctly no red results and there is no trace for halted approval due to test failure in the bot pipeline either.
Updated by jbaier_cz about 1 month ago
- Status changed from Workable to Resolved
Again, no soft-failed job blocking a release. I even managed to once more verify, that the "acceptable_for" feature is working as intended:
2024-11-25 15:05:51 INFO Ignoring failed job https://openqa.suse.de/t15996419 for incident 36467 due to openQA comment
...
2024-11-25 15:05:52 INFO Incidents to approve:
2024-11-25 15:05:52 INFO * SUSE:Maintenance:36467:353830
2024-11-25 15:05:52 INFO Accepting review for SUSE:Maintenance:36467:353830
Considering the age of this ticket, I believe we already improved the workflow in the mean time so everything is working as expected. Hence I am marking this as resolved unless someone points out a new examples with the current code base.