action #122308
closedcoordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release
Handle invalid openQA job references in qem-dashboard size:M
0%
Description
Motivation¶
See #97118#note-10. Looking into https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1301182 for the most recent run of "approve" we found more problems:
2022-12-21 13:34:16 INFO Job 1967173 not found
2022-12-21 13:34:16 INFO Job 1967169 not found
2022-12-21 13:34:16 INFO Found failed, not-ignored job 57268 for incident 27251
2022-12-21 13:34:16 INFO Inc 27251 has at least one failed job in aggregate tests
2022-12-21 13:34:16 INFO Found failed, not-ignored job 1967179 for incident 27252
so it looks like there are "jobs" 57268 and 1967179 which are not valid openQA jobs from openqa.suse.de. But those "jobs" block the approval. So what are those? Regardless they should be handled accordingly. If those are openQA job references in the database then we should likely crosscheck all openQA job ids and whenever blocking approval check if they actually exist in the live openQA database and delete (or at least ignore) otherwise. It looks like this kind of ID is either an incident_openqa_settings ID or an update_openqa_settings ID but not an openQA job ID. However, that makes me quite confused about my understanding of the code base. In particular, it means the comment-lookup feature I've once introduced cannot actually work because it isn't using an openQA job ID (the is_job_marked_acceptable_for_incident function is basically broken if that's correct). The log message should also be improved to state what kind of ID is logged there because "job" is highly ambiguous. The code should also have a comment where JobAggr is defined what the job_id is.
Acceptance criteria¶
- AC1: The message "Found failed, not-ignored job …" refers to actual openQA jobs
Suggestions¶
- See how the message is currently written in https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/openqabot/approver.py#L148
- Also see https://github.com/openSUSE/qem-dashboard/blob/main/migrations/dashboard.sql#L53
- Write a (failing) unit test that refers to actual openQA jobs from both incident and aggregate tests (both because the problem might be that the code already works for incident tests but for aggregate tests we might refer to the wrong so far)
- We assume that in https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L434 we could check for the log message for a failed job but it is likely not 20005 from https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L427 but another number refererring to a "real openQA job". In case of https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L160 the message might be already correct if we would add a test asserting that "job ID 100" is expected to show up so the suggestion is:
- Extend https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L181 to check for a message like
Found failed, not-ignored job 100
- Extend https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L434 to check for a message like
Found failed, not-ignored job ?
with ? to be filled by a proper number but likely not 20005
- Extend https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L181 to check for a message like
- Somehow change the code so that we have the necessary information available in the JobAggr class or something