Project

General

Profile

action #122308

Updated by okurz almost 2 years ago

## Motivation 
 See #97118#note-10. Looking into https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1301182 for the most recent run of "approve" we found more problems: 

 ``` 
 2022-12-21 13:34:16 INFO       Job 1967173 not found  
 2022-12-21 13:34:16 INFO       Job 1967169 not found  
 2022-12-21 13:34:16 INFO       Found failed, not-ignored job 57268 for incident 27251 
 2022-12-21 13:34:16 INFO       Inc 27251 has at least one failed job in aggregate tests 
 2022-12-21 13:34:16 INFO       Found failed, not-ignored job 1967179 for incident 27252 
 ``` 

 so it looks like there are "jobs" 57268 and 1967179 which are not valid openQA jobs from openqa.suse.de. But those "jobs" block the approval. So what are those? Regardless they should be handled accordingly. If those are openQA job references in the database then we should likely crosscheck all openQA job ids and whenever blocking approval check if they actually exist in the live openQA database and delete (or at least ignore) otherwise. It looks like this kind of ID is either an incident_openqa_settings ID or an update_openqa_settings ID but not an openQA job ID. However, that makes me quite confused about my understanding of the code base. In particular, it means the comment-lookup feature I've once introduced cannot actually work because it isn't using an openQA job ID (the is_job_marked_acceptable_for_incident function is basically broken if that's correct). The log message should also be improved to state what kind of ID is logged there because "job" is highly ambiguous. The code should also have a comment where JobAggr is defined what the job_id is. 

 ## Acceptance criteria 
 * **AC1:** The message "Found failed, not-ignored job …" refers to actual openQA jobs 

 ## Suggestions 
 * See how the message is currently written in https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/openqabot/approver.py#L148 
 * Also see https://github.com/openSUSE/qem-dashboard/blob/main/migrations/dashboard.sql#L53 
 * Write a (failing) unit test that refers to actual openQA jobs from both incident and aggregate tests (both because the problem might be that the code already works for incident tests but for aggregate tests we might refer to the wrong so far) 
 * We assume that in https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L434 we could check for the log message for a failed job but it is likely *not* 20005 from https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L427 but another number refererring to a "real openQA job". In case of https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L160 the message might be already correct if we would add a test asserting that "job ID 100" is expected to show up so the suggestion is: 
     * Extend https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L181 to check for a message like `Found failed, not-ignored job 100` 
     * Extend https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L434 to check for a message like `Found failed, not-ignored job ?` with ? to be filled by a proper number but likely not 20005 
 * Somehow change the code so that we have the necessary information available in the JobAggr class or something otherwise

Back