action #122308
closedcoordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release
Handle invalid openQA job references in qem-dashboard size:M
0%
Description
Motivation¶
See #97118#note-10. Looking into https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1301182 for the most recent run of "approve" we found more problems:
2022-12-21 13:34:16 INFO Job 1967173 not found
2022-12-21 13:34:16 INFO Job 1967169 not found
2022-12-21 13:34:16 INFO Found failed, not-ignored job 57268 for incident 27251
2022-12-21 13:34:16 INFO Inc 27251 has at least one failed job in aggregate tests
2022-12-21 13:34:16 INFO Found failed, not-ignored job 1967179 for incident 27252
so it looks like there are "jobs" 57268 and 1967179 which are not valid openQA jobs from openqa.suse.de. But those "jobs" block the approval. So what are those? Regardless they should be handled accordingly. If those are openQA job references in the database then we should likely crosscheck all openQA job ids and whenever blocking approval check if they actually exist in the live openQA database and delete (or at least ignore) otherwise. It looks like this kind of ID is either an incident_openqa_settings ID or an update_openqa_settings ID but not an openQA job ID. However, that makes me quite confused about my understanding of the code base. In particular, it means the comment-lookup feature I've once introduced cannot actually work because it isn't using an openQA job ID (the is_job_marked_acceptable_for_incident function is basically broken if that's correct). The log message should also be improved to state what kind of ID is logged there because "job" is highly ambiguous. The code should also have a comment where JobAggr is defined what the job_id is.
Acceptance criteria¶
- AC1: The message "Found failed, not-ignored job …" refers to actual openQA jobs
Suggestions¶
- See how the message is currently written in https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/openqabot/approver.py#L148
- Also see https://github.com/openSUSE/qem-dashboard/blob/main/migrations/dashboard.sql#L53
- Write a (failing) unit test that refers to actual openQA jobs from both incident and aggregate tests (both because the problem might be that the code already works for incident tests but for aggregate tests we might refer to the wrong so far)
- We assume that in https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L434 we could check for the log message for a failed job but it is likely not 20005 from https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L427 but another number refererring to a "real openQA job". In case of https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L160 the message might be already correct if we would add a test asserting that "job ID 100" is expected to show up so the suggestion is:
- Extend https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L181 to check for a message like
Found failed, not-ignored job 100
- Extend https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L434 to check for a message like
Found failed, not-ignored job ?
with ? to be filled by a proper number but likely not 20005
- Extend https://github.com/openSUSE/qem-bot/blob/2aac660ef36c9584ce56ab4e08c4705371d4dc02/tests/test_approve.py#L181 to check for a message like
- Somehow change the code so that we have the necessary information available in the JobAggr class or something
Updated by okurz almost 2 years ago
- Copied to action #122311: Use live openQA test results instead of inconsistent qem-dashboard database in qem-bot approver added
Updated by okurz almost 2 years ago
Apparently the "job_id" in case of 57292 is an id in the table "update_openqa_settings". So we can reference back to the job with
dashboard_db=# select job_id from openqa_jobs where update_settings=57292;
job_id
----------
10217371
10217368
10217365
…
10217373
so it looks like what we understand as "job_id" something that can either be an openQA job id or just a reference to a settings table that again references openQA jobs, weird design choice.
Updated by mkittler almost 2 years ago
Our starting point was the log message "Found failed …" so I've checked the bot's code base where it occurs. It looks like this kind of ID is either an incident_openqa_settings ID or an update_openqa_settings ID but not an openQA job ID. However, that makes me quite confused about my understanding of the code base. In particular, it means the comment-lookup feature I've once introduced cannot actually work because it isn't using an openQA job ID (the is_job_marked_acceptable_for_incident
function is basically broken if that's correct). The log message should also be improved to state what kind of ID is logged there because "job" is highly ambiguous. The code should also have a comment where JobAggr
is defined what the job_id
is.
Updated by okurz almost 2 years ago
- Related to action #107923: qem-bot: Ignore not-ok openQA jobs for specific incident based on openQA job comment size:M added
Updated by okurz almost 2 years ago
- Subject changed from Handle non-existant openQA job references in qem-dashboard to Handle invalid openQA job references in qem-dashboard size:M
- Description updated (diff)
- Priority changed from High to Normal
Updated by jbaier_cz almost 2 years ago
- Status changed from Workable to In Progress
Indeed there is a confusion in the naming. Apparently, we are overusing the term job. Actually all lines related to a "job" and outputting its id are in fact printing id for dashboard entity JobAggr, which is just a helper object for N:M mapping between maintenance incident and openQA job. In short, the logged number is openqa_jobs.id
, what we want is openqa_jobs.job_id
. In the current qem-bot code, this is not fetched at all (as it is practically not needed, the only important is the result of that job). This can be also seen by enhancing the tests: https://github.com/openSUSE/qem-bot/commit/817a92224c9ac934c40a3307b46996252b2549b5
As we actually need the openQA job id for #107923, I will proceed with modifying the current code to retain this information.
Updated by openqa_review almost 2 years ago
- Due date set to 2023-01-07
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan almost 2 years ago
Unfortunately the test coverage doesn't seem to reflect what we need in production and it's now failing:
Traceback (most recent call last):
File "./qem-bot/bot-ng.py", line 7, in <module>
main()
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/main.py", line 43, in main
sys.exit(cfg.func(cfg))
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/args.py", line 49, in do_approve
return approve()
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/approver.py", line 69, in __call__
incidents_to_approve = [inc for inc in increqs if self._approvable(inc)]
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/approver.py", line 69, in <listcomp>
incidents_to_approve = [inc for inc in increqs if self._approvable(inc)]
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/approver.py", line 86, in _approvable
i_jobs = get_incident_settings(inc.inc, self.token, self.all_incidents)
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/qem.py", line 95, in get_incident_settings
return [JobAggr(i["id"], i["job_id"], False, i["withAggregate"]) for i in settings]
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/qem.py", line 95, in <listcomp>
return [JobAggr(i["id"], i["job_id"], False, i["withAggregate"]) for i in settings]
KeyError: 'job_id'
Updated by jbaier_cz almost 2 years ago
Stopping the pipeline or temporary reverting the PR would be probably a good idea in this case; it seems that in real data, there are some entries without job_id
. I suspect that in this case, it is not the coverage what is wrong, we might have too ideal test data.
Updated by okurz almost 2 years ago
Proposing a revert https://github.com/openSUSE/qem-bot/pull/108 for now.
EDIT: Merged https://github.com/openSUSE/qem-bot/pull/108
Updated by jbaier_cz almost 2 years ago
Apparently the i
(object returned from dashboard API) has not all attributes from the database, I will need to look on the dashboard and maybe enhance the API (or maybe I just need to call another endpoint).
Updated by jbaier_cz almost 2 years ago
- Status changed from In Progress to Feedback
We have a bunch of new PR, after all of them are merged, the new version should list openQA job ids correctly. Where not possible, the log entry should explicitly tell the "job setting" id (which refers to incident/update setting entity in the dashboard).
Updated by livdywan almost 2 years ago
Please try and always mention the PR's here for clarity. That makes it easier to double-check that they're all being reviewed timely:
Updated by livdywan almost 2 years ago
All PR's have been merged. The pipeline from 30 minutes ago shows 2023-01-04 08:33:34 INFO Found failed, not-ignored job 10271442 for incident 26100
. It's not linked but https://openqa.suse.de/tests/10271442 seems to be a valid job.
Updated by okurz almost 2 years ago
- Due date deleted (
2023-01-07) - Status changed from Feedback to Resolved
We crosschecked again during the weekly SUSE QE Tools unblock 2023-01-04. We also looked at the next message:
Found failed, not-ignored job 10271608 for incident 27311
checking https://openqa.suse.de/tests/10271608 we find a valid unhandled openQA test failure. Also when following http://dashboard.qam.suse.de/incident/27311 or looking on http://dashboard.qam.suse.de/blocked we find exactly one failure blocking the approval which is the very same openQA job. So all good.