Project

General

Profile

Actions

action #122308

closed

coordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release

Handle invalid openQA job references in qem-dashboard size:M

Added by okurz almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Start date:
2022-12-21
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See #97118#note-10. Looking into https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1301182 for the most recent run of "approve" we found more problems:

2022-12-21 13:34:16 INFO     Job 1967173 not found 
2022-12-21 13:34:16 INFO     Job 1967169 not found 
2022-12-21 13:34:16 INFO     Found failed, not-ignored job 57268 for incident 27251
2022-12-21 13:34:16 INFO     Inc 27251 has at least one failed job in aggregate tests
2022-12-21 13:34:16 INFO     Found failed, not-ignored job 1967179 for incident 27252

so it looks like there are "jobs" 57268 and 1967179 which are not valid openQA jobs from openqa.suse.de. But those "jobs" block the approval. So what are those? Regardless they should be handled accordingly. If those are openQA job references in the database then we should likely crosscheck all openQA job ids and whenever blocking approval check if they actually exist in the live openQA database and delete (or at least ignore) otherwise. It looks like this kind of ID is either an incident_openqa_settings ID or an update_openqa_settings ID but not an openQA job ID. However, that makes me quite confused about my understanding of the code base. In particular, it means the comment-lookup feature I've once introduced cannot actually work because it isn't using an openQA job ID (the is_job_marked_acceptable_for_incident function is basically broken if that's correct). The log message should also be improved to state what kind of ID is logged there because "job" is highly ambiguous. The code should also have a comment where JobAggr is defined what the job_id is.

Acceptance criteria

  • AC1: The message "Found failed, not-ignored job …" refers to actual openQA jobs

Suggestions


Related issues 2 (1 open1 closed)

Related to QA (public) - action #107923: qem-bot: Ignore not-ok openQA jobs for specific incident based on openQA job comment size:MResolvedjbaier_cz

Actions
Copied to QA (public) - action #122311: Use live openQA test results instead of inconsistent qem-dashboard database in qem-bot approverFeedbackmgrifalconi2022-12-21

Actions
Actions #1

Updated by okurz almost 2 years ago

  • Parent task set to #80194
Actions #2

Updated by okurz almost 2 years ago

  • Copied to action #122311: Use live openQA test results instead of inconsistent qem-dashboard database in qem-bot approver added
Actions #3

Updated by okurz almost 2 years ago

Apparently the "job_id" in case of 57292 is an id in the table "update_openqa_settings". So we can reference back to the job with

dashboard_db=# select job_id from openqa_jobs where update_settings=57292;
  job_id  
----------
 10217371
 10217368
 10217365
…
 10217373

so it looks like what we understand as "job_id" something that can either be an openQA job id or just a reference to a settings table that again references openQA jobs, weird design choice.

Actions #4

Updated by mkittler almost 2 years ago

Our starting point was the log message "Found failed …" so I've checked the bot's code base where it occurs. It looks like this kind of ID is either an incident_openqa_settings ID or an update_openqa_settings ID but not an openQA job ID. However, that makes me quite confused about my understanding of the code base. In particular, it means the comment-lookup feature I've once introduced cannot actually work because it isn't using an openQA job ID (the is_job_marked_acceptable_for_incident function is basically broken if that's correct). The log message should also be improved to state what kind of ID is logged there because "job" is highly ambiguous. The code should also have a comment where JobAggr is defined what the job_id is.

Actions #5

Updated by okurz almost 2 years ago

  • Related to action #107923: qem-bot: Ignore not-ok openQA jobs for specific incident based on openQA job comment size:M added
Actions #6

Updated by okurz almost 2 years ago

  • Subject changed from Handle non-existant openQA job references in qem-dashboard to Handle invalid openQA job references in qem-dashboard size:M
  • Description updated (diff)
  • Priority changed from High to Normal
Actions #7

Updated by okurz almost 2 years ago

  • Status changed from New to Workable
Actions #8

Updated by jbaier_cz almost 2 years ago

  • Assignee set to jbaier_cz
Actions #9

Updated by jbaier_cz almost 2 years ago

  • Status changed from Workable to In Progress

Indeed there is a confusion in the naming. Apparently, we are overusing the term job. Actually all lines related to a "job" and outputting its id are in fact printing id for dashboard entity JobAggr, which is just a helper object for N:M mapping between maintenance incident and openQA job. In short, the logged number is openqa_jobs.id, what we want is openqa_jobs.job_id. In the current qem-bot code, this is not fetched at all (as it is practically not needed, the only important is the result of that job). This can be also seen by enhancing the tests: https://github.com/openSUSE/qem-bot/commit/817a92224c9ac934c40a3307b46996252b2549b5

As we actually need the openQA job id for #107923, I will proceed with modifying the current code to retain this information.

Actions #10

Updated by openqa_review almost 2 years ago

  • Due date set to 2023-01-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by livdywan almost 2 years ago

Unfortunately the test coverage doesn't seem to reflect what we need in production and it's now failing:

Traceback (most recent call last):
  File "./qem-bot/bot-ng.py", line 7, in <module>
    main()
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/main.py", line 43, in main
    sys.exit(cfg.func(cfg))
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/args.py", line 49, in do_approve
    return approve()
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/approver.py", line 69, in __call__
    incidents_to_approve = [inc for inc in increqs if self._approvable(inc)]
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/approver.py", line 69, in <listcomp>
    incidents_to_approve = [inc for inc in increqs if self._approvable(inc)]
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/approver.py", line 86, in _approvable
    i_jobs = get_incident_settings(inc.inc, self.token, self.all_incidents)
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/qem.py", line 95, in get_incident_settings
    return [JobAggr(i["id"], i["job_id"], False, i["withAggregate"]) for i in settings]
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/qem.py", line 95, in <listcomp>
    return [JobAggr(i["id"], i["job_id"], False, i["withAggregate"]) for i in settings]
KeyError: 'job_id'
Actions #12

Updated by jbaier_cz almost 2 years ago

Stopping the pipeline or temporary reverting the PR would be probably a good idea in this case; it seems that in real data, there are some entries without job_id. I suspect that in this case, it is not the coverage what is wrong, we might have too ideal test data.

Actions #14

Updated by jbaier_cz almost 2 years ago

Apparently the i (object returned from dashboard API) has not all attributes from the database, I will need to look on the dashboard and maybe enhance the API (or maybe I just need to call another endpoint).

Actions #15

Updated by jbaier_cz almost 2 years ago

  • Status changed from In Progress to Feedback

We have a bunch of new PR, after all of them are merged, the new version should list openQA job ids correctly. Where not possible, the log entry should explicitly tell the "job setting" id (which refers to incident/update setting entity in the dashboard).

Actions #16

Updated by livdywan almost 2 years ago

Please try and always mention the PR's here for clarity. That makes it easier to double-check that they're all being reviewed timely:

Actions #17

Updated by livdywan almost 2 years ago

All PR's have been merged. The pipeline from 30 minutes ago shows 2023-01-04 08:33:34 INFO Found failed, not-ignored job 10271442 for incident 26100. It's not linked but https://openqa.suse.de/tests/10271442 seems to be a valid job.

Actions #18

Updated by okurz almost 2 years ago

  • Due date deleted (2023-01-07)
  • Status changed from Feedback to Resolved

We crosschecked again during the weekly SUSE QE Tools unblock 2023-01-04. We also looked at the next message:

Found failed, not-ignored job 10271608 for incident 27311

checking https://openqa.suse.de/tests/10271608 we find a valid unhandled openQA test failure. Also when following http://dashboard.qam.suse.de/incident/27311 or looking on http://dashboard.qam.suse.de/blocked we find exactly one failure blocking the approval which is the very same openQA job. So all good.

Actions

Also available in: Atom PDF