Project

General

Profile

action #107923

coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release

qem-bot: Ignore not-ok openQA jobs for specific incident based on openQA job comment size:M

Added by okurz about 1 year ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
Due date:
2023-03-18
% Done:

0%

Estimated time:

Description

Motivation

See the proposal in the parent epic #95479, e.g. about the specific format of an openQA label that is readable by qem-bot

Acceptance criteria

  • AC1: A not-ok openQA job with a comment following format https://progress.opensuse.org/issues/95479#Suggestions is not blocking approval of incident updates
  • AC2: A not-ok openQA job with such comment is still blocking approval of all other, not specified incident updates
  • AC3: A not-ok openQA without such comment is still blocking all related incidents

Suggestions

  • DONE: Add a testing framework to github.com/openSUSE/qem-bot/, e.g. based on github.com/os-autoinst/openqa_review -> #109641
  • DONE: Add a simple automatic test exercising one of the existing happy path workflows of qem-bot -> #110167
  • DONE: Add automatic tests for the above acceptance criteria
  • DONE: As a first quick-and-dirty, and messy approach read out openQA comments directly within the approval step of qem-bot (only for the failed jobs which should not take too long)
  • DONE: Parse the mentioned special label string and for the parsed incident remove the according not-ok openQA job from the list of blocking results
  • Optional: Add openQA comment parsing over the openQA API together with consistent data in qem-dashboard, i.e.
    • As qem-dashboard is the "database for qem-bot" read out the according data from openQA that is pushed to qem-dashboard
    • make qem-dashboard store the related data
    • and then qem-bot should read it out from there

Out of scope

Visualize such specially handled failed openQA jobs in dashboard


Related issues

Related to openQA Project - openqa-force-result #109857: Secure auto-review+force_result size:M auto_review:"Failed to download gobbledeegoop":force_result:softfailedResolved
Related to QA - action #111078: Simple automatic test exercising one of the existing happy path workflows of qem-bot size:MResolved2022-05-13

Related to openQA Project - action #119467: "Internal server error" on opening any job group front page at OSDResolved

Related to QA - action #114415: [timeboxed:10h][spike solution] qem-bot comments on IBS size:SResolved2022-07-20

Related to openQA Infrastructure - action #120939: [alert] Pipeline for scheduling incidents runs into timeout size:MResolved2022-11-242022-12-13

Related to QA - action #119161: Approval step of qem-bot says incident has failed job in incidents but it looks empty on the dashboard size:MResolved2022-10-21

Related to QA - action #122308: Handle invalid openQA job references in qem-dashboard size:MResolved2022-12-21

History

#1 Updated by okurz about 1 year ago

  • Related to openqa-force-result #109857: Secure auto-review+force_result size:M auto_review:"Failed to download gobbledeegoop":force_result:softfailed added

#2 Updated by okurz about 1 year ago

  • Target version changed from future to Ready

#3 Updated by okurz about 1 year ago

  • Description updated (diff)

#4 Updated by okurz about 1 year ago

  • Related to action #111078: Simple automatic test exercising one of the existing happy path workflows of qem-bot size:M added

#5 Updated by okurz about 1 year ago

  • Status changed from New to Blocked
  • Assignee set to okurz

#6 Updated by okurz 12 months ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)

#7 Updated by okurz 12 months ago

  • Priority changed from Low to Normal

#8 Updated by okurz 11 months ago

  • Subject changed from qem-bot: Ignore not-ok openQA jobs for specific incident based on openQA job comment to qem-bot: Ignore not-ok openQA jobs for specific incident based on openQA job comment size:M
  • Description updated (diff)
  • Status changed from New to Workable

#9 Updated by mkittler 8 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

#10 Updated by openqa_review 8 months ago

  • Due date set to 2022-10-28

Setting due date based on mean cycle time of SUSE QE Tools

#12 Updated by mkittler 8 months ago

PR to fix pipeline failures: https://github.com/openSUSE/qem-bot/pull/75

#13 Updated by jbaier_cz 8 months ago

It seems, that the pipeline may be working, although we should still get rid of the stack traces. To actually see, if we already use the patched version, I propose a little update for the pipeline: https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/59

#15 Updated by jbaier_cz 7 months ago

The code still does a lot of retries on 404 from non-existent comments resulting in a very long run duration and job termination after 1h timeout.

#16 Updated by mkittler 7 months ago

Maybe it is the simplest to just disable the retry for now: https://github.com/openSUSE/qem-bot/pull/79
(Not sure how to make it only retry on connection errors and 500 responses which would likely be the most reasonable choice and I assumed would be the default retry behavior.)

#17 Updated by mkittler 7 months ago

  • Status changed from In Progress to Feedback

All PRs have been merged. The pipeline works again and doesn't take that long to execute (e.g. https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1194317).

So putting a comment like https://progress.opensuse.org/issues/95479#Suggestions in an openQA job should now work.

#18 Updated by okurz 7 months ago

Nice. Also thank you for the announcement in #eng-testing. I suggest to await a real-life usage. You could query the openQA comments for matching comments if people used it.

Before resolving please make sure to put any not-done "optional" tasks in according new ticket(s).

#19 Updated by tinita 7 months ago

mkittler wrote:

Maybe it is the simplest to just disable the retry for now: https://github.com/openSUSE/qem-bot/pull/79
(Not sure how to make it only retry on connection errors and 500 responses which would likely be the most reasonable choice and I assumed would be the default retry behavior.)

This should probably be done in the openqa_client.
There's even a related issue: https://github.com/os-autoinst/openQA-python-client/issues/16
I had a quick look at the code, but currently it is using requests, and if I understand it correctly, it would have to be rewritten with urllib3 to use the retry feature.

#20 Updated by mkittler 7 months ago

  • Description updated (diff)

#21 Updated by mkittler 7 months ago

There are no such comments yet:

openqa=# select job_id, text from comments where text like '%review%acceptable%';
 job_id | text 
--------+------
(0 Zeilen)

Likely it is also possible to tweak the retry while keep using requests. (I cannot imagine that request makes this impossible. Likely they only prefer urllib3 because it has already a built-in feature for exactly what we want.)

#22 Updated by kraih 7 months ago

Still lots of error messages in the pipeline output https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1196209:

ERROR: ('GET', 'https://openqa.suse.de/api/v1/jobs/1944211/comments', 404)
Traceback (most recent call last):
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/openqa.py", line 73, in get_job_comments
    "GET", "jobs/%s/comments" % job_id, retries=0
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 184, in openqa_request
    return self.do_request(req, retries=retries, wait=wait)
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 164, in do_request
    raise err
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 144, in do_request
    request.method, resp.url, resp.status_code
openqa_client.exceptions.RequestError: ('GET', 'https://openqa.suse.de/api/v1/jobs/1944211/comments', 404)

#25 Updated by okurz 7 months ago

merged

#26 Updated by mkittler 7 months ago

okurz wrote:

  • Optional: Add openQA comment parsing over the openQA API together with consistent data in qem-dashboard, i.e.
    • As qem-dashboard is the "database for qem-bot" read out the according data from openQA that is pushed to qem-dashboard
    • make qem-dashboard store the related data
    • and then qem-bot should read it out from there

That's likely not very useful. We'd need to query all comments when syncing the dashboard so we'd likely end up with more request on the openQA side. Besides, judging the by logs of recent approval pipeline runs I suppose openQA will be able to handle the traffic (as it doesn't look too much, we also only query comments for failed jobs).

One could do a further optimization: As soon as we see a failing job blocking an incident, we'd avoid checking any further failing job for that incident (e.g. an early break in the loop introduced here: https://github.com/openSUSE/qem-bot/pull/73/files).

#27 Updated by mkittler 7 months ago

PR for the optimization mentioned in the previous comment: https://github.com/openSUSE/qem-bot/pull/82

Note that the feature isn't used yet (select job_id, user_id, text from comments where text like '%review%acceptable%' and user_id != 126; returns no results).

#28 Updated by mkittler 7 months ago

PR to make this at least a little bit more discoverable: https://github.com/os-autoinst/openQA/pull/4867

#29 Updated by okurz 7 months ago

  • Related to action #119467: "Internal server error" on opening any job group front page at OSD added

#30 Updated by okurz 7 months ago

As we learned from #119467 we should have

  • No obvious related regression on OSD
  • Test coverage for the branding button code exists for both the job comments as well as job group comments

so I suggest in the next try to introduce such test.

#31 Updated by okurz 7 months ago

  • Due date changed from 2022-10-28 to 2022-11-11

We hit the mentioned regression and want to improve tests when reintroducing. Giving more time for this which should be an exception

#32 Updated by mkittler 7 months ago

Fixed version of the original PR with test coverage for the SUSE branding template code: https://github.com/os-autoinst/openQA/pull/4881

#33 Updated by mkittler 7 months ago

Looks like the feature is still not used. The PR for adding the button back has been merged.

#34 Updated by okurz 7 months ago

  • Due date deleted (2022-11-11)
  • Status changed from Feedback to Resolved

I triggered an extraordinary deployment on OSD and verified that the comment template showed up on https://openqa.suse.de/tests/latest#comments . Documentation and informing people can be done in related tickets, e.g. #111066

#35 Updated by tinita 6 months ago

  • Status changed from Resolved to Feedback

We should be able to use retry now because 404 is now excluded: https://github.com/os-autoinst/openQA-python-client/pull/34

#36 Updated by mkittler 6 months ago

Ok, although I need to check first whether we already have v4.2.1 of that Python module in the CI environment.

#37 Updated by okurz 6 months ago

In https://suse.slack.com/archives/C02CBB35W5B/p1668779284852569?thread_ts=1668611066.502389&cid=C02CBB35W5B Veronika Svecova tried to use the special marking comment but it seems to have no effect. Do you know why incident 26757 can not be auto-approved? The only linked failed incident test according to http://dashboard.qam.suse.de/incident/26757 is https://openqa.suse.de/tests/9987265 with a comment "@review:acceptable_for:incident_26757:kgraft-patch-SLE12-SP4_Update_23:cleared_with_Marcus" but https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1246560#L171 states "INFO: Inc 26757 has failed job in incidents"

openqa-cli api --osd job_settings/jobs key=*_TEST_ISSUES list_value=26757 returns

{"jobs":[9987265,9977121]}

which seems quite broken as there are definitely more jobs than two. At least it includes the one failed job that was marked to be ignorable. The other job is ok though.

Adding some logging qem-bot states that https://openqa.suse.de/tests/1953773 is blocking approval. This job does not even exist in OSD (anymore?) and judging by the number it would years old. I assume with that even force-result won't help. Sounds like #119161

#38 Updated by kraih 6 months ago

okurz wrote:

openqa-cli api --osd job_settings/jobs key=*_TEST_ISSUES list_value=26757 returns

{"jobs":[9987265,9977121]}

which seems quite broken as there are definitely more jobs than two. At least it includes the one failed job that was marked to be ignorable. The other job is ok though.

That would be the job_id limit. We only list jobs that are max(job_id) - 20000 or higher for performance. (trigram index would fix that and show 257 results) #117655

#39 Updated by kraih 6 months ago

okurz wrote:

Adding some logging qem-bot states that https://openqa.suse.de/tests/1953773 is blocking approval. This job does not even exist in OSD (anymore?) and judging by the number it would years old. I assume with that even force-result won't help. Sounds like #119161

Here's what the dashboard knows about the incident:

dashboard_db=# select job_id, status from openqa_jobs where incident_settings = 1953780 order by job_id desc;
 job_id  | status
---------+--------
 9962984 | passed
 9962255 | passed
 9962253 | passed
 9936895 | passed
 9936894 | passed
 9936893 | passed
 9936892 | passed
 9936891 | passed
 9936890 | passed
 9936889 | passed
 9936888 | passed
 9936884 | passed
 9936883 | passed
 9936882 | passed
 9936881 | passed
 9936880 | passed
 9936879 | passed
 9936878 | passed
 9936877 | passed
 9936876 | passed
 9936875 | passed
 9936874 | passed
 9936873 | passed
 9936872 | passed
 9936871 | passed
 9936870 | passed
 9936869 | passed
 9936868 | passed
 9936867 | passed
 9936866 | passed
 9936865 | passed
 9936864 | passed
 9936863 | passed
 9936862 | passed
 9936861 | passed
 9936860 | passed
 9936859 | passed
 9936858 | passed
 9936857 | passed
 9936856 | passed
 9936855 | passed
 9936854 | passed
 9936853 | passed
 9936852 | passed
 9936851 | passed
 9936850 | passed
 9936849 | passed
 9936848 | passed
 9936847 | passed
 9936846 | passed
 9936845 | passed
 9936844 | passed
 9936843 | passed
 9936842 | passed
 9936841 | passed
 9936840 | passed
(56 rows)

Curious that the newest 4 jobs in OSD for the incident are unknown to the dashboard and the dashboard doesn't know about 9937076 etc.:

openqa=# select * from job_settings where key like '%TEST_ISSUES' and value like '%26757%' order by job_id desc;
    id     |       key        | value | job_id  |      t_created      |      t_updated
-----------+------------------+-------+---------+---------------------+---------------------
 455803578 | LIVE_TEST_ISSUES | 26757 | 9987265 | 2022-11-18 10:59:48 | 2022-11-18 10:59:48
 455315068 | LIVE_TEST_ISSUES | 26757 | 9977121 | 2022-11-16 19:56:54 | 2022-11-16 19:56:54
 454525516 | LIVE_TEST_ISSUES | 26757 | 9962987 | 2022-11-15 12:55:06 | 2022-11-15 12:55:06
 454525454 | LIVE_TEST_ISSUES | 26757 | 9962985 | 2022-11-15 12:54:26 | 2022-11-15 12:54:26
 454525399 | LIVE_TEST_ISSUES | 26757 | 9962984 | 2022-11-15 12:54:25 | 2022-11-15 12:54:25
 454494414 | LIVE_TEST_ISSUES | 26757 | 9962255 | 2022-11-15 10:52:09 | 2022-11-15 10:52:09
 454494309 | LIVE_TEST_ISSUES | 26757 | 9962253 | 2022-11-15 10:51:38 | 2022-11-15 10:51:38
 453330250 | LIVE_TEST_ISSUES | 26757 | 9937076 | 2022-11-11 15:16:23 | 2022-11-11 15:16:23
 453330203 | LIVE_TEST_ISSUES | 26757 | 9937075 | 2022-11-11 15:16:23 | 2022-11-11 15:16:23
 453330187 | LIVE_TEST_ISSUES | 26757 | 9937074 | 2022-11-11 15:16:23 | 2022-11-11 15:16:23
 453330149 | LIVE_TEST_ISSUES | 26757 | 9937073 | 2022-11-11 15:16:22 | 2022-11-11 15:16:22
 453330101 | LIVE_TEST_ISSUES | 26757 | 9937072 | 2022-11-11 15:16:22 | 2022-11-11 15:16:22
 453330077 | LIVE_TEST_ISSUES | 26757 | 9937071 | 2022-11-11 15:16:22 | 2022-11-11 15:16:22
...
 453323887 | LIVE_TEST_ISSUES | 26757 | 9936895 | 2022-11-11 15:15:09 | 2022-11-11 15:15:09
 453323853 | LIVE_TEST_ISSUES | 26757 | 9936894 | 2022-11-11 15:15:09 | 2022-11-11 15:15:09
...

Note that these are not aggregate jobs, so it's unrelated to previous issues with missing data.

#40 Updated by mkittler 6 months ago

If the bot/dashboard don't know/consider the job where the comment is created on than this feature can obviously not work.

It would be great if the failed jobs would be logged (instead of getting just "INFO: Inc 26757 has failed job in incidents"). But before improving the logging it would be best to wait until the currently pending mega-PR (https://github.com/openSUSE/qem-bot/pull/84) has been merged. And then I can also improve the error handling as mentioned in #107923#note-35.

#41 Updated by okurz 6 months ago

  • Related to action #114415: [timeboxed:10h][spike solution] qem-bot comments on IBS size:S added

#42 Updated by jbaier_cz 6 months ago

  • Related to action #120939: [alert] Pipeline for scheduling incidents runs into timeout size:M added

#43 Updated by cdywan 6 months ago

mkittler wrote:

If the bot/dashboard don't know/consider the job where the comment is created on than this feature can obviously not work.

It would be great if the failed jobs would be logged (instead of getting just "INFO: Inc 26757 has failed job in incidents"). But before improving the logging it would be best to wait until the currently pending mega-PR (https://github.com/openSUSE/qem-bot/pull/84) has been merged. And then I can also improve the error handling as mentioned in #107923#note-35.

See also https://github.com/openSUSE/qem-bot/pull/83 which was merged as part of it

#44 Updated by jbaier_cz 6 months ago

  • Related to action #119161: Approval step of qem-bot says incident has failed job in incidents but it looks empty on the dashboard size:M added

#45 Updated by okurz 6 months ago

  • Due date set to 2022-12-16
  • Status changed from Feedback to In Progress
  • Assignee changed from mkittler to okurz

The mentioned PR https://github.com/openSUSE/qem-bot/pull/84 has been reverted in the meantime. I see that we are not progressing here so I am taking over trying to bring in my original changes one by one. This should help with debugging from logs.

#47 Updated by okurz 5 months ago

  • Due date changed from 2022-12-16 to 2022-12-23
  • Status changed from In Progress to Feedback

#48 Updated by mkittler 5 months ago

What would actually still be left is the improvement mentioned in #107923#note-35.

#49 Updated by okurz 5 months ago

  • Due date changed from 2022-12-23 to 2023-01-20

Opened a new PR https://github.com/openSUSE/qem-bot/pull/103

I would like to keep this open until more people return after Christmas absence.

#50 Updated by okurz 5 months ago

  • Related to action #122308: Handle invalid openQA job references in qem-dashboard size:M added

#52 Updated by okurz 5 months ago

  • Status changed from Feedback to Workable

#122308 was resolved. So next step can be for me to work with tinita on https://github.com/openSUSE/qem-bot/pull/103 and then to crosscheck again if there are any more useful refactorings pending.

#53 Updated by cdywan 4 months ago

  • Due date deleted (2023-01-20)

I think the Due Date should've been reset here but wasn't.

#54 Updated by okurz 3 months ago

  • Assignee deleted (okurz)

This needs to be picked up by someone else. I failed to accomodate it

#55 Updated by okurz 3 months ago

  • Priority changed from Normal to High

This is the only thing we need to complete to finish off #91646 as well -> "High" prio now

#56 Updated by jbaier_cz 3 months ago

I am just wondering what are we missing here... the main functionality (AC1 and AC2) should be covered by the initial pull request https://github.com/openSUSE/qem-bot/pull/73 and then refined in some follow-ups, at the end we clarified the difference between dashboard job id and openqa job id and fixed that feature in https://github.com/openSUSE/qem-bot/pull/109.
All ACs are covered by a corresponding test.

#57 Updated by okurz 3 months ago

Two things left to do:

  • Refactor according to #107923#note-35
  • Verify ACs in production, not just from unit tests, i.e. reference an example from logs where non-tools-team users could effectively ignore an aggregate test for one incident

#58 Updated by jbaier_cz 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to jbaier_cz

#59 Updated by jbaier_cz 3 months ago

  • Tags deleted (mob)

Revert the retry disabling commit and see if the 404 retries are really gone: https://github.com/openSUSE/qem-bot/pull/113

#60 Updated by openqa_review 3 months ago

  • Due date set to 2023-03-18

Setting due date based on mean cycle time of SUSE QE Tools

#61 Updated by jbaier_cz 3 months ago

  • Status changed from In Progress to Feedback

I created a small PR https://github.com/openSUSE/qem-bot/pull/114 to slightly improve the output of the Approver. Now it should also output a link to the failed / ignored openQA job

#62 Updated by mkittler 3 months ago

The most recent usage of the feature is https://openqa.suse.de/tests/10478348#comments (found via select job_id, user_id, text from comments where text like '%review%acceptable%' and user_id != 126;). Not sure whether that's good enough to verify that the feature works.

#63 Updated by jbaier_cz 3 months ago

  • Status changed from Feedback to Resolved

We have a new confirmation today in https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1444928

2023-03-09 11:04:15 INFO     Ignoring failed job https://openqa.suse.de/t10643070 for incident 27492 due to openQA comment
2023-03-09 11:04:15 INFO     Ignoring failed job https://openqa.suse.de/t10643071 for incident 27492 due to openQA comment
2023-03-09 11:04:15 INFO     Ignoring failed job https://openqa.suse.de/t10643068 for incident 27492 due to openQA comment
2023-03-09 11:04:15 INFO     Ignoring failed job https://openqa.suse.de/t10643069 for incident 27492 due to openQA comment
2023-03-09 11:04:15 INFO     Found failed, not-ignored job https://openqa.suse.de/t10646456 for incident 27492
2023-03-09 11:04:15 INFO     SUSE:Maintenance:27492:290872 has at least one failed job in aggregate tests

Also available in: Atom PDF