action #107923
coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release
qem-bot: Ignore not-ok openQA jobs for specific incident based on openQA job comment size:M
0%
Description
Motivation¶
See the proposal in the parent epic #95479, e.g. about the specific format of an openQA label that is readable by qem-bot
Acceptance criteria¶
- AC1: A not-ok openQA job with a comment following format https://progress.opensuse.org/issues/95479#Suggestions is not blocking approval of incident updates
- AC2: A not-ok openQA job with such comment is still blocking approval of all other, not specified incident updates
- AC3: A not-ok openQA without such comment is still blocking all related incidents
Suggestions¶
- DONE: Add a testing framework to github.com/openSUSE/qem-bot/, e.g. based on github.com/os-autoinst/openqa_review -> #109641
- DONE: Add a simple automatic test exercising one of the existing happy path workflows of qem-bot -> #110167
- DONE: Add automatic tests for the above acceptance criteria
- DONE: As a first quick-and-dirty, and messy approach read out openQA comments directly within the approval step of qem-bot (only for the failed jobs which should not take too long)
- DONE: Parse the mentioned special label string and for the parsed incident remove the according not-ok openQA job from the list of blocking results
- Optional: Add openQA comment parsing over the openQA API together with consistent data in qem-dashboard, i.e.
- As qem-dashboard is the "database for qem-bot" read out the according data from openQA that is pushed to qem-dashboard
- make qem-dashboard store the related data
- and then qem-bot should read it out from there
Out of scope¶
Visualize such specially handled failed openQA jobs in dashboard
Related issues
History
#1
Updated by okurz about 1 year ago
- Related to openqa-force-result #109857: Secure auto-review+force_result size:M auto_review:"Failed to download gobbledeegoop":force_result:softfailed added
#2
Updated by okurz about 1 year ago
- Target version changed from future to Ready
#3
Updated by okurz about 1 year ago
- Description updated (diff)
#4
Updated by okurz about 1 year ago
- Related to action #111078: Simple automatic test exercising one of the existing happy path workflows of qem-bot size:M added
#5
Updated by okurz about 1 year ago
- Status changed from New to Blocked
- Assignee set to okurz
#10
Updated by openqa_review 8 months ago
- Due date set to 2022-10-28
Setting due date based on mean cycle time of SUSE QE Tools
#11
Updated by jbaier_cz 8 months ago
Likely a regression: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1189679
#12
Updated by mkittler 8 months ago
PR to fix pipeline failures: https://github.com/openSUSE/qem-bot/pull/75
#13
Updated by jbaier_cz 8 months ago
It seems, that the pipeline may be working, although we should still get rid of the stack traces. To actually see, if we already use the patched version, I propose a little update for the pipeline: https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/59
#14
Updated by mkittler 7 months ago
Let's see whether https://github.com/openSUSE/qem-bot/pull/76 fixes the pipeline (e.g. https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipelines/506277).
#16
Updated by mkittler 7 months ago
Maybe it is the simplest to just disable the retry for now: https://github.com/openSUSE/qem-bot/pull/79
(Not sure how to make it only retry on connection errors and 500 responses which would likely be the most reasonable choice and I assumed would be the default retry behavior.)
#17
Updated by mkittler 7 months ago
- Status changed from In Progress to Feedback
All PRs have been merged. The pipeline works again and doesn't take that long to execute (e.g. https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1194317).
So putting a comment like https://progress.opensuse.org/issues/95479#Suggestions in an openQA job should now work.
#18
Updated by okurz 7 months ago
Nice. Also thank you for the announcement in #eng-testing. I suggest to await a real-life usage. You could query the openQA comments for matching comments if people used it.
Before resolving please make sure to put any not-done "optional" tasks in according new ticket(s).
#19
Updated by tinita 7 months ago
mkittler wrote:
Maybe it is the simplest to just disable the retry for now: https://github.com/openSUSE/qem-bot/pull/79
(Not sure how to make it only retry on connection errors and 500 responses which would likely be the most reasonable choice and I assumed would be the default retry behavior.)
This should probably be done in the openqa_client.
There's even a related issue: https://github.com/os-autoinst/openQA-python-client/issues/16
I had a quick look at the code, but currently it is using requests
, and if I understand it correctly, it would have to be rewritten with urllib3
to use the retry feature.
#21
Updated by mkittler 7 months ago
There are no such comments yet:
openqa=# select job_id, text from comments where text like '%review%acceptable%'; job_id | text --------+------ (0 Zeilen)
Likely it is also possible to tweak the retry while keep using requests. (I cannot imagine that request makes this impossible. Likely they only prefer urllib3
because it has already a built-in feature for exactly what we want.)
#22
Updated by kraih 7 months ago
Still lots of error messages in the pipeline output https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1196209:
ERROR: ('GET', 'https://openqa.suse.de/api/v1/jobs/1944211/comments', 404) Traceback (most recent call last): File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/openqa.py", line 73, in get_job_comments "GET", "jobs/%s/comments" % job_id, retries=0 File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 184, in openqa_request return self.do_request(req, retries=retries, wait=wait) File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 164, in do_request raise err File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 144, in do_request request.method, resp.url, resp.status_code openqa_client.exceptions.RequestError: ('GET', 'https://openqa.suse.de/api/v1/jobs/1944211/comments', 404)
#26
Updated by mkittler 7 months ago
okurz wrote:
- Optional: Add openQA comment parsing over the openQA API together with consistent data in qem-dashboard, i.e.
- As qem-dashboard is the "database for qem-bot" read out the according data from openQA that is pushed to qem-dashboard
- make qem-dashboard store the related data
- and then qem-bot should read it out from there
That's likely not very useful. We'd need to query all comments when syncing the dashboard so we'd likely end up with more request on the openQA side. Besides, judging the by logs of recent approval pipeline runs I suppose openQA will be able to handle the traffic (as it doesn't look too much, we also only query comments for failed jobs).
One could do a further optimization: As soon as we see a failing job blocking an incident, we'd avoid checking any further failing job for that incident (e.g. an early break in the loop introduced here: https://github.com/openSUSE/qem-bot/pull/73/files).
#27
Updated by mkittler 7 months ago
PR for the optimization mentioned in the previous comment: https://github.com/openSUSE/qem-bot/pull/82
Note that the feature isn't used yet (select job_id, user_id, text from comments where text like '%review%acceptable%' and user_id != 126;
returns no results).
#28
Updated by mkittler 7 months ago
PR to make this at least a little bit more discoverable: https://github.com/os-autoinst/openQA/pull/4867
#29
Updated by okurz 7 months ago
- Related to action #119467: "Internal server error" on opening any job group front page at OSD added
#32
Updated by mkittler 7 months ago
Fixed version of the original PR with test coverage for the SUSE branding template code: https://github.com/os-autoinst/openQA/pull/4881
#34
Updated by okurz 7 months ago
- Due date deleted (
2022-11-11) - Status changed from Feedback to Resolved
I triggered an extraordinary deployment on OSD and verified that the comment template showed up on https://openqa.suse.de/tests/latest#comments . Documentation and informing people can be done in related tickets, e.g. #111066
#35
Updated by tinita 6 months ago
- Status changed from Resolved to Feedback
We should be able to use retry now because 404 is now excluded: https://github.com/os-autoinst/openQA-python-client/pull/34
#37
Updated by okurz 6 months ago
In https://suse.slack.com/archives/C02CBB35W5B/p1668779284852569?thread_ts=1668611066.502389&cid=C02CBB35W5B Veronika Svecova tried to use the special marking comment but it seems to have no effect. Do you know why incident 26757 can not be auto-approved? The only linked failed incident test according to http://dashboard.qam.suse.de/incident/26757 is https://openqa.suse.de/tests/9987265 with a comment "@review:acceptable_for:incident_26757:kgraft-patch-SLE12-SP4_Update_23:cleared_with_Marcus" but https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1246560#L171 states "INFO: Inc 26757 has failed job in incidents"
openqa-cli api --osd job_settings/jobs key=*_TEST_ISSUES list_value=26757
returns
{"jobs":[9987265,9977121]}
which seems quite broken as there are definitely more jobs than two. At least it includes the one failed job that was marked to be ignorable. The other job is ok though.
Adding some logging qem-bot states that https://openqa.suse.de/tests/1953773 is blocking approval. This job does not even exist in OSD (anymore?) and judging by the number it would years old. I assume with that even force-result won't help. Sounds like #119161
#38
Updated by kraih 6 months ago
okurz wrote:
openqa-cli api --osd job_settings/jobs key=*_TEST_ISSUES list_value=26757
returns{"jobs":[9987265,9977121]}which seems quite broken as there are definitely more jobs than two. At least it includes the one failed job that was marked to be ignorable. The other job is ok though.
That would be the job_id
limit. We only list jobs that are max(job_id) - 20000
or higher for performance. (trigram index would fix that and show 257 results) #117655
#39
Updated by kraih 6 months ago
okurz wrote:
Adding some logging qem-bot states that https://openqa.suse.de/tests/1953773 is blocking approval. This job does not even exist in OSD (anymore?) and judging by the number it would years old. I assume with that even force-result won't help. Sounds like #119161
Here's what the dashboard knows about the incident:
dashboard_db=# select job_id, status from openqa_jobs where incident_settings = 1953780 order by job_id desc; job_id | status ---------+-------- 9962984 | passed 9962255 | passed 9962253 | passed 9936895 | passed 9936894 | passed 9936893 | passed 9936892 | passed 9936891 | passed 9936890 | passed 9936889 | passed 9936888 | passed 9936884 | passed 9936883 | passed 9936882 | passed 9936881 | passed 9936880 | passed 9936879 | passed 9936878 | passed 9936877 | passed 9936876 | passed 9936875 | passed 9936874 | passed 9936873 | passed 9936872 | passed 9936871 | passed 9936870 | passed 9936869 | passed 9936868 | passed 9936867 | passed 9936866 | passed 9936865 | passed 9936864 | passed 9936863 | passed 9936862 | passed 9936861 | passed 9936860 | passed 9936859 | passed 9936858 | passed 9936857 | passed 9936856 | passed 9936855 | passed 9936854 | passed 9936853 | passed 9936852 | passed 9936851 | passed 9936850 | passed 9936849 | passed 9936848 | passed 9936847 | passed 9936846 | passed 9936845 | passed 9936844 | passed 9936843 | passed 9936842 | passed 9936841 | passed 9936840 | passed (56 rows)
Curious that the newest 4 jobs in OSD for the incident are unknown to the dashboard and the dashboard doesn't know about 9937076 etc.:
openqa=# select * from job_settings where key like '%TEST_ISSUES' and value like '%26757%' order by job_id desc; id | key | value | job_id | t_created | t_updated -----------+------------------+-------+---------+---------------------+--------------------- 455803578 | LIVE_TEST_ISSUES | 26757 | 9987265 | 2022-11-18 10:59:48 | 2022-11-18 10:59:48 455315068 | LIVE_TEST_ISSUES | 26757 | 9977121 | 2022-11-16 19:56:54 | 2022-11-16 19:56:54 454525516 | LIVE_TEST_ISSUES | 26757 | 9962987 | 2022-11-15 12:55:06 | 2022-11-15 12:55:06 454525454 | LIVE_TEST_ISSUES | 26757 | 9962985 | 2022-11-15 12:54:26 | 2022-11-15 12:54:26 454525399 | LIVE_TEST_ISSUES | 26757 | 9962984 | 2022-11-15 12:54:25 | 2022-11-15 12:54:25 454494414 | LIVE_TEST_ISSUES | 26757 | 9962255 | 2022-11-15 10:52:09 | 2022-11-15 10:52:09 454494309 | LIVE_TEST_ISSUES | 26757 | 9962253 | 2022-11-15 10:51:38 | 2022-11-15 10:51:38 453330250 | LIVE_TEST_ISSUES | 26757 | 9937076 | 2022-11-11 15:16:23 | 2022-11-11 15:16:23 453330203 | LIVE_TEST_ISSUES | 26757 | 9937075 | 2022-11-11 15:16:23 | 2022-11-11 15:16:23 453330187 | LIVE_TEST_ISSUES | 26757 | 9937074 | 2022-11-11 15:16:23 | 2022-11-11 15:16:23 453330149 | LIVE_TEST_ISSUES | 26757 | 9937073 | 2022-11-11 15:16:22 | 2022-11-11 15:16:22 453330101 | LIVE_TEST_ISSUES | 26757 | 9937072 | 2022-11-11 15:16:22 | 2022-11-11 15:16:22 453330077 | LIVE_TEST_ISSUES | 26757 | 9937071 | 2022-11-11 15:16:22 | 2022-11-11 15:16:22 ... 453323887 | LIVE_TEST_ISSUES | 26757 | 9936895 | 2022-11-11 15:15:09 | 2022-11-11 15:15:09 453323853 | LIVE_TEST_ISSUES | 26757 | 9936894 | 2022-11-11 15:15:09 | 2022-11-11 15:15:09 ...
Note that these are not aggregate jobs, so it's unrelated to previous issues with missing data.
#40
Updated by mkittler 6 months ago
If the bot/dashboard don't know/consider the job where the comment is created on than this feature can obviously not work.
It would be great if the failed jobs would be logged (instead of getting just "INFO: Inc 26757 has failed job in incidents"). But before improving the logging it would be best to wait until the currently pending mega-PR (https://github.com/openSUSE/qem-bot/pull/84) has been merged. And then I can also improve the error handling as mentioned in #107923#note-35.
#41
Updated by okurz 6 months ago
- Related to action #114415: [timeboxed:10h][spike solution] qem-bot comments on IBS size:S added
#42
Updated by jbaier_cz 6 months ago
- Related to action #120939: [alert] Pipeline for scheduling incidents runs into timeout size:M added
#43
Updated by cdywan 6 months ago
mkittler wrote:
If the bot/dashboard don't know/consider the job where the comment is created on than this feature can obviously not work.
It would be great if the failed jobs would be logged (instead of getting just "INFO: Inc 26757 has failed job in incidents"). But before improving the logging it would be best to wait until the currently pending mega-PR (https://github.com/openSUSE/qem-bot/pull/84) has been merged. And then I can also improve the error handling as mentioned in #107923#note-35.
See also https://github.com/openSUSE/qem-bot/pull/83 which was merged as part of it
#44
Updated by jbaier_cz 6 months ago
- Related to action #119161: Approval step of qem-bot says incident has failed job in incidents but it looks empty on the dashboard size:M added
#45
Updated by okurz 6 months ago
- Due date set to 2022-12-16
- Status changed from Feedback to In Progress
- Assignee changed from mkittler to okurz
The mentioned PR https://github.com/openSUSE/qem-bot/pull/84 has been reverted in the meantime. I see that we are not progressing here so I am taking over trying to bring in my original changes one by one. This should help with debugging from logs.
#46
Updated by okurz 6 months ago
https://github.com/openSUSE/qem-bot/pull/93https://github.com/openSUSE/qem-bot/pull/94https://github.com/openSUSE/qem-bot/pull/95https://github.com/openSUSE/qem-bot/pull/96
EDIT: All merged. New PR, one step at a time:
#48
Updated by mkittler 5 months ago
What would actually still be left is the improvement mentioned in #107923#note-35.
#49
Updated by okurz 5 months ago
- Due date changed from 2022-12-23 to 2023-01-20
Opened a new PR https://github.com/openSUSE/qem-bot/pull/103
I would like to keep this open until more people return after Christmas absence.
#50
Updated by okurz 5 months ago
- Related to action #122308: Handle invalid openQA job references in qem-dashboard size:M added
#52
Updated by okurz 5 months ago
- Status changed from Feedback to Workable
#122308 was resolved. So next step can be for me to work with tinita on https://github.com/openSUSE/qem-bot/pull/103 and then to crosscheck again if there are any more useful refactorings pending.
#56
Updated by jbaier_cz 3 months ago
I am just wondering what are we missing here... the main functionality (AC1 and AC2) should be covered by the initial pull request https://github.com/openSUSE/qem-bot/pull/73 and then refined in some follow-ups, at the end we clarified the difference between dashboard job id and openqa job id and fixed that feature in https://github.com/openSUSE/qem-bot/pull/109.
All ACs are covered by a corresponding test.
#57
Updated by okurz 3 months ago
Two things left to do:
- Refactor according to #107923#note-35
- Verify ACs in production, not just from unit tests, i.e. reference an example from logs where non-tools-team users could effectively ignore an aggregate test for one incident
#59
Updated by jbaier_cz 3 months ago
- Tags deleted (
mob)
Revert the retry disabling commit and see if the 404 retries are really gone: https://github.com/openSUSE/qem-bot/pull/113
#60
Updated by openqa_review 3 months ago
- Due date set to 2023-03-18
Setting due date based on mean cycle time of SUSE QE Tools
#61
Updated by jbaier_cz 3 months ago
- Status changed from In Progress to Feedback
I created a small PR https://github.com/openSUSE/qem-bot/pull/114 to slightly improve the output of the Approver. Now it should also output a link to the failed / ignored openQA job
#62
Updated by mkittler 3 months ago
The most recent usage of the feature is https://openqa.suse.de/tests/10478348#comments (found via select job_id, user_id, text from comments where text like '%review%acceptable%' and user_id != 126;
). Not sure whether that's good enough to verify that the feature works.
#63
Updated by jbaier_cz 3 months ago
- Status changed from Feedback to Resolved
We have a new confirmation today in https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1444928
2023-03-09 11:04:15 INFO Ignoring failed job https://openqa.suse.de/t10643070 for incident 27492 due to openQA comment 2023-03-09 11:04:15 INFO Ignoring failed job https://openqa.suse.de/t10643071 for incident 27492 due to openQA comment 2023-03-09 11:04:15 INFO Ignoring failed job https://openqa.suse.de/t10643068 for incident 27492 due to openQA comment 2023-03-09 11:04:15 INFO Ignoring failed job https://openqa.suse.de/t10643069 for incident 27492 due to openQA comment 2023-03-09 11:04:15 INFO Found failed, not-ignored job https://openqa.suse.de/t10646456 for incident 27492 2023-03-09 11:04:15 INFO SUSE:Maintenance:27492:290872 has at least one failed job in aggregate tests