action #133583
closedqem-bot approve incidents failed in gitlab CI, reason unkown size:M
0%
Description
Observation¶
https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1724368 says
2023-07-31 10:36:47 INFO * SUSE:Maintenance:30033:304153
2023-07-31 10:36:47 INFO Accepting review for SUSE:Maintenance:29819:304034
2023-07-31 10:36:47 INFO Accepting review for SUSE:Maintenance:29993:304113
2023-07-31 10:36:47 INFO Received 'Not Found'. Request 304113 removed or problem on OBS side, ignoring
2023-07-31 10:36:47 INFO Accepting review for SUSE:Maintenance:29994:304114
…
2023-07-31 10:36:48 INFO Received 'Not Found'. Request 304153 removed or problem on OBS side, ignoring
2023-07-31 10:36:48 INFO End of bot run
++ let 'sleep=BACKOFF_FACTOR*2**count'
++ let count+=1
++ (( count > MAX_RETRIES ))
++ exit 100
Uploading artifacts for failed job 00:01
Uploading artifacts...
bot_*.log: found 3 matching artifact files and directories
Uploading artifacts as "archive" to coordinator... 201 Created id=1724368 responseStatus=201 Created token=64_LuS46
Cleaning up project directory and file based variables 00:01
ERROR: Job failed: exit code 100
but I could not identify the underlying cause
Acceptance criteria¶
- AC1: Those CI jobs no longer run into the issue mentioned under observation or at least retry a reasonably amount of times or ignore the error for good
- AC2: The logs make it clear whether an error is fatal or has been ignored or when retries happened
Suggestions¶
- It was suggested to log the request body for the 404 because it can have multiple reasons
- Crosscheck the code what should happen before/after "End of bot run" and why the retry is actually triggered
- Improve the error message. It says "ignoring" but it fails
- Check what MAX_RETRIES is set to. Is it 3? Do we want more retries? -> in .gitlab-ci.yml it looks like it is set to 0 and the "exit 100" simply means that the retries are exhausted
- https://github.com/openSUSE/qem-bot/blob/cbef942434e03d1aa92776d27813488a0462f5c1/openqabot/approver.py#L86
- https://github.com/openSUSE/qem-bot/blob/cbef942434e03d1aa92776d27813488a0462f5c1/openqabot/approver.py#L42
- https://gitlab.suse.de/qa-maintenance/bot-ng/-/blob/master/.gitlab-ci.yml#L48
Rollback steps¶
- Enable again email notifications in https://gitlab.suse.de/qa-maintenance/bot-ng/edit
Files
Updated by tinita over 1 year ago
I asked in #help-obs
.
Looking at the qem-bot code, the API request for 304153 and others must have returned 404, although they are older and must have existed already.
Updated by mkittler over 1 year ago
- Subject changed from qem-bot approve incidents failed in gitlab CI, reason unkown to qem-bot approve incidents failed in gitlab CI, reason unkown size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
- Description updated (diff)
- Priority changed from Normal to Urgent
This seems to happen often enough to make this "Urgent" (if not "Immediate"). Latest incident: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1734072#L464
Disabled email notifications in https://gitlab.suse.de/qa-maintenance/bot-ng/edit
Updated by livdywan over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
Ack. I'm starting by improving the error handling and then see which way to go
Updated by tinita over 1 year ago
Looking at the docs of POST /request/id, it's possible that they changed the 403 to a 404, so it could be really just a "request is not in review state" which means it's likely already approved.
edit: but if I try it out for https://build.suse.de/request/show/304153 , I get a "403 The request is neither in state review nor new", so maybe that's just wrong documentation.
Updated by livdywan over 1 year ago
Looking at the docs of POST /request/id, it's possible that they changed the 403 to a 404
The "Not found" example in the docs says Couldn't find request with id '120'
, though?
We can at least get the headers from the HTTPError. Interestingly the unit test for 404 is using the wrong strings but it's expecting it to fail. So I'm for now assuming that is the behavior that was desired:
https://github.com/openSUSE/qem-bot/pull/129
Getting the response body would be even better I guess, but not sure how to get that.
Updated by tinita over 1 year ago
livdywan wrote:
Looking at the docs of POST /request/id, it's possible that they changed the 403 to a 404
The "Not found" example in the docs says
Couldn't find request with id '120'
, though?
Are you looking at the actual documentation for POST /request/id? It looks like what you are quoting is for the GET request.
Updated by livdywan over 1 year ago
tinita wrote:
livdywan wrote:
Looking at the docs of POST /request/id, it's possible that they changed the 403 to a 404
The "Not found" example in the docs says
Couldn't find request with id '120'
, though?Are you looking at the actual documentation for POST /request/id? It looks like what you are quoting is for the GET request.
Yes. You can select the example in the combo box. GET only has one possible outcome.
Updated by tinita over 1 year ago
For GET /request/id
we see this documentation:
404
Not Found
Media type

application/xml; charset=utf-8
Example Value
Schema
<?xml version="1.0" encoding="UTF-8"?>
<status code="not_found">
<summary>Couldn't find request with id '5'</summary>
</status>
For POST /request/id
we see this documentation:
404
Not Found
Media type

application/xml; charset=utf-8
Examples

Request Not Modifiable
Example Value
Schema
<?xml version="1.0" encoding="UTF-8"?>
<status code="request_not_modifiable">
<summary>request is not in review state</summary>
</status>
And since in qem-bot we are doing a POST request, this should be the relevant one.
Updated by tinita over 1 year ago
- File obs-post-request-id.png obs-post-request-id.png added
To make sure we're on the same page:
https://build.suse.de/apidocs/index#/Requests/post_request__id_
Updated by tinita over 1 year ago
When I try it for 304153 out I get:
# Headers
status: 403 Forbidden
x-opensuse-errorcode: review_change_state_no_permission
...
# Body
<status code="review_change_state_no_permission">
<summary>The request is neither in state review nor new</summary>
</status>
Updated by tinita over 1 year ago
So I'm assuming we're fine, because the request was accepted before, it's just that the documentation talks about a 404, and apparently that's what we're getting, but I get a 403 with a similar error, so to be sure we could just log the body for now.
Updated by openqa_review over 1 year ago
- Due date set to 2023-08-19
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 1 year ago
- Status changed from In Progress to Feedback
livdywan wrote:
https://github.com/openSUSE/qem-bot/pull/129
Getting the response body would be even better I guess, but not sure how to get that.
FYI we do log the body now, and from here on we can hopefully disambiguate the errors we're getting
Updated by livdywan over 1 year ago
- Priority changed from Urgent to High
2023-08-04 18:15:24 ERROR Received error 401, reason: 'Unauthorized' for Request 304387 - problem on OBS side
The most recent ones from 4 days ago look like so. Otherwise no failures at all right now. Maybe it's fair to say it's High but not Urgent. Unfortunately we don't know what changed in the meantime.
Updated by livdywan over 1 year ago
- Status changed from Feedback to Resolved
We can probably resolve this. There's a minimum feasible improvement which should help us if this or a similar issue happens again.