Project

General

Profile

Actions

action #133583

closed

qem-bot approve incidents failed in gitlab CI, reason unkown size:M

Added by okurz 9 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2023-07-31
Due date:
2023-08-19
% Done:

0%

Estimated time:
Tags:

Description

Observation

https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1724368 says

2023-07-31 10:36:47 INFO     * SUSE:Maintenance:30033:304153
2023-07-31 10:36:47 INFO     Accepting review for SUSE:Maintenance:29819:304034
2023-07-31 10:36:47 INFO     Accepting review for SUSE:Maintenance:29993:304113
2023-07-31 10:36:47 INFO     Received 'Not Found'. Request 304113 removed or problem on OBS side, ignoring
2023-07-31 10:36:47 INFO     Accepting review for SUSE:Maintenance:29994:304114
…
2023-07-31 10:36:48 INFO     Received 'Not Found'. Request 304153 removed or problem on OBS side, ignoring
2023-07-31 10:36:48 INFO     End of bot run
++ let 'sleep=BACKOFF_FACTOR*2**count'
++ let count+=1
++ ((  count > MAX_RETRIES  ))
++ exit 100
Uploading artifacts for failed job 00:01
Uploading artifacts...
bot_*.log: found 3 matching artifact files and directories 
Uploading artifacts as "archive" to coordinator... 201 Created  id=1724368 responseStatus=201 Created token=64_LuS46
Cleaning up project directory and file based variables 00:01
ERROR: Job failed: exit code 100

but I could not identify the underlying cause

Acceptance criteria

  • AC1: Those CI jobs no longer run into the issue mentioned under observation or at least retry a reasonably amount of times or ignore the error for good
  • AC2: The logs make it clear whether an error is fatal or has been ignored or when retries happened

Suggestions

Rollback steps


Files

obs-post-request-id.png (129 KB) obs-post-request-id.png tinita, 2023-08-04 13:14
Actions #1

Updated by tinita 9 months ago

I asked in #help-obs.
Looking at the qem-bot code, the API request for 304153 and others must have returned 404, although they are older and must have existed already.

Actions #2

Updated by mkittler 9 months ago

  • Subject changed from qem-bot approve incidents failed in gitlab CI, reason unkown to qem-bot approve incidents failed in gitlab CI, reason unkown size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by okurz 9 months ago

  • Description updated (diff)
  • Priority changed from Normal to Urgent

This seems to happen often enough to make this "Urgent" (if not "Immediate"). Latest incident: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1734072#L464

Disabled email notifications in https://gitlab.suse.de/qa-maintenance/bot-ng/edit

Actions #4

Updated by livdywan 9 months ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

Ack. I'm starting by improving the error handling and then see which way to go

Actions #5

Updated by tinita 9 months ago

Looking at the docs of POST /request/id, it's possible that they changed the 403 to a 404, so it could be really just a "request is not in review state" which means it's likely already approved.

edit: but if I try it out for https://build.suse.de/request/show/304153 , I get a "403 The request is neither in state review nor new", so maybe that's just wrong documentation.

Actions #6

Updated by livdywan 9 months ago

Looking at the docs of POST /request/id, it's possible that they changed the 403 to a 404

The "Not found" example in the docs says Couldn't find request with id '120', though?

We can at least get the headers from the HTTPError. Interestingly the unit test for 404 is using the wrong strings but it's expecting it to fail. So I'm for now assuming that is the behavior that was desired:

https://github.com/openSUSE/qem-bot/pull/129

Getting the response body would be even better I guess, but not sure how to get that.

Actions #7

Updated by tinita 9 months ago

livdywan wrote:

Looking at the docs of POST /request/id, it's possible that they changed the 403 to a 404

The "Not found" example in the docs says Couldn't find request with id '120', though?

Are you looking at the actual documentation for POST /request/id? It looks like what you are quoting is for the GET request.

Actions #8

Updated by livdywan 9 months ago

tinita wrote:

livdywan wrote:

Looking at the docs of POST /request/id, it's possible that they changed the 403 to a 404

The "Not found" example in the docs says Couldn't find request with id '120', though?

Are you looking at the actual documentation for POST /request/id? It looks like what you are quoting is for the GET request.

Yes. You can select the example in the combo box. GET only has one possible outcome.

Actions #9

Updated by tinita 9 months ago

For GET /request/id we see this documentation:

404 
Not Found

Media type

application/xml; charset=utf-8
Example Value
Schema
<?xml version="1.0" encoding="UTF-8"?>
<status code="not_found">
    <summary>Couldn&apos;t find request with id &apos;5&apos;</summary>
</status>

For POST /request/id we see this documentation:

404 
Not Found

Media type

application/xml; charset=utf-8
Examples

Request Not Modifiable
Example Value
Schema
<?xml version="1.0" encoding="UTF-8"?>
<status code="request_not_modifiable">
    <summary>request is not in review state</summary>
</status>

And since in qem-bot we are doing a POST request, this should be the relevant one.

Actions #11

Updated by tinita 9 months ago

When I try it for 304153 out I get:

# Headers
status: 403 Forbidden
x-opensuse-errorcode: review_change_state_no_permission
...
# Body
<status code="review_change_state_no_permission">
  <summary>The request is neither in state review nor new</summary>
</status>
Actions #12

Updated by tinita 9 months ago

So I'm assuming we're fine, because the request was accepted before, it's just that the documentation talks about a 404, and apparently that's what we're getting, but I get a 403 with a similar error, so to be sure we could just log the body for now.

Actions #13

Updated by openqa_review 9 months ago

  • Due date set to 2023-08-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions #14

Updated by livdywan 9 months ago

  • Status changed from In Progress to Feedback

livdywan wrote:

https://github.com/openSUSE/qem-bot/pull/129

Getting the response body would be even better I guess, but not sure how to get that.

FYI we do log the body now, and from here on we can hopefully disambiguate the errors we're getting

Actions #15

Updated by livdywan 9 months ago

  • Priority changed from Urgent to High
2023-08-04 18:15:24 ERROR    Received error 401, reason: 'Unauthorized' for Request 304387 - problem on OBS side

The most recent ones from 4 days ago look like so. Otherwise no failures at all right now. Maybe it's fair to say it's High but not Urgent. Unfortunately we don't know what changed in the meantime.

Actions #16

Updated by livdywan 9 months ago

  • Status changed from Feedback to Resolved

We can probably resolve this. There's a minimum feasible improvement which should help us if this or a similar issue happens again.

Actions

Also available in: Atom PDF