Project

General

Profile

Actions

action #156625

closed

[alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 size:M

Added by okurz about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

After #156052 we still have a case https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2344558 like this:

Job state of job ID 13715326: running, waiting …
{"blocked_by_id":null,"id":13715326,"result":"none","state":"running"}
Job state of job ID 13715326: running, waiting …
Request failed, hit error 503, retrying up to 60 more times after waiting …
…
Request failed, hit error 503, retrying up to 1 more times after waiting …
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
<p>Additionally, a 503 Service Unavailable
error was encountered while trying to use an ErrorDocument to handle the request.</p>
<hr>
<address>Apache/2.4.51 (Linux/SUSE) Server at openqa.suse.de Port 80</address>
</body></html>

that's possibly a retry over multiple minutes but still something is off here.

Acceptance criteria

  • AC1: (vague) openqa-cli waits sufficiently long to cover usual OSD outages
  • AC2: The retry-functionality in openqa-cli was double-verified and works as intended

Suggestions

  • Test the openqa-cli behaviour maybe together with an apache proxy on a local installation
  • Check if the retry actually properly sleeps in between
  • Consider adding exponential backup into openqa-cli, see https://github.com/okurz/retry/blob/main/retry#L49
  • Consider adding a timestamp to the gitlab CI pipeline output
  • Consider output the value of OPENQA_CLI_RETRY_SLEEP_TIME_S in the Request failed, hit error ..., retrying up to ... more times after waiting line

Related issues 1 (0 open1 closed)

Copied from openQA Project - action #156052: [alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting size:SResolvedmkittler2024-02-262024-03-13

Actions
Actions #1

Updated by okurz about 2 months ago

  • Copied from action #156052: [alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting size:S added
Actions #2

Updated by livdywan about 2 months ago

Once more: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2353970

  • Two jobs were scheduled, and passed
  • Somehow OSD became unresponsive while --monitor was still waiting?
  • There were no other issues around that time?
  • It took 02:54 and 01:49 respectively to finish the tests.
  • The 503 causes the wait for the job state to be aborted.

Maybe a 503 should be treated like a temporary error?

Actions #3

Updated by tinita about 2 months ago · Edited

This is apparently a typical time when the webui is restarted (looks the same as yesterday):
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1709699912476&to=1709707387316

Mar 05 06:15:30 openqa systemd[1]: Started The openQA web UI.
Mar 06 06:15:58 openqa systemd[1]: Started The openQA web UI.

The 503 is already treated as a temporary error, that's why it's retrying.
We could set OPENQA_CLI_RETRY_SLEEP_TIME_S to a higher value. Currently it is set to 5, which means we wait 5 minutes in total.

Not sure why it was unavailable for 5 minutes though:

Mar 06 06:15:34 openqa systemd[1]: Stopping The openQA web UI...
...
Mar 06 06:15:34 openqa systemd[1]: openqa-webui.service: Deactivated successfully.
Mar 06 06:15:34 openqa systemd[1]: Stopped The openQA web UI.
Mar 06 06:15:58 openqa systemd[1]: Started The openQA web UI.
Mar 06 06:15:59 openqa openqa[13845]: [info] Loading external plugin AMQP
Mar 06 06:15:59 openqa openqa[13845]: [info] Loading external plugin ObsRsync
Mar 06 06:16:00 openqa openqa[13845]: [info] Listening at "http://127.0.0.1:9526"
Mar 06 06:16:00 openqa openqa[13845]: Web application available at http://127.0.0.1:9526
Mar 06 06:16:00 openqa openqa[13845]: [info] Listening at "http://[::1]:9526"
Mar 06 06:16:00 openqa openqa[13845]: Web application available at http://[::1]:9526
Actions #4

Updated by tinita about 2 months ago

What I find very confusing is the 503 in the output of openqa-cli schedule:

+ openqa-cli schedule --monitor --host http://openqa.suse.de --param-file SCENARIO_DEFINITIONS_YAML=/tmp/tmp.NoQyW7vXWm DISTRI=sle VERSION=15-SP5 FLAVOR=Server-DVD-Updates ARCH=x86_64 BUILD=2024-03-05T05:12+00:00 _GROUP_ID=0 HDD_1=SLES-15-SP5-x86_64-mru-install-minimal-with-addons-Build20240304-2-Server-DVD-Updates-64bit.qcow2
503 Service Unavailable
{"count":2,"failed":[],"ids":[13715326,13715327],"scheduled_product_id":2070541}

It returns a 503, but also apparently did schedule two jobs successfully? I thought the 503 might come from the openqa-cli call finding out the settings.HDD_1 but that can't be because HDD_1=... looks ok in the line below.

Actions #5

Updated by okurz about 2 months ago

I suggest to check the apache logs as the 503 might come from the openQA WebUI or the apache proxy.

Actions #6

Updated by okurz about 2 months ago

  • Subject changed from [alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 to [alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Assignee set to tinita
Actions #7

Updated by tinita about 2 months ago · Edited

% grep '" 503 ' access_log >/tmp/503
% wc -l /tmp/503
503 /tmp/503          # <------- that's a funny coincidence that we had exactly 503 503 errors :D
% head -1 /tmp/503
... - - [06/Mar/2024:06:15:34 +0100] "GET /api/v1/mm/children HTTP/1.1" 503 1013 "-" "Mojolicious (Perl)" 1
% tail -1 /tmp/503
... - - [06/Mar/2024:06:16:00 +0100] "POST /api/v1/barrier/HAWK_GUI_CPU_TEST_FINISH_hawk HTTP/1.1" 503 1013 "-" "Mojolicious (Perl)" 0

So apache only returned 503 only for that half minute from 06:15:34-06:16:00 that matches the stop and start of the webui

Actions #9

Updated by tinita about 2 months ago · Edited

https://github.com/os-autoinst/scripts/pull/297 Make scheduling of multimachine tests easier to debug (merged)
https://github.com/os-autoinst/openQA/pull/5502 Log waiting time in monitor script and useragent (merged)

Actions #10

Updated by tinita about 2 months ago

  • Status changed from Workable to Feedback

PRs merged. Will monitor the pipelines for now

Actions #11

Updated by tinita about 2 months ago

Error hasn't occurred again yet.
Only unrelated failure with the registry: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2367528

Actions #12

Updated by tinita about 1 month ago

  • Status changed from Feedback to Resolved

Still the error hasn't occurred again. Resolving for now.
We will get an alert if it happens again, and we have more helpful output now.

Actions

Also available in: Atom PDF