action #156625
closed[alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 size:M
0%
Description
Observation¶
After #156052 we still have a case https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2344558 like this:
Job state of job ID 13715326: running, waiting …
{"blocked_by_id":null,"id":13715326,"result":"none","state":"running"}
Job state of job ID 13715326: running, waiting …
Request failed, hit error 503, retrying up to 60 more times after waiting …
…
Request failed, hit error 503, retrying up to 1 more times after waiting …
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
<p>Additionally, a 503 Service Unavailable
error was encountered while trying to use an ErrorDocument to handle the request.</p>
<hr>
<address>Apache/2.4.51 (Linux/SUSE) Server at openqa.suse.de Port 80</address>
</body></html>
that's possibly a retry over multiple minutes but still something is off here.
Acceptance criteria¶
- AC1: (vague) openqa-cli waits sufficiently long to cover usual OSD outages
- AC2: The retry-functionality in openqa-cli was double-verified and works as intended
Suggestions¶
- Test the openqa-cli behaviour maybe together with an apache proxy on a local installation
- Check if the retry actually properly sleeps in between
- Consider adding exponential backup into openqa-cli, see https://github.com/okurz/retry/blob/main/retry#L49
- Consider adding a timestamp to the gitlab CI pipeline output
- Consider output the value of
OPENQA_CLI_RETRY_SLEEP_TIME_S
in theRequest failed, hit error ..., retrying up to ... more times after waiting
line
Updated by okurz about 1 year ago
- Copied from action #156052: [alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting size:S added
Updated by livdywan about 1 year ago
Once more: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2353970
- Two jobs were scheduled, and passed
- Somehow OSD became unresponsive while --monitor was still waiting?
- There were no other issues around that time?
- It took 02:54 and 01:49 respectively to finish the tests.
- The 503 causes the wait for the job state to be aborted.
Maybe a 503 should be treated like a temporary error?
Updated by tinita about 1 year ago · Edited
This is apparently a typical time when the webui is restarted (looks the same as yesterday):
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1709699912476&to=1709707387316
Mar 05 06:15:30 openqa systemd[1]: Started The openQA web UI.
Mar 06 06:15:58 openqa systemd[1]: Started The openQA web UI.
The 503 is already treated as a temporary error, that's why it's retrying.
We could set OPENQA_CLI_RETRY_SLEEP_TIME_S
to a higher value. Currently it is set to 5, which means we wait 5 minutes in total.
Not sure why it was unavailable for 5 minutes though:
Mar 06 06:15:34 openqa systemd[1]: Stopping The openQA web UI...
...
Mar 06 06:15:34 openqa systemd[1]: openqa-webui.service: Deactivated successfully.
Mar 06 06:15:34 openqa systemd[1]: Stopped The openQA web UI.
Mar 06 06:15:58 openqa systemd[1]: Started The openQA web UI.
Mar 06 06:15:59 openqa openqa[13845]: [info] Loading external plugin AMQP
Mar 06 06:15:59 openqa openqa[13845]: [info] Loading external plugin ObsRsync
Mar 06 06:16:00 openqa openqa[13845]: [info] Listening at "http://127.0.0.1:9526"
Mar 06 06:16:00 openqa openqa[13845]: Web application available at http://127.0.0.1:9526
Mar 06 06:16:00 openqa openqa[13845]: [info] Listening at "http://[::1]:9526"
Mar 06 06:16:00 openqa openqa[13845]: Web application available at http://[::1]:9526
Updated by tinita about 1 year ago
What I find very confusing is the 503 in the output of openqa-cli schedule
:
+ openqa-cli schedule --monitor --host http://openqa.suse.de --param-file SCENARIO_DEFINITIONS_YAML=/tmp/tmp.NoQyW7vXWm DISTRI=sle VERSION=15-SP5 FLAVOR=Server-DVD-Updates ARCH=x86_64 BUILD=2024-03-05T05:12+00:00 _GROUP_ID=0 HDD_1=SLES-15-SP5-x86_64-mru-install-minimal-with-addons-Build20240304-2-Server-DVD-Updates-64bit.qcow2
503 Service Unavailable
{"count":2,"failed":[],"ids":[13715326,13715327],"scheduled_product_id":2070541}
It returns a 503, but also apparently did schedule two jobs successfully? I thought the 503 might come from the openqa-cli call finding out the settings.HDD_1
but that can't be because HDD_1=...
looks ok in the line below.
Updated by okurz about 1 year ago
I suggest to check the apache logs as the 503 might come from the openQA WebUI or the apache proxy.
Updated by okurz about 1 year ago
- Subject changed from [alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 to [alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 size:M
- Description updated (diff)
- Status changed from New to Workable
- Assignee set to tinita
Updated by tinita about 1 year ago · Edited
% grep '" 503 ' access_log >/tmp/503
% wc -l /tmp/503
503 /tmp/503 # <------- that's a funny coincidence that we had exactly 503 503 errors :D
% head -1 /tmp/503
... - - [06/Mar/2024:06:15:34 +0100] "GET /api/v1/mm/children HTTP/1.1" 503 1013 "-" "Mojolicious (Perl)" 1
% tail -1 /tmp/503
... - - [06/Mar/2024:06:16:00 +0100] "POST /api/v1/barrier/HAWK_GUI_CPU_TEST_FINISH_hawk HTTP/1.1" 503 1013 "-" "Mojolicious (Perl)" 0
So apache only returned 503 only for that half minute from 06:15:34-06:16:00
that matches the stop and start of the webui
Updated by tinita about 1 year ago · Edited
https://github.com/os-autoinst/scripts/pull/297 Make scheduling of multimachine tests easier to debug (merged)
https://github.com/os-autoinst/openQA/pull/5502 Log waiting time in monitor script and useragent (merged)
Updated by tinita about 1 year ago
- Status changed from Workable to Feedback
PRs merged. Will monitor the pipelines for now
Updated by tinita about 1 year ago
Error hasn't occurred again yet.
Only unrelated failure with the registry: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2367528
Updated by tinita about 1 year ago
- Status changed from Feedback to Resolved
Still the error hasn't occurred again. Resolving for now.
We will get an alert if it happens again, and we have more helpful output now.
Updated by jbaier_cz 11 months ago
- Copied to action #160877: [alert] Scripts CI pipeline failing due to osd yielding 502 size:M added