action #156625
closed[alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 size:M
0%
Description
Observation¶
After #156052 we still have a case https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2344558 like this:
Job state of job ID 13715326: running, waiting …
{"blocked_by_id":null,"id":13715326,"result":"none","state":"running"}
Job state of job ID 13715326: running, waiting …
Request failed, hit error 503, retrying up to 60 more times after waiting …
…
Request failed, hit error 503, retrying up to 1 more times after waiting …
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
<p>Additionally, a 503 Service Unavailable
error was encountered while trying to use an ErrorDocument to handle the request.</p>
<hr>
<address>Apache/2.4.51 (Linux/SUSE) Server at openqa.suse.de Port 80</address>
</body></html>
that's possibly a retry over multiple minutes but still something is off here.
Acceptance criteria¶
- AC1: (vague) openqa-cli waits sufficiently long to cover usual OSD outages
- AC2: The retry-functionality in openqa-cli was double-verified and works as intended
Suggestions¶
- Test the openqa-cli behaviour maybe together with an apache proxy on a local installation
- Check if the retry actually properly sleeps in between
- Consider adding exponential backup into openqa-cli, see https://github.com/okurz/retry/blob/main/retry#L49
- Consider adding a timestamp to the gitlab CI pipeline output
- Consider output the value of
OPENQA_CLI_RETRY_SLEEP_TIME_S
in theRequest failed, hit error ..., retrying up to ... more times after waiting
line
Updated by okurz 7 months ago
- Copied from action #156052: [alert] Scripts CI pipeline failing after logging multiple Job state of job ID 13603796: running, waiting size:S added
Updated by livdywan 7 months ago
Once more: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2353970
- Two jobs were scheduled, and passed
- Somehow OSD became unresponsive while --monitor was still waiting?
- There were no other issues around that time?
- It took 02:54 and 01:49 respectively to finish the tests.
- The 503 causes the wait for the job state to be aborted.
Maybe a 503 should be treated like a temporary error?
Updated by tinita 7 months ago · Edited
This is apparently a typical time when the webui is restarted (looks the same as yesterday):
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1709699912476&to=1709707387316
Mar 05 06:15:30 openqa systemd[1]: Started The openQA web UI.
Mar 06 06:15:58 openqa systemd[1]: Started The openQA web UI.
The 503 is already treated as a temporary error, that's why it's retrying.
We could set OPENQA_CLI_RETRY_SLEEP_TIME_S
to a higher value. Currently it is set to 5, which means we wait 5 minutes in total.
Not sure why it was unavailable for 5 minutes though:
Mar 06 06:15:34 openqa systemd[1]: Stopping The openQA web UI...
...
Mar 06 06:15:34 openqa systemd[1]: openqa-webui.service: Deactivated successfully.
Mar 06 06:15:34 openqa systemd[1]: Stopped The openQA web UI.
Mar 06 06:15:58 openqa systemd[1]: Started The openQA web UI.
Mar 06 06:15:59 openqa openqa[13845]: [info] Loading external plugin AMQP
Mar 06 06:15:59 openqa openqa[13845]: [info] Loading external plugin ObsRsync
Mar 06 06:16:00 openqa openqa[13845]: [info] Listening at "http://127.0.0.1:9526"
Mar 06 06:16:00 openqa openqa[13845]: Web application available at http://127.0.0.1:9526
Mar 06 06:16:00 openqa openqa[13845]: [info] Listening at "http://[::1]:9526"
Mar 06 06:16:00 openqa openqa[13845]: Web application available at http://[::1]:9526
Updated by tinita 7 months ago
What I find very confusing is the 503 in the output of openqa-cli schedule
:
+ openqa-cli schedule --monitor --host http://openqa.suse.de --param-file SCENARIO_DEFINITIONS_YAML=/tmp/tmp.NoQyW7vXWm DISTRI=sle VERSION=15-SP5 FLAVOR=Server-DVD-Updates ARCH=x86_64 BUILD=2024-03-05T05:12+00:00 _GROUP_ID=0 HDD_1=SLES-15-SP5-x86_64-mru-install-minimal-with-addons-Build20240304-2-Server-DVD-Updates-64bit.qcow2
503 Service Unavailable
{"count":2,"failed":[],"ids":[13715326,13715327],"scheduled_product_id":2070541}
It returns a 503, but also apparently did schedule two jobs successfully? I thought the 503 might come from the openqa-cli call finding out the settings.HDD_1
but that can't be because HDD_1=...
looks ok in the line below.
Updated by tinita 7 months ago · Edited
% grep '" 503 ' access_log >/tmp/503
% wc -l /tmp/503
503 /tmp/503 # <------- that's a funny coincidence that we had exactly 503 503 errors :D
% head -1 /tmp/503
... - - [06/Mar/2024:06:15:34 +0100] "GET /api/v1/mm/children HTTP/1.1" 503 1013 "-" "Mojolicious (Perl)" 1
% tail -1 /tmp/503
... - - [06/Mar/2024:06:16:00 +0100] "POST /api/v1/barrier/HAWK_GUI_CPU_TEST_FINISH_hawk HTTP/1.1" 503 1013 "-" "Mojolicious (Perl)" 0
So apache only returned 503 only for that half minute from 06:15:34-06:16:00
that matches the stop and start of the webui
Updated by tinita 7 months ago · Edited
https://github.com/os-autoinst/scripts/pull/297 Make scheduling of multimachine tests easier to debug (merged)
https://github.com/os-autoinst/openQA/pull/5502 Log waiting time in monitor script and useragent (merged)
Updated by tinita 7 months ago
Error hasn't occurred again yet.
Only unrelated failure with the registry: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2367528
Updated by jbaier_cz 5 months ago
- Copied to action #160877: [alert] Scripts CI pipeline failing due to osd yielding 502 size:M added