action #160877
closedcoordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
coordination #108209: [epic] Reduce load on OSD
[alert] Scripts CI pipeline failing due to osd yielding 502 size:M
0%
Description
Observation¶
We have a case where https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2649768 fails due to:
Job state of job ID 14429107: scheduled, waiting … (delay: 10; waited 70s)
{"blocked_by_id":null,"id":14429107,"result":"none","state":"scheduled"}
Job state of job ID 14429107: scheduled, waiting … (delay: 10; waited 80s)
Request failed, hit error 502, retrying up to 60 more times after waiting … (delay: 5; waited 0s)
...
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.21.5</center>
</body>
</html>
This also happened again: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2655477
Did we managed do DoS the server? Do we need to tweak the nginx even more?
Suggestions¶
- We're already retrying 60 times as is visible in the logs - more retries probably won't help
- Maybe this could be a bug in
openqa-cli ... --monitor
- How come we didn't see issues elsewhere?
- Seems to happen roughly around the some time e.g. around 8 in the morning
- Unsilence web UI: Too many 5xx HTTP responses alert
Updated by jbaier_cz 7 months ago
- Copied from action #156625: [alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 size:M added
Updated by mkittler 7 months ago
Normally I'd say it was probably just the service being restarted. However, the script was trying for an hour if the log output can be trusted (at least the time really elapsed as there's also real 66m30.787s
logged at the bottom). Note that the pending rate limiting MR is not in place yet. We currently don't have monitoring for specific response codes but I'm pretty sure we would have been told if OSD was really down for so long.
The jobs themselves were passing but they were scheduled for quite a long time (as they have only finished 3 hours ago with only ~1 minute runtime and this pipeline already ran 6 hours ago).
Updated by okurz 7 months ago
- Related to action #159654: high response times on osd - nginx properly monitored in grafana size:S added
Updated by okurz 7 months ago
- Status changed from Workable to Blocked
nothing unusual on https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1716366921649&to=1717508559367 but we don't have HTTP response code monitoring yet. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1716312636834&to=1717508669893 shows that during that time we had a high job queue but not unusual nor problematic. I would like to take a look again after #159654
Updated by okurz 7 months ago
happened again this morning: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2688299
Updated by okurz 6 months ago
- Status changed from Blocked to Workable
- Assignee deleted (
okurz)
#159654 resolved. https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=now-24h&to=now&viewPanel=80 shows nginx response codes including a sometimes significant number of 502's. Also reproduced in https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2732542 . We should look into preventing this. Maybe we need tweaking of timeouts and buffers on the side of nginx? I don't think it was that severe with apache.
Updated by mkittler 6 months ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
The scripts CI pipeline is definitely still failing in the same way, e.g. https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2733461. But I still doubt that the pipeline failures are showing a real problem - or did we really have an over 95 minutes log outage yesterday (where the NGINX always returned 502)? Considering that multiple pipelines failed in the say way the outage probably needed to be even longer. So I'll check whether the retry of openqa-cli
is actually working.
Updated by mkittler 6 months ago
I can reproduce the problem locally:
OPENQA_CLI_RETRY_SLEEP_TIME_S=5 OPENQA_CLI_RETRIES=60 script/openqa-cli schedule --host http://localhost --monitor --param-file SCENARIO_DEFINITIONS_YAML=/hdd/openqa-devel/openqa/share/tests/example/scenario-definitions.yaml DISTRI=example VERSION=0 FLAVOR=DVD ARCH=x86_64 TEST=simple_boot BUILD=test-scheduling-and-monitor _GROUP_ID=0 CASEDIR=/hdd/openqa-devel/openqa/share/tests/example NEEDLES_DIR=%%CASEDIR%%/needles
Request failed, hit error 502, retrying up to 60 more times after waiting … (delay: 5; waited 0s)
Request failed, hit error 502, retrying up to 59 more times after waiting … (delay: 5; waited 65s)
Request failed, hit error 502, retrying up to 58 more times after waiting … (delay: 5; waited 131s)
Request failed, hit error 502, retrying up to 57 more times after waiting … (delay: 5; waited 196s)
Request failed, hit error 502, retrying up to 56 more times after waiting … (delay: 5; waited 261s)
Request failed, hit error 502, retrying up to 55 more times after waiting … (delay: 5; waited 326s)
…
I made it so NGINX only returns a 502 error initially but the cli stays stuck in the error state. Only pressing Ctrl-C and invoking the same command again makes it schedule the job. Additionally, it also only does one attempt every 60 seconds (but it should be very 5 seconds).
Updated by livdywan 6 months ago
mkittler wrote in #note-12:
The scripts CI pipeline is definitely still failing in the same way, e.g. https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2733461. But I still doubt that the pipeline failures are showing a real problem - or did we really have an over 95 minutes log outage yesterday (where the NGINX always returned 502)? Considering that multiple pipelines failed in the say way the outage probably needed to be even longer. So I'll check whether the retry of
openqa-cli
is actually working.
How about bug_fetcher also running into 502 and giving up? Would that suggest the problem is not openqa-cli?
Traceback (most recent call last):
File "/usr/bin/fetch_openqa_bugs", line 39, in <module>
bugs = client.openqa_request("GET", "bugs", {"refreshable": 1, "delta": config["main"]["refresh_interval"]})["bugs"]
File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 301, in openqa_request
return self.do_request(req, retries=retries, wait=wait, parse=True)
File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 239, in do_request
return self.do_request(request, retries=retries - 1, wait=newwait)
File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 239, in do_request
return self.do_request(request, retries=retries - 1, wait=newwait)
File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 239, in do_request
return self.do_request(request, retries=retries - 1, wait=newwait)
[Previous line repeated 2 more times]
File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 241, in do_request
raise err
File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 216, in do_request
request.method, resp.url, resp.status_code, resp.text
openqa_client.exceptions.RequestError: ('GET', 'https://openqa.suse.de/api/v1/bugs?refreshable=1&delta=86400', 502, '<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx/1.21.5</center>\r\n</body>\r\n</html>\r\n')
Note I didn't file a new ticket for this one yet. This was today at 15.13 (Cron root@openqa-service (date; fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log).
Updated by mkittler 6 months ago
Well, there is definitely something wrong with CLI. I could easily reproduce it getting stuck after a 5xx error. I also created a fix for the problem: https://github.com/os-autoinst/openQA/pull/5714
Note that openqa-cli did not give up. It actually retried - but got stuck in an error state doing that.
I don't know about the bug fetcher. It is obviously expected that it runs into 502 errors at this point (and especially since we switched to NGINX). So this shouldn't be a concern as long as we don't get alerts/e-mails about it and a retry is happening on some level (and I believe it does happen on some level). (If you think that's not the case you can file a new ticket about it.)
Updated by mkittler 6 months ago
Also note that resolving this ticket does not mean we have fixed the web UI: Too many 5xx HTTP responses alert alert.
A problem of openqa-cli being revealed (what this ticket is about) and the 5xx alert firing are both just symptoms of the same cause - and I believe that is switching to NGINX which apparently behaves differently when we restart the web UI.
Updated by okurz 6 months ago
- Copied to action #162533: [alert] OSD nginx yields 502 responses rather than being more resilient of e.g. openqa-webui restarts size:S added
Updated by openqa_review 6 months ago
- Due date set to 2024-07-04
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler 6 months ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to High
The container we use in the pipeline has already been rebuilt (and it has the commit 0611ef7ac81600980d476945ccc39bbb5f6671da).
I couldn't find a run where my change was put to test so I'm keeping the ticket in feedback lowering the prio.
Updated by okurz 6 months ago
- Due date deleted (
2024-07-04) - Status changed from Feedback to Resolved
With your fix applied and after you verified that the according change should be used in production and after you verified it did not introduce regressions immediately we can just resolve as we would be immediately notified if the solution is not enough as we monitor these CI jobs anyway.
Updated by jbaier_cz 2 months ago
- Related to action #167833: openqa/scripts-ci pipeline fails - "jq: parse error: Invalid numeric literal at line 1, column 8 (rc: 5 Input: >>>Request failed, hit error 502" while running openqa-schedule-mm-ping-test added