Project

General

Profile

Actions

action #160877

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #108209: [epic] Reduce load on OSD

[alert] Scripts CI pipeline failing due to osd yielding 502 size:M

Added by jbaier_cz 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-05-24
Due date:
% Done:

0%

Estimated time:

Description

Observation

We have a case where https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2649768 fails due to:

Job state of job ID 14429107: scheduled, waiting … (delay: 10; waited 70s)
{"blocked_by_id":null,"id":14429107,"result":"none","state":"scheduled"}
Job state of job ID 14429107: scheduled, waiting … (delay: 10; waited 80s)
Request failed, hit error 502, retrying up to 60 more times after waiting … (delay: 5; waited 0s)
...
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.21.5</center>
</body>
</html>

This also happened again: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2655477

Did we managed do DoS the server? Do we need to tweak the nginx even more?

Suggestions

  • We're already retrying 60 times as is visible in the logs - more retries probably won't help
  • Maybe this could be a bug in openqa-cli ... --monitor
  • How come we didn't see issues elsewhere?
  • Seems to happen roughly around the some time e.g. around 8 in the morning
  • Unsilence web UI: Too many 5xx HTTP responses alert

Related issues 4 (0 open4 closed)

Related to openQA Project - action #159654: high response times on osd - nginx properly monitored in grafana size:SResolvedjbaier_cz2024-04-26

Actions
Related to openQA Infrastructure - action #167833: openqa/scripts-ci pipeline fails - "jq: parse error: Invalid numeric literal at line 1, column 8 (rc: 5 Input: >>>Request failed, hit error 502" while running openqa-schedule-mm-ping-testResolvedtinita2024-10-07

Actions
Copied from openQA Project - action #156625: [alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 size:MResolvedtinita

Actions
Copied to openQA Project - action #162533: [alert] OSD nginx yields 502 responses rather than being more resilient of e.g. openqa-webui restarts size:SResolvedmkittler2024-05-24

Actions
Actions #1

Updated by jbaier_cz 6 months ago

  • Copied from action #156625: [alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 size:M added
Actions #2

Updated by mkittler 6 months ago

Normally I'd say it was probably just the service being restarted. However, the script was trying for an hour if the log output can be trusted (at least the time really elapsed as there's also real 66m30.787s logged at the bottom). Note that the pending rate limiting MR is not in place yet. We currently don't have monitoring for specific response codes but I'm pretty sure we would have been told if OSD was really down for so long.

The jobs themselves were passing but they were scheduled for quite a long time (as they have only finished 3 hours ago with only ~1 minute runtime and this pipeline already ran 6 hours ago).

Actions #3

Updated by livdywan 6 months ago

  • Subject changed from [alert] Scripts CI pipeline failing due to osd yielding 502 to [alert] Scripts CI pipeline failing due to osd yielding 502 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz 6 months ago

  • Assignee set to okurz
Actions #5

Updated by okurz 6 months ago

  • Related to action #159654: high response times on osd - nginx properly monitored in grafana size:S added
Actions #6

Updated by okurz 6 months ago

  • Status changed from Workable to Blocked

nothing unusual on https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1716366921649&to=1717508559367 but we don't have HTTP response code monitoring yet. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1716312636834&to=1717508669893 shows that during that time we had a high job queue but not unusual nor problematic. I would like to take a look again after #159654

Actions #7

Updated by okurz 6 months ago

  • Priority changed from High to Normal
Actions #8

Updated by okurz 6 months ago

Actions #9

Updated by okurz 5 months ago

  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

#159654 resolved. https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=now-24h&to=now&viewPanel=80 shows nginx response codes including a sometimes significant number of 502's. Also reproduced in https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2732542 . We should look into preventing this. Maybe we need tweaking of timeouts and buffers on the side of nginx? I don't think it was that severe with apache.

Actions #10

Updated by okurz 5 months ago

  • Priority changed from Normal to Urgent

Recurring alerts appearing related to that

Actions #11

Updated by livdywan 5 months ago

  • Description updated (diff)
Actions #12

Updated by mkittler 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

The scripts CI pipeline is definitely still failing in the same way, e.g. https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2733461. But I still doubt that the pipeline failures are showing a real problem - or did we really have an over 95 minutes log outage yesterday (where the NGINX always returned 502)? Considering that multiple pipelines failed in the say way the outage probably needed to be even longer. So I'll check whether the retry of openqa-cli is actually working.

Actions #13

Updated by mkittler 5 months ago

I can reproduce the problem locally:

OPENQA_CLI_RETRY_SLEEP_TIME_S=5 OPENQA_CLI_RETRIES=60 script/openqa-cli schedule --host http://localhost --monitor --param-file SCENARIO_DEFINITIONS_YAML=/hdd/openqa-devel/openqa/share/tests/example/scenario-definitions.yaml DISTRI=example VERSION=0 FLAVOR=DVD ARCH=x86_64 TEST=simple_boot BUILD=test-scheduling-and-monitor _GROUP_ID=0 CASEDIR=/hdd/openqa-devel/openqa/share/tests/example NEEDLES_DIR=%%CASEDIR%%/needles
Request failed, hit error 502, retrying up to 60 more times after waiting … (delay: 5; waited 0s)
Request failed, hit error 502, retrying up to 59 more times after waiting … (delay: 5; waited 65s)
Request failed, hit error 502, retrying up to 58 more times after waiting … (delay: 5; waited 131s)
Request failed, hit error 502, retrying up to 57 more times after waiting … (delay: 5; waited 196s)
Request failed, hit error 502, retrying up to 56 more times after waiting … (delay: 5; waited 261s)
Request failed, hit error 502, retrying up to 55 more times after waiting … (delay: 5; waited 326s)
…

I made it so NGINX only returns a 502 error initially but the cli stays stuck in the error state. Only pressing Ctrl-C and invoking the same command again makes it schedule the job. Additionally, it also only does one attempt every 60 seconds (but it should be very 5 seconds).

Actions #14

Updated by livdywan 5 months ago

mkittler wrote in #note-12:

The scripts CI pipeline is definitely still failing in the same way, e.g. https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2733461. But I still doubt that the pipeline failures are showing a real problem - or did we really have an over 95 minutes log outage yesterday (where the NGINX always returned 502)? Considering that multiple pipelines failed in the say way the outage probably needed to be even longer. So I'll check whether the retry of openqa-cli is actually working.

How about bug_fetcher also running into 502 and giving up? Would that suggest the problem is not openqa-cli?

Traceback (most recent call last):
  File "/usr/bin/fetch_openqa_bugs", line 39, in <module>
    bugs = client.openqa_request("GET", "bugs", {"refreshable": 1, "delta": config["main"]["refresh_interval"]})["bugs"]
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 301, in openqa_request
    return self.do_request(req, retries=retries, wait=wait, parse=True)
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 239, in do_request
    return self.do_request(request, retries=retries - 1, wait=newwait)
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 239, in do_request
    return self.do_request(request, retries=retries - 1, wait=newwait)
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 239, in do_request
    return self.do_request(request, retries=retries - 1, wait=newwait)
  [Previous line repeated 2 more times]
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 241, in do_request
    raise err
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 216, in do_request
    request.method, resp.url, resp.status_code, resp.text
openqa_client.exceptions.RequestError: ('GET', 'https://openqa.suse.de/api/v1/bugs?refreshable=1&delta=86400', 502, '<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx/1.21.5</center>\r\n</body>\r\n</html>\r\n')

Note I didn't file a new ticket for this one yet. This was today at 15.13 (Cron root@openqa-service (date; fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log).

Actions #15

Updated by mkittler 5 months ago

Well, there is definitely something wrong with CLI. I could easily reproduce it getting stuck after a 5xx error. I also created a fix for the problem: https://github.com/os-autoinst/openQA/pull/5714

Note that openqa-cli did not give up. It actually retried - but got stuck in an error state doing that.

I don't know about the bug fetcher. It is obviously expected that it runs into 502 errors at this point (and especially since we switched to NGINX). So this shouldn't be a concern as long as we don't get alerts/e-mails about it and a retry is happening on some level (and I believe it does happen on some level). (If you think that's not the case you can file a new ticket about it.)

Actions #16

Updated by mkittler 5 months ago

Also note that resolving this ticket does not mean we have fixed the web UI: Too many 5xx HTTP responses alert alert.

A problem of openqa-cli being revealed (what this ticket is about) and the 5xx alert firing are both just symptoms of the same cause - and I believe that is switching to NGINX which apparently behaves differently when we restart the web UI.

Actions #17

Updated by okurz 5 months ago

  • Copied to action #162533: [alert] OSD nginx yields 502 responses rather than being more resilient of e.g. openqa-webui restarts size:S added
Actions #18

Updated by okurz 5 months ago

mkittler wrote in #note-16:

Also note that resolving this ticket does not mean we have fixed the web UI: Too many 5xx HTTP responses alert alert.

Fine. You can focus on openqa-cli here. I created #162533 for the server behavior itself

Actions #19

Updated by openqa_review 5 months ago

  • Due date set to 2024-07-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions #20

Updated by mkittler 5 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

The container we use in the pipeline has already been rebuilt (and it has the commit 0611ef7ac81600980d476945ccc39bbb5f6671da).

I couldn't find a run where my change was put to test so I'm keeping the ticket in feedback lowering the prio.

Actions #21

Updated by okurz 5 months ago

  • Due date deleted (2024-07-04)
  • Status changed from Feedback to Resolved

With your fix applied and after you verified that the according change should be used in production and after you verified it did not introduce regressions immediately we can just resolve as we would be immediately notified if the solution is not enough as we monitor these CI jobs anyway.

Actions #22

Updated by okurz 5 months ago

  • Parent task set to #108209
Actions #23

Updated by jbaier_cz about 1 month ago

  • Related to action #167833: openqa/scripts-ci pipeline fails - "jq: parse error: Invalid numeric literal at line 1, column 8 (rc: 5 Input: >>>Request failed, hit error 502" while running openqa-schedule-mm-ping-test added
Actions

Also available in: Atom PDF