action #160877: [alert] Scripts CI pipeline failing due to osd yielding 502 size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #160877

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #108209: [epic] Reduce load on OSD

[alert] Scripts CI pipeline failing due to osd yielding 502 size:M

Added by jbaier_cz about 1 year ago. Updated 12 months ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2024-05-24

Due date:

% Done:

Estimated time:

Tags:

alert, reactive work

Description

Observation¶

We have a case where https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2649768 fails due to:

Job state of job ID 14429107: scheduled, waiting … (delay: 10; waited 70s)
{"blocked_by_id":null,"id":14429107,"result":"none","state":"scheduled"}
Job state of job ID 14429107: scheduled, waiting … (delay: 10; waited 80s)
Request failed, hit error 502, retrying up to 60 more times after waiting … (delay: 5; waited 0s)
...
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.21.5</center>
</body>
</html>

This also happened again: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2655477

Did we managed do DoS the server? Do we need to tweak the nginx even more?

Suggestions¶

We're already retrying 60 times as is visible in the logs - more retries probably won't help
Maybe this could be a bug in openqa-cli ... --monitor
How come we didn't see issues elsewhere?
Seems to happen roughly around the some time e.g. around 8 in the morning
Unsilence web UI: Too many 5xx HTTP responses alert

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by jbaier_cz about 1 year ago

Copied from action #156625: [alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 size:M added

Actions

Copy link

Updated by mkittler about 1 year ago

Normally I'd say it was probably just the service being restarted. However, the script was trying for an hour if the log output can be trusted (at least the time really elapsed as there's also real 66m30.787s logged at the bottom). Note that the pending rate limiting MR is not in place yet. We currently don't have monitoring for specific response codes but I'm pretty sure we would have been told if OSD was really down for so long.

The jobs themselves were passing but they were scheduled for quite a long time (as they have only finished 3 hours ago with only ~1 minute runtime and this pipeline already ran 6 hours ago).

Actions

Copy link

Updated by livdywan about 1 year ago

Subject changed from [alert] Scripts CI pipeline failing due to osd yielding 502 to [alert] Scripts CI pipeline failing due to osd yielding 502 size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz 12 months ago

Assignee set to okurz

Actions

Copy link

Updated by okurz 12 months ago

Related to action #159654: high response times on osd - nginx properly monitored in grafana size:S added

Actions

Copy link

Updated by okurz 12 months ago

Status changed from Workable to Blocked

nothing unusual on https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1716366921649&to=1717508559367 but we don't have HTTP response code monitoring yet. https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1716312636834&to=1717508669893 shows that during that time we had a high job queue but not unusual nor problematic. I would like to take a look again after #159654

Actions

Copy link

Updated by okurz 12 months ago

Priority changed from High to Normal

Actions

Copy link

Updated by okurz 12 months ago

happened again this morning: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2688299

Actions

Copy link

Updated by okurz 12 months ago

Status changed from Blocked to Workable
Assignee deleted (~~okurz~~)

#159654 resolved. https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=now-24h&to=now&viewPanel=80 shows nginx response codes including a sometimes significant number of 502's. Also reproduced in https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2732542 . We should look into preventing this. Maybe we need tweaking of timeouts and buffers on the side of nginx? I don't think it was that severe with apache.

Actions

Copy link

#10

Updated by okurz 12 months ago

Priority changed from Normal to Urgent

Recurring alerts appearing related to that

Actions

Copy link

#11

Updated by livdywan 12 months ago

Description updated (diff)

Actions

Copy link

#12

Updated by mkittler 12 months ago

Status changed from Workable to In Progress
Assignee set to mkittler

The scripts CI pipeline is definitely still failing in the same way, e.g. https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2733461. But I still doubt that the pipeline failures are showing a real problem - or did we really have an over 95 minutes log outage yesterday (where the NGINX always returned 502)? Considering that multiple pipelines failed in the say way the outage probably needed to be even longer. So I'll check whether the retry of openqa-cli is actually working.

Actions

Copy link

#13

Updated by mkittler 12 months ago

I can reproduce the problem locally:

OPENQA_CLI_RETRY_SLEEP_TIME_S=5 OPENQA_CLI_RETRIES=60 script/openqa-cli schedule --host http://localhost --monitor --param-file SCENARIO_DEFINITIONS_YAML=/hdd/openqa-devel/openqa/share/tests/example/scenario-definitions.yaml DISTRI=example VERSION=0 FLAVOR=DVD ARCH=x86_64 TEST=simple_boot BUILD=test-scheduling-and-monitor _GROUP_ID=0 CASEDIR=/hdd/openqa-devel/openqa/share/tests/example NEEDLES_DIR=%%CASEDIR%%/needles
Request failed, hit error 502, retrying up to 60 more times after waiting … (delay: 5; waited 0s)
Request failed, hit error 502, retrying up to 59 more times after waiting … (delay: 5; waited 65s)
Request failed, hit error 502, retrying up to 58 more times after waiting … (delay: 5; waited 131s)
Request failed, hit error 502, retrying up to 57 more times after waiting … (delay: 5; waited 196s)
Request failed, hit error 502, retrying up to 56 more times after waiting … (delay: 5; waited 261s)
Request failed, hit error 502, retrying up to 55 more times after waiting … (delay: 5; waited 326s)
…

I made it so NGINX only returns a 502 error initially but the cli stays stuck in the error state. Only pressing Ctrl-C and invoking the same command again makes it schedule the job. Additionally, it also only does one attempt every 60 seconds (but it should be very 5 seconds).

Actions

Copy link

#14

Updated by livdywan 12 months ago

mkittler wrote in #note-12:

The scripts CI pipeline is definitely still failing in the same way, e.g. https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2733461. But I still doubt that the pipeline failures are showing a real problem - or did we really have an over 95 minutes log outage yesterday (where the NGINX always returned 502)? Considering that multiple pipelines failed in the say way the outage probably needed to be even longer. So I'll check whether the retry of openqa-cli is actually working.

How about bug_fetcher also running into 502 and giving up? Would that suggest the problem is not openqa-cli?

Traceback (most recent call last):
  File "/usr/bin/fetch_openqa_bugs", line 39, in <module>
    bugs = client.openqa_request("GET", "bugs", {"refreshable": 1, "delta": config["main"]["refresh_interval"]})["bugs"]
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 301, in openqa_request
    return self.do_request(req, retries=retries, wait=wait, parse=True)
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 239, in do_request
    return self.do_request(request, retries=retries - 1, wait=newwait)
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 239, in do_request
    return self.do_request(request, retries=retries - 1, wait=newwait)
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 239, in do_request
    return self.do_request(request, retries=retries - 1, wait=newwait)
  [Previous line repeated 2 more times]
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 241, in do_request
    raise err
  File "/usr/lib/python3.6/site-packages/openqa_client/client.py", line 216, in do_request
    request.method, resp.url, resp.status_code, resp.text
openqa_client.exceptions.RequestError: ('GET', 'https://openqa.suse.de/api/v1/bugs?refreshable=1&delta=86400', 502, '<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx/1.21.5</center>\r\n</body>\r\n</html>\r\n')

Note I didn't file a new ticket for this one yet. This was today at 15.13 (Cron root@openqa-service (date; fetch_openqa_bugs)> /tmp/fetch_openqa_bugs_osd.log).

Actions

Copy link

#15

Updated by mkittler 12 months ago

Well, there is definitely something wrong with CLI. I could easily reproduce it getting stuck after a 5xx error. I also created a fix for the problem: https://github.com/os-autoinst/openQA/pull/5714

Note that openqa-cli did not give up. It actually retried - but got stuck in an error state doing that.

I don't know about the bug fetcher. It is obviously expected that it runs into 502 errors at this point (and especially since we switched to NGINX). So this shouldn't be a concern as long as we don't get alerts/e-mails about it and a retry is happening on some level (and I believe it does happen on some level). (If you think that's not the case you can file a new ticket about it.)

Actions

Copy link

#16

Updated by mkittler 12 months ago

Also note that resolving this ticket does not mean we have fixed the web UI: Too many 5xx HTTP responses alert alert.

A problem of openqa-cli being revealed (what this ticket is about) and the 5xx alert firing are both just symptoms of the same cause - and I believe that is switching to NGINX which apparently behaves differently when we restart the web UI.

Actions

Copy link

#17

Updated by okurz 12 months ago

Copied to action #162533: [alert] OSD nginx yields 502 responses rather than being more resilient of e.g. openqa-webui restarts size:S added

Actions

Copy link

#18

Updated by okurz 12 months ago

mkittler wrote in #note-16:

Also note that resolving this ticket does not mean we have fixed the web UI: Too many 5xx HTTP responses alert alert.

Fine. You can focus on openqa-cli here. I created #162533 for the server behavior itself

Actions

Copy link

#19

Updated by openqa_review 12 months ago

Due date set to 2024-07-04

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#20

Updated by mkittler 12 months ago

Status changed from In Progress to Feedback
Priority changed from Urgent to High

The container we use in the pipeline has already been rebuilt (and it has the commit 0611ef7ac81600980d476945ccc39bbb5f6671da).

I couldn't find a run where my change was put to test so I'm keeping the ticket in feedback lowering the prio.

Actions

Copy link

#21

Updated by okurz 12 months ago

Due date deleted (~~2024-07-04~~)
Status changed from Feedback to Resolved

With your fix applied and after you verified that the according change should be used in production and after you verified it did not introduce regressions immediately we can just resolve as we would be immediately notified if the solution is not enough as we monitor these CI jobs anyway.

Actions

Copy link

#22

Updated by okurz 12 months ago

Parent task set to #108209

Actions

Copy link

#23

Updated by jbaier_cz 8 months ago

Related to action #167833: openqa/scripts-ci pipeline fails - "jq: parse error: Invalid numeric literal at line 1, column 8 (rc: 5 Input: >>>Request failed, hit error 502" while running openqa-schedule-mm-ping-test added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #160877

[alert] Scripts CI pipeline failing due to osd yielding 502 size:M

Observation¶

Suggestions¶

Updated by jbaier_cz about 1 year ago

Updated by mkittler about 1 year ago

Updated by livdywan about 1 year ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by livdywan 12 months ago

Updated by mkittler 12 months ago

Updated by mkittler 12 months ago

Updated by livdywan 12 months ago

Updated by mkittler 12 months ago

Updated by mkittler 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by openqa_review 12 months ago

Updated by mkittler 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by jbaier_cz 8 months ago