Project

General

Profile

Actions

action #160877

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #108209: [epic] Reduce load on OSD

[alert] Scripts CI pipeline failing due to osd yielding 502 size:M

Added by jbaier_cz 7 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-05-24
Due date:
% Done:

0%

Estimated time:

Description

Observation

We have a case where https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2649768 fails due to:

Job state of job ID 14429107: scheduled, waiting … (delay: 10; waited 70s)
{"blocked_by_id":null,"id":14429107,"result":"none","state":"scheduled"}
Job state of job ID 14429107: scheduled, waiting … (delay: 10; waited 80s)
Request failed, hit error 502, retrying up to 60 more times after waiting … (delay: 5; waited 0s)
...
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.21.5</center>
</body>
</html>

This also happened again: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2655477

Did we managed do DoS the server? Do we need to tweak the nginx even more?

Suggestions

  • We're already retrying 60 times as is visible in the logs - more retries probably won't help
  • Maybe this could be a bug in openqa-cli ... --monitor
  • How come we didn't see issues elsewhere?
  • Seems to happen roughly around the some time e.g. around 8 in the morning
  • Unsilence web UI: Too many 5xx HTTP responses alert

Related issues 4 (0 open4 closed)

Related to openQA Project (public) - action #159654: high response times on osd - nginx properly monitored in grafana size:SResolvedjbaier_cz2024-04-26

Actions
Related to openQA Infrastructure (public) - action #167833: openqa/scripts-ci pipeline fails - "jq: parse error: Invalid numeric literal at line 1, column 8 (rc: 5 Input: >>>Request failed, hit error 502" while running openqa-schedule-mm-ping-testResolvedtinita2024-10-07

Actions
Copied from openQA Project (public) - action #156625: [alert] Scripts CI pipeline failing due to osd yielding 503 - take 2 size:MResolvedtinita

Actions
Copied to openQA Project (public) - action #162533: [alert] OSD nginx yields 502 responses rather than being more resilient of e.g. openqa-webui restarts size:SResolvedmkittler2024-05-24

Actions
Actions

Also available in: Atom PDF