Project

General

Profile

Actions

action #130636

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #108209: [epic] Reduce load on OSD

high response times on osd - Try nginx on OSD size:S

Added by livdywan over 1 year ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

Apache in prefork mode uses a lot of resources to provide mediocre performance.

Acceptance criteria

  • AC1: Nginx has been deployed successfully on OSD
  • AC2: No alerts regarding "oh no, apache is down" ;)

Suggestions

  • Make sure there is an easy way to switch back to Apache in case something goes wrong
  • See #129490 for results from O3
  • Adapt OSD nginx config for HTTP + HTTPS (O3 only requires HTTP)
  • We can prepare the deployment of nginx in parallel to apache, have it deployed and at any time decide when to switch by just disabling/enabling services accordingly. The deployment needs to consider dehydrated+nginx as well. We can switch OSD to nginx to gather realtime data before we suggest to use nginx as default in our openQA documentation and CI infrastructure.
  • Add changes to salt-states-openqa excluding monitoring
  • Ensure that we have no alerts regarding "oh no, apache is down" ;)
  • If there are any bigger issues observed then just revert and note down in follow-up tickets what needs to be solved first (to limit the ticket to size:S)

Out of scope

  • It is known if Nginx rate limiting features work for our use cases
  • Full monitoring integration

Rollback steps

  • DONE: delete from workers where host = 'linux-9lzf'; to delete my test workers

Related issues 9 (0 open9 closed)

Related to openQA Infrastructure (public) - action #157081: OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30ZResolvedokurz2024-03-12

Actions
Related to openQA Infrastructure (public) - action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34ZResolvedokurz

Actions
Related to openQA Infrastructure (public) - action #159396: Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:MResolveddheidler

Actions
Related to openQA Infrastructure (public) - action #160083: client gets a redirect and downloads an HTML page from microsoft instead of the proper windows .qcow2 imageResolvedtinita2024-05-08

Actions
Related to openQA Infrastructure (public) - action #160171: [openQA][assets] Access to openQA assets forbidden size:SResolvedmkittler2024-05-102024-05-27

Actions
Related to openQA Infrastructure (public) - action #160239: [alert] External http responses Salt (https://openqa.suse.de/health) due to "Too many open files" after switch to nginxResolvedokurz2024-05-12

Actions
Copied from openQA Project (public) - action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing featuresResolvedkraih

Actions
Copied to openQA Project (public) - action #159651: high response times on osd - nginx with enabled rate limiting features size:SRejectedokurz2024-04-26

Actions
Copied to openQA Infrastructure (public) - action #160367: After switch to nginx on OSD let's investigate how system performance was impactedResolvedokurz2024-05-14

Actions
Actions

Also available in: Atom PDF