Project

General

Profile

Actions

action #130636

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #108209: [epic] Reduce load on OSD

high response times on osd - Try nginx on OSD size:S

Added by livdywan 11 months ago. Updated 2 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
2024-05-17
% Done:

0%

Estimated time:
Tags:

Description

Motivation

Apache in prefork mode uses a lot of resources to provide mediocre performance.

Acceptance criteria

  • AC1: Nginx has been deployed successfully on OSD
  • AC2: No alerts regarding "oh no, apache is down" ;)

Suggestions

  • Make sure there is an easy way to switch back to Apache in case something goes wrong
  • See #129490 for results from O3
  • Adapt OSD nginx config for HTTP + HTTPS (O3 only requires HTTP)
  • We can prepare the deployment of nginx in parallel to apache, have it deployed and at any time decide when to switch by just disabling/enabling services accordingly. The deployment needs to consider dehydrated+nginx as well. We can switch OSD to nginx to gather realtime data before we suggest to use nginx as default in our openQA documentation and CI infrastructure.
  • Add changes to salt-states-openqa excluding monitoring
  • Ensure that we have no alerts regarding "oh no, apache is down" ;)
  • If there are any bigger issues observed then just revert and note down in follow-up tickets what needs to be solved first (to limit the ticket to size:S)

Out of scope

  • It is known if Nginx rate limiting features work for our use cases
  • Full monitoring integration

Rollback steps

  • DONE: delete from workers where host = 'linux-9lzf'; to delete my test workers

Related issues 9 (4 open5 closed)

Related to openQA Infrastructure - action #157081: OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30ZResolvedokurz2024-03-12

Actions
Related to openQA Infrastructure - action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34ZResolvedokurz

Actions
Related to openQA Infrastructure - action #159396: Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:MFeedbackokurz2024-06-09

Actions
Related to openQA Infrastructure - action #160083: client gets a redirect and downloads an HTML page from microsoft instead of the proper windows .qcow2 imageResolvedtinita2024-05-08

Actions
Related to openQA Infrastructure - action #160171: [openQA][assets] Access to openQA assets forbidden auto_review:"Download.*curl.*error for.*http://openqa.suse.de/":retry size:SFeedbackmkittler2024-05-102024-05-27

Actions
Related to openQA Infrastructure - action #160239: [alert] External http responses Salt (https://openqa.suse.de/health) due to "Too many open files" after switch to nginxFeedbackokurz2024-05-122024-05-29

Actions
Copied from openQA Project - action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing featuresResolvedkraih

Actions
Copied to openQA Project - action #159651: high response times on osd - nginx with enabled rate limiting features size:SWorkable2024-04-26

Actions
Copied to openQA Infrastructure - action #160367: After switch to nginx on OSD let's investigate how system performance was impactedResolvedokurz2024-05-14

Actions
Actions #1

Updated by livdywan 11 months ago

  • Copied from action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features added
Actions #2

Updated by okurz 11 months ago

  • Description updated (diff)
Actions #3

Updated by kraih 11 months ago

  • Description updated (diff)
Actions #4

Updated by kraih 11 months ago

During the openQA weekly we've talked about this ticket and consider it a good candidate for a mob session. Main problems to solve are Salt deployment and SSL configuration. As well as a simple way to rollback the deployment and use Apache again in case something goes wrong.

Actions #5

Updated by kraih 11 months ago

  • Description updated (diff)
Actions #6

Updated by okurz 11 months ago

We can prepare the deployment of nginx in parallel to apache, have it deployed and at any time decide when to switch by just disabling/enabling services accordingly. The deployment needs to consider dehydrated+nginx as well. We can switch OSD to nginx to gather realtime data before we suggest to use nginx as default in our openQA documentation and CI infrastructure.

Actions #7

Updated by okurz 2 months ago

  • Related to action #157081: OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30Z added
Actions #8

Updated by okurz about 2 months ago

  • Related to action #158059: OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34Z added
Actions #9

Updated by okurz 20 days ago

  • Tags set to infra
  • Target version changed from future to Ready

due to repeated issues with unresponsiveness we should give this more focus and bring it onto the backlog now.

Actions #10

Updated by okurz 20 days ago

  • Copied to action #159651: high response times on osd - nginx with enabled rate limiting features size:S added
Actions #11

Updated by jbaier_cz 20 days ago · Edited

  • Subject changed from high response times on osd - Try nginx on osd with enabled load limiting or load balancing features to high response times on osd - Try nginx on OSD size:S
  • Status changed from New to Workable
Actions #12

Updated by jbaier_cz 20 days ago

  • Description updated (diff)
Actions #13

Updated by mkittler 14 days ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #14

Updated by openqa_review 14 days ago

  • Due date set to 2024-05-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions #15

Updated by mkittler 13 days ago

It seems to generally work with the config I've already put on Slack: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1168

It uses different ports (the normal ports plus 1000). Therefore I also had to add service_port_delta = 0 to the config to make the live mode work (as it would otherwise assume a reverse-proxy-less development setup).

I'll test connecting a worker after lunch.

Actions #16

Updated by mkittler 13 days ago · Edited

I connected a worker via

HOST=https://openqa.suse.de:1443
BACKEND=qemu
WORKER_CLASS=qemu_x86_64_poo130636

and it worked (registration, picking up a job and concluding it).

So I'll prepare a MR to switch ports which we can merge next week. EDIT: https://gitlab.suse.de/mkittler/salt-states-openqa/-/merge_requests/new?merge_request%5Bsource_branch%5D=nginx-for-real

Actions #17

Updated by livdywan 13 days ago

I also went through the web UI just to see if anything stands out. Cloned a bunch of jobs, and it seems fine https://openqa.suse.de:1443/tests/overview?distri=sle&version=15-SP4&build=poo%23130636 - note that the port keeps being reset, even the output of openqa-clone-job --repeat 100 --within-instance https://openqa.suse.de/tests/14196034 _GROUP=0 BUILD=poo#130636 gave me URLs without a port so this may have made manual testing less relevant.

Actions #18

Updated by mkittler 10 days ago

This test is in fact not really relevant as all of those tests probably just ran on a workers that connected via apache. But at least we know that there are no surprises with openqa-clone-job itself (if it actually honored the port).

I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1169 to use NGINX for real.

I'll have to make a break in two hours so currently there's not a big enough window for me to merge it. So I'll merge it when I get back or tomorrow.

Actions #19

Updated by livdywan 9 days ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1173 to get the same, faster asset handling as on o3

Actions #20

Updated by mkittler 9 days ago

  • Description updated (diff)

We tried to use nginx in production but it didn't work; openqa prefork workers always quickly used lots of cpu and everything went very slow.

Maybe this helps: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1174

I was able to connect 200 local workers simultaneously with that via nginx/http and it didn't had any impact (which is good).

Actions #21

Updated by okurz 8 days ago

  • Related to action #159396: Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M added
Actions #22

Updated by mkittler 8 days ago

  • Description updated (diff)

It looks good after https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1174 so I'll keep NGINX enabled. See my messages on Slack for details.

I also deleted my test workers from the OSD database.

Actions #23

Updated by mkittler 8 days ago

  • Status changed from In Progress to Feedback
Actions #24

Updated by mkittler 8 days ago · Edited

Even though implementing the monitoring is out of scope we should probably at least get rid of

openqa.suse.de:
    2024-05-08T11:24:18Z E! [inputs.apache] Error in plugin: http://localhost/server-status?auto returned HTTP status 404 Not Found
    2024-05-08T11:24:21Z E! [telegraf] Error running agent: input plugins recorded 1 errors

as it lets our pipelines fail.


EDIT: MRs:

Actions #25

Updated by jbaier_cz 7 days ago

  • Related to action #160083: client gets a redirect and downloads an HTML page from microsoft instead of the proper windows .qcow2 image added
Actions #26

Updated by rainerkoenig 7 days ago

We also encounter strange failures when clicking on links to YAML schedules in the settings tab. Example:

https://openqa.suse.de/tests/14263283/settings/schedule/yast/maintenance/create_hdd_transactional_server_restapi.yaml

displays the following text instead of the YAML schedule:

File path: /var/lib/openqa/share/tests/sle/schedule/yast/maintenance/create_hdd_transactional_server_restapi.yaml
let mode; let path = document.getElementById('script').dataset.path; if (path && path.endsWith('.pm') || path.endsWith('.pl')) { mode = 'ace/mode/perl'; } var editor = ace.edit("script", { mode: mode, maxLines: Infinity, readOnly: true, }); editor.session.setUseWrapMode(true);
Actions #27

Updated by tinita 6 days ago

Oh, that seems to be a missing quote of the data-path. Looking at the source:

<div class="code" id="script" data-path="/var/lib/openqa/share/tests/sle/schedule/yast/maintenance/create_hdd_transactional_server_restapi.yaml>---
...
</div>

  <script type="text/javascript">
let mode;
let path = document.getElementById('script').dataset.path;
...
</script>

</div>

Can't see how that's related to the nginx though...

Fix: https://github.com/os-autoinst/openQA/pull/5631

Actions #28

Updated by mkittler 4 days ago · Edited

Ok, so nothing problematic came up besides https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1180 which has already been merged.

The two monitoring MRs have been merged as well. Of course Grafana still needs to be adjusted (as the 2nd MR only covers Telegraf) but we defined this out of scope for this ticket.

EDIT: Looks like https://progress.opensuse.org/issues/160171 is related, too.

Actions #29

Updated by okurz 4 days ago

  • Related to action #160171: [openQA][assets] Access to openQA assets forbidden auto_review:"Download.*curl.*error for.*http://openqa.suse.de/":retry size:S added
Actions #30

Updated by okurz 3 days ago

  • Related to action #160239: [alert] External http responses Salt (https://openqa.suse.de/health) due to "Too many open files" after switch to nginx added
Actions #31

Updated by mkittler 2 days ago

  • Status changed from Feedback to Resolved

The switch to NGINX generally worked. We created follow-up tickets for some problems which came up.

For now it seems that NGINX provides good performance but maybe it is too soon to tell whether it is an improvement.

Actions #32

Updated by okurz 2 days ago

  • Copied to action #160367: After switch to nginx on OSD let's investigate how system performance was impacted added
Actions

Also available in: Atom PDF