action #129619
closedcoordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
coordination #108209: [epic] Reduce load on OSD
high response times on osd - simple limit of jobs running concurrently in openQA size:M
Description
Motivation¶
OSD suffers from high response times or alerts about http responses. As it's likely due to too many jobs trying to upload concurrently we should introduce limits. Likely the easiest limit is on the number of jobs that the scheduler assigns to workers to prevent too many running in parallel
Acceptance criteria¶
- AC1: openQA configuration options can limit the number of jobs that will be picked up at once
- AC2: By default there is no limit
Suggestions¶
- Look into the scheduler code, likely in lib/OpenQA/Scheduler/Model/Jobs.pm . Maybe it is possible to simply not assign any jobs to workers based on a config setting, if defined
- Confirm in production, e.g. try it out on OSD
- Come up with a good limit for osd
Further details¶
- by default "no limit" because otherwise admins and users might be surprised if jobs are limited and they never configured something
Out of scope¶
- Type of workers or type of jobs don't matter. Of course jobs with 10k job modules are more heavy but here we really focus on the number of jobs
Files
Updated by okurz over 1 year ago
- Copied from action #129487: high response times on osd - Limit the number of concurrent job upload handling on webUI side. Can we use a semaphore or lock using the database? size:M added
Updated by okurz over 1 year ago
- Subject changed from high response times on osd - simple limit of jobs running concurrently to high response times on osd - simple limit of jobs running concurrently size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by kraih over 1 year ago
Jan found an Nginx feature that would probably work for limiting concurrent uploads with the right location settings for the API endpoint. Log file upload retry should just work if rate limited, asset uploads are not so clear, there is a small possibility that they could get lost. So this needs to be tested.
Updated by kraih over 1 year ago
I'll do some local experiments to figure out the right nginx settings. For production deployment we'll have to consider other factors too though, such as logging with rotation and TLS certificates. So there will probably be a follow up ticket for proper nginx deployment, if this approach works out for the rate limiting.
Updated by openqa_review over 1 year ago
- Due date set to 2023-06-14
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 1 year ago
- Blocks action #129481: Try to *reduce* number of apache workers to limit concurrent requests causing high CPU usage added
Updated by kraih over 1 year ago
The nginx connection limits to appear to work exactly as we expected and are pretty straight forward to configure:
# Here we create two connection counters, one for all incoming
# connections, and one per IP address
limit_conn_zone $server_name zone=all:10m;
limit_conn_zone $binary_remote_addr zone=addr:10m;
server {
listen 8080;
server_name localhost;
root /usr/share/openqa/public;
client_max_body_size 0;
location /api/v1/ws/ {
proxy_pass http://[::1]:9527;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
location /liveviewhandler/ {
proxy_pass http://[::1]:9528;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
location /tests {
# Here we limit the actual number of connections
limit_conn all 4;
limit_conn addr 2;
proxy_set_header X-Forwarded-Host $host:$server_port;
proxy_set_header X-Forwarded-Server $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_pass "http://[::1]:9526/tests";
}
location / {
proxy_set_header X-Forwarded-Host $host:$server_port;
proxy_set_header X-Forwarded-Server $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_pass "http://[::1]:9526";
}
}
For testing purposes i've tried to only put limits on /tests
, 4 connections overall from all IP addresses, and 2 connections per individual IP address. And this config does exactly that.
Updated by kraih over 1 year ago
And this one should do the trick for specifically rate limiting artefact uploads:
limit_conn_zone $server_name zone=servers:10m;
server {
listen 8080;
server_name localhost;
root /usr/share/openqa/public;
client_max_body_size 0;
location /api/v1/ws/ {
proxy_pass http://[::1]:9527;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
location /liveviewhandler/ {
proxy_pass http://[::1]:9528;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
location ~ ^/jobs/[0-9]+/artefact$ {
limit_conn servers 50;
proxy_set_header X-Rate-Limit-Test "true";
proxy_set_header X-Forwarded-Host $host:$server_port;
proxy_set_header X-Forwarded-Server $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_pass "http://[::1]:9526";
}
location / {
proxy_set_header X-Forwarded-Host $host:$server_port;
proxy_set_header X-Forwarded-Server $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_pass "http://[::1]:9526";
}
}
With the rate limiting out of the way, now it is time to figure out all the other settings we would need for live deployment on O3/OSD.
Updated by kraih over 1 year ago
For starters i'm setting up an Nginx setup on O3 in parallel to the current Apache2. Then we can do a switch over on a not so busy day to see if everything works. There's a whole lot of small workarounds in the Apache config, so probably there will be some minor breakage, but shouldn't be anything that we can't fix quickly too.
Updated by kraih over 1 year ago
Nginx is now running on O3 on port 8080
and should work as an Apache replacement.Only remaining question is how we go about testing it in production without causing too serious disruptions.
Updated by kraih over 1 year ago
I've set it up so starting and stopping the apache2
and nginx
services should be enough to switch between the two reverse proxies.
$ cat /etc/nginx/conf.d/openqa.conf
server {
listen 80;
server_name openqa.opensuse.org openqa.infra.opensuse.org;
root /usr/share/openqa/public;
client_max_body_size 0;
client_body_buffer_size 64k;
client_header_buffer_size 4k;
location /nginx_status {
stub_status;
access_log off;
allow 127.0.0.1;
deny all;
}
location /assets {
alias /var/lib/openqa/share/factory;
tcp_nopush on;
sendfile on;
sendfile_max_chunk 1m;
}
location /image {
alias /var/lib/openqa/images;
tcp_nopush on;
sendfile on;
sendfile_max_chunk 1m;
}
location /api/v1/ws/ {
proxy_pass http://127.0.0.1:9527;
proxy_http_version 1.1;
proxy_read_timeout 3600;
proxy_send_timeout 3600;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
location /liveviewhandler/ {
proxy_pass http://127.0.0.1:9528;
proxy_http_version 1.1;
proxy_read_timeout 3600;
proxy_send_timeout 3600;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
location / {
proxy_pass "http://127.0.0.1:9526";
tcp_nodelay on;
proxy_read_timeout 900;
proxy_send_timeout 900;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-Host $host:$server_port;
proxy_set_header X-Forwarded-Server $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto "https";
}
access_log /space/logs/nginx/openqa.access_log;
error_log /space/logs/nginx/openqa.error_log;
}
Edit: Some more tuning was required on O3. Enabling sendfile
for asset downloads turned out to be extremely important. Without it the whole webui felt sluggish, because downloads were slowing everything else down.
Updated by kraih over 1 year ago
I've also set up logrotate based on the apache config.
/space/logs/nginx/openqa.access_log /space/logs/nginx/access_log {
compress
dateext
delaycompress
maxage 365
rotate 10
size=+4096k
notifempty
missingok
create 644 root root
sharedscripts
postrotate
systemctl reload nginx.service
sleep 60
endscript
}
/space/logs/nginx/openqa.error_log /space/logs/nginx/error_log {
compress
dateext
delaycompress
maxage 365
rotate 10
size=+4096k
notifempty
missingok
create 644 root root
sharedscripts
postrotate
systemctl reload nginx.service
sleep 60
endscript
}
Updated by okurz over 1 year ago
- Related to action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features added
Updated by okurz over 1 year ago
- Related to action #130312: [tools] URL listing TW snapshots (and the changes therein), has stopped working added
Updated by okurz over 1 year ago
Sounds great! Though I suggest to focus on nginx related work in #129490 and focus this ticket on a limit in pure openQA. In the end we might not want to continue this feature work in openQA if nginx helps us already to achieve the same as you have already found out
Updated by livdywan over 1 year ago
okurz wrote:
Sounds great! Though I suggest to focus on nginx related work in #129490 and focus this ticket on a limit in pure openQA. In the end we might not want to continue this feature work in openQA if nginx helps us already to achieve the same as you have already found out
That is exactly what we discussed and what Sebastian is doing here. It makes little sense to do the other way around.
Updated by okurz over 1 year ago
cdywan wrote:
That is exactly what we discussed and what Sebastian is doing here. It makes little sense to do the other way around.
I don't see that it makes little sense to do what this ticket suggests. But in any case please make the work explicit and transparent by using #129490 for nginx related work
Updated by livdywan over 1 year ago
okurz wrote:
cdywan wrote:
That is exactly what we discussed and what Sebastian is doing here. It makes little sense to do the other way around.
I don't see that it makes little sense to do what this ticket suggests. But in any case please make the work explicit and transparent by using #129490 for nginx related work
I was talking about looking into nginx first ;-) Not sure how the tickets got mixed up again. We also had some confusion due to #129487. It seems we need to take better care of tickets with overlapping goals.
Updated by livdywan over 1 year ago
- Subject changed from high response times on osd - simple limit of jobs running concurrently size:M to high response times on osd - simple limit of jobs running concurrently in openQA size:M
- Status changed from In Progress to Blocked
Let's swap the tickets accordingly. #129490 is the one to be evaluated first, hence blocking on that.
Updated by kraih over 1 year ago
okurz wrote:
In the end we might not want to continue this feature work in openQA if nginx helps us already to achieve the same as you have already found out
To sum up the situation so far: Nginx performs very well in production on O3 (with the right tuning), and it does have all the rate/connection limiting features we need for fine grained control over worker uploads. So it would make sense to rephrase the ACs of this ticket and to block it on #129490 as a prerequisite for connection limiting.
Updated by okurz over 1 year ago
kraih wrote:
So it would make sense to rephrase the ACs of this ticket and to block it on #129490 as a prerequisite for connection limiting.
I would rather put it like this: If we manage to use nginx to do all what we want to ensure that no openQA instance is overloaded then we don't need this openQA feature at all. It might still be slightly beneficial in non-nginx scenarios if there are any useful ones.
Updated by okurz over 1 year ago
- Due date deleted (
2023-06-14)
Discussed in weekly unblock 2023-06-07 and we will just wait for the nginx results before re-evaluating.
Updated by livdywan over 1 year ago
- Subject changed from high response times on osd - simple limit of jobs running concurrently in openQA size:M to high response times on osd - simple limit of jobs running concurrently in openQA
- Status changed from Blocked to New
- Assignee deleted (
kraih)
okurz wrote:
first #131024
The blocker was resolved. I'm resetting the ticket so we can re-evaluate what we want here
Updated by okurz over 1 year ago
- Subject changed from high response times on osd - simple limit of jobs running concurrently in openQA to high response times on osd - simple limit of jobs running concurrently in openQA size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by tinita over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to tinita
Updated by tinita over 1 year ago
https://github.com/os-autoinst/openQA/pull/5276 Limit number of running jobs per webui instance
Updated by openqa_review over 1 year ago
- Due date set to 2023-08-23
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita over 1 year ago
https://github.com/os-autoinst/openQA/pull/5276 merged.
Another suggestion was to add a notice to the "All tests" page if the number of running jobs is currently at the limit, so instead of the normal "56 jobs are running" users would see something like "250 jobs are running (limit of running jobs reached)"
Updated by okurz over 1 year ago
tinita wrote:
Another suggestion was to add a notice to the "All tests" page if the number of running jobs is currently at the limit, so instead of the normal "56 jobs are running" users would see something like "250 jobs are running (limit of running jobs reached)"
How about
"250 jobs are running (limited by server config)"
Updated by tinita over 1 year ago
https://github.com/os-autoinst/openQA/pull/5279 Show max running jobs on /tests page
Updated by tinita over 1 year ago
So I investigated the problem with max_running_jobs not always being set to the default value, causing warnings.
That is actually a problem of the tests, and it wouldn't happen in the productive code.
The problem is that apparently OpenQA::Setup::read_config is not always called when we create app instances in tests.
I think that's unfortunate because we have to cover for that in code by ensuring a default value in several places, although that should be done by OpenQA::Setup in just one place.
It would be good to make sure in tests we read the config and fill in defaults.
I will create a ticket because I had a quick look, and it looks to me like it's not fixable very fast: #134114
Updated by tinita over 1 year ago
- Status changed from In Progress to Feedback
I set max_running_jobs to 250 on osd and will monitor the job queue and CPU Load
Updated by tinita over 1 year ago
- Status changed from Feedback to In Progress
Somehow the limit had no effect. Will test locally...
Updated by tinita over 1 year ago
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/openQA/pull/5285 Fix scheduler getting max_running_jobs config
So we have to use OpenQA::App->singleton->config
. I had used $self->{config}
because I found that code in the same module already.
This also explains some of the problems with the defaults.
I will look into that as part of #134114
Updated by tinita over 1 year ago
https://github.com/os-autoinst/openQA/pull/5285 merged and deployed on osd, so limit should now be effective.
https://github.com/os-autoinst/openQA/pull/5287 Remove defaults, should be ensured by OpenQA::Setup already - merged, deployed on o3
Now monitoring what's happening
Updated by tinita about 1 year ago
- File max_running_jobs_grafana_job_queue.png max_running_jobs_grafana_job_queue.png added
- Due date changed from 2023-08-23 to 2023-08-30
As discussed, the current implementation was too naive; I was just assuming that the schedule function would only get one job (or a group of jobs depending on each other) as a parameter.
What is happening now is the following:
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=1692746389833&to=1692749719627
As soon as $running
(the number of jobs in an EXECUTION state) is below the $limit
, the schedule
function will assign all currently scheduled jobs (with a maximum of MAX_JOB_ALLOCATION
- 80 be default), and limited by free workers of course.
Then it will no longer schedule until $running
is below $limit
again.
The current logic is not what was intended.
It is not that bad though, because the newly assigned jobs (let's say 80) are just starting, and then one job after the other will finish, while no new jobs are assigned. That means the number of jobs in the phase of uploading at the same time will not be that high.
We could say that together with the MAX_JOB_ALLOCATION limit we are probably fine, however, we might want max_running_jobs
to be a hard limit (with the exception of parallel clusters that should be assigned together).
Updated by tinita about 1 year ago
- Status changed from Feedback to In Progress
We decided now that we want max_running_jobs
as a hard limit, even if the current combination of max_running_jobs
and MAX_JOB_ALLOCATION
might actually be the best usage of resources, but it's harder to explain to users what's happening exactly. (And MAX_JOB_ALLOCATION
would have to be a configuration variable to easily adjust it, so it would also require a change).
Updated by tinita about 1 year ago
- Due date changed from 2023-08-30 to 2023-09-06
Due to putting out fires and other important tasks - bumping due date
Updated by tinita about 1 year ago
Updated by tinita about 1 year ago
- Status changed from In Progress to Workable
I'm unassigning myself.
No programming possible for the last two weeks, only putting fires out, and it seems it goes on.
Updated by tinita about 1 year ago
I was wondering why the schedule method takes two additional parameters, as I can't find any code under lib
that passes these:
sub schedule ($self, $allocated_workers = {}, $allocated_jobs = {}) {
I found https://github.com/os-autoinst/openQA/pull/3741/files and it looks like the parameters were added to be able to check the contents of those hashes (or actually only $allocated_workers
) in a unit test.
Since I'm extracting _allocate_jobs
this might not be necessary anymore, and I can remove those parameters.
Updated by tinita about 1 year ago
- Due date changed from 2023-09-06 to 2023-09-13
Updated by okurz about 1 year ago
- Related to action #134927: OSD throws 503, unresponsive for some minutes size:M added
Updated by tinita about 1 year ago
Still working on the test.
One problem is that 04-scheduler.t so far only had one worker, so there are no tests yet that deal with more than one job assignment.
Updated by tinita about 1 year ago
Ready for review: https://github.com/os-autoinst/openQA/pull/5289 Make max_running_jobs a hard limit
Updated by okurz about 1 year ago
PR approved. As mentioned in https://github.com/os-autoinst/openQA/pull/5289#issuecomment-1713488199 please look into the runtime improvement of the unit test and further refactoring of the code as discussed.
Updated by livdywan about 1 year ago
okurz wrote in #note-51:
PR approved. As mentioned in https://github.com/os-autoinst/openQA/pull/5289#issuecomment-1713488199 please look into the runtime improvement of the unit test and further refactoring of the code as discussed.
I would lean towards a follow-up ticket here as this has already been in the queue for a bit and probably won't be done today... of course we can discuss this in the Unblock.
Updated by okurz about 1 year ago
- Status changed from Feedback to In Progress
livdywan wrote in #note-52:
okurz wrote in #note-51:
PR approved. As mentioned in https://github.com/os-autoinst/openQA/pull/5289#issuecomment-1713488199 please look into the runtime improvement of the unit test and further refactoring of the code as discussed.
I would lean towards a follow-up ticket here as this has already been in the queue for a bit and probably won't be done today... of course we can discuss this in the Unblock.
No, refactoring and cleanup with no user-facing benefit must be part of the original work. There is no better time for refactoring than now. Of course as a team we should strive to support each other as much as possible to bring this to conclusion as soon as possible.
Updated by okurz about 1 year ago
- Due date changed from 2023-09-13 to 2023-09-29
- Priority changed from High to Normal
The feature is effective in production and we can follow-up with the unit test improvements in the following days. Bumping the due-date due to unforeseen distractions I need to put on the shoulders of the team the past days.
Updated by tinita about 1 year ago
https://github.com/os-autoinst/openQA/pull/5306 scheduler: Log statistics of rejected jobs (#135578)
Updated by okurz about 1 year ago
- Due date deleted (
2023-09-29) - Status changed from Feedback to Blocked
blocked by #135632
Updated by okurz about 1 year ago
- Related to action #135632: "Mojo::File::spurt is deprecated in favor of Mojo::File::spew" breaking os-autoinst OBS build and osd-deployment size:M added
Updated by tinita about 1 year ago
Updated by tinita about 1 year ago
- Status changed from Blocked to Feedback
deployed yesterday.
/var/log/openqa_scheduler
osd:
[2023-09-25T10:59:56.015244+02:00] [debug] [pid:9919] Skipping 74 jobs because of no free workers for requested worker classes (qemu_ppc64le,tap:22,hmc_ppc64le-1disk:17,qemu_ppc64le-large-mem,tap:13,qemu_x86_64
,tap,worker31:8,virt-mm-64bit-ipmi:4,64bit-ipmi-large-mem:3,openqaworker16,qemu_x86_64,tap:3,64bit-ipmi-amd-zen3:1,generalhw_RPi3B:1,generalhw_RPi3B+:1,generalhw_RPi4:1)
o3:
[2023-09-25T09:00:39.060274Z] [debug] [pid:23000] Skipping 23 jobs because of no free workers for requested worker classes (qemu_ppc64le:15,qemu_aarch32,tap:6,s390x-zVM-vswitch-l2,tap:2)
Updated by tinita about 1 year ago
https://github.com/os-autoinst/openQA/pull/5315 Reduce runtime of t/04-scheduler.t
Updated by openqa_review about 1 year ago
- Due date set to 2023-10-12
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 1 year ago
https://github.com/os-autoinst/openQA/pull/5315 merged, what's next/missing?