Project

General

Profile

Actions

action #94111

closed

Optimize /api/v1/jobs

Added by tinita over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-06-16
Due date:
% Done:

0%

Estimated time:

Description

Currently that endpoint is accessible by everyone, and without parameters it returns up to 10k jobs.

Additionally it executes several SELECTS for each of the 10k jobs, resulting in about 40k.

This can take very long, for example I saw 150s on o3 today. So with only one HTTP request you can keep the webserver and DB occupied for 150s.

Ideas:

  • Test if the same optimization as in #93925 helps
  • Maybe choose a lower limit than 10k?
  • Restrict requests with a large limit to authenticated users

Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #93925: Optimize SQL on /tests/overviewResolvedtinita2021-06-12

Actions
Related to openQA Infrastructure (public) - action #107875: [alert][osd] Apache Response Time alert size:MResolvedtinita2022-03-04

Actions
Actions #1

Updated by tinita over 3 years ago

  • Related to action #93925: Optimize SQL on /tests/overview added
Actions #2

Updated by livdywan over 3 years ago

For comparison the Search route has a configurable rate limit based on a minion lock which is per user if logged in, or global otherwise. Additionally there's a configurable hard limit on the number of results. See lib/OpenQA/WebAPI/Controller/API/V1/Search.pm#L184.

Actions #3

Updated by tinita almost 3 years ago

  • Status changed from New to In Progress
  • Assignee set to tinita
  • Target version changed from future to Ready
Actions #4

Updated by okurz almost 3 years ago

I tried to limit all queries by default with https://github.com/os-autoinst/openQA/pull/4145 but haven't progressed far

Actions #5

Updated by tinita almost 3 years ago

In order to reduce the large number of SQL queries, I fetched the data per table for all jobs, like in #93925.
I tested with 2000 jobs and saved up to 43% of the (http) request time. For a lower number of jobs, the difference is smaller.
I pushed a branch with my code so far.

I also tried prefetching all data in the initial select from the jobs table, but that showed to be slower for a large number of jobs.

I will follow my first approach, and I personally would suggest to lower the default limit (I will look into typical reqests to see what number of jobs we typically have), and I would still vote to limit the maximum number for unauthenticated users.

Paging and rate limits would of course be even nicer.

Actions #6

Updated by tinita almost 3 years ago

I analyzed the Apache requests from March 3rd.

total:    293428
jobs api:  86050

How long did the Apache requests take?
10ms  - <100ms:     35949
100ms - <1s:        30791
1s    - <10s:       18254
10s   - <100s:       1054
100s  - <1000s:         2

To be fair, the very fast requests typically return no jobs at all, so they (and the requests with very few jobs) won't be affected.
I analyzed them (many requests were repeated), looked at the min, max and avg time and tested some of them locally.
For the longer taking ones, I was able to see an improvement of 50% or more.

PR: https://github.com/os-autoinst/openQA/pull/4545

Actions #7

Updated by tinita almost 3 years ago

  • Status changed from In Progress to Feedback

PR was merged and deployed on monday.
Will analyze logfiles again to see if there is a response time difference.

Actions #8

Updated by tinita almost 3 years ago

So far I can't find a comparable day since the deployment. Either it was too high or too low load.

March 8:
10ms  - <100ms:     44910
100ms - <1s:        26407
1s    - <10s:       12076
10s   - <100s:        146
100s  - <1000s:         0
sum:                83539

March 9:
10ms  - <100ms:     22765
100ms - <1s:        43971
1s    - <10s:       16863
10s   - <100s:        770
100s  - <1000s:         0
sum:                84369

Will check tomorrow.

Actions #9

Updated by tinita almost 3 years ago

  • Status changed from Feedback to Resolved

Ok, I looked at the average request time for jobs api requests (GET /api/v1/jobs($|\?.*)) and got those results:

2022-03-01     c: 96975 sum: 78704668 avg: 811
2022-03-02     c: 95898 sum: 80113600 avg: 835
2022-03-03     c: 86050 sum: 80575648 avg: 936
2022-03-04     c: 74823 sum: 57385550 avg: 766
2022-03-05     c: 73457 sum: 53986927 avg: 734
2022-03-06     c: 74160 sum: 48444445 avg: 653
2022-03-07     c: 79884 sum: 54675994 avg: 684 # deployed during the day
2022-03-08     c: 83539 sum: 45379482 avg: 543
#2022-03-09     c: 84369 sum: 76491174 avg: 906
#2022-03-10     c: 78570 sum: 68839436 avg: 876
2022-03-11     c: 79619 sum: 48973395 avg: 615
2022-03-12     c: 78793 sum: 51649855 avg: 655
2022-03-13     c: 77576 sum: 38979942 avg: 502

On March 9 and 10 we had a very high load in general, so I would ignore those.
So the average time is lower than before the deployment.
I also tested /api/v1/jobs without any arguments, and it was much faster (2.5 min compared to 10 min before).

Actions #10

Updated by okurz almost 3 years ago

  • Related to action #107875: [alert][osd] Apache Response Time alert size:M added
Actions

Also available in: Atom PDF