action #94111
closedOptimize /api/v1/jobs
Description
Currently that endpoint is accessible by everyone, and without parameters it returns up to 10k jobs.
Additionally it executes several SELECTS for each of the 10k jobs, resulting in about 40k.
This can take very long, for example I saw 150s on o3 today. So with only one HTTP request you can keep the webserver and DB occupied for 150s.
Ideas:
- Test if the same optimization as in #93925 helps
- Maybe choose a lower limit than 10k?
- Restrict requests with a large limit to authenticated users
Updated by tinita over 3 years ago
- Related to action #93925: Optimize SQL on /tests/overview added
Updated by livdywan over 3 years ago
For comparison the Search route has a configurable rate limit based on a minion lock which is per user if logged in, or global otherwise. Additionally there's a configurable hard limit on the number of results. See lib/OpenQA/WebAPI/Controller/API/V1/Search.pm#L184.
Updated by tinita almost 3 years ago
- Status changed from New to In Progress
- Assignee set to tinita
- Target version changed from future to Ready
Updated by okurz almost 3 years ago
I tried to limit all queries by default with https://github.com/os-autoinst/openQA/pull/4145 but haven't progressed far
Updated by tinita almost 3 years ago
In order to reduce the large number of SQL queries, I fetched the data per table for all jobs, like in #93925.
I tested with 2000 jobs and saved up to 43% of the (http) request time. For a lower number of jobs, the difference is smaller.
I pushed a branch with my code so far.
I also tried prefetching all data in the initial select from the jobs table, but that showed to be slower for a large number of jobs.
I will follow my first approach, and I personally would suggest to lower the default limit (I will look into typical reqests to see what number of jobs we typically have), and I would still vote to limit the maximum number for unauthenticated users.
Paging and rate limits would of course be even nicer.
Updated by tinita almost 3 years ago
I analyzed the Apache requests from March 3rd.
total: 293428
jobs api: 86050
How long did the Apache requests take?
10ms - <100ms: 35949
100ms - <1s: 30791
1s - <10s: 18254
10s - <100s: 1054
100s - <1000s: 2
To be fair, the very fast requests typically return no jobs at all, so they (and the requests with very few jobs) won't be affected.
I analyzed them (many requests were repeated), looked at the min, max and avg time and tested some of them locally.
For the longer taking ones, I was able to see an improvement of 50% or more.
Updated by tinita almost 3 years ago
- Status changed from In Progress to Feedback
PR was merged and deployed on monday.
Will analyze logfiles again to see if there is a response time difference.
Updated by tinita almost 3 years ago
So far I can't find a comparable day since the deployment. Either it was too high or too low load.
March 8:
10ms - <100ms: 44910
100ms - <1s: 26407
1s - <10s: 12076
10s - <100s: 146
100s - <1000s: 0
sum: 83539
March 9:
10ms - <100ms: 22765
100ms - <1s: 43971
1s - <10s: 16863
10s - <100s: 770
100s - <1000s: 0
sum: 84369
Will check tomorrow.
Updated by tinita almost 3 years ago
- Status changed from Feedback to Resolved
Ok, I looked at the average request time for jobs api requests (GET /api/v1/jobs($|\?.*)
) and got those results:
2022-03-01 c: 96975 sum: 78704668 avg: 811
2022-03-02 c: 95898 sum: 80113600 avg: 835
2022-03-03 c: 86050 sum: 80575648 avg: 936
2022-03-04 c: 74823 sum: 57385550 avg: 766
2022-03-05 c: 73457 sum: 53986927 avg: 734
2022-03-06 c: 74160 sum: 48444445 avg: 653
2022-03-07 c: 79884 sum: 54675994 avg: 684 # deployed during the day
2022-03-08 c: 83539 sum: 45379482 avg: 543
#2022-03-09 c: 84369 sum: 76491174 avg: 906
#2022-03-10 c: 78570 sum: 68839436 avg: 876
2022-03-11 c: 79619 sum: 48973395 avg: 615
2022-03-12 c: 78793 sum: 51649855 avg: 655
2022-03-13 c: 77576 sum: 38979942 avg: 502
On March 9 and 10 we had a very high load in general, so I would ignore those.
So the average time is lower than before the deployment.
I also tested /api/v1/jobs
without any arguments, and it was much faster (2.5 min compared to 10 min before).
Updated by okurz almost 3 years ago
- Related to action #107875: [alert][osd] Apache Response Time alert size:M added