action #55904
open/status updates are too heavy with external results
0%
Description
As we generate a lot of modules it can take up to 30seconds to answer. This needs to be improved - we should aim for smaller requests that are managable.
You can execute this on openqa-monitor:
epoch=$(date +%s); epoch=$((epoch-7000)); influx -database telegraf -execute "select response_time_us,request from apache_log where response_time_us > 19000000 and request =~ /status/ and time>$epoch*1000000000"
Will give you e.g.
1566573983000000000 30590665 /api/v1/jobs/3281950/status
1566573984000000000 29956535 /api/v1/jobs/3282402/status
1566573985000000000 28671861 /api/v1/jobs/3282393/status
1566573986000000000 26500427 /api/v1/jobs/3281940/status
1566573987000000000 26421768 /api/v1/jobs/3281944/status
1566573988000000000 25935256 /api/v1/jobs/3281938/status
1566573988001000000 24334786 /api/v1/jobs/3281918/status
1566573989000000000 25448723 /api/v1/jobs/3282232/status
1566573989001000000 23724981 /api/v1/jobs/3277369/status
1566573989002000000 22939661 /api/v1/jobs/3282169/status
1566573990000000000 21872689 /api/v1/jobs/3282156/status
1566573991000000000 21818731 /api/v1/jobs/3282400/status
1566573993000000000 19485082 /api/v1/jobs/3282281/status
1566574127000000000 33826524 /api/v1/jobs/3281841/status
1566574188000000000 30708568 /api/v1/jobs/3281892/status
1566575262000000000 32902187 /api/v1/jobs/3281998/status
1566575288000000000 21759388 /api/v1/jobs/3282002/status
1566575503000000000 33676646 /api/v1/jobs/3282028/status
1566575639000000000 25030512 /api/v1/jobs/3282185/status
1566575684000000000 24631893 /api/v1/jobs/3282039/status
1566575857000000000 40510643 /api/v1/jobs/3282207/status
1566575933000000000 51457756 /api/v1/jobs/3281879/status
1566576361000000000 24472420 /api/v1/jobs/3282284/status
1566576676000000000 30885742 /api/v1/jobs/3282306/status
I.e. tons of requests that are very slow - and if you check e.g. https://openqa.suse.de/tests/3282284 that it inserts tons of job modules.
Updated by kraih about 5 years ago
- Assignee set to kraih
This is related to a bug with sometimes duplicated records in the job_modules
table i'm investigating. So i'll take a look at this too.
Updated by okurz about 5 years ago
- Related to action #56591: Improve feedback for jobs in "Uploading" added
Updated by okurz about 5 years ago
- Related to action #55238: jobs with high amount of log files, thumbnails, test results are incompleted but the job continues with upload attempts added
Updated by kraih about 5 years ago
The new concurrency limit for the webui server should make this much less of a problem. https://github.com/os-autoinst/openQA/pull/2378
Updated by kraih almost 5 years ago
I've done a lot of profiling and the JSON data structures we send for status updates really don't look too bad at all. The primary bottleneck we are fighting with is inefficient DBIx::Class
queries and abstraction. Nobody is really at fault here though, it's just the nature of ORMs. I've done some optimizations that should gain us about 30% better performance for the slowest queries (those including the initial test_order
), but without custom SQL and batch inserts i don't think there is much we can do. https://github.com/os-autoinst/openQA/pull/2390 Splitting up the test_order
update into multiple requests would add quite a bit of complexity and many new opportunities for things to go wrong, especially regarding timing errors in the worker. So, i don't think it's actually a good idea to go into that direction.
Given that those 30s+ requests are only performing a loop of inserts into a Postgres table (with some cascading queries triggered by DBIx::Class
for module stats), there is quite a bit of performance to be gained from using custom SQL. The code would just get quite a bit uglier. But it is the tradeoff i would take if we really need more performance improvements.
Updated by livdywan almost 5 years ago
- Target version changed from Ready to Current Sprint
Updated by coolo almost 5 years ago
- Status changed from Feedback to New
- Assignee deleted (
kraih) - Target version deleted (
Current Sprint)
We need to check if it's still a problem
Updated by okurz over 4 years ago
- Category changed from Organisational to Feature requests
Still a problem. The command mentioned in the description still shows some long running requests, e.g.
name: apache_log
time response_time_us request
---- ---------------- -------
1582619724000000000 24188469 /api/v1/jobs/3920257/status
1582621459000000000 19464431 /api/v1/jobs/3920605/status
1582621697000000000 25021240 /api/v1/jobs/3920257/status
1582623514000000000 19915238 /api/v1/jobs/3920277/status
1582623700000000000 21386320 /api/v1/jobs/3920277/status
1582626223000000000 25691448 /api/v1/jobs/3920277/status
where "3920277" is not even one of the infamous ltp test cases. These are migration test cases. https://openqa.suse.de/tests/3920605 is "ltp_openposix" so related to #55283
Updated by okurz over 4 years ago
- Priority changed from Normal to Low
- Target version set to future
with https://github.com/os-autoinst/openQA/pull/2932 this is less important as the job details show up regardless of heavy results. The root cause problem persists though.