/status updates are too heavy with external results
As we generate a lot of modules it can take up to 30seconds to answer. This needs to be improved - we should aim for smaller requests that are managable.
You can execute this on openqa-monitor:
epoch=$(date +%s); epoch=$((epoch-7000)); influx -database telegraf -execute "select response_time_us,request from apache_log where response_time_us > 19000000 and request =~ /status/ and time>$epoch*1000000000"
Will give you e.g.
1566573983000000000 30590665 /api/v1/jobs/3281950/status 1566573984000000000 29956535 /api/v1/jobs/3282402/status 1566573985000000000 28671861 /api/v1/jobs/3282393/status 1566573986000000000 26500427 /api/v1/jobs/3281940/status 1566573987000000000 26421768 /api/v1/jobs/3281944/status 1566573988000000000 25935256 /api/v1/jobs/3281938/status 1566573988001000000 24334786 /api/v1/jobs/3281918/status 1566573989000000000 25448723 /api/v1/jobs/3282232/status 1566573989001000000 23724981 /api/v1/jobs/3277369/status 1566573989002000000 22939661 /api/v1/jobs/3282169/status 1566573990000000000 21872689 /api/v1/jobs/3282156/status 1566573991000000000 21818731 /api/v1/jobs/3282400/status 1566573993000000000 19485082 /api/v1/jobs/3282281/status 1566574127000000000 33826524 /api/v1/jobs/3281841/status 1566574188000000000 30708568 /api/v1/jobs/3281892/status 1566575262000000000 32902187 /api/v1/jobs/3281998/status 1566575288000000000 21759388 /api/v1/jobs/3282002/status 1566575503000000000 33676646 /api/v1/jobs/3282028/status 1566575639000000000 25030512 /api/v1/jobs/3282185/status 1566575684000000000 24631893 /api/v1/jobs/3282039/status 1566575857000000000 40510643 /api/v1/jobs/3282207/status 1566575933000000000 51457756 /api/v1/jobs/3281879/status 1566576361000000000 24472420 /api/v1/jobs/3282284/status 1566576676000000000 30885742 /api/v1/jobs/3282306/status
I.e. tons of requests that are very slow - and if you check e.g. https://openqa.suse.de/tests/3282284 that it inserts tons of job modules.
#7 Updated by kraih over 1 year ago
I've done a lot of profiling and the JSON data structures we send for status updates really don't look too bad at all. The primary bottleneck we are fighting with is inefficient
DBIx::Class queries and abstraction. Nobody is really at fault here though, it's just the nature of ORMs. I've done some optimizations that should gain us about 30% better performance for the slowest queries (those including the initial
test_order), but without custom SQL and batch inserts i don't think there is much we can do. https://github.com/os-autoinst/openQA/pull/2390 Splitting up the
test_order update into multiple requests would add quite a bit of complexity and many new opportunities for things to go wrong, especially regarding timing errors in the worker. So, i don't think it's actually a good idea to go into that direction.
Given that those 30s+ requests are only performing a loop of inserts into a Postgres table (with some cascading queries triggered by
DBIx::Class for module stats), there is quite a bit of performance to be gained from using custom SQL. The code would just get quite a bit uglier. But it is the tradeoff i would take if we really need more performance improvements.
#12 Updated by okurz about 1 year ago
- Category changed from Organisational to Feature requests
Still a problem. The command mentioned in the description still shows some long running requests, e.g.
name: apache_log time response_time_us request ---- ---------------- ------- 1582619724000000000 24188469 /api/v1/jobs/3920257/status 1582621459000000000 19464431 /api/v1/jobs/3920605/status 1582621697000000000 25021240 /api/v1/jobs/3920257/status 1582623514000000000 19915238 /api/v1/jobs/3920277/status 1582623700000000000 21386320 /api/v1/jobs/3920277/status 1582626223000000000 25691448 /api/v1/jobs/3920277/status
where "3920277" is not even one of the infamous ltp test cases. These are migration test cases. https://openqa.suse.de/tests/3920605 is "ltp_openposix" so related to #55283