Project

General

Profile

action #55904

/status updates are too heavy with external results

Added by coolo over 1 year ago. Updated 12 months ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2019-08-23
Due date:
% Done:

0%

Estimated time:
Difficulty:
hard

Description

As we generate a lot of modules it can take up to 30seconds to answer. This needs to be improved - we should aim for smaller requests that are managable.

You can execute this on openqa-monitor:

epoch=$(date +%s); epoch=$((epoch-7000)); influx -database telegraf -execute "select response_time_us,request from apache_log where response_time_us > 19000000 and request =~ /status/ and time>$epoch*1000000000"

Will give you e.g.

1566573983000000000 30590665         /api/v1/jobs/3281950/status
1566573984000000000 29956535         /api/v1/jobs/3282402/status
1566573985000000000 28671861         /api/v1/jobs/3282393/status
1566573986000000000 26500427         /api/v1/jobs/3281940/status
1566573987000000000 26421768         /api/v1/jobs/3281944/status
1566573988000000000 25935256         /api/v1/jobs/3281938/status
1566573988001000000 24334786         /api/v1/jobs/3281918/status
1566573989000000000 25448723         /api/v1/jobs/3282232/status
1566573989001000000 23724981         /api/v1/jobs/3277369/status
1566573989002000000 22939661         /api/v1/jobs/3282169/status
1566573990000000000 21872689         /api/v1/jobs/3282156/status
1566573991000000000 21818731         /api/v1/jobs/3282400/status
1566573993000000000 19485082         /api/v1/jobs/3282281/status
1566574127000000000 33826524         /api/v1/jobs/3281841/status
1566574188000000000 30708568         /api/v1/jobs/3281892/status
1566575262000000000 32902187         /api/v1/jobs/3281998/status
1566575288000000000 21759388         /api/v1/jobs/3282002/status
1566575503000000000 33676646         /api/v1/jobs/3282028/status
1566575639000000000 25030512         /api/v1/jobs/3282185/status
1566575684000000000 24631893         /api/v1/jobs/3282039/status
1566575857000000000 40510643         /api/v1/jobs/3282207/status
1566575933000000000 51457756         /api/v1/jobs/3281879/status
1566576361000000000 24472420         /api/v1/jobs/3282284/status
1566576676000000000 30885742         /api/v1/jobs/3282306/status

I.e. tons of requests that are very slow - and if you check e.g. https://openqa.suse.de/tests/3282284 that it inserts tons of job modules.


Related issues

Related to openQA Project - action #56591: Improve feedback for jobs in "Uploading"Resolved2019-09-08

Related to openQA Project - action #55238: jobs with high amount of log files, thumbnails, test results are incompleted but the job continues with upload attemptsNew2019-07-31

History

#1 Updated by kraih over 1 year ago

  • Assignee set to kraih

This is related to a bug with sometimes duplicated records in the job_modules table i'm investigating. So i'll take a look at this too.

#2 Updated by kraih over 1 year ago

  • Status changed from New to In Progress

#3 Updated by okurz over 1 year ago

  • Related to action #56591: Improve feedback for jobs in "Uploading" added

#4 Updated by okurz over 1 year ago

  • Related to action #55238: jobs with high amount of log files, thumbnails, test results are incompleted but the job continues with upload attempts added

#5 Updated by kraih over 1 year ago

The new concurrency limit for the webui server should make this much less of a problem. https://github.com/os-autoinst/openQA/pull/2378

#6 Updated by kraih over 1 year ago

  • Priority changed from High to Normal

#7 Updated by kraih over 1 year ago

I've done a lot of profiling and the JSON data structures we send for status updates really don't look too bad at all. The primary bottleneck we are fighting with is inefficient DBIx::Class queries and abstraction. Nobody is really at fault here though, it's just the nature of ORMs. I've done some optimizations that should gain us about 30% better performance for the slowest queries (those including the initial test_order), but without custom SQL and batch inserts i don't think there is much we can do. https://github.com/os-autoinst/openQA/pull/2390 Splitting up the test_order update into multiple requests would add quite a bit of complexity and many new opportunities for things to go wrong, especially regarding timing errors in the worker. So, i don't think it's actually a good idea to go into that direction.

Given that those 30s+ requests are only performing a loop of inserts into a Postgres table (with some cascading queries triggered by DBIx::Class for module stats), there is quite a bit of performance to be gained from using custom SQL. The code would just get quite a bit uglier. But it is the tradeoff i would take if we really need more performance improvements.

#8 Updated by kraih over 1 year ago

  • Difficulty changed from medium to hard

#9 Updated by kraih over 1 year ago

  • Status changed from In Progress to Feedback

#10 Updated by cdywan over 1 year ago

  • Target version changed from Ready to Current Sprint

#11 Updated by coolo over 1 year ago

  • Status changed from Feedback to New
  • Assignee deleted (kraih)
  • Target version deleted (Current Sprint)

We need to check if it's still a problem

#12 Updated by okurz about 1 year ago

  • Category changed from Organisational to Feature requests

Still a problem. The command mentioned in the description still shows some long running requests, e.g.

name: apache_log
time                response_time_us request
----                ---------------- -------
1582619724000000000 24188469         /api/v1/jobs/3920257/status
1582621459000000000 19464431         /api/v1/jobs/3920605/status
1582621697000000000 25021240         /api/v1/jobs/3920257/status
1582623514000000000 19915238         /api/v1/jobs/3920277/status
1582623700000000000 21386320         /api/v1/jobs/3920277/status
1582626223000000000 25691448         /api/v1/jobs/3920277/status

where "3920277" is not even one of the infamous ltp test cases. These are migration test cases. https://openqa.suse.de/tests/3920605 is "ltp_openposix" so related to #55283

#13 Updated by okurz 12 months ago

  • Priority changed from Normal to Low
  • Target version set to future

with https://github.com/os-autoinst/openQA/pull/2932 this is less important as the job details show up regardless of heavy results. The root cause problem persists though.

Also available in: Atom PDF