Project

General

Profile

Actions

coordination #92854

open

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic] limit overload of openQA webUI by heavy requests

Added by okurz almost 3 years ago. Updated 9 months ago.

Status:
Blocked
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-06-12
Due date:
% Done:

73%

Estimated time:
(Total: 2.00 h)

Description

Motivation

See #92770 . Requests to certain routes can be quite heavy in comparison to other operations, e.g. /group_overview which by default evaluate comments which include labels and issue links. This can cause high computational load which should not prevent openQA workers to be able to execute tests as well as shut out other users.

Ideas

  • Only allow certain routes or query parameters for logged in users
  • Disable more costly query parameters by default, e.g. no comment, tag, label parsing
  • Use external tools to enforce rate limiting, e.g. https://www.nginx.com/blog/rate-limiting-nginx/?amp=1 or in the apache proxy
  • Do more caching on heavy requests, e.g. evaluation of comments for labels and such

Files

graph.png (563 KB) graph.png kraih, 2021-06-02 15:12

Subtasks 27 (10 open17 closed)

action #93925: Optimize SQL on /tests/overviewResolvedtinita2021-06-12

Actions
action #94354: Optimize /dashboard_build_results and /group_overview/* pagesResolvedtinita2021-06-21

Actions
action #94667: Optimize products/machines/test_suites API calls size:MResolvedkraih2021-06-24

Actions
action #94705: Monitor number of SQL queries in grafanaNew2021-06-25

Actions
action #94753: Try out munin on o3Resolvedtinita2021-06-25

Actions
action #97190: Limit size of initial requests everywhere, e.g. /, /tests, etc., over webUI and APINew2022-05-30

Actions
action #111770: Limit finished tests on /tests, but query configurable and show complete number of jobs size:SResolvedmkittler2022-05-30

Actions
action #113189: Research where we need limits size:SResolvedmkittler2022-07-04

Actions
action #114421: Add a limit where it makes sense after we have it for /tests, query configurable size:MResolvedrobert.richardson

Actions
action #119428: Ensure users can get all the data for limited queries, e.g. with paginationNew2022-11-22

Actions
action #120841: Add pagination for GET /api/v1/assets size:MResolvedkraih2022-11-22

Actions
action #121048: Add pagination for GET /api/v1/bugsResolvedkraih2022-11-22

Actions
action #121099: Add pagination for GET /api/v1/jobs/<job_id:num>/commentsNew

Actions
action #121102: Add pagination for GET /api/v1/jobs size:MResolvedkraih

Actions
action #121105: Add pagination for GET /api/v1/test_suites, GET /api/v1/test_suites/:id, GET /api/v1/machines, GET api/v1/machines/:id, GET /api/v1/products, GET /api/v1/products:id size:MResolvedkraih

Actions
action #121108: Add pagination for GET /api/v1/workersResolvedkraih

Actions
action #121111: Add pagination for GET /api/v1/job_templatesNew

Actions
action #121114: Add pagination for GET /api/v1/job_settings/jobsNew

Actions
action #121117: Add pagination for GET /experimental/searchNew

Actions
action #119431: Inform users e.g. in the webUI if not all results are returned size:MWorkable2022-10-26

Actions
action #124230: "Off-by-one" error when using the limit=1 openQA API parameterResolvedmkittler2023-02-09

Actions
action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:MResolvedlivdywan2022-02-03

Actions
action #110677: Investigation page shouldn't involve blocking long-running API routes size:MResolvedtinita2022-02-03

Actions
action #110680: Overview page shouldn't allow long-running requests without limits size:MResolvedkraih2022-02-03

Actions
action #126935: Promote /experimental/jobs/.../status as non-experimental APINew

Actions
openQA Infrastructure - action #126941: [spike] Evaluate a move of the osd database to its own VMNew2023-03-30

Actions
action #129068: Limit the number of uploadable test result steps size:MResolvedmkittler2023-04-01

Actions

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure - action #92770: openqa.opensuse.org down, o3 VM reachable, no failed serviceResolvedokurz2021-05-182021-06-02

Actions
Actions #1

Updated by okurz almost 3 years ago

  • Copied from action #92770: openqa.opensuse.org down, o3 VM reachable, no failed service added
Actions #2

Updated by okurz almost 3 years ago

  • Tracker changed from action to coordination
  • Subject changed from limit overload of openQA webUI by heavy requests to [epic] limit overload of openQA webUI by heavy requests
  • Description updated (diff)
  • Category set to Feature requests
  • Assignee deleted (okurz)
  • Parent task set to #80142
Actions #3

Updated by okurz almost 3 years ago

  • Description updated (diff)
Actions #4

Updated by ilausuch almost 3 years ago

Idea: Use a schema de Master/Multi-slave (read-only) database

Actions #5

Updated by ilausuch almost 3 years ago

Other ideas:

  • One problem with a web apps is that the user has the F5 button in his keyboard or reloading as virtual one. And sometimes if the user shows that a query is spending too much time could relaunch the query restarting the page, so this increase the problem. This could be solved storing a transaction HASH in a cache and launching the query using a RPC in a queue sistem (Rabbitmq) This transaction has certain attributes that could be hashed to create an only one entry in a cache system. If this query doesn't exist in the cache will be created and resolve on certain time. But it has to be stored in the launch time in the cache and mark as in process. When the process finish will update the cache. So any query will query first on the cache before launch a new query to the DB

  • Other problem is that we have historical data. And some of these data is static when we generate them. But we are using a relational DB that will pick up the information from different tables. One strategy I used on non-relation databases (mongodb and elastic) is to create redundancy of the data storing all the query in one registry in a "Historical table" if we know that is not going to change. This increase the response of the queries

Actions #6

Updated by mkittler almost 3 years ago

  • Caching sounds great in general but obviously raises the problem of cache invalidation. If we had a real attacker (the last "DDoS attack" looked more like an accident) it would likely not help much because the attacker would simply modify each query slightly so for a generic cache these requests would all be different ones (unless we make the criteria what counts as the same query very coarse, possibly disabling certain distinctions).
  • We don't really have historical data which is set in stone. Nobody prevents you from scheduling new jobs for an old build. Of course we could store the computed figures for a build somewhere (e.g. within a JSON file on disk) and load it directly on subsequent queries. Adding/changing/deleting a job (or a job comment with bugrefs) within that build would invalidate the figures again. That would be similar to how we display asset statistics. By the way, I wouldn't involve MongoDB. From my experience its performance is quite poor for large datasets.
  • We could also enforce a rate limit via Minion locks (as we already do for the search). It would have the advantage that it is easy to implement and we could also easily differentiate between anonymous users and logged-in ones. However, there are likely more generic solutions which would perform better.
Actions #7

Updated by kraih almost 3 years ago

Since we don't actually know why /group_overview is so slow i'll run some profiling on it and post the results.

Actions #8

Updated by mkittler almost 3 years ago

I did some profiling in the past. When I remember correctly the heavy part is the querying and number crunching for the build results (as displaying comments is now reduced to a limited number of comments). Btw, on the index page that "slow part" has been moved to an extra AJAX query.

Actions #9

Updated by kraih almost 3 years ago

Actually, the flame graph i've attached to this comment looks quite interesting. On the circled plateau we spend a lot of time inflating DBIx::Class columns with DateTime. That is something that could definitely be optimised if we wanted to.

Actions #10

Updated by kraih almost 3 years ago

The top 15 subroutine calls also reflect that.

Calls   P   F   Time Exclusive  Time Inclusive  Subroutine
90  1   1   34.2s   34.2s                                     IO::Poll::_poll (xsub)
309690  11  11  885ms   1.86s                    DBIx::Class::FilterColumn::get_column
26980   1   1   689ms   1.04s                                     DateTime::_check_new_params
2455    1   1   678ms   678ms                                      DBI::st::execute (xsub)
10  1   1   640ms   19.0s                         OpenQA::BuildResults::compute_build_results
11552   3   1   628ms   2.84s                       DBIx::Class::ResultSet::_construct_results
27082   2   1   534ms   1.37s                                     DateTime::_new
356697  18  10  517ms   822ms                                         next::method
26920   1   1   443ms   4.08s   DateTime::Format::Builder::Parser::generic::__ANON__[DateTime/Format/Builder/Parser/generic.pm:82]
26860   1   1   419ms   884ms                                     DateTime::_compare
31405   4   1   414ms   414ms                            DBIx::Class::Carp::CORE:regcomp (opcode)
2431    1   1   375ms   478ms               DBIx::Class::Storage::DBIHacks::_resolve_column_info
35230   32  21  359ms   7.64s                                    Try::Tiny::try (recurses: max depth 3, inclusive time 119ms)
330516  18  18  306ms   306ms                             DBIx::Class::Row::get_column
26922   2   1   305ms   5.99s         DBIx::Class::InflateColumn::DateTime::_flate_or_fallback

Same for files ordered by exclusive time. The 689248 1.36s DBIx/Class/Storage/DBI.pm entry should be the one actually waiting for data from PostgreSQL.

Stmts  Exclusive Time  Reports  Source File
3277    34.3s   line    Mojo/Reactor/Poll.pm
2305356 1.95s   line    DateTime.pm
1121465 1.46s   line    DBIx/Class/ResultSet.pm (including 2 string evals)
689248  1.36s   line    DBIx/Class/Storage/DBI.pm
1306578 1.25s   line    SQL/Abstract/Classic.pm
713408  1.18s   line    mro.pm
654392  1.11s   line    Class/Accessor/Grouped.pm (including 182 string evals)
1519828 906ms   line    DBIx/Class/Row.pm
824492  884ms   line    Eval/Closure.pm (including 1 string eval)
460053  831ms   line    DBIx/Class/InflateColumn/DateTime.pm
96289   725ms   line    DBIx/Class/Carp.pm
355647  666ms   line    /home/sri/work/openQA/repos/openQA/script/../lib/OpenQA/BuildResults.pm
811799  664ms   line    Try/Tiny.pm
350115  632ms   line    DateTime/Format/Builder/Parser/Regex.pm
929102  597ms   line    DBIx/Class/FilterColumn.pm
704047  593ms   line    DBIx/Class/Storage/DBIHacks.pm
509024  544ms   line    DBIx/Class/ResultSource.pm
701021  469ms   line    DateTime/Format/Builder/Parser.pm
430897  424ms   line    DateTime/Format/Builder/Parser/generic.pm
429768  421ms   line    DateTime/Helpers.pm
316899  369ms   line    DBIx/Class/_Util.pm
350020  359ms   line    DateTime/Format/Pg.pm (including 1 string eval)
138333  356ms   line    DBIx/Class/Storage/DBI/Cursor.pm
400973  355ms   line    DBIx/Class/SQLMaker/ClassicExtensions.pm
292454  337ms   line    DBIx/Class/InflateColumn.pm
473769  317ms   line    Archive/Extract.pm
234161  271ms   line    Mojo/Util.pm (including 5 string evals)
208552  257ms   line    Mojo/Base.pm
161717  209ms   line    DateTime/TimeZone/Floating.pm
202072  193ms   line    Specio/Constraint/Role/Interface.pm
83257   175ms   line    /home/sri/work/openQA/repos/openQA/script/../lib/OpenQA/Utils.pm
217182  170ms   line    /home/sri/work/openQA/repos/openQA/script/../lib/OpenQA/Log.pm
208892  167ms   line    Mojo/Path.pm
184932  155ms   line    Mojo/URL.pm

You can ignore IO::Poll and Mojo::Reactor::Poll, that's just the web server mainloop waiting for requests.

Actions #11

Updated by tinita almost 3 years ago

Regarding DateTime: I remember that I significantly optimized the board software behind perl-community.de by turning the datetime columns into integer columns with epoch seconds, and then only rendering it when finally displaying it in HTML.

That's probably not an option here.
But we could remove the datetime columns (and other columns) from the SELECT list where we don't need them.

Actions #12

Updated by ilausuch almost 3 years ago

Regarding what Tina said, for sure the DB is not the problematic here, but we could also add a new column with the epoch (timestamp) as a int/bigint and use this for conditions, and use (as Tina said) the datetime only for queries that requires that

Actions #13

Updated by mkittler almost 3 years ago

Before adding yet another redundant column or changing the data type, let's just check whether we can disable the automatic conversion for the column we have, e.g. by removing InflateColumn::DateTime from load_components in the relevant OpenQA::Schema::Result::* packages. This of course means we need to create DateTime manually where they are needed and possibly fixing many places in the code where the t_* columns are used.

However, I'm wondering whether saving a few seconds here is already enough to prevent a DDoS attack/accident. (The optimization is likely a good idea regardless but wouldn't a more generic measure make more sense to tackle this issue? I suppose we'll always have slow routes.)

Actions #14

Updated by mkittler almost 3 years ago

Here's another example of a slow route (which would be perfect to cause this issue): #93246#note-6

The point here is again that a more generic measure would make sense (and not just optimize one route and call it done). @kraih That's actually the reason why I wanted your feedback on the ticket. Maybe you know something more generic?

Actions #15

Updated by okurz almost 3 years ago

mkittler wrote:

However, I'm wondering whether saving a few seconds here is already enough to prevent a DDoS attack/accident.

I see the "optimization" as a nice side-task that we can do but the purpose of the ticket is, well, as the subject says "limit overload of openQA webUI by heavy requests" which can happen in other cases of "heavy requests".

Actions #16

Updated by tinita almost 3 years ago

We should add %D or %T (The time taken to serve the request) to our Apache access_logs.

I found out today that on o3 we don't have an access_log at all, and on osd there are two, and only in one we have %D.

If one wants to solve performance problems, gathering performance data is the first step.

Actions #17

Updated by okurz almost 3 years ago

As I stated any optimization is only secondary to prevent an overload

Actions #18

Updated by tinita almost 3 years ago

okurz wrote:

As I stated any optimization is only secondary to prevent an overload

Does that mean access_logs are not necessary?

How can one analyze an overload if we have no data about requests (and request times) at all?

Actions #20

Updated by okurz almost 3 years ago

tinita wrote:

okurz wrote:

As I stated any optimization is only secondary to prevent an overload

Does that mean access_logs are not necessary?

How can one analyze an overload if we have no data about requests (and request times) at all?

Think about the following scenario: An attacker spawns a DDoS attack on any publically accessible route. How to prevent that in this situation even openQA jobs failing with weird errors due to openQA worker communication (also going over HTTP) being impacted.

tinita wrote:

How about reducing MaxRequestWorkers? https://httpd.apache.org/docs/2.4/mod/mpm_common.html#maxrequestworkers

This will likely kill the communication to openQA workers as well

Actions #21

Updated by tinita almost 3 years ago

okurz wrote:

tinita wrote:

How about reducing MaxRequestWorkers? https://httpd.apache.org/docs/2.4/mod/mpm_common.html#maxrequestworkers

This will likely kill the communication to openQA workers as well

Then the workers should get their own instance they can talk to.

Actions #22

Updated by okurz over 2 years ago

  • Target version changed from Ready to future
Actions #23

Updated by okurz almost 2 years ago

  • Target version changed from future to Ready
Actions #24

Updated by okurz almost 2 years ago

  • Status changed from New to Blocked
  • Assignee set to okurz

We can block on the current updated set of subtasks

Actions #25

Updated by okurz 12 months ago

  • Description updated (diff)
Actions #26

Updated by okurz 12 months ago

  • Target version changed from Ready to future
Actions

Also available in: Atom PDF