coordination #92854: [epic] limit overload of openQA webUI by heavy requests - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

#1

Updated by okurz almost 4 years ago

Copied from action #92770: openqa.opensuse.org down, o3 VM reachable, no failed service added

Actions

Copy link

#2

Updated by okurz almost 4 years ago

Tracker changed from action to coordination
Subject changed from limit overload of openQA webUI by heavy requests to [epic] limit overload of openQA webUI by heavy requests
Description updated (diff)
Category set to Feature requests
Assignee deleted (~~okurz~~)
Parent task set to #80142

Actions

Copy link

#3

Updated by okurz almost 4 years ago

Description updated (diff)

Actions

Copy link

#4

Updated by ilausuch almost 4 years ago

Idea: Use a schema de Master/Multi-slave (read-only) database

Actions

Copy link

#5

Updated by ilausuch almost 4 years ago

Other ideas:

One problem with a web apps is that the user has the F5 button in his keyboard or reloading as virtual one. And sometimes if the user shows that a query is spending too much time could relaunch the query restarting the page, so this increase the problem. This could be solved storing a transaction HASH in a cache and launching the query using a RPC in a queue sistem (Rabbitmq) This transaction has certain attributes that could be hashed to create an only one entry in a cache system. If this query doesn't exist in the cache will be created and resolve on certain time. But it has to be stored in the launch time in the cache and mark as in process. When the process finish will update the cache. So any query will query first on the cache before launch a new query to the DB
Other problem is that we have historical data. And some of these data is static when we generate them. But we are using a relational DB that will pick up the information from different tables. One strategy I used on non-relation databases (mongodb and elastic) is to create redundancy of the data storing all the query in one registry in a "Historical table" if we know that is not going to change. This increase the response of the queries

Actions

Copy link

#6

Updated by mkittler almost 4 years ago

Caching sounds great in general but obviously raises the problem of cache invalidation. If we had a real attacker (the last "DDoS attack" looked more like an accident) it would likely not help much because the attacker would simply modify each query slightly so for a generic cache these requests would all be different ones (unless we make the criteria what counts as the same query very coarse, possibly disabling certain distinctions).
We don't really have historical data which is set in stone. Nobody prevents you from scheduling new jobs for an old build. Of course we could store the computed figures for a build somewhere (e.g. within a JSON file on disk) and load it directly on subsequent queries. Adding/changing/deleting a job (or a job comment with bugrefs) within that build would invalidate the figures again. That would be similar to how we display asset statistics. By the way, I wouldn't involve MongoDB. From my experience its performance is quite poor for large datasets.
We could also enforce a rate limit via Minion locks (as we already do for the search). It would have the advantage that it is easy to implement and we could also easily differentiate between anonymous users and logged-in ones. However, there are likely more generic solutions which would perform better.

Actions

Copy link

#7

Updated by kraih almost 4 years ago

Since we don't actually know why /group_overview is so slow i'll run some profiling on it and post the results.

Actions

Copy link

#8

Updated by mkittler almost 4 years ago

I did some profiling in the past. When I remember correctly the heavy part is the querying and number crunching for the build results (as displaying comments is now reduced to a limited number of comments). Btw, on the index page that "slow part" has been moved to an extra AJAX query.

Actions

Copy link

#9

Updated by kraih almost 4 years ago

File graph.png graph.png added

Actually, the flame graph i've attached to this comment looks quite interesting. On the circled plateau we spend a lot of time inflating DBIx::Class columns with DateTime. That is something that could definitely be optimised if we wanted to.

Actions

Copy link

#10

Updated by kraih almost 4 years ago

The top 15 subroutine calls also reflect that.

Calls   P   F   Time Exclusive  Time Inclusive  Subroutine
90  1   1   34.2s   34.2s                                     IO::Poll::_poll (xsub)
309690  11  11  885ms   1.86s                    DBIx::Class::FilterColumn::get_column
26980   1   1   689ms   1.04s                                     DateTime::_check_new_params
2455    1   1   678ms   678ms                                      DBI::st::execute (xsub)
10  1   1   640ms   19.0s                         OpenQA::BuildResults::compute_build_results
11552   3   1   628ms   2.84s                       DBIx::Class::ResultSet::_construct_results
27082   2   1   534ms   1.37s                                     DateTime::_new
356697  18  10  517ms   822ms                                         next::method
26920   1   1   443ms   4.08s   DateTime::Format::Builder::Parser::generic::__ANON__[DateTime/Format/Builder/Parser/generic.pm:82]
26860   1   1   419ms   884ms                                     DateTime::_compare
31405   4   1   414ms   414ms                            DBIx::Class::Carp::CORE:regcomp (opcode)
2431    1   1   375ms   478ms               DBIx::Class::Storage::DBIHacks::_resolve_column_info
35230   32  21  359ms   7.64s                                    Try::Tiny::try (recurses: max depth 3, inclusive time 119ms)
330516  18  18  306ms   306ms                             DBIx::Class::Row::get_column
26922   2   1   305ms   5.99s         DBIx::Class::InflateColumn::DateTime::_flate_or_fallback

Same for files ordered by exclusive time. The 689248 1.36s DBIx/Class/Storage/DBI.pm entry should be the one actually waiting for data from PostgreSQL.

Stmts  Exclusive Time  Reports  Source File
3277    34.3s   line    Mojo/Reactor/Poll.pm
2305356 1.95s   line    DateTime.pm
1121465 1.46s   line    DBIx/Class/ResultSet.pm (including 2 string evals)
689248  1.36s   line    DBIx/Class/Storage/DBI.pm
1306578 1.25s   line    SQL/Abstract/Classic.pm
713408  1.18s   line    mro.pm
654392  1.11s   line    Class/Accessor/Grouped.pm (including 182 string evals)
1519828 906ms   line    DBIx/Class/Row.pm
824492  884ms   line    Eval/Closure.pm (including 1 string eval)
460053  831ms   line    DBIx/Class/InflateColumn/DateTime.pm
96289   725ms   line    DBIx/Class/Carp.pm
355647  666ms   line    /home/sri/work/openQA/repos/openQA/script/../lib/OpenQA/BuildResults.pm
811799  664ms   line    Try/Tiny.pm
350115  632ms   line    DateTime/Format/Builder/Parser/Regex.pm
929102  597ms   line    DBIx/Class/FilterColumn.pm
704047  593ms   line    DBIx/Class/Storage/DBIHacks.pm
509024  544ms   line    DBIx/Class/ResultSource.pm
701021  469ms   line    DateTime/Format/Builder/Parser.pm
430897  424ms   line    DateTime/Format/Builder/Parser/generic.pm
429768  421ms   line    DateTime/Helpers.pm
316899  369ms   line    DBIx/Class/_Util.pm
350020  359ms   line    DateTime/Format/Pg.pm (including 1 string eval)
138333  356ms   line    DBIx/Class/Storage/DBI/Cursor.pm
400973  355ms   line    DBIx/Class/SQLMaker/ClassicExtensions.pm
292454  337ms   line    DBIx/Class/InflateColumn.pm
473769  317ms   line    Archive/Extract.pm
234161  271ms   line    Mojo/Util.pm (including 5 string evals)
208552  257ms   line    Mojo/Base.pm
161717  209ms   line    DateTime/TimeZone/Floating.pm
202072  193ms   line    Specio/Constraint/Role/Interface.pm
83257   175ms   line    /home/sri/work/openQA/repos/openQA/script/../lib/OpenQA/Utils.pm
217182  170ms   line    /home/sri/work/openQA/repos/openQA/script/../lib/OpenQA/Log.pm
208892  167ms   line    Mojo/Path.pm
184932  155ms   line    Mojo/URL.pm

You can ignore IO::Poll and Mojo::Reactor::Poll, that's just the web server mainloop waiting for requests.

Actions

Copy link

#11

Updated by tinita almost 4 years ago

Regarding DateTime: I remember that I significantly optimized the board software behind perl-community.de by turning the datetime columns into integer columns with epoch seconds, and then only rendering it when finally displaying it in HTML.

That's probably not an option here.
But we could remove the datetime columns (and other columns) from the SELECT list where we don't need them.

Actions

Copy link

#12

Updated by ilausuch almost 4 years ago

Regarding what Tina said, for sure the DB is not the problematic here, but we could also add a new column with the epoch (timestamp) as a int/bigint and use this for conditions, and use (as Tina said) the datetime only for queries that requires that

Actions

Copy link

#13

Updated by mkittler almost 4 years ago

Before adding yet another redundant column or changing the data type, let's just check whether we can disable the automatic conversion for the column we have, e.g. by removing InflateColumn::DateTime from load_components in the relevant OpenQA::Schema::Result::* packages. This of course means we need to create DateTime manually where they are needed and possibly fixing many places in the code where the t_* columns are used.

However, I'm wondering whether saving a few seconds here is already enough to prevent a DDoS attack/accident. (The optimization is likely a good idea regardless but wouldn't a more generic measure make more sense to tackle this issue? I suppose we'll always have slow routes.)

Actions

Copy link

#14

Updated by mkittler almost 4 years ago

Here's another example of a slow route (which would be perfect to cause this issue): #93246#note-6

The point here is again that a more generic measure would make sense (and not just optimize one route and call it done). @kraih That's actually the reason why I wanted your feedback on the ticket. Maybe you know something more generic?

Actions

Copy link

#15

Updated by okurz almost 4 years ago

mkittler wrote:

However, I'm wondering whether saving a few seconds here is already enough to prevent a DDoS attack/accident.

I see the "optimization" as a nice side-task that we can do but the purpose of the ticket is, well, as the subject says "limit overload of openQA webUI by heavy requests" which can happen in other cases of "heavy requests".

Actions

Copy link

#16

Updated by tinita almost 4 years ago

We should add %D or %T (The time taken to serve the request) to our Apache access_logs.

I found out today that on o3 we don't have an access_log at all, and on osd there are two, and only in one we have %D.

If one wants to solve performance problems, gathering performance data is the first step.

Actions

Copy link

#17

Updated by okurz almost 4 years ago

As I stated any optimization is only secondary to prevent an overload

Actions

Copy link

#18

Updated by tinita almost 4 years ago

okurz wrote:

As I stated any optimization is only secondary to prevent an overload

Does that mean access_logs are not necessary?

How can one analyze an overload if we have no data about requests (and request times) at all?

Actions

Copy link

#19

Updated by tinita almost 4 years ago

How about reducing MaxRequestWorkers? https://httpd.apache.org/docs/2.4/mod/mpm_common.html#maxrequestworkers

Actions

Copy link

#20

Updated by okurz almost 4 years ago

tinita wrote:

okurz wrote:

As I stated any optimization is only secondary to prevent an overload

Does that mean access_logs are not necessary?

How can one analyze an overload if we have no data about requests (and request times) at all?

Think about the following scenario: An attacker spawns a DDoS attack on any publically accessible route. How to prevent that in this situation even openQA jobs failing with weird errors due to openQA worker communication (also going over HTTP) being impacted.

tinita wrote:

How about reducing MaxRequestWorkers? https://httpd.apache.org/docs/2.4/mod/mpm_common.html#maxrequestworkers

This will likely kill the communication to openQA workers as well

Actions

Copy link

#21

Updated by tinita almost 4 years ago

okurz wrote:

tinita wrote:

How about reducing MaxRequestWorkers? https://httpd.apache.org/docs/2.4/mod/mpm_common.html#maxrequestworkers

This will likely kill the communication to openQA workers as well

Then the workers should get their own instance they can talk to.

Actions

Copy link

#22

Updated by okurz over 3 years ago

Target version changed from Ready to future

Actions

Copy link

#23

Updated by okurz almost 3 years ago

Target version changed from future to Ready

Actions

Copy link

#24

Updated by okurz almost 3 years ago

Status changed from New to Blocked
Assignee set to okurz

We can block on the current updated set of subtasks

Actions

Copy link

#25

Updated by okurz about 2 years ago

Description updated (diff)

Actions

Copy link

#26

Updated by okurz about 2 years ago

Target version changed from Ready to future

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries