https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842021-05-19T13:31:04ZopenSUSE Project Management ToolopenQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4088922021-05-19T13:31:04Zokurzokurz@suse.com
<ul><li><strong>Copied from</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed behind-schedule" href="/issues/92770">action #92770</a>: openqa.opensuse.org down, o3 VM reachable, no failed service</i> added</li></ul> openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4089042021-05-19T13:33:07Zokurzokurz@suse.com
<ul><li><strong>Tracker</strong> changed from <i>action</i> to <i>coordination</i></li><li><strong>Subject</strong> changed from <i>limit overload of openQA webUI by heavy requests</i> to <i>[epic] limit overload of openQA webUI by heavy requests</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/408904/diff?detail_id=388468">diff</a>)</li><li><strong>Category</strong> set to <i>Feature requests</i></li><li><strong>Assignee</strong> deleted (<del><i>okurz</i></del>)</li><li><strong>Parent task</strong> set to <i>#80142</i></li></ul> openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4112972021-05-28T09:33:02Zokurzokurz@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/411297/diff?detail_id=390672">diff</a>)</li></ul> openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4113002021-05-28T09:39:41Zilausuchilausuch@suse.com
<ul></ul><p>Idea: Use a schema de Master/Multi-slave (read-only) database</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4113212021-05-28T10:37:29Zilausuchilausuch@suse.com
<ul></ul><p>Other ideas:</p>
<ul>
<li><p>One problem with a web apps is that the user has the F5 button in his keyboard or reloading as virtual one. And sometimes if the user shows that a query is spending too much time could relaunch the query restarting the page, so this increase the problem. This could be solved storing a transaction HASH in a cache and launching the query using a RPC in a queue sistem (Rabbitmq) This transaction has certain attributes that could be hashed to create an only one entry in a cache system. If this query doesn't exist in the cache will be created and resolve on certain time. But it has to be stored in the launch time in the cache and mark as in process. When the process finish will update the cache. So any query will query first on the cache before launch a new query to the DB </p></li>
<li><p>Other problem is that we have historical data. And some of these data is static when we generate them. But we are using a relational DB that will pick up the information from different tables. One strategy I used on non-relation databases (mongodb and elastic) is to create redundancy of the data storing all the query in one registry in a "Historical table" if we know that is not going to change. This increase the response of the queries</p></li>
</ul>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4127912021-06-02T14:06:10Zmkittlermarius.kittler@suse.com
<ul></ul><ul>
<li>Caching sounds great in general but obviously raises the problem of cache invalidation. If we had a real attacker (the last "DDoS attack" looked more like an accident) it would likely not help much because the attacker would simply modify each query slightly so for a generic cache these requests would all be different ones (unless we make the criteria what counts as the same query very coarse, possibly disabling certain distinctions).</li>
<li>We don't really have historical data which is set in stone. Nobody prevents you from scheduling new jobs for an old build. Of course we could store the computed figures for a build somewhere (e.g. within a JSON file on disk) and load it directly on subsequent queries. Adding/changing/deleting a job (or a job comment with bugrefs) within that build would invalidate the figures again. That would be similar to how we display asset statistics. By the way, I wouldn't involve MongoDB. From my experience its performance is quite poor for large datasets.</li>
<li>We could also enforce a rate limit via Minion locks (as we already do for the search). It would have the advantage that it is easy to implement and we could also easily differentiate between anonymous users and logged-in ones. However, there are likely more generic solutions which would perform better.</li>
</ul>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4127972021-06-02T14:13:57Zkraihsebastian.riedel@suse.com
<ul></ul><p>Since we don't actually know why <code>/group_overview</code> is so slow i'll run some profiling on it and post the results.</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4128032021-06-02T14:23:36Zmkittlermarius.kittler@suse.com
<ul></ul><p>I did some profiling in the past. When I remember correctly the heavy part is the querying and number crunching for the build results (as displaying comments is now reduced to a limited number of comments). Btw, on the index page that "slow part" has been moved to an extra AJAX query.</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4128212021-06-02T15:14:35Zkraihsebastian.riedel@suse.com
<ul><li><strong>File</strong> <a href="/attachments/11445">graph.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/11445/graph.png">graph.png</a> added</li></ul><p>Actually, the flame graph i've attached to this comment looks quite interesting. On the circled plateau we spend a lot of time inflating <code>DBIx::Class</code> columns with <code>DateTime</code>. That is something that could definitely be optimised if we wanted to.</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4128242021-06-02T15:17:30Zkraihsebastian.riedel@suse.com
<ul></ul><p>The top 15 subroutine calls also reflect that.</p>
<pre><code>Calls P F Time Exclusive Time Inclusive Subroutine
90 1 1 34.2s 34.2s IO::Poll::_poll (xsub)
309690 11 11 885ms 1.86s DBIx::Class::FilterColumn::get_column
26980 1 1 689ms 1.04s DateTime::_check_new_params
2455 1 1 678ms 678ms DBI::st::execute (xsub)
10 1 1 640ms 19.0s OpenQA::BuildResults::compute_build_results
11552 3 1 628ms 2.84s DBIx::Class::ResultSet::_construct_results
27082 2 1 534ms 1.37s DateTime::_new
356697 18 10 517ms 822ms next::method
26920 1 1 443ms 4.08s DateTime::Format::Builder::Parser::generic::__ANON__[DateTime/Format/Builder/Parser/generic.pm:82]
26860 1 1 419ms 884ms DateTime::_compare
31405 4 1 414ms 414ms DBIx::Class::Carp::CORE:regcomp (opcode)
2431 1 1 375ms 478ms DBIx::Class::Storage::DBIHacks::_resolve_column_info
35230 32 21 359ms 7.64s Try::Tiny::try (recurses: max depth 3, inclusive time 119ms)
330516 18 18 306ms 306ms DBIx::Class::Row::get_column
26922 2 1 305ms 5.99s DBIx::Class::InflateColumn::DateTime::_flate_or_fallback
</code></pre>
<p>Same for files ordered by exclusive time. The <code>689248 1.36s DBIx/Class/Storage/DBI.pm</code> entry should be the one actually waiting for data from PostgreSQL.</p>
<pre><code>Stmts Exclusive Time Reports Source File
3277 34.3s line Mojo/Reactor/Poll.pm
2305356 1.95s line DateTime.pm
1121465 1.46s line DBIx/Class/ResultSet.pm (including 2 string evals)
689248 1.36s line DBIx/Class/Storage/DBI.pm
1306578 1.25s line SQL/Abstract/Classic.pm
713408 1.18s line mro.pm
654392 1.11s line Class/Accessor/Grouped.pm (including 182 string evals)
1519828 906ms line DBIx/Class/Row.pm
824492 884ms line Eval/Closure.pm (including 1 string eval)
460053 831ms line DBIx/Class/InflateColumn/DateTime.pm
96289 725ms line DBIx/Class/Carp.pm
355647 666ms line /home/sri/work/openQA/repos/openQA/script/../lib/OpenQA/BuildResults.pm
811799 664ms line Try/Tiny.pm
350115 632ms line DateTime/Format/Builder/Parser/Regex.pm
929102 597ms line DBIx/Class/FilterColumn.pm
704047 593ms line DBIx/Class/Storage/DBIHacks.pm
509024 544ms line DBIx/Class/ResultSource.pm
701021 469ms line DateTime/Format/Builder/Parser.pm
430897 424ms line DateTime/Format/Builder/Parser/generic.pm
429768 421ms line DateTime/Helpers.pm
316899 369ms line DBIx/Class/_Util.pm
350020 359ms line DateTime/Format/Pg.pm (including 1 string eval)
138333 356ms line DBIx/Class/Storage/DBI/Cursor.pm
400973 355ms line DBIx/Class/SQLMaker/ClassicExtensions.pm
292454 337ms line DBIx/Class/InflateColumn.pm
473769 317ms line Archive/Extract.pm
234161 271ms line Mojo/Util.pm (including 5 string evals)
208552 257ms line Mojo/Base.pm
161717 209ms line DateTime/TimeZone/Floating.pm
202072 193ms line Specio/Constraint/Role/Interface.pm
83257 175ms line /home/sri/work/openQA/repos/openQA/script/../lib/OpenQA/Utils.pm
217182 170ms line /home/sri/work/openQA/repos/openQA/script/../lib/OpenQA/Log.pm
208892 167ms line Mojo/Path.pm
184932 155ms line Mojo/URL.pm
</code></pre>
<p>You can ignore <code>IO::Poll</code> and <code>Mojo::Reactor::Poll</code>, that's just the web server mainloop waiting for requests.</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4128302021-06-02T15:28:57Ztinitatina.mueller+trick-redmine@suse.com
<ul></ul><p>Regarding DateTime: I remember that I significantly optimized the board software behind perl-community.de by turning the datetime columns into integer columns with epoch seconds, and then only rendering it when finally displaying it in HTML.</p>
<p>That's probably not an option here.<br>
But we could remove the datetime columns (and other columns) from the SELECT list where we don't need them.</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4128362021-06-02T16:41:57Zilausuchilausuch@suse.com
<ul></ul><p>Regarding what Tina said, for sure the DB is not the problematic here, but we could also add a new column with the epoch (timestamp) as a int/bigint and use this for conditions, and use (as Tina said) the datetime only for queries that requires that</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4139602021-06-08T09:09:29Zmkittlermarius.kittler@suse.com
<ul></ul><p>Before adding yet another redundant column or changing the data type, let's just check whether we can disable the automatic conversion for the column we have, e.g. by removing <code>InflateColumn::DateTime</code> from <code>load_components</code> in the relevant <code>OpenQA::Schema::Result::*</code> packages. This of course means we need to create <code>DateTime</code> manually where they are needed and possibly fixing many places in the code where the <code>t_*</code> columns are used.</p>
<p>However, I'm wondering whether saving a few seconds here is already enough to prevent a DDoS attack/accident. (The optimization is likely a good idea regardless but wouldn't a more generic measure make more sense to tackle this issue? I suppose we'll always have slow routes.)</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4139932021-06-08T10:24:29Zmkittlermarius.kittler@suse.com
<ul></ul><p>Here's another example of a slow route (which would be perfect to cause this issue): <a class="issue tracker-6 status-3 priority-4 priority-default closed child parent" title="coordination: [epic] List all unreviewed failed (or incomplete) jobs on /tests on request size:M (Resolved)" href="https://progress.opensuse.org/issues/93246#note-6">#93246#note-6</a></p>
<p>The point here is again that a more generic measure would make sense (and not just optimize one route and call it done). <a class="user active user-mention" href="https://progress.opensuse.org/users/23018">@kraih</a> That's actually the reason why I wanted your feedback on the ticket. Maybe you know something more generic?</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4142332021-06-08T14:20:30Zokurzokurz@suse.com
<ul></ul><p>mkittler wrote:</p>
<blockquote>
<p>However, I'm wondering whether saving a few seconds here is already enough to prevent a DDoS attack/accident.</p>
</blockquote>
<p>I see the "optimization" as a nice side-task that we can do but the purpose of the ticket is, well, as the subject says "limit overload of openQA webUI by heavy requests" which can happen in other cases of "heavy requests".</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4163382021-06-15T12:31:23Ztinitatina.mueller+trick-redmine@suse.com
<ul></ul><p>We should add <code>%D</code> or <code>%T</code> (The time taken to serve the request) to our Apache access_logs.</p>
<p>I found out today that on o3 we don't have an access_log at all, and on osd there are two, and only in one we have <code>%D</code>.</p>
<p>If one wants to solve performance problems, gathering performance data is the first step.</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4163442021-06-15T12:44:28Zokurzokurz@suse.com
<ul></ul><p>As I stated any optimization is only secondary to prevent an overload</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4163562021-06-15T12:58:53Ztinitatina.mueller+trick-redmine@suse.com
<ul></ul><p>okurz wrote:</p>
<blockquote>
<p>As I stated any optimization is only secondary to prevent an overload</p>
</blockquote>
<p>Does that mean access_logs are not necessary?</p>
<p>How can one analyze an overload if we have no data about requests (and request times) at all?</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4163622021-06-15T13:38:41Ztinitatina.mueller+trick-redmine@suse.com
<ul></ul><p>How about reducing MaxRequestWorkers? <a href="https://httpd.apache.org/docs/2.4/mod/mpm_common.html#maxrequestworkers" class="external">https://httpd.apache.org/docs/2.4/mod/mpm_common.html#maxrequestworkers</a></p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4164162021-06-15T14:47:20Zokurzokurz@suse.com
<ul></ul><p>tinita wrote:</p>
<blockquote>
<p>okurz wrote:</p>
<blockquote>
<p>As I stated any optimization is only secondary to prevent an overload</p>
</blockquote>
<p>Does that mean access_logs are not necessary?</p>
<p>How can one analyze an overload if we have no data about requests (and request times) at all?</p>
</blockquote>
<p>Think about the following scenario: An attacker spawns a DDoS attack on <em>any</em> publically accessible route. How to prevent that in this situation even openQA jobs failing with weird errors due to openQA worker communication (also going over HTTP) being impacted.</p>
<p>tinita wrote:</p>
<blockquote>
<p>How about reducing MaxRequestWorkers? <a href="https://httpd.apache.org/docs/2.4/mod/mpm_common.html#maxrequestworkers" class="external">https://httpd.apache.org/docs/2.4/mod/mpm_common.html#maxrequestworkers</a></p>
</blockquote>
<p>This will likely kill the communication to openQA workers as well</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4164312021-06-15T15:54:37Ztinitatina.mueller+trick-redmine@suse.com
<ul></ul><p>okurz wrote:</p>
<blockquote>
<p>tinita wrote:</p>
<blockquote>
<p>How about reducing MaxRequestWorkers? <a href="https://httpd.apache.org/docs/2.4/mod/mpm_common.html#maxrequestworkers" class="external">https://httpd.apache.org/docs/2.4/mod/mpm_common.html#maxrequestworkers</a></p>
</blockquote>
<p>This will likely kill the communication to openQA workers as well</p>
</blockquote>
<p>Then the workers should get their own instance they can talk to.</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=4249242021-07-09T08:16:21Zokurzokurz@suse.com
<ul><li><strong>Target version</strong> changed from <i>Ready</i> to <i>future</i></li></ul> openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=5162952022-05-05T13:26:40Zokurzokurz@suse.com
<ul><li><strong>Target version</strong> changed from <i>future</i> to <i>Ready</i></li></ul> openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=5187312022-05-12T10:09:54Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Blocked</i></li><li><strong>Assignee</strong> set to <i>okurz</i></li></ul><p>We can block on the current updated set of subtasks</p>
openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=6185122023-03-30T05:39:34Zokurzokurz@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/618512/diff?detail_id=580898">diff</a>)</li></ul> openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestshttps://progress.opensuse.org/issues/92854?journal_id=6193102023-03-31T13:34:51Zokurzokurz@suse.com
<ul><li><strong>Target version</strong> changed from <i>Ready</i> to <i>future</i></li></ul>