action #129068
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #92854: [epic] limit overload of openQA webUI by heavy requests
Limit the number of uploadable test result steps size:M
Description
Motivation¶
In https://suse.slack.com/archives/C02CANHLANP/p1683723956965209 and #129065 we discussed what could cause huge OSD load and fvogt identified a potential candidate: A job with a very high number of test steps:
From OSD:
openqa:/var/lib/openqa/testresults/11085/11085729-sle-15-SP5-Online-aarch64-Buildlemon-suse_os-autoinst-distri-opensuse_fix-nfs-server-exports-timeout-issue-yast2_nfs_v3_server_@aarch64 # ls -l | wc -l
63884
So an opportunity to use that as an awesome scalability test case :) First thing we should do as a stop-gap is to limit to N test steps uploadable (configurable value).
Acceptance criteria¶
- AC1: openQA by default refuses to accept test step results that exceed a configurable limit with sensible default
- AC2: A user can see a clear error message, e.g. incomplete openQA job with explanatory "reason" field
Suggestions¶
- Look into our history of tickets: We already had at least one ticket about huge jobs killing openQA. Why did we not continue there?
- Implement a limit within os-autoinst or openQA worker or maybe also within the webUI refusing to accept if a configurable limit is exceeded
- Test with a test instruction like
while (1) { assert_script_run('true'); }
- Also prevent displaying test steps that exceed a configurable limit, can be same limit value
- For os-autoinst look into https://github.com/os-autoinst/os-autoinst/blob/master/basetest.pm#L404
- For the openQA web UI relevant code is likely in that controller: https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/WebAPI/Controller/API/V1/Job.pm
- If possible consider multiple levels, e.g. simple safe guard in os-autoinst, openQA worker and webUI
Updated by okurz over 1 year ago
- Copied from action #129065: [alert] HTTP Response alert fired, OSD loads slow size:M added
Updated by favogt over 1 year ago
It's not about the test modules but the individual results (screenshots, serial output).
To reproduce https://bugzilla.suse.com/show_bug.cgi?id=1210075, I asked for runs of
while (1) {
clear_console();
assert_script_run("true");
}
(FTR, that was a success: The jobs eventually failed)
Updated by okurz over 1 year ago
- Subject changed from Limit the number of uploadable test result modules to Limit the number of uploadable test result steps
- Description updated (diff)
Updated by mkittler over 1 year ago
- Subject changed from Limit the number of uploadable test result steps to Limit the number of uploadable test result steps size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
a small idea for os-autoinst: https://github.com/os-autoinst/os-autoinst/pull/2313
Updated by mkittler over 1 year ago
Most effective would be a limit enforced by the web UI controller. This way nobody can upload too many results using a patched version of os-autoinst or the worker. The most straight forward way to implement this is adding a result_count
column (in addition to the existing result_size
column), increment it via UPDATE … WHERE … result_count < $limit
and return an error if the row could not be updated. The only annoyance is the additional column itself. Alternatively one could just use a text file in the results directory to store the counter but making this work race-free is a bit more difficult. We could also just do a directory listing and count the relevant files but distinguishing relevant and irrelevant files in the listing is also not that nice.
I guess the approach from https://github.com/os-autoinst/os-autoinst/pull/2313 could be followed-up upon as well. Of course it is only a client-side check and thus not really secure. As it has been implemented so far it would also only affect the number of steps within a single job module (but likely we want just a global limit).
Updated by mkittler over 1 year ago
- Status changed from Workable to In Progress
PR for adding a limit on os-autoinst level: https://github.com/os-autoinst/os-autoinst/pull/2322
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
Maybe we can discuss in the unblock whether we want a server-side limit for this (as there are some caveats, see #129068#note-8).
Updated by okurz over 1 year ago
I think just limiting in os-autoinst is ok as there is no other backend. And if an instance suffers from old os-autoinst then obviously the solution is to upgrade os-autoinst :) and if an admin identifies a malicious worker they can simply remove the API key
Updated by mkittler over 1 year ago
I was more thinking about malicious users trying to break things intentionally. However, considering https://github.com/os-autoinst/os-autoinst/pull/2322#issuecomment-1580378712 and the effort a server-side limit would be it is likely ok to just limit it client-side.
So what's now still missing is to mark the job as incomplete with a meaningful reason.
Updated by mkittler over 1 year ago
The PR now covers marking the job as incomplete with a meaningful reason together with https://github.com/os-autoinst/openQA/pull/5193.
Updated by okurz over 1 year ago
- Status changed from Feedback to Resolved
https://github.com/os-autoinst/os-autoinst/pull/2322 merged. That should suffice and also cover the case mentioned in the ticket's motivation section. If we see a problem again we can also lower the limit per instance with the "MAX_TEST_STEPS" variable.