Project

General

Profile

Actions

action #129068

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #92854: [epic] limit overload of openQA webUI by heavy requests

Limit the number of uploadable test result steps size:M

Added by okurz 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-04-01
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

In https://suse.slack.com/archives/C02CANHLANP/p1683723956965209 and #129065 we discussed what could cause huge OSD load and fvogt identified a potential candidate: A job with a very high number of test steps:

From OSD:

openqa:/var/lib/openqa/testresults/11085/11085729-sle-15-SP5-Online-aarch64-Buildlemon-suse_os-autoinst-distri-opensuse_fix-nfs-server-exports-timeout-issue-yast2_nfs_v3_server_@aarch64 # ls -l | wc -l
63884

So an opportunity to use that as an awesome scalability test case :) First thing we should do as a stop-gap is to limit to N test steps uploadable (configurable value).

Acceptance criteria

  • AC1: openQA by default refuses to accept test step results that exceed a configurable limit with sensible default
  • AC2: A user can see a clear error message, e.g. incomplete openQA job with explanatory "reason" field

Suggestions


Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure - action #129065: [alert] HTTP Response alert fired, OSD loads slow size:MResolvedmkittler2023-04-01

Actions
Actions #1

Updated by okurz 5 months ago

  • Copied from action #129065: [alert] HTTP Response alert fired, OSD loads slow size:M added
Actions #2

Updated by favogt 5 months ago

It's not about the test modules but the individual results (screenshots, serial output).

To reproduce https://bugzilla.suse.com/show_bug.cgi?id=1210075, I asked for runs of

    while (1) {
        clear_console();
        assert_script_run("true");
    }

(FTR, that was a success: The jobs eventually failed)

Actions #3

Updated by okurz 5 months ago

  • Subject changed from Limit the number of uploadable test result modules to Limit the number of uploadable test result steps
  • Description updated (diff)
Actions #4

Updated by okurz 5 months ago

Sorry, I meant test steps, not modules.

Actions #5

Updated by mkittler 5 months ago

  • Subject changed from Limit the number of uploadable test result steps to Limit the number of uploadable test result steps size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by okurz 5 months ago

Actions #7

Updated by mkittler 4 months ago

  • Assignee set to mkittler
Actions #8

Updated by mkittler 4 months ago

Most effective would be a limit enforced by the web UI controller. This way nobody can upload too many results using a patched version of os-autoinst or the worker. The most straight forward way to implement this is adding a result_count column (in addition to the existing result_size column), increment it via UPDATE … WHERE … result_count < $limit and return an error if the row could not be updated. The only annoyance is the additional column itself. Alternatively one could just use a text file in the results directory to store the counter but making this work race-free is a bit more difficult. We could also just do a directory listing and count the relevant files but distinguishing relevant and irrelevant files in the listing is also not that nice.

I guess the approach from https://github.com/os-autoinst/os-autoinst/pull/2313 could be followed-up upon as well. Of course it is only a client-side check and thus not really secure. As it has been implemented so far it would also only affect the number of steps within a single job module (but likely we want just a global limit).

Actions #9

Updated by mkittler 4 months ago

  • Status changed from Workable to In Progress

PR for adding a limit on os-autoinst level: https://github.com/os-autoinst/os-autoinst/pull/2322

Actions #10

Updated by mkittler 4 months ago

  • Status changed from In Progress to Feedback

Maybe we can discuss in the unblock whether we want a server-side limit for this (as there are some caveats, see #129068#note-8).

Actions #11

Updated by okurz 4 months ago

I think just limiting in os-autoinst is ok as there is no other backend. And if an instance suffers from old os-autoinst then obviously the solution is to upgrade os-autoinst :) and if an admin identifies a malicious worker they can simply remove the API key

Actions #12

Updated by mkittler 4 months ago

I was more thinking about malicious users trying to break things intentionally. However, considering https://github.com/os-autoinst/os-autoinst/pull/2322#issuecomment-1580378712 and the effort a server-side limit would be it is likely ok to just limit it client-side.

So what's now still missing is to mark the job as incomplete with a meaningful reason.

Actions #13

Updated by mkittler 4 months ago

The PR now covers marking the job as incomplete with a meaningful reason together with https://github.com/os-autoinst/openQA/pull/5193.

Actions #14

Updated by okurz 4 months ago

  • Status changed from Feedback to Resolved

https://github.com/os-autoinst/os-autoinst/pull/2322 merged. That should suffice and also cover the case mentioned in the ticket's motivation section. If we see a problem again we can also lower the limit per instance with the "MAX_TEST_STEPS" variable.

Actions #15

Updated by okurz 4 months ago

  • Parent task set to #92854
Actions

Also available in: Atom PDF