Project

General

Profile

action #110677

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #92854: [epic] limit overload of openQA webUI by heavy requests

Investigation page shouldn't involve blocking long-running API routes size:M

Added by cdywan about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2022-02-03
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

The ajax request can be very slow e.g. https://openqa.opensuse.org/tests/2326528/investigation_ajax will take several minutes to respond.
See #106759#note-36

Acceptance criteria

  • AC1: /tests/*/investigation_ajax requests are no longer able to block the prefork worker process for more than 400 seconds
  • AC2: The user does not see any timeouts, even if loading of the "Investigation" tab takes a while

Suggestions

  • Move the investigation code into a Minion job and add a route for polling its result
    • Very easy to store results as job results, but performance regressions need to be considered
  • As a further optimization the investigation results could be cached explicitely/longer
    • Generate results even before the job finishes:
    • compute git analysys once and save it to a file
    • save to an asset
    • possibly a bit wasteful (as we run a lot of jobs but don't need the investigation for every job)
    • Trigger caching of data upon first request
    • return "no data yet" on the first attempt -> trigger a minion job
    • check for an existing minion job for the same job (e.g. user refreshing the page)
    • next step: save to an asset internally

Related issues

Related to openQA Project - action #97190: Limit size of initial requests everywhere, e.g. /, /tests, etc.New2022-05-30

Blocks openQA Project - action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:MFeedback2022-02-03

History

#1 Updated by cdywan about 2 months ago

  • Copied from action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added

#2 Updated by okurz about 2 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Category set to Feature requests
  • Target version set to Ready

#3 Updated by okurz about 2 months ago

  • Due date deleted (2022-05-13)

#4 Updated by okurz about 2 months ago

  • Parent task set to #92854

#5 Updated by tinita about 2 months ago

I asked myself the questions:

  • Do we have a general problem with investigation_ajax? Most of the times this seems to be quite fast
  • In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?

#6 Updated by okurz about 2 months ago

tinita wrote:

  • In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?

The investigation information is not "first bad vs. last good" but "current job vs. last good". There is a link on the top of the investigation page pointing to the first bad if the current job is not the first bad already. But that means that for a reproducible failure failing for very long the further the distance between first bad and current job, the longer the log and the longer the investigation takes. I do not want to prescribe to the user that it only makes sense to look at the "first bad vs. last good" in every case. But we should limit at least the time the investigation takes, potentially just pointing to the link of the "first bad vs. last good" if we can't ensure to finish sooner, or try to limit the time it takes to evaluate the individual steps, e.g. limit the log content as you suggested.

#7 Updated by tinita about 2 months ago

okurz wrote:

tinita wrote:

  • In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?

The investigation information is not "first bad vs. last good" but "current job vs. last good". There is a link on the top of the investigation page pointing to the first bad if the current job is not the first bad already.

Of course I know that. I implemented a part of that feature.
What I mean is that in many cases that is the interesting link. Not in all cases, of course.
But the longer the log, the less probable it is that people will find the cause in there.

I do not want to prescribe to the user that it only makes sense to look at the "first bad vs. last good" in every case.

Of course not. I did not suggest that, did I?

We are talking about a couple of thousand commits in the case that brought us here. We can probably limit it to 200 and be fine.

#8 Updated by cdywan about 2 months ago

  • Copied from deleted (action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M)

#9 Updated by cdywan about 2 months ago

  • Blocks action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added

#10 Updated by tinita about 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita

#11 Updated by openqa_review about 2 months ago

  • Due date set to 2022-05-26

Setting due date based on mean cycle time of SUSE QE Tools

#12 Updated by okurz about 2 months ago

  • Related to action #97190: Limit size of initial requests everywhere, e.g. /, /tests, etc. added

#13 Updated by tinita about 1 month ago

I've been trying to add a limit and was wondering why the last commit wasn't showing up and found a bug in the githashToLink function:
https://github.com/os-autoinst/openQA/blob/master/assets/javascripts/test_result.js#L602

It matches the git log --stat text output, replaces the text with HTML links, and then matches that again with a /(\<a.+?)(?=(\<a))/gms regex which excludes the last commit because of the lookahead.
Instead of fixing that regex, I think matching the text again after manipulating the whole string after the first match, we can do better (and even save the client a few milliseconds), and I would like to create a ready-to-use data structure in that function instead of multiple matches and later spliting.

#14 Updated by tinita about 1 month ago

https://github.com/os-autoinst/openQA/pull/4654 Fix displaying of investigation git log

#16 Updated by tinita about 1 month ago

https://github.com/os-autoinst/openQA/pull/4656 Limit git log in investigation tab

#17 Updated by okurz about 1 month ago

https://github.com/os-autoinst/openQA/pull/4656 merged. Due to continuous deployment on o3 you could resolve this today after verification as soon as the update was rolled out.

#19 Updated by tinita about 1 month ago

  • Status changed from In Progress to Feedback

#20 Updated by tinita about 1 month ago

It takes about 25 seconds to load the investigation tab now for the job in question.

(In a local test the needles_diff_stat git call took 17s, so that's the call that takes most of the time.)

If we also wanted to avoid this, then a doing a count of the commits first (and show nothing if it's > 200) would be a possibility. A git rev-list --count a..b takes about 2-3s for the job in question.

So, can we live with the 25s for now or should I do the commit-count?

#21 Updated by okurz about 1 month ago

So, can we live with the 25s for now or should I do the commit-count?

I would say 25s is still bearable as people only trigger the investigation on purpose and selectively. I think it can be expected that the "deep investigation" does some demanding number crunching and then takes longer.

#22 Updated by tinita about 1 month ago

Another thing which should probably be changed is that the output of the git command is logged via log_debug.
I mean, debugging is useful, but I would say this should be log_trace instead of log_debug.

https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Utils.pm#L319 run_cmd_with_log_return_error

#23 Updated by tinita about 1 month ago

https://github.com/os-autoinst/openQA/pull/4669 Don't log_debug git stdout in all cases

#25 Updated by tinita about 1 month ago

  • Status changed from Feedback to Resolved

I checked the logs/journal on o3 & osd and the log/diff output is now gone. Investigation tab still works as expected.

#26 Updated by okurz about 1 month ago

  • Due date deleted (2022-05-26)

Also available in: Atom PDF