action #110677: Investigation page shouldn't involve blocking long-running API routes size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

action #110677

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #92854: [epic] limit overload of openQA webUI by heavy requests

Investigation page shouldn't involve blocking long-running API routes size:M

Added by livdywan about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

tinita

Category:

Feature requests

Target version:

Ready

Start date:

2022-02-03

Due date:

% Done:

Estimated time:

Description

Observation¶

The ajax request can be very slow e.g. https://openqa.opensuse.org/tests/2326528/investigation_ajax will take several minutes to respond.
See #106759#note-36

Acceptance criteria¶

AC1: /tests/*/investigation_ajax requests are no longer able to block the prefork worker process for more than 400 seconds
AC2: The user does not see any timeouts, even if loading of the "Investigation" tab takes a while

Suggestions¶

Move the investigation code into a Minion job and add a route for polling its result
- Very easy to store results as job results, but performance regressions need to be considered
As a further optimization the investigation results could be cached explicitely/longer
- Generate results even before the job finishes:
  - compute git analysys once and save it to a file
  - save to an asset
  - possibly a bit wasteful (as we run a lot of jobs but don't need the investigation for every job)
- Trigger caching of data upon first request
  - return "no data yet" on the first attempt -> trigger a minion job
  - check for an existing minion job for the same job (e.g. user refreshing the page)
  - next step: save to an asset internally

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by livdywan about 3 years ago

Copied from action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added

Actions

Copy link

Updated by okurz about 3 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Category set to Feature requests
Target version set to Ready

Actions

Copy link

Updated by okurz about 3 years ago

Due date deleted (~~2022-05-13~~)

Actions

Copy link

Updated by okurz about 3 years ago

Parent task set to #92854

Actions

Copy link

Updated by tinita about 3 years ago

I asked myself the questions:

Do we have a general problem with investigation_ajax? Most of the times this seems to be quite fast
In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?

Actions

Copy link

Updated by okurz about 3 years ago

tinita wrote:

In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?

The investigation information is not "first bad vs. last good" but "current job vs. last good". There is a link on the top of the investigation page pointing to the first bad if the current job is not the first bad already. But that means that for a reproducible failure failing for very long the further the distance between first bad and current job, the longer the log and the longer the investigation takes. I do not want to prescribe to the user that it only makes sense to look at the "first bad vs. last good" in every case. But we should limit at least the time the investigation takes, potentially just pointing to the link of the "first bad vs. last good" if we can't ensure to finish sooner, or try to limit the time it takes to evaluate the individual steps, e.g. limit the log content as you suggested.

Actions

Copy link

Updated by tinita about 3 years ago

okurz wrote:

tinita wrote:

In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?

The investigation information is not "first bad vs. last good" but "current job vs. last good". There is a link on the top of the investigation page pointing to the first bad if the current job is not the first bad already.

Of course I know that. I implemented a part of that feature.
What I mean is that in many cases that is the interesting link. Not in all cases, of course.
But the longer the log, the less probable it is that people will find the cause in there.

I do not want to prescribe to the user that it only makes sense to look at the "first bad vs. last good" in every case.

Of course not. I did not suggest that, did I?

We are talking about a couple of thousand commits in the case that brought us here. We can probably limit it to 200 and be fine.

Actions

Copy link

Updated by livdywan about 3 years ago

Copied from deleted (action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M)

Actions

Copy link

Updated by livdywan about 3 years ago

Blocks action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added

Actions

Copy link

#10

Updated by tinita about 3 years ago

Status changed from Workable to In Progress
Assignee set to tinita

Actions

Copy link

#11

Updated by openqa_review about 3 years ago

Due date set to 2022-05-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#12

Updated by okurz about 3 years ago

Related to action #97190: Limit size of initial requests everywhere, e.g. /, /tests, etc., over webUI and API added

Actions

Copy link

#13

Updated by tinita about 3 years ago

I've been trying to add a limit and was wondering why the last commit wasn't showing up and found a bug in the githashToLink function:
https://github.com/os-autoinst/openQA/blob/master/assets/javascripts/test_result.js#L602

It matches the git log --stat text output, replaces the text with HTML links, and then matches that again with a /(\<a.+?)(?=(\<a))/gms regex which excludes the last commit because of the lookahead.
Instead of fixing that regex, I think matching the text again after manipulating the whole string after the first match, we can do better (and even save the client a few milliseconds), and I would like to create a ready-to-use data structure in that function instead of multiple matches and later spliting.

Actions

Copy link

#14

Updated by tinita about 3 years ago

https://github.com/os-autoinst/openQA/pull/4654 Fix displaying of investigation git log

Actions

Copy link

#15

Updated by tinita about 3 years ago

https://github.com/os-autoinst/openQA/pull/4654 was merged

Actions

Copy link

#16

Updated by tinita about 3 years ago

https://github.com/os-autoinst/openQA/pull/4656 Limit git log in investigation tab

Actions

Copy link

#17

Updated by okurz about 3 years ago

https://github.com/os-autoinst/openQA/pull/4656 merged. Due to continuous deployment on o3 you could resolve this today after verification as soon as the update was rolled out.

Actions

Copy link

#18

Updated by tinita about 3 years ago

https://github.com/os-autoinst/openQA/pull/4656 merged

Actions

Copy link

#19

Updated by tinita about 3 years ago

Status changed from In Progress to Feedback

Actions

Copy link

#20

Updated by tinita about 3 years ago

It takes about 25 seconds to load the investigation tab now for the job in question.

(In a local test the needles_diff_stat git call took 17s, so that's the call that takes most of the time.)

If we also wanted to avoid this, then a doing a count of the commits first (and show nothing if it's > 200) would be a possibility. A git rev-list --count a..b takes about 2-3s for the job in question.

So, can we live with the 25s for now or should I do the commit-count?

Actions

Copy link

#21

Updated by okurz about 3 years ago

So, can we live with the 25s for now or should I do the commit-count?

I would say 25s is still bearable as people only trigger the investigation on purpose and selectively. I think it can be expected that the "deep investigation" does some demanding number crunching and then takes longer.

Actions

Copy link

#22

Updated by tinita about 3 years ago

Another thing which should probably be changed is that the output of the git command is logged via log_debug.
I mean, debugging is useful, but I would say this should be log_trace instead of log_debug.

https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Utils.pm#L319 run_cmd_with_log_return_error

Actions

Copy link

#23