action #110677: Investigation page shouldn't involve blocking long-running API routes size:M - openQA Project (public) - openSUSE Project Management Tool

Custom queries

All 'new' issues w/o assignee, sorted by version/priority
All auto_review tickets
All auto_review+force_result tickets
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE Tools team - due soon
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

action #110677

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #92854: [epic] limit overload of openQA webUI by heavy requests

Investigation page shouldn't involve blocking long-running API routes size:M

Added by livdywan over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

tinita

Category:

Feature requests

Target version:

Ready

Start date:

2022-02-03

Due date:

% Done:

Estimated time:

Description

Observation¶

The ajax request can be very slow e.g. https://openqa.opensuse.org/tests/2326528/investigation_ajax will take several minutes to respond.
See #106759#note-36

Acceptance criteria¶

AC1: /tests/*/investigation_ajax requests are no longer able to block the prefork worker process for more than 400 seconds
AC2: The user does not see any timeouts, even if loading of the "Investigation" tab takes a while

Suggestions¶

Move the investigation code into a Minion job and add a route for polling its result
- Very easy to store results as job results, but performance regressions need to be considered
As a further optimization the investigation results could be cached explicitely/longer
- Generate results even before the job finishes:
- compute git analysys once and save it to a file
- save to an asset
- possibly a bit wasteful (as we run a lot of jobs but don't need the investigation for every job)
- Trigger caching of data upon first request
- return "no data yet" on the first attempt -> trigger a minion job
- check for an existing minion job for the same job (e.g. user refreshing the page)
- next step: save to an asset internally

Related issues 3 (1 open — 2 closed)

Related to openQA Project (public) - action #97190: Limit size of initial requests everywhere, e.g. /, /tests, etc., over webUI and API

New

2022-05-30

Actions

Related to openQA Project (public) - action #128591: [openqa_logwarn] logwarn reports the same entry over and over size:M

Resolved

tinita

2023-05-03

Actions

Blocks openQA Project (public) - action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M

Resolved

livdywan

2022-02-03

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by livdywan over 2 years ago

Copied from action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added

Actions

Copy link

Updated by okurz over 2 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Category set to Feature requests
Target version set to Ready

Actions

Copy link

Updated by okurz over 2 years ago

Due date deleted (~~2022-05-13~~)

Actions

Copy link

Updated by okurz over 2 years ago

Parent task set to #92854

Actions

Copy link

Updated by tinita over 2 years ago

I asked myself the questions:

Do we have a general problem with investigation_ajax? Most of the times this seems to be quite fast
In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?

Actions

Copy link

Updated by okurz over 2 years ago

tinita wrote:

In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?

The investigation information is not "first bad vs. last good" but "current job vs. last good". There is a link on the top of the investigation page pointing to the first bad if the current job is not the first bad already. But that means that for a reproducible failure failing for very long the further the distance between first bad and current job, the longer the log and the longer the investigation takes. I do not want to prescribe to the user that it only makes sense to look at the "first bad vs. last good" in every case. But we should limit at least the time the investigation takes, potentially just pointing to the link of the "first bad vs. last good" if we can't ensure to finish sooner, or try to limit the time it takes to evaluate the individual steps, e.g. limit the log content as you suggested.

Actions

Copy link

Updated by tinita over 2 years ago

okurz wrote:

tinita wrote:

In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?

The investigation information is not "first bad vs. last good" but "current job vs. last good". There is a link on the top of the investigation page pointing to the first bad if the current job is not the first bad already.

Of course I know that. I implemented a part of that feature.
What I mean is that in many cases that is the interesting link. Not in all cases, of course.
But the longer the log, the less probable it is that people will find the cause in there.

I do not want to prescribe to the user that it only makes sense to look at the "first bad vs. last good" in every case.

Of course not. I did not suggest that, did I?

We are talking about a couple of thousand commits in the case that brought us here. We can probably limit it to 200 and be fine.

Actions

Copy link

Updated by livdywan over 2 years ago

Copied from deleted (action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M)

Actions

Copy link

Updated by livdywan over 2 years ago

Blocks action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added

Actions

Copy link

#10

Updated by tinita over 2 years ago

Status changed from Workable to In Progress
Assignee set to tinita

Actions

Copy link

#11

Updated by openqa_review over 2 years ago

Due date set to 2022-05-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#12

Updated by okurz over 2 years ago

Related to action #97190: Limit size of initial requests everywhere, e.g. /, /tests, etc., over webUI and API added

Actions

Copy link

#13

Updated by tinita over 2 years ago

I've been trying to add a limit and was wondering why the last commit wasn't showing up and found a bug in the githashToLink function:
https://github.com/os-autoinst/openQA/blob/master/assets/javascripts/test_result.js#L602

It matches the git log --stat text output, replaces the text with HTML links, and then matches that again with a /(\<a.+?)(?=(\<a))/gms regex which excludes the last commit because of the lookahead.
Instead of fixing that regex, I think matching the text again after manipulating the whole string after the first match, we can do better (and even save the client a few milliseconds), and I would like to create a ready-to-use data structure in that function instead of multiple matches and later spliting.

Actions

Copy link

#14

Updated by tinita over 2 years ago

https://github.com/os-autoinst/openQA/pull/4654 Fix displaying of investigation git log

Actions

Copy link

#15

Updated by tinita over 2 years ago

https://github.com/os-autoinst/openQA/pull/4654 was merged

Actions

Copy link

#16

Updated by tinita over 2 years ago

https://github.com/os-autoinst/openQA/pull/4656 Limit git log in investigation tab

Actions

Copy link

#17

Updated by okurz over 2 years ago

https://github.com/os-autoinst/openQA/pull/4656 merged. Due to continuous deployment on o3 you could resolve this today after verification as soon as the update was rolled out.

Actions

Copy link

#18

Updated by tinita over 2 years ago

https://github.com/os-autoinst/openQA/pull/4656 merged

Actions

Copy link

#19

Updated by tinita over 2 years ago

Status changed from In Progress to Feedback

Actions

Copy link

#20

Updated by tinita over 2 years ago

It takes about 25 seconds to load the investigation tab now for the job in question.

(In a local test the needles_diff_stat git call took 17s, so that's the call that takes most of the time.)

If we also wanted to avoid this, then a doing a count of the commits first (and show nothing if it's > 200) would be a possibility. A git rev-list --count a..b takes about 2-3s for the job in question.

So, can we live with the 25s for now or should I do the commit-count?

Actions

Copy link

#21

Updated by okurz over 2 years ago

So, can we live with the 25s for now or should I do the commit-count?

I would say 25s is still bearable as people only trigger the investigation on purpose and selectively. I think it can be expected that the "deep investigation" does some demanding number crunching and then takes longer.

Actions

Copy link

#22

Updated by tinita over 2 years ago

Another thing which should probably be changed is that the output of the git command is logged via log_debug.
I mean, debugging is useful, but I would say this should be log_trace instead of log_debug.

https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Utils.pm#L319 run_cmd_with_log_return_error

Actions

Copy link

#23