action #110677
closed
coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #92854: [epic] limit overload of openQA webUI by heavy requests
Investigation page shouldn't involve blocking long-running API routes size:M
Added by livdywan over 2 years ago.
Updated over 2 years ago.
Category:
Feature requests
Description
Observation¶
The ajax request can be very slow e.g. https://openqa.opensuse.org/tests/2326528/investigation_ajax will take several minutes to respond.
See #106759#note-36
Acceptance criteria¶
- AC1:
/tests/*/investigation_ajax
requests are no longer able to block the prefork worker process for more than 400 seconds
- AC2: The user does not see any timeouts, even if loading of the "Investigation" tab takes a while
Suggestions¶
- Move the investigation code into a Minion job and add a route for polling its result
- Very easy to store results as job results, but performance regressions need to be considered
- As a further optimization the investigation results could be cached explicitely/longer
- Generate results even before the job finishes:
- compute git analysys once and save it to a file
- save to an asset
- possibly a bit wasteful (as we run a lot of jobs but don't need the investigation for every job)
- Trigger caching of data upon first request
- return "no data yet" on the first attempt -> trigger a minion job
- check for an existing minion job for the same job (e.g. user refreshing the page)
- next step: save to an asset internally
- Copied from action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added
- Project changed from openQA Infrastructure (public) to openQA Project (public)
- Category set to Feature requests
- Target version set to Ready
- Due date deleted (
2022-05-13)
- Parent task set to #92854
I asked myself the questions:
- Do we have a general problem with investigation_ajax? Most of the times this seems to be quite fast
- In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?
tinita wrote:
- In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?
The investigation information is not "first bad vs. last good" but "current job vs. last good". There is a link on the top of the investigation page pointing to the first bad if the current job is not the first bad already. But that means that for a reproducible failure failing for very long the further the distance between first bad and current job, the longer the log and the longer the investigation takes. I do not want to prescribe to the user that it only makes sense to look at the "first bad vs. last good" in every case. But we should limit at least the time the investigation takes, potentially just pointing to the link of the "first bad vs. last good" if we can't ensure to finish sooner, or try to limit the time it takes to evaluate the individual steps, e.g. limit the log content as you suggested.
okurz wrote:
tinita wrote:
- In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?
The investigation information is not "first bad vs. last good" but "current job vs. last good". There is a link on the top of the investigation page pointing to the first bad if the current job is not the first bad already.
Of course I know that. I implemented a part of that feature.
What I mean is that in many cases that is the interesting link. Not in all cases, of course.
But the longer the log, the less probable it is that people will find the cause in there.
I do not want to prescribe to the user that it only makes sense to look at the "first bad vs. last good" in every case.
Of course not. I did not suggest that, did I?
We are talking about a couple of thousand commits in the case that brought us here. We can probably limit it to 200 and be fine.
- Copied from deleted (action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M)
- Blocks action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added
- Status changed from Workable to In Progress
- Assignee set to tinita
- Due date set to 2022-05-26
Setting due date based on mean cycle time of SUSE QE Tools
- Related to action #97190: Limit size of initial requests everywhere, e.g. /, /tests, etc., over webUI and API added
I've been trying to add a limit and was wondering why the last commit wasn't showing up and found a bug in the githashToLink function:
https://github.com/os-autoinst/openQA/blob/master/assets/javascripts/test_result.js#L602
It matches the git log --stat
text output, replaces the text with HTML links, and then matches that again with a /(\<a.+?)(?=(\<a))/gms
regex which excludes the last commit because of the lookahead.
Instead of fixing that regex, I think matching the text again after manipulating the whole string after the first match, we can do better (and even save the client a few milliseconds), and I would like to create a ready-to-use data structure in that function instead of multiple matches and later split
ing.
- Status changed from In Progress to Feedback
It takes about 25 seconds to load the investigation tab now for the job in question.
(In a local test the needles_diff_stat git call took 17s, so that's the call that takes most of the time.)
If we also wanted to avoid this, then a doing a count of the commits first (and show nothing if it's > 200) would be a possibility. A git rev-list --count a..b
takes about 2-3s for the job in question.
So, can we live with the 25s for now or should I do the commit-count?
So, can we live with the 25s for now or should I do the commit-count?
I would say 25s is still bearable as people only trigger the investigation on purpose and selectively. I think it can be expected that the "deep investigation" does some demanding number crunching and then takes longer.
- Status changed from Feedback to Resolved
I checked the logs/journal on o3 & osd and the log/diff output is now gone. Investigation tab still works as expected.
- Due date deleted (
2022-05-26)
- Related to action #128591: [openqa_logwarn] logwarn reports the same entry over and over size:M added
Also available in: Atom
PDF