action #110677
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #92854: [epic] limit overload of openQA webUI by heavy requests
Investigation page shouldn't involve blocking long-running API routes size:M
Description
Observation¶
The ajax request can be very slow e.g. https://openqa.opensuse.org/tests/2326528/investigation_ajax will take several minutes to respond.
See #106759#note-36
Acceptance criteria¶
- AC1:
/tests/*/investigation_ajax
requests are no longer able to block the prefork worker process for more than 400 seconds - AC2: The user does not see any timeouts, even if loading of the "Investigation" tab takes a while
Suggestions¶
- Move the investigation code into a Minion job and add a route for polling its result
- Very easy to store results as job results, but performance regressions need to be considered
- As a further optimization the investigation results could be cached explicitely/longer
- Generate results even before the job finishes:
- compute git analysys once and save it to a file
- save to an asset
- possibly a bit wasteful (as we run a lot of jobs but don't need the investigation for every job)
- Trigger caching of data upon first request
- return "no data yet" on the first attempt -> trigger a minion job
- check for an existing minion job for the same job (e.g. user refreshing the page)
- next step: save to an asset internally
Updated by livdywan over 2 years ago
- Copied from action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added
Updated by okurz over 2 years ago
- Project changed from openQA Infrastructure (public) to openQA Project (public)
- Category set to Feature requests
- Target version set to Ready
Updated by tinita over 2 years ago
I asked myself the questions:
- Do we have a general problem with investigation_ajax? Most of the times this seems to be quite fast
- In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?
Updated by okurz over 2 years ago
tinita wrote:
- In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?
The investigation information is not "first bad vs. last good" but "current job vs. last good". There is a link on the top of the investigation page pointing to the first bad if the current job is not the first bad already. But that means that for a reproducible failure failing for very long the further the distance between first bad and current job, the longer the log and the longer the investigation takes. I do not want to prescribe to the user that it only makes sense to look at the "first bad vs. last good" in every case. But we should limit at least the time the investigation takes, potentially just pointing to the link of the "first bad vs. last good" if we can't ensure to finish sooner, or try to limit the time it takes to evaluate the individual steps, e.g. limit the log content as you suggested.
Updated by tinita over 2 years ago
okurz wrote:
tinita wrote:
- In the cases where it's slow this seems to be caused by a very long log, and there I question the value of a long log. In case of hundresd of commits the error is probably elsewhere. We could limit the log to a certain number of commits. It's usually the last good vs. the first bad anyway, right, and then the log should be quite small?
The investigation information is not "first bad vs. last good" but "current job vs. last good". There is a link on the top of the investigation page pointing to the first bad if the current job is not the first bad already.
Of course I know that. I implemented a part of that feature.
What I mean is that in many cases that is the interesting link. Not in all cases, of course.
But the longer the log, the less probable it is that people will find the cause in there.
I do not want to prescribe to the user that it only makes sense to look at the "first bad vs. last good" in every case.
Of course not. I did not suggest that, did I?
We are talking about a couple of thousand commits in the case that brought us here. We can probably limit it to 200 and be fine.
Updated by livdywan over 2 years ago
- Copied from deleted (action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M)
Updated by livdywan over 2 years ago
- Blocks action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:M added
Updated by tinita over 2 years ago
- Status changed from Workable to In Progress
- Assignee set to tinita
Updated by openqa_review over 2 years ago
- Due date set to 2022-05-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 2 years ago
- Related to action #97190: Limit size of initial requests everywhere, e.g. /, /tests, etc., over webUI and API added
Updated by tinita over 2 years ago
I've been trying to add a limit and was wondering why the last commit wasn't showing up and found a bug in the githashToLink function:
https://github.com/os-autoinst/openQA/blob/master/assets/javascripts/test_result.js#L602
It matches the git log --stat
text output, replaces the text with HTML links, and then matches that again with a /(\<a.+?)(?=(\<a))/gms
regex which excludes the last commit because of the lookahead.
Instead of fixing that regex, I think matching the text again after manipulating the whole string after the first match, we can do better (and even save the client a few milliseconds), and I would like to create a ready-to-use data structure in that function instead of multiple matches and later split
ing.
Updated by tinita over 2 years ago
https://github.com/os-autoinst/openQA/pull/4654 Fix displaying of investigation git log
Updated by tinita over 2 years ago
Updated by tinita over 2 years ago
https://github.com/os-autoinst/openQA/pull/4656 Limit git log in investigation tab
Updated by okurz over 2 years ago
https://github.com/os-autoinst/openQA/pull/4656 merged. Due to continuous deployment on o3 you could resolve this today after verification as soon as the update was rolled out.
Updated by tinita over 2 years ago
Updated by tinita over 2 years ago
It takes about 25 seconds to load the investigation tab now for the job in question.
(In a local test the needles_diff_stat git call took 17s, so that's the call that takes most of the time.)
If we also wanted to avoid this, then a doing a count of the commits first (and show nothing if it's > 200) would be a possibility. A git rev-list --count a..b
takes about 2-3s for the job in question.
So, can we live with the 25s for now or should I do the commit-count?
Updated by okurz over 2 years ago
So, can we live with the 25s for now or should I do the commit-count?
I would say 25s is still bearable as people only trigger the investigation on purpose and selectively. I think it can be expected that the "deep investigation" does some demanding number crunching and then takes longer.
Updated by tinita over 2 years ago
Another thing which should probably be changed is that the output of the git command is logged via log_debug
.
I mean, debugging is useful, but I would say this should be log_trace
instead of log_debug
.
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Utils.pm#L319 run_cmd_with_log_return_error
Updated by tinita over 2 years ago
https://github.com/os-autoinst/openQA/pull/4669 Don't log_debug git stdout in all cases
Updated by tinita over 2 years ago
Updated by tinita over 2 years ago
- Status changed from Feedback to Resolved
I checked the logs/journal on o3 & osd and the log/diff output is now gone. Investigation tab still works as expected.
Updated by tinita over 1 year ago
- Related to action #128591: [openqa_logwarn] logwarn reports the same entry over and over size:M added