action #96713: Slow grep in openqa-label-known-issues leads to high CPU usage - openQA Infrastructure - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #96713

closed

Slow grep in openqa-label-known-issues leads to high CPU usage

Added by tinita almost 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Target version:

openQA Project - Ready

Start date:

Due date:

2021-09-10

% Done:

Estimated time:

Description

Observation¶

On 2021-08-10 we experienced problems on osd like a high load, and one of the reasons seemed to be slow grep commands called by openqa-label-known-issues.

For example this regex seemed to be problematic:

grep -qPzo (?s)2021-.*T.*Error connecting to <root@s390p.*.suse.de>: No route to host /tmp/tmp.gy3iDKQDrV

Running over 4 minutes sometimes. (Removing the (?s) made it run in only a few milliseconds on a file with about 500kb.)

We are now lowering the timeout for the post fail hook
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/538

But that will mean that the whole process can timeout because of one bad regex.

Suggestions¶

Study current grep logic in openqa-label-known-issues
The mentioned regex comes from this issue: #93119
Maybe the regex can be improved
Add a timeout to the grep call
Make sure we are informed in some way of timed out grep calls so we can improve the regexes
Reduce limit=200 (maximum number of jobs to check)

Related issues 3 (0 open — 3 closed)

Related to openQA Infrastructure - action #96807: Web UI is slow and Apache Response Time alert got triggered

Resolved

okurz

2021-08-12

2021-10-01

Actions

Copied from openQA Infrastructure - action #96710: Error `Can't call method "write" on an undefined value` shows up in worker log leading to incompletes

Resolved

mkittler

2021-08-10

2021-08-31

Actions

Copied to openQA Infrastructure - action #97943: Increase number of CPU cores on OSD VM due to high usage size:S

Resolved

mkittler

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by tinita almost 3 years ago

Copied from action #96710: Error `Can't call method "write" on an undefined value` shows up in worker log leading to incompletes added

Actions

Copy link

Updated by mkittler almost 3 years ago

The SR for the timeout unfortunately needs a follow up: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/539

Actions

Copy link

Updated by livdywan almost 3 years ago

Description updated (diff)

Actions

Copy link

Updated by livdywan almost 3 years ago

Add a timeout to the grep call

Thinking out loud here. The script has some hard-coded regexes where grep is called individually, but results from progress are passed in one go. So I assume to allow a timeout per title we'd have to add the overhead of calling grep every single time.
It might be more beneficial to join other grep calls and speed up the whole thing?

Actions

Copy link

Updated by tinita almost 3 years ago

cdywan wrote:

Thinking out loud here. The script has some hard-coded regexes where grep is called individually, but results from progress are passed in one go.

Not sure what you mean.
We currently call grep for every title.

Actions

Copy link

Updated by livdywan almost 3 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/541

tinita wrote:

We currently call grep for every title.

Nevermind, seems I was misreading the code. label_on_issues_from_issue_tracker loops through the issues indiivdually.

Actions

Copy link

Updated by tinita almost 3 years ago

Status changed from New to In Progress
Assignee set to tinita

Actions

Copy link

Updated by tinita almost 3 years ago

Status changed from In Progress to Feedback

PR: https://github.com/os-autoinst/scripts/pull/99

Actions

Copy link

Updated by tinita almost 3 years ago

PR was merged, scripts on osd updated.
I'll watch the logs.

Actions

Copy link

#10

Updated by tinita almost 3 years ago

Target version set to Ready

Actions

Copy link

#11

Updated by livdywan almost 3 years ago

Related to action #96807: Web UI is slow and Apache Response Time alert got triggered added

Actions

Copy link

#12

Updated by okurz almost 3 years ago

Due date set to 2021-09-03
Priority changed from Urgent to High

High load seems to be solved meanwhile but the current timeout of just 5s is rather low. Please also see my related MR https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/550 where I want to configure a much longer timeout along with running scripts with nice

Actions

Copy link

#13

Updated by okurz almost 3 years ago

Assignee changed from tinita to okurz
Priority changed from High to Normal

discussed with tinita and cdywan, noted down one idea for improvement in #65271#note-70

will merge https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/550 as soon as pipelines succeed.

Will watch logs in some days with ssh osd 'sudo journalctl -u openqa-gru | grep timed'

Actions

Copy link

#14

Updated by okurz almost 3 years ago

Due date changed from 2021-09-03 to 2021-09-10
Status changed from Feedback to In Progress

Since the MR was merged I don't see problems. On OSD I could confirm that sometimes there are grep processes running with nice level 19 as expected. Monitoring in htop I see that telegraf pops up often but only with nice-level of 10. Mostly it's "openqa prefork" processes so that's fine. I don't think my change had negative impact. However we see that we book all 10 cores often so we should ask for increase in number of cores. Also we see alerts on CPU load lately. We can consider bumping the CPU load alert limits a bit as well.

Actions

Copy link

#15