action #73246: [osd-admins][alert] openqaworker8: Memory usage alert - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #73246

closed

[osd-admins][alert] openqaworker8: Memory usage alert

Added by okurz about 4 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2020-10-12

Due date:

2020-10-23

% Done:

Estimated time:

Tags:

alert, osd, memory

Description

Observation¶

*/[Alerting] openqaworker8: Memory usage alert/* 

*Metric name* 
*Value* 
available 
1927805952.000

http://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?fullscreen&edit&tab=alert&panelId=12054&orgId=1

Suddenly our memory usage seems have grown excessively on that host. Checking processes on the worker I found multiple openQA tests with 32GB of RAM for qemu VMs. Taking a look into one pool dir I found the job https://openqa.suse.de/tests/4810245 which is a qam sap hana multi-machine test.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz about 4 years ago

Status changed from New to In Progress
Assignee set to okurz
Priority changed from Urgent to High
Target version set to Ready

I stopped the alert for the memory usage alert and will try to find who might help us here.

Actions

Copy link

Updated by okurz about 4 years ago

Due date set to 2020-10-23
Status changed from In Progress to Blocked

wrote an email to openqa@suse.de and qa-maintenance@suse.de, wait for corresponding response on mailing list.

Actions

Copy link

Updated by okurz about 4 years ago

we also see this problem in auto-review now, e.g. https://openqa.suse.de/tests/4810157#comments

Actions

Copy link

Updated by okurz about 4 years ago

Related to action #73405: job incompletes with "(?s)openqaworker8.*terminated prematurely.*OpenCV Error: Insufficient memory" added

Actions

Copy link

Updated by okurz about 4 years ago

Status changed from Blocked to In Progress

have created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/266 to reduce the number of worker instances.

EDIT: Looking in the audit log if I could identify who could have added or extended the test schedule or increased the memory usage a lot. Also asked a person in https://chat.suse.de/channel/testing?msg=kA8jfTPzocpRrbN2B who might have done that as I could not easily identify who just changed job templates and/or who changed actually something about the scenario "qam-sles4sap_hana".

Merged the above MR. As currently the CI pipeline does not trigger after merge I need to do that manually:

ssh osd
cd /srv/pillar/
git pull --rebase origin master 
salt -l error --state-output=changes 'openqaworker[89]*' state.apply test=True
salt -l error --state-output=changes 'openqaworker[89]*' state.apply
salt -l error --state-output=changes 'openqaworker[89]*' cmd.run 'systemctl disable --now openqa-worker@{21..24}; systemctl mask openqa-worker@{21..24}'

EDIT: gladly jadamek has responded and pulled me into https://chat.suse.de/channel/asg-qe.maintenance?msg=dinTixBAZSKfMhrz6 where we also try to address the question.

Actions

Copy link

Updated by okurz about 4 years ago

Status changed from In Progress to Resolved

memory usage looks ok the past days so I unpaused the alert https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?orgId=1&panelId=12054&fullscreen&edit&tab=alert&refresh=1m again.

jadamek stated that the improved the mentioned test cluster to use less RAM for the individual machines as well as not use the 32GB RAM machine for the "supportserver" but the default "64bit" one with less RAM. This should also help.

Actions

Copy link