Project

General

Profile

Actions

action #73246

closed

[osd-admins][alert] openqaworker8: Memory usage alert

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2020-10-12
Due date:
2020-10-23
% Done:

0%

Estimated time:

Description

Observation

*/[Alerting] openqaworker8: Memory usage alert/* 

*Metric name* 
*Value* 
available 
1927805952.000 

http://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?fullscreen&edit&tab=alert&panelId=12054&orgId=1

Suddenly our memory usage seems have grown excessively on that host. Checking processes on the worker I found multiple openQA tests with 32GB of RAM for qemu VMs. Taking a look into one pool dir I found the job https://openqa.suse.de/tests/4810245 which is a qam sap hana multi-machine test.


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #73405: job incompletes with "(?s)openqaworker8.*terminated prematurely.*OpenCV Error: Insufficient memory"Resolvedokurz2020-10-15

Actions
Related to openQA Infrastructure - action #90857: Add redundancy for SAP multi machines tests - Extend openQA worker config to accomodate for upgraded RAMResolvedokurz2021-08-03

Actions
Actions #1

Updated by okurz over 3 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Priority changed from Urgent to High
  • Target version set to Ready

I stopped the alert for the memory usage alert and will try to find who might help us here.

Actions #2

Updated by okurz over 3 years ago

  • Due date set to 2020-10-23
  • Status changed from In Progress to Blocked

wrote an email to openqa@suse.de and qa-maintenance@suse.de, wait for corresponding response on mailing list.

Actions #3

Updated by okurz over 3 years ago

we also see this problem in auto-review now, e.g. https://openqa.suse.de/tests/4810157#comments

Actions #4

Updated by okurz over 3 years ago

  • Related to action #73405: job incompletes with "(?s)openqaworker8.*terminated prematurely.*OpenCV Error: Insufficient memory" added
Actions #5

Updated by okurz over 3 years ago

  • Status changed from Blocked to In Progress

have created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/266 to reduce the number of worker instances.

EDIT: Looking in the audit log if I could identify who could have added or extended the test schedule or increased the memory usage a lot. Also asked a person in https://chat.suse.de/channel/testing?msg=kA8jfTPzocpRrbN2B who might have done that as I could not easily identify who just changed job templates and/or who changed actually something about the scenario "qam-sles4sap_hana".

Merged the above MR. As currently the CI pipeline does not trigger after merge I need to do that manually:

ssh osd
cd /srv/pillar/
git pull --rebase origin master 
salt -l error --state-output=changes 'openqaworker[89]*' state.apply test=True
salt -l error --state-output=changes 'openqaworker[89]*' state.apply
salt -l error --state-output=changes 'openqaworker[89]*' cmd.run 'systemctl disable --now openqa-worker@{21..24}; systemctl mask openqa-worker@{21..24}'

EDIT: gladly jadamek has responded and pulled me into https://chat.suse.de/channel/asg-qe.maintenance?msg=dinTixBAZSKfMhrz6 where we also try to address the question.

Actions #6

Updated by okurz over 3 years ago

  • Status changed from In Progress to Resolved

memory usage looks ok the past days so I unpaused the alert https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?orgId=1&panelId=12054&fullscreen&edit&tab=alert&refresh=1m again.

jadamek stated that the improved the mentioned test cluster to use less RAM for the individual machines as well as not use the 32GB RAM machine for the "supportserver" but the default "64bit" one with less RAM. This should also help.

Actions #7

Updated by jadamek over 3 years ago

This new configuration should help to improve the situation
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/277

Actions #8

Updated by okurz almost 3 years ago

  • Related to action #90857: Add redundancy for SAP multi machines tests - Extend openQA worker config to accomodate for upgraded RAM added
Actions

Also available in: Atom PDF