action #73246
closed[osd-admins][alert] openqaworker8: Memory usage alert
0%
Description
Observation¶
*/[Alerting] openqaworker8: Memory usage alert/*
*Metric name*
*Value*
available
1927805952.000
Suddenly our memory usage seems have grown excessively on that host. Checking processes on the worker I found multiple openQA tests with 32GB of RAM for qemu VMs. Taking a look into one pool dir I found the job https://openqa.suse.de/tests/4810245 which is a qam sap hana multi-machine test.
Updated by okurz about 4 years ago
- Status changed from New to In Progress
- Assignee set to okurz
- Priority changed from Urgent to High
- Target version set to Ready
I stopped the alert for the memory usage alert and will try to find who might help us here.
Updated by okurz about 4 years ago
- Due date set to 2020-10-23
- Status changed from In Progress to Blocked
wrote an email to openqa@suse.de and qa-maintenance@suse.de, wait for corresponding response on mailing list.
Updated by okurz about 4 years ago
we also see this problem in auto-review now, e.g. https://openqa.suse.de/tests/4810157#comments
Updated by okurz about 4 years ago
- Related to action #73405: job incompletes with "(?s)openqaworker8.*terminated prematurely.*OpenCV Error: Insufficient memory" added
Updated by okurz about 4 years ago
- Status changed from Blocked to In Progress
have created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/266 to reduce the number of worker instances.
EDIT: Looking in the audit log if I could identify who could have added or extended the test schedule or increased the memory usage a lot. Also asked a person in https://chat.suse.de/channel/testing?msg=kA8jfTPzocpRrbN2B who might have done that as I could not easily identify who just changed job templates and/or who changed actually something about the scenario "qam-sles4sap_hana".
Merged the above MR. As currently the CI pipeline does not trigger after merge I need to do that manually:
ssh osd
cd /srv/pillar/
git pull --rebase origin master
salt -l error --state-output=changes 'openqaworker[89]*' state.apply test=True
salt -l error --state-output=changes 'openqaworker[89]*' state.apply
salt -l error --state-output=changes 'openqaworker[89]*' cmd.run 'systemctl disable --now openqa-worker@{21..24}; systemctl mask openqa-worker@{21..24}'
EDIT: gladly jadamek has responded and pulled me into https://chat.suse.de/channel/asg-qe.maintenance?msg=dinTixBAZSKfMhrz6 where we also try to address the question.
Updated by okurz about 4 years ago
- Status changed from In Progress to Resolved
memory usage looks ok the past days so I unpaused the alert https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?orgId=1&panelId=12054&fullscreen&edit&tab=alert&refresh=1m again.
jadamek stated that the improved the mentioned test cluster to use less RAM for the individual machines as well as not use the 32GB RAM machine for the "supportserver" but the default "64bit" one with less RAM. This should also help.
Updated by jadamek about 4 years ago
This new configuration should help to improve the situation
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/277
Updated by okurz over 3 years ago
- Related to action #90857: Add redundancy for SAP multi machines tests - Extend openQA worker config to accomodate for upgraded RAM added