action #153958
closed[alert] s390zl12: Memory usage alert Generic memory_usage_alert_s390zl12 generic
Added by tinita 12 months ago. Updated 10 months ago.
0%
Description
Observation¶
Date: Fri, 19 Jan 2024 11:55:37 +0100
1 firing alert instance
[IMAGE]
GROUPED BY
hostname=s390zl12
1 firing instances
Firing [stats.openqa-monitor.qa.suse.de]
s390zl12: Memory usage alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
A0=0.06117900738663373
Labels
alertname
s390zl12: Memory usage alert
grafana_folder
Generic
hostname
s390zl12
rule_uid
memory_usage_alert_s390zl12
http://stats.openqa-monitor.qa.suse.de/alerting/grafana/memory_usage_alert_s390zl12/view?orgId=1
Rollback steps¶
Remove silence "alertname=s390zl12: Memory usage alert" from https://stats.openqa-monitor.qa.suse.de/alerting/silences
Updated by okurz 12 months ago
@tinita when you ask about such alert and receive a response over other channels, in this case Slack, please also include the relevant answers in the ticket.
So memory usage alert is likely due to too many or too big kvm instances on it. Looking at
Looking at https://stats.openqa-monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?viewPanel=12054&orgId=1&from=1705654256531&to=1705665689914
from 1130 to 1150 the available memory dropped steadily and significantly and after that, a bit slower, recovering until about 1230 when it was ok again. I assume some openQA spawned VMs here causing this.
I queried the database with
openqa=> select j.id,test,t_started,result,js.value from jobs j join worker_properties wp on j.assigned_worker_id = wp.worker_id join job_settings js on js.job_id = j.id where arch='s390x' and t_finished >= '2024-01-19' and result!='passed' and wp.key='WORKER_CLASS' and wp.value~'s390zl12' and js.key='QEMURAM' order by js.value desc;
and found
id | test | t_started | result | value
----------+--------------------------------------------------------------------------------------------------------------------------------+---------------------+------------+-------
13289263 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863 | 2024-01-19 10:26:27 | failed | 8192
13289262 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863 | 2024-01-19 10:26:22 | failed | 8192
13289374 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863 | 2024-01-19 11:12:25 | failed | 8192
13289271 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863 | 2024-01-19 10:27:03 | failed | 8192
13289344 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863 | 2024-01-19 10:55:40 | failed | 8192
13289382 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863 | 2024-01-19 11:20:50 | failed | 8192
13289283 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863 | 2024-01-19 10:33:03 | failed | 8192
13289560 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863 | 2024-01-19 12:00:47 | failed | 8192
13288222 | python_3.6_on_SLES_15-SP5_podman | 2024-01-19 04:26:49 | softfailed | 6144
13284951 | golang_oldstable_on_SLES_15-SP5_podman | 2024-01-19 00:26:31 | softfailed | 6144
13285156 | php-fpm_8_on_SLES_15-SP5_docker | 2024-01-19 00:37:17 | softfailed | 6144
13285171 | python_3.11_on_SLES_15-SP5_podman | 2024-01-19 00:41:49 | softfailed | 6144
13285069 | openjdk_17_on_SLES_15-SP5_podman | 2024-01-19 00:32:33 | softfailed | 6144
13286351 | rmt_on_SLES_15-SP5_docker:investigate:last_good_build:13.1_rmt-server-image | 2024-01-19 05:03:18 | failed | 4096
13288559 | rmt_on_SLES_15-SP5_docker:investigate:retry | 2024-01-19 05:08:37 | failed | 4096
13288769 | cc_atsec:investigate:bisect_without_32126 | 2024-01-19 06:20:09 | failed | 4096
13288697 | cc_atsec:investigate:bisect_without_32124 | 2024-01-19 05:50:16 | failed | 4096
13286388 | cc_atsec:investigate:bisect_without_32126 | 2024-01-19 05:23:35 | failed | 4096
13285311 | git_on_SLES_15-SP5_docker | 2024-01-19 00:46:37 | softfailed | 4096
13282452 | sle_autoyast_support_image_gnome_12sp5 | 2024-01-19 01:33:48 | incomplete | 4096
13288338 | pcp_5_on_SLES_15-SP5_docker | 2024-01-19 04:36:18 | softfailed | 4096
13289123 | ltp_syscalls | 2024-01-19 09:26:27 | softfailed | 4096
13286194 | ltp_syscalls | 2024-01-19 04:05:19 | failed | 4096
13286349 | rmt_on_SLES_15-SP5_docker:investigate:retry | 2024-01-19 05:03:18 | failed | 4096
13286350 | rmt_on_SLES_15-SP5_docker:investigate:last_good_tests:efad569fffc7a34751ef76ad7f651b496793d053 | 2024-01-19 05:03:18 | failed | 4096
13286344 | rmt_on_SLES_15-SP5_podman:investigate:retry | 2024-01-19 04:54:18 | failed | 4096
13286387 | cc_atsec:investigate:bisect_without_32124 | 2024-01-19 05:20:46 | failed | 4096
13286391 | cc_atsec:investigate:bisect_without_32144 | 2024-01-19 05:23:35 | failed | 4096
13288698 | cc_atsec:investigate:bisect_without_32126 | 2024-01-19 05:50:16 | failed | 4096
13288382 | registry_2.8_on_SLES_15-SP5_docker | 2024-01-19 04:37:48 | softfailed | 4096
13288557 | rmt_on_SLES_15-SP5_podman:investigate:last_good_tests_and_build:efad569fffc7a34751ef76ad7f651b496793d053+13.1_rmt-server-image | 2024-01-19 05:07:48 | failed | 4096
13288555 | rmt_on_SLES_15-SP5_podman:investigate:last_good_tests:efad569fffc7a34751ef76ad7f651b496793d053 | 2024-01-19 05:06:18 | failed | 4096
13288575 | rmt_on_SLES_15-SP5_docker:investigate:last_good_tests_and_build:efad569fffc7a34751ef76ad7f651b496793d053+13.1_rmt-server-image | 2024-01-19 05:14:20 | failed | 4096
13286345 | rmt_on_SLES_15-SP5_podman:investigate:last_good_tests:efad569fffc7a34751ef76ad7f651b496793d053 | 2024-01-19 04:57:18 | failed | 4096
13286352 | rmt_on_SLES_15-SP5_docker:investigate:last_good_tests_and_build:efad569fffc7a34751ef76ad7f651b496793d053+13.1_rmt-server-image | 2024-01-19 05:04:48 | failed | 4096
13288570 | rmt_on_SLES_15-SP5_docker:investigate:last_good_tests:efad569fffc7a34751ef76ad7f651b496793d053 | 2024-01-19 05:10:36 | failed | 4096
13285793 | ltp_syscalls | 2024-01-19 03:45:19 | softfailed | 4096
13288558 | rmt_on_SLES_15-SP5_docker:investigate:last_good_build:13.1_rmt-server-image | 2024-01-19 05:08:37 | failed | 4096
13284404 | slem_containers_selinux:investigate:last_good_tests_and_build:595e0dea8db29d8305370398904f071d5fd69687+20240117-1 | 2024-01-19 05:54:46 | softfailed | 2048
13285482 | slem_containers:investigate:bisect_without_32152 | 2024-01-19 06:41:21 | failed | 2048
13286307 | slem_containers:investigate:last_good_tests_and_build:595e0dea8db29d8305370398904f071d5fd69687+20240117-1 | 2024-01-19 07:09:24 | softfailed | 2048
13281819 | slem_containers | 2024-01-19 00:05:22 | failed | 2048
13284272 | mau-bootloader:investigate:last_good_tests:ea3dbf5193d32cc6cac5ba91615a9bff47a110ce | 2024-01-19 05:14:20 | failed | 2048
13283454 | fips_ker_mode_tests_crypt_tool | 2024-01-19 02:52:23 | failed | 2048
13286195 | ltp_syscalls_debug_pagealloc | 2024-01-19 04:05:19 | softfailed | 2048
13284266 | mau-bootloader:investigate:last_good_tests:ea3dbf5193d32cc6cac5ba91615a9bff47a110ce | 2024-01-19 05:09:10 | failed | 2048
13288588 | fips_ker_mode_tests_crypt_tool | 2024-01-19 05:23:36 | failed | 2048
13288701 | slem_containers | 2024-01-19 06:07:18 | failed | 2048
13285481 | slem_containers:investigate:bisect_without_32126 | 2024-01-19 06:35:16 | failed | 2048
13286209 | slem_containers_selinux:investigate:bisect_without_32152 | 2024-01-19 07:04:09 | failed | 2048
13288850 | slem_containers_selinux:investigate:retry | 2024-01-19 08:17:54 | failed | 2048
13289132 | ltp_syscalls_debug_pagealloc | 2024-01-19 09:28:51 | softfailed | 2048
13281906 | slem_containers | 2024-01-19 01:20:18 | failed | 2048
13285635 | ltp_syscalls_debug_pagealloc | 2024-01-19 03:36:09 | failed | 2048
13285471 | slem_containers_selinux:investigate:last_good_build:20240117-1 | 2024-01-19 06:22:24 | failed | 2048
13285523 | slem_containers:investigate:last_good_build:20240117-1 | 2024-01-19 06:50:38 | failed | 2048
13286304 | slem_containers:investigate:last_good_build:20240117-1 | 2024-01-19 07:09:24 | failed | 2048
13288838 | slem_containers:investigate:bisect_without_32124 | 2024-01-19 08:14:51 | failed | 2048
13288888 | slem_containers:investigate:bisect_without_32124 | 2024-01-19 08:34:24 | failed | 2048
13282455 | sle_autoyast_support_image_gnome_12sp5_sdk_lp_asmm_contm_lgm_pcm_tcm_wsm_all_patterns | 2024-01-19 02:13:03 | failed | 2048
13284347 | mau-bootloader:investigate:retry | 2024-01-19 05:16:16 | failed | 2048
13285525 | slem_containers:investigate:bisect_without_32124 | 2024-01-19 06:53:38 | failed | 2048
13288836 | slem_containers:investigate:last_good_build:20240117-1 | 2024-01-19 08:13:38 | failed | 2048
13289023 | create_hdd_autoyast_containers | 2024-01-19 08:50:53 | failed | 2048
13286189 | ltp_cve_git | 2024-01-19 03:21:13 | failed | 2048
13284402 | slem_containers_selinux:investigate:last_good_tests:595e0dea8db29d8305370398904f071d5fd69687 | 2024-01-19 05:53:19 | softfailed | 2048
13285521 | slem_containers:investigate:retry | 2024-01-19 06:41:22 | failed | 2048
13286208 | slem_containers_selinux:investigate:bisect_without_32126 | 2024-01-19 07:01:08 | failed | 2048
13288855 | slem_containers_selinux:investigate:bisect_without_32126 | 2024-01-19 08:19:23 | failed | 2048
13289126 | ltp_syscalls_debug_pagealloc | 2024-01-19 09:27:28 | softfailed | 2048
13285473 | slem_containers_selinux:investigate:bisect_without_32124 | 2024-01-19 06:24:18 | failed | 2048
13287960 | fips_ker_mode_tests_crypt_tool | 2024-01-19 04:51:18 | failed | 2048
13288025 | slem_containers:investigate:bisect_without_32126 | 2024-01-19 07:32:21 | failed | 2048
13286313 | slem_containers:investigate:bisect_without_32152 | 2024-01-19 07:11:16 | failed | 2048
13288835 | slem_containers:investigate:last_good_tests:595e0dea8db29d8305370398904f071d5fd69687 | 2024-01-19 08:12:08 | softfailed | 2048
13286439 | sle_autoyast_support_image_gnome_12sp5_sdk_lp_asmm_contm_lgm_pcm_tcm_wsm_all_patterns | 2024-01-19 03:26:29 | failed | 2048
13285950 | ltp_net_nfs | 2024-01-19 03:54:48 | softfailed | 2048
13288579 | mau-bootloader:investigate:retry | 2024-01-19 05:42:47 | failed | 2048
13288582 | mau-bootloader:investigate:last_good_build:20240117-1 | 2024-01-19 05:47:16 | failed | 2048
13285472 | slem_containers_selinux:investigate:last_good_tests_and_build:595e0dea8db29d8305370398904f071d5fd69687+20240117-1 | 2024-01-19 06:23:46 | softfailed | 2048
13286205 | slem_containers_selinux:investigate:last_good_build:20240117-1 | 2024-01-19 06:59:38 | failed | 2048
13288856 | slem_containers_selinux:investigate:bisect_without_32152 | 2024-01-19 08:20:54 | failed | 2048
13286358 | slem_containers | 2024-01-19 03:22:48 | failed | 2048
13286442 | sle_autoyast_create_hdd_gnome_12sp5_sdk_all_patterns | 2024-01-19 04:48:18 | failed | 2048
13288584 | mau-bootloader:investigate:last_good_tests_and_build:ea3dbf5193d32cc6cac5ba91615a9bff47a110ce+20240117-1 | 2024-01-19 05:48:46 | failed | 2048
13284379 | slem_containers:investigate:last_good_build:20240117-1 | 2024-01-19 05:53:16 | failed | 2048
13284406 | slem_containers_selinux:investigate:bisect_without_32152 | 2024-01-19 06:22:21 | failed | 2048
13286206 | slem_containers_selinux:investigate:last_good_tests_and_build:595e0dea8db29d8305370398904f071d5fd69687+20240117-1 | 2024-01-19 07:01:08 | softfailed | 2048
13288852 | slem_containers_selinux:investigate:last_good_build:20240117-1 | 2024-01-19 08:17:53 | failed | 2048
13281772 | slem_containers | 2024-01-18 23:50:50 | failed | 2048
13284273 | mau-bootloader:investigate:last_good_build:20240117-1 | 2024-01-19 05:16:16 | failed | 2048
13285785 | ltp_syscalls_debug_pagealloc | 2024-01-19 03:39:41 | failed | 2048
13288585 | mau-bootloader:investigate:last_good_tests:ea3dbf5193d32cc6cac5ba91615a9bff47a110ce | 2024-01-19 05:48:46 | failed | 2048
13284401 | slem_containers_selinux:investigate:retry | 2024-01-19 05:53:16 | failed | 2048
13285478 | slem_containers:investigate:last_good_build:20240117-1 | 2024-01-19 06:33:24 | failed | 2048
13286207 | slem_containers_selinux:investigate:bisect_without_32124 | 2024-01-19 07:01:08 | failed | 2048
13288837 | slem_containers:investigate:last_good_tests_and_build:595e0dea8db29d8305370398904f071d5fd69687+20240117-1 | 2024-01-19 08:13:38 | softfailed | 2048
13284274 | mau-bootloader:investigate:last_good_tests_and_build:ea3dbf5193d32cc6cac5ba91615a9bff47a110ce+20240117-1 | 2024-01-19 05:16:16 | failed | 2048
13288586 | mau-bootloader:investigate:last_good_tests:ea3dbf5193d32cc6cac5ba91615a9bff47a110ce | 2024-01-19 05:48:46 | failed | 2048
13284382 | slem_containers:investigate:bisect_without_32152 | 2024-01-19 05:53:16 | failed | 2048
13288839 | slem_containers:investigate:bisect_without_32126 | 2024-01-19 08:16:23 | failed | 2048
13285469 | slem_containers_selinux:investigate:last_good_tests:595e0dea8db29d8305370398904f071d5fd69687 | 2024-01-19 06:22:21 | softfailed | 2048
13288020 | slem_containers:investigate:retry | 2024-01-19 07:32:21 | failed | 2048
13286203 | slem_containers_selinux:investigate:retry | 2024-01-19 06:58:08 | failed | 2048
13288889 | slem_containers:investigate:last_good_build:20240117-1 | 2024-01-19 08:35:32 | failed | 2048
13289127 | ltp_cve_git | 2024-01-19 09:28:06 | failed | 2048
13284349 | mau-bootloader:investigate:last_good_tests:ea3dbf5193d32cc6cac5ba91615a9bff47a110ce | 2024-01-19 05:17:46 | failed | 2048
13288590 | mau-bootloader:investigate:last_good_build:20240117-1 | 2024-01-19 05:50:16 | failed | 2048
(108 rows)
Considering that the machine has only 48GB memory assigned multiple jobs are quite heavy and as we have currently 10 openQA worker instances then it's quite likely that the machine runs out of memory. We could increase the memory assigned to the machine or reduce number of instances. I will ask domain experts.
Updated by okurz 12 months ago
- Due date set to 2024-02-02
- Status changed from In Progress to Feedback
https://suse.slack.com/archives/C02CANHLANP/p1705668822851679
(Oliver Kurz) @ihno Krumreich @Matthias Griessmeier Do you think we can have more memory (or RAM or whatever IBM calls it in this context) assigned to s390zl12.oqa.prg2.suse.org as we currently have (only) 48GB and with 10 openQA instances that was not enough at least once today
Updated by okurz 11 months ago · Edited
- Description updated (diff)
- Status changed from Resolved to In Progress
alerting again, added silence.
From
https://stats.openqa-monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?orgId=1&viewPanel=12054&from=1705814470454&to=1706353946183
it looks like RAM decreased on s390zl12.oqa.prg2.suse.org from 80GB to 62GB at 2024-01-22 12:30 (CET?). Anyone knows about that?
(Oliver Kurz) @Matthias Griessmeier @Gerhard Schlotter From https://stats.openqa-monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?orgId=1&viewPanel=12054&from=1705814470454&to=1706353946183
it looks like RAM decreased on s390zl12.oqa.prg2.suse.org from 80GB to 62GB at 2024-01-22 12:30 (CET?). Anyone knows about that? I logged into https://zhmc2.suse.de and can confirm that it says 64GB but I don't have permissions to change that I assume or would neet to power down the machine.
(Matthias Griessmeier) Yes. That’s what we’ve talked about over lunch on Thursday. Ihno was playing around with swap, so 64 should suffice, but can be dynamically extended.
That’s why it is only on zl12 so far.
Are we hitting issues with that?
(Oliver Kurz) oh, I see.Well, our alert triggered because the system was out of RAM though the OS was still operational. I will take a look if that's something we can adapt.
(Matthias Griessmeier) With out of ram you mean it exceeded 64gb? Or the alert just said it’s not 80 anymore?
(Ihno Krumreich) I created about 20 Gb of swap space with 10 Disks. So the overall amount of RAM is now about 82 Gb which can be activly used.
(Oliver Kurz) the alert triggered because more than 80-90% of the available 64GB were exceeded. This only happened now after the change from 80GB->64GB as it depends on how many and how big test VMs come together
Updated by livdywan 10 months ago
- Status changed from Resolved to Workable
It seems like this has come back? Since it looks exactly the same I'm re-opening rather than creating a new ticket:
A0=0.06855787906121152
The alert fired and resolved itself several times after ~5 minutes over a period of multiple days.
Updated by okurz 10 months ago
- Status changed from Workable to Feedback
https://stats.openqa-monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?viewPanel=12054&orgId=1&from=1708815643546&to=1708840378184 shows that s390zl12 was significantly below the low-available-memory threshold for a period of 4h with intermittent phases of more memory available. I don't think we should further increase the alerting threshold.
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/727
Updated by okurz 10 months ago
- Status changed from Feedback to Resolved
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/727 merged and deployed. https://stats.openqa-monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?viewPanel=12054&orgId=1&from=now-3h&to=now looks good. Resolving directly as we are monitoring that well enough
Updated by okurz 9 months ago
- Related to action #158170: Increase resources for s390x kvm size:M added
Updated by jbaier_cz 7 months ago
- Copied to action #160598: [alert] s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker size:S added