Project

General

Profile

Actions

action #153958

closed

[alert] s390zl12: Memory usage alert Generic memory_usage_alert_s390zl12 generic

Added by tinita 11 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2024-01-19
Due date:
% Done:

0%

Estimated time:

Description

Observation

Date: Fri, 19 Jan 2024 11:55:37 +0100
1 firing alert instance
[IMAGE]
 GROUPED BY 

hostname=s390zl12

 1 firing instances

Firing [stats.openqa-monitor.qa.suse.de]
s390zl12: Memory usage alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
A0=0.06117900738663373 
Labels
alertname
s390zl12: Memory usage alert
grafana_folder
Generic
hostname
s390zl12
rule_uid
memory_usage_alert_s390zl12

http://stats.openqa-monitor.qa.suse.de/alerting/grafana/memory_usage_alert_s390zl12/view?orgId=1

Rollback steps

Remove silence "alertname=s390zl12: Memory usage alert" from https://stats.openqa-monitor.qa.suse.de/alerting/silences


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #158170: Increase resources for s390x kvm size:MResolvednicksinger2024-03-27

Actions
Copied to openQA Infrastructure (public) - action #160598: [alert] s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker size:SResolvedjbaier_cz

Actions
Actions #1

Updated by okurz 11 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #2

Updated by okurz 11 months ago

@tinita when you ask about such alert and receive a response over other channels, in this case Slack, please also include the relevant answers in the ticket.

So memory usage alert is likely due to too many or too big kvm instances on it. Looking at
Looking at https://stats.openqa-monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?viewPanel=12054&orgId=1&from=1705654256531&to=1705665689914
from 1130 to 1150 the available memory dropped steadily and significantly and after that, a bit slower, recovering until about 1230 when it was ok again. I assume some openQA spawned VMs here causing this.

I queried the database with

openqa=> select j.id,test,t_started,result,js.value from jobs j join worker_properties wp on j.assigned_worker_id = wp.worker_id join job_settings js on js.job_id = j.id where arch='s390x' and t_finished >= '2024-01-19' and result!='passed' and wp.key='WORKER_CLASS' and wp.value~'s390zl12' and js.key='QEMURAM' order by js.value desc;

and found

    id    |                                                              test                                                              |      t_started      |   result   | value 
----------+--------------------------------------------------------------------------------------------------------------------------------+---------------------+------------+-------
 13289263 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863                                                                          | 2024-01-19 10:26:27 | failed     | 8192
 13289262 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863                                                                          | 2024-01-19 10:26:22 | failed     | 8192
 13289374 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863                                                                          | 2024-01-19 11:12:25 | failed     | 8192
 13289271 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863                                                                          | 2024-01-19 10:27:03 | failed     | 8192
 13289344 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863                                                                          | 2024-01-19 10:55:40 | failed     | 8192
 13289382 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863                                                                          | 2024-01-19 11:20:50 | failed     | 8192
 13289283 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863                                                                          | 2024-01-19 10:33:03 | failed     | 8192
 13289560 | stig@vtrubovics/os-autoinst-distri-opensuse#POO150863                                                                          | 2024-01-19 12:00:47 | failed     | 8192
 13288222 | python_3.6_on_SLES_15-SP5_podman                                                                                               | 2024-01-19 04:26:49 | softfailed | 6144
 13284951 | golang_oldstable_on_SLES_15-SP5_podman                                                                                         | 2024-01-19 00:26:31 | softfailed | 6144
 13285156 | php-fpm_8_on_SLES_15-SP5_docker                                                                                                | 2024-01-19 00:37:17 | softfailed | 6144
 13285171 | python_3.11_on_SLES_15-SP5_podman                                                                                              | 2024-01-19 00:41:49 | softfailed | 6144
 13285069 | openjdk_17_on_SLES_15-SP5_podman                                                                                               | 2024-01-19 00:32:33 | softfailed | 6144
 13286351 | rmt_on_SLES_15-SP5_docker:investigate:last_good_build:13.1_rmt-server-image                                                    | 2024-01-19 05:03:18 | failed     | 4096
 13288559 | rmt_on_SLES_15-SP5_docker:investigate:retry                                                                                    | 2024-01-19 05:08:37 | failed     | 4096
 13288769 | cc_atsec:investigate:bisect_without_32126                                                                                      | 2024-01-19 06:20:09 | failed     | 4096
 13288697 | cc_atsec:investigate:bisect_without_32124                                                                                      | 2024-01-19 05:50:16 | failed     | 4096
 13286388 | cc_atsec:investigate:bisect_without_32126                                                                                      | 2024-01-19 05:23:35 | failed     | 4096
 13285311 | git_on_SLES_15-SP5_docker                                                                                                      | 2024-01-19 00:46:37 | softfailed | 4096
 13282452 | sle_autoyast_support_image_gnome_12sp5                                                                                         | 2024-01-19 01:33:48 | incomplete | 4096
 13288338 | pcp_5_on_SLES_15-SP5_docker                                                                                                    | 2024-01-19 04:36:18 | softfailed | 4096
 13289123 | ltp_syscalls                                                                                                                   | 2024-01-19 09:26:27 | softfailed | 4096
 13286194 | ltp_syscalls                                                                                                                   | 2024-01-19 04:05:19 | failed     | 4096
 13286349 | rmt_on_SLES_15-SP5_docker:investigate:retry                                                                                    | 2024-01-19 05:03:18 | failed     | 4096
 13286350 | rmt_on_SLES_15-SP5_docker:investigate:last_good_tests:efad569fffc7a34751ef76ad7f651b496793d053                                 | 2024-01-19 05:03:18 | failed     | 4096
 13286344 | rmt_on_SLES_15-SP5_podman:investigate:retry                                                                                    | 2024-01-19 04:54:18 | failed     | 4096
 13286387 | cc_atsec:investigate:bisect_without_32124                                                                                      | 2024-01-19 05:20:46 | failed     | 4096
 13286391 | cc_atsec:investigate:bisect_without_32144                                                                                      | 2024-01-19 05:23:35 | failed     | 4096
 13288698 | cc_atsec:investigate:bisect_without_32126                                                                                      | 2024-01-19 05:50:16 | failed     | 4096
 13288382 | registry_2.8_on_SLES_15-SP5_docker                                                                                             | 2024-01-19 04:37:48 | softfailed | 4096
 13288557 | rmt_on_SLES_15-SP5_podman:investigate:last_good_tests_and_build:efad569fffc7a34751ef76ad7f651b496793d053+13.1_rmt-server-image | 2024-01-19 05:07:48 | failed     | 4096
 13288555 | rmt_on_SLES_15-SP5_podman:investigate:last_good_tests:efad569fffc7a34751ef76ad7f651b496793d053                                 | 2024-01-19 05:06:18 | failed     | 4096
 13288575 | rmt_on_SLES_15-SP5_docker:investigate:last_good_tests_and_build:efad569fffc7a34751ef76ad7f651b496793d053+13.1_rmt-server-image | 2024-01-19 05:14:20 | failed     | 4096
 13286345 | rmt_on_SLES_15-SP5_podman:investigate:last_good_tests:efad569fffc7a34751ef76ad7f651b496793d053                                 | 2024-01-19 04:57:18 | failed     | 4096
 13286352 | rmt_on_SLES_15-SP5_docker:investigate:last_good_tests_and_build:efad569fffc7a34751ef76ad7f651b496793d053+13.1_rmt-server-image | 2024-01-19 05:04:48 | failed     | 4096
 13288570 | rmt_on_SLES_15-SP5_docker:investigate:last_good_tests:efad569fffc7a34751ef76ad7f651b496793d053                                 | 2024-01-19 05:10:36 | failed     | 4096
 13285793 | ltp_syscalls                                                                                                                   | 2024-01-19 03:45:19 | softfailed | 4096
 13288558 | rmt_on_SLES_15-SP5_docker:investigate:last_good_build:13.1_rmt-server-image                                                    | 2024-01-19 05:08:37 | failed     | 4096
 13284404 | slem_containers_selinux:investigate:last_good_tests_and_build:595e0dea8db29d8305370398904f071d5fd69687+20240117-1              | 2024-01-19 05:54:46 | softfailed | 2048
 13285482 | slem_containers:investigate:bisect_without_32152                                                                               | 2024-01-19 06:41:21 | failed     | 2048
 13286307 | slem_containers:investigate:last_good_tests_and_build:595e0dea8db29d8305370398904f071d5fd69687+20240117-1                      | 2024-01-19 07:09:24 | softfailed | 2048
 13281819 | slem_containers                                                                                                                | 2024-01-19 00:05:22 | failed     | 2048
 13284272 | mau-bootloader:investigate:last_good_tests:ea3dbf5193d32cc6cac5ba91615a9bff47a110ce                                            | 2024-01-19 05:14:20 | failed     | 2048
 13283454 | fips_ker_mode_tests_crypt_tool                                                                                                 | 2024-01-19 02:52:23 | failed     | 2048
 13286195 | ltp_syscalls_debug_pagealloc                                                                                                   | 2024-01-19 04:05:19 | softfailed | 2048
 13284266 | mau-bootloader:investigate:last_good_tests:ea3dbf5193d32cc6cac5ba91615a9bff47a110ce                                            | 2024-01-19 05:09:10 | failed     | 2048
 13288588 | fips_ker_mode_tests_crypt_tool                                                                                                 | 2024-01-19 05:23:36 | failed     | 2048
 13288701 | slem_containers                                                                                                                | 2024-01-19 06:07:18 | failed     | 2048
 13285481 | slem_containers:investigate:bisect_without_32126                                                                               | 2024-01-19 06:35:16 | failed     | 2048
 13286209 | slem_containers_selinux:investigate:bisect_without_32152                                                                       | 2024-01-19 07:04:09 | failed     | 2048
 13288850 | slem_containers_selinux:investigate:retry                                                                                      | 2024-01-19 08:17:54 | failed     | 2048
 13289132 | ltp_syscalls_debug_pagealloc                                                                                                   | 2024-01-19 09:28:51 | softfailed | 2048
 13281906 | slem_containers                                                                                                                | 2024-01-19 01:20:18 | failed     | 2048
 13285635 | ltp_syscalls_debug_pagealloc                                                                                                   | 2024-01-19 03:36:09 | failed     | 2048
 13285471 | slem_containers_selinux:investigate:last_good_build:20240117-1                                                                 | 2024-01-19 06:22:24 | failed     | 2048
 13285523 | slem_containers:investigate:last_good_build:20240117-1                                                                         | 2024-01-19 06:50:38 | failed     | 2048
 13286304 | slem_containers:investigate:last_good_build:20240117-1                                                                         | 2024-01-19 07:09:24 | failed     | 2048
 13288838 | slem_containers:investigate:bisect_without_32124                                                                               | 2024-01-19 08:14:51 | failed     | 2048
 13288888 | slem_containers:investigate:bisect_without_32124                                                                               | 2024-01-19 08:34:24 | failed     | 2048
 13282455 | sle_autoyast_support_image_gnome_12sp5_sdk_lp_asmm_contm_lgm_pcm_tcm_wsm_all_patterns                                          | 2024-01-19 02:13:03 | failed     | 2048
 13284347 | mau-bootloader:investigate:retry                                                                                               | 2024-01-19 05:16:16 | failed     | 2048
 13285525 | slem_containers:investigate:bisect_without_32124                                                                               | 2024-01-19 06:53:38 | failed     | 2048
 13288836 | slem_containers:investigate:last_good_build:20240117-1                                                                         | 2024-01-19 08:13:38 | failed     | 2048
 13289023 | create_hdd_autoyast_containers                                                                                                 | 2024-01-19 08:50:53 | failed     | 2048
 13286189 | ltp_cve_git                                                                                                                    | 2024-01-19 03:21:13 | failed     | 2048
 13284402 | slem_containers_selinux:investigate:last_good_tests:595e0dea8db29d8305370398904f071d5fd69687                                   | 2024-01-19 05:53:19 | softfailed | 2048
 13285521 | slem_containers:investigate:retry                                                                                              | 2024-01-19 06:41:22 | failed     | 2048
 13286208 | slem_containers_selinux:investigate:bisect_without_32126                                                                       | 2024-01-19 07:01:08 | failed     | 2048
 13288855 | slem_containers_selinux:investigate:bisect_without_32126                                                                       | 2024-01-19 08:19:23 | failed     | 2048
 13289126 | ltp_syscalls_debug_pagealloc                                                                                                   | 2024-01-19 09:27:28 | softfailed | 2048
 13285473 | slem_containers_selinux:investigate:bisect_without_32124                                                                       | 2024-01-19 06:24:18 | failed     | 2048
 13287960 | fips_ker_mode_tests_crypt_tool                                                                                                 | 2024-01-19 04:51:18 | failed     | 2048
 13288025 | slem_containers:investigate:bisect_without_32126                                                                               | 2024-01-19 07:32:21 | failed     | 2048
 13286313 | slem_containers:investigate:bisect_without_32152                                                                               | 2024-01-19 07:11:16 | failed     | 2048
 13288835 | slem_containers:investigate:last_good_tests:595e0dea8db29d8305370398904f071d5fd69687                                           | 2024-01-19 08:12:08 | softfailed | 2048
 13286439 | sle_autoyast_support_image_gnome_12sp5_sdk_lp_asmm_contm_lgm_pcm_tcm_wsm_all_patterns                                          | 2024-01-19 03:26:29 | failed     | 2048
 13285950 | ltp_net_nfs                                                                                                                    | 2024-01-19 03:54:48 | softfailed | 2048
 13288579 | mau-bootloader:investigate:retry                                                                                               | 2024-01-19 05:42:47 | failed     | 2048
 13288582 | mau-bootloader:investigate:last_good_build:20240117-1                                                                          | 2024-01-19 05:47:16 | failed     | 2048
 13285472 | slem_containers_selinux:investigate:last_good_tests_and_build:595e0dea8db29d8305370398904f071d5fd69687+20240117-1              | 2024-01-19 06:23:46 | softfailed | 2048
 13286205 | slem_containers_selinux:investigate:last_good_build:20240117-1                                                                 | 2024-01-19 06:59:38 | failed     | 2048
 13288856 | slem_containers_selinux:investigate:bisect_without_32152                                                                       | 2024-01-19 08:20:54 | failed     | 2048
 13286358 | slem_containers                                                                                                                | 2024-01-19 03:22:48 | failed     | 2048
 13286442 | sle_autoyast_create_hdd_gnome_12sp5_sdk_all_patterns                                                                           | 2024-01-19 04:48:18 | failed     | 2048
 13288584 | mau-bootloader:investigate:last_good_tests_and_build:ea3dbf5193d32cc6cac5ba91615a9bff47a110ce+20240117-1                       | 2024-01-19 05:48:46 | failed     | 2048
 13284379 | slem_containers:investigate:last_good_build:20240117-1                                                                         | 2024-01-19 05:53:16 | failed     | 2048
 13284406 | slem_containers_selinux:investigate:bisect_without_32152                                                                       | 2024-01-19 06:22:21 | failed     | 2048
 13286206 | slem_containers_selinux:investigate:last_good_tests_and_build:595e0dea8db29d8305370398904f071d5fd69687+20240117-1              | 2024-01-19 07:01:08 | softfailed | 2048
 13288852 | slem_containers_selinux:investigate:last_good_build:20240117-1                                                                 | 2024-01-19 08:17:53 | failed     | 2048
 13281772 | slem_containers                                                                                                                | 2024-01-18 23:50:50 | failed     | 2048
 13284273 | mau-bootloader:investigate:last_good_build:20240117-1                                                                          | 2024-01-19 05:16:16 | failed     | 2048
 13285785 | ltp_syscalls_debug_pagealloc                                                                                                   | 2024-01-19 03:39:41 | failed     | 2048
 13288585 | mau-bootloader:investigate:last_good_tests:ea3dbf5193d32cc6cac5ba91615a9bff47a110ce                                            | 2024-01-19 05:48:46 | failed     | 2048
 13284401 | slem_containers_selinux:investigate:retry                                                                                      | 2024-01-19 05:53:16 | failed     | 2048
 13285478 | slem_containers:investigate:last_good_build:20240117-1                                                                         | 2024-01-19 06:33:24 | failed     | 2048
 13286207 | slem_containers_selinux:investigate:bisect_without_32124                                                                       | 2024-01-19 07:01:08 | failed     | 2048
 13288837 | slem_containers:investigate:last_good_tests_and_build:595e0dea8db29d8305370398904f071d5fd69687+20240117-1                      | 2024-01-19 08:13:38 | softfailed | 2048
 13284274 | mau-bootloader:investigate:last_good_tests_and_build:ea3dbf5193d32cc6cac5ba91615a9bff47a110ce+20240117-1                       | 2024-01-19 05:16:16 | failed     | 2048
 13288586 | mau-bootloader:investigate:last_good_tests:ea3dbf5193d32cc6cac5ba91615a9bff47a110ce                                            | 2024-01-19 05:48:46 | failed     | 2048
 13284382 | slem_containers:investigate:bisect_without_32152                                                                               | 2024-01-19 05:53:16 | failed     | 2048
 13288839 | slem_containers:investigate:bisect_without_32126                                                                               | 2024-01-19 08:16:23 | failed     | 2048
 13285469 | slem_containers_selinux:investigate:last_good_tests:595e0dea8db29d8305370398904f071d5fd69687                                   | 2024-01-19 06:22:21 | softfailed | 2048
 13288020 | slem_containers:investigate:retry                                                                                              | 2024-01-19 07:32:21 | failed     | 2048
 13286203 | slem_containers_selinux:investigate:retry                                                                                      | 2024-01-19 06:58:08 | failed     | 2048
 13288889 | slem_containers:investigate:last_good_build:20240117-1                                                                         | 2024-01-19 08:35:32 | failed     | 2048
 13289127 | ltp_cve_git                                                                                                                    | 2024-01-19 09:28:06 | failed     | 2048
 13284349 | mau-bootloader:investigate:last_good_tests:ea3dbf5193d32cc6cac5ba91615a9bff47a110ce                                            | 2024-01-19 05:17:46 | failed     | 2048
 13288590 | mau-bootloader:investigate:last_good_build:20240117-1                                                                          | 2024-01-19 05:50:16 | failed     | 2048
(108 rows)

Considering that the machine has only 48GB memory assigned multiple jobs are quite heavy and as we have currently 10 openQA worker instances then it's quite likely that the machine runs out of memory. We could increase the memory assigned to the machine or reduce number of instances. I will ask domain experts.

Actions #3

Updated by okurz 11 months ago

  • Due date set to 2024-02-02
  • Status changed from In Progress to Feedback

https://suse.slack.com/archives/C02CANHLANP/p1705668822851679

(Oliver Kurz) @ihno Krumreich @Matthias Griessmeier Do you think we can have more memory (or RAM or whatever IBM calls it in this context) assigned to s390zl12.oqa.prg2.suse.org as we currently have (only) 48GB and with 10 openQA instances that was not enough at least once today

Actions #4

Updated by okurz 11 months ago

  • Due date deleted (2024-02-02)
  • Status changed from Feedback to Resolved

Memory was increased to 80GB now which should be a better headroom for the 10 worker instances.

Actions #5

Updated by okurz 11 months ago · Edited

  • Description updated (diff)
  • Status changed from Resolved to In Progress

alerting again, added silence.

From
https://stats.openqa-monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?orgId=1&viewPanel=12054&from=1705814470454&to=1706353946183
it looks like RAM decreased on s390zl12.oqa.prg2.suse.org from 80GB to 62GB at 2024-01-22 12:30 (CET?). Anyone knows about that?

https://suse.slack.com/archives/C02CANHLANP/p1706354555432789?thread_ts=1705668822.851679&cid=C02CANHLANP

(Oliver Kurz) @Matthias Griessmeier @Gerhard Schlotter From https://stats.openqa-monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?orgId=1&viewPanel=12054&from=1705814470454&to=1706353946183
it looks like RAM decreased on s390zl12.oqa.prg2.suse.org from 80GB to 62GB at 2024-01-22 12:30 (CET?). Anyone knows about that? I logged into https://zhmc2.suse.de and can confirm that it says 64GB but I don't have permissions to change that I assume or would neet to power down the machine.
(Matthias Griessmeier) Yes. That’s what we’ve talked about over lunch on Thursday. Ihno was playing around with swap, so 64 should suffice, but can be dynamically extended.
That’s why it is only on zl12 so far.
Are we hitting issues with that?
(Oliver Kurz) oh, I see.Well, our alert triggered because the system was out of RAM though the OS was still operational. I will take a look if that's something we can adapt.
(Matthias Griessmeier) With out of ram you mean it exceeded 64gb? Or the alert just said it’s not 80 anymore?
(Ihno Krumreich) I created about 20 Gb of swap space with 10 Disks. So the overall amount of RAM is now about 82 Gb which can be activly used.
(Oliver Kurz) the alert triggered because more than 80-90% of the available 64GB were exceeded. This only happened now after the change from 80GB->64GB as it depends on how many and how big test VMs come together

Actions #7

Updated by okurz 11 months ago

  • Status changed from In Progress to Feedback
Actions #8

Updated by okurz 11 months ago

  • Due date deleted (2024-02-10)
  • Status changed from Feedback to Resolved

merged and deployed and confirmed to be effective. Currently no alert. Silence removed.

Actions #9

Updated by livdywan 10 months ago

  • Status changed from Resolved to Workable

It seems like this has come back? Since it looks exactly the same I'm re-opening rather than creating a new ticket:

A0=0.06855787906121152  

The alert fired and resolved itself several times after ~5 minutes over a period of multiple days.

Actions #10

Updated by okurz 10 months ago

  • Status changed from Workable to Feedback

https://stats.openqa-monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?viewPanel=12054&orgId=1&from=1708815643546&to=1708840378184 shows that s390zl12 was significantly below the low-available-memory threshold for a period of 4h with intermittent phases of more memory available. I don't think we should further increase the alerting threshold.

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/727

Actions #11

Updated by okurz 10 months ago

  • Status changed from Feedback to Resolved
Actions #12

Updated by okurz 9 months ago

  • Related to action #158170: Increase resources for s390x kvm size:M added
Actions #13

Updated by jbaier_cz 7 months ago

  • Copied to action #160598: [alert] s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker size:S added
Actions

Also available in: Atom PDF