Project

General

Profile

Actions

action #164853

closed

[alert][FIRING:1] s390zl13 (s390zl13: Memory usage alert Generic memory_usage_alert_s390zl13 generic) size:S

Added by okurz 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2024-08-02
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/GDs390zl13/dashboard-for-s390zl13?orgId=1&viewPanel=12054&from=1722568647371&to=1722580533085
shows high memory usage over the course of 50m. As we had repeated memory alerts especially on those s390 hosts and no problem reports by users that tests are affected I would like to loosen the alert conditions a bit.

Suggestions

  • Adjust memory thresholds
  • Combine an absolute and a relative threshold

Rollback actions


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #150887: [alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic), also s390zl13 size:MResolvedokurz2023-11-15

Actions
Actions #1

Updated by okurz 5 months ago

  • Due date set to 2024-08-16
  • Status changed from New to Feedback
Actions #2

Updated by okurz 5 months ago

  • Description updated (diff)
Actions #3

Updated by okurz 4 months ago

  • Related to action #150887: [alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic), also s390zl13 size:M added
Actions #4

Updated by livdywan 4 months ago

  • Subject changed from [alert][FIRING:1] s390zl13 (s390zl13: Memory usage alert Generic memory_usage_alert_s390zl13 generic) to [alert][FIRING:1] s390zl13 (s390zl13: Memory usage alert Generic memory_usage_alert_s390zl13 generic) size:S
  • Description updated (diff)
Actions #5

Updated by okurz 4 months ago

  • Status changed from Feedback to In Progress

As we decided we want to consider absolute values as well I did sudo salt \* cmd.run 'free -g | grep Mem:' to find lower limits. Lowest available values are:

schort-server.qe.nue2.suse.org:
    Mem:            …         901
tumblesle.qe.nue2.suse.org:
    Mem:            …         266
jenkins.qe.nue2.suse.org:
    Mem:            …        1767

so we could say anything below 200M is probably problematic.

Actions #7

Updated by nicksinger 4 months ago

okurz wrote in #note-5:

As we decided we want to consider absolute values as well I did sudo salt \* cmd.run 'free -g | grep Mem:' to find lower limits. Lowest available values are:

schort-server.qe.nue2.suse.org:
    Mem:            …         901
tumblesle.qe.nue2.suse.org:
    Mem:            …         266
jenkins.qe.nue2.suse.org:
    Mem:            …        1767

so we could say anything below 200M is probably problematic.

I'm not following with the conclusion here but now also better understand why you previously questioned this approach. Let me try to explain: I would expect that we have no system running with this low memory available. Reason is that a modern system commonly assumes having a little bit more available. One thing I did not take into consideration is swap which gives us headroom without needlessly blowing up RAM for a very small VM (like the ones above). So could we extend the alert to "free mem + free swap < 1G"?

Actions #8

Updated by okurz 4 months ago

  • Status changed from In Progress to Feedback
Actions #9

Updated by okurz 4 months ago

  • Description updated (diff)
  • Due date deleted (2024-08-16)
  • Status changed from Feedback to Resolved

So far no problems. Completed rollback action(s). Will have to see in long-term evaluation if this proves to be stable.

Actions #10

Updated by okurz 4 months ago

  • Status changed from Resolved to In Progress
Actions #11

Updated by okurz 4 months ago

  • Description updated (diff)
Actions #12

Updated by okurz 4 months ago

  • Description updated (diff)
  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF