action #164853
closed[alert][FIRING:1] s390zl13 (s390zl13: Memory usage alert Generic memory_usage_alert_s390zl13 generic) size:S
0%
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/d/GDs390zl13/dashboard-for-s390zl13?orgId=1&viewPanel=12054&from=1722568647371&to=1722580533085
shows high memory usage over the course of 50m. As we had repeated memory alerts especially on those s390 hosts and no problem reports by users that tests are affected I would like to loosen the alert conditions a bit.
Suggestions¶
- Adjust memory thresholds
- Combine an absolute and a relative threshold
Rollback actions¶
- DONE Remove alert from https://monitor.qa.suse.de/alerting/silences called
rule_uid=~memory_usage_.*s390
Updated by okurz 4 months ago
- Related to action #150887: [alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic), also s390zl13 size:M added
Updated by okurz 4 months ago
- Status changed from Feedback to In Progress
As we decided we want to consider absolute values as well I did sudo salt \* cmd.run 'free -g | grep Mem:'
to find lower limits. Lowest available values are:
schort-server.qe.nue2.suse.org:
Mem: … 901
tumblesle.qe.nue2.suse.org:
Mem: … 266
jenkins.qe.nue2.suse.org:
Mem: … 1767
so we could say anything below 200M is probably problematic.
Updated by nicksinger 4 months ago
okurz wrote in #note-5:
As we decided we want to consider absolute values as well I did
sudo salt \* cmd.run 'free -g | grep Mem:'
to find lower limits. Lowest available values are:schort-server.qe.nue2.suse.org: Mem: … 901 tumblesle.qe.nue2.suse.org: Mem: … 266 jenkins.qe.nue2.suse.org: Mem: … 1767
so we could say anything below 200M is probably problematic.
I'm not following with the conclusion here but now also better understand why you previously questioned this approach. Let me try to explain: I would expect that we have no system running with this low memory available. Reason is that a modern system commonly assumes having a little bit more available. One thing I did not take into consideration is swap which gives us headroom without needlessly blowing up RAM for a very small VM (like the ones above). So could we extend the alert to "free mem + free swap < 1G"?
Updated by okurz 4 months ago
- Status changed from In Progress to Feedback
Updated https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1244 to include available+swap_free
Updated by okurz 4 months ago
- Status changed from Resolved to In Progress
https://stats.openqa-monitor.qa.suse.de/d/GDs390zl13/dashboard-for-s390zl13?viewPanel=12054&orgId=1&from=1723610465583&to=1723611644209 just fired again. Will need to reconsider both alerts.