Project

General

Profile

Actions

tickets #155992

open

Physical machines reporting odd oom_kill value

Added by crameleon 4 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Physical infrastructure / Hardware
Target version:
-
Start date:
2024-02-24
Due date:
% Done:

0%

Estimated time:

Description

I want to start monitoring OOM kills. Prometheus uses oom_kill in /proc/vmstat for this. In the process of doing so, I find three machines having a consistently and oddly high value there:

node_vmstat_oom_kill{instance="falkor20.infra.opensuse.org", job="nodes"} 12601268945
node_vmstat_oom_kill{instance="falkor21.infra.opensuse.org", job="nodes"} 313409559560
node_vmstat_oom_kill{instance="falkor22.infra.opensuse.org", job="nodes"} 144713341803

This matches what is reported on the machines themselves:

falkor20 (Hypervisor):~ # grep oom_kill /proc/vmstat 
oom_kill 12603800634

The odd part is, that there are no recent OOM kills on these machines:

falkor20 (Hypervisor):~ # dmesg|grep -i oom\|kill ; echo $?
1

And I would be surprised if there were, there is plenty of free memory:

falkor20 (Hypervisor):~ # free -h
               total        used        free      shared  buff/cache   available
Mem:           1.0Ti        14Gi       979Gi       132Mi        19Gi       993Gi
Swap:           31Gi          0B        31Gi

Since it is reported for the three Falkor nodes, which share all the same hardware, and only for them, and not any other machines, I wonder if there is something specific to these machines making the kernel report a bogus oom_kill value? Is it possibly the >1TB memory need some additional tuning?

Would appreciate any ideas!
I would like to not just exclude these three machines from OOM monitoring without knowing why this is.

Actions #1

Updated by crameleon 4 months ago

  • Category set to Physical infrastructure / Hardware
  • Private changed from Yes to No
Actions #2

Updated by crameleon 4 months ago

  • Subject changed from Fakor reporting odd oom_kill value to Falkor reporting odd oom_kill value
Actions #3

Updated by crameleon about 2 months ago

  • Subject changed from Falkor reporting odd oom_kill value to Physical machines reporting odd oom_kill value

The Falkor machines now no longer shows these odd values, but now the Orbit ones do.

Actions #4

Updated by crameleon about 2 months ago

Configured OOM alerting filtered to virtual machines for now since I couldn't yet figure out this peculiarity.

https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/1854

Actions

Also available in: Atom PDF