tickets #155992
openPhysical machines reporting odd oom_kill value
0%
Description
I want to start monitoring OOM kills. Prometheus uses oom_kill
in /proc/vmstat for this. In the process of doing so, I find three machines having a consistently and oddly high value there:
node_vmstat_oom_kill{instance="falkor20.infra.opensuse.org", job="nodes"} 12601268945
node_vmstat_oom_kill{instance="falkor21.infra.opensuse.org", job="nodes"} 313409559560
node_vmstat_oom_kill{instance="falkor22.infra.opensuse.org", job="nodes"} 144713341803
This matches what is reported on the machines themselves:
falkor20 (Hypervisor):~ # grep oom_kill /proc/vmstat
oom_kill 12603800634
The odd part is, that there are no recent OOM kills on these machines:
falkor20 (Hypervisor):~ # dmesg|grep -i oom\|kill ; echo $?
1
And I would be surprised if there were, there is plenty of free memory:
falkor20 (Hypervisor):~ # free -h
total used free shared buff/cache available
Mem: 1.0Ti 14Gi 979Gi 132Mi 19Gi 993Gi
Swap: 31Gi 0B 31Gi
Since it is reported for the three Falkor nodes, which share all the same hardware, and only for them, and not any other machines, I wonder if there is something specific to these machines making the kernel report a bogus oom_kill value? Is it possibly the >1TB memory need some additional tuning?
Would appreciate any ideas!
I would like to not just exclude these three machines from OOM monitoring without knowing why this is.
Updated by crameleon about 2 months ago
- Subject changed from Falkor reporting odd oom_kill value to Physical machines reporting odd oom_kill value
The Falkor machines now no longer shows these odd values, but now the Orbit ones do.
Updated by crameleon about 2 months ago
Configured OOM alerting filtered to virtual machines for now since I couldn't yet figure out this peculiarity.
https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/1854