tickets #155992: Physical machines reporting odd oom_kill value - openSUSE admin - openSUSE Project Management Tool

Actions

Copy link

tickets #155992

open

Physical machines reporting odd oom_kill value

Added by crameleon 4 months ago. Updated about 2 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Physical infrastructure / Hardware

Target version:

Start date:

2024-02-24

Due date:

% Done:

Estimated time:

Description

I want to start monitoring OOM kills. Prometheus uses oom_kill in /proc/vmstat for this. In the process of doing so, I find three machines having a consistently and oddly high value there:

node_vmstat_oom_kill{instance="falkor20.infra.opensuse.org", job="nodes"} 12601268945
node_vmstat_oom_kill{instance="falkor21.infra.opensuse.org", job="nodes"} 313409559560
node_vmstat_oom_kill{instance="falkor22.infra.opensuse.org", job="nodes"} 144713341803

This matches what is reported on the machines themselves:

falkor20 (Hypervisor):~ # grep oom_kill /proc/vmstat 
oom_kill 12603800634

The odd part is, that there are no recent OOM kills on these machines:

falkor20 (Hypervisor):~ # dmesg|grep -i oom\|kill ; echo $?
1

And I would be surprised if there were, there is plenty of free memory:

falkor20 (Hypervisor):~ # free -h
               total        used        free      shared  buff/cache   available
Mem:           1.0Ti        14Gi       979Gi       132Mi        19Gi       993Gi
Swap:           31Gi          0B        31Gi

Since it is reported for the three Falkor nodes, which share all the same hardware, and only for them, and not any other machines, I wonder if there is something specific to these machines making the kernel report a bogus oom_kill value? Is it possibly the >1TB memory need some additional tuning?

Would appreciate any ideas!
I would like to not just exclude these three machines from OOM monitoring without knowing why this is.