action #97139
closed[alert] multiple unhandled alerts about "malbec: Memory usage alert" size:M
Added by okurz over 3 years ago. Updated over 3 years ago.
0%
Description
Observation¶
Acceptance criteria¶
- AC1: We have understood the reason why the alert triggered
- AC2: Actions have been applied to prevent a similar situation in the future
Suggestion¶
- Check history of monitoring data
- Check logs if there is something that we still need to fix
- Check if it was maybe a one-off experiment by a person, talk to the person and tell them "don't do it again!"
- Consider adapting alerting thresholds
Updated by okurz over 3 years ago
- Copied from action #97136: [alert] multiple unhandled alerts about "broken workers" size:M added
Updated by okurz over 3 years ago
- Subject changed from [alert] multiple unhandled alerts about "malbec: Memory usage alert" to [alert] multiple unhandled alerts about "malbec: Memory usage alert" size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Looks like something really used up almost all the entire memory up to the point where only 477 MiB (of 127 GiB) were free. Then the memory was suddenly freed-up again. Maybe it was just a single process which got eventually killed.
Judging by the alert's state history the problem is not happening very often but I'll have a look at the worker logs and check what special processes could possibly run on that worker.
Updated by mkittler over 3 years ago
There's not much going on. The worker only as 4 slots. When idling the top process is ovs-vswitchd which uses 8149 MiB memory (virtual) but it is likely not the culprit.
The free memory rapidly declined 18:53 (and started to decline slowly 18:45). The memory is freed again at 19:30. Interesting log messages, starting shortly before the last alert triggered:
Aug 17 18:39:07 malbec wickedd-dhcp4[4224]: eth4: Committed DHCPv4 lease with address 10.161.24.54 (lease time 1800 sec, renew in 900 sec, rebind in 1575 sec)
Aug 17 18:39:07 malbec wickedd[4443]: route ipv4 0.0.0.0/0 via 10.161.0.1 dev eth4#6 type unicast table main scope universe protocol dhcp covered by a ipv4:dhcp lease
Aug 17 18:39:07 malbec wickedd[4443]: ni_process_reap: process 36869 has not exited yet; now doing a blocking waitpid()
Aug 17 18:54:07 malbec wickedd-dhcp4[4224]: eth4: Committed DHCPv4 lease with address 10.161.24.54 (lease time 1800 sec, renew in 900 sec, rebind in 1575 sec)
Aug 17 18:54:07 malbec wickedd[4443]: route ipv4 0.0.0.0/0 via 10.161.0.1 dev eth4#6 type unicast table main scope universe protocol dhcp covered by a ipv4:dhcp lease
Aug 17 18:54:07 malbec wickedd[4443]: ni_process_reap: process 39315 has not exited yet; now doing a blocking waitpid()
Aug 17 18:59:37 malbec wickedd-dhcp6[4222]: eth4: Committing DHCPv6 lease with:
Aug 17 18:59:37 malbec wickedd-dhcp6[4222]: eth4 +ia-na.address 2620:113:80c0:8000:10:161:24:54/0, pref-lft 9000, valid-lft 14400
Aug 17 18:59:40 malbec wickedd[4443]: ni_process_reap: process 40024 has not exited yet; now doing a blocking waitpid()
Aug 17 19:00:00 malbec systemd[1]: Started Timeline of Snapper Snapshots.
Aug 17 19:00:00 malbec dbus-daemon[3188]: [system] Activating service name='org.opensuse.Snapper' requested by ':1.324' (uid=0 pid=40267 comm="/usr/lib/snapper/systemd-helper --timeline ") (using servicehelper)
Aug 17 19:00:00 malbec dbus-daemon[3188]: [system] Successfully activated service 'org.opensuse.Snapper'
Aug 17 19:01:30 malbec systemd[1]: dev-disk-by\x2duuid-b4df213d\x2d22b0\x2d4264\x2d9fa0\x2d777474872c8a.device: Job dev-disk-by\x2duuid-b4df213d\x2d22b0\x2d4264\x2d9fa0\x2d777474872c8a.device/start timed out.
Aug 17 19:01:30 malbec systemd[1]: Timed out waiting for device dev-disk-by\x2duuid-b4df213d\x2d22b0\x2d4264\x2d9fa0\x2d777474872c8a.device.
Aug 17 19:01:30 malbec systemd[1]: Dependency failed for /dev/disk/by-uuid/b4df213d-22b0-4264-9fa0-777474872c8a.
Aug 17 19:01:30 malbec systemd[1]: dev-disk-by\x2duuid-b4df213d\x2d22b0\x2d4264\x2d9fa0\x2d777474872c8a.swap: Job dev-disk-by\x2duuid-b4df213d\x2d22b0\x2d4264\x2d9fa0\x2d777474872c8a.swap/start failed with result 'dependency'.
Aug 17 19:01:30 malbec systemd[1]: dev-disk-by\x2duuid-b4df213d\x2d22b0\x2d4264\x2d9fa0\x2d777474872c8a.device: Job dev-disk-by\x2duuid-b4df213d\x2d22b0\x2d4264\x2d9fa0\x2d777474872c8a.device/start failed with result 'timeout'.
Aug 17 19:07:39 malbec ovs-vsctl[41023]: ovs|00001|vsctl|INFO|Called as ovs-vsctl remove port tap3 tag 9
…
Aug 17 19:30:06 malbec kernel: salt-minion invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Aug 17 19:30:08 malbec kernel: CPU: 40 PID: 9188 Comm: salt-minion Kdump: loaded Not tainted 5.3.18-lp152.87-default #1 openSUSE Leap 15.2
Aug 17 19:30:08 malbec kernel: Call Trace:
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb5a0] [c000000000d0f048] dump_stack+0xbc/0x104 (unreliable)
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb5e0] [c000000000383940] dump_header+0x60/0x2e0
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb670] [c00000000038416c] oom_kill_process+0x19c/0x2c0
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb6b0] [c000000000385414] out_of_memory+0x114/0x720
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb750] [c0000000004028e8] __alloc_pages_slowpath+0xa78/0xe20
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb900] [c000000000402fa8] __alloc_pages_nodemask+0x318/0x3e0
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb980] [c0000000004280e0] alloc_pages_current+0xa0/0x140
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb9c0] [c000000000377888] __page_cache_alloc+0xb8/0x120
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb9f0] [c00000000037af78] pagecache_get_page+0x128/0x490
Aug 17 19:30:09 malbec kernel: [c000000fe9bdba50] [c00000000037c25c] filemap_fault+0x51c/0xd50
Aug 17 19:30:09 malbec kernel: [c000000fe9bdbb70] [c0000000003d1f8c] __do_fault+0x5c/0x200
Aug 17 19:30:09 malbec kernel: [c000000fe9bdbbb0] [c0000000003d9254] __handle_mm_fault+0x1174/0x1b10
Aug 17 19:30:09 malbec kernel: [c000000fe9bdbca0] [c0000000003d9d1c] handle_mm_fault+0x12c/0x220
Aug 17 19:30:09 malbec kernel: [c000000fe9bdbce0] [c00000000007a618] __do_page_fault+0x338/0xe80
Aug 17 19:30:09 malbec kernel: [c000000fe9bdbde0] [c00000000007b198] do_page_fault+0x38/0xc0
Aug 17 19:30:09 malbec kernel: [c000000fe9bdbe20] [c00000000000a908] handle_page_fault+0x10/0x30
Aug 17 19:30:09 malbec kernel: Mem-Info:
Aug 17 19:30:09 malbec kernel: active_anon:1236907 inactive_anon:794527 isolated_anon:0
active_file:66 inactive_file:143 isolated_file:0
unevictable:5202 dirty:3 writeback:0 unstable:0
slab_reclaimable:2156 slab_unreclaimable:10551
mapped:444 shmem:1146897 pagetables:396 bounce:0
free:11120 free_pcp:1 free_cma:4
Aug 17 19:30:09 malbec kernel: Node 0 active_anon:22985600kB inactive_anon:8939904kB active_file:3904kB inactive_file:576kB unevictable:277952kB isolated(anon):0kB isolated(file):0kB mapped:23936kB dirty:0kB writeback:0kB shmem:12452992kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 17367040kB writeback_tmp:0kB u>
Aug 17 19:30:09 malbec kernel: Node 1 active_anon:24274944kB inactive_anon:8716864kB active_file:0kB inactive_file:1408kB unevictable:7552kB isolated(anon):0kB isolated(file):0kB mapped:512kB dirty:0kB writeback:0kB shmem:10899776kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 20529152kB writeback_tmp:0kB unstabl>
Aug 17 19:30:09 malbec kernel: Node 16 active_anon:16248896kB inactive_anon:16740224kB active_file:320kB inactive_file:5440kB unevictable:40640kB isolated(anon):0kB isolated(file):0kB mapped:2176kB dirty:192kB writeback:0kB shmem:24688000kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 5718016kB writeback_tmp:0kB >
Aug 17 19:30:09 malbec kernel: Node 17 active_anon:15652608kB inactive_anon:16452736kB active_file:128kB inactive_file:1728kB unevictable:6784kB isolated(anon):0kB isolated(file):0kB mapped:1792kB dirty:0kB writeback:0kB shmem:25360640kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 5701632kB writeback_tmp:0kB uns>
Aug 17 19:30:09 malbec kernel: Node 0 Normal free:178176kB min:178944kB low:223680kB high:268416kB active_anon:22985600kB inactive_anon:8939904kB active_file:2048kB inactive_file:0kB unevictable:277952kB writepending:0kB present:33554432kB managed:33145408kB mlocked:277952kB kernel_stack:10800kB pagetables:9024kB >
Aug 17 19:30:09 malbec kernel: lowmem_reserve[]: 0 0 0
Aug 17 19:30:09 malbec kernel: Node 1 Normal free:178624kB min:180864kB low:226048kB high:271232kB active_anon:24274944kB inactive_anon:8716864kB active_file:0kB inactive_file:1408kB unevictable:7552kB writepending:0kB present:33554432kB managed:33475072kB mlocked:7552kB kernel_stack:7920kB pagetables:6848kB bounc>
Aug 17 19:30:09 malbec kernel: lowmem_reserve[]: 0 0 0
Aug 17 19:30:09 malbec kernel: Node 16 Normal free:177792kB min:180864kB low:226048kB high:271232kB active_anon:16248896kB inactive_anon:16740224kB active_file:320kB inactive_file:2944kB unevictable:40640kB writepending:192kB present:33554432kB managed:33475840kB mlocked:40640kB kernel_stack:6816kB pagetables:6336>
Aug 17 19:30:09 malbec kernel: lowmem_reserve[]: 0 0 0
Aug 17 19:30:09 malbec kernel: Node 17 Normal free:177088kB min:180096kB low:225088kB high:270080kB active_anon:15652608kB inactive_anon:16452736kB active_file:128kB inactive_file:832kB unevictable:6784kB writepending:0kB present:33554432kB managed:33331776kB mlocked:6784kB kernel_stack:5840kB pagetables:3136kB bo>
Aug 17 19:30:09 malbec kernel: lowmem_reserve[]: 0 0 0
Aug 17 19:30:09 malbec kernel: Node 0 Normal: 674*64kB (UME) 338*128kB (UME) 144*256kB (UME) 82*512kB (UME) 15*1024kB (UM) 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 180608kB
Aug 17 19:30:09 malbec kernel: Node 1 Normal: 424*64kB (UME) 239*128kB (UME) 143*256kB (UME) 99*512kB (UE) 34*1024kB (UME) 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 179840kB
Aug 17 19:30:09 malbec kernel: Node 16 Normal: 529*64kB (UME) 288*128kB (UME) 135*256kB (UME) 83*512kB (UME) 26*1024kB (UME) 3*2048kB (UME) 0*4096kB 0*8192kB 0*16384kB = 180544kB
Aug 17 19:30:09 malbec kernel: Node 17 Normal: 316*64kB (UMEC) 195*128kB (UMEC) 101*256kB (UME) 69*512kB (UME) 39*1024kB (UME) 11*2048kB (UME) 3*4096kB (UME) 0*8192kB 0*16384kB = 181120kB
Aug 17 19:30:09 malbec kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16384kB
Aug 17 19:30:09 malbec kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB
Aug 17 19:30:09 malbec kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16384kB
Aug 17 19:30:09 malbec kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB
Aug 17 19:30:09 malbec kernel: Node 16 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16384kB
Aug 17 19:30:09 malbec kernel: Node 16 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB
Aug 17 19:30:09 malbec kernel: Node 17 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16384kB
Aug 17 19:30:09 malbec kernel: Node 17 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB
Aug 17 19:30:09 malbec kernel: 1147690 total pagecache pages
Aug 17 19:30:09 malbec kernel: 0 pages in swap cache
Aug 17 19:30:09 malbec kernel: Swap cache stats: add 0, delete 0, find 0/0
Aug 17 19:30:09 malbec kernel: Free swap = 0kB
Aug 17 19:30:09 malbec kernel: Total swap = 0kB
Aug 17 19:30:09 malbec kernel: 2097152 pages RAM
Aug 17 19:30:09 malbec kernel: 0 pages HighMem/MovableOnly
Aug 17 19:30:09 malbec kernel: 12338 pages reserved
Aug 17 19:30:09 malbec kernel: 104960 pages cma reserved
Aug 17 19:30:09 malbec kernel: 0 pages hwpoisoned
Aug 17 19:30:09 malbec kernel: Tasks state (memory values in pages):
Aug 17 19:30:09 malbec kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Aug 17 19:30:09 malbec kernel: [ 2038] 0 2038 3877 122 79872 0 0 systemd-journal
Aug 17 19:30:09 malbec kernel: [ 2180] 0 2180 315 110 51200 0 -1000 systemd-udevd
Aug 17 19:30:09 malbec kernel: [ 2181] 0 2181 145 90 49152 0 0 haveged
Aug 17 19:30:09 malbec kernel: [ 2974] 0 2974 6688 524 65536 0 -1000 multipathd
Aug 17 19:30:09 malbec kernel: [ 3157] 0 3157 202 19 51200 0 -1000 auditd
Aug 17 19:30:09 malbec kernel: [ 3182] 0 3182 1278 62 51200 0 0 irqbalance
Aug 17 19:30:09 malbec kernel: [ 3188] 499 3188 178 89 51200 0 -900 dbus-daemon
Aug 17 19:30:09 malbec kernel: [ 3197] 496 3197 13269 68 81920 0 0 nscd
Aug 17 19:30:09 malbec kernel: [ 3253] 0 3253 2019 600 59392 0 0 firewalld
Aug 17 19:30:09 malbec kernel: [ 3254] 0 3254 248 108 51200 0 0 systemd-logind
Aug 17 19:30:09 malbec kernel: [ 3255] 0 3255 1348 41 51200 0 0 iprdump
Aug 17 19:30:09 malbec kernel: [ 3258] 0 3258 90 72 51200 0 0 iprinit
Aug 17 19:30:09 malbec kernel: [ 3260] 0 3260 97 76 49152 0 0 iprupdate
Aug 17 19:30:09 malbec kernel: [ 4222] 0 4222 221 74 49152 0 0 wickedd-dhcp6
Aug 17 19:30:09 malbec kernel: [ 4223] 0 4223 217 86 51200 0 0 wickedd-auto4
Aug 17 19:30:09 malbec kernel: [ 4224] 0 4224 225 75 53248 0 0 wickedd-dhcp4
Aug 17 19:30:09 malbec kernel: [ 4289] 468 4289 351 165 51200 0 0 ovsdb-server
Aug 17 19:30:09 malbec kernel: [ 4424] 468 4424 130389 4714 337920 0 0 ovs-vswitchd
Aug 17 19:30:09 malbec kernel: [ 4443] 0 4443 241 138 51200 0 0 wickedd
Aug 17 19:30:09 malbec kernel: [ 4508] 0 4508 221 85 51200 0 0 wickedd-nanny
Aug 17 19:30:09 malbec kernel: [ 5483] 497 5483 30592 178 86016 0 0 polkitd
Aug 17 19:30:09 malbec kernel: [ 8319] 0 8319 69625 546 258048 0 0 telegraf
Aug 17 19:30:09 malbec kernel: [ 8322] 0 8322 794 570 55296 0 0 salt-minion
Aug 17 19:30:09 malbec kernel: [ 8325] 0 8325 149 80 49152 0 0 xinetd
Aug 17 19:30:09 malbec kernel: [ 8328] 0 8328 475 131 53248 0 0 rebootmgrd
Aug 17 19:30:09 malbec kernel: [ 8330] 478 8330 143 22 51200 0 0 nrpe
Aug 17 19:30:09 malbec kernel: [ 8334] 0 8334 2137 56 65536 0 -1000 tgtd
Aug 17 19:30:09 malbec kernel: [ 8335] 0 8335 69 14 49152 0 0 tgtd
Aug 17 19:30:09 malbec kernel: [ 8341] 0 8341 566 420 53248 0 0 os-autoinst-ope
Aug 17 19:30:09 malbec kernel: [ 8354] 467 8354 1248 19 53248 0 0 chronyd
Aug 17 19:30:09 malbec kernel: [ 8420] 0 8420 280 156 51200 0 -1000 sshd
Aug 17 19:30:09 malbec kernel: [ 8640] 0 8640 310 88 51200 0 0 master
Aug 17 19:30:09 malbec kernel: [ 8642] 51 8642 306 83 51200 0 0 qmgr
Aug 17 19:30:09 malbec kernel: [ 8671] 0 8671 105 65 49152 0 0 cron
Aug 17 19:30:09 malbec kernel: [ 9188] 0 9188 15886 1296 102400 0 0 salt-minion
Aug 17 19:30:09 malbec kernel: [ 9691] 0 9691 2059 594 59392 0 0 salt-minion
Aug 17 19:30:09 malbec kernel: [ 13850] 0 13850 62 16 49152 0 0 agetty
Aug 17 19:30:09 malbec kernel: [ 13851] 0 13851 62 17 51200 0 0 agetty
Aug 17 19:30:09 malbec kernel: [ 21287] 495 21287 192 105 51200 0 0 rpcbind
Aug 17 19:30:09 malbec kernel: [ 21288] 486 21288 160 40 49152 0 0 rpc.statd
Aug 17 19:30:09 malbec kernel: [ 12051] 479 12051 1407 1080 59392 0 0 openqa-workerca
Aug 17 19:30:09 malbec kernel: [ 12053] 479 12053 1276 1074 59392 0 0 openqa-workerca
Aug 17 19:30:09 malbec kernel: [ 12542] 479 12542 1294 1050 57344 0 0 openqa-workerca
Aug 17 19:30:09 malbec kernel: [ 12543] 479 12543 1294 1050 57344 0 0 openqa-workerca
Aug 17 19:30:09 malbec kernel: [ 12544] 479 12544 1294 1048 57344 0 0 openqa-workerca
Aug 17 19:30:09 malbec kernel: [ 12545] 479 12545 1294 1031 57344 0 0 openqa-workerca
Aug 17 19:30:09 malbec kernel: [ 31346] 479 31346 1273 1121 59392 0 0 worker
Aug 17 19:30:09 malbec kernel: [ 33998] 479 33998 1225 1076 59392 0 0 worker
Aug 17 19:30:09 malbec kernel: [ 34406] 479 34406 1252 1102 57344 0 0 worker
Aug 17 19:30:09 malbec kernel: [ 34515] 479 34515 2926 2648 73728 0 0 perl
Aug 17 19:30:09 malbec kernel: [ 34520] 479 34520 3578 3294 77824 0 0 perl
Aug 17 19:30:09 malbec kernel: [ 34521] 479 34521 1780 1422 63488 0 0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [ 34529] 479 34529 2038 1671 65536 0 0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [ 34533] 479 34533 42600 2578 126976 0 0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [ 34552] 479 34552 50627 10599 190464 0 0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [ 34571] 479 34571 21086 214 63488 0 0 videoencoder
Aug 17 19:30:09 malbec kernel: [ 34590] 479 34590 484556 391935 3457024 0 0 qemu-system-ppc
Aug 17 19:30:09 malbec kernel: [ 34607] 479 34607 42778 2745 129024 0 0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [ 34626] 479 34626 46511 6469 159744 0 0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [ 34645] 479 34645 21086 165 63488 0 0 videoencoder
Aug 17 19:30:09 malbec kernel: [ 34664] 479 34664 76856 19866 348160 0 0 qemu-system-ppc
Aug 17 19:30:09 malbec kernel: [ 34700] 479 34700 2927 2641 75776 0 0 perl
Aug 17 19:30:09 malbec kernel: [ 34709] 479 34709 1855 1467 65536 0 0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [ 34715] 479 34715 42692 2629 129024 0 0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [ 34734] 479 34734 47646 7538 167936 0 0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [ 34753] 479 34753 21086 213 63488 0 0 videoencoder
Aug 17 19:30:09 malbec kernel: [ 34772] 479 34772 434586 391811 3317760 0 0 qemu-system-ppc
Aug 17 19:30:09 malbec kernel: [ 40474] 51 40474 301 103 51200 0 0 pickup
Aug 17 19:30:09 malbec kernel: [ 41035] 479 41035 1186 1035 59392 0 0 worker
Aug 17 19:30:09 malbec kernel: [ 43037] 479 43037 2885 2599 73728 0 0 perl
Aug 17 19:30:09 malbec kernel: [ 43050] 479 43050 2032 1675 65536 0 0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [ 43207] 479 43207 42584 2585 126976 0 0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [ 43231] 479 43231 46594 6423 157696 0 0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [ 43250] 479 43250 21086 164 65536 0 0 videoencoder
Aug 17 19:30:09 malbec kernel: [ 43269] 479 43269 96241 15497 413696 0 0 qemu-system-ppc
Aug 17 19:30:09 malbec kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,16-17,global_oom,task_memcg=/,task=qemu-system-ppc,pid=34590,uid=479
Aug 17 19:30:09 malbec kernel: Out of memory: Killed process 34590 (qemu-system-ppc) total-vm:31011584kB, anon-rss:25077184kB, file-rss:6592kB, shmem-rss:64kB
Aug 17 19:30:09 malbec kernel: oom_reaper: reaped process 34590 (qemu-system-ppc), now anon-rss:0kB, file-rss:576kB, shmem-rss:0kB
So the situation resolved itself when the OOM killer killed qemu-system-ppc
. Apparently salt-minion
invoked the OOM killer but considering qemu-system-ppc
was actually killed (and that helped) it is unlikely the problem.
We also see that the corresponding job ended up incomplete with the reason "backend died: QEMU was killed due to the system being out of memory": https://openqa.suse.de/tests/6872185
So at least the error message the user ends up with is very clear. Relevant log from the job:
[2021-08-17T18:23:14.842 CEST] [debug] starting: /usr/bin/qemu-system-ppc64 -g 1024x768 -vga std -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 24576 -machine usb=off,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken -cpu host -netdev tap,id=qanet0,ifname=tap1,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:03:92 -boot order=c -device nec-usb-xhci -device usb-tablet -device usb-kbd -smp 8,threads=1 -enable-kvm -no-shutdown -vnc :92,share=force-shared -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/2/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=hd1-file,filename=/var/lib/openqa/pool/2/raid/hd1,cache.no-flush=on -blockdev driver=qcow2,node-name=hd1,file=hd1-file,cache.no-flush=on -device virtio-blk,id=hd1-device,drive=hd1,serial=hd1 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/2/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-ovelay0,serial=cd0
[2021-08-17T18:23:16.969 CEST] [debug] QEMU: QEMU emulator version 4.2.1 (openSUSE Leap 15.2)
[2021-08-17T18:23:16.969 CEST] [debug] QEMU: Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
[2021-08-17T18:23:16.969 CEST] [debug] QEMU: KVM: Failed to create TCE64 table for liobn 0x80000000
The job has the setting QEMURAM=24576
leading to the -m 24576
parameter which is setting the VMs RAM to 24 GiB. Considering we have only 4 worker slots on the machine which has 127 GiB of ram that seems actually reasonable.
Updated by okurz over 3 years ago
The job has the setting
QEMURAM=24576
leading to the-m 24576
parameter which is setting the VMs RAM to 24 GiB. Considering we have only 4 worker slots on the machine which has 127 GiB of ram that seems actually reasonable.
What do you see as reasonable? Do you know what memory other jobs at the same time had?
Updated by openqa_review over 3 years ago
- Due date set to 2021-09-09
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 3 years ago
What do you see as reasonable?
I said that having 24 GiB per job in that setup actually is reasonable.
Do you know what memory other jobs at the same time had?
I don't know the IDs of the other jobs. However, the table shows that there were 4 QEMU processes running using 29.5750, 26.5250, 5.8741 and 4.6909 GiB of virtual memory. (The table counts in number of pages so I multiplied by that.) That means spawning a VM with 24 GiB can result in QEMU using 30 GiB of virtual memory. However, I'm not sure whether my computation is correct. I've checked that the page sizes on the machine is 65536 bytes but I'm not sure how huge pages come into play here.
Here's a sorted version of the table logged by the OOM killer: https://w3.suse.de/~mkittler/malbec-oom.ods
Updated by mkittler over 3 years ago
We have actually many jobs with such a huge VM size, e.g. the query
select jobs.id as job_id, scheduled_product_id, (select host from workers where id = assigned_worker_id) as worker, t_finished, pg_size_pretty(value::bigint * 1024 * 1024) as vm_ram, result, reason from jobs join job_settings on jobs.id = job_settings.job_id where key = 'QEMURAM' and cast(value as bigint) >= 24576 and t_finished >= '2021-08-20T00:00:00' and jobs.id > 6930000 order by t_finished;
returns 390 jobs since 2021-08-24 15:03:05. Only two of those jobs have been executed on malbec. On the other hand we're seeing the same problem also on other workers:
openqa=> select jobs.id as job_id, scheduled_product_id, (select host from workers where id = assigned_worker_id) as worker, t_finished, pg_size_pretty(value::bigint * 1024 * 1024) as vm_ram, result, reason from jobs join job_settings on jobs.id = job_settings.job_id where key = 'QEMURAM' and cast(value as bigint) >= 24576 and t_finished >= '2021-08-01T00:00:00' and result = 'incomplete' and reason like '%QEMU%' order by t_finished;
job_id | scheduled_product_id | worker | t_finished | vm_ram | result | reason
---------+----------------------+---------------------+---------------------+--------+------------+---------------------------------------------------------------------
6872185 | 778785 | malbec | 2021-08-17 17:30:11 | 24 GB | incomplete | backend died: QEMU was killed due to the system being out of memory
6906616 | | powerqaworker-qam-1 | 2021-08-20 15:46:41 | 32 GB | incomplete | backend died: QEMU exited unexpectedly, see log for details
6908043 | | powerqaworker-qam-1 | 2021-08-20 16:09:27 | 32 GB | incomplete | backend died: QEMU exited unexpectedly, see log for details
6908135 | | malbec | 2021-08-20 17:20:39 | 32 GB | incomplete | backend died: QEMU exited unexpectedly, see log for details
6949729 | 785086 | powerqaworker-qam-1 | 2021-08-27 11:01:22 | 32 GB | incomplete | backend died: QEMU was killed due to the system being out of memory
(The jobs where QEMU exits unexpectedly have the following line within the log so they're basically about the same problem:
[2021-08-20T17:46:40.157 CEST] [warn] !!! : qemu-system-ppc64: Failed to allocate KVM HPT of order 28 (try smaller maxmem?): Cannot allocate memory
)
The first job with a big QEMURAM
was scheduled in 2019:
openqa=> select jobs.id as job_id, scheduled_product_id, (select host from workers where id = assigned_worker_id) as worker, t_finished, pg_size_pretty(value::bigint * 1024 * 1024) as vm_ram, result, reason from jobs join job_settings on jobs.id = job_settings.job_id where key = 'QEMURAM' and cast(value as bigint) >= 24576 order by job_id limit 1;
job_id | scheduled_product_id | worker | t_finished | vm_ram | result | reason
---------+----------------------+-----------------+---------------------+--------+--------+--------
2973985 | 434592 | QA-Power8-4-kvm | 2019-06-13 17:21:18 | 24 GB | failed |
(1 Zeile)
Updated by mkittler over 3 years ago
Looks like QEMURAM
is set on all levels (test suites, machines, mediums, job templates). However, QEMURAM=24576
is set only set on several test suites and the machines 64bit-sap-qam
and ppc64le-sap-qam
. QEMURAM=32768
is only set on the machines 64bit-sap
, ppc64le-sap
, and uefi-sap
. Luckily job templates don't contain very high QEMURAM
settings so far.
To summarize: Problematic test suites are sles4sap_scc_*
and problematic machines are *-sap
and *-sap-qam
.
I see multiple ways to improve:
- Remove high
QEMURAM
settings from jobs which don't really need it by removing it from the corresponding test suites and machines. - Schedule jobs which really need a high
QEMURAM
setting with a specialWORKER_CLASS
of which we would only provide a limited number of slots per worker. It doesn't look like a special worker class is used so far. - Decide whether the setting should go into machine settings or test suites instead of spreading it over multiple places.
It would make sense if test maintainers would try to identify what falls under 1. I could then try to convert the remaining occurrences to use a special worker class and configure a limited number of worker slots on worker hosts which currently provide the normal worker class.
Updated by mkittler over 3 years ago
When it comes to memory there's also another special configuration on malbec: It uses a tmpfs for the pool directory. According to salt -C 'G@roles:worker' cmd.run 'df -h /var/lib/openqa/pool'
that's only the case for malbec.arch.suse.de
and powerqaworker-qam-1.qa.suse.de
. That corresponds very much with the worker hosts shown in the table of my previous comment #97139#note-8. Maybe it is also sufficient to revert that configuration. I'm wondering why it was configured this way, though. At least it looks like one cannot have both at the same time (pool as tmpfs and running jobs with high VM memory).
By the way, this was brought to my attention because one job on powerqaworker-qam-1.qa.suse.de
just incompleted because it ran out of disk space on the tmpfs: https://openqa.suse.de/tests/6953619/logfile?filename=autoinst-log.txt
Updated by mkittler over 3 years ago
- Status changed from In Progress to Feedback
I was told that 24 GiB is really the minimum RAM required here. So I've been creating a SR to create a special worker class like the one we already have for x86_64: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/347
Updated by mkittler over 3 years ago
The SR has been merged and I've set the worker class in the mentioned machines. I suppose that should cover all jobs so no modifications to the test suites are required.
I have also updated 64bit-sap
to use WORKER_CLASS=qemu_x86_64-large-mem
in consistency with ppc64le-sap
(which I updated as part of this ticket) and 64bit-sap-qam
(which already used …-large-mem
) because it also has a high QEMURAM
setting.
Updated by mkittler over 3 years ago
The new config is effective on OSD so I'm resolving the ticket. If the problem occurs nevertheless again we would get an alert and can reopen the ticket.
Updated by okurz over 3 years ago
- Related to action #98682: jobs run powerqaworker-qam-1 fail with auto_review:"(?s)powerqaworker-qam-1.*Can't write to file (.*): No space left on device at .*":retry size:M added