Project

General

Profile

Actions

action #97139

closed

[alert] multiple unhandled alerts about "malbec: Memory usage alert" size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
2021-08-18
Due date:
2021-09-09
% Done:

0%

Estimated time:

Description

Observation

See http://stats.openqa-monitor.qa.suse.de/d/WDmalbec/worker-dashboard-malbec?tab=alert&viewPanel=12054&orgId=1

Acceptance criteria

  • AC1: We have understood the reason why the alert triggered
  • AC2: Actions have been applied to prevent a similar situation in the future

Suggestion

  • Check history of monitoring data
  • Check logs if there is something that we still need to fix
  • Check if it was maybe a one-off experiment by a person, talk to the person and tell them "don't do it again!"
  • Consider adapting alerting thresholds

Related issues 2 (1 open1 closed)

Related to openQA Infrastructure - action #98682: jobs run powerqaworker-qam-1 fail with auto_review:"(?s)powerqaworker-qam-1.*Can't write to file (.*): No space left on device at .*":retry size:MWorkable2021-09-15

Actions
Copied from openQA Infrastructure - action #97136: [alert] multiple unhandled alerts about "broken workers" size:MResolveddheidler2021-08-18

Actions
Actions #1

Updated by okurz over 2 years ago

  • Copied from action #97136: [alert] multiple unhandled alerts about "broken workers" size:M added
Actions #2

Updated by okurz over 2 years ago

  • Subject changed from [alert] multiple unhandled alerts about "malbec: Memory usage alert" to [alert] multiple unhandled alerts about "malbec: Memory usage alert" size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by mkittler over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

Looks like something really used up almost all the entire memory up to the point where only 477 MiB (of 127 GiB) were free. Then the memory was suddenly freed-up again. Maybe it was just a single process which got eventually killed.

Judging by the alert's state history the problem is not happening very often but I'll have a look at the worker logs and check what special processes could possibly run on that worker.

Actions #4

Updated by mkittler over 2 years ago

There's not much going on. The worker only as 4 slots. When idling the top process is ovs-vswitchd which uses 8149 MiB memory (virtual) but it is likely not the culprit.

The free memory rapidly declined 18:53 (and started to decline slowly 18:45). The memory is freed again at 19:30. Interesting log messages, starting shortly before the last alert triggered:

Aug 17 18:39:07 malbec wickedd-dhcp4[4224]: eth4: Committed DHCPv4 lease with address 10.161.24.54 (lease time 1800 sec, renew in 900 sec, rebind in 1575 sec)
Aug 17 18:39:07 malbec wickedd[4443]: route ipv4 0.0.0.0/0 via 10.161.0.1 dev eth4#6 type unicast table main scope universe protocol dhcp covered by a ipv4:dhcp lease
Aug 17 18:39:07 malbec wickedd[4443]: ni_process_reap: process 36869 has not exited yet; now doing a blocking waitpid()
Aug 17 18:54:07 malbec wickedd-dhcp4[4224]: eth4: Committed DHCPv4 lease with address 10.161.24.54 (lease time 1800 sec, renew in 900 sec, rebind in 1575 sec)
Aug 17 18:54:07 malbec wickedd[4443]: route ipv4 0.0.0.0/0 via 10.161.0.1 dev eth4#6 type unicast table main scope universe protocol dhcp covered by a ipv4:dhcp lease
Aug 17 18:54:07 malbec wickedd[4443]: ni_process_reap: process 39315 has not exited yet; now doing a blocking waitpid()
Aug 17 18:59:37 malbec wickedd-dhcp6[4222]: eth4: Committing DHCPv6 lease with:
Aug 17 18:59:37 malbec wickedd-dhcp6[4222]: eth4    +ia-na.address 2620:113:80c0:8000:10:161:24:54/0, pref-lft 9000, valid-lft 14400
Aug 17 18:59:40 malbec wickedd[4443]: ni_process_reap: process 40024 has not exited yet; now doing a blocking waitpid()
Aug 17 19:00:00 malbec systemd[1]: Started Timeline of Snapper Snapshots.
Aug 17 19:00:00 malbec dbus-daemon[3188]: [system] Activating service name='org.opensuse.Snapper' requested by ':1.324' (uid=0 pid=40267 comm="/usr/lib/snapper/systemd-helper --timeline ") (using servicehelper)
Aug 17 19:00:00 malbec dbus-daemon[3188]: [system] Successfully activated service 'org.opensuse.Snapper'
Aug 17 19:01:30 malbec systemd[1]: dev-disk-by\x2duuid-b4df213d\x2d22b0\x2d4264\x2d9fa0\x2d777474872c8a.device: Job dev-disk-by\x2duuid-b4df213d\x2d22b0\x2d4264\x2d9fa0\x2d777474872c8a.device/start timed out.
Aug 17 19:01:30 malbec systemd[1]: Timed out waiting for device dev-disk-by\x2duuid-b4df213d\x2d22b0\x2d4264\x2d9fa0\x2d777474872c8a.device.
Aug 17 19:01:30 malbec systemd[1]: Dependency failed for /dev/disk/by-uuid/b4df213d-22b0-4264-9fa0-777474872c8a.
Aug 17 19:01:30 malbec systemd[1]: dev-disk-by\x2duuid-b4df213d\x2d22b0\x2d4264\x2d9fa0\x2d777474872c8a.swap: Job dev-disk-by\x2duuid-b4df213d\x2d22b0\x2d4264\x2d9fa0\x2d777474872c8a.swap/start failed with result 'dependency'.
Aug 17 19:01:30 malbec systemd[1]: dev-disk-by\x2duuid-b4df213d\x2d22b0\x2d4264\x2d9fa0\x2d777474872c8a.device: Job dev-disk-by\x2duuid-b4df213d\x2d22b0\x2d4264\x2d9fa0\x2d777474872c8a.device/start failed with result 'timeout'.
Aug 17 19:07:39 malbec ovs-vsctl[41023]: ovs|00001|vsctl|INFO|Called as ovs-vsctl remove port tap3 tag 9
…
Aug 17 19:30:06 malbec kernel: salt-minion invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Aug 17 19:30:08 malbec kernel: CPU: 40 PID: 9188 Comm: salt-minion Kdump: loaded Not tainted 5.3.18-lp152.87-default #1 openSUSE Leap 15.2
Aug 17 19:30:08 malbec kernel: Call Trace:
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb5a0] [c000000000d0f048] dump_stack+0xbc/0x104 (unreliable)
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb5e0] [c000000000383940] dump_header+0x60/0x2e0
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb670] [c00000000038416c] oom_kill_process+0x19c/0x2c0
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb6b0] [c000000000385414] out_of_memory+0x114/0x720
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb750] [c0000000004028e8] __alloc_pages_slowpath+0xa78/0xe20
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb900] [c000000000402fa8] __alloc_pages_nodemask+0x318/0x3e0
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb980] [c0000000004280e0] alloc_pages_current+0xa0/0x140
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb9c0] [c000000000377888] __page_cache_alloc+0xb8/0x120
Aug 17 19:30:09 malbec kernel: [c000000fe9bdb9f0] [c00000000037af78] pagecache_get_page+0x128/0x490
Aug 17 19:30:09 malbec kernel: [c000000fe9bdba50] [c00000000037c25c] filemap_fault+0x51c/0xd50
Aug 17 19:30:09 malbec kernel: [c000000fe9bdbb70] [c0000000003d1f8c] __do_fault+0x5c/0x200
Aug 17 19:30:09 malbec kernel: [c000000fe9bdbbb0] [c0000000003d9254] __handle_mm_fault+0x1174/0x1b10
Aug 17 19:30:09 malbec kernel: [c000000fe9bdbca0] [c0000000003d9d1c] handle_mm_fault+0x12c/0x220
Aug 17 19:30:09 malbec kernel: [c000000fe9bdbce0] [c00000000007a618] __do_page_fault+0x338/0xe80
Aug 17 19:30:09 malbec kernel: [c000000fe9bdbde0] [c00000000007b198] do_page_fault+0x38/0xc0
Aug 17 19:30:09 malbec kernel: [c000000fe9bdbe20] [c00000000000a908] handle_page_fault+0x10/0x30
Aug 17 19:30:09 malbec kernel: Mem-Info:
Aug 17 19:30:09 malbec kernel: active_anon:1236907 inactive_anon:794527 isolated_anon:0
                                active_file:66 inactive_file:143 isolated_file:0
                                unevictable:5202 dirty:3 writeback:0 unstable:0
                                slab_reclaimable:2156 slab_unreclaimable:10551
                                mapped:444 shmem:1146897 pagetables:396 bounce:0
                                free:11120 free_pcp:1 free_cma:4
Aug 17 19:30:09 malbec kernel: Node 0 active_anon:22985600kB inactive_anon:8939904kB active_file:3904kB inactive_file:576kB unevictable:277952kB isolated(anon):0kB isolated(file):0kB mapped:23936kB dirty:0kB writeback:0kB shmem:12452992kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 17367040kB writeback_tmp:0kB u>
Aug 17 19:30:09 malbec kernel: Node 1 active_anon:24274944kB inactive_anon:8716864kB active_file:0kB inactive_file:1408kB unevictable:7552kB isolated(anon):0kB isolated(file):0kB mapped:512kB dirty:0kB writeback:0kB shmem:10899776kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 20529152kB writeback_tmp:0kB unstabl>
Aug 17 19:30:09 malbec kernel: Node 16 active_anon:16248896kB inactive_anon:16740224kB active_file:320kB inactive_file:5440kB unevictable:40640kB isolated(anon):0kB isolated(file):0kB mapped:2176kB dirty:192kB writeback:0kB shmem:24688000kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 5718016kB writeback_tmp:0kB >
Aug 17 19:30:09 malbec kernel: Node 17 active_anon:15652608kB inactive_anon:16452736kB active_file:128kB inactive_file:1728kB unevictable:6784kB isolated(anon):0kB isolated(file):0kB mapped:1792kB dirty:0kB writeback:0kB shmem:25360640kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 5701632kB writeback_tmp:0kB uns>
Aug 17 19:30:09 malbec kernel: Node 0 Normal free:178176kB min:178944kB low:223680kB high:268416kB active_anon:22985600kB inactive_anon:8939904kB active_file:2048kB inactive_file:0kB unevictable:277952kB writepending:0kB present:33554432kB managed:33145408kB mlocked:277952kB kernel_stack:10800kB pagetables:9024kB >
Aug 17 19:30:09 malbec kernel: lowmem_reserve[]: 0 0 0
Aug 17 19:30:09 malbec kernel: Node 1 Normal free:178624kB min:180864kB low:226048kB high:271232kB active_anon:24274944kB inactive_anon:8716864kB active_file:0kB inactive_file:1408kB unevictable:7552kB writepending:0kB present:33554432kB managed:33475072kB mlocked:7552kB kernel_stack:7920kB pagetables:6848kB bounc>
Aug 17 19:30:09 malbec kernel: lowmem_reserve[]: 0 0 0
Aug 17 19:30:09 malbec kernel: Node 16 Normal free:177792kB min:180864kB low:226048kB high:271232kB active_anon:16248896kB inactive_anon:16740224kB active_file:320kB inactive_file:2944kB unevictable:40640kB writepending:192kB present:33554432kB managed:33475840kB mlocked:40640kB kernel_stack:6816kB pagetables:6336>
Aug 17 19:30:09 malbec kernel: lowmem_reserve[]: 0 0 0
Aug 17 19:30:09 malbec kernel: Node 17 Normal free:177088kB min:180096kB low:225088kB high:270080kB active_anon:15652608kB inactive_anon:16452736kB active_file:128kB inactive_file:832kB unevictable:6784kB writepending:0kB present:33554432kB managed:33331776kB mlocked:6784kB kernel_stack:5840kB pagetables:3136kB bo>
Aug 17 19:30:09 malbec kernel: lowmem_reserve[]: 0 0 0
Aug 17 19:30:09 malbec kernel: Node 0 Normal: 674*64kB (UME) 338*128kB (UME) 144*256kB (UME) 82*512kB (UME) 15*1024kB (UM) 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 180608kB
Aug 17 19:30:09 malbec kernel: Node 1 Normal: 424*64kB (UME) 239*128kB (UME) 143*256kB (UME) 99*512kB (UE) 34*1024kB (UME) 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 179840kB
Aug 17 19:30:09 malbec kernel: Node 16 Normal: 529*64kB (UME) 288*128kB (UME) 135*256kB (UME) 83*512kB (UME) 26*1024kB (UME) 3*2048kB (UME) 0*4096kB 0*8192kB 0*16384kB = 180544kB
Aug 17 19:30:09 malbec kernel: Node 17 Normal: 316*64kB (UMEC) 195*128kB (UMEC) 101*256kB (UME) 69*512kB (UME) 39*1024kB (UME) 11*2048kB (UME) 3*4096kB (UME) 0*8192kB 0*16384kB = 181120kB
Aug 17 19:30:09 malbec kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16384kB
Aug 17 19:30:09 malbec kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB
Aug 17 19:30:09 malbec kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16384kB
Aug 17 19:30:09 malbec kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB
Aug 17 19:30:09 malbec kernel: Node 16 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16384kB
Aug 17 19:30:09 malbec kernel: Node 16 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB
Aug 17 19:30:09 malbec kernel: Node 17 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16384kB
Aug 17 19:30:09 malbec kernel: Node 17 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB
Aug 17 19:30:09 malbec kernel: 1147690 total pagecache pages
Aug 17 19:30:09 malbec kernel: 0 pages in swap cache
Aug 17 19:30:09 malbec kernel: Swap cache stats: add 0, delete 0, find 0/0
Aug 17 19:30:09 malbec kernel: Free swap  = 0kB
Aug 17 19:30:09 malbec kernel: Total swap = 0kB
Aug 17 19:30:09 malbec kernel: 2097152 pages RAM
Aug 17 19:30:09 malbec kernel: 0 pages HighMem/MovableOnly
Aug 17 19:30:09 malbec kernel: 12338 pages reserved
Aug 17 19:30:09 malbec kernel: 104960 pages cma reserved
Aug 17 19:30:09 malbec kernel: 0 pages hwpoisoned
Aug 17 19:30:09 malbec kernel: Tasks state (memory values in pages):
Aug 17 19:30:09 malbec kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Aug 17 19:30:09 malbec kernel: [   2038]     0  2038     3877      122    79872        0             0 systemd-journal
Aug 17 19:30:09 malbec kernel: [   2180]     0  2180      315      110    51200        0         -1000 systemd-udevd
Aug 17 19:30:09 malbec kernel: [   2181]     0  2181      145       90    49152        0             0 haveged
Aug 17 19:30:09 malbec kernel: [   2974]     0  2974     6688      524    65536        0         -1000 multipathd
Aug 17 19:30:09 malbec kernel: [   3157]     0  3157      202       19    51200        0         -1000 auditd
Aug 17 19:30:09 malbec kernel: [   3182]     0  3182     1278       62    51200        0             0 irqbalance
Aug 17 19:30:09 malbec kernel: [   3188]   499  3188      178       89    51200        0          -900 dbus-daemon
Aug 17 19:30:09 malbec kernel: [   3197]   496  3197    13269       68    81920        0             0 nscd
Aug 17 19:30:09 malbec kernel: [   3253]     0  3253     2019      600    59392        0             0 firewalld
Aug 17 19:30:09 malbec kernel: [   3254]     0  3254      248      108    51200        0             0 systemd-logind
Aug 17 19:30:09 malbec kernel: [   3255]     0  3255     1348       41    51200        0             0 iprdump
Aug 17 19:30:09 malbec kernel: [   3258]     0  3258       90       72    51200        0             0 iprinit
Aug 17 19:30:09 malbec kernel: [   3260]     0  3260       97       76    49152        0             0 iprupdate
Aug 17 19:30:09 malbec kernel: [   4222]     0  4222      221       74    49152        0             0 wickedd-dhcp6
Aug 17 19:30:09 malbec kernel: [   4223]     0  4223      217       86    51200        0             0 wickedd-auto4
Aug 17 19:30:09 malbec kernel: [   4224]     0  4224      225       75    53248        0             0 wickedd-dhcp4
Aug 17 19:30:09 malbec kernel: [   4289]   468  4289      351      165    51200        0             0 ovsdb-server
Aug 17 19:30:09 malbec kernel: [   4424]   468  4424   130389     4714   337920        0             0 ovs-vswitchd
Aug 17 19:30:09 malbec kernel: [   4443]     0  4443      241      138    51200        0             0 wickedd
Aug 17 19:30:09 malbec kernel: [   4508]     0  4508      221       85    51200        0             0 wickedd-nanny
Aug 17 19:30:09 malbec kernel: [   5483]   497  5483    30592      178    86016        0             0 polkitd
Aug 17 19:30:09 malbec kernel: [   8319]     0  8319    69625      546   258048        0             0 telegraf
Aug 17 19:30:09 malbec kernel: [   8322]     0  8322      794      570    55296        0             0 salt-minion
Aug 17 19:30:09 malbec kernel: [   8325]     0  8325      149       80    49152        0             0 xinetd
Aug 17 19:30:09 malbec kernel: [   8328]     0  8328      475      131    53248        0             0 rebootmgrd
Aug 17 19:30:09 malbec kernel: [   8330]   478  8330      143       22    51200        0             0 nrpe
Aug 17 19:30:09 malbec kernel: [   8334]     0  8334     2137       56    65536        0         -1000 tgtd
Aug 17 19:30:09 malbec kernel: [   8335]     0  8335       69       14    49152        0             0 tgtd
Aug 17 19:30:09 malbec kernel: [   8341]     0  8341      566      420    53248        0             0 os-autoinst-ope
Aug 17 19:30:09 malbec kernel: [   8354]   467  8354     1248       19    53248        0             0 chronyd
Aug 17 19:30:09 malbec kernel: [   8420]     0  8420      280      156    51200        0         -1000 sshd
Aug 17 19:30:09 malbec kernel: [   8640]     0  8640      310       88    51200        0             0 master
Aug 17 19:30:09 malbec kernel: [   8642]    51  8642      306       83    51200        0             0 qmgr
Aug 17 19:30:09 malbec kernel: [   8671]     0  8671      105       65    49152        0             0 cron
Aug 17 19:30:09 malbec kernel: [   9188]     0  9188    15886     1296   102400        0             0 salt-minion
Aug 17 19:30:09 malbec kernel: [   9691]     0  9691     2059      594    59392        0             0 salt-minion
Aug 17 19:30:09 malbec kernel: [  13850]     0 13850       62       16    49152        0             0 agetty
Aug 17 19:30:09 malbec kernel: [  13851]     0 13851       62       17    51200        0             0 agetty
Aug 17 19:30:09 malbec kernel: [  21287]   495 21287      192      105    51200        0             0 rpcbind
Aug 17 19:30:09 malbec kernel: [  21288]   486 21288      160       40    49152        0             0 rpc.statd
Aug 17 19:30:09 malbec kernel: [  12051]   479 12051     1407     1080    59392        0             0 openqa-workerca
Aug 17 19:30:09 malbec kernel: [  12053]   479 12053     1276     1074    59392        0             0 openqa-workerca
Aug 17 19:30:09 malbec kernel: [  12542]   479 12542     1294     1050    57344        0             0 openqa-workerca
Aug 17 19:30:09 malbec kernel: [  12543]   479 12543     1294     1050    57344        0             0 openqa-workerca
Aug 17 19:30:09 malbec kernel: [  12544]   479 12544     1294     1048    57344        0             0 openqa-workerca
Aug 17 19:30:09 malbec kernel: [  12545]   479 12545     1294     1031    57344        0             0 openqa-workerca
Aug 17 19:30:09 malbec kernel: [  31346]   479 31346     1273     1121    59392        0             0 worker
Aug 17 19:30:09 malbec kernel: [  33998]   479 33998     1225     1076    59392        0             0 worker
Aug 17 19:30:09 malbec kernel: [  34406]   479 34406     1252     1102    57344        0             0 worker
Aug 17 19:30:09 malbec kernel: [  34515]   479 34515     2926     2648    73728        0             0 perl
Aug 17 19:30:09 malbec kernel: [  34520]   479 34520     3578     3294    77824        0             0 perl
Aug 17 19:30:09 malbec kernel: [  34521]   479 34521     1780     1422    63488        0             0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [  34529]   479 34529     2038     1671    65536        0             0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [  34533]   479 34533    42600     2578   126976        0             0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [  34552]   479 34552    50627    10599   190464        0             0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [  34571]   479 34571    21086      214    63488        0             0 videoencoder
Aug 17 19:30:09 malbec kernel: [  34590]   479 34590   484556   391935  3457024        0             0 qemu-system-ppc
Aug 17 19:30:09 malbec kernel: [  34607]   479 34607    42778     2745   129024        0             0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [  34626]   479 34626    46511     6469   159744        0             0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [  34645]   479 34645    21086      165    63488        0             0 videoencoder
Aug 17 19:30:09 malbec kernel: [  34664]   479 34664    76856    19866   348160        0             0 qemu-system-ppc
Aug 17 19:30:09 malbec kernel: [  34700]   479 34700     2927     2641    75776        0             0 perl
Aug 17 19:30:09 malbec kernel: [  34709]   479 34709     1855     1467    65536        0             0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [  34715]   479 34715    42692     2629   129024        0             0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [  34734]   479 34734    47646     7538   167936        0             0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [  34753]   479 34753    21086      213    63488        0             0 videoencoder
Aug 17 19:30:09 malbec kernel: [  34772]   479 34772   434586   391811  3317760        0             0 qemu-system-ppc
Aug 17 19:30:09 malbec kernel: [  40474]    51 40474      301      103    51200        0             0 pickup
Aug 17 19:30:09 malbec kernel: [  41035]   479 41035     1186     1035    59392        0             0 worker
Aug 17 19:30:09 malbec kernel: [  43037]   479 43037     2885     2599    73728        0             0 perl
Aug 17 19:30:09 malbec kernel: [  43050]   479 43050     2032     1675    65536        0             0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [  43207]   479 43207    42584     2585   126976        0             0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [  43231]   479 43231    46594     6423   157696        0             0 /usr/bin/isotov
Aug 17 19:30:09 malbec kernel: [  43250]   479 43250    21086      164    65536        0             0 videoencoder
Aug 17 19:30:09 malbec kernel: [  43269]   479 43269    96241    15497   413696        0             0 qemu-system-ppc
Aug 17 19:30:09 malbec kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,16-17,global_oom,task_memcg=/,task=qemu-system-ppc,pid=34590,uid=479
Aug 17 19:30:09 malbec kernel: Out of memory: Killed process 34590 (qemu-system-ppc) total-vm:31011584kB, anon-rss:25077184kB, file-rss:6592kB, shmem-rss:64kB
Aug 17 19:30:09 malbec kernel: oom_reaper: reaped process 34590 (qemu-system-ppc), now anon-rss:0kB, file-rss:576kB, shmem-rss:0kB

So the situation resolved itself when the OOM killer killed qemu-system-ppc. Apparently salt-minion invoked the OOM killer but considering qemu-system-ppc was actually killed (and that helped) it is unlikely the problem.

We also see that the corresponding job ended up incomplete with the reason "backend died: QEMU was killed due to the system being out of memory": https://openqa.suse.de/tests/6872185
So at least the error message the user ends up with is very clear. Relevant log from the job:

[2021-08-17T18:23:14.842 CEST] [debug] starting: /usr/bin/qemu-system-ppc64 -g 1024x768 -vga std -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 24576 -machine usb=off,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken -cpu host -netdev tap,id=qanet0,ifname=tap1,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:03:92 -boot order=c -device nec-usb-xhci -device usb-tablet -device usb-kbd -smp 8,threads=1 -enable-kvm -no-shutdown -vnc :92,share=force-shared -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/2/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=hd1-file,filename=/var/lib/openqa/pool/2/raid/hd1,cache.no-flush=on -blockdev driver=qcow2,node-name=hd1,file=hd1-file,cache.no-flush=on -device virtio-blk,id=hd1-device,drive=hd1,serial=hd1 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/2/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-ovelay0,serial=cd0
[2021-08-17T18:23:16.969 CEST] [debug] QEMU: QEMU emulator version 4.2.1 (openSUSE Leap 15.2)
[2021-08-17T18:23:16.969 CEST] [debug] QEMU: Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
[2021-08-17T18:23:16.969 CEST] [debug] QEMU: KVM: Failed to create TCE64 table for liobn 0x80000000

The job has the setting QEMURAM=24576 leading to the -m 24576 parameter which is setting the VMs RAM to 24 GiB. Considering we have only 4 worker slots on the machine which has 127 GiB of ram that seems actually reasonable.

Actions #5

Updated by okurz over 2 years ago

The job has the setting QEMURAM=24576 leading to the -m 24576 parameter which is setting the VMs RAM to 24 GiB. Considering we have only 4 worker slots on the machine which has 127 GiB of ram that seems actually reasonable.

What do you see as reasonable? Do you know what memory other jobs at the same time had?

Actions #6

Updated by openqa_review over 2 years ago

  • Due date set to 2021-09-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by mkittler over 2 years ago

What do you see as reasonable?

I said that having 24 GiB per job in that setup actually is reasonable.

Do you know what memory other jobs at the same time had?

I don't know the IDs of the other jobs. However, the table shows that there were 4 QEMU processes running using 29.5750, 26.5250, 5.8741 and 4.6909 GiB of virtual memory. (The table counts in number of pages so I multiplied by that.) That means spawning a VM with 24 GiB can result in QEMU using 30 GiB of virtual memory. However, I'm not sure whether my computation is correct. I've checked that the page sizes on the machine is 65536 bytes but I'm not sure how huge pages come into play here.

Here's a sorted version of the table logged by the OOM killer: https://w3.suse.de/~mkittler/malbec-oom.ods

Actions #8

Updated by mkittler over 2 years ago

We have actually many jobs with such a huge VM size, e.g. the query

select jobs.id as job_id, scheduled_product_id, (select host from workers where id = assigned_worker_id) as worker, t_finished, pg_size_pretty(value::bigint * 1024 * 1024) as vm_ram, result, reason from jobs join job_settings on jobs.id = job_settings.job_id where key = 'QEMURAM' and cast(value as bigint) >= 24576 and t_finished >= '2021-08-20T00:00:00' and jobs.id > 6930000 order by t_finished;

returns 390 jobs since 2021-08-24 15:03:05. Only two of those jobs have been executed on malbec. On the other hand we're seeing the same problem also on other workers:

openqa=> select jobs.id as job_id, scheduled_product_id, (select host from workers where id = assigned_worker_id) as worker, t_finished, pg_size_pretty(value::bigint * 1024 * 1024) as vm_ram, result, reason from jobs join job_settings on jobs.id = job_settings.job_id where key = 'QEMURAM' and cast(value as bigint) >= 24576 and t_finished >= '2021-08-01T00:00:00' and result = 'incomplete' and reason like '%QEMU%' order by t_finished;
 job_id  | scheduled_product_id |       worker        |     t_finished      | vm_ram |   result   |                               reason                                
---------+----------------------+---------------------+---------------------+--------+------------+---------------------------------------------------------------------
 6872185 |               778785 | malbec              | 2021-08-17 17:30:11 | 24 GB  | incomplete | backend died: QEMU was killed due to the system being out of memory
 6906616 |                      | powerqaworker-qam-1 | 2021-08-20 15:46:41 | 32 GB  | incomplete | backend died: QEMU exited unexpectedly, see log for details
 6908043 |                      | powerqaworker-qam-1 | 2021-08-20 16:09:27 | 32 GB  | incomplete | backend died: QEMU exited unexpectedly, see log for details
 6908135 |                      | malbec              | 2021-08-20 17:20:39 | 32 GB  | incomplete | backend died: QEMU exited unexpectedly, see log for details
 6949729 |               785086 | powerqaworker-qam-1 | 2021-08-27 11:01:22 | 32 GB  | incomplete | backend died: QEMU was killed due to the system being out of memory

(The jobs where QEMU exits unexpectedly have the following line within the log so they're basically about the same problem:
[2021-08-20T17:46:40.157 CEST] [warn] !!! : qemu-system-ppc64: Failed to allocate KVM HPT of order 28 (try smaller maxmem?): Cannot allocate memory)

The first job with a big QEMURAM was scheduled in 2019:

openqa=> select jobs.id as job_id, scheduled_product_id, (select host from workers where id = assigned_worker_id) as worker, t_finished, pg_size_pretty(value::bigint * 1024 * 1024) as vm_ram, result, reason from jobs join job_settings on jobs.id = job_settings.job_id where key = 'QEMURAM' and cast(value as bigint) >= 24576 order by job_id limit 1;
 job_id  | scheduled_product_id |     worker      |     t_finished      | vm_ram | result | reason 
---------+----------------------+-----------------+---------------------+--------+--------+--------
 2973985 |               434592 | QA-Power8-4-kvm | 2019-06-13 17:21:18 | 24 GB  | failed | 
(1 Zeile)
Actions #9

Updated by mkittler over 2 years ago

Looks like QEMURAM is set on all levels (test suites, machines, mediums, job templates). However, QEMURAM=24576 is set only set on several test suites and the machines 64bit-sap-qam and ppc64le-sap-qam. QEMURAM=32768 is only set on the machines 64bit-sap, ppc64le-sap, and uefi-sap. Luckily job templates don't contain very high QEMURAM settings so far.
To summarize: Problematic test suites are sles4sap_scc_* and problematic machines are *-sap and *-sap-qam.

I see multiple ways to improve:

  1. Remove high QEMURAM settings from jobs which don't really need it by removing it from the corresponding test suites and machines.
  2. Schedule jobs which really need a high QEMURAM setting with a special WORKER_CLASS of which we would only provide a limited number of slots per worker. It doesn't look like a special worker class is used so far.
  3. Decide whether the setting should go into machine settings or test suites instead of spreading it over multiple places.

It would make sense if test maintainers would try to identify what falls under 1. I could then try to convert the remaining occurrences to use a special worker class and configure a limited number of worker slots on worker hosts which currently provide the normal worker class.

Actions #10

Updated by mkittler over 2 years ago

When it comes to memory there's also another special configuration on malbec: It uses a tmpfs for the pool directory. According to salt -C 'G@roles:worker' cmd.run 'df -h /var/lib/openqa/pool' that's only the case for malbec.arch.suse.de and powerqaworker-qam-1.qa.suse.de. That corresponds very much with the worker hosts shown in the table of my previous comment #97139#note-8. Maybe it is also sufficient to revert that configuration. I'm wondering why it was configured this way, though. At least it looks like one cannot have both at the same time (pool as tmpfs and running jobs with high VM memory).

By the way, this was brought to my attention because one job on powerqaworker-qam-1.qa.suse.de just incompleted because it ran out of disk space on the tmpfs: https://openqa.suse.de/tests/6953619/logfile?filename=autoinst-log.txt

Actions #11

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Feedback

I was told that 24 GiB is really the minimum RAM required here. So I've been creating a SR to create a special worker class like the one we already have for x86_64: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/347

Actions #12

Updated by mkittler over 2 years ago

The SR has been merged and I've set the worker class in the mentioned machines. I suppose that should cover all jobs so no modifications to the test suites are required.


I have also updated 64bit-sap to use WORKER_CLASS=qemu_x86_64-large-mem in consistency with ppc64le-sap (which I updated as part of this ticket) and 64bit-sap-qam (which already used …-large-mem) because it also has a high QEMURAM setting.

Actions #13

Updated by mkittler over 2 years ago

The new config is effective on OSD so I'm resolving the ticket. If the problem occurs nevertheless again we would get an alert and can reopen the ticket.

Actions #14

Updated by mkittler over 2 years ago

  • Status changed from Feedback to Resolved
Actions #15

Updated by okurz over 2 years ago

  • Related to action #98682: jobs run powerqaworker-qam-1 fail with auto_review:"(?s)powerqaworker-qam-1.*Can't write to file (.*): No space left on device at .*":retry size:M added
Actions

Also available in: Atom PDF