Project

General

Profile

Actions

action #90161

closed

[Alerting] malbec: Memory usage alert triggered briefly and turned OK within the next minute

Added by livdywan almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2021-03-16
Due date:
% Done:

0%

Estimated time:

Description

Observation

[Alerting] malbec: Memory usage alert

Metric name

Value
available

1848351129.600

The alert turned back to OK within the next minute.

Suggestion

  • See if this comes up again (was it flaky, was it a one-off)
  • Find a related open issue (the alert was not mentioned in any open tickets as of this ticket being filed)

Related issues 1 (0 open1 closed)

Copied to openQA Project (public) - action #90974: Make it obvious if qemu gets terminated unexpectedly due to out-of-memoryResolvedXiaojing_liu

Actions
Actions #1

Updated by okurz almost 4 years ago

  • Project changed from openQA Project (public) to openQA Infrastructure (public)
  • Status changed from New to Workable
  • Priority changed from Normal to High
  • Target version set to Ready

Definition of Done for alert tickets should be clear, setting to "Workable" without further explicit ACs

Actions #2

Updated by mkittler over 3 years ago

It hasn't happened again yet.

Actions #3

Updated by okurz over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #4

Updated by okurz over 3 years ago

https://monitor.qa.suse.de/d/WDmalbec/worker-dashboard-malbec?viewPanel=12054&orgId=1&from=1615877632079&to=1615880700634 shows that the available memory went down to 0 at 2021-03-16 08:30

Found OOM in system journal:

Mar 16 08:30:12 malbec kernel: ntpd invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Mar 16 08:30:12 malbec kernel: CPU: 0 PID: 11378 Comm: ntpd Kdump: loaded Not tainted 5.3.18-lp152.57-default #1 openSUSE Leap 15.2
Mar 16 08:30:12 malbec kernel: Call Trace:
Mar 16 08:30:12 malbec kernel: [c000000fe9be75b0] [c000000000d0cd28] dump_stack+0xbc/0x104 (unreliable)
Mar 16 08:30:12 malbec kernel: [c000000fe9be75f0] [c000000000383c20] dump_header+0x60/0x2e0
Mar 16 08:30:12 malbec kernel: [c000000fe9be7680] [c00000000038444c] oom_kill_process+0x19c/0x2c0
Mar 16 08:30:12 malbec kernel: [c000000fe9be76c0] [c0000000003856c4] out_of_memory+0x114/0x720
Mar 16 08:30:12 malbec kernel: [c000000fe9be7760] [c000000000402a08] __alloc_pages_slowpath+0xa78/0xe20
Mar 16 08:30:12 malbec kernel: [c000000fe9be7910] [c0000000004030c8] __alloc_pages_nodemask+0x318/0x3e0
Mar 16 08:30:12 malbec kernel: [c000000fe9be7990] [c000000000428300] alloc_pages_current+0xa0/0x140
Mar 16 08:30:12 malbec kernel: [c000000fe9be79d0] [c000000000377b68] __page_cache_alloc+0xb8/0x120
Mar 16 08:30:12 malbec kernel: [c000000fe9be7a00] [c00000000037b258] pagecache_get_page+0x128/0x490
Mar 16 08:30:12 malbec kernel: [c000000fe9be7a60] [c00000000037c53c] filemap_fault+0x51c/0xd50
Mar 16 08:30:12 malbec kernel: [c000000fe9be7b80] [c0000000003d21fc] __do_fault+0x5c/0x200
Mar 16 08:30:12 malbec kernel: [c000000fe9be7bc0] [c0000000003d9494] __handle_mm_fault+0x1144/0x1af0
Mar 16 08:30:12 malbec kernel: [c000000fe9be7cb0] [c0000000003d9f6c] handle_mm_fault+0x12c/0x220
Mar 16 08:30:12 malbec kernel: [c000000fe9be7cf0] [c00000000007a408] __do_page_fault+0x298/0xf10
Mar 16 08:30:12 malbec kernel: [c000000fe9be7de0] [c00000000007b0b8] do_page_fault+0x38/0xc0
Mar 16 08:30:12 malbec kernel: [c000000fe9be7e20] [c00000000000a908] handle_page_fault+0x10/0x30
Mar 16 08:30:12 malbec kernel: Mem-Info:
Mar 16 08:30:12 malbec kernel: active_anon:1401138 inactive_anon:628449 isolated_anon:0
                                active_file:214 inactive_file:0 isolated_file:0
                                unevictable:4480 dirty:7 writeback:0 unstable:0
                                slab_reclaimable:2945 slab_unreclaimable:12285
                                mapped:427 shmem:844311 pagetables:486 bounce:0
                                free:10996 free_pcp:0 free_cma:39
…
Mar 16 08:30:13 malbec kernel: [  49233]   479 49233   616370   525869  4550656        0             0 qemu-system-ppc
Mar 16 08:30:13 malbec kernel: [  49298]   479 49298     4884     4523    90112        0             0 perl
Mar 16 08:30:13 malbec kernel: [  49304]   479 49304     1846     1429    63488        0             0 /usr/bin/isotov
Mar 16 08:30:13 malbec kernel: [  49312]   479 49312    43424     3329   133120        0             0 /usr/bin/isotov
Mar 16 08:30:13 malbec kernel: [  49331]   479 49331    51625    11504   200704        0             0 /usr/bin/isotov
Mar 16 08:30:13 malbec kernel: [  49350]   479 49350    21086      171    63488        0             0 videoencoder
Mar 16 08:30:13 malbec kernel: [  49366]   479 49366   610272   525752  4487168        0             0 qemu-system-ppc
Mar 16 08:30:13 malbec kernel: [  52958]   479 52958     1218     1026    57344        0             0 worker
Mar 16 08:30:13 malbec kernel: [  53324]   479 53324     2834     2484    73728        0             0 perl
Mar 16 08:30:13 malbec kernel: [  53334]   479 53334     1784     1411    65536        0             0 /usr/bin/isotov
Mar 16 08:30:13 malbec kernel: [  53574]   479 53574    42295     2241   124928        0             0 /usr/bin/isotov
Mar 16 08:30:13 malbec kernel: [  53575]   479 53575    47838     7658   169984        0             0 /usr/bin/isotov
Mar 16 08:30:13 malbec kernel: [  53612]   479 53612    21086      124    65536        0             0 videoencoder
Mar 16 08:30:13 malbec kernel: [  53628]   479 53628    93345    31529   468992        0             0 qemu-system-ppc
Mar 16 08:30:13 malbec kernel: [  55187]    51 55187      302       64    51200        0             0 pickup
Mar 16 08:30:13 malbec kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,16-17,global_oom,task_memcg=/,task=qemu-system->
Mar 16 08:30:13 malbec kernel: Out of memory: Killed process 49233 (qemu-system-ppc) total-vm:39447680kB, anon-rss:33653376kB, file-rss:2176kB, shmem-r

the corresponding openQA test is https://openqa.suse.de/tests/5674784 which was of course incomplete with reason "Reason: backend died: QEMU exited unexpectedly, see log for details" and auto-review labeled with #71188

Actions #5

Updated by okurz over 3 years ago

  • Copied to action #90974: Make it obvious if qemu gets terminated unexpectedly due to out-of-memory added
Actions #6

Updated by okurz over 3 years ago

  • Status changed from In Progress to Resolved

Created new feature request #90974 as follow-up. With this and the alert resolved we can resolve this ticket.

Actions

Also available in: Atom PDF