Project

General

Profile

action #98841

qemu randomly fails to start on QA-Power8-5-kvm auto_review:"(?s)QA-Power8.*Failed to allocate KVM HPT of order 25 (try smaller maxmem?): Cannot allocate memory":retry size:M

Added by MDoucha 29 days ago. Updated 3 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-09-17
Due date:
2021-10-28
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

A few LTP jobs have failed to start today due to qemu error on QA-Power8-5-kvm worker:
https://openqa.suse.de/tests/7138972
https://openqa.suse.de/tests/7149857
https://openqa.suse.de/tests/7153989

All of them have the following output in autoinst-log.txt:

[2021-09-17T16:24:29.803 CEST] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  QEMU terminated before QMP connection could be established. Check for errors below
[2021-09-17T16:24:29.804 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2021-09-17T16:24:29.805 CEST] [debug] Passing remaining frames to the video encoder
[2021-09-17T16:24:29.805 CEST] [debug] Waiting for video encoder to finalize the video
[2021-09-17T16:24:29.805 CEST] [debug] The built-in video encoder (pid 110385) terminated
[2021-09-17T16:24:29.807 CEST] [debug] QEMU: QEMU emulator version 4.2.1 (openSUSE Leap 15.2)
[2021-09-17T16:24:29.807 CEST] [debug] QEMU: Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
[2021-09-17T16:24:29.807 CEST] [warn] !!! : qemu-system-ppc64: Failed to allocate KVM HPT of order 25 (try smaller maxmem?): Cannot allocate memory

Problem

QA-Power8-5-kvm has 256GB RAM. https://monitor.qa.suse.de/d/WDQA-Power8-5-kvm/worker-dashboard-qa-power8-5-kvm?viewPanel=12054&orgId=1&from=1631765162464&to=1632085860553 shows that some memory was used during the period when the test failed but nothing that should explain the inability to allocate the memory for the qemu VM. In the system journal there is

Sep 17 16:24:28 QA-Power8-5-kvm worker[88148]: [debug] [pid:88148] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7125263/status
Sep 17 16:24:29 QA-Power8-5-kvm worker[104911]: [info] [pid:108741] sle-15-SP4-ppc64le-Build36.1-HA-BV.qcow2: Processing chunk 501/5812, avg. speed ~976.562 KiB/s
Sep 17 16:24:29 QA-Power8-5-kvm worker[101413]: [debug] [pid:102598] Uploading artefact mq_timedreceive_15-1-2.txt
Sep 17 16:24:29 QA-Power8-5-kvm worker[96737]: [debug] [pid:96737] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7125265/status
Sep 17 16:24:29 QA-Power8-5-kvm worker[109458]: [debug] [pid:110336] Uploading artefact bootloader_start-15.txt
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: 23 callbacks suppressed
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf00, 3cb100) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf04, 3cb104) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf08, 3cb108) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf0c, 3cb10c) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf10, 3cb110) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf14, 3cb114) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf18, 3cb118) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf1c, 3cb11c) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf20, 3cb120) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf24, 3cb124) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -16
Sep 17 16:24:30 QA-Power8-5-kvm worker[88148]: [debug] [pid:88148] Upload concluded (at wait_children)
Sep 17 16:24:30 QA-Power8-5-kvm worker[109557]: [info] [pid:109557] Isotovideo exit status: 1
Sep 17 16:24:30 QA-Power8-5-kvm worker[109557]: [debug] [pid:109557] Stopping job 7153989 from openqa.suse.de: 07153989-sle-15-SP3-Server-DVD-Incidents-Kernel-KOTD-ppc64le-Build5.3.18-302.1.g316993b-ltp_crashme@ppc64le-virtio - reason: died
Sep 17 16:24:30 QA-Power8-5-kvm worker[109557]: [debug] [pid:109557] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7153989/status
Sep 17 16:24:30 QA-Power8-5-kvm worker[101413]: [debug] [pid:102598] Uploading artefact mq_timedreceive_7-1-2.txt

in particular the messages

Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf20, 3cb120) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf24, 3cb124) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -16

so an allocation failure. We could report a bug about this but because KVM on SUSE with Power8 is unsupported so I don't expect any success.

We likely need to accept such issues and trigger a restart automatically by openQA.

Acceptance criteria

  • AC1: qemu ppc64le allocation errors cause automatic job retriggers by openQA

Suggestions

History

#1 Updated by okurz 27 days ago

  • Target version set to Ready

#2 Updated by okurz 26 days ago

  • Project changed from openQA Infrastructure to openQA Project
  • Subject changed from qemu randomly fails to start on QA-Power8-5-kvm to qemu randomly fails to start on QA-Power8-5-kvm auto_review:"(?s)QA-Power8.*Failed to allocate KVM HPT of order 25 (try smaller maxmem?): Cannot allocate memory":retry
  • Description updated (diff)
  • Category set to Feature requests

QA-Power8-5-kvm has 256GB RAM. https://monitor.qa.suse.de/d/WDQA-Power8-5-kvm/worker-dashboard-qa-power8-5-kvm?viewPanel=12054&orgId=1&from=1631765162464&to=1632085860553 shows that some memory was used during the period when the test failed but nothing that should explain the inability to allocate the memory for the qemu VM. In the system journal there is

Sep 17 16:24:28 QA-Power8-5-kvm worker[88148]: [debug] [pid:88148] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7125263/status
Sep 17 16:24:29 QA-Power8-5-kvm worker[104911]: [info] [pid:108741] sle-15-SP4-ppc64le-Build36.1-HA-BV.qcow2: Processing chunk 501/5812, avg. speed ~976.562 KiB/s
Sep 17 16:24:29 QA-Power8-5-kvm worker[101413]: [debug] [pid:102598] Uploading artefact mq_timedreceive_15-1-2.txt
Sep 17 16:24:29 QA-Power8-5-kvm worker[96737]: [debug] [pid:96737] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7125265/status
Sep 17 16:24:29 QA-Power8-5-kvm worker[109458]: [debug] [pid:110336] Uploading artefact bootloader_start-15.txt
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: 23 callbacks suppressed
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf00, 3cb100) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf04, 3cb104) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf08, 3cb108) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf0c, 3cb10c) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf10, 3cb110) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf14, 3cb114) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf18, 3cb118) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf1c, 3cb11c) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf20, 3cb120) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf24, 3cb124) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -16
Sep 17 16:24:30 QA-Power8-5-kvm worker[88148]: [debug] [pid:88148] Upload concluded (at wait_children)
Sep 17 16:24:30 QA-Power8-5-kvm worker[109557]: [info] [pid:109557] Isotovideo exit status: 1
Sep 17 16:24:30 QA-Power8-5-kvm worker[109557]: [debug] [pid:109557] Stopping job 7153989 from openqa.suse.de: 07153989-sle-15-SP3-Server-DVD-Incidents-Kernel-KOTD-ppc64le-Build5.3.18-302.1.g316993b-ltp_crashme@ppc64le-virtio - reason: died
Sep 17 16:24:30 QA-Power8-5-kvm worker[109557]: [debug] [pid:109557] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7153989/status
Sep 17 16:24:30 QA-Power8-5-kvm worker[101413]: [debug] [pid:102598] Uploading artefact mq_timedreceive_7-1-2.txt

in particular the messages

Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf20, 3cb120) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf24, 3cb124) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -16

so an allocation failure. We could report a bug about this but because KVM on SUSE with Power8 is unsupported so I don't expect any success.

We likely need to accept such issues and trigger a restart automatically by openQA.

#3 Updated by cdywan 10 days ago

  • Subject changed from qemu randomly fails to start on QA-Power8-5-kvm auto_review:"(?s)QA-Power8.*Failed to allocate KVM HPT of order 25 (try smaller maxmem?): Cannot allocate memory":retry to qemu randomly fails to start on QA-Power8-5-kvm auto_review:"(?s)QA-Power8.*Failed to allocate KVM HPT of order 25 (try smaller maxmem?): Cannot allocate memory":retry size:M
  • Description updated (diff)
  • Status changed from New to Workable

#4 Updated by Xiaojing_liu 6 days ago

  • Assignee set to Xiaojing_liu

#5 Updated by cdywan 6 days ago

Discussed briefly in the daily. The job is already Incomplete, so we only need to handle the restart

#6 Updated by Xiaojing_liu 4 days ago

  • Status changed from Workable to In Progress

#7 Updated by openqa_review 3 days ago

  • Due date set to 2021-10-28

Setting due date based on mean cycle time of SUSE QE Tools

Also available in: Atom PDF