action #19390: [tools][sprint 201711.2] qemu "migrate" within testapi::save_memory_dump command never finishes within 2h - openQA Project - openSUSE Project Management Tool

Custom queries

All 'new' issues w/o assignee, sorted by version/priority
All auto_review tickets
All auto_review+force_result tickets
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE Tools team - due soon
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #19390

closed

[tools][sprint 201711.2] qemu "migrate" within testapi::save_memory_dump command never finishes within 2h

Added by okurz about 7 years ago. Updated about 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

szarate

Category:

Regressions/Crashes

Target version:

QA - future

Start date:

2018-01-12

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Description

observation¶

https://openqa.opensuse.org/tests/409170/file/autoinst-log.txt shows

15:52:21.6105 9944 <<< testapi::save_memory_dump(filename='first_boot')
15:52:21.6106 9944 Trying to save machine state
15:52:21.6114 9945 Migrating the machine.
15:52:21.6125 9945 EVENT {"timestamp":
[...]
15:52:22.1338 9945 Migrating total bytes:       1082990592
15:52:22.1339 9945 Migrating remaining bytes:       1068404736
[...]
17:28:21.7226 9945 Migrating total bytes:       1082990592
17:28:21.7226 9945 Migrating remaining bytes:       17428480
17:28:22.2244 9945 Migrating total bytes:       1082990592
17:28:22.2245 9945 Migrating remaining bytes:       6242304
17:28:22.5250 9940 signalhandler got TERM - loop 1

so there is actually progress but veeery slow

problem¶

nearly 2h for saving 1G is too long, right?

Subtasks 1 (0 open — 1 closed)

action #30267: [tools][Sprint 201711.2] Workaround memory dumps taking too long

Resolved

szarate

2018-01-12

Actions

Related issues 2 (0 open — 2 closed)

Related to openQA Project - action #30649: [tools][openqa] Improve performance by using migrations and external snapshots

Resolved

rpalethorpe

2018-04-24

Actions

Copied to openQA Tests - action #42683: [functional][u] Make save_memory_dump work again and re-enable save_memory_dump call in tests/installation/first_boot and other boot modules

Resolved

szarate

2020-06-23

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by okurz about 7 years ago

As this happened more often recently I disabled the call to save_memory_dump in our test code with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2986 . Please readd when done here.

Actions

Copy link

Updated by okurz about 7 years ago

Subject changed from qemu "migrate" within testapi::save_memory_dump command never finishes within 2h to [tools]qemu "migrate" within testapi::save_memory_dump command never finishes within 2h
Priority changed from Normal to High

blocking us from better investigation in bootup problems, e.g. see https://openqa.suse.de/tests/1054715#step/first_boot/3

Actions

Copy link

Updated by coolo over 6 years ago

Target version set to Ready

I'm afraid I was never a believer in this feature. So if qemu can't dump within 2 hours we have to get rid of this api call alltogether

Actions

Copy link

Updated by szarate over 6 years ago

I have some love for this feature, however qemu upstream has several bugs related to live migration. (Can't find the bug id in RH)

@okurz: I was never a fan of enabling the memory dumps for all tests :)

Actions

Copy link

Updated by szarate over 6 years ago

Subject changed from [tools]qemu "migrate" within testapi::save_memory_dump command never finishes within 2h to [tools][sprint 201711.2] qemu "migrate" within testapi::save_memory_dump command never finishes within 2h

This is more about investigating and finding out who to poke, which will be most likely agraf and finding out which are the bug # related to this. I doubt that extra work will need to be done on the backend side of things.

Actions

Copy link

Updated by szarate over 6 years ago

Status changed from New to In Progress

So i created a very small PR with this: https://github.com/os-autoinst/os-autoinst/pull/882 but it's only covering for qemu's fault here.

Actions

Copy link

Updated by szarate over 6 years ago

Assignee set to szarate

Actions

Copy link

Updated by szarate over 6 years ago

Target version changed from Ready to Current Sprint

Carry over to sprint 201712.1, assigning to "Current Sprint".

Actions

Copy link

Updated by szarate over 6 years ago

So, boo#1072000 and boo#1072008 say two things:

1: For now we can't timeout the memory dumps.
2: Memory dumps cannot be called at any point in time.

PR: https://github.com/os-autoinst/os-autoinst/pull/892 has been created to solve the problem generated by boo#1072008 as a workaround but boo#1072000 needs a real fix.

Actions

Copy link

#10

Updated by szarate over 6 years ago

Status changed from In Progress to Blocked
Target version changed from Current Sprint to future

So PR has been merged setting this to Blocked per boo#1072000 and boo#1072008

Actions

Copy link

#11

Updated by rpalethorpe over 6 years ago

Related to action #30649: [tools][openqa] Improve performance by using migrations and external snapshots added

Actions

Copy link

#12

Updated by rpalethorpe over 6 years ago

As part of the other ticket I have linked to, I have converted the migration to dump to an fd instead. Once I have applied some other fixes then I will make a PR.

Actions

Copy link

#13

Updated by rpalethorpe over 6 years ago

The following pull request should help: https://github.com/os-autoinst/os-autoinst/pull/918. I have made a number of potential fixes, but I suspect the main issue was that the VM was still running. Doing a live migration probably requires tweaking the CPU throttling and other settings otherwise QEMU will get stuck trying to reach the low water mark before freezing the VM.

I did some profiling of a full test run using the new memory dumper:
CPU cycles:
80% used by xz
11% used by qemu
6% used by OpenQA

Page faults:
40-60% caused by OpenQA, at least 20% of which came from libopencv
20% caused by xz
10% caused by qemu

Disk I/O:
Swapper wrote 600-800MB
xz wrote 11-12MB
qemu wrote 4-5MB

In all cases QEMU is not using many resources. xz uses quite a lot, but this is expected and on CPU limited workers it can be replaced by bzip2. Using internal QEMU migration compression would probably save some disk I/O, but it makes the dumps difficult to read and xz/bzip2 archives are 10-40% smaller. QEMU can also migrate to a socket, so another option would be to read from a socket and compress the data using a library, then write the result to disk. That would be best done in C/C++ because Perl's library support is not so good. Alternatively we could go back to using the exec URI in migrate, but it is not clear if QEMU will report errors from bzip and sh accurately and there are too many pipes involved for my liking. I'm guessing that most of the disk I/O is attributed to swapper because the file system is deferring or combining writes which confuses the reporting.

Interestingly opencv related code is page faulting a lot even though needles were only used during boot and the rest of the test was using serial terminal.

The profiling was done with:
$ sudo perf record -e cycles,faults sudo -u _openqa-worker --preserve-env=QEMU ~/qa/openQA/script/worker --isotovideo ~/qa/os-autoinst/isotovideo
$ perf report --hierarchy

and

$ sudo blktrace -d /dev/sdX
$ blkparse -shi sdX

Where sdX is the drive where /var/lib/openqa/pool is mounted.

It would be nice to attach 'perf record' to QEMU every time it does a snapshot, memory dump, etc. with full stack trace and whatever then detach straight after. This could be coded into os-autoinst, but it might be better to set up some monitoring software which can insert a probe to achieve the same thing without hard coding it.

Actions

Copy link

#14

Updated by rpalethorpe over 6 years ago

Pull request is now merged. When os-autoinst is deployed we need to look out for compilation failures (because I touched the C part) and increased CPU and drive usage.

Actions

Copy link

#15

Updated by rpalethorpe over 6 years ago

Status changed from Blocked to Resolved

I hope this is fixed now, if anyone has seen another failure please reopen and let me know.

Actions

Copy link

#16

Updated by okurz about 6 years ago

Target version changed from future to future

Actions

Copy link

#17

Updated by okurz over 5 years ago

Copied to action #42683: [functional][u] Make save_memory_dump work again and re-enable save_memory_dump call in tests/installation/first_boot and other boot modules added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project

Tags

Custom queries

action #19390

[tools][sprint 201711.2] qemu "migrate" within testapi::save_memory_dump command never finishes within 2h

observation¶

problem¶

Updated by okurz about 7 years ago

Updated by okurz about 7 years ago

Updated by coolo over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by rpalethorpe over 6 years ago

Updated by rpalethorpe over 6 years ago

Updated by rpalethorpe over 6 years ago

Updated by rpalethorpe over 6 years ago

Updated by rpalethorpe over 6 years ago

Updated by okurz about 6 years ago

Updated by okurz over 5 years ago