action #19390: [tools][sprint 201711.2] qemu "migrate" within testapi::save_memory_dump command never finishes within 2h - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #19390

closed

[tools][sprint 201711.2] qemu "migrate" within testapi::save_memory_dump command never finishes within 2h

Added by okurz about 8 years ago. Updated almost 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

szarate

Category:

Regressions/Crashes

Target version:

QA (public) - future

Start date:

2018-01-12

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Description

observation¶

https://openqa.opensuse.org/tests/409170/file/autoinst-log.txt shows

15:52:21.6105 9944 <<< testapi::save_memory_dump(filename='first_boot')
15:52:21.6106 9944 Trying to save machine state
15:52:21.6114 9945 Migrating the machine.
15:52:21.6125 9945 EVENT {"timestamp":
[...]
15:52:22.1338 9945 Migrating total bytes:     	1082990592
15:52:22.1339 9945 Migrating remaining bytes:   	1068404736
[...]
17:28:21.7226 9945 Migrating total bytes:     	1082990592
17:28:21.7226 9945 Migrating remaining bytes:   	17428480
17:28:22.2244 9945 Migrating total bytes:     	1082990592
17:28:22.2245 9945 Migrating remaining bytes:   	6242304
17:28:22.5250 9940 signalhandler got TERM - loop 1

so there is actually progress but veeery slow

problem¶

nearly 2h for saving 1G is too long, right?

Subtasks 1 (0 open — 1 closed)

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz about 8 years ago

As this happened more often recently I disabled the call to save_memory_dump in our test code with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2986 . Please readd when done here.

Actions

Copy link

Updated by okurz almost 8 years ago

Subject changed from qemu "migrate" within testapi::save_memory_dump command never finishes within 2h to [tools]qemu "migrate" within testapi::save_memory_dump command never finishes within 2h
Priority changed from Normal to High

blocking us from better investigation in bootup problems, e.g. see https://openqa.suse.de/tests/1054715#step/first_boot/3

Actions

Copy link

Updated by coolo over 7 years ago

Target version set to Ready

I'm afraid I was never a believer in this feature. So if qemu can't dump within 2 hours we have to get rid of this api call alltogether

Actions

Copy link

Updated by szarate over 7 years ago

I have some love for this feature, however qemu upstream has several bugs related to live migration. (Can't find the bug id in RH)

@okurz: I was never a fan of enabling the memory dumps for all tests :)

Actions

Copy link

Updated by szarate over 7 years ago

Subject changed from [tools]qemu "migrate" within testapi::save_memory_dump command never finishes within 2h to [tools][sprint 201711.2] qemu "migrate" within testapi::save_memory_dump command never finishes within 2h

This is more about investigating and finding out who to poke, which will be most likely agraf and finding out which are the bug # related to this. I doubt that extra work will need to be done on the backend side of things.

Actions

Copy link

Updated by szarate over 7 years ago

Status changed from New to In Progress

So i created a very small PR with this: https://github.com/os-autoinst/os-autoinst/pull/882 but it's only covering for qemu's fault here.

Actions

Copy link

Updated by szarate over 7 years ago

Assignee set to szarate

Actions

Copy link

Updated by szarate over 7 years ago

Target version changed from Ready to Current Sprint

Carry over to sprint 201712.1, assigning to "Current Sprint".

Actions

Copy link

Updated by szarate over 7 years ago

So, boo#1072000 and boo#1072008 say two things:

1: For now we can't timeout the memory dumps.
2: Memory dumps cannot be called at any point in time.

PR: https://github.com/os-autoinst/os-autoinst/pull/892 has been created to solve the problem generated by boo#1072008 as a workaround but boo#1072000 needs a real fix.

Actions

Copy link

#10

Updated by szarate over 7 years ago

Status changed from In Progress to Blocked
Target version changed from Current Sprint to future

So PR has been merged setting this to Blocked per boo#1072000 and boo#1072008

Actions

Copy link

#11

Updated by rpalethorpe over 7 years ago

Related to action #30649: [tools][openqa] Improve performance by using migrations and external snapshots added

Actions

Copy link

#12

Updated by rpalethorpe over 7 years ago

As part of the other ticket I have linked to, I have converted the migration to dump to an fd instead. Once I have applied some other fixes then I will make a PR.

Actions

Copy link

#13

Updated by rpalethorpe over 7 years ago

The following pull request should help: https://github.com/os-autoinst/os-autoinst/pull/918. I have made a number of potential fixes, but I suspect the main issue was that the VM was still running. Doing a live migration probably requires tweaking the CPU throttling and other settings otherwise QEMU will get stuck trying to reach the low water mark before freezing the VM.

I did some profiling of a full test run using the new memory dumper:
CPU cycles:
80% used by xz
11% used by qemu
6% used by OpenQA

Page faults:
40-60% caused by OpenQA, at least 20% of which came from libopencv
20% caused by xz
10% caused by qemu

Disk I/O:
Swapper wrote 600-800MB
xz wrote 11-12MB
qemu wrote 4-5MB

In all cases QEMU is not using many resources. xz uses quite a lot, but this is expected and on CPU limited workers it can be replaced by bzip2. Using internal QEMU migration compression would probably save some disk I/O, but it makes the dumps difficult to read and xz/bzip2 archives are 10-40% smaller. QEMU can also migrate to a socket, so another option would be to read from a socket and compress the data using a library, then write the result to disk. That would be best done in C/C++ because Perl's library support is not so good. Alternatively we could go back to using the exec URI in migrate, but it is not clear if QEMU will report errors from bzip and sh accurately and there are too many pipes involved for my liking. I'm guessing that most of the disk I/O is attributed to swapper because the file system is deferring or combining writes which confuses the reporting.

Interestingly opencv related code is page faulting a lot even though needles were only used during boot and the rest of the test was using serial terminal.

The profiling was done with:
$ sudo perf record -e cycles,faults sudo -u _openqa-worker --preserve-env=QEMU ~/qa/openQA/script/worker --isotovideo ~/qa/os-autoinst/isotovideo
$ perf report --hierarchy

and

$ sudo blktrace -d /dev/sdX
$ blkparse -shi sdX

Where sdX is the drive where /var/lib/openqa/pool is mounted.

It would be nice to attach 'perf record' to QEMU every time it does a snapshot, memory dump, etc. with full stack trace and whatever then detach straight after. This could be coded into os-autoinst, but it might be better to set up some monitoring software which can insert a probe to achieve the same thing without hard coding it.

Actions

Copy link

#14

Updated by rpalethorpe over 7 years ago

Pull request is now merged. When os-autoinst is deployed we need to look out for compilation failures (because I touched the C part) and increased CPU and drive usage.

Actions

Copy link

#15

Updated by rpalethorpe about 7 years ago

Status changed from Blocked to Resolved

I hope this is fixed now, if anyone has seen another failure please reopen and let me know.

Actions

Copy link

#16

Updated by okurz almost 7 years ago

Target version changed from future to future

Actions

Copy link

#17

Updated by okurz over 6 years ago

Copied to action #42683: [functional][u] Make save_memory_dump work again and re-enable save_memory_dump call in tests/installation/first_boot and other boot modules added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #19390

[tools][sprint 201711.2] qemu "migrate" within testapi::save_memory_dump command never finishes within 2h

observation¶

problem¶

Updated by okurz about 8 years ago

Updated by okurz almost 8 years ago

Updated by coolo over 7 years ago

Updated by szarate over 7 years ago

Updated by szarate over 7 years ago

Updated by szarate over 7 years ago

Updated by szarate over 7 years ago

Updated by szarate over 7 years ago

Updated by szarate over 7 years ago

Updated by szarate over 7 years ago

Updated by rpalethorpe over 7 years ago

Updated by rpalethorpe over 7 years ago

Updated by rpalethorpe over 7 years ago

Updated by rpalethorpe over 7 years ago

Updated by rpalethorpe about 7 years ago

Updated by okurz almost 7 years ago

Updated by okurz over 6 years ago