action #19390
closed[tools][sprint 201711.2] qemu "migrate" within testapi::save_memory_dump command never finishes within 2h
100%
Description
observation¶
https://openqa.opensuse.org/tests/409170/file/autoinst-log.txt shows
15:52:21.6105 9944 <<< testapi::save_memory_dump(filename='first_boot')
15:52:21.6106 9944 Trying to save machine state
15:52:21.6114 9945 Migrating the machine.
15:52:21.6125 9945 EVENT {"timestamp":
[...]
15:52:22.1338 9945 Migrating total bytes: 1082990592
15:52:22.1339 9945 Migrating remaining bytes: 1068404736
[...]
17:28:21.7226 9945 Migrating total bytes: 1082990592
17:28:21.7226 9945 Migrating remaining bytes: 17428480
17:28:22.2244 9945 Migrating total bytes: 1082990592
17:28:22.2245 9945 Migrating remaining bytes: 6242304
17:28:22.5250 9940 signalhandler got TERM - loop 1
so there is actually progress but veeery slow
problem¶
nearly 2h for saving 1G is too long, right?
Updated by okurz over 7 years ago
As this happened more often recently I disabled the call to save_memory_dump in our test code with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2986 . Please readd when done here.
Updated by okurz over 7 years ago
- Subject changed from qemu "migrate" within testapi::save_memory_dump command never finishes within 2h to [tools]qemu "migrate" within testapi::save_memory_dump command never finishes within 2h
- Priority changed from Normal to High
blocking us from better investigation in bootup problems, e.g. see https://openqa.suse.de/tests/1054715#step/first_boot/3
Updated by coolo about 7 years ago
- Target version set to Ready
I'm afraid I was never a believer in this feature. So if qemu can't dump within 2 hours we have to get rid of this api call alltogether
Updated by szarate about 7 years ago
I have some love for this feature, however qemu upstream has several bugs related to live migration. (Can't find the bug id in RH)
@okurz: I was never a fan of enabling the memory dumps for all tests :)
Updated by szarate about 7 years ago
- Subject changed from [tools]qemu "migrate" within testapi::save_memory_dump command never finishes within 2h to [tools][sprint 201711.2] qemu "migrate" within testapi::save_memory_dump command never finishes within 2h
This is more about investigating and finding out who to poke, which will be most likely agraf and finding out which are the bug # related to this. I doubt that extra work will need to be done on the backend side of things.
Updated by szarate about 7 years ago
- Status changed from New to In Progress
So i created a very small PR with this: https://github.com/os-autoinst/os-autoinst/pull/882 but it's only covering for qemu's fault here.
Updated by szarate about 7 years ago
- Target version changed from Ready to Current Sprint
Carry over to sprint 201712.1, assigning to "Current Sprint".
Updated by szarate almost 7 years ago
So, boo#1072000 and boo#1072008 say two things:
1: For now we can't timeout the memory dumps.
2: Memory dumps cannot be called at any point in time.
PR: https://github.com/os-autoinst/os-autoinst/pull/892 has been created to solve the problem generated by boo#1072008 as a workaround but boo#1072000 needs a real fix.
Updated by szarate almost 7 years ago
- Status changed from In Progress to Blocked
- Target version changed from Current Sprint to future
So PR has been merged setting this to Blocked per boo#1072000 and boo#1072008
Updated by rpalethorpe almost 7 years ago
- Related to action #30649: [tools][openqa] Improve performance by using migrations and external snapshots added
Updated by rpalethorpe almost 7 years ago
As part of the other ticket I have linked to, I have converted the migration to dump to an fd instead. Once I have applied some other fixes then I will make a PR.
Updated by rpalethorpe almost 7 years ago
The following pull request should help: https://github.com/os-autoinst/os-autoinst/pull/918. I have made a number of potential fixes, but I suspect the main issue was that the VM was still running. Doing a live migration probably requires tweaking the CPU throttling and other settings otherwise QEMU will get stuck trying to reach the low water mark before freezing the VM.
I did some profiling of a full test run using the new memory dumper:
CPU cycles:
80% used by xz
11% used by qemu
6% used by OpenQA
Page faults:
40-60% caused by OpenQA, at least 20% of which came from libopencv
20% caused by xz
10% caused by qemu
Disk I/O:
Swapper wrote 600-800MB
xz wrote 11-12MB
qemu wrote 4-5MB
In all cases QEMU is not using many resources. xz uses quite a lot, but this is expected and on CPU limited workers it can be replaced by bzip2. Using internal QEMU migration compression would probably save some disk I/O, but it makes the dumps difficult to read and xz/bzip2 archives are 10-40% smaller. QEMU can also migrate to a socket, so another option would be to read from a socket and compress the data using a library, then write the result to disk. That would be best done in C/C++ because Perl's library support is not so good. Alternatively we could go back to using the exec URI in migrate, but it is not clear if QEMU will report errors from bzip and sh accurately and there are too many pipes involved for my liking. I'm guessing that most of the disk I/O is attributed to swapper because the file system is deferring or combining writes which confuses the reporting.
Interestingly opencv related code is page faulting a lot even though needles were only used during boot and the rest of the test was using serial terminal.
The profiling was done with:
$ sudo perf record -e cycles,faults sudo -u _openqa-worker --preserve-env=QEMU ~/qa/openQA/script/worker --isotovideo ~/qa/os-autoinst/isotovideo
$ perf report --hierarchy
and
$ sudo blktrace -d /dev/sdX
$ blkparse -shi sdX
Where sdX is the drive where /var/lib/openqa/pool is mounted.
It would be nice to attach 'perf record' to QEMU every time it does a snapshot, memory dump, etc. with full stack trace and whatever then detach straight after. This could be coded into os-autoinst, but it might be better to set up some monitoring software which can insert a probe to achieve the same thing without hard coding it.
Updated by rpalethorpe almost 7 years ago
Pull request is now merged. When os-autoinst is deployed we need to look out for compilation failures (because I touched the C part) and increased CPU and drive usage.
Updated by rpalethorpe over 6 years ago
- Status changed from Blocked to Resolved
I hope this is fixed now, if anyone has seen another failure please reopen and let me know.
Updated by okurz about 6 years ago
- Copied to action #42683: [functional][u] Make save_memory_dump work again and re-enable save_memory_dump call in tests/installation/first_boot and other boot modules added