action #19390

[tools][sprint 201711.2] qemu "migrate" within testapi::save_memory_dump command never finishes within 2h

Added by okurz over 2 years ago. Updated over 1 year ago.

Status:ResolvedStart date:12/01/2018
Priority:NormalDue date:
Assignee:szarate% Done:

100%

Category:Concrete Bugs
Target version:QA - future
Difficulty:
Duration:

Description

observation

https://openqa.opensuse.org/tests/409170/file/autoinst-log.txt shows

15:52:21.6105 9944 <<< testapi::save_memory_dump(filename='first_boot')
15:52:21.6106 9944 Trying to save machine state
15:52:21.6114 9945 Migrating the machine.
15:52:21.6125 9945 EVENT {"timestamp":
[...]
15:52:22.1338 9945 Migrating total bytes:       1082990592
15:52:22.1339 9945 Migrating remaining bytes:       1068404736
[...]
17:28:21.7226 9945 Migrating total bytes:       1082990592
17:28:21.7226 9945 Migrating remaining bytes:       17428480
17:28:22.2244 9945 Migrating total bytes:       1082990592
17:28:22.2245 9945 Migrating remaining bytes:       6242304
17:28:22.5250 9940 signalhandler got TERM - loop 1

so there is actually progress but veeery slow

problem

nearly 2h for saving 1G is too long, right?


Subtasks

action #30267: [tools][Sprint 201711.2] Workaround memory dumps taking t...Resolvedszarate


Related issues

Related to openQA Project - action #30649: [tools][openqa] Improve performance by using migrations a... Resolved 24/04/2018
Copied to openQA Tests - action #42683: [functional][u] Make save_memory_dump work again and re-e... Feedback 12/01/2018

History

#1 Updated by okurz over 2 years ago

As this happened more often recently I disabled the call to save_memory_dump in our test code with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/2986 . Please readd when done here.

#2 Updated by okurz over 2 years ago

  • Subject changed from qemu "migrate" within testapi::save_memory_dump command never finishes within 2h to [tools]qemu "migrate" within testapi::save_memory_dump command never finishes within 2h
  • Priority changed from Normal to High

blocking us from better investigation in bootup problems, e.g. see https://openqa.suse.de/tests/1054715#step/first_boot/3

#3 Updated by coolo over 2 years ago

  • Target version set to Ready

I'm afraid I was never a believer in this feature. So if qemu can't dump within 2 hours we have to get rid of this api call alltogether

#4 Updated by szarate over 2 years ago

I have some love for this feature, however qemu upstream has several bugs related to live migration. (Can't find the bug id in RH)

@okurz: I was never a fan of enabling the memory dumps for all tests :)

#5 Updated by szarate over 2 years ago

  • Subject changed from [tools]qemu "migrate" within testapi::save_memory_dump command never finishes within 2h to [tools][sprint 201711.2] qemu "migrate" within testapi::save_memory_dump command never finishes within 2h

This is more about investigating and finding out who to poke, which will be most likely agraf and finding out which are the bug # related to this. I doubt that extra work will need to be done on the backend side of things.

#6 Updated by szarate over 2 years ago

  • Status changed from New to In Progress

So i created a very small PR with this: https://github.com/os-autoinst/os-autoinst/pull/882 but it's only covering for qemu's fault here.

#7 Updated by szarate about 2 years ago

  • Assignee set to szarate

#8 Updated by szarate about 2 years ago

  • Target version changed from Ready to Current Sprint

Carry over to sprint 201712.1, assigning to "Current Sprint".

#9 Updated by szarate about 2 years ago

So, boo#1072000 and boo#1072008 say two things:

1: For now we can't timeout the memory dumps.
2: Memory dumps cannot be called at any point in time.

PR: https://github.com/os-autoinst/os-autoinst/pull/892 has been created to solve the problem generated by boo#1072008 as a workaround but boo#1072000 needs a real fix.

#10 Updated by szarate about 2 years ago

  • Status changed from In Progress to Blocked
  • Target version changed from Current Sprint to future

So PR has been merged setting this to Blocked per boo#1072000 and boo#1072008

#11 Updated by rpalethorpe about 2 years ago

  • Related to action #30649: [tools][openqa] Improve performance by using migrations and external snapshots added

#12 Updated by rpalethorpe about 2 years ago

As part of the other ticket I have linked to, I have converted the migration to dump to an fd instead. Once I have applied some other fixes then I will make a PR.

#13 Updated by rpalethorpe about 2 years ago

The following pull request should help: https://github.com/os-autoinst/os-autoinst/pull/918. I have made a number of potential fixes, but I suspect the main issue was that the VM was still running. Doing a live migration probably requires tweaking the CPU throttling and other settings otherwise QEMU will get stuck trying to reach the low water mark before freezing the VM.

I did some profiling of a full test run using the new memory dumper:
CPU cycles:
80% used by xz
11% used by qemu
6% used by OpenQA

Page faults:
40-60% caused by OpenQA, at least 20% of which came from libopencv
20% caused by xz
10% caused by qemu

Disk I/O:
Swapper wrote 600-800MB
xz wrote 11-12MB
qemu wrote 4-5MB

In all cases QEMU is not using many resources. xz uses quite a lot, but this is expected and on CPU limited workers it can be replaced by bzip2. Using internal QEMU migration compression would probably save some disk I/O, but it makes the dumps difficult to read and xz/bzip2 archives are 10-40% smaller. QEMU can also migrate to a socket, so another option would be to read from a socket and compress the data using a library, then write the result to disk. That would be best done in C/C++ because Perl's library support is not so good. Alternatively we could go back to using the exec URI in migrate, but it is not clear if QEMU will report errors from bzip and sh accurately and there are too many pipes involved for my liking. I'm guessing that most of the disk I/O is attributed to swapper because the file system is deferring or combining writes which confuses the reporting.

Interestingly opencv related code is page faulting a lot even though needles were only used during boot and the rest of the test was using serial terminal.

The profiling was done with:
$ sudo perf record -e cycles,faults sudo -u _openqa-worker --preserve-env=QEMU ~/qa/openQA/script/worker --isotovideo ~/qa/os-autoinst/isotovideo
$ perf report --hierarchy

and

$ sudo blktrace -d /dev/sdX
$ blkparse -shi sdX

Where sdX is the drive where /var/lib/openqa/pool is mounted.

It would be nice to attach 'perf record' to QEMU every time it does a snapshot, memory dump, etc. with full stack trace and whatever then detach straight after. This could be coded into os-autoinst, but it might be better to set up some monitoring software which can insert a probe to achieve the same thing without hard coding it.

#14 Updated by rpalethorpe about 2 years ago

Pull request is now merged. When os-autoinst is deployed we need to look out for compilation failures (because I touched the C part) and increased CPU and drive usage.

#15 Updated by rpalethorpe almost 2 years ago

  • Status changed from Blocked to Resolved

I hope this is fixed now, if anyone has seen another failure please reopen and let me know.

#16 Updated by okurz over 1 year ago

  • Target version changed from future to future

#17 Updated by okurz over 1 year ago

  • Copied to action #42683: [functional][u] Make save_memory_dump work again and re-enable save_memory_dump call in tests/installation/first_boot and other boot modules added

Also available in: Atom PDF