action #30649: [tools][openqa] Improve performance by using migrations and external snapshots - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

#1

Updated by coolo about 7 years ago

As we have so much fun with migrate command: https://progress.opensuse.org/issues/19390

Actions

Copy link

#2

Updated by rpalethorpe about 7 years ago

Currently we are using it to pipe an entire memory dump into bzip then redirecting that to a file through the shell. Despite the fact this is the documented way of doing it, I don't think it is the most common way. Libvirt is passing a file descriptor using SCM rights (with the option of doing an incremental dump which is useful for snapshots), so it is possible that if we do it the same way then we will avoid some bottleneck in the exec code. Although I don't see what could be wrong with the exec code in QEMU.

Also libvirt calls migrate_set_speed with the largest value possible before taking a snapshot. I think it is possible that QEMU automatically throttles migrations to prevent them from using all the bandwidth so we should probably do the same when migrating to a file.

Actions

Copy link

#3

Updated by rpalethorpe about 7 years ago

Related to action #19390: [tools][sprint 201711.2] qemu "migrate" within testapi::save_memory_dump command never finishes within 2h added

Actions

Copy link

#4

Updated by rpalethorpe about 7 years ago

OK, so migrating to and from a file works, however it requires restarting QEMU. That probably means refactoring the start_qemu os-autoinst code quite a lot. Alternatively it should be possible patch QEMU to allow incoming migrations without a restart.

Before doing that it is probably a good idea to try making the memory dump reliable. If it suffers from the same problems as savevm (after applying all the fixes I can think of) then there are probably better things to be doing.

Actions

Copy link

#6

Updated by rpalethorpe about 7 years ago

It should also be possible to enable performance monitoring during savevm and memory dump which might help reveal where the problem is.

Actions

Copy link

#7

Updated by rpalethorpe about 7 years ago

Priority changed from Normal to High

Actions

Copy link

#8

Updated by rpalethorpe about 7 years ago

On a semi-related note it looks like snapshots are adding 100GB to the image size for the image published by install_ltp

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4283

Actions

Copy link

#9

Updated by rpalethorpe about 7 years ago

From ongoing discussions on the QEMU mailing list it appears various people are working on related problems and solutions for snapshots. I'm not sure I understand all of the problems and solutions, but I think that we should be able to patch QEMU so that it allows an incoming migration without being restarted. This might turn up some bugs or problems which were previously hidden in QEMU (maybe not a bad thing), but it is probably the least intrusive change. It seems to be worth further investigation. Some people appear to be working on solutions which look better than this, but they may take a long time.

Actions

Copy link

#10

Updated by rpalethorpe almost 7 years ago

It appears that incoming migrations do work in a non-pristine QEMU with a few minor changes, the patch is fairly simple so far: https://github.com/richiejp/qemu/commit/788156d104079ef0deb9e48048a7995f1995dc22. However I wonder what kind of edge cases there are and what run states should be considered a valid starting point.

Actions

Copy link

#11

Updated by coolo almost 7 years ago

Well, find out what upstream tells you about the patch. In the past, this turned into a dead end often enough :(

Actions

Copy link

#12

Updated by rpalethorpe almost 7 years ago

Status changed from New to Blocked

I don't think they will disagree in principle as I got the idea from them, but I have sent an RFC patch and will wait to see what happens. It will probably require a lot of testing, but learning about the QEMU testing framework may be useful regardless.

Actions

Copy link

#13

Updated by rpalethorpe almost 7 years ago

Target version set to Milestone 15

Actions

Copy link

#14

Updated by rpalethorpe almost 7 years ago

The mail archive link: http://lists.nongnu.org/archive/html/qemu-devel/2018-02/msg06782.html

Actions

Copy link

#15

Updated by rpalethorpe almost 7 years ago

Status changed from Blocked to In Progress

So far from the discussion upstream there are two things which need to be investigated:

1) Deleting the current "active layer" and making a new overlay based on the backing node where the snapshot was taken
    a) Creating a new command which drops the current active layer
    b) Recreating the block devices instead of adding a new command

2) Checking whether devices are still reading or even writing to RAM

Note that (2) is also a problem with the current method.

Actions

Copy link

#16

Updated by rpalethorpe almost 7 years ago

Loading a VM snapshot/migration into a QEMU instance which has already ran a VM appears to work, but I am concerned about the following:

1) There is nothing stopping devices from reading or writing memory during the migration if they do not follow the memory or bus APIs.
2) Many devices hold state outside of guest RAM and it is down to the device to correctly load new state. It is not clear if all devices overwrite existing state correctly.
3) It is not clear that the vCPUs are all guaranteed to have stopped by the time the migration starts.

I have not found any instances of (1) or (2), but QEMU has a large number of complex devices so auditing it would be difficult. It seems that 'loadvm' usually works, but we can not be certain that all devices have a consistent state. If an error in the guest kernel is caused by a device with inconsistent state it may be almost impossible to determine what caused the bug. So while I am sure QEMU can be patched to allow incoming migrations without '-defer' and it would work most of the time (and be quite useful for some people), we really should restart QEMU before loading a VM.

Actions

Copy link

#17

Updated by rpalethorpe almost 7 years ago

Status changed from In Progress to Workable

Actions

Copy link

#18

Updated by rpalethorpe almost 7 years ago

Due date set to 2018-04-24
Start date changed from 2018-03-09 to 2018-04-24

due to changes in a related task

Actions

Copy link

#19

Updated by rpalethorpe almost 7 years ago

Target version deleted (~~Milestone 15~~)

Actions

Copy link

#20

Updated by rpalethorpe over 6 years ago

Status changed from Workable to In Progress

The good news is that after the deployment of the QEMU rewrite and various bug fixes for integration issues with the rest of the OpenQA framework, snapshots on ARM now appear to be reliable. The bad news is they are still slow in comparison with X86 and ppc64le. The snapshot timeout of 240 seconds is too short for machines with 4GB and frankly 240 seconds is way too long to begin with. For now, I think ARM machines with >2GB of RAM should just have snapshots disabled or we just disable them for all machines except the virtio variant used mostly with console tests.

The reason for the slowness is not clear, possibly ARM struggles with the compression, although it is very mild compression. Another possibility is that there is some bottleneck in a bus or other transport. As far as QEMU and OpenQA is concerned; we are now on the happy path, so I doubt it is anything specific to OpenQA.

Actions

Copy link

#21

Updated by rpalethorpe over 6 years ago

Status changed from In Progress to Resolved

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #30649

[tools][openqa] Improve performance by using migrations and external snapshots

Updated by coolo about 7 years ago

Updated by rpalethorpe about 7 years ago

Updated by rpalethorpe about 7 years ago

Updated by rpalethorpe about 7 years ago

Updated by rpalethorpe about 7 years ago

Updated by rpalethorpe about 7 years ago

Updated by rpalethorpe about 7 years ago

Updated by rpalethorpe about 7 years ago

Updated by rpalethorpe almost 7 years ago

Updated by coolo almost 7 years ago

Updated by rpalethorpe almost 7 years ago

Updated by rpalethorpe almost 7 years ago

Updated by rpalethorpe almost 7 years ago

Updated by rpalethorpe almost 7 years ago

Updated by rpalethorpe almost 7 years ago

Updated by rpalethorpe almost 7 years ago

Updated by rpalethorpe almost 7 years ago

Updated by rpalethorpe almost 7 years ago

Updated by rpalethorpe over 6 years ago

Updated by rpalethorpe over 6 years ago