action #38813

Qemu backend rewrite fallout

Added by szarate over 1 year ago. Updated over 1 year ago.

Status:ResolvedStart date:25/07/2018
Priority:UrgentDue date:
Assignee:rpalethorpe% Done:

0%

Category:-
Target version:Done
Difficulty:
Duration:

Description

Since we have deployed the new version of the QEMU backend, we spect some fallout on tests with this backend, I'm creating this ticket to help reviewers to tag easily (and us too)


Related issues

Related to openQA Tests - action #38840: BIOS variable still present in test suites for aarch64 Resolved 25/07/2018
Related to openQA Project - action #38849: [tools] openqaworker - DIE can't open qmp Resolved 26/07/2018
Related to openQA Tests - action #39002: test fails in bootloader Resolved 02/08/2018
Related to openQA Project - action #39035: PFLASH files handling Resolved 01/08/2018
Related to openQA Tests - action #38996: [sle][migration][sle12sp4] test fails in boot_to_desktop ... Resolved 01/08/2018
Related to openQA Tests - action #38963: [functional][y][fast] qemu backend rewrite: upgrade not p... Resolved 09/11/2018
Related to openQA Project - action #32968: [kernel][tools] Refactor QEMU backend - Create QEMU proce... Resolved 24/04/2018
Related to openQA Project - action #39101: Publishing of assets failed when extracting pflash-vars q... Resolved 02/08/2018
Copied to openQA Project - action #38822: Qemu: Could not open backing file: Cannot reference an ex... Resolved 25/07/2018

History

#1 Updated by szarate over 1 year ago

  • Subject changed from qemu rewrite fallout to Qemu backend rewrite/refactor fallout

#2 Updated by rpalethorpe over 1 year ago

https://openqa.suse.de/tests/1857912# (ipa_ec2)

Looks like the milestone console variable is not set, possibly because the post fail hook dies.

#3 Updated by szarate over 1 year ago

There's https://openqa.suse.de/tests/1857909#step/keymap_or_locale/12 that tried to load the snapshot and somehow failed afterwards:

[2018-07-25T10:28:57.0893 CEST] [debug] QEMU: qemu-system-x86_64: -blockdev driver=qcow2,node-name=hd0-overlay1,file=hd0-overlay1-file,cache.no-flush=on,backing=hd0: Could not open backing file: Cannot reference an existing block device with additional options or a new filename

#4 Updated by rpalethorpe over 1 year ago

szarate wrote:

There's https://openqa.suse.de/tests/1857909#step/keymap_or_locale/12 that tried to load the snapshot and somehow failed afterwards:


[2018-07-25T10:28:57.0893 CEST] [debug] QEMU: qemu-system-x86_64: -blockdev driver=qcow2,node-name=hd0-overlay1,file=hd0-overlay1-file,cache.no-flush=on,backing=hd0: Could not open backing file: Cannot reference an existing block device with additional options or a new filename

This seems like the more common one so I will start working on that.

#5 Updated by rpalethorpe over 1 year ago

  • Copied to action #38822: Qemu: Could not open backing file: Cannot reference an existing block device with additional options or a new filename added

#6 Updated by okurz over 1 year ago

Please avoid the word refactoring as it was not refactoring as in the more common meaning.

#7 Updated by rpalethorpe over 1 year ago

  • Subject changed from Qemu backend rewrite/refactor fallout to Qemu backend rewrite fallout

#8 Updated by rpalethorpe over 1 year ago

-bios flag needs to be avoided on ARM: https://openqa.suse.de/tests/1858395/file/autoinst-log.txt

#9 Updated by szarate over 1 year ago

  • Related to action #38840: BIOS variable still present in test suites for aarch64 added

#10 Updated by szarate over 1 year ago

Ah! I just saw that richie already noted the failure in aarch64

#11 Updated by rpalethorpe over 1 year ago

I removed the BIOS var from the ARM machines.

Also hopefully fixed the issue with selecting an undefined console: https://github.com/os-autoinst/os-autoinst/pull/996

#12 Updated by rpalethorpe over 1 year ago

I think the following happens: https://openqa.suse.de/tests/1858462#step/force_scheduled_tasks/1 because it is trying to 'activate' the root console. It is probably doing this because the console is reset when reverting to the snapshot. However the root console is logged in before the snapshot is taken.

The problem here is that we are confusing resetting/activating consoles in the backend with logging them in on the SUT (the two can be different because of snapshots). Possibly the distribution code could be changed to handle the situation where a console is already logged in during activation.

#13 Updated by rpalethorpe over 1 year ago

I don't think my changes introduced the problem with reactivating the console as the code to reset them has always been there. However only consoles activated since the last snapshot should be reset, but it seems the root console, which was active before the snapshot, is also being reset.

This is a tricky bit of code though so my confidence is low.

#14 Updated by rpalethorpe over 1 year ago

We should probably add some "self tests" within the suse test suite for testing reverting to snapshots under a number of circumstances.

#15 Updated by rpalethorpe over 1 year ago

Boot failure on USBInstall is possibly caused by changes to how the USB storage device is handled.
https://openqa.suse.de/tests/1859849#

#16 Updated by rpalethorpe over 1 year ago

The USB storage device appears in different positions in the boot order depending on the exact command line options used. It is not clear if this is deterministic or not. However the bootindex device parameter can hopefully be used to solve this, so we don't even need to go into the boot menu: https://github.com/os-autoinst/os-autoinst/pull/997

#17 Updated by rpalethorpe over 1 year ago

The version of QEMU on at least some PowerPC workers is much too old (2.6.2).

#19 Updated by rpalethorpe over 1 year ago

FFS, OpenQA has two different functions which decide if a setting is an asset. One here which I updated and one here which I did not update.

The pflash assets are being stored in the correct place because of the first, but the variable is not being substituted with the absolute path due to the second.

#20 Updated by szarate over 1 year ago

  • Related to action #38849: [tools] openqaworker - DIE can't open qmp added

#21 Updated by coolo over 1 year ago

That obviously did not help, so revert that again

Update the machines to SP3 - powerqaworker-qam-1 runs on SP3 without major problems:
https://openqa.suse.de/admin/workers/1049

#22 Updated by szarate over 1 year ago

@coolo ok, upgrading the workers then

#23 Updated by szarate over 1 year ago

All power workers have been upgraded, need only to test malbec. Will probably use the chance to reduce the number of workers there to 3. Since there have been reports that it's causing some trouble

#24 Updated by rpalethorpe over 1 year ago

This test is dieing because it is trying to download the pflash drive which doesn't exist for none UEFI tests: https://openqa.opensuse.org/tests/715933.

The variable needs to be ignored when UEFI is not set or we somehow need to add the variable when UEFI is set.

#25 Updated by rpalethorpe over 1 year ago

From the discussion on IRC: We could add a code hook before isotovideo is run which allows assets to be cached/downloaded by the test suite and/or backend.

So we could revert the changes which detect pflash vars as a hdd asset and add a code hook to the backend instead which downloads the asset and changes the variable to include its full path.

#26 Updated by rpalethorpe over 1 year ago

As a temporary workaround the backend could publish an empty vars file if PUBLISH_PFLASH_VARS is present, but UEFI is not set.

#28 Updated by okurz over 1 year ago

Pretty sure that https://openqa.suse.de/tests/1873194#step/yast2_bootloader/6 is related as well. The disk used for installing the bootloader changed. Please make sure to address this as well. If you think it's acceptable as is then just create a new corresponding needle to cover this.

#29 Updated by okurz over 1 year ago

https://openqa.suse.de/tests/1874310#step/grub_test/4 in the USBinstall test suite also seems to be related.

#30 Updated by rpalethorpe over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to rpalethorpe

Another potential problem is that I am using '-boot order=' and bootindex at the same time. However I have not seen any problems yet. IIRC it depends on the VM's bios firmware if it explodes or not.

#31 Updated by szarate over 1 year ago

I have applied the following patch: https://github.com/os-autoinst/openQA/pull/1732 which seems to work, and should allow us to carry on testing while an appropriate solution is found.

#32 Updated by rpalethorpe over 1 year ago

Doh. The serial file is probably being truncated on reverting to a snapshot. Need to change the serial log to append.

#33 Updated by szarate over 1 year ago

Looks like for some tests, the worker is not even downloading the pflash https://openqa.suse.de/tests/1876680

#35 Updated by szarate over 1 year ago

#37 Updated by szarate over 1 year ago

#38 Updated by szarate over 1 year ago

  • Related to action #38996: [sle][migration][sle12sp4] test fails in boot_to_desktop -- need wait to assert needle inst-slof added

#39 Updated by rpalethorpe over 1 year ago

  • Related to action #38963: [functional][y][fast] qemu backend rewrite: upgrade not possible anymore in many scenarios added

#40 Updated by rpalethorpe over 1 year ago

  • Related to action #32968: [kernel][tools] Refactor QEMU backend - Create QEMU process manager and save configuration state added

#41 Updated by szarate over 1 year ago

looks like another one more in the saga: https://openqa.suse.de/tests/1882747

#42 Updated by rpalethorpe over 1 year ago

szarate wrote:

looks like another one more in the saga: https://openqa.suse.de/tests/1882747

https://github.com/os-autoinst/os-autoinst/pull/1005

Also ARM snapshots are timing out on opensuse. The aarch64 machines have been given 4GB of RAM which on ARM takes slightly more than 4 minutes to compress and save to disk. If the RAM size needs to be this large then snapshots probably need to be disabled again. However we really want snapshots enabled on SLE, but I think the RAM size is less than 2GB there which should be OK.

Ideally we want snapshots enabled everywhere, but I will follow that up in the main ticket.

#43 Updated by szarate over 1 year ago

  • Related to action #39101: Publishing of assets failed when extracting pflash-vars qcow image. added

#44 Updated by rpalethorpe over 1 year ago

  • Status changed from In Progress to Resolved

This still takes up some time, but the vast majority of it is over.

#46 Updated by coolo over 1 year ago

  • Target version changed from Current Sprint to Done

#47 Updated by zluo over 1 year ago

sorry, wrong entry here, pls ignore...

Also available in: Atom PDF