action #19012
closed[tools] test gic-version=3 and its=off on the caivum thunderx
0%
Description
Observation¶
openQA test in scenario sle-12-SP3-Server-DVD-aarch64-cryptlvm+activate_existing+import_users@aarch64 fails with incomplete after
boot_encrypt trying to save a memory dump. log file content:
21:45:35.5310 8746 >>> testapi::wait_screen_change: timed out
21:45:35.5313 8746 Save memory dump to debug bootup problems, e.g. for bsc#1005313
21:45:35.5315 Debug: /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/installation/first_boot.pm:72 called testapi::save_memory_dump
21:45:35.5319 8746 <<< testapi::save_memory_dump(filename='first_boot')
21:45:35.5320 8746 Trying to save machine state
21:45:35.5335 8748 Migrating the machine.
21:45:35.5342 8748 EVENT {"event":"NIC_RX_FILTER_CHANGED","data":{"path":"/machine/peripheral-anon/device[1]/virtio-backend"},"timestamp":{"microseconds":321504,"seconds":1494019343}}
21:45:35.5346 8748 EVENT {"data":{"server":{"host":"0.0.0.0","service":"5991","websocket":false,"auth":"none","family":"ipv4"},"client":{"host":"127.0.0.1","service":"37174","websocket":false,"family":"ipv4"}},"timestamp":{"microseconds":644277,"seconds":1494019422},"event":"VNC_DISCONNECTED"}
21:45:35.5348 8748 EVENT {"timestamp":{"microseconds":645124,"seconds":1494019422},"data":{"server":{"host":"0.0.0.0","service":"5991","websocket":false,"family":"ipv4","auth":"none"},"client":{"family":"ipv4","websocket":false,"service":"37968","host":"127.0.0.1"}},"event":"VNC_CONNECTED"}
21:45:35.5350 8748 EVENT {"timestamp":{"seconds":1494019422,"microseconds":656703},"data":{"server":{"family":"ipv4","auth":"none","websocket":false,"service":"5991","host":"0.0.0.0"},"client":{"service":"37968","host":"127.0.0.1","family":"ipv4","websocket":false}},"event":"VNC_INITIALIZED"}
21:45:35.5353 8748 EVENT {"timestamp":{"microseconds":958315,"seconds":1494020298},"data":{"server":{"websocket":false,"family":"ipv4","auth":"none","host":"0.0.0.0","service":"5991"},"client":{"family":"ipv4","websocket":false,"service":"37968","host":"127.0.0.1"}},"event":"VNC_DISCONNECTED"}
21:45:35.5355 8748 EVENT {"data":{"client":{"host":"127.0.0.1","service":"41462","websocket":false,"family":"ipv4"},"server":{"host":"0.0.0.0","service":"5991","websocket":false,"family":"ipv4","auth":"none"}},"timestamp":{"seconds":1494020298,"microseconds":958863},"event":"VNC_CONNECTED"}
21:45:35.5357 8748 EVENT {"event":"VNC_INITIALIZED","timestamp":{"seconds":1494020298,"microseconds":965817},"data":{"server":{"auth":"none","family":"ipv4","websocket":false,"service":"5991","host":"0.0.0.0"},"client":{"host":"127.0.0.1","service":"41462","websocket":false,"family":"ipv4"}}}
21:45:35.5360 8748 EVENT {"event":"VNC_DISCONNECTED","timestamp":{"microseconds":326005,"seconds":1494020345},"data":{"server":{"service":"5991","host":"0.0.0.0","auth":"none","family":"ipv4","websocket":false},"client":{"family":"ipv4","websocket":false,"service":"41462","host":"127.0.0.1"}}}
21:45:35.5362 8748 EVENT {"data":{"client":{"websocket":false,"family":"ipv4","host":"127.0.0.1","service":"41772"},"server":{"family":"ipv4","auth":"none","websocket":false,"service":"5991","host":"0.0.0.0"}},"timestamp":{"microseconds":326765,"seconds":1494020345},"event":"VNC_CONNECTED"}
21:45:35.5364 8748 EVENT {"data":{"client":{"websocket":false,"family":"ipv4","host":"127.0.0.1","service":"41772"},"server":{"family":"ipv4","auth":"none","websocket":false,"service":"5991","host":"0.0.0.0"}},"timestamp":{"microseconds":342342,"seconds":1494020345},"event":"VNC_INITIALIZED"}
21:45:35.5367 8748 EVENT {"event":"VNC_DISCONNECTED","data":{"client":{"host":"127.0.0.1","service":"41772","websocket":false,"family":"ipv4"},"server":{"auth":"none","family":"ipv4","websocket":false,"service":"5991","host":"0.0.0.0"}},"timestamp":{"seconds":1494020493,"microseconds":811602}}
21:45:35.5369 8748 EVENT {"event":"VNC_CONNECTED","data":{"client":{"family":"ipv4","websocket":false,"service":"42626","host":"127.0.0.1"},"server":{"websocket":false,"auth":"none","family":"ipv4","host":"0.0.0.0","service":"5991"}},"timestamp":{"microseconds":812345,"seconds":1494020493}}
21:45:35.5371 8748 EVENT {"event":"VNC_INITIALIZED","timestamp":{"microseconds":823717,"seconds":1494020493},"data":{"server":{"service":"5991","host":"0.0.0.0","auth":"none","family":"ipv4","websocket":false},"client":{"service":"42626","host":"127.0.0.1","family":"ipv4","websocket":false}}}
21:45:35.5373 8748 EVENT {"timestamp":{"microseconds":762036,"seconds":1494020636},"event":"RESET"}
DIE Migration failed: desc: State blocked by non-migratable device 'arm_gicv3_its', class: GenericError, stopped at /usr/lib/os-autoinst/backend/qemu.pm line 174.
at /usr/lib/os-autoinst/backend/baseclass.pm line 73.
backend::baseclass::die_handler('Migration failed: desc: State blocked by non-migratable devic...') called at /usr/lib/os-autoinst/backend/qemu.pm line 174
backend::qemu::save_memory_dump('backend::qemu=HASH(0x323730a8)', 'HASH(0x328a90e8)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 68
backend::baseclass::handle_command('backend::qemu=HASH(0x323730a8)', 'HASH(0x328ad208)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 422
backend::baseclass::check_socket('backend::qemu=HASH(0x323730a8)', 'IO::Handle=GLOB(0x322dcdb0)') called at /usr/lib/os-autoinst/backend/qemu.pm line 1018
backend::qemu::check_socket('backend::qemu=HASH(0x323730a8)', 'IO::Handle=GLOB(0x322dcdb0)', 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 203
eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 151
backend::baseclass::run_capture_loop('backend::qemu=HASH(0x323730a8)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 122
backend::baseclass::run('backend::qemu=HASH(0x323730a8)', 6, 9) called at /usr/lib/os-autoinst/backend/driver.pm line 85
backend::driver::start('backend::driver=HASH(0x2e6a5cc8)') called at /usr/lib/os-autoinst/backend/driver.pm line 48
backend::driver::new('backend::driver', 'qemu') called at /usr/bin/isotovideo line 206
main::init_backend() called at /usr/bin/isotovideo line 271
Reproducible¶
- Make test fail when trying to boot using "first_boot"
- openQA should try to save the memory dump
- Observe the error
Expected result¶
Last good: Probably some months ago on overdrive2 with still the old qemu version
Further details¶
Always latest result in this scenario: latest
Updated by okurz almost 7 years ago
- Project changed from openQA Tests to openQA Project
- Subject changed from aarch64 fails to save memory dump because of "non-migratable device 'arm_gicv3_its'" to [tools][aarch64] fails to save memory dump because of "non-migratable device 'arm_gicv3_its'"
- Category changed from Infrastructure to 132
Updated by szarate almost 7 years ago
- Subject changed from [tools][aarch64] fails to save memory dump because of "non-migratable device 'arm_gicv3_its'" to [tools][aarch64] test gic-version=3 and its=off on the caivum thunderx
- Target version set to Milestone 7
overdrive2 should not run jobs for sp3.
This job ran on openqaworker-arm-2, jobs on that worker need to run with its=off which has the new patches that should remove the RCU stalls that don't allow us to use the thunderx's as workers.
Updated by RBrownSUSE almost 7 years ago
- Assignee set to nicksinger
Nick's looking at it as part of his aarch64 investigations
Updated by nicksinger almost 7 years ago
- Subject changed from [tools][aarch64] test gic-version=3 and its=off on the caivum thunderx to [tools] save_memory_dump fails if some qemu-devices are not migratable
I'm somehow confused by this ticket.
The title states to test gic-version=3 and its=off but the whole description is about a non-miratable device which only exists because of the parameter "gic-version=3" (the device in question is called "arm_gicv3_its"…).
Testing gic-version=3 and its=off is covered by poo#17740 now (new bullet point).
I'll change this ticket to be way more general because I think it covers an interesting point (which is neither related to aarch64 nor to the caviums): openQA's memory dump routine fails if some qemu-device is not migratable which can happen from time to time (like in this case for example).
Updated by okurz almost 7 years ago
wenn you read #19012#note-2 that should resolve your confusion. szarate updated the subject line. I am wondering if there is not a ticket which is blocking this one here?
Updated by nicksinger almost 7 years ago
reading note-2 just enables me to finger-point on santi but does not explain his weighty reasons ;)
Which ticket should block this ticket exactly? poo#17740 (and its changes) only cause this problem on aarch64 (so one could argue that this is raised and blocked by poo#17740). But as I said in my previous comment I think that this issue is way more generic and should be catched regardless of the architecture.
Updated by EDiGiacinto almost 7 years ago
which version of qemu was used? The qmp answer from the api looks like migration on aarch64 and VGICv3 is not supported yet by the qemu/kernel version, and if qemu is 2.8, probably that's the case.
In the current implementation [1] if qmp returns an error we let the test die. As a mitigation we could decide to not let the test stop and continue without snapshotting capabilities.
VGICv3 aarch64 migration support on linux kernel seems have been updated lately with a new round of patches [2], but qemu VGICv3 save/restore support should be there only in versions >=2.9 [3], while linux kernel from 4.8+ [4]
[1] https://github.com/os-autoinst/os-autoinst/blob/master/backend/qemu.pm#L177
[2] https://www.spinics.net/lists/arm-kernel/msg558046.html
[3] https://github.com/qemu/qemu/commit/b28f9db1a7ce4d537ce2fae6fbce5e5e37dc265b
[4] https://lkml.org/lkml/2016/10/6/450
Updated by szarate almost 7 years ago
- Subject changed from [tools] save_memory_dump fails if some qemu-devices are not migratable to [tools] test gic-version=3 and its=off on the caivum thunderx
- Status changed from In Progress to Resolved
This issue was specifically about the problems with gicv3 and its migration support on qemu not being possible... so we tested with qemu 2.9 (From SP3) on the caviums that are the only systems in our openQA infraestructure that actually require this due to the RCU stalls.
As Ettore is pointing out, we already die when the snapshot can't be created, and I don't see any reason to actually try to recover from that situation.
The tests were specifically to ensure that the variation of the environment would still allow the thunderX to run tests (Which it did, we could run tests and do snapshots with ITS disabled regardless of the status of the tests.)
So for now, I'm marking this as solved. :). As Running tests with GICv3 and ITS off, allows us to create snapshots on qemu 2.9.