action #51743
closed[openqa] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64]
Added by pvorel over 5 years ago. Updated about 4 years ago.
0%
Description
Reproducible¶
Fails since on SLE12 SP5 and openSUSE
- Build 20190516 (openSUSE)
install_ltp tests are running, than it does not see qcow2 image thus does not see GRUB and tries to boot from PXE
https://openqa.opensuse.org/tests/936921#step/boot_ltp/2
Other archs than x86_64 tests are ok:
https://openqa.opensuse.org/tests/936570#step/boot_ltp/1
Expected result¶
Last good:
- Build 20190514 (openSUSE)
Hint¶
Qemu complains "Unknown host!"
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: QEMU emulator version 2.9.1(openSUSE Leap 42.3)
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!
NOTE: it cannot be reproduced outside of o3 (on osd or private worker).
Files
autoinst-log.936570.txt (730 KB) autoinst-log.936570.txt | pvorel, 2019-05-21 12:36 |
Updated by pvorel over 5 years ago
14:46 < nsinger> pvorel: the previous (working) runs where running with qemu-system-x86_64 -cpu qemu64 and now it runs with "-cpu host".
Updated by SLindoMansilla over 5 years ago
- Category set to Infrastructure
As a result of backlog triaging (see https://progress.opensuse.org/projects/openqatests/wiki#ticket-backlog-triaging for more information).
Please, feel free to adjust the category or the "[label]" if you think different.
Updated by pvorel over 5 years ago
Broken build https://openqa.opensuse.org/tests/940933/file/autoinst-log.txt
Use of uninitialized value in string eq at /usr/lib/os-autoinst/OpenQA/Qemu/DriveDevice.pm line 115.
Use of uninitialized value in string eq at /usr/lib/os-autoinst/OpenQA/Qemu/DriveDevice.pm line 115.
[2019-05-23T20:46:39.683 CEST] [debug] starting: /usr/bin/qemu-system-x86_64 -vga cirrus -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw ac97 -global isa-fdc.driveA= -m 1536 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :104,share=force-shared -device virtio-serial -chardev socket,path=virtio_console,server,nowait,id=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/14/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/14/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0
Attempt 0 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 1 at /usr/lib/os-autoinst/osutils.pm line 130.
Updated by pvorel over 5 years ago
- Subject changed from [ltp][kernel][opensuse] All LTP tests are failing on openSUSE on x86_64 to [ltp][kernel] All LTP tests are failing on SLE12 SP5 (osd) and openSUSE (o3) on x86_64
- Description updated (diff)
Updated by pvorel over 5 years ago
asmorodskyi found that QCOW2 image on osd was broken.
Looking at it size is ok, just qemu-img check complains:
$ qemu-img check sle-12-SP5-x86_64-0187-Server-DVD@64bit-with-ltp.qcow2
qemu-img: This image format does not support checks
QEMU version which creates problems is 2.9.1(openSUSE Leap 42.3), but this was probably also on previous builds which didn't fail. On quasar.suse.cz, where I run install_ltp which produced a correct image is 2.11.2(SUSE Linux Enterprise 15). But really no clue what causes producing wrong qcow images.
Updated by pvorel over 5 years ago
Retriggering install_ltp helps on both osd and o3. Not sure what creates wrong image.
Updated by pcervinka over 5 years ago
Rechecked recent failed job https://openqa.suse.de/tests/2992911, downloaded qcow2 image and tried to boot from in local virtual machine. Unfortunately it ended with "no bootable device error". It means that image is really corrupted and we need to find where it started. It can be incorrectly generated already on the worker or it can be corrupted during upload from worker.
Both things can be difficult to troubleshoot.
Updated by pcervinka over 5 years ago
@pvorel could you try to set variable QEMU_DISABLE_SNAPSHOTS in test install ltp test suite?
Updated by rpalethorpe over 5 years ago
Just a thought, but if this is related to the problems I have seen before, then this is due to how the qcow images are stored and distributed. A solution might be to put them in a Ceph object store and allow it to handle as much of the distribution and caching as possible.
Updated by pvorel over 5 years ago
- Related to action #53294: [kernel][ltp] test fails in boot_ltp - incorrect kernel name provided added
Updated by okurz over 5 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: kernel-live-patching
https://openqa.opensuse.org/tests/973499
Updated by rpalethorpe over 5 years ago
Yeah, seems like the image on x86_64 o3 is always corrupt:
[2019-07-04T08:13:11.519 CEST] [debug] running /usr/bin/qemu-img info --output=json /var/lib/openqa/pool/2/opensuse-Tumbleweed-x86_64-20190703-DVD@64bit-with-ltp.qcow2
[2019-07-04T08:13:11.534 CEST] [debug] {
"virtual-size": 2322137088,
"filename": "/var/lib/openqa/pool/2/opensuse-Tumbleweed-x86_64-20190703-DVD@64bit-with-ltp.qcow2",
"format": "raw",
"actual-size": 2324410368,
"dirty-flag": false
}
Format should be qcow2.
[2019-07-04T09:19:53.726 UTC] [debug] {
"virtual-size": 32212254720,
"filename": "/var/lib/openqa/pool/6/opensuse-Tumbleweed-ppc64le-20190703-DVD@ppc64le-with-ltp.qcow2",
"cluster-size": 65536,
"format": "qcow2",
"actual-size": 2299990016,
"format-specific": {
"type": "qcow2",
"data": {
"compat": "1.1",
"lazy-refcounts": false,
"refcount-bits": 16,
"corrupt": false
}
},
"dirty-flag": false
}
The publishing job doesn't take any snapshots.
Updated by okurz over 5 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: ltp_net_sctp
https://openqa.suse.de/tests/3074428
Updated by pcervinka over 5 years ago
If we combine system installation with ltp installation, we will be able to save one job and image. Maybe we could create combined installation into one create_hdd_ltp which would do system installation with ltp at once. I see that there were some experiments already create_hdd_kotd_ltp, but is not used.
What do you think?
Updated by pvorel over 5 years ago
For VM based testing it worked ok, as we reused already installed image. So from these 3 steps (1) install OS 2) install LTP 3) run test) I'd prefer either join 2) + 3) and/or (only for IPMI) have all 3 steps in single test suite.
For IPMI I'm planning to use Michie's way (iPXE based installation), where he's going to detect which SLES version has been installed (so we avoid installing it if not needed).
Updated by okurz over 5 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: kernel-live-patching
https://openqa.opensuse.org/tests/1001688
Updated by pvorel over 5 years ago
Some time ago fixed in osd, but still broken on o3.
Updated by okurz over 5 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: ltp_net_ipv6_lib
https://openqa.suse.de/tests/3308930
Updated by rpalethorpe over 5 years ago
- Subject changed from [ltp][kernel] All LTP tests are failing on SLE12 SP5 (osd) and openSUSE (o3) on x86_64 to [ltp][kernel] All LTP tests are failing on SLE12 SP5 (osd) and openSUSE (o3) on [x86_64]
Add arch tag for JDP.
Updated by pvorel over 5 years ago
- Subject changed from [ltp][kernel] All LTP tests are failing on SLE12 SP5 (osd) and openSUSE (o3) on [x86_64] to [ltp][kernel] All LTP tests are failing on openSUSE (o3) on [x86_64]
Tests are failing just on Tumbleweed.
Updated by jlausuch about 5 years ago
- Subject changed from [ltp][kernel] All LTP tests are failing on openSUSE (o3) on [x86_64] to [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64]
Updated by pvorel almost 5 years ago
- Subject changed from [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64] to [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64][ppc64le]
Hm, ppc64le got broken now as well:
last good is build 20191206 (https://openqa.opensuse.org/tests/overview?version=Tumbleweed&build=20191206&groupid=32&distri=opensuse)
first bad is build 20191216 (https://openqa.opensuse.org/tests/overview?distri=opensuse&groupid=32&build=20191216&version=Tumbleweed)
Updated by pvorel almost 5 years ago
There are other failures on ppc, maybe related https://openqa.suse.de/tests/3723347#next_previous (first fail: 4.12.14-146.1.ge31b461 https://openqa.suse.de/tests/3696777)
Updated by pvorel almost 5 years ago
- Subject changed from [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64][ppc64le] to [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64]
ppc failures are caused by https://bugzilla.suse.com/show_bug.cgi?id=1159096 (on both osd and o3; checked by inspecting qcow2 image).
Updated by pvorel almost 5 years ago
- Related to coordination #61203: [kernel][ltp][epic][grub] General solution for handling kernel parameters (debug_pagealloc=on) added
Updated by pvorel almost 5 years ago
Hm, it's hard to debug this problem on o3 :(. I tried several times to restart a job and watch things on o3. Mostly I waited several hours and job got restarted while I was away. Today job got restarted, but I have no SSH access. According to okurz: ssh access to openqa.opensuse.org is down, see https://progress.opensuse.org/issues/61218.
Updated by pvorel almost 5 years ago
Still haven't found the root cause of the problem.
Updated by pvorel almost 5 years ago
jlausuch wrote:
can you reproduce it locally?
No, that's the hardest problem on this ticket (I thought I reported it, but I didn't), together with busyness of o3 (it's hard to reschedule the job).
At least my PR which could help debugging a bit has been merged https://github.com/os-autoinst/os-autoinst/pull/1327.
Updated by pvorel almost 5 years ago
Hm, but looking at log (https://openqa.opensuse.org/tests/1142577/file/autoinst-log.txt) of failing intel build (https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200112&groupid=32) published qcow2 image has qcow2 format, so the problem is somewhere else :(
[2020-01-14T02:18:29.943 CET] [debug] running nice ionice qemu-img convert -c -O qcow2 /var/lib/openqa/pool/4/raid/hd0-overlay0 assets_public/opensuse-Tumbleweed-x86_64-20200112-DVD@64bit-with-ltp.qcow2
[2020-01-14T02:20:56.864 CET] [debug] running qemu-img info --output=json assets_public/opensuse-Tumbleweed-x86_64-20200112-DVD@64bit-with-ltp.qcow2
[2020-01-14T02:20:56.876 CET] [debug] {
"virtual-size": 32212254720,
"filename": "assets_public/opensuse-Tumbleweed-x86_64-20200112-DVD@64bit-with-ltp.qcow2",
"cluster-size": 65536,
"format": "qcow2",
"actual-size": 1103364096,
"format-specific": {
"type": "qcow2",
"data": {
"compat": "1.1",
"lazy-refcounts": false,
"refcount-bits": 16,
"corrupt": false
}
},
"dirty-flag": false
}
Updated by rpalethorpe almost 5 years ago
To clarify: qemu-img shows that the qcow2 is valid after install-ltp completes, but when dependent test starts the image is invalid.
So the image is valid before it is uploaded as an asset, but is invalid by the time we download it.
using hexdump -n 1M -C :
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00040000
So the file is probably just all zeroes, but has the correct file length. For reference, it should look more like:
hexdump -C -n 256K qa/runltp-support/ldisc-syzkaller.qcow2
00000000 51 46 49 fb 00 00 00 03 00 00 00 00 00 00 01 38 |QFI............8|
00000010 00 00 00 0b 00 00 00 10 00 00 00 0c 80 00 00 00 |................|
00000020 00 00 00 00 00 00 00 64 00 00 00 00 00 03 00 00 |.......d........|
00000030 00 00 00 00 00 01 00 00 00 00 00 01 00 00 00 00 |................|
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
...
There do not appear to be any errors during uploading. However it is possible that OpenQA is creating the file, but not writing to it correctly. Alternatively something could incorrectly copy the file after it has been uploaded as part of the asset caching.
Updated by rpalethorpe almost 5 years ago
- Related to action #45836: [tools] qcow images mismatch in size added
Updated by pvorel almost 5 years ago
rpalethorpe wrote:
To clarify: qemu-img shows that the qcow2 is valid after install-ltp completes, but when dependent test starts the image is invalid.
So the image is valid before it is uploaded as an asset, but is invalid by the time we download it.
Yep, I found that as well before, but just didn't believe it could be possible (so planned to investigate it more). mdoucha fount that both install_ltp+opensuse+DVD and install_ltp+opensuse+DVD-m32 used the same PUBLISH_HDD_1 and PUBLISH_PFLASH_VARS variables (thanks Martin!). While PUBLISH_PFLASH_VARS might not be a problem PUBLISH_HDD_1 certainly is. I restarted jobs with correct variables, let's see.
Updated by rpalethorpe almost 5 years ago
- Related to action #34597: Race condition causing problems with the worker cache added
Updated by rpalethorpe almost 5 years ago
- Subject changed from [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64] to [openqa] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64]
Updated by rpalethorpe almost 5 years ago
- Related to action #13646: Ensuring asset files integrity (was: "An error occurred during the installation" on images) added
Updated by pvorel almost 5 years ago
- Status changed from In Progress to Feedback
Build 20200113 is ok :).
https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200113&groupid=32
Looks like really wrong setup of HDD_1 was causing this problem.
Let's wait few more builds to be sure.
Updated by pvorel almost 5 years ago
- Status changed from Feedback to Resolved
Builds 20200114 and 20200115 are also ok. + Problem was really just on intel, which was the only one affected by wrong setup => fixed.
Updated by pvorel almost 5 years ago
Tests fails again: https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200119&groupid=32
But this time it's something else https://openqa.opensuse.org/tests/1150521#
[2020-01-21T16:12:06.0882 CET] [info] +++ setup notes +++
[2020-01-21T16:12:06.0882 CET] [info] Start time: 2020-01-21 15:12:06
[2020-01-21T16:12:06.0882 CET] [info] Running on openqaworker1:7 (Linux 4.12.14-lp151.28.36-default #1 SMP Fri Dec 6 13:50:27 UTC 2019 (8f4a495) x86_64)
[2020-01-21T16:12:06.0890 CET] [info] Downloading opensuse-Tumbleweed-x86_64-20200119-DVD@64bit-with-ltp.qcow2, request #203793 sent to Cache Service
[2020-01-21T16:12:11.0950 CET] [info] Download of opensuse-Tumbleweed-x86_64-20200119-DVD@64bit-with-ltp.qcow2 processed
[2020-01-21T16:12:11.0985 CET] [info] +++ worker notes +++
[2020-01-21T16:12:11.0985 CET] [info] End time: 2020-01-21 15:12:11
[2020-01-21T16:12:11.0985 CET] [info] Result: setup failure
[2020-01-21T16:12:11.0998 CET] [info] Uploading autoinst-log.txt
Updated by pvorel almost 5 years ago
- Related to action #63373: [o3][kernel][scheduler][x86_64] Dependent (child) jobs should start after uploading all of parent assets added
Updated by pcervinka about 4 years ago
- Target version changed from 457 to QE Kernel Done