action #51743

[openqa] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64]

Added by pvorel 9 months ago. Updated about 1 month ago.

Status:ResolvedStart date:21/05/2019
Priority:HighDue date:
Assignee:pvorel% Done:

0%

Category:Infrastructure
Target version:SUSE QA tests - Current Sprint - kernel
Difficulty:
Duration:

Description

Reproducible

Fails since on SLE12 SP5 and openSUSE

install_ltp tests are running, than it does not see qcow2 image thus does not see GRUB and tries to boot from PXE
https://openqa.opensuse.org/tests/936921#step/boot_ltp/2
Other archs than x86_64 tests are ok:
https://openqa.opensuse.org/tests/936570#step/boot_ltp/1

Expected result

Last good:

Hint

Qemu complains "Unknown host!"

[2019-05-17T14:10:26.428 UTC] [debug] QEMU: QEMU emulator version 2.9.1(openSUSE Leap 42.3)
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!

NOTE: it cannot be reproduced outside of o3 (on osd or private worker).

autoinst-log.936570.txt Magnifier (730 KB) pvorel, 21/05/2019 12:36 pm


Related issues

Related to openQA Tests - action #53294: [kernel][ltp] test fails in boot_ltp - incorrect kernel n... Closed 19/06/2019
Related to openQA Tests - action #61203: [kernel][ltp][epic][grub] General solution for handling k... Resolved 08/01/2020
Related to openQA Project - action #45836: [tools] qcow images mismatch in size Rejected 08/01/2019
Related to openQA Project - action #34597: Race condition causing problems with the worker cache Resolved 11/05/2018
Related to openQA Project - action #13646: Ensuring asset files integrity (was: "An error occurred d... New 09/09/2016
Related to openQA Tests - action #63373: [o3][kernel][scheduler][x86_64] Dependent (child) jobs sh... New 11/02/2020

History

#1 Updated by pvorel 9 months ago

14:46 < nsinger> pvorel: the previous (working) runs where running with qemu-system-x86_64 -cpu qemu64 and now it runs with "-cpu host".

#2 Updated by SLindoMansilla 9 months ago

  • Category set to Infrastructure

As a result of backlog triaging (see https://progress.opensuse.org/projects/openqatests/wiki#ticket-backlog-triaging for more information).

Please, feel free to adjust the category or the "[label]" if you think different.

#3 Updated by pvorel 9 months ago

  • Priority changed from Normal to High

#4 Updated by pvorel 9 months ago

Broken build https://openqa.opensuse.org/tests/940933/file/autoinst-log.txt

Use of uninitialized value in string eq at /usr/lib/os-autoinst/OpenQA/Qemu/DriveDevice.pm line 115.
Use of uninitialized value in string eq at /usr/lib/os-autoinst/OpenQA/Qemu/DriveDevice.pm line 115.
[2019-05-23T20:46:39.683 CEST] [debug] starting: /usr/bin/qemu-system-x86_64 -vga cirrus -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw ac97 -global isa-fdc.driveA= -m 1536 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :104,share=force-shared -device virtio-serial -chardev socket,path=virtio_console,server,nowait,id=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/14/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/14/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0
Attempt 0 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 1 at /usr/lib/os-autoinst/osutils.pm line 130.

#5 Updated by pvorel 9 months ago

  • Subject changed from [ltp][kernel][opensuse] All LTP tests are failing on openSUSE on x86_64 to [ltp][kernel] All LTP tests are failing on SLE12 SP5 (osd) and openSUSE (o3) on x86_64
  • Description updated (diff)

#6 Updated by pvorel 9 months ago

  • Description updated (diff)

#7 Updated by pvorel 9 months ago

asmorodskyi found that QCOW2 image on osd was broken.
Looking at it size is ok, just qemu-img check complains:

$ qemu-img check sle-12-SP5-x86_64-0187-Server-DVD@64bit-with-ltp.qcow2
qemu-img: This image format does not support checks

QEMU version which creates problems is 2.9.1(openSUSE Leap 42.3), but this was probably also on previous builds which didn't fail. On quasar.suse.cz, where I run install_ltp which produced a correct image is 2.11.2(SUSE Linux Enterprise 15). But really no clue what causes producing wrong qcow images.

#8 Updated by pvorel 9 months ago

On osd it helped to retrigger install_ltp.

#9 Updated by pvorel 9 months ago

Retriggering install_ltp helps on both osd and o3. Not sure what creates wrong image.

#10 Updated by pcervinka 8 months ago

Rechecked recent failed job https://openqa.suse.de/tests/2992911, downloaded qcow2 image and tried to boot from in local virtual machine. Unfortunately it ended with "no bootable device error". It means that image is really corrupted and we need to find where it started. It can be incorrectly generated already on the worker or it can be corrupted during upload from worker.
Both things can be difficult to troubleshoot.

#11 Updated by pcervinka 8 months ago

@pvorel could you try to set variable QEMU_DISABLE_SNAPSHOTS in test install ltp test suite?

#12 Updated by rpalethorpe 8 months ago

Just a thought, but if this is related to the problems I have seen before, then this is due to how the qcow images are stored and distributed. A solution might be to put them in a Ceph object store and allow it to handle as much of the distribution and caching as possible.

#13 Updated by pvorel 8 months ago

  • Related to action #53294: [kernel][ltp] test fails in boot_ltp - incorrect kernel name provided added

#14 Updated by okurz 8 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: kernel-live-patching
https://openqa.opensuse.org/tests/973499

#15 Updated by rpalethorpe 8 months ago

Yeah, seems like the image on x86_64 o3 is always corrupt:

[2019-07-04T08:13:11.519 CEST] [debug] running /usr/bin/qemu-img info --output=json /var/lib/openqa/pool/2/opensuse-Tumbleweed-x86_64-20190703-DVD@64bit-with-ltp.qcow2
[2019-07-04T08:13:11.534 CEST] [debug] {
    "virtual-size": 2322137088,
    "filename": "/var/lib/openqa/pool/2/opensuse-Tumbleweed-x86_64-20190703-DVD@64bit-with-ltp.qcow2",
    "format": "raw",
    "actual-size": 2324410368,
    "dirty-flag": false
}   

Format should be qcow2.

[2019-07-04T09:19:53.726 UTC] [debug] {
    "virtual-size": 32212254720,
    "filename": "/var/lib/openqa/pool/6/opensuse-Tumbleweed-ppc64le-20190703-DVD@ppc64le-with-ltp.qcow2",
    "cluster-size": 65536,
    "format": "qcow2",
    "actual-size": 2299990016,
    "format-specific": {
        "type": "qcow2",
        "data": {
            "compat": "1.1",
            "lazy-refcounts": false,
            "refcount-bits": 16,
            "corrupt": false
        }
    },
    "dirty-flag": false
}

The publishing job doesn't take any snapshots.

#16 Updated by okurz 7 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ltp_net_sctp
https://openqa.suse.de/tests/3074428

#17 Updated by pcervinka 7 months ago

If we combine system installation with ltp installation, we will be able to save one job and image. Maybe we could create combined installation into one create_hdd_ltp which would do system installation with ltp at once. I see that there were some experiments already create_hdd_kotd_ltp, but is not used.

What do you think?

#18 Updated by pvorel 7 months ago

For VM based testing it worked ok, as we reused already installed image. So from these 3 steps (1) install OS 2) install LTP 3) run test) I'd prefer either join 2) + 3) and/or (only for IPMI) have all 3 steps in single test suite.

For IPMI I'm planning to use Michie's way (iPXE based installation), where he's going to detect which SLES version has been installed (so we avoid installing it if not needed).

#19 Updated by okurz 7 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: kernel-live-patching
https://openqa.opensuse.org/tests/1001688

#20 Updated by pvorel 7 months ago

Some time ago fixed in osd, but still broken on o3.

#21 Updated by okurz 6 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ltp_net_ipv6_lib
https://openqa.suse.de/tests/3308930

#22 Updated by rpalethorpe 6 months ago

  • Subject changed from [ltp][kernel] All LTP tests are failing on SLE12 SP5 (osd) and openSUSE (o3) on x86_64 to [ltp][kernel] All LTP tests are failing on SLE12 SP5 (osd) and openSUSE (o3) on [x86_64]

Add arch tag for JDP.

#23 Updated by pvorel 6 months ago

  • Subject changed from [ltp][kernel] All LTP tests are failing on SLE12 SP5 (osd) and openSUSE (o3) on [x86_64] to [ltp][kernel] All LTP tests are failing on openSUSE (o3) on [x86_64]

Tests are failing just on Tumbleweed.

#24 Updated by jlausuch 4 months ago

  • Parent task set to #58685

#25 Updated by jlausuch 4 months ago

  • Subject changed from [ltp][kernel] All LTP tests are failing on openSUSE (o3) on [x86_64] to [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64]

#26 Updated by pvorel 3 months ago

  • Assignee set to pvorel

#27 Updated by pvorel 3 months ago

  • Target version set to Current Sprint - kernel

#28 Updated by pvorel 3 months ago

  • Status changed from New to In Progress

#29 Updated by pvorel 2 months ago

  • Subject changed from [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64] to [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64][ppc64le]

#30 Updated by pvorel 2 months ago

There are other failures on ppc, maybe related https://openqa.suse.de/tests/3723347#next_previous (first fail: 4.12.14-146.1.ge31b461 https://openqa.suse.de/tests/3696777)

#31 Updated by pvorel 2 months ago

  • Subject changed from [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64][ppc64le] to [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64]

ppc failures are caused by https://bugzilla.suse.com/show_bug.cgi?id=1159096 (on both osd and o3; checked by inspecting qcow2 image).

#32 Updated by pvorel 2 months ago

  • Related to action #61203: [kernel][ltp][epic][grub] General solution for handling kernel parameters (debug_pagealloc=on) added

#33 Updated by pvorel 2 months ago

Hm, it's hard to debug this problem on o3 :(. I tried several times to restart a job and watch things on o3. Mostly I waited several hours and job got restarted while I was away. Today job got restarted, but I have no SSH access. According to okurz: ssh access to openqa.opensuse.org is down, see https://progress.opensuse.org/issues/61218.

#34 Updated by pvorel about 1 month ago

Still haven't found the root cause of the problem.

#35 Updated by jlausuch about 1 month ago

can you reproduce it locally?

#36 Updated by pvorel about 1 month ago

  • Description updated (diff)

#37 Updated by pvorel about 1 month ago

jlausuch wrote:

can you reproduce it locally?

No, that's the hardest problem on this ticket (I thought I reported it, but I didn't), together with busyness of o3 (it's hard to reschedule the job).
At least my PR which could help debugging a bit has been merged https://github.com/os-autoinst/os-autoinst/pull/1327.

#38 Updated by pvorel about 1 month ago

Hm, but looking at log (https://openqa.opensuse.org/tests/1142577/file/autoinst-log.txt) of failing intel build (https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200112&groupid=32) published qcow2 image has qcow2 format, so the problem is somewhere else :(

[2020-01-14T02:18:29.943 CET] [debug] running nice ionice qemu-img convert -c -O qcow2 /var/lib/openqa/pool/4/raid/hd0-overlay0 assets_public/opensuse-Tumbleweed-x86_64-20200112-DVD@64bit-with-ltp.qcow2
[2020-01-14T02:20:56.864 CET] [debug] running qemu-img info --output=json assets_public/opensuse-Tumbleweed-x86_64-20200112-DVD@64bit-with-ltp.qcow2
[2020-01-14T02:20:56.876 CET] [debug] {
"virtual-size": 32212254720,
"filename": "assets_public/opensuse-Tumbleweed-x86_64-20200112-DVD@64bit-with-ltp.qcow2",
"cluster-size": 65536,
"format": "qcow2",
"actual-size": 1103364096,
"format-specific": {
"type": "qcow2",
"data": {
"compat": "1.1",
"lazy-refcounts": false,
"refcount-bits": 16,
"corrupt": false
}
},
"dirty-flag": false
}

#39 Updated by rpalethorpe about 1 month ago

To clarify: qemu-img shows that the qcow2 is valid after install-ltp completes, but when dependent test starts the image is invalid.

So the image is valid before it is uploaded as an asset, but is invalid by the time we download it.

using hexdump -n 1M -C :

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00040000

So the file is probably just all zeroes, but has the correct file length. For reference, it should look more like:

hexdump -C -n 256K qa/runltp-support/ldisc-syzkaller.qcow2
00000000  51 46 49 fb 00 00 00 03  00 00 00 00 00 00 01 38  |QFI............8|
00000010  00 00 00 0b 00 00 00 10  00 00 00 0c 80 00 00 00  |................|
00000020  00 00 00 00 00 00 00 64  00 00 00 00 00 03 00 00  |.......d........|
00000030  00 00 00 00 00 01 00 00  00 00 00 01 00 00 00 00  |................|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
...

There do not appear to be any errors during uploading. However it is possible that OpenQA is creating the file, but not writing to it correctly. Alternatively something could incorrectly copy the file after it has been uploaded as part of the asset caching.

#40 Updated by rpalethorpe about 1 month ago

  • Related to action #45836: [tools] qcow images mismatch in size added

#41 Updated by pvorel about 1 month ago

rpalethorpe wrote:

To clarify: qemu-img shows that the qcow2 is valid after install-ltp completes, but when dependent test starts the image is invalid.


So the image is valid before it is uploaded as an asset, but is invalid by the time we download it.

Yep, I found that as well before, but just didn't believe it could be possible (so planned to investigate it more). mdoucha fount that both install_ltp+opensuse+DVD and install_ltp+opensuse+DVD-m32 used the same PUBLISH_HDD_1 and PUBLISH_PFLASH_VARS variables (thanks Martin!). While PUBLISH_PFLASH_VARS might not be a problem PUBLISH_HDD_1 certainly is. I restarted jobs with correct variables, let's see.

#42 Updated by rpalethorpe about 1 month ago

  • Related to action #34597: Race condition causing problems with the worker cache added

#43 Updated by rpalethorpe about 1 month ago

  • Subject changed from [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64] to [openqa] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64]

#44 Updated by rpalethorpe about 1 month ago

  • Related to action #13646: Ensuring asset files integrity (was: "An error occurred during the installation" on images) added

#45 Updated by pvorel about 1 month ago

  • Status changed from In Progress to Feedback

Build 20200113 is ok :).
https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200113&groupid=32
Looks like really wrong setup of HDD_1 was causing this problem.
Let's wait few more builds to be sure.

#46 Updated by pvorel about 1 month ago

  • Status changed from Feedback to Resolved

Builds 20200114 and 20200115 are also ok. + Problem was really just on intel, which was the only one affected by wrong setup => fixed.

#47 Updated by jlausuch about 1 month ago

Good to hear! Thanks a lot

#48 Updated by pvorel about 1 month ago

Tests fails again: https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200119&groupid=32

But this time it's something else https://openqa.opensuse.org/tests/1150521#

[2020-01-21T16:12:06.0882 CET] [info] +++ setup notes +++
[2020-01-21T16:12:06.0882 CET] [info] Start time: 2020-01-21 15:12:06
[2020-01-21T16:12:06.0882 CET] [info] Running on openqaworker1:7 (Linux 4.12.14-lp151.28.36-default #1 SMP Fri Dec 6 13:50:27 UTC 2019 (8f4a495) x86_64)
[2020-01-21T16:12:06.0890 CET] [info] Downloading opensuse-Tumbleweed-x86_64-20200119-DVD@64bit-with-ltp.qcow2, request #203793 sent to Cache Service
[2020-01-21T16:12:11.0950 CET] [info] Download of opensuse-Tumbleweed-x86_64-20200119-DVD@64bit-with-ltp.qcow2 processed
[2020-01-21T16:12:11.0985 CET] [info] +++ worker notes +++
[2020-01-21T16:12:11.0985 CET] [info] End time: 2020-01-21 15:12:11
[2020-01-21T16:12:11.0985 CET] [info] Result: setup failure
[2020-01-21T16:12:11.0998 CET] [info] Uploading autoinst-log.txt

#49 Updated by jlausuch about 1 month ago

Looks fine in latest build...

#50 Updated by pvorel 14 days ago

  • Related to action #63373: [o3][kernel][scheduler][x86_64] Dependent (child) jobs should start after uploading all of parent assets added

Also available in: Atom PDF