Project

General

Profile

Actions

action #51743

closed

[openqa] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64]

Added by pvorel over 5 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Infrastructure
Target version:
QE Kernel - QE Kernel Done
Start date:
2019-05-21
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Reproducible

Fails since on SLE12 SP5 and openSUSE

install_ltp tests are running, than it does not see qcow2 image thus does not see GRUB and tries to boot from PXE
https://openqa.opensuse.org/tests/936921#step/boot_ltp/2
Other archs than x86_64 tests are ok:
https://openqa.opensuse.org/tests/936570#step/boot_ltp/1

Expected result

Last good:

Hint

Qemu complains "Unknown host!"

[2019-05-17T14:10:26.428 UTC] [debug] QEMU: QEMU emulator version 2.9.1(openSUSE Leap 42.3)
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!
[2019-05-17T14:10:26.428 UTC] [debug] QEMU: Unknown host!

NOTE: it cannot be reproduced outside of o3 (on osd or private worker).


Files

autoinst-log.936570.txt (730 KB) autoinst-log.936570.txt pvorel, 2019-05-21 12:36

Related issues 6 (1 open5 closed)

Related to openQA Tests - action #53294: [kernel][ltp] test fails in boot_ltp - incorrect kernel name providedClosedpcervinka2019-06-19

Actions
Related to openQA Tests - coordination #61203: [kernel][ltp][epic][grub] General solution for handling kernel parameters (debug_pagealloc=on)Resolvedpvorel2020-01-08

Actions
Related to openQA Project - action #45836: [tools] qcow images mismatch in size RejectedXiaojing_liu2019-01-08

Actions
Related to openQA Project - action #34597: Race condition causing problems with the worker cacheResolvedEDiGiacinto2018-05-11

Actions
Related to openQA Project - action #13646: Ensuring asset files integrity (was: "An error occurred during the installation" on images)Workable2016-09-09

Actions
Related to openQA Tests - action #63373: [o3][kernel][scheduler][x86_64] Dependent (child) jobs should start after uploading all of parent assetsResolvedpvorel2020-02-11

Actions
Actions #1

Updated by pvorel over 5 years ago

14:46 < nsinger> pvorel: the previous (working) runs where running with qemu-system-x86_64 -cpu qemu64 and now it runs with "-cpu host".

Actions #2

Updated by SLindoMansilla over 5 years ago

  • Category set to Infrastructure

As a result of backlog triaging (see https://progress.opensuse.org/projects/openqatests/wiki#ticket-backlog-triaging for more information).

Please, feel free to adjust the category or the "[label]" if you think different.

Actions #3

Updated by pvorel over 5 years ago

  • Priority changed from Normal to High
Actions #4

Updated by pvorel over 5 years ago

Broken build https://openqa.opensuse.org/tests/940933/file/autoinst-log.txt

Use of uninitialized value in string eq at /usr/lib/os-autoinst/OpenQA/Qemu/DriveDevice.pm line 115.
Use of uninitialized value in string eq at /usr/lib/os-autoinst/OpenQA/Qemu/DriveDevice.pm line 115.
[2019-05-23T20:46:39.683 CEST] [debug] starting: /usr/bin/qemu-system-x86_64 -vga cirrus -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw ac97 -global isa-fdc.driveA= -m 1536 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :104,share=force-shared -device virtio-serial -chardev socket,path=virtio_console,server,nowait,id=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/14/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/14/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0
Attempt 0 at /usr/lib/os-autoinst/osutils.pm line 130.
Attempt 1 at /usr/lib/os-autoinst/osutils.pm line 130.
Actions #5

Updated by pvorel over 5 years ago

  • Subject changed from [ltp][kernel][opensuse] All LTP tests are failing on openSUSE on x86_64 to [ltp][kernel] All LTP tests are failing on SLE12 SP5 (osd) and openSUSE (o3) on x86_64
  • Description updated (diff)
Actions #6

Updated by pvorel over 5 years ago

  • Description updated (diff)
Actions #7

Updated by pvorel over 5 years ago

asmorodskyi found that QCOW2 image on osd was broken.
Looking at it size is ok, just qemu-img check complains:

$ qemu-img check sle-12-SP5-x86_64-0187-Server-DVD@64bit-with-ltp.qcow2
qemu-img: This image format does not support checks

QEMU version which creates problems is 2.9.1(openSUSE Leap 42.3), but this was probably also on previous builds which didn't fail. On quasar.suse.cz, where I run install_ltp which produced a correct image is 2.11.2(SUSE Linux Enterprise 15). But really no clue what causes producing wrong qcow images.

Actions #8

Updated by pvorel over 5 years ago

On osd it helped to retrigger install_ltp.

Actions #9

Updated by pvorel over 5 years ago

Retriggering install_ltp helps on both osd and o3. Not sure what creates wrong image.

Actions #10

Updated by pcervinka over 5 years ago

Rechecked recent failed job https://openqa.suse.de/tests/2992911, downloaded qcow2 image and tried to boot from in local virtual machine. Unfortunately it ended with "no bootable device error". It means that image is really corrupted and we need to find where it started. It can be incorrectly generated already on the worker or it can be corrupted during upload from worker.
Both things can be difficult to troubleshoot.

Actions #11

Updated by pcervinka over 5 years ago

@pvorel could you try to set variable QEMU_DISABLE_SNAPSHOTS in test install ltp test suite?

Actions #12

Updated by rpalethorpe over 5 years ago

Just a thought, but if this is related to the problems I have seen before, then this is due to how the qcow images are stored and distributed. A solution might be to put them in a Ceph object store and allow it to handle as much of the distribution and caching as possible.

Actions #13

Updated by pvorel over 5 years ago

  • Related to action #53294: [kernel][ltp] test fails in boot_ltp - incorrect kernel name provided added
Actions #14

Updated by okurz over 5 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: kernel-live-patching
https://openqa.opensuse.org/tests/973499

Actions #15

Updated by rpalethorpe over 5 years ago

Yeah, seems like the image on x86_64 o3 is always corrupt:

[2019-07-04T08:13:11.519 CEST] [debug] running /usr/bin/qemu-img info --output=json /var/lib/openqa/pool/2/opensuse-Tumbleweed-x86_64-20190703-DVD@64bit-with-ltp.qcow2
[2019-07-04T08:13:11.534 CEST] [debug] {
"virtual-size": 2322137088,
"filename": "/var/lib/openqa/pool/2/opensuse-Tumbleweed-x86_64-20190703-DVD@64bit-with-ltp.qcow2",
"format": "raw",
"actual-size": 2324410368,
"dirty-flag": false
}

Format should be qcow2.

[2019-07-04T09:19:53.726 UTC] [debug] {
"virtual-size": 32212254720,
"filename": "/var/lib/openqa/pool/6/opensuse-Tumbleweed-ppc64le-20190703-DVD@ppc64le-with-ltp.qcow2",
"cluster-size": 65536,
"format": "qcow2",
"actual-size": 2299990016,
"format-specific": {
"type": "qcow2",
"data": {
"compat": "1.1",
"lazy-refcounts": false,
"refcount-bits": 16,
"corrupt": false
}
},
"dirty-flag": false
}

The publishing job doesn't take any snapshots.

Actions #16

Updated by okurz over 5 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ltp_net_sctp
https://openqa.suse.de/tests/3074428

Actions #17

Updated by pcervinka over 5 years ago

If we combine system installation with ltp installation, we will be able to save one job and image. Maybe we could create combined installation into one create_hdd_ltp which would do system installation with ltp at once. I see that there were some experiments already create_hdd_kotd_ltp, but is not used.

What do you think?

Actions #18

Updated by pvorel over 5 years ago

For VM based testing it worked ok, as we reused already installed image. So from these 3 steps (1) install OS 2) install LTP 3) run test) I'd prefer either join 2) + 3) and/or (only for IPMI) have all 3 steps in single test suite.

For IPMI I'm planning to use Michie's way (iPXE based installation), where he's going to detect which SLES version has been installed (so we avoid installing it if not needed).

Actions #19

Updated by okurz over 5 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: kernel-live-patching
https://openqa.opensuse.org/tests/1001688

Actions #20

Updated by pvorel over 5 years ago

Some time ago fixed in osd, but still broken on o3.

Actions #21

Updated by okurz about 5 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ltp_net_ipv6_lib
https://openqa.suse.de/tests/3308930

Actions #22

Updated by rpalethorpe about 5 years ago

  • Subject changed from [ltp][kernel] All LTP tests are failing on SLE12 SP5 (osd) and openSUSE (o3) on x86_64 to [ltp][kernel] All LTP tests are failing on SLE12 SP5 (osd) and openSUSE (o3) on [x86_64]

Add arch tag for JDP.

Actions #23

Updated by pvorel about 5 years ago

  • Subject changed from [ltp][kernel] All LTP tests are failing on SLE12 SP5 (osd) and openSUSE (o3) on [x86_64] to [ltp][kernel] All LTP tests are failing on openSUSE (o3) on [x86_64]

Tests are failing just on Tumbleweed.

Actions #24

Updated by jlausuch about 5 years ago

  • Parent task set to #58685
Actions #25

Updated by jlausuch about 5 years ago

  • Subject changed from [ltp][kernel] All LTP tests are failing on openSUSE (o3) on [x86_64] to [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64]
Actions #26

Updated by pvorel almost 5 years ago

  • Assignee set to pvorel
Actions #27

Updated by pvorel almost 5 years ago

  • Target version set to 445
Actions #28

Updated by pvorel almost 5 years ago

  • Status changed from New to In Progress
Actions #29

Updated by pvorel almost 5 years ago

  • Subject changed from [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64] to [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64][ppc64le]
Actions #30

Updated by pvorel almost 5 years ago

There are other failures on ppc, maybe related https://openqa.suse.de/tests/3723347#next_previous (first fail: 4.12.14-146.1.ge31b461 https://openqa.suse.de/tests/3696777)

Actions #31

Updated by pvorel almost 5 years ago

  • Subject changed from [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64][ppc64le] to [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64]

ppc failures are caused by https://bugzilla.suse.com/show_bug.cgi?id=1159096 (on both osd and o3; checked by inspecting qcow2 image).

Actions #32

Updated by pvorel almost 5 years ago

  • Related to coordination #61203: [kernel][ltp][epic][grub] General solution for handling kernel parameters (debug_pagealloc=on) added
Actions #33

Updated by pvorel almost 5 years ago

Hm, it's hard to debug this problem on o3 :(. I tried several times to restart a job and watch things on o3. Mostly I waited several hours and job got restarted while I was away. Today job got restarted, but I have no SSH access. According to okurz: ssh access to openqa.opensuse.org is down, see https://progress.opensuse.org/issues/61218.

Actions #34

Updated by pvorel almost 5 years ago

Still haven't found the root cause of the problem.

Actions #35

Updated by jlausuch almost 5 years ago

can you reproduce it locally?

Actions #36

Updated by pvorel almost 5 years ago

  • Description updated (diff)
Actions #37

Updated by pvorel almost 5 years ago

jlausuch wrote:

can you reproduce it locally?

No, that's the hardest problem on this ticket (I thought I reported it, but I didn't), together with busyness of o3 (it's hard to reschedule the job).
At least my PR which could help debugging a bit has been merged https://github.com/os-autoinst/os-autoinst/pull/1327.

Actions #38

Updated by pvorel almost 5 years ago

Hm, but looking at log (https://openqa.opensuse.org/tests/1142577/file/autoinst-log.txt) of failing intel build (https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200112&groupid=32) published qcow2 image has qcow2 format, so the problem is somewhere else :(

[2020-01-14T02:18:29.943 CET] [debug] running nice ionice qemu-img convert -c -O qcow2 /var/lib/openqa/pool/4/raid/hd0-overlay0 assets_public/opensuse-Tumbleweed-x86_64-20200112-DVD@64bit-with-ltp.qcow2
[2020-01-14T02:20:56.864 CET] [debug] running qemu-img info --output=json assets_public/opensuse-Tumbleweed-x86_64-20200112-DVD@64bit-with-ltp.qcow2
[2020-01-14T02:20:56.876 CET] [debug] {
"virtual-size": 32212254720,
"filename": "assets_public/opensuse-Tumbleweed-x86_64-20200112-DVD@64bit-with-ltp.qcow2",
"cluster-size": 65536,
"format": "qcow2",
"actual-size": 1103364096,
"format-specific": {
"type": "qcow2",
"data": {
"compat": "1.1",
"lazy-refcounts": false,
"refcount-bits": 16,
"corrupt": false
}
},
"dirty-flag": false
}
Actions #39

Updated by rpalethorpe almost 5 years ago

To clarify: qemu-img shows that the qcow2 is valid after install-ltp completes, but when dependent test starts the image is invalid.

So the image is valid before it is uploaded as an asset, but is invalid by the time we download it.

using hexdump -n 1M -C :

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00040000

So the file is probably just all zeroes, but has the correct file length. For reference, it should look more like:

hexdump -C -n 256K qa/runltp-support/ldisc-syzkaller.qcow2
00000000  51 46 49 fb 00 00 00 03  00 00 00 00 00 00 01 38  |QFI............8|
00000010  00 00 00 0b 00 00 00 10  00 00 00 0c 80 00 00 00  |................|
00000020  00 00 00 00 00 00 00 64  00 00 00 00 00 03 00 00  |.......d........|
00000030  00 00 00 00 00 01 00 00  00 00 00 01 00 00 00 00  |................|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
...

There do not appear to be any errors during uploading. However it is possible that OpenQA is creating the file, but not writing to it correctly. Alternatively something could incorrectly copy the file after it has been uploaded as part of the asset caching.

Actions #40

Updated by rpalethorpe almost 5 years ago

  • Related to action #45836: [tools] qcow images mismatch in size added
Actions #41

Updated by pvorel almost 5 years ago

rpalethorpe wrote:

To clarify: qemu-img shows that the qcow2 is valid after install-ltp completes, but when dependent test starts the image is invalid.

So the image is valid before it is uploaded as an asset, but is invalid by the time we download it.

Yep, I found that as well before, but just didn't believe it could be possible (so planned to investigate it more). mdoucha fount that both install_ltp+opensuse+DVD and install_ltp+opensuse+DVD-m32 used the same PUBLISH_HDD_1 and PUBLISH_PFLASH_VARS variables (thanks Martin!). While PUBLISH_PFLASH_VARS might not be a problem PUBLISH_HDD_1 certainly is. I restarted jobs with correct variables, let's see.

Actions #42

Updated by rpalethorpe almost 5 years ago

  • Related to action #34597: Race condition causing problems with the worker cache added
Actions #43

Updated by rpalethorpe almost 5 years ago

  • Subject changed from [kernel][ltp] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64] to [openqa] All LTP tests are failing on boot_ltp for openSUSE (o3) on [x86_64]
Actions #44

Updated by rpalethorpe almost 5 years ago

  • Related to action #13646: Ensuring asset files integrity (was: "An error occurred during the installation" on images) added
Actions #45

Updated by pvorel almost 5 years ago

  • Status changed from In Progress to Feedback

Build 20200113 is ok :).
https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200113&groupid=32
Looks like really wrong setup of HDD_1 was causing this problem.
Let's wait few more builds to be sure.

Actions #46

Updated by pvorel almost 5 years ago

  • Status changed from Feedback to Resolved

Builds 20200114 and 20200115 are also ok. + Problem was really just on intel, which was the only one affected by wrong setup => fixed.

Actions #47

Updated by jlausuch almost 5 years ago

Good to hear! Thanks a lot

Actions #48

Updated by pvorel almost 5 years ago

Tests fails again: https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200119&groupid=32

But this time it's something else https://openqa.opensuse.org/tests/1150521#

[2020-01-21T16:12:06.0882 CET] [info] +++ setup notes +++
[2020-01-21T16:12:06.0882 CET] [info] Start time: 2020-01-21 15:12:06
[2020-01-21T16:12:06.0882 CET] [info] Running on openqaworker1:7 (Linux 4.12.14-lp151.28.36-default #1 SMP Fri Dec 6 13:50:27 UTC 2019 (8f4a495) x86_64)
[2020-01-21T16:12:06.0890 CET] [info] Downloading opensuse-Tumbleweed-x86_64-20200119-DVD@64bit-with-ltp.qcow2, request #203793 sent to Cache Service
[2020-01-21T16:12:11.0950 CET] [info] Download of opensuse-Tumbleweed-x86_64-20200119-DVD@64bit-with-ltp.qcow2 processed
[2020-01-21T16:12:11.0985 CET] [info] +++ worker notes +++
[2020-01-21T16:12:11.0985 CET] [info] End time: 2020-01-21 15:12:11
[2020-01-21T16:12:11.0985 CET] [info] Result: setup failure
[2020-01-21T16:12:11.0998 CET] [info] Uploading autoinst-log.txt

Actions #49

Updated by jlausuch almost 5 years ago

Looks fine in latest build...

Actions #50

Updated by pvorel almost 5 years ago

  • Related to action #63373: [o3][kernel][scheduler][x86_64] Dependent (child) jobs should start after uploading all of parent assets added
Actions #51

Updated by metan over 4 years ago

  • Target version changed from 445 to 457
Actions #52

Updated by pcervinka about 4 years ago

  • Target version changed from 457 to QE Kernel Done
Actions

Also available in: Atom PDF