Project

General

Profile

action #52451

test incompletes reproducibly in first_boot of krypton-live-install with "Seems like os-autoinst has produced a result which openQA can not display."

Added by okurz over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
-
Start date:
2019-05-31
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario opensuse-5.12.80-Krypton-Live-x86_64-krypton-live-installation@64bit-2G fails in
first_boot

Test suite description

Boots the openSUSE Krypton/Argon Live DVD and uses the installer to install with default settings, then reboots into the installed system.

Reproducible

Fails since (at least) Build 5.145

Expected result

Last good: 5.144 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues

Related to openQA Project - action #52673: os-autoinst: Do not save "lastgood" snapshot on last module unless img is preserved with snapshot (e.g. --no-cleanup)Feedback2019-06-06

Related to openQA Infrastructure - action #53525: Upgrade o3 workers to Leap 15.1Resolved2019-06-262019-07-20

Related to openQA Project - action #77728: Test incompletes with auto_review:"(?s)Error in .*isotovideo: backen.* corrupted size vs. prev_size":retryResolved2020-11-11

History

#2 Updated by okurz over 2 years ago

  • Project changed from openQA Tests to openQA Project
  • Category changed from Bugs in existing tests to Concrete Bugs

#3 Updated by okurz over 2 years ago

  • Status changed from New to Feedback
  • Assignee set to okurz

trying to reproduce myself locally … -> http://lord.arch/tests/2062

#4 Updated by okurz over 2 years ago

all good now: https://openqa.opensuse.org/tests/948461

let's see if this holds, e.g. waiting for further builds

#5 Updated by okurz over 2 years ago

latest run does not exactly look better: https://openqa.opensuse.org/tests/950017 incompleted due to timeout and broken await_install, #52664 created for the specific problem.

#6 Updated by okurz over 2 years ago

ok, seems we are back to either the original problem or something very much related, https://openqa.opensuse.org/tests/950417/file/autoinst-log.txt shows:

[2019-06-05T21:52:44.014 CEST] [debug] Migrating total bytes:       2152538112
[2019-06-05T21:52:44.014 CEST] [debug] Migrating remaining bytes:       1837256704
[2019-06-05T21:52:44.525 CEST] [debug] sysread failed: Connection reset by peer
Use of uninitialized value in string eq at /usr/lib/os-autoinst/backend/qemu.pm line 249.
Use of uninitialized value in concatenation (.) or string at /usr/lib/os-autoinst/backend/qemu.pm line 253.
[2019-06-05T21:52:44.526 CEST] [debug] Migrating total bytes:       
Use of uninitialized value in concatenation (.) or string at /usr/lib/os-autoinst/backend/qemu.pm line 254.
[2019-06-05T21:52:44.526 CEST] [debug] Migrating remaining bytes:       
Use of uninitialized value in string eq at /usr/lib/os-autoinst/backend/qemu.pm line 263.
[2019-06-05T21:52:44.546 CEST] [debug] Backend process died, backend errors are reported below in the following lines:
Can't syswrite(IO::Socket::UNIX=GLOB(0x564bfac93478), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 963

[2019-06-05T21:52:44.547 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-06-05T21:52:44.548 CEST] [debug] flushing frames
last frame
[2019-06-05T21:52:44.578 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-06-05T21:52:44.579 CEST] [debug] QEMU: Formatting '/var/lib/openqa/pool/2/raid/hd0-overlay1', fmt=qcow2 size=21474836480 backing_file=/var/lib/openqa/pool/2/raid/hd0 backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-06-05T21:52:44.579 CEST] [debug] QEMU: Formatting '/var/lib/openqa/pool/2/raid/cd0-overlay1', fmt=qcow2 size=1020264448 backing_file=/var/lib/openqa/pool/2/raid/cd0-overlay0 backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-06-05T21:52:44.579 CEST] [debug] QEMU: *** Error in `/usr/bin/qemu-system-x86_64': corrupted size vs. prev_size: 0x000056322a96f680 ***
[2019-06-05T21:52:44.580 CEST] [debug] sending magic and exit
[2019-06-05T21:52:44.581 CEST] [debug] received magic close
[2019-06-05T21:52:44.581 CEST] [debug] THERE IS NOTHING TO READ 15 4 3
[2019-06-05T21:52:44.582 CEST] [debug] killing command server 9237 because test execution ended
[2019-06-05T21:52:44.582 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20023/hj52iGQNg7DC7w6o/broadcast
[2019-06-05T21:52:44.601 CEST] [debug] backend process exited: 0
[2019-06-05T21:52:59.600 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /usr/bin/isotovideo line 172.

[2019-06-05T21:53:00.601 CEST] [error] can_read received kill signal at /usr/lib/os-autoinst/myjsonrpc.pm line 91.

[2019-06-05T21:53:00.606 CEST] [debug] commands process exited: 0
[2019-06-05T21:53:01.607 CEST] [debug] done with command server
[2019-06-05T21:53:01.607 CEST] [debug] killing autotest process 9242
[2019-06-05T21:53:01.607 CEST] [debug] sysread failed: 
[2019-06-05T21:53:01.615 CEST] [debug] [autotest] process exited: 0
[2019-06-05T21:53:02.615 CEST] [debug] done with autotest process
[2019-06-05T21:53:02.615 CEST] [debug] isotovideo failed
[2019-06-05T21:53:02.616 CEST] [debug] killing backend process 9310
[2019-06-05T21:53:02.616 CEST] [debug] done with backend process
9231: EXIT 1

That message

QEMU: *** Error in `/usr/bin/qemu-system-x86_64': corrupted size vs. prev_size: 0x000056322a96f680 ***

very much sounds like a qemu bug however the "corrupted size vs. prev_size" itself comes from glibc.

A job from yesterday had

QEMU: *** Error in `/usr/bin/qemu-system-x86_64': malloc(): smallbin double linked list corrupted: 0x0000557b4ea753d0 ***

this one seems to have hit it not during any kind of snapshot action but just somewhere in between.

#7 Updated by okurz over 2 years ago

  • Related to action #52673: os-autoinst: Do not save "lastgood" snapshot on last module unless img is preserved with snapshot (e.g. --no-cleanup) added

#8 Updated by okurz over 2 years ago

https://patchwork.kernel.org/patch/10721403/ sounds like it could be related. Have not found more.

By the way, it feels a bit wasteful to record snapshots just before successfully finishing a test that prunes the worker pool directory just afterwards:

[2019-06-05T21:52:34.455 CEST] [debug] ||| finished first_boot installation at 2019-06-05 19:52:34 (37 s)
[2019-06-05T21:52:34.456 CEST] [debug] Creating a VM snapshot lastgood
…
[2019-06-05T21:52:35.000 CEST] [debug] Migrating total bytes:       2152538112
[2019-06-05T21:52:35.000 CEST] [debug] Migrating remaining bytes:       2130817024

-> #52673

Also we did not seem to have this problem about 3 months ago, e.g. see https://openqa.opensuse.org/tests/876852 and further tests in the history that did not fail.

As we are suspecting qemu the question is if the packages differ. however both the old job as well as new ones show the qemu version string as "QEMU: QEMU emulator version 2.11.2(openSUSE Leap 15.0)" so no difference there.

openqaworker4:~ # grep -i '|qemu|' /var/log/zypp/history 
2018-10-01 13:49:07|install|qemu|2.11.2-lp150.7.9.1|x86_64||repo-update|feef165faae3f66a39d5339db9994a2620160ac930172adee9f3a529265dd9da|
2018-12-07 00:33:21|install|qemu|2.11.2-lp150.7.12.1|x86_64||repo-update|eef9ee9e02c90d20b97e7d7520e3748565f77e42b651acc2307664f45c217b7d|
2018-12-16 00:50:35|install|qemu|2.11.2-lp150.7.15.1|x86_64||repo-update|1ec34cf5f0df53881fa7050ac1dc805112b52a73b6fe68283a2bdff153336d8a|
2019-02-28 00:11:59|install|qemu|2.11.2-lp150.7.18.1|x86_64||repo-update|4f8871d87248bba24314cb1b4f8bcb93bb0ffc73ee5c752bf83ffc4efa48b3d3|
2019-05-20 16:52:47|install|qemu|2.11.2-lp150.7.22.1|x86_64||repo-update|419e55de487213462e848818ef975db252b671a4327c4e515f079eae287ef478|

tells me that we had maintenance updates for qemu.

#9 Updated by okurz over 2 years ago

I created https://bugzilla.opensuse.org/show_bug.cgi?id=1137520

We could try to revert the qemu version on one or multiple of the workers to crosscheck, or test with Leap 15.1, etc.

I triggered 100 jobs on http://lord.arch/tests to check how it behaves there. On lord saving qcow snapshots was partially taking too long, probably due to kubernetes in the background acting up. I terminated kubernetes but also temporarily increased the timeout in backend/qemu.pm from 240s to 600s (did I restart the worker with this isotovideo? Dunno).

So far 80/100 passed or failed or timed out (incomplete) in another step but no backend crashes. I wonder if I can run a worker in an opensuse/leap:15.0 container and reproduce?

On lord.arch:

docker run -it --rm -l test_openqa_worker_boo1137520 --cap-add=SYS_ADMIN --security-opt apparmor:unconfined -v /etc/openqa:/mnt/etc/openqa:ro -v /var/lib/openqa/share:/var/lib/openqa/share:ro -v /dev/kvm:/dev/kvm opensuse/leap:15.0

with the options to allow mounting an overlayfs for /etc/openqa.

Then within the container:

mkdir -p /etc/openqa && mount -t overlay overlay -o lowerdir=/mnt/etc/openqa,upperdir=/etc/openqa,workdir=/tmp /etc/openqa && \
zypper ar -f -p 95 https://download.opensuse.org/repositories/devel:openQA/openSUSE_Leap_15.0/devel:openQA.repo && \
zypper ar -f -p 95 https://download.opensuse.org/repositories/devel:/openQA:/Leap:/15.0/openSUSE_Leap_15.0/devel:openQA:Leap:15.0.repo && \
zypper --gpg-auto-import-keys ref && zypper -n in openQA-worker qemu-x86 sudo
sed -i -e 's/localhost/lord.arch.suse.de/' /etc/openqa/{client.conf,workers.ini}
chown _openqa-worker /etc/openqa/client.conf
echo -e '\n[74]\nWORKER_CLASS = leap150' >> /etc/openqa/workers.ini
mkdir -p /var/lib/openqa/pool/74
chown -R _openqa-worker /var/lib/openqa/pool/74/
sudo -u _openqa-worker /usr/share/openqa/script/worker --host lord.arch.suse.de --instance 74

More failures on o3, e.g. https://openqa.opensuse.org/tests/954639/file/autoinst-log.txt shows "corrupted size vs. prev_size: 0x0000563c79678390"

lord.arch can be used again (after SRV2 ext climate control is fixed, no one told me though, I had to ask).

Without using the --privileged option on docker I could not run kvm acceleration, let's try with

docker run -it --rm -l test_openqa_worker_boo1137520 --cap-add=SYS_ADMIN --security-opt apparmor:unconfined -v /etc/openqa:/mnt/etc/openqa:ro -v /var/lib/openqa/share:/var/lib/openqa/share:ro -v /dev/kvm:/dev/kvm --privileged opensuse/leap:15.0

Changed the approach to build from a Dockerfile instead to skip the lengthy installation procedure:

docker build -t test_openqa_worker_boo1137520 . && docker run -it --rm -l test_openqa_worker_boo1137520 --cap-add=SYS_ADMIN --security-opt apparmor:unconfined -v /etc/openqa:/mnt/etc/openqa:ro -v /var/lib/openqa/share:/var/lib/openqa/share:ro -v /dev/kvm:/dev/kvm --privileged test_openqa_worker_boo1137520 --instance 1 --host http://lord.arch --apikey XXX --apisecret XXX --verbose

but I need to supply key and secret

I also pushed that image to the docker hub:

docker tag 42bfa9fbb658 okurz/test_openqa_worker_boo1137520 && docker push okurz/test_openqa_worker_boo1137520

http://lord.arch/tests/2287# is the first successful job, finished "soft-failed".

Found http://lord.arch/tests/2288/file/autoinst-log.txt ended up incomplete which could be for the same reason we are looking for in this ticket.

Triggered 100 jobs with

for i in {001..100}; do openqa-clone-job --skip-chained-deps --host http://lord.arch --from https://openqa.opensuse.org 966733 TIMEOUT_SCALE=10 WORKER_CLASS=leap150 TEST=krypton-live-installation_$i; done

Definitely I could reproduce problems which at least on the top level look like similar symptoms, e.g.:

all incompleting in "first_boot" with the question mark result. The logs like

[2019-06-25T19:04:43.252 UTC] [debug] >>> testapi::_handle_found_needle: found generic-desktop-kde-plasma-20190601, similarity 1.00 @ 2/733
[2019-06-25T19:05:03.068 UTC] [debug] Backend process died, backend errors are reported below in the following lines:
unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 186.

inform that there was a sudden loss of data transfer but not much more. No "glibc memory corruption error" at least.

Another interesting candidate is http://lord.arch/tests/2302# which is running for 11h already.

I guess I could upgrade the Leap 15.0 workers to 15.1 and see if the problem still appears and give up digging in the direction of Leap 15.1

#10 Updated by okurz over 2 years ago

  • Related to action #53525: Upgrade o3 workers to Leap 15.1 added

#11 Updated by okurz about 2 years ago

After upgrading e.g. openqaworker1 to Leap 15.1 but openqaworker is still on Leap 15.0 I can conduct some statistical investigation: I added the worker class of the current hostname for each worker, e.g. "WORKER_CLASS=openqaworker1" for openqaworker1 so that we can pin jobs to each version. Now we can trigger some tests on each:

for i in 1 4; do WORKER_CLASS=openqaworker$i; env end=020 openqa-clone-set https://openqa.opensuse.org/tests/970560 WORKER_CLASS=$WORKER_CLASS NAME=okurz_poo52451_$WORKER_CLASS; done

-> https://openqa.opensuse.org/tests/overview?version=5.12.80&build=WORKER_CLASS%3Dopenqaworker1&distri=opensuse and https://openqa.opensuse.org/tests/overview?version=5.12.80&build=WORKER_CLASS%3Dopenqaworker4&distri=opensuse

All failed on boo#1137047, created yet another workaround needle and retriggered.

#12 Updated by okurz about 2 years ago

The worker pinning did not work however I can still check all incompletes if openqaworker1 appears anywhere: 18/40 incompletes, all on openqaworker4 or imagetester, none on openqaworker1 which is openSUSE Leap 15.1 so my hypothesis is confirmed that this only happens with the older packages. I will reference this in the bug report now and go ahead to upgrade imagetester and openqaworker4 to openSUSE Leap 15.1 as a workaround for us :)

#13 Updated by okurz about 2 years ago

Reported in https://bugzilla.opensuse.org/show_bug.cgi?id=1137520#c2. Now I can continue with the upgrade in #53525

#14 Updated by okurz about 2 years ago

  • Status changed from Feedback to Resolved

krypton-live-installation does not incomplete anymore (because the test never reaches first_boot anymore), the bug is not needing any work from me (because no one cares to fix the potential qemu security flaw) and the workers are upgraded to Leap 15.1 or moved to a different WORKER_CLASS. I guess this is good enough.

#15 Updated by okurz 11 months ago

  • Related to action #77728: Test incompletes with auto_review:"(?s)Error in .*isotovideo: backen.* corrupted size vs. prev_size":retry added

Also available in: Atom PDF