action #104304
closedopenQA Project (public) - coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
Crosscheck results of https://github.com/os-autoinst/os-autoinst#verifying-a-runtime-environment on arm-1/2/3 vs. arm-4/5 to find out if arm-4/5 are "typing stable" size:M
0%
Description
Acceptance criteria¶
- AC1: Reliable results exist from all arm-1/2/3 and arm-4/5 from https://github.com/os-autoinst/os-autoinst#verifying-a-runtime-environment with both KVM and without
Suggestions¶
- Run https://github.com/os-autoinst/os-autoinst#verifying-a-runtime-environment, e.g. using
for i in 1 2 3 4 5; do ssh openqaworker-arm-$i "…"; :one
and compare results - Try to extend the command with access to KVM, e.g.
podman -v /dev/kvm …
and patch out https://github.com/os-autoinst/os-autoinst/blob/master/t/99-full-stack.t#L38"QEMU_NO_KVM" : "1",
on the fly, e.g. usingsed -i '/QEMU_NO_KVM/d' /opt/t/99-full-stack.t
on the podman command line. Compare results to the above w/o KVM - As necessary run multiple times, e.g. use https://github.com/okurz/scripts/blob/master/count_fail_ratio in combination with the above commands
- Make sense of all the results what this means regarding typing stability of arm-4+5 :)
Updated by okurz almost 3 years ago
- Project changed from openQA Project (public) to openQA Infrastructure (public)
- Target version set to Ready
Updated by okurz almost 3 years ago
- Subject changed from Crosscheck results of https://github.com/os-autoinst/os-autoinst#verifying-a-runtime-environment on arm-1/2/3 vs. arm-4/5 to find out if arm-4/5 are "typing stable" to Crosscheck results of https://github.com/os-autoinst/os-autoinst#verifying-a-runtime-environment on arm-1/2/3 vs. arm-4/5 to find out if arm-4/5 are "typing stable" size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler almost 3 years ago
- Status changed from Workable to In Progress
I've not come very far:
martchus@openqa:~> sudo salt 'openqaworker-arm*' cmd.run 'bash -c "ok=0; for i in {1..5}; do echo \"i: \$i\" && podman run --pull=always --rm -it --entrypoint \"\" registry.opensuse.org/devel/openqa/containers/os-autoinst_dev:latest /bin/sh -c \"git -C /opt clone --depth 1 https://github.com/os-autoinst/os-autoinst && make -C /opt/os-autoinst/ test-perl-testsuite TESTS=t/99-full-stack.t\" && ok=$((ok + 1)); done; echo \"ok: \$ok of 5\" "'
openqaworker-arm-2.suse.de:
i: 1
Trying to pull registry.opensuse.org/devel/openqa/containers/os-autoinst_dev:latest...
Getting image source signatures
Copying blob sha256:32d063c8d4f5f084c9410a80820cbdbcea5072ad6960ccf906d74eb4c910b485
Copying blob sha256:0c9b09d8f0b99b7a3c2d3f9eb86e60098ec27e3f3a9f76ee12cfe66f8569db6a
Copying config sha256:1dc3c70f60a100e83d2ae29414f4ade5f268cf4771ec2fe9307b55ac477979c9
Writing manifest to image destination
Storing signatures
standard_init_linux.go:228: exec user process caused: exec format error
It works locally with Docker. I suppose the podman setup is broken on those workers.
Updated by openqa_review almost 3 years ago
- Due date set to 2022-01-25
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler almost 3 years ago
The problem was that we ceased to build the container image for aarch64. I re-enabled the build and with the current image it generally runs. However, I've ran into the next issue:
# [2022-01-11T11:42:55.496453Z] [debug] <<< testapi::assert_script_run(cmd="wget -q http://10.0.2.2:15223/nOlEDGIpPA/files/modified.xml", fail_message="", quiet=undef, timeout=90)
# [2022-01-11T11:42:55.496999Z] [debug] tests/modify_and_upload_file.pm:33 called testapi::assert_script_run
# [2022-01-11T11:42:55.497679Z] [debug] <<< testapi::type_string(string="wget -q http://10.0.2.2:15223/nOlEDGIpPA/files/modified.xml", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
# [2022-01-11T11:42:57.796942Z] [debug] tests/modify_and_upload_file.pm:33 called testapi::assert_script_run
# [2022-01-11T11:42:57.797923Z] [debug] <<< testapi::type_string(string="; echo 8I~_t-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
# [2022-01-11T11:42:59.034733Z] [debug] tests/modify_and_upload_file.pm:33 called testapi::assert_script_run
# [2022-01-11T11:42:59.035608Z] [debug] <<< testapi::wait_serial(regexp=qr/8I~_t-\d+-/u, record_output=undef, buffer_size=undef, timeout=90, quiet=undef, no_regex=0, expect_not_found=0)
# [2022-01-11T11:43:01.330864Z] [debug] >>> testapi::wait_serial: (?^u:8I~_t-\d+-): ok
# [2022-01-11T11:43:01.620512Z] [info] ::: basetest::runtest: # Test died: command 'wget -q http://10.0.2.2:15223/nOlEDGIpPA/files/modified.xml' failed at /opt/os-autoinst/t/data//tests/tests/modify_and_upload_file.pm line 33.
This exact same error is happening consistently on all ARM workers I've tested it so far on (but does not happen locally or on an x86_64 worker¹).
¹ In the good case the test log looks like this:
[2022-01-11T11:26:37.058618Z] [debug] ||| starting modify_and_upload_file tests/modify_and_upload_file.pm
[2022-01-11T11:26:37.221007Z] [debug] tests/modify_and_upload_file.pm:24 called testapi::get_test_data
[2022-01-11T11:26:37.221396Z] [debug] <<< testapi::get_test_data(path="/opt/os-autoinst/t/data//tests/data/autoinst.xml")
[2022-01-11T11:26:37.227357Z] [debug] tests/modify_and_upload_file.pm:26 called testapi::type_string
[2022-01-11T11:26:37.228021Z] [debug] <<< testapi::type_string(string="echo get_test_data returned expected file\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:38.707440Z] [debug] tests/modify_and_upload_file.pm:31 called testapi::save_tmp_file
[2022-01-11T11:26:38.707822Z] [debug] <<< testapi::save_tmp_file(path="modified.xml")
[2022-01-11T11:26:38.708649Z] [debug] tests/modify_and_upload_file.pm:33 called testapi::assert_script_run
[2022-01-11T11:26:38.709230Z] [debug] <<< testapi::assert_script_run(cmd="wget -q http://10.0.2.2:15223/brRAHMeVWk/files/modified.xml", quiet=undef, timeout=90, fail_message="")
[2022-01-11T11:26:38.709738Z] [debug] tests/modify_and_upload_file.pm:33 called testapi::assert_script_run
[2022-01-11T11:26:38.710503Z] [debug] <<< testapi::type_string(string="wget -q http://10.0.2.2:15223/brRAHMeVWk/files/modified.xml", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:40.976618Z] [debug] tests/modify_and_upload_file.pm:33 called testapi::assert_script_run
[2022-01-11T11:26:40.977477Z] [debug] <<< testapi::type_string(string="; echo mzNDB-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:42.173453Z] [debug] tests/modify_and_upload_file.pm:33 called testapi::assert_script_run
[2022-01-11T11:26:42.173889Z] [debug] <<< testapi::wait_serial(regexp=qr/mzNDB-\d+-/u, no_regex=0, timeout=90, quiet=undef, record_output=undef, buffer_size=undef, expect_not_found=0)
[2022-01-11T11:26:44.275461Z] [debug] >>> testapi::wait_serial: (?^u:mzNDB-\d+-): ok
[2022-01-11T11:26:44.275689Z] [debug] tests/modify_and_upload_file.pm:34 called testapi::script_run
[2022-01-11T11:26:44.275916Z] [debug] <<< testapi::script_run(cmd="echo '72d2c15cb10535f36862d7d2eecc8a79 modified.xml' > modified.md5", timeout=30, quiet=undef, output="", die_on_timeout=-1)
[2022-01-11T11:26:44.276071Z] [debug] tests/modify_and_upload_file.pm:34 called testapi::script_run
[2022-01-11T11:26:44.276295Z] [debug] <<< testapi::type_string(string="echo '72d2c15cb10535f36862d7d2eecc8a79 modified.xml' > modified.md5", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:46.705473Z] [debug] tests/modify_and_upload_file.pm:34 called testapi::script_run
[2022-01-11T11:26:46.706233Z] [debug] <<< testapi::type_string(string="; echo ~liyM-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:47.868134Z] [debug] tests/modify_and_upload_file.pm:34 called testapi::script_run
[2022-01-11T11:26:47.868980Z] [debug] <<< testapi::wait_serial(no_regex=0, regexp=qr/~liyM-\d+-/u, quiet=undef, timeout=30, record_output=undef, expect_not_found=0, buffer_size=undef)
[2022-01-11T11:26:48.961609Z] [debug] >>> testapi::wait_serial: (?^u:~liyM-\d+-): ok
[2022-01-11T11:26:48.961772Z] [warn] !!! testapi::script_run: DEPRECATED call of script_run() in tests/modify_and_upload_file.pm:34 add `die_on_timeout => ?` to avoid this warning
[2022-01-11T11:26:48.961938Z] [debug] tests/modify_and_upload_file.pm:35 called testapi::assert_script_run
[2022-01-11T11:26:48.962134Z] [debug] <<< testapi::assert_script_run(cmd="md5sum -c modified.md5", quiet=undef, timeout=90, fail_message="")
[2022-01-11T11:26:48.962285Z] [debug] tests/modify_and_upload_file.pm:35 called testapi::assert_script_run
[2022-01-11T11:26:48.962461Z] [debug] <<< testapi::type_string(string="md5sum -c modified.md5", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:49.725770Z] [debug] tests/modify_and_upload_file.pm:35 called testapi::assert_script_run
[2022-01-11T11:26:49.726529Z] [debug] <<< testapi::type_string(string="; echo ib~GQ-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:50.912502Z] [debug] tests/modify_and_upload_file.pm:35 called testapi::assert_script_run
[2022-01-11T11:26:50.912916Z] [debug] <<< testapi::wait_serial(record_output=undef, expect_not_found=0, buffer_size=undef, regexp=qr/ib~GQ-\d+-/u, no_regex=0, quiet=undef, timeout=90)
[2022-01-11T11:26:52.013581Z] [debug] >>> testapi::wait_serial: (?^u:ib~GQ-\d+-): ok
[2022-01-11T11:26:52.013763Z] [debug] tests/modify_and_upload_file.pm:37 called testapi::type_string
[2022-01-11T11:26:52.013987Z] [debug] <<< testapi::type_string(string="echo save_tmp_file returned expected file\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:53.521354Z] [debug] ||| finished modify_and_upload_file tests (runtime: 16 s)
The errors
[2022-01-11T11:26:21.791426Z] [warn] !!! consoles::VNC::login: Error connecting to VNC server <novnc.nowhere:5901>: IO::Socket::INET: Bad hostname 'novnc.nowhere'
in the previous test module are also happening in the good case.
Updated by mkittler almost 3 years ago
The test actually passes on arm 4 and 5 (and only fails on arm 1, 2 and 3). The results are at least consistent within each host (I've run the test at least 5 times on each host).
Updated by mkittler almost 3 years ago
Try to extend the command with access to KVM…
I don't think that will lead anywhere because the image we're using in the fullstack test is i368 which we really need to emulate on aarch64. Also note that "QEMU_NO_KVM" : "1"
is no longer present in the test settings so there's nothing to patch out.
Updated by mkittler almost 3 years ago
Out of curiosity I've just tried adding -v /dev/kvm
to podman
args which in turn leads to -enable-kvm
being added to QEMU args which leads to the error qemu-system-i386: invalid accelerator kvm
.
Updated by mkittler almost 3 years ago
I give up on figuring out why it isn't working on arm-1, 2 and 3. We mainly want to find out whether the typing problems we see on arm-4 and 5 in production can be reproduced with the os-autoinst fullstack test. So far I couldn't reproduce a single occurrence. I'll let it run in a loop 10 more times on each host¹. As stated before, using KVM isn't possible using this approach. (There would be arm ports of tinycore but using them is likely out of scope for this issue.)
¹ Still couldn't reproduce it.
Updated by mkittler almost 3 years ago
- Status changed from In Progress to Feedback
Still not reproducible after 60 more runs. I suppose running the fullstack test is not a way to reproduce typing issues.
Updated by mkittler almost 3 years ago
- Status changed from Feedback to Resolved
I'd like to conclude this ticket. I cannot exclude that the typing issues are happening more often when KVM is enabled. That actually seems likely considering the typing issues could not be reproduced without KVM. Maybe it isn't actually KVM but simply the faster execution is provides.
Updated by okurz almost 3 years ago
Please provide an update in the parent epic what this means and what we should do as next step