Project

General

Profile

Actions

action #104304

closed

openQA Project (public) - coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3

Crosscheck results of https://github.com/os-autoinst/os-autoinst#verifying-a-runtime-environment on arm-1/2/3 vs. arm-4/5 to find out if arm-4/5 are "typing stable" size:M

Added by okurz almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2021-12-22
Due date:
% Done:

0%

Estimated time:

Description

Acceptance criteria

Suggestions

Actions #1

Updated by okurz almost 3 years ago

  • Project changed from openQA Project (public) to openQA Infrastructure (public)
  • Target version set to Ready
Actions #2

Updated by okurz almost 3 years ago

  • Subject changed from Crosscheck results of https://github.com/os-autoinst/os-autoinst#verifying-a-runtime-environment on arm-1/2/3 vs. arm-4/5 to find out if arm-4/5 are "typing stable" to Crosscheck results of https://github.com/os-autoinst/os-autoinst#verifying-a-runtime-environment on arm-1/2/3 vs. arm-4/5 to find out if arm-4/5 are "typing stable" size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by mkittler almost 3 years ago

  • Assignee set to mkittler
Actions #4

Updated by mkittler almost 3 years ago

  • Status changed from Workable to In Progress

I've not come very far:

martchus@openqa:~> sudo salt 'openqaworker-arm*' cmd.run 'bash -c "ok=0; for i in {1..5}; do echo \"i: \$i\" && podman run --pull=always --rm -it --entrypoint \"\" registry.opensuse.org/devel/openqa/containers/os-autoinst_dev:latest /bin/sh -c \"git -C /opt clone --depth 1 https://github.com/os-autoinst/os-autoinst && make -C /opt/os-autoinst/ test-perl-testsuite TESTS=t/99-full-stack.t\" && ok=$((ok + 1)); done; echo \"ok: \$ok of 5\" "'
openqaworker-arm-2.suse.de:
    i: 1
    Trying to pull registry.opensuse.org/devel/openqa/containers/os-autoinst_dev:latest...
    Getting image source signatures
    Copying blob sha256:32d063c8d4f5f084c9410a80820cbdbcea5072ad6960ccf906d74eb4c910b485
    Copying blob sha256:0c9b09d8f0b99b7a3c2d3f9eb86e60098ec27e3f3a9f76ee12cfe66f8569db6a
    Copying config sha256:1dc3c70f60a100e83d2ae29414f4ade5f268cf4771ec2fe9307b55ac477979c9
    Writing manifest to image destination
    Storing signatures
    standard_init_linux.go:228: exec user process caused: exec format error

It works locally with Docker. I suppose the podman setup is broken on those workers.

Actions #5

Updated by openqa_review almost 3 years ago

  • Due date set to 2022-01-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by mkittler almost 3 years ago

The problem was that we ceased to build the container image for aarch64. I re-enabled the build and with the current image it generally runs. However, I've ran into the next issue:

# [2022-01-11T11:42:55.496453Z] [debug] <<< testapi::assert_script_run(cmd="wget -q http://10.0.2.2:15223/nOlEDGIpPA/files/modified.xml", fail_message="", quiet=undef, timeout=90)
# [2022-01-11T11:42:55.496999Z] [debug] tests/modify_and_upload_file.pm:33 called testapi::assert_script_run
# [2022-01-11T11:42:55.497679Z] [debug] <<< testapi::type_string(string="wget -q http://10.0.2.2:15223/nOlEDGIpPA/files/modified.xml", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
# [2022-01-11T11:42:57.796942Z] [debug] tests/modify_and_upload_file.pm:33 called testapi::assert_script_run
# [2022-01-11T11:42:57.797923Z] [debug] <<< testapi::type_string(string="; echo 8I~_t-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
# [2022-01-11T11:42:59.034733Z] [debug] tests/modify_and_upload_file.pm:33 called testapi::assert_script_run
# [2022-01-11T11:42:59.035608Z] [debug] <<< testapi::wait_serial(regexp=qr/8I~_t-\d+-/u, record_output=undef, buffer_size=undef, timeout=90, quiet=undef, no_regex=0, expect_not_found=0)
# [2022-01-11T11:43:01.330864Z] [debug] >>> testapi::wait_serial: (?^u:8I~_t-\d+-): ok
# [2022-01-11T11:43:01.620512Z] [info] ::: basetest::runtest: # Test died: command 'wget -q http://10.0.2.2:15223/nOlEDGIpPA/files/modified.xml' failed at /opt/os-autoinst/t/data//tests/tests/modify_and_upload_file.pm line 33.

This exact same error is happening consistently on all ARM workers I've tested it so far on (but does not happen locally or on an x86_64 worker¹).


¹ In the good case the test log looks like this:

[2022-01-11T11:26:37.058618Z] [debug] ||| starting modify_and_upload_file tests/modify_and_upload_file.pm
[2022-01-11T11:26:37.221007Z] [debug] tests/modify_and_upload_file.pm:24 called testapi::get_test_data
[2022-01-11T11:26:37.221396Z] [debug] <<< testapi::get_test_data(path="/opt/os-autoinst/t/data//tests/data/autoinst.xml")
[2022-01-11T11:26:37.227357Z] [debug] tests/modify_and_upload_file.pm:26 called testapi::type_string
[2022-01-11T11:26:37.228021Z] [debug] <<< testapi::type_string(string="echo get_test_data returned expected file\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:38.707440Z] [debug] tests/modify_and_upload_file.pm:31 called testapi::save_tmp_file
[2022-01-11T11:26:38.707822Z] [debug] <<< testapi::save_tmp_file(path="modified.xml")
[2022-01-11T11:26:38.708649Z] [debug] tests/modify_and_upload_file.pm:33 called testapi::assert_script_run
[2022-01-11T11:26:38.709230Z] [debug] <<< testapi::assert_script_run(cmd="wget -q http://10.0.2.2:15223/brRAHMeVWk/files/modified.xml", quiet=undef, timeout=90, fail_message="")
[2022-01-11T11:26:38.709738Z] [debug] tests/modify_and_upload_file.pm:33 called testapi::assert_script_run
[2022-01-11T11:26:38.710503Z] [debug] <<< testapi::type_string(string="wget -q http://10.0.2.2:15223/brRAHMeVWk/files/modified.xml", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:40.976618Z] [debug] tests/modify_and_upload_file.pm:33 called testapi::assert_script_run
[2022-01-11T11:26:40.977477Z] [debug] <<< testapi::type_string(string="; echo mzNDB-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:42.173453Z] [debug] tests/modify_and_upload_file.pm:33 called testapi::assert_script_run
[2022-01-11T11:26:42.173889Z] [debug] <<< testapi::wait_serial(regexp=qr/mzNDB-\d+-/u, no_regex=0, timeout=90, quiet=undef, record_output=undef, buffer_size=undef, expect_not_found=0)
[2022-01-11T11:26:44.275461Z] [debug] >>> testapi::wait_serial: (?^u:mzNDB-\d+-): ok
[2022-01-11T11:26:44.275689Z] [debug] tests/modify_and_upload_file.pm:34 called testapi::script_run
[2022-01-11T11:26:44.275916Z] [debug] <<< testapi::script_run(cmd="echo '72d2c15cb10535f36862d7d2eecc8a79  modified.xml' > modified.md5", timeout=30, quiet=undef, output="", die_on_timeout=-1)
[2022-01-11T11:26:44.276071Z] [debug] tests/modify_and_upload_file.pm:34 called testapi::script_run
[2022-01-11T11:26:44.276295Z] [debug] <<< testapi::type_string(string="echo '72d2c15cb10535f36862d7d2eecc8a79  modified.xml' > modified.md5", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:46.705473Z] [debug] tests/modify_and_upload_file.pm:34 called testapi::script_run
[2022-01-11T11:26:46.706233Z] [debug] <<< testapi::type_string(string="; echo ~liyM-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:47.868134Z] [debug] tests/modify_and_upload_file.pm:34 called testapi::script_run
[2022-01-11T11:26:47.868980Z] [debug] <<< testapi::wait_serial(no_regex=0, regexp=qr/~liyM-\d+-/u, quiet=undef, timeout=30, record_output=undef, expect_not_found=0, buffer_size=undef)
[2022-01-11T11:26:48.961609Z] [debug] >>> testapi::wait_serial: (?^u:~liyM-\d+-): ok
[2022-01-11T11:26:48.961772Z] [warn] !!! testapi::script_run: DEPRECATED call of script_run() in tests/modify_and_upload_file.pm:34 add `die_on_timeout => ?` to avoid this warning
[2022-01-11T11:26:48.961938Z] [debug] tests/modify_and_upload_file.pm:35 called testapi::assert_script_run
[2022-01-11T11:26:48.962134Z] [debug] <<< testapi::assert_script_run(cmd="md5sum -c modified.md5", quiet=undef, timeout=90, fail_message="")
[2022-01-11T11:26:48.962285Z] [debug] tests/modify_and_upload_file.pm:35 called testapi::assert_script_run
[2022-01-11T11:26:48.962461Z] [debug] <<< testapi::type_string(string="md5sum -c modified.md5", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:49.725770Z] [debug] tests/modify_and_upload_file.pm:35 called testapi::assert_script_run
[2022-01-11T11:26:49.726529Z] [debug] <<< testapi::type_string(string="; echo ib~GQ-\$?- > /dev/ttyS0\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:50.912502Z] [debug] tests/modify_and_upload_file.pm:35 called testapi::assert_script_run
[2022-01-11T11:26:50.912916Z] [debug] <<< testapi::wait_serial(record_output=undef, expect_not_found=0, buffer_size=undef, regexp=qr/ib~GQ-\d+-/u, no_regex=0, quiet=undef, timeout=90)
[2022-01-11T11:26:52.013581Z] [debug] >>> testapi::wait_serial: (?^u:ib~GQ-\d+-): ok
[2022-01-11T11:26:52.013763Z] [debug] tests/modify_and_upload_file.pm:37 called testapi::type_string
[2022-01-11T11:26:52.013987Z] [debug] <<< testapi::type_string(string="echo save_tmp_file returned expected file\n", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2022-01-11T11:26:53.521354Z] [debug] ||| finished modify_and_upload_file tests (runtime: 16 s)

The errors

[2022-01-11T11:26:21.791426Z] [warn] !!! consoles::VNC::login: Error connecting to VNC server <novnc.nowhere:5901>: IO::Socket::INET: Bad hostname 'novnc.nowhere'

in the previous test module are also happening in the good case.

Actions #7

Updated by mkittler almost 3 years ago

The test actually passes on arm 4 and 5 (and only fails on arm 1, 2 and 3). The results are at least consistent within each host (I've run the test at least 5 times on each host).

Actions #8

Updated by mkittler almost 3 years ago

Try to extend the command with access to KVM…

I don't think that will lead anywhere because the image we're using in the fullstack test is i368 which we really need to emulate on aarch64. Also note that "QEMU_NO_KVM" : "1" is no longer present in the test settings so there's nothing to patch out.

Actions #9

Updated by mkittler almost 3 years ago

Out of curiosity I've just tried adding -v /dev/kvm to podman args which in turn leads to -enable-kvm being added to QEMU args which leads to the error qemu-system-i386: invalid accelerator kvm.

Actions #10

Updated by mkittler almost 3 years ago

I give up on figuring out why it isn't working on arm-1, 2 and 3. We mainly want to find out whether the typing problems we see on arm-4 and 5 in production can be reproduced with the os-autoinst fullstack test. So far I couldn't reproduce a single occurrence. I'll let it run in a loop 10 more times on each host¹. As stated before, using KVM isn't possible using this approach. (There would be arm ports of tinycore but using them is likely out of scope for this issue.)

¹ Still couldn't reproduce it.

Actions #11

Updated by mkittler almost 3 years ago

  • Status changed from In Progress to Feedback

Still not reproducible after 60 more runs. I suppose running the fullstack test is not a way to reproduce typing issues.

Actions #12

Updated by mkittler almost 3 years ago

  • Status changed from Feedback to Resolved

I'd like to conclude this ticket. I cannot exclude that the typing issues are happening more often when KVM is enabled. That actually seems likely considering the typing issues could not be reproduced without KVM. Maybe it isn't actually KVM but simply the faster execution is provides.

Actions #13

Updated by okurz almost 3 years ago

Please provide an update in the parent epic what this means and what we should do as next step

Actions #14

Updated by okurz almost 3 years ago

  • Due date deleted (2022-01-25)
Actions #15

Updated by okurz almost 3 years ago

  • Priority changed from Normal to High
Actions

Also available in: Atom PDF