action #113396
closed[qe-core]test fails in logs_from_installation_system due to 'wait_countdown_stop' function doesn't work fine, performance issue?
100%
Description
Observation¶
openQA test in scenario sle-15-Server-DVD-Updates-aarch64-qam-minimal+base@aarch64-virtio fails in
logs_from_installation_system
Test suite description¶
Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.
Reproducible¶
Fails since (at least) Build 20220707-1
Expected result¶
Last good: 20220706-1 (or more recent)
Further details¶
Always latest result in this scenario: latest
ACTION¶
From the autoinst-log, I can see in test 'await_install',
'wait_countdown_stop' function has sent key "alt-s" to stop the reboot process there, however, it seems not work:
[2022-07-08T01:02:43.025678+02:00] [debug] tests/installation/await_install.pm:197 called await_install::wait_countdown_stop -> tests/installation/await_install.pm:45 called testapi::send_key
[2022-07-08T01:02:43.026211+02:00] [debug] <<< testapi::send_key(key="alt-s", do_wait=0, wait_screen_change=0)
[2022-07-08T01:02:43.369296+02:00] [debug] tests/installation/await_install.pm:197 called await_install::wait_countdown_stop -> tests/installation/await_install.pm:46 called testapi::wait_screen_change
[2022-07-08T01:02:43.370001+02:00] [debug] <<< testapi::wait_screen_change(timeout=3, similarity_level=50, similarity=55)
[2022-07-08T01:02:43.386041+02:00] [debug] waiting for screen change: 0 1000000
[2022-07-08T01:02:43.898563+02:00] [debug] waiting for screen change: 0 1000000
[2022-07-08T01:02:44.410907+02:00] [debug] waiting for screen change: 1 1000000
[2022-07-08T01:02:44.928192+02:00] [debug] waiting for screen change: 1 25.5385308432191
[2022-07-08T01:02:44.928592+02:00] [debug] >>> testapi::wait_screen_change: screen change seen at 1
[2022-07-08T01:02:44.929165+02:00] [debug] tests/installation/await_install.pm:198 called testapi::record_info
[2022-07-08T01:02:44.929691+02:00] [debug] <<< testapi::record_info(title="workaround", output="While trying to stop countdown we saw a screen change, retrying up to 9 times more", result="ok")
[2022-07-08T01:02:44.930634+02:00] [debug] tests/installation/await_install.pm:198 called await_install::wait_countdown_stop -> tests/installation/await_install.pm:45 called testapi::send_key
[2022-07-08T01:02:44.931027+02:00] [debug] <<< testapi::send_key(key="alt-s", do_wait=0, wait_screen_change=0)
[2022-07-08T01:02:45.270135+02:00] [debug] tests/installation/await_install.pm:198 called await_install::wait_countdown_stop -> tests/installation/await_install.pm:46 called testapi::wait_screen_change
[2022-07-08T01:02:45.270842+02:00] [debug] <<< testapi::wait_screen_change(timeout=3, similarity_level=50, similarity=55)
[2022-07-08T01:02:45.286134+02:00] [debug] waiting for screen change: 0 1000000
[2022-07-08T01:02:45.803449+02:00] [debug] waiting for screen change: 0 43.5021660349225
[2022-07-08T01:02:45.803865+02:00] [debug] >>> testapi::wait_screen_change: screen change seen at 0
[2022-07-08T01:02:45.804458+02:00] [debug] tests/installation/await_install.pm:198 called testapi::record_info
[2022-07-08T01:02:45.804969+02:00] [debug] <<< testapi::record_info(title="workaround", output="While trying to stop countdown we saw a screen change, retrying up to 8 times more", result="ok")
[2022-07-08T01:02:45.805881+02:00] [debug] tests/installation/await_install.pm:198 called await_install::wait_countdown_stop -> tests/installation/await_install.pm:45 called testapi::send_key
[2022-07-08T01:02:45.806285+02:00] [debug] <<< testapi::send_key(key="alt-s", do_wait=0, wait_screen_change=0)
[2022-07-08T01:02:46.144722+02:00] [debug] tests/installation/await_install.pm:198 called await_install::wait_countdown_stop -> tests/installation/await_install.pm:46 called testapi::wait_screen_change
[2022-07-08T01:02:46.145353+02:00] [debug] <<< testapi::wait_screen_change(timeout=3, similarity_level=50, similarity=55)
[2022-07-08T01:02:46.158213+02:00] [debug] waiting for screen change: 0 1000000
[2022-07-08T01:02:46.675674+02:00] [debug] waiting for screen change: 0 25.0089828516758
[2022-07-08T01:02:46.676107+02:00] [debug] >>> testapi::wait_screen_change: screen change seen at 0
[2022-07-08T01:02:46.676703+02:00] [debug] tests/installation/await_install.pm:198 called testapi::record_info
[2022-07-08T01:02:46.677217+02:00] [debug] <<< testapi::record_info(title="workaround", output="While trying to stop countdown we saw a screen change, retrying up to 7 times more", result="ok")
[2022-07-08T01:02:46.678186+02:00] [debug] tests/installation/await_install.pm:198 called await_install::wait_countdown_stop -> tests/installation/await_install.pm:45 called testapi::send_key
[2022-07-08T01:02:46.678636+02:00] [debug] <<< testapi::send_key(key="alt-s", wait_screen_change=0, do_wait=0)
[2022-07-08T01:02:47.016808+02:00] [debug] tests/installation/await_install.pm:198 called await_install::wait_countdown_stop -> tests/installation/await_install.pm:46 called testapi::wait_screen_change
[2022-07-08T01:02:47.017395+02:00] [debug] <<< testapi::wait_screen_change(timeout=3, similarity_level=50, similarity=55)
[2022-07-08T01:02:47.032559+02:00] [debug] waiting for screen change: 0 1000000
[2022-07-08T01:02:47.546980+02:00] [debug] waiting for screen change: 0 1000000
[2022-07-08T01:02:48.061593+02:00] [debug] waiting for screen change: 1 1000000
[2022-07-08T01:02:48.576185+02:00] [debug] waiting for screen change: 1 1000000
[2022-07-08T01:02:49.090664+02:00] [debug] waiting for screen change: 2 1000000
[2022-07-08T01:02:49.605189+02:00] [debug] waiting for screen change: 2 1000000
[2022-07-08T01:02:50.356651+02:00] [debug] >>> testapi::wait_screen_change: timed out
[2022-07-08T01:02:50.663053+02:00] [debug] ||| finished await_install installation (runtime: 804 s)
The issue can be seen in past 3 drops and rerun 1-2 times can be the workaround. however, I am wondering that the issue should have something to do with worker performance on aarch64 platform.
AC1: Increase the QEMURAM
AC2: Enhance the logic for function 'wait_countdown_stop' to make sure we can stop auto rebooting within 10s.
Updated by rfan1 over 2 years ago
- Status changed from New to In Progress
- % Done changed from 0 to 20
Rerun the tests with 4GB memory
# for i in 1 2 3 4 5; do openqa-clone-job --from http://openqa.suse.de --host http://openqa.suse.de 9100352 --skip-download --skip-chained-deps _GROUP=0 QEMURAM=4096; sleep 5;done
Created job #9101616: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9101616
Created job #9101617: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9101617
Created job #9101618: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9101618
Created job #9101619: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9101619
Created job #9101620: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9101620
So far, I can see that the random issue can rarely be seen on the platforms other than SLES15 + aarch64, so the current plan is only adjusting the memory size on this platform only.
Updated by rfan1 over 2 years ago
Bad news, Increase the memory size seems can't fix the performance issue, let me try to enhance the 'wait_countdown_stop' function (send key every 2s )
# for i in 1 2 3 4 5; do openqa-clone-job --from http://openqa.suse.de --host http://openqa.suse.de 9100352 --skip-download --skip-chained-deps _GROUP=0 QEMURAM=4096 CASEDIR=https://github.com/rfan1/os-autoinst-distri-opensuse.git#wait_countdown_stop; sleep 5;done
Created job #9101633: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9101633
Created job #9101634: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9101634
Created job #9101635: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9101635
Created job #9101636: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9101636
Created job #9101637: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9101637
Updated by rfan1 over 2 years ago
- Copied to action #113474: [qe-core] [maintenance] move security_389ds_server/client to security job group added
Updated by rfan1 over 2 years ago
- Copied to deleted (action #113474: [qe-core] [maintenance] move security_389ds_server/client to security job group)
Updated by rfan1 over 2 years ago
As Santiago's kindly suggestion, we can use the below workaround:
Case setting-> EXCLUDE_MODULES=logs_from_installation_system
Then even we fail to stop the auto-reboot process within 10s, we don't need to check the grub menu mandatory.
However, we may miss some function points [save and upload y2logs etc..] which I think we can skip them safely :)
Please see more comments at https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/15198
Updated by rfan1 over 2 years ago
Bad news again, excluding the test module can lead to another fail...
https://openqa.suse.de/tests/9109054#step/reboot_after_installation/4
[2022-07-12T04:04:40.021447+02:00] [debug] ||| starting reboot_after_installation tests/installation/reboot_after_installation.pm
[2022-07-12T04:04:40.024008+02:00] [debug] tests/installation/reboot_after_installation.pm:25 called testapi::select_console
[2022-07-12T04:04:40.024344+02:00] [debug] <<< testapi::select_console(testapi_console="installation")
[2022-07-12T04:04:40.448102+02:00] [debug] activate_console, console: installation, type:
[2022-07-12T04:04:40.448501+02:00] [debug] activate_console called with generic type, no action
[2022-07-12T04:04:40.449359+02:00] [debug] tests/installation/reboot_after_installation.pm:25 called testapi::select_console -> lib/susedistribution.pm:925 called testapi::assert_screen
[2022-07-12T04:04:40.449842+02:00] [debug] <<< testapi::assert_screen(mustmatch="installation", no_wait=1, timeout=30)
[2022-07-12T04:04:43.779804+02:00] [warn] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 3.33 seconds for 61 candidate needles - make your needles more specific
[2022-07-12T04:04:43.780190+02:00] [debug] no match: 29.9s, best candidate: after-install-11sp4-s390-20191223 (0.00)
[2022-07-12T04:04:43.882472+02:00] [debug] no change: 26.5s
[2022-07-12T04:04:46.173257+02:00] [warn] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 2.29 seconds for 61 candidate needles - make your needles more specific
[2022-07-12T04:04:46.173703+02:00] [debug] no match: 26.5s, best candidate: after-install-11sp4-s390-20191223 (0.00)
[2022-07-12T04:04:48.766897+02:00] [warn] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 2.49 seconds for 61 candidate needles - make your needles more specific
[2022-07-12T04:04:48.767310+02:00] [debug] no match: 24.1s, best candidate: after-install-11sp4-s390-20191223 (0.00)
[2022-07-12T04:04:50.731116+02:00] [warn] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 1.86 seconds for 61 candidate needles - make your needles more specific
[2022-07-12T04:04:50.731557+02:00] [debug] no match: 21.5s, best candidate: after-install-11sp4-s390-20191223 (0.00)
[2022-07-12T04:04:52.698924+02:00] [warn] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 1.86 seconds for 61 candidate needles - make your needles more specific
[2022-07-12T04:04:52.699369+02:00] [debug] no match: 19.6s, best candidate: after-install-11sp4-s390-20191223 (0.00)
[2022-07-12T04:04:54.642662+02:00] [warn] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 1.84 seconds for 61 candidate needles - make your needles more specific
[2022-07-12T04:04:54.643082+02:00] [debug] no match: 17.6s, best candidate: after-install-11sp4-s390-20191223 (0.00)
[2022-07-12T04:05:22.339131+02:00] [warn] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 27.59 seconds for 61 candidate needles - make your needles more specific
[2022-07-12T04:05:22.339569+02:00] [debug] no match: 15.7s, best candidate: rebootnow-20191219 (0.67)
[2022-07-12T04:05:22.340946+02:00] [warn] !!! backend::baseclass::do_capture: There is some problem with your environment, we detected a stall for 27.6963760852814 seconds
[2022-07-12T04:05:48.568915+02:00] [warn] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 26.13 seconds for 61 candidate needles - make your needles more specific
[2022-07-12T04:05:48.569374+02:00] [debug] no match: -12.0s, best candidate: rebootnow-20191219 (0.67)
[2022-07-12T04:05:49.188829+02:00] [debug] >>> testapi::_check_backend_response: match=installation timed out after 30 (assert_screen)
[2022-07-12T04:05:49.345729+02:00] [debug] tests/installation/reboot_after_installation.pm:25 called testapi::select_console -> lib/susedistribution.pm:925 called testapi::assert_screen
[2022-07-12T04:05:49.346370+02:00] [debug] <<< testapi::record_info(title="Stall detected", output="Stall was detected during assert_screen fail", result="fail")
[2022-07-12T04:05:49.354895+02:00] [info] ::: basetest::runtest: # Test died: no candidate needle with tag(s) 'installation' matched
Updated by rfan1 over 2 years ago
So seems I have trouble to fix this issue now:
- https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/15198 [this code change may introduce new issue since we type the key too fast]
- EXCLUDE_MODULES=logs_from_installation_system [we can still hit new issue with this workaround EXCLUDE_MODULES=logs_from_installation_system]
The last option in my mind is:
RETRY=3 [or even bigger value], I think we can pass this test with more retries, any comments, experts?
Updated by rfan1 over 2 years ago
The flowing workaround can be used to minimise the impact by lower performance on arm workers.
Settings:
REBOOT_TIMEOUT=0
EXCLUDE_MODULES=logs_from_installation_system,reboot_after_installation
#for i in 1 2 3 4 5; do openqa-clone-job --from http://openqa.suse.de --host http://openqa.suse.de 9100352 --skip-download --skip-chained-deps _GROUP=0 EXCLUDE_MODULES=logs_from_installation_system,reboot_after_installation _GROUP_ID=0 BUILD=rfan_sle15 REBOOT_TIMEOUT=0;sleep 5;done
Created job #9112500: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9112500
Created job #9112501: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9112501
Created job #9112502: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9112502
Created job #9112503: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9112503
Created job #9112504: sle-15-Server-DVD-Updates-aarch64-Build20220710-1-qam-minimal+base@aarch64-virtio -> http://openqa.suse.de/t9112504
Updated by rfan1 over 2 years ago
- % Done changed from 20 to 60
- Estimated time changed from 4.00 h to 8.00 h
Updated by rfan1 over 2 years ago
- Status changed from In Progress to Resolved
- % Done changed from 60 to 100
Updated by szarate over 2 years ago
- Sprint set to QE-Core: July Sprint (Jul 06 - Aug 03)
Updated by rfan1 over 2 years ago
- Related to action #114959: [qem][qe-core]test fails in logs_from_installation_system, "wait_countdown_stop" function can't stop auto reboot process added
Updated by rfan1 over 2 years ago
- Related to action #115079: [qe-core][qem&functional] Many test failures due to low performance on arm workers added