action #45938
[functional][u] disk_boot test in Kubic textmode scenarios fail after 09th Jan
0%
Description
Observation¶
Since 09th Jan the disk_boot test on Kubic is failing, but only during it's textmode scenario
https://openqa.opensuse.org/tests/828121
This test should be rebooting the SUT, but that reboot does not appear to occur
Other Kubic scenarios that call the disk_boot test but do not use Textmode are ok
The product has been confirmed manually that it boots perfectly fine, even when using textmode during the installation.
This is strongly suspected to be an openQA issue because tests with earlier TW builds that passed before 0109 (eg https://openqa.opensuse.org/tests/826720 ) now fail when run after the deploy that occurred on 0109 (eg https://openqa.opensuse.org/tests/828139)
I honestly have NO idea how the deploy can cause this symptoms, but in the absence of any recent commit that affects the use codepath (https://github.com/os-autoinst/os-autoinst-distri-opensuse/commits/master) I think the changes introduced in the deploy to o3 are the most likely cause.
Problem¶
H1 REJECT The product has changed
- H1.1 REJECT Specified assets have changed -> see #45938#note-1, failure reproduced with the ISO from "last good"
- H1.2 REJECT Downloaded, dynamic content is differing and impacting the test
H2 ACCEPT Fails because of changes in test setup
- H2.1 REJECT Our test hardware equipment behaves different
- H2.2 REJECT The network behaves different
- H2.3 ACCEPT The worker has an influence
H3 REJECT Fails because of changes in test infrastructure software, e.g. os-autoinst, openQA
H4 REJECT Fails because of changes in test management configuration, e.g. openQA database settings
H5 REJECT Fails because of changes in the test software itself (the test plan in source code as well as needles) -> E2.3-1
H6 ACCEPT Sporadic issue, i.e. the root problem is already hidden in the system for a long time but does not show symptoms every time
- H6.1 ACCEPT The test code is not resilient enough to slow reboots because of the 30s timeout waiting for the grub screen in
kubic/disk_boot.pm:20
- H6.1 ACCEPT The test code is not resilient enough to slow reboots because of the 30s timeout waiting for the grub screen in
History
#1
Updated by RBrownSUSE about 4 years ago
- Description updated (diff)
#2
Updated by okurz about 4 years ago
- Status changed from New to In Progress
- Assignee set to okurz
#3
Updated by okurz about 4 years ago
- Subject changed from disk_boot test in Kubic textmode scenarios fail after 09th Jan to [functional][u] disk_boot test in Kubic textmode scenarios fail after 09th Jan
- Description updated (diff)
- Priority changed from Normal to High
- Target version set to Milestone 22
Let me see.
- last good: build 20190105: https://openqa.opensuse.org/tests/826720/ running on imagetester:1, os-autoinst 4.5.1546602946.a7be7efa, test git hash f56b364adcd8154fc88bdb8bcbecc5de43195fbf
- bad: same build: https://openqa.opensuse.org/tests/828139/ running on openqaworker1:10, os-autoinst 4.5.1546949268.cb3fa727, test git hash cf10b19af359c26b61e1b8f50e67c45d78e774df
last good on openqaworker1: https://openqa.opensuse.org/tests/820899 from 16 days ago.
- E4-1 Compare job settings: O4-1:
$ diff <(openqa_client_o3 --json-output jobs/826720 | jq '.job | .settings' | sort) <(openqa_client_o3 --json-output jobs/828139 | jq '.job | .settings' | sort) 18,19c18,19 < "MULTI_STEP_KUBIC_FLOW": "1", < "NAME": "00826720-kubic-Tumbleweed-DVD-x86_64-Build20190105-microos_textmode@64bit-4G-HD40G", --- > "MULTI_STEP_KUBIC_FLOW": "1" > "NAME": "00828139-kubic-Tumbleweed-DVD-x86_64-Build20190105-microos_textmode@64bit-4G-HD40G", 31c31 < "TEST": "microos_textmode" --- > "TEST": "microos_textmode",
no significant difference -> REJECT H4
- E3-1 Check difference of os-autoinst: O3-1
$ git log1 --no-merges a7be7efa..cb3fa727 036ab540 Add missing network_console.pm to Makefile 631d0f7a Do not incomplete on connection error with ssh based consoles
so unlikely
- E2.3-1 Retrigger tests on production until the same worker as in "last good" picks the job: https://openqa.opensuse.org/tests/828316#live -> passed disk_boot, supporting H2.3 and H6, REJECT H5, REJECT H1.2 and H1
- E2.3-2 Check average load on our workers: O2.3-2
$ for i in power8 aarch64 imagetester openqaworker1 openqaworker4 ; do echo -n "$i: " && ssh root@$i "cat /proc/loadavg"; done power8: 8.19 8.37 7.64 8/1424 152970 aarch64: 1.22 1.35 1.27 2/866 24624 imagetester: 6.64 4.74 3.06 5/442 24515 openqaworker1: 10.80 12.80 10.53 17/1458 5511 openqaworker4: 8.19 10.86 9.71 6/1126 24461 $ for i in power8 aarch64 imagetester openqaworker1 openqaworker4 ; do echo -n "$i: " && ssh root@$i "cat /proc/loadavg"; done power8: 8.35 8.63 7.96 8/1405 153270 aarch64: 0.45 1.00 1.15 1/806 24713 imagetester: 1.26 2.85 2.72 1/419 24721 openqaworker1: 6.49 9.38 9.72 8/1218 6257 openqaworker4: 10.08 9.82 9.50 11/1192 25258
The long-term load on imagetester is 2.72-3.06 whereas on openqaworker1/4 it is 9.50-10.53. So the load is significantly higher on openqaworker1 and openqaworker4 than imagetester. REJECT H2.1 and H2.2
E2-1 Local clone: Running on lord.arch: http://lord.arch/tests/1966 -> failed in disk_boot, supporting H2 and H6
E6-1 Gather statistics and check workers in particular:
for i in {001..020}; do openqa-clone-job --from https://openqa.opensuse.org --host https://openqa.opensuse.org --skip-download --skip-chained-deps 828316 TEST=okurz_poo45938_$i _GROUP="Development Tumbleweed" BUILD="20190105:poo45938" EXCLUDE_MODULES=networking,repositories,create_autoyast,libzypp_config,one_line_checks,services_enabled,filesystem_ro,transactional_update,rebootmgr,journal_check,shutdown; done
17/20 failed. The three passed ones all were executed on imagetester with the total execution time 8:30m, 8:32m, 10:06m (winter grub theme in the last which takes longer in "bootloader"). All failed jobs are in the time range of 8:57m-10:37m so longer. ACCEPT H2.3 and H2. As all three workers have the same package state ensured by transactional updates, REJECT H3
E6.1-1 lord.arch is currently loaded and conducting tests slowly. Let's see if bumping the timeout in kubic/disk_boot can help -> http://lord.arch/tests/1967 with using
wait_boot
instead of the customassert_screen
which effectively bumps the timeout waiting for "grub2" from 30s to 90s. Failed. Setting TIMEOUT_SCALE=5 -> http://lord.arch/tests/1969Additionally triggered https://openqa.opensuse.org/tests/overview?distri=kubic&groupid=38&build=20190105%3Apoo45938_scaled_3&version=Tumbleweed with
TIMEOUT_SCALE=3
which showed that jobs still fail except on imagetester so going for a higher timeout in the fix.
Fix in https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6521
#4
Updated by okurz about 4 years ago
- Description updated (diff)
#5
Updated by okurz about 4 years ago
- Description updated (diff)
- Status changed from In Progress to Feedback
#6
Updated by okurz about 4 years ago
PR merged.
Crosschecking:
$ for i in {001..020}; do openqa-clone-job --from https://openqa.opensuse.org --host https://openqa.opensuse.org --skip-download --skip-chained-deps 828316 TEST=okurz_poo45938_scaled_3_$i _GROUP="Development Tumbleweed" BUILD="20190105:poo45938_with_fix" EXCLUDE_MODULES=networking,repositories,create_autoyast,libzypp_config,one_line_checks,services_enabled,filesystem_ro,transactional_update,rebootmgr,journal_check,shutdown; done