action #20240
closed[sles][functional] test fails in install_and_reboot - The test times out too early, when the installation is not yet finished
0%
Description
Observation¶
openQA test in scenario sle-12-SP3-Server-DVD-aarch64-lvm+RAID1@aarch64 fails in
install_and_reboot
The test times out too early, when the installation is not yet finished
Reproducible¶
Fails since (at least) Build 0458 (current job)
At least seen twice in the last 50 prev jobs (https://openqa.suse.de/tests/874292#step/install_and_reboot/21)
Can be reproduced (https://openqa.suse.de/tests/874292#)
Expected result¶
Last good: 0345 (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by jorauch over 7 years ago
Further appearances:
https://openqa.suse.de/tests/1043773
https://openqa.suse.de/tests/1039660
Updated by SLindoMansilla over 7 years ago
- Status changed from New to In Progress
- Assignee set to jorauch
Updated by jorauch over 7 years ago
- Status changed from In Progress to New
- Assignee deleted (
jorauch)
Error seems to have disappeared in build 466
Updated by jorauch over 7 years ago
- Status changed from New to In Progress
- Assignee set to jorauch
Updated by Anonymous over 7 years ago
- Assignee changed from jorauch to Anonymous
Updated by Anonymous over 7 years ago
The log file https://openqa.suse.de/tests/1043773/file/autoinst-log.txt shows it checks if installation-details-view is still there, and if not, it thinks installation might be finished, does a screen check "installation-details-view-remaining-time-gt2h" (here the needle is no more there, so the test code should also be modified), and if needle not found, continues to next module install_and_reboot.
So the question is why it failed to detect the installation process, it could be the issue of needle or other reasons. One way to improve is to increase the intervals/counter for checking. I'll test it and see if the failure can be reproduced.
Updated by Anonymous over 7 years ago
I can't reproduce it, but I think we can try to avoid it by increasing the counter and waiting time for check_screen 'installation-details-view' in the start_install.pm, in case the system is too busy and takes longer than 5 seconds to switch screen.
Updated by mkravec over 7 years ago
I think installation-details-view-remaining-time-gt2h is expected to not match anything (bsc#982138 & poo#12162)
From autoinst.log:
- installation started
- check_screen 'installation-details-view' passed fine
- check_screen 'installation-details-view-remaining-time-gt2h' failes to match as expected
- assert_screen 'rebootnow' fails after 2000 seconds
I don't think problem is in installation-details-view, I think you should increase timeout for 'rebootnow' in install_and_reboot module.
Updated by Anonymous over 7 years ago
Can you please send me the log into my mailbox: yxu@suse.de
I still think the problem is within start_install module, the following steps happened within this module, and it finishes too early:
- check_screen 'installation-details-view' passed fine
- check_screen 'installation-details-view-remaining-time-gt2h' failes to match as expected
"assert_screen 'rebootnow' fails after 2000 seconds" because it is not yet ready to reboot, but you don't know how long it still takes to install, or which number would you suggest here?
Updated by okurz over 7 years ago
I suggest take a look into https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/installation/install_and_reboot.pm#L52 as an example.
The default timeout for installation, covering the whole downloading from remote repositories, installing packages to the disk, configuring files, etc., is 2000 seconds as this was shown to be a good empiric threshold for "normal" installations. As full disk encryption is involving some I/O overhead the overall installation is slower and therefore scaled accordingly with the factor of two. LVM and RAID can also introduce an overhead and therefore a slowdown unless it is RAID0 which should actually decrease the necessary time but that is most likely not the case when we put both RAID volumes on the same disk so simply stated no slowdown or speedup expected.
TL;DR: Just extend https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/installation/install_and_reboot.pm#L52 to $timeout *= 2 if get_var('ENCRYPT') or get_var('LVM') or get_var('RAID');
and adjust the comment
Updated by Anonymous over 7 years ago
Verification run: https://openqa.suse.de/tests/1116688
Updated by Anonymous over 7 years ago
- Status changed from Feedback to In Progress
Updated by Anonymous over 7 years ago
Verification testrun succeeded:
https://openqa.suse.de/tests/1122672
Updated by Anonymous over 7 years ago
- Status changed from In Progress to Resolved
Updated by okurz over 7 years ago
ok, good. Thank you for your contribution. Please make sure to not retrigger tests of a GM build on production as this will alter test results we want to stay unaltered. Better clone these tests from osd to osd while renaming the test name and also the group assignment, e.g.: clone_job.pl --from https://openqa.suse.de --host https://openqa.suse.de 1234 _GROUP=0 TEST=verify_poo#20240
Updated by Anonymous over 7 years ago
- Status changed from Resolved to In Progress
Updated by Anonymous over 7 years ago
Recent failure:
https://openqa.suse.de/tests/1126389
Updated by Anonymous over 7 years ago
The log file shows after 2000 seconds there is still no matching needle rebootnow, thus 2000 is not enough for installation of the above scenario. A solution would be to increase the timeout.
Updated by okurz over 7 years ago
As also stated in the PR: We already double the time for RAID. But the failing test you referenced in the progress ticket https://openqa.suse.de/tests/1126389 is stuck at 38% and also is unable to switch to the text terminal in time for gathering logs. First, increasing from 2*2000 seconds to 2*3600 might not even cover that gap to reach 100% and IMHO that is not a behaviour we should accept lightly, especially considering that we are only installing LeanOS, i.e. way less packages then default SLES which we did all the time in before. IMHO there is something wrong with either the product or the worker which should be investigated but not by bumping the timeout. I suggest you close this PR and instead check other jobs for same symptoms or how it should work like, trigger more tests for statistical analysis, etc. E.g. do the following
for i in {1..100} ; do clone_job.pl --from https://openqa.suse.de 1126389 TEST=yxu_test_timeout_$i _GROUP=96 ; done
which is especially a good idea to do on a friday with a quite quiet osd. Give openQA some work for the weekend! :-)
Updated by Anonymous over 7 years ago
The 100 jobs are almost done now, results can be seen from https://openqa.suse.de/tests/1129763 to https://openqa.suse.de/tests/1129854 (simply increase the number). Grouping didn't work well, and I manually checked test results. One can assume 90% of the tests failed. It almost always stuck around 30% and 40% of installation process. How can it be explained? Something wrong with the product? I will try to reproduce it with new build.
Updated by Anonymous over 7 years ago
An overview of testrun:
https://openqa.suse.de/tests/overview?distri=sle&version=15&build=162.2&groupid=96
2 of 32 testruns fail at first_boot, all others stuck at installation.
Updated by Anonymous over 7 years ago
- Status changed from In Progress to Rejected
Product bug:
https://bugzilla.suse.com/show_bug.cgi?id=1055682