Project

General

Profile

Actions

action #20240

closed

[sles][functional] test fails in install_and_reboot - The test times out too early, when the installation is not yet finished

Added by jorauch about 7 years ago. Updated almost 7 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
Bugs in existing tests
Start date:
2017-07-04
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-12-SP3-Server-DVD-aarch64-lvm+RAID1@aarch64 fails in
install_and_reboot
The test times out too early, when the installation is not yet finished

Reproducible

Fails since (at least) Build 0458 (current job)
At least seen twice in the last 50 prev jobs (https://openqa.suse.de/tests/874292#step/install_and_reboot/21)
Can be reproduced (https://openqa.suse.de/tests/874292#)

Expected result

Last good: 0345 (or more recent)

Further details

Always latest result in this scenario: latest

Actions #1

Updated by jorauch about 7 years ago

  • Description updated (diff)
Actions #2

Updated by jorauch about 7 years ago

  • Description updated (diff)
Actions #4

Updated by SLindoMansilla about 7 years ago

  • Status changed from New to In Progress
  • Assignee set to jorauch
Actions #5

Updated by jorauch about 7 years ago

  • Status changed from In Progress to New
  • Assignee deleted (jorauch)

Error seems to have disappeared in build 466

Actions #6

Updated by jorauch about 7 years ago

  • Status changed from New to In Progress
  • Assignee set to jorauch
Actions #7

Updated by jorauch about 7 years ago

  • Status changed from In Progress to Feedback
Actions #8

Updated by Anonymous almost 7 years ago

  • Assignee changed from jorauch to Anonymous
Actions #9

Updated by Anonymous almost 7 years ago

  • Target version set to Milestone 11
Actions #10

Updated by Anonymous almost 7 years ago

The log file https://openqa.suse.de/tests/1043773/file/autoinst-log.txt shows it checks if installation-details-view is still there, and if not, it thinks installation might be finished, does a screen check "installation-details-view-remaining-time-gt2h" (here the needle is no more there, so the test code should also be modified), and if needle not found, continues to next module install_and_reboot.
So the question is why it failed to detect the installation process, it could be the issue of needle or other reasons. One way to improve is to increase the intervals/counter for checking. I'll test it and see if the failure can be reproduced.

Actions #11

Updated by Anonymous almost 7 years ago

I can't reproduce it, but I think we can try to avoid it by increasing the counter and waiting time for check_screen 'installation-details-view' in the start_install.pm, in case the system is too busy and takes longer than 5 seconds to switch screen.

Actions #12

Updated by mkravec almost 7 years ago

I think installation-details-view-remaining-time-gt2h is expected to not match anything (bsc#982138 & poo#12162)

From autoinst.log:

  • installation started
  • check_screen 'installation-details-view' passed fine
  • check_screen 'installation-details-view-remaining-time-gt2h' failes to match as expected
  • assert_screen 'rebootnow' fails after 2000 seconds

I don't think problem is in installation-details-view, I think you should increase timeout for 'rebootnow' in install_and_reboot module.

Actions #13

Updated by Anonymous almost 7 years ago

Can you please send me the log into my mailbox: yxu@suse.de

I still think the problem is within start_install module, the following steps happened within this module, and it finishes too early:

  • check_screen 'installation-details-view' passed fine
  • check_screen 'installation-details-view-remaining-time-gt2h' failes to match as expected

"assert_screen 'rebootnow' fails after 2000 seconds" because it is not yet ready to reboot, but you don't know how long it still takes to install, or which number would you suggest here?

Actions #14

Updated by okurz almost 7 years ago

I suggest take a look into https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/installation/install_and_reboot.pm#L52 as an example.

The default timeout for installation, covering the whole downloading from remote repositories, installing packages to the disk, configuring files, etc., is 2000 seconds as this was shown to be a good empiric threshold for "normal" installations. As full disk encryption is involving some I/O overhead the overall installation is slower and therefore scaled accordingly with the factor of two. LVM and RAID can also introduce an overhead and therefore a slowdown unless it is RAID0 which should actually decrease the necessary time but that is most likely not the case when we put both RAID volumes on the same disk so simply stated no slowdown or speedup expected.

TL;DR: Just extend https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/installation/install_and_reboot.pm#L52 to $timeout *= 2 if get_var('ENCRYPT') or get_var('LVM') or get_var('RAID'); and adjust the comment

Actions #15

Updated by Anonymous almost 7 years ago

Thanks Olli. PR updated.

Actions #16

Updated by Anonymous almost 7 years ago

Actions #17

Updated by Anonymous almost 7 years ago

  • Status changed from Feedback to In Progress
Actions #18

Updated by Anonymous almost 7 years ago

Verification testrun succeeded:
https://openqa.suse.de/tests/1122672

Actions #19

Updated by Anonymous almost 7 years ago

  • Status changed from In Progress to Resolved
Actions #20

Updated by okurz almost 7 years ago

ok, good. Thank you for your contribution. Please make sure to not retrigger tests of a GM build on production as this will alter test results we want to stay unaltered. Better clone these tests from osd to osd while renaming the test name and also the group assignment, e.g.: clone_job.pl --from https://openqa.suse.de --host https://openqa.suse.de 1234 _GROUP=0 TEST=verify_poo#20240

Actions #21

Updated by Anonymous almost 7 years ago

  • Status changed from Resolved to In Progress
Actions #22

Updated by Anonymous almost 7 years ago

Actions #23

Updated by Anonymous almost 7 years ago

The log file shows after 2000 seconds there is still no matching needle rebootnow, thus 2000 is not enough for installation of the above scenario. A solution would be to increase the timeout.

Actions #24

Updated by okurz almost 7 years ago

As also stated in the PR: We already double the time for RAID. But the failing test you referenced in the progress ticket https://openqa.suse.de/tests/1126389 is stuck at 38% and also is unable to switch to the text terminal in time for gathering logs. First, increasing from 2*2000 seconds to 2*3600 might not even cover that gap to reach 100% and IMHO that is not a behaviour we should accept lightly, especially considering that we are only installing LeanOS, i.e. way less packages then default SLES which we did all the time in before. IMHO there is something wrong with either the product or the worker which should be investigated but not by bumping the timeout. I suggest you close this PR and instead check other jobs for same symptoms or how it should work like, trigger more tests for statistical analysis, etc. E.g. do the following

for i in {1..100} ; do clone_job.pl --from https://openqa.suse.de 1126389 TEST=yxu_test_timeout_$i _GROUP=96 ; done

which is especially a good idea to do on a friday with a quite quiet osd. Give openQA some work for the weekend! :-)

Actions #25

Updated by Anonymous almost 7 years ago

The 100 jobs are almost done now, results can be seen from https://openqa.suse.de/tests/1129763 to https://openqa.suse.de/tests/1129854 (simply increase the number). Grouping didn't work well, and I manually checked test results. One can assume 90% of the tests failed. It almost always stuck around 30% and 40% of installation process. How can it be explained? Something wrong with the product? I will try to reproduce it with new build.

Actions #26

Updated by Anonymous almost 7 years ago

An overview of testrun:
https://openqa.suse.de/tests/overview?distri=sle&version=15&build=162.2&groupid=96
2 of 32 testruns fail at first_boot, all others stuck at installation.

Actions #27

Updated by Anonymous almost 7 years ago

  • Status changed from In Progress to Rejected
Actions

Also available in: Atom PDF