Project

General

Profile

action #50615

[functional][y] test fails in await_install - does not catch rebootnow

Added by mlin7442 over 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
Start date:
2019-04-22
Due date:
% Done:

0%

Estimated time:
5.00 h
Difficulty:

Description

Observation

openQA test in scenario opensuse-Tumbleweed-KDE-Live-x86_64-kde_live_upgrade_leap_42.3@64bit-2G fails in
await_install

In general we don't care much if logs were collected before the reboot or after, except that system might not boot.
So if we don't find a way to make it work, let's just boot. Consequence will be in case of failures which prevent system from booting we won't have logs. But that's only for the cases when YaST wasn't able to detect the issue.

So after discussion we decided to implement solution not to collect logs from SUT depending on some variable and then we don't need to catch reboot pop-up.
This will require un-scheduling logs_from_installation_system and modifying await_install not to wait for the pop-up.

Mentioned scenario is the single one affected.

Test suite description

Uses the live installer on the kde live media for upgrading the system.

Acceptance criteria

  1. Test suite doesn't fail if we miss reboot screen

Reproducible

Fails since (at least) Build 20190421 (current job)

Expected result

Last good: 20190420 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues

Related to openQA Tests - action #53534: [opensuse][kde] test fails in await_install - timeout not working properlyResolved2019-06-26

Related to openQA Tests - action #51983: [functional][y][sporadic] test fails in "await_install" to detect the end of installationRejected2019-05-25

Related to openQA Infrastructure - action #58727: openqa-aarch64 from o3 slower than usual aka. os-autoinst is too slow pressing F2 causing ARM tests to fail in "boot_to_desktop"Resolved2019-10-28

Related to openQA Infrastructure - action #20914: [tools] configure vm settings for workers with rotating discsResolved2017-07-282019-11-05

Has duplicate openQA Tests - action #58802: test fails in await_installRejected2019-10-29

Has duplicate openQA Tests - action #58832: test fails in await_install, seems to be stuck on grub menuRejected2019-10-29

History

#1 Updated by SLindoMansilla over 2 years ago

  • Subject changed from test fails in await_install - does not catch rebootnow to [opensuse] test fails in await_install - does not catch rebootnow

As a result of backlog triaging (see https://progress.opensuse.org/projects/openqatests/wiki#ticket-backlog-triaging for more information).

Please, feel free to adjust the category or the "[label]" if you think different.

#2 Updated by okurz over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: update_Leap_42.1_kde
https://openqa.opensuse.org/tests/935316

#3 Updated by okurz over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: update_Leap_42.1_kde
https://openqa.opensuse.org/tests/945390

#4 Updated by okurz over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: update_Leap_42.3_kde+system_performance
https://openqa.opensuse.org/tests/959088

#5 Updated by okurz over 2 years ago

  • Subject changed from [opensuse] test fails in await_install - does not catch rebootnow to [functional][y] test fails in await_install - does not catch rebootnow
  • Priority changed from Normal to High

This is happening seemingly more and more now, e.g. see https://openqa.opensuse.org/tests/overview?version=Tumbleweed&failed_modules=await_install showing already 5 jobs within a single build right now.

riafarov I think QSF-y can handle this better. Could you please help? Seems to me as if the installer changed it's performance impact somewhat so that we have a more loaded system which is more prone to miss the screen? Or did something change in test behaviour? Or does the installer have an option by now to always stop at the end without timeout :) I guess as a workaround we could still try to accept the fact when we found a successfully booted system instead that we do not even need the installation logs from the next module.

#6 Updated by okurz over 2 years ago

  • Related to action #53534: [opensuse][kde] test fails in await_install - timeout not working properly added

#7 Updated by okurz over 2 years ago

  • Related to action #51983: [functional][y][sporadic] test fails in "await_install" to detect the end of installation added

#9 Updated by riafarov over 2 years ago

  • Target version set to Milestone 27

From the logs, there is no gap of 9 seconds, so screen should have matched, even while having 33 needle to match the screen. I will attempt to reduce number of needles, but it's issue with a tooling. In the logs we have evidence of the message being displayed

#10 Updated by riafarov about 2 years ago

  • Description updated (diff)
  • Due date set to 2019-09-24

#11 Updated by riafarov about 2 years ago

  • Priority changed from High to Normal
  • Target version changed from Milestone 27 to Milestone 28

#12 Updated by riafarov about 2 years ago

  • Description updated (diff)
  • Status changed from New to Workable
  • Estimated time set to 5.00 h

#13 Updated by riafarov about 2 years ago

  • Due date changed from 2019-09-24 to 2019-10-08
  • Assignee set to riafarov

#14 Updated by riafarov about 2 years ago

  • Status changed from Workable to Blocked

There is problem with the image, looks like we upgrade uefi installation using legacy boot which breaks.

#15 Updated by riafarov about 2 years ago

  • Due date changed from 2019-10-08 to 2019-10-22

#16 Updated by riafarov about 2 years ago

  • Target version changed from Milestone 28 to Milestone 30+

#17 Updated by riafarov about 2 years ago

  • Due date deleted (2019-10-22)
  • Target version changed from Milestone 30+ to future

#18 Updated by okurz about 2 years ago

Hi riafarov , you last comment in #50615#note-14 indicates a temporary problem? https://openqa.opensuse.org/tests/1067386#step/await_install/5 shows a recent failure with same symptoms but for sure not related to UEFI upgrade or legacy boot. What do you think about my suggestion in #50615#note-5 to "to accept the fact when we found a successfully booted system instead that we do not even need the installation logs from the next module."? That should be even easier now as it is possible to dynamically change the test schedule from the test itself. However it might be better to explicitly record the "skipping" in each test module that comes before grub_test, i.e. "logs_from_installation_system" and "reboot_after_installation".

I still see as alternative what was discussed in https://bugzilla.suse.com/show_bug.cgi?id=1122493 , managed by the YaST development team: https://trello.com/c/CDedArHx , to have an option to have indefinite timeout at the end of the installation.

EDIT: Trying myself with a suggestion for the YaST installer: https://github.com/yast/yast-installation/pull/823

#19 Updated by ggardet_arm almost 2 years ago

  • Related to action #58727: openqa-aarch64 from o3 slower than usual aka. os-autoinst is too slow pressing F2 causing ARM tests to fail in "boot_to_desktop" added

#20 Updated by riafarov almost 2 years ago

  • Due date set to 2019-12-03
  • Status changed from Blocked to Workable
  • Assignee deleted (riafarov)

okurz wrote:

Hi riafarov , you last comment in #50615#note-14 indicates a temporary problem? https://openqa.opensuse.org/tests/1067386#step/await_install/5 shows a recent failure with same symptoms but for sure not related to UEFI upgrade or legacy boot. What do you think about my suggestion in #50615#note-5 to "to accept the fact when we found a successfully booted system instead that we do not even need the installation logs from the next module."? That should be even easier now as it is possible to dynamically change the test schedule from the test itself. However it might be better to explicitly record the "skipping" in each test module that comes before grub_test, i.e. "logs_from_installation_system" and "reboot_after_installation".

I still see as alternative what was discussed in https://bugzilla.suse.com/show_bug.cgi?id=1122493 , managed by the YaST development team: https://trello.com/c/CDedArHx , to have an option to have indefinite timeout at the end of the installation.

EDIT: Trying myself with a suggestion for the YaST installer: https://github.com/yast/yast-installation/pull/823

Hi @okurz. I would not call a problem which is there for a month temporary. Have you checked the failure in the job mentioned here? Also, as being said there is no easy way out, as our tools cannot handle this scenarios properly, meaning are unreliable. For what you are suggesting, there is already variable called GRUB_TIMEOUT. Alternative would be to use startshell=1 boot parameter which provides console before the reboot and doesn't require sync on the pop-up.

As you are part of the tools team now, maybe you could take a look why we cannot match the pop-up which is there for 10 seconds?

The bug you are referring to is against SLE 15 SP1 and about general performance, so please, do not mix everything in the single issue.
As now we have some job where we can reproduce the job, it can be worked on.

#21 Updated by okurz almost 2 years ago

  • Related to action #20914: [tools] configure vm settings for workers with rotating discs added

#22 Updated by okurz almost 2 years ago

riafarov wrote:

Hi @okurz. I would not call a problem which is there for a month temporary.

Yes, for sure, that's my point. But you updated the ticket status to "Workable" so that's what I meant, thanks! :)

Have you checked the failure in the job mentioned here?

Yes, I have checked. Did I miss something?

Also, as being said there is no easy way out, as our tools cannot handle this scenarios properly, meaning are unreliable. For what you are suggesting, there is already variable called GRUB_TIMEOUT.

Of course, I know about the variable. What I meant with "successfully booted system" is when we reached the grub menu, not a booted Linux system.

Alternative would be to use startshell=1 boot parameter which provides console before the reboot and doesn't require sync on the pop-up.

Yes, we discussed this already. It might be a bit too different from a normal test flow though.

As you are part of the tools team now, maybe you could take a look why we cannot match the pop-up which is there for 10 seconds?

The reason is simple: Linux is not a realtime operating system and we can not guarantee that we are able to interact with a system within time. Also see #20914 for more details. It is unfortunate that we can not make it work even within 8s but all save alternatives would come with a severe slowdown which we can not take lightly.

The bug you are referring to is against SLE 15 SP1 and about general performance, so please, do not mix everything in the single issue.

You know just the same as I do that the installer in SLE15SP1 is hardly any different from Tumbleweed so I don't know why you don't see this connect. However I mentioned the bug because the proposed solution is in there: To give a possibility to not have a timeout at all. Maybe you have an idea how we could hot-patch a live system to change https://github.com/yast/yast-installation/blob/master/src/lib/installation/clients/inst_finish.rb#L155 within the installer? In the end, it should be all ruby code, not compiled C code, right?

As now we have some job where we can reproduce the job, it can be worked on.

Hm, I doubt we have a more reproducible problem. At least this one is back to "works often, not always".

#23 Updated by okurz almost 2 years ago

#24 Updated by okurz almost 2 years ago

  • Has duplicate action #58832: test fails in await_install, seems to be stuck on grub menu added

#25 Updated by riafarov almost 2 years ago

  • Due date changed from 2019-12-03 to 2019-12-17

There was a change in the installer code to disable timeout in the live installer, so might be that we don't need this fix anymore.

#26 Updated by okurz almost 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

correct. I am on it currently. The product change is right now pending in Tumbleweed staging where we should be able to test it out. According to riafarov older derived products do not seem to be affected, i.e. more stable so maybe the product change is good enough. As an alternative we could still follow the "startshell" approach in parallel.

#27 Updated by okurz almost 2 years ago

  • Status changed from In Progress to Blocked

still waiting for staging :F to build a new medium. We need to wait for at least build 317 in https://build.opensuse.org/package/binaries/openSUSE:Factory:Staging:F/000product:openSUSE-dvd5-dvd-x86_64/images including the necessary product changes before we can test again.

#28 Updated by okurz almost 2 years ago

I resolved both https://bugzilla.suse.com/show_bug.cgi?id=1157476 and https://bugzilla.suse.com/show_bug.cgi?id=1122493 , now waiting for https://build.opensuse.org/project/show/openSUSE:Factory:Staging:E to be accepted for https://build.opensuse.org/request/show/751336. Afterwards we can apply the new linuxrc parameter "reboot_timeout=0" for all but older products, i.e. Tumbleweed.

EDIT: 2019-12-16: The according SRs for all Tumbleweed, Leap 15.2 and SLE15SP2 are accepted now, we can set reboot_timeout=0 for all tests on newer products:

openqa-clone-job --within-instance https://openqa.opensuse.org/tests/1113903 BUILD= _GROUP= CASEDIR=https://github.com/okurz/os-autoinst-distri-opensuse.git#feature/install_timeout TEST=minimalx_no_reboot_timeout_okurz_poo50615

Created job #1114648: opensuse-Tumbleweed-DVD-x86_64-Build20191214-minimalx@64bit -> https://openqa.opensuse.org/t1114648

-> https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/9176

#29 Updated by okurz almost 2 years ago

  • Due date deleted (2019-12-17)
  • Status changed from Blocked to Feedback
  • Target version changed from future to Current Sprint

#31 Updated by okurz almost 2 years ago

  • Status changed from Feedback to Resolved
  • Target version changed from Current Sprint to Done

that was before the PR was merged. Rescheduled: https://openqa.suse.de/tests/3747435 , https://openqa.suse.de/tests/3747435#step/bootloader/4 shows the parameter "reboot_timeout=0" entered. https://openqa.suse.de/tests/3747435#step/await_install/5 shows the reboot confirmation dialog waiting for explicit action. With this I see this issue fixed.

Also available in: Atom PDF