Project

General

Profile

Actions

coordination #35215

closed

[functional][u][epic][medium] test fails on shutdown module

Added by okurz over 6 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Bugs in existing tests
Target version:
SUSE QA (private) - Milestone 19
Start date:
2018-04-19
Due date:
2018-09-25
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario opensuse-15.0-GNOME-Live-x86_64-gnome-live@uefi-usb-2G fails in
shutdown
to shutdown in time.

Reproducible

Fails since (at least) Build 15.1

Expected result

Last good: 14.8 (or more recent)

Problem

  • H1: ACCEPT: Only usb-based images are affected -> other scenarios seem to be fine in general
  • H2: REJECT: live-images shutdown does not work at all and needs to be reported as bug and worked around -> oorlov conducted two local verification runs, one passed and one fail so it can work
  • H3: shutdown only takes much longer
  • H4: Shutdown test fails because some of the installed services slow down it
    • H4.1 It fails due to services, that were installed during the test run
    • H4.2 It fails due to pre-installed service in the system itself

Suggestions

  • Reproduce locally
  • Investigate existing failures and see if we would maybe just need to bump the timeout
  • Crosscheck other tickets about shutdown
  • Make sure tests reliably shutdown and reboot by either just bumping timeout or finding out the real cause of being stuck or using another more reliant method to reboot as workaround
  • If increasing timeout will not help or will show that shutdown takes too long time, then next step would be removing splash screen and check what is happening in background
  • Merge three shutdown files into one (x11/shutdown.pm, shutdown/shutdown.pm, kernel/shutdown_ltp.pm) to avoid duplication and have single place where shutdown logic is located (1d);
  • Add generic method to raise 'soft fail' if some amount of time is exceeded but action is not finished. If the time exceeded, then soft fail should be reported and test could wait more time for the action to be finished (it is required, because we want to raise 'soft fail' if assert_shutdown() is not completed in time and then wait until shutdown will actually finish) (2d);
  • Add ability to debug system shutdown, e.g. it could be done if job is started with "DEBUG_SHUTDOWN" parameter (4h)
  • Test changed code on several combinations of tests (1d)

Further details

Always latest result in this scenario: latest


Files

shutdown-logs.tar.gz (275 KB) shutdown-logs.tar.gz oorlov, 2018-06-04 13:06

Related issues 8 (0 open8 closed)

Related to openQA Tests (public) - coordination #35302: [qe-core][opensuse][functional][epic][sporadic] Various unstable tests on o3ResolvedSLindoMansilla2018-04-26

Actions
Related to openQA Tests (public) - action #33571: [opensuse][functional][u][medium] test fails in shutdown - emoticon settings are openedResolved2018-03-212018-06-05

Actions
Related to openQA Tests (public) - action #36150: [opensuse][functional][u][sporadic] test fails in shutdown on unsafe code looking for "sddm_shutdown_option_btn"Resolvedokurz2018-05-142018-07-31

Actions
Related to openQA Tests (public) - action #35011: [functional][sle][u][sporadic] test fails in user_defined_snapshot during reboot because shutdown is not working or it takes too longResolvedzluo2018-04-162018-08-28

Actions
Related to openQA Tests (public) - action #36040: [opensuse][functional][u] test fails in shutdown - need a way to check after shutdown needle got matched but shutdown hasn't been executed for unknown reasonResolvedoorlov2018-05-092018-09-25

Actions
Related to openQA Tests (public) - action #40055: [functional][u][bsc#1105517][sporadic][fast] - running application preventing the system from shutdown on otherDE_mateRejectedoorlov2018-08-212018-08-28

Actions
Related to openQA Tests (public) - action #43880: [functional][u][s390x][sporadic] test fails in shutdown on s390xResolvedzluo2018-11-16

Actions
Blocks openQA Tests (public) - action #36997: [opensuse][functional][u][wayland] test fails in shutdown - xterm does not startResolvedoorlov2018-06-082018-09-25

Actions
Actions #1

Updated by oorlov over 6 years ago

  • Assignee set to oorlov
Actions #2

Updated by okurz over 6 years ago

  • Subject changed from [functional][u][sporadic][fast] test fails in shutdown of new leap live image tests to [functional][u][sporadic][medium] test fails in shutdown of new leap live image tests
  • Status changed from New to Workable
  • Difficulty set to medium
Actions #3

Updated by okurz over 6 years ago

  • Target version changed from Milestone 15 to Milestone 16

correcting milestone, it's actually M16

Actions #4

Updated by okurz over 6 years ago

  • Description updated (diff)
  • Status changed from Workable to In Progress

@oorlov I know you are actually working on this one :)

Please update this ticket accordingly on a more timely base because it is urgent.

See how I updated the description as an example.

Actions #5

Updated by oorlov over 6 years ago

I've reproduced the issue locally. From 3 runs it failed twice.

After that I've commented all the modules in between boot and shutdown and executed 10 times. All the tests passed.

So, my assumption is that some of the installed packages slow the shutdown. I'll try to increase the timeout and see what will happen.

Actions #8

Updated by oorlov over 6 years ago

  • Description updated (diff)

After discussion with okurz it was decided to increase the timeout dramatically and trigger huge amount of test runs on o3 in order to check if it may solve the issue or it will still fail regardless of the timeout.

Pull request: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4956

Build on o3: https://openqa.opensuse.org/tests/overview?version=15.0&build=poo35215_investigation&distri=opensuse&groupid=39

Actions #9

Updated by okurz over 6 years ago

  • Related to coordination #35302: [qe-core][opensuse][functional][epic][sporadic] Various unstable tests on o3 added
Actions #10

Updated by oorlov over 6 years ago

I've verified a lot of failed 'shutdown' modules on o3 and found out, that they could be divided into two groups (for now I'm not sure if it is caused by the same reason):

  • First group: KDE tests. After pressing 'ctrl+alt+del' the popup with 'power-off' button is not appeared and test fails after 15 seconds of waiting for it. For these tests I've made snapshots and will try investigate the problem locally.
  • Second group: Gnome tests. After pressing 'Power Off' button, test fails after 60 seconds of waiting in assert_shutdown. There are similar issue for sle15, with soft-fail mark bsc#1055462. So I'll try to increase timeout and see what will happen.
Actions #11

Updated by oorlov over 6 years ago

So, for KDE case I've created a ticket bsc#1091933, increased wait timeout and added soft-fail for case if waiting for "Power Off" popup to be appeared is more, than 15 seconds.

Gnome case seems to be different issue, as it is stuck on Plymouth splash screen while powering off the machine. Trying to investigate the root case of that behavior.

Also, I'm working on improving shutdown hook (to grab more information about the system in case of failed test).

Actions #12

Updated by oorlov over 6 years ago

Added pool request for workaround on KDE.

Need a time to investigate Gnome problem, as I slow down with the snapshots due to qemu issue. I've found workaround and can proceed further now.

Also I want to improve the shutdown fail hook (to add more info to the logs on).

Actions #13

Updated by mgriessmeier over 6 years ago

  • Due date changed from 2018-05-08 to 2018-05-22
Actions #14

Updated by okurz over 6 years ago

  • Related to action #33571: [opensuse][functional][u][medium] test fails in shutdown - emoticon settings are opened added
Actions #15

Updated by oorlov over 6 years ago

Improved post_fail_hook with more logging was added. PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5048

Currently, the issue looks more like timeout issue. And it happened randomly to different test runs, regardless of desktop (gnome/kde), mostly for livecd installation and update OS.

I'm running today a lot of builds locally with different timeouts to find the timeout that will allow the test to check the shutdown properly.

Actions #16

Updated by okurz over 6 years ago

  • Related to action #36040: [opensuse][functional][u] test fails in shutdown - need a way to check after shutdown needle got matched but shutdown hasn't been executed for unknown reason added
Actions #17

Updated by okurz over 6 years ago

one of the latest examples: https://openqa.opensuse.org/tests/678386#step/shutdown/15

any news?

Actions #18

Updated by oorlov over 6 years ago

I've added PR with the increased timeout for Live CD, as it solves the issue.

I've executed 18 runs and all of them were passed.

Pull Request: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5094

Actions #19

Updated by oorlov over 6 years ago

  • Status changed from In Progress to Feedback
Actions #20

Updated by oorlov over 6 years ago

  • Status changed from Feedback to Resolved

I've checked two latest builds. Shutdown module stopped failing on Live CD due to timeout issue.

We still have several failed shutdown modules in other jobs, but they are related to another issue (e.g. poo#36150 ).

Actions #21

Updated by okurz over 6 years ago

  • Due date changed from 2018-05-22 to 2018-06-05
  • Status changed from Resolved to In Progress
  • Priority changed from Urgent to Normal

As decided in the sprint review meeting: You removed the urgency by introducing a higher timeout for the live media but we IMHO we have not solved the underlying issue or understood it well enough. This is something we should track better still.

Actions #22

Updated by riafarov over 6 years ago

  • Related to deleted (action #36040: [opensuse][functional][u] test fails in shutdown - need a way to check after shutdown needle got matched but shutdown hasn't been executed for unknown reason)
Actions #23

Updated by riafarov over 6 years ago

  • Blocks action #36040: [opensuse][functional][u] test fails in shutdown - need a way to check after shutdown needle got matched but shutdown hasn't been executed for unknown reason added
Actions #24

Updated by oorlov over 6 years ago

I've executed the tests in the following way (as suggested by @nsinger):

  1. Specify parameters on bootload: "systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M printk.devkmsg=on enforcing=0";
  2. Mount a drive to the system;
  3. Add 'debug.sh' script with the following content

#!/bin/sh
mount /dev/vda1 /tmp/test/
dmesg > /tmp/test/shutdown-log.txt
umount /dev/vda1

to /usr/lib/systemd/system-shutdown.

After the job finishes, mount a qcow2 image in the local system and copy shutdown-log.txt.

I've added logs of 3 failed jobs and 1 passed. Please see them in the attached 'shutdown-logs.tar.gz' archive.
Tests failed while shutdown timeout was set up to 60 sec.

Actions #25

Updated by mgriessmeier over 6 years ago

  • Due date changed from 2018-06-05 to 2018-06-19
  • Target version changed from Milestone 16 to Milestone 17
Actions #26

Updated by oorlov over 6 years ago

After investigating the logs with @riafarov and @SLindoMansilla it was decided to create a bcs#1096241.

An appropriate workaround with soft-fail will be added to the tests.

Actions #27

Updated by oorlov over 6 years ago

  • Description updated (diff)
Actions #28

Updated by oorlov over 6 years ago

  • Description updated (diff)
Actions #29

Updated by okurz over 6 years ago

Please add your time estimation for the individual tasks

Actions #30

Updated by okurz over 6 years ago

  • Related to action #36150: [opensuse][functional][u][sporadic] test fails in shutdown on unsafe code looking for "sddm_shutdown_option_btn" added
Actions #31

Updated by oorlov over 6 years ago

  • Description updated (diff)
Actions #32

Updated by okurz over 6 years ago

  • Target version changed from Milestone 17 to Milestone 17
Actions #33

Updated by mgriessmeier over 6 years ago

  • Due date changed from 2018-06-19 to 2018-07-03
Actions #34

Updated by oorlov over 6 years ago

Added PR with merged shutdown modules and divided shutdown to 'cleanup' and 'shutdown' parts.

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5296

Also, I've made a debug part that should be called while running job with "DEBUG_SHUTDOWN" flag, but did not made a PR for it as it depends on the code from the previous PR.

Currently I'm working on the assert shutdown with soft fail.

Actions #35

Updated by okurz over 6 years ago

oorlov wrote:

Also, I've made a debug part that should be called while running job with "DEBUG_SHUTDOWN" flag, but did not made a PR for it as it depends on the code from the previous PR.

Please provide it nevertheless as WIP-PR. We won't merge it but we should provide feedback as early as possible, ok?

Actions #36

Updated by riafarov over 6 years ago

  • Related to deleted (action #36150: [opensuse][functional][u][sporadic] test fails in shutdown on unsafe code looking for "sddm_shutdown_option_btn")
Actions #37

Updated by riafarov over 6 years ago

  • Blocks action #36150: [opensuse][functional][u][sporadic] test fails in shutdown on unsafe code looking for "sddm_shutdown_option_btn" added
Actions #38

Updated by mgriessmeier over 6 years ago

  • Due date changed from 2018-07-03 to 2018-07-31
Actions #39

Updated by oorlov over 6 years ago

[WIP]PR for shutdown debugging: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5314

I've also tried to make assert shutdown with soft fail message without changing os-autoinst code. But it seems like it is impossible, as the assert_shutdown also marks test as failed if the shutdown is not finished and then sends 'croak'. Even if catch the 'croak' with the 'eval' or 'try-catch', the test will proceed further (i.e. no post_fail_hook will be called), but it will be marked as failed anyway.

So, that part needs further discussion.

Actions #40

Updated by oorlov over 6 years ago

  • Status changed from In Progress to Feedback

Implemented assert_shutdown_soft_fail helper function: [WIP] PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5371

It also dependent on the PR in os-autoinst repo: https://github.com/os-autoinst/os-autoinst/pull/988

So, all the tasks for the ticket are covered. Now, waiting for the PRs to be reviewed.

Actions #41

Updated by oorlov over 6 years ago

I've added unit tests to os-autoinst project, as they were requested by the reviewers: https://github.com/os-autoinst/os-autoinst/pull/988

Actions #42

Updated by okurz over 6 years ago

  • Blocks deleted (action #36150: [opensuse][functional][u][sporadic] test fails in shutdown on unsafe code looking for "sddm_shutdown_option_btn")
Actions #43

Updated by okurz over 6 years ago

  • Related to action #36150: [opensuse][functional][u][sporadic] test fails in shutdown on unsafe code looking for "sddm_shutdown_option_btn" added
Actions #44

Updated by okurz over 6 years ago

  • Target version changed from Milestone 17 to Milestone 18
Actions #45

Updated by zluo over 6 years ago

  • Related to action #35011: [functional][sle][u][sporadic] test fails in user_defined_snapshot during reboot because shutdown is not working or it takes too long added
Actions #46

Updated by okurz over 6 years ago

  • Blocks action #36997: [opensuse][functional][u][wayland] test fails in shutdown - xterm does not start added
Actions #47

Updated by okurz over 6 years ago

  • Priority changed from Normal to High

-> blocked tickets

Actions #48

Updated by okurz over 6 years ago

  • Priority changed from High to Urgent

This seems to become more severe now. Please urgently visit current test failures, e.g.

https://openqa.opensuse.org/tests/715855#step/shutdown/2

Actions #49

Updated by oorlov over 6 years ago

So, what I've got for now:

  • The issue occurs on Live images only;
  • The issue started happening after my PR with cleanup_shutdown was merged. The difference between previous shutdown and current one is that instead of X11 Terminal the test opens tty console. And 'Stall detected' occurs while switching back to desktop;
  • Some of the tests failed with 'Test died: ensure_unlocked_desktop repeated too much. Check for X-server crash.', some with 'Stall was detected during assert_screen fail';
  • It is failing with 100% ratio on Krypton and Argon builds.
Actions #50

Updated by oorlov over 6 years ago

  • Status changed from Feedback to In Progress
Actions #51

Updated by oorlov over 6 years ago

After deeper investigation I've found that it is not related to 'cleanup_before_shutdown' module.

The issue seems to be appear, because of console selecting in 'power_action' function. In the scenario, Wayland is used, but in power_action function it tries to select X11 console. Currently, I'm trying to reproduce the issue locally by removing all unnecessary modules in between to ensure, that the issue is related to the console.

Actions #52

Updated by mgriessmeier over 6 years ago

  • Due date changed from 2018-07-31 to 2018-08-14

to remove urgency it might make sense to just stop execution of some modules to make it pass and solve it properly afterwards

Actions #53

Updated by oorlov over 6 years ago

fvogt made a PR, that fixes an issue with setting up the console for Wayland session.

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5496

Actions #54

Updated by oorlov over 6 years ago

  • Status changed from In Progress to Feedback
Actions #55

Updated by oorlov over 6 years ago

  • Blocks deleted (action #36040: [opensuse][functional][u] test fails in shutdown - need a way to check after shutdown needle got matched but shutdown hasn't been executed for unknown reason)
Actions #56

Updated by oorlov over 6 years ago

  • Subject changed from [functional][u][sporadic][medium] test fails in shutdown of new leap live image tests to [functional][u][epic][medium] test fails on shutdown module
Actions #57

Updated by oorlov over 6 years ago

  • Related to action #36040: [opensuse][functional][u] test fails in shutdown - need a way to check after shutdown needle got matched but shutdown hasn't been executed for unknown reason added
Actions #58

Updated by oorlov over 6 years ago

  • Description updated (diff)

PR that allows to gather more logs on shutdown: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5314. Merged.

PR that allows to set shutdown timeout to any value by passing it to DEBUG_SHUTDOWN parameter: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5578

Actions #59

Updated by okurz over 6 years ago

  • Due date changed from 2018-08-14 to 2018-08-28

bulk move to next sprint as could not be discussed in SR

Actions #60

Updated by SLindoMansilla over 6 years ago

  • Status changed from Feedback to In Progress
  • No PR open
  • not waiting for a verification run
Actions #61

Updated by oorlov over 6 years ago

PR on os-autoinst, that introduces check_shutdown function: https://github.com/os-autoinst/os-autoinst/pull/988
PR on os-autoinst-distri-opensuse, that is dependent on the PR above: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5371

Actions #62

Updated by oorlov over 6 years ago

  • Related to action #40055: [functional][u][bsc#1105517][sporadic][fast] - running application preventing the system from shutdown on otherDE_mate added
Actions #64

Updated by oorlov over 6 years ago

The PR in os-autoinst with check_shutdown has been merged.

So, currently I'm waiting for it to be appeared on o3 and on osd. Then my PR in os-autoinst-distri-opensuse could be merged: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5371

Actions #65

Updated by mgriessmeier over 6 years ago

  • Due date changed from 2018-08-28 to 2018-09-11
Actions #66

Updated by mgriessmeier over 6 years ago

  • Due date changed from 2018-09-11 to 2018-09-25
  • Target version changed from Milestone 18 to Milestone 19
Actions #67

Updated by okurz over 6 years ago

I checked the workers and both the osd and o3 workers have the os-autoinst change – have found #40880 though in the process. Please rebase your PR for the tests and expedite the merging.

Actions #68

Updated by oorlov over 6 years ago

  • Status changed from In Progress to Feedback

The PR that introduces assert_shutdown_with_soft_timeout merged: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5371

After several successful builds, this ticket could be finally closed, as all the requirements from this ticket are covered.

Further shutdown issues could be opened as alone tickets as for other issues.

Actions #69

Updated by okurz over 6 years ago

oorlov wrote:

After several successful builds, this ticket could be finally closed, as all the requirements from this ticket are covered.

Fine, don't hesitate to actually close it :) Latest job https://openqa.opensuse.org/tests/752806 in the originally mentioned scenario is fine regarding shutdown.

Actions #70

Updated by okurz over 6 years ago

  • Status changed from Feedback to Resolved
Actions #71

Updated by okurz about 6 years ago

  • Related to action #43880: [functional][u][s390x][sporadic] test fails in shutdown on s390x added
Actions #72

Updated by szarate over 4 years ago

  • Tracker changed from action to coordination
  • Difficulty deleted (medium)
Actions

Also available in: Atom PDF