Project

General

Profile

Actions

action #37000

closed

[opensuse][functional][u][sporadic] test fails in reboot_plasma5 - either stuck in shutdown or not enough waiting time for grub2?

Added by mloviska over 6 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
SUSE QA (private) - Milestone 18
Start date:
2018-06-08
Due date:
2018-09-25
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-update_Leap_42.1_kde@64bit fails in
reboot_plasma5

Reproducible

Fails since (at least) Build 20180606 (current job)

Expected result

Last good: 20180605 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 2 (0 open2 closed)

Related to openQA Tests (public) - action #37003: [opensuse][functional][u][sporadic] test fails in network_configuration - xterm does not startResolvedokurz2018-06-082018-07-31

Actions
Copied to openQA Tests (public) - action #47246: [opensuse][functional][y] Get rid of TIMEOUT_SCALE in kde testsuite on o3 if still there or adjust test suiteRejectedriafarov2019-03-26

Actions
Actions #1

Updated by okurz over 6 years ago

  • Due date set to 2018-07-17
  • Target version set to Milestone 17
Actions #2

Updated by okurz over 6 years ago

  • Target version changed from Milestone 17 to Milestone 17
Actions #3

Updated by okurz over 6 years ago

  • Target version changed from Milestone 17 to Milestone 18
Actions #4

Updated by okurz over 6 years ago

  • Due date changed from 2018-07-17 to 2018-07-31

It's hackweek time!

Actions #5

Updated by zluo over 6 years ago

  • Status changed from New to In Progress
  • Assignee set to zluo

take over

Actions #6

Updated by zluo over 6 years ago

https://openqa.opensuse.org/tests/700235#step/reboot_plasma5/5

3 days ago still show this issue. The latest test run looks good.

Actions #7

Updated by okurz over 6 years ago

  • Subject changed from [opensuse][functional][u] test fails in reboot_plasma5 - extend wait time to load grub2 to [opensuse][functional][u][sporadic] test fails in reboot_plasma5 - extend wait time to load grub2

From the job history I can see that the test module failure is sporadic, i.e. we need better statistics than just single jobs -> https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation

Actions #8

Updated by okurz over 6 years ago

  • Assignee changed from zluo to okurz

looking into it.

Actions #10

Updated by okurz over 6 years ago

  • Status changed from In Progress to Resolved
Actions #13

Updated by okurz over 6 years ago

I see. Thanks for the observation. I guess we need to handle the longish shutdown better -> https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5314 from @oorlov

Actions #14

Updated by okurz over 6 years ago

  • Related to action #37003: [opensuse][functional][u][sporadic] test fails in network_configuration - xterm does not start added
Actions #15

Updated by okurz over 6 years ago

  • Subject changed from [opensuse][functional][u][sporadic] test fails in reboot_plasma5 - extend wait time to load grub2 to [opensuse][functional][u][sporadic] test fails in reboot_plasma5 - either stuck in shutdown or not enough waiting time for grub2?
  • Due date changed from 2018-07-31 to 2018-08-14
  • Status changed from Workable to Blocked
Actions #16

Updated by okurz over 6 years ago

  • Status changed from Blocked to Feedback

Seeing latest example https://openqa.opensuse.org/tests/714457#step/reboot_plasma5/3 this clearly looks again like we just do not wait long enough.

-> https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5476

Actions #17

Updated by okurz over 6 years ago

PR merged.

Latest job has been running only before the PR was merged so I guess we need to collect better statistics … by waiting

Actions #18

Updated by oorlov over 6 years ago

Finally the PR with the ability to gather more logs on shutdown is merged.

So, I've added DEBUG_SHUTDOWN=1 property to 'kde' scenarios.

Let's see after several executions if it will give us some meaningful logs.

Actions #19

Updated by okurz over 6 years ago

  • Assignee changed from okurz to oorlov

Please closely monitor the scenario then to make sure this does not introduce even more failures.

Actions #20

Updated by okurz over 6 years ago

I recommend to make use of the weekend capacity and trigger some more jobs for statistical investigation on o3 to crosscheck.

Actions #21

Updated by oorlov over 6 years ago

I've increased timeout with TIME_SCALE=3 for the test suite and the test is passed.

https://openqa.opensuse.org/tests/724980
https://openqa.opensuse.org/tests/724981

The huge time (~90 sec) when the system waits is between that steps:

[ 3159.891194] display-manager[5540]: Shutting down service sddm..done
[ 3248.440114] systemd[1807]: dbus.service: State 'stop-final-sigterm' timed out. Killing.

Actions #22

Updated by okurz over 6 years ago

  • Due date changed from 2018-08-14 to 2018-08-28

bulk move to next sprint as could not be discussed in SR

Actions #23

Updated by SLindoMansilla over 6 years ago

okurz, why is this ticket in feedback, I cannot see any open PR nor waiting for any verification run. Can I change the status to "in progress"?

Actions #24

Updated by okurz over 6 years ago

oorlov wrote:

I've increased timeout with TIME_SCALE=3 for the test suite and the test is passed.

https://openqa.opensuse.org/tests/724980
https://openqa.opensuse.org/tests/724981

The huge time (~90 sec) when the system waits is between that steps:

[ 3159.891194] display-manager[5540]: Shutting down service sddm..done
[ 3248.440114] systemd[1807]: dbus.service: State 'stop-final-sigterm' timed out. Killing.

I am not sure if we actually need any change in os-autoinst although your PR there looks fine.

For issues like the one above I recommend to debug further what is the big waiting time in between and report a bug.

Can you take a look into https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/console/force_scheduled_tasks.pm#L29 and apply the same method here? It should be ok to just call assert_shutdown with a huge timeout and record a soft failure depending on the actual elapsed time

Actions #25

Updated by mgriessmeier over 6 years ago

  • Due date changed from 2018-08-28 to 2018-09-11
Actions #26

Updated by mgriessmeier over 6 years ago

  • Due date changed from 2018-09-11 to 2018-09-25

let's discuss the state offline

Actions #27

Updated by oorlov over 6 years ago

  • Status changed from Feedback to Resolved

After updating shutdown module and scaling timeout with with TIME_SCALE=3, the modules never failed on reboot_plasma5 or shutdown module (in the last more than 20 builds).

I've checked the job in the last 6 builds, shutdown never took more than 15 seconds to be finished.

I'm closing the ticket as 'Resolved', as the issue is not reproduced anymore.

Actions #28

Updated by okurz over 6 years ago

good observation. Do we actually set TIMEOUT_SCALE anywhere or was this just just used for investigation?

Actions #29

Updated by okurz almost 6 years ago

  • Copied to action #47246: [opensuse][functional][y] Get rid of TIMEOUT_SCALE in kde testsuite on o3 if still there or adjust test suite added
Actions #30

Updated by okurz almost 5 years ago

As I realized myself I can answer my question in #3700#note-28 : We do set "TIMEOUT_SCALE=3" in tests, e.g. as visible in https://openqa.opensuse.org/tests/1171538# which also the test suite description reflects. TIMEOUT_SCALE is meant as a temporary measure or for really slow workers which we should not have at all in production. Where necessary we should bump internal timeouts, then remove TIMEOUT_SCALE and also adjust the testsuite settings please. I recorded this in #63388 rather than reopening this ticket which is already a bit old.

Actions

Also available in: Atom PDF