Project

General

Profile

Actions

action #43880

closed

[functional][u][s390x][sporadic] test fails in shutdown on s390x

Added by oorlov about 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Target version:
SUSE QA (private) - Milestone 23
Start date:
2018-11-16
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP1-Installer-DVD-s390x-create_hdd_gnome@s390x-kvm-sle12 fails in
shutdown

Reproducible

Fails since (at least) Build 96.7 (current job)

Expected result

Last good: 95.1 (or more recent)

Further details


Related issues 8 (0 open8 closed)

Related to openQA Tests (public) - coordination #35215: [functional][u][epic][medium] test fails on shutdown moduleResolvedoorlov2018-04-192018-09-25

Actions
Related to openQA Tests (public) - action #43064: [functional][u] test fails in boot_into_snapshot and reboot_gnome with encrypted setup - test not waiting long enough for shutdown/reboot until we end up in grub which shows in post_fail_hookResolvedmgriessmeier2018-10-30

Actions
Related to openQA Tests (public) - action #43616: [functional][u][sporadic] test fails in shutdown - SUT took longer than 60 seconds to shutdown, no logs available (non-s390x)Rejectedzluo2018-11-09

Actions
Related to openQA Tests (public) - action #41183: [functional][u] soft-fail in shutdown should have valid bugrefResolvedjorauch2018-09-18

Actions
Related to openQA Tests (public) - action #42038: [sle][functional[u] test fails in shutdown - add post_fail_hook for shutdown module if possibleResolvedzluo2018-10-05

Actions
Related to openQA Tests (public) - action #38108: [openqa][kernel] Power down needs to have longer timeout due longer shutdown of BTRFS related serviceResolved2018-07-03

Actions
Related to openQA Tests (public) - action #35892: [qe-core][functional][hard] test fails in kdump_and_crash - improve bootup/shutdown debugging approachRejectedSLindoMansilla2018-05-04

Actions
Related to openQA Tests (public) - action #46076: [sle][functional][u][medium] test fails in shutdown on minimalxResolvedjorauch2019-01-14

Actions
Actions #1

Updated by okurz about 6 years ago

  • Status changed from New to Workable
  • Priority changed from Normal to High

hm, there is check_shutdown(60) which again is not that long. I recommend to bump the timeout and crosscheck with the code branches which might already wait longer.

Actions #2

Updated by okurz about 6 years ago

  • Target version set to Milestone 22
Actions #3

Updated by mgriessmeier about 6 years ago

  • Status changed from Workable to In Progress
  • Assignee set to mgriessmeier

I will take a look into this,
though the 60s seems to have worked in the past - might be different issue - let's investigate

Actions #4

Updated by okurz about 6 years ago

try SCHEDULE=tests/boot/boot_to_desktop,tests/shutdown/shutdown, or add in "bootloader_zkvm" in there.

Actions #5

Updated by mgriessmeier about 6 years ago

okurz wrote:

try SCHEDULE=tests/boot/boot_to_desktop,tests/shutdown/shutdown, or add in "bootloader_zkvm" in there.

problem is that the only fails in shutdown I can find are on create_hdd jobs - so we might miss the issue if we just use a qcow image
so I will first schedule the whole job once and grab the image from there

Actions #6

Updated by okurz about 6 years ago

well, I thought the "boot+shutdown" test would be super-fast to finish so you can schedule 100 runs easily on production and reject the hypothesis that it can also appear in the image-booting scenarios.

Actions #7

Updated by mgriessmeier about 6 years ago

triggered 50 jobs with custom scheduled - so far 2/11 failed =)
http://opeth.suse.de/tests/overview?build=107.5&distri=sle&version=15-SP1

Actions #8

Updated by mgriessmeier about 6 years ago

mgriessmeier wrote:

triggered 50 jobs with custom scheduled - so far 2/11 failed =)
http://opeth.suse.de/tests/overview?build=107.5&distri=sle&version=15-SP1

so 8 out of 50 failed, I will investigate those and trigger 50 more with increased timeout to see if the failure rate is reduced or completely gone

Actions #9

Updated by mgriessmeier about 6 years ago

ok, with double timeout it's 1 fail out of 50...
will go with *3 which we also have for other cases...

Actions #10

Updated by okurz about 6 years ago

  • Related to coordination #35215: [functional][u][epic][medium] test fails on shutdown module added
Actions #11

Updated by okurz about 6 years ago

  • Related to action #43064: [functional][u] test fails in boot_into_snapshot and reboot_gnome with encrypted setup - test not waiting long enough for shutdown/reboot until we end up in grub which shows in post_fail_hook added
Actions #12

Updated by okurz about 6 years ago

  • Related to action #43616: [functional][u][sporadic] test fails in shutdown - SUT took longer than 60 seconds to shutdown, no logs available (non-s390x) added
Actions #13

Updated by okurz about 6 years ago

  • Related to action #41183: [functional][u] soft-fail in shutdown should have valid bugref added
Actions #14

Updated by okurz about 6 years ago

  • Related to action #42038: [sle][functional[u] test fails in shutdown - add post_fail_hook for shutdown module if possible added
Actions #15

Updated by okurz about 6 years ago

  • Related to action #38108: [openqa][kernel] Power down needs to have longer timeout due longer shutdown of BTRFS related service added
Actions #16

Updated by okurz about 6 years ago

  • Related to action #35892: [qe-core][functional][hard] test fails in kdump_and_crash - improve bootup/shutdown debugging approach added
Actions #17

Updated by okurz about 6 years ago

custom scheduling rocks, right? ;)

Good evaluation. Please see the related issues as well. We should on one hand fix the test failing which we can do by bumping the timeout but also we should provide bug reports or help with investigation on already existing ones. The approach you found might be especially helpful for that.

@oorlov, as the "shutdown specialist", what suggestions can you give how to investigate the issue further or how to track according bugs with soft-fails?

@mgriessmeier maybe just find corresponding bugs about "shutdown takes too long", comment in there with suggestions how to investigate, e.g. point to the openqa-clone-job commands you used to reproduce easily.

Actions #18

Updated by mgriessmeier about 6 years ago

  • Priority changed from High to Normal

triple timeout also fails 2 out of 50 - but since it is improving the situation significantly, here is the PR:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6374

reducing urgency therefore and continuing on suggestions from okurz

Actions #19

Updated by mgriessmeier about 6 years ago

  • Status changed from In Progress to Workable
  • Assignee deleted (mgriessmeier)

Urgency was removed - not working on the followup atm

Actions #20

Updated by okurz almost 6 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: create_hdd_textmode@zkvm
https://openqa.suse.de/tests/2347165

Actions #21

Updated by okurz almost 6 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: create_hdd_gnome
https://openqa.suse.de/tests/2369949

Actions #22

Updated by zluo almost 6 years ago

My PR (https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6702 , see #42038 for details) might be helpful to check the cause of shutdown issue.

Actions #23

Updated by szarate almost 6 years ago

I wonder if setting the $soft_fail_data hash to have a hard timeout of 180 and a soft timeout of 60 would work in this case, today we got 3 failures so far, most of them in the order of 2 minutes between the moment the poweroff command is sent and the moment where virsh actually reports that the SUT is actually shut off...

Actions #24

Updated by okurz almost 6 years ago

  • Target version changed from Milestone 22 to Milestone 23
Actions #25

Updated by SLindoMansilla almost 6 years ago

  • Subject changed from [functional][u][s390x] test fails in shutdown on s390x to [functional][u][s390x][sporadic] test fails in shutdown on s390x

New occurrence on OSD: https://openqa.suse.de/tests/2496334
It seems to be sporadic.

Actions #26

Updated by zluo almost 6 years ago

  • Assignee set to zluo

I filed yesterday ticket #48545, maybe this is related issue, let me check this and give an update on this.

Actions #27

Updated by zluo almost 6 years ago

  • Status changed from Workable to In Progress
Actions #28

Updated by zluo almost 6 years ago

Actions #29

Updated by zluo almost 6 years ago

WIP PR:

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6940

with extend timeout for check_shutdown:

Created job #2508629: sle-15-SP1-Installer-DVD-s390x-Build178.3-create_hdd_gnome@s390x-kvm-sle12 -> https://openqa.suse.de/t2508629

Actions #31

Updated by okurz almost 6 years ago

  • Priority changed from Normal to High

back to "High" as still or again linked to currently failing tests.

Actions #32

Updated by okurz almost 6 years ago

  • Related to action #46076: [sle][functional][u][medium] test fails in shutdown on minimalx added
Actions #33

Updated by zluo almost 6 years ago

my WIP PR which got tested on osd https://openqa.suse.de/t2508629 shows no issue and on my local server I tested:

http://f40.suse.de/tests/1526#next_previous shows all successful test runs, 100 total, a couple of tests failed at very beginning bootloader_zkvm which is a different issue.

Actions #35

Updated by zluo almost 6 years ago

  • Status changed from In Progress to Feedback

waiting for merging PR.

Actions #37

Updated by SLindoMansilla over 5 years ago

I don't know how to interpret the link from okurz in https://progress.opensuse.org/issues/43880#note-36

Waiting for verification run on OSD: https://openqa.suse.de/tests/overview?build=poo43880_osd_verification

Actions #38

Updated by zluo over 5 years ago

  • Status changed from Feedback to Resolved
Actions #39

Updated by SLindoMansilla over 5 years ago

  • Status changed from Resolved to Feedback

That is not enough, because we need to verify an sporadic bug.

We need to verify with 100 jobs.

Waiting for: https://openqa.suse.de/tests/overview?build=poo43880_osd_verification

Please, feel free to assign the ticket to me if you don't want to have it, so I take care of updating the status when the jobs finish.

Actions #40

Updated by zluo over 5 years ago

I checked this already over 100 times: http://f40.suse.de/tests/1526#next_previous

please see my previous comment, and this has been already tested on osd with WIP PR.

I think this is enough, but if you want to test extra, please go head...

Actions #41

Updated by zluo over 5 years ago

  • Status changed from Feedback to Resolved

verification tests on osd look good. Resolved.

Actions

Also available in: Atom PDF