Project

General

Profile

action #43880

[functional][u][s390x][sporadic] test fails in shutdown on s390x

Added by oorlov almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Target version:
SUSE QA - Milestone 23
Start date:
2018-11-16
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP1-Installer-DVD-s390x-create_hdd_gnome@s390x-kvm-sle12 fails in
shutdown

Reproducible

Fails since (at least) Build 96.7 (current job)

Expected result

Last good: 95.1 (or more recent)

Further details


Related issues

Related to openQA Tests - coordination #35215: [functional][u][epic][medium] test fails on shutdown moduleResolved2018-04-192018-09-25

Related to openQA Tests - action #43064: [functional][u] test fails in boot_into_snapshot and reboot_gnome with encrypted setup - test not waiting long enough for shutdown/reboot until we end up in grub which shows in post_fail_hookResolved2018-10-30

Related to openQA Tests - action #43616: [functional][u][sporadic] test fails in shutdown - SUT took longer than 60 seconds to shutdown, no logs available (non-s390x)Rejected2018-11-09

Related to openQA Tests - action #41183: [functional][u] soft-fail in shutdown should have valid bugrefResolved2018-09-18

Related to openQA Tests - action #42038: [sle][functional[u] test fails in shutdown - add post_fail_hook for shutdown module if possibleResolved2018-10-05

Related to openQA Tests - action #38108: [openqa][kernel] Power down needs to have longer timeout due longer shutdown of BTRFS related serviceResolved2018-07-03

Related to openQA Tests - action #35892: [qe-core][functional][hard] test fails in kdump_and_crash - improve bootup/shutdown debugging approachRejected2018-05-04

Related to openQA Tests - action #46076: [sle][functional][u][medium] test fails in shutdown on minimalxResolved2019-01-14

History

#1 Updated by okurz almost 3 years ago

  • Status changed from New to Workable
  • Priority changed from Normal to High

hm, there is check_shutdown(60) which again is not that long. I recommend to bump the timeout and crosscheck with the code branches which might already wait longer.

#2 Updated by okurz almost 3 years ago

  • Target version set to Milestone 22

#3 Updated by mgriessmeier almost 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to mgriessmeier

I will take a look into this,
though the 60s seems to have worked in the past - might be different issue - let's investigate

#4 Updated by okurz almost 3 years ago

try SCHEDULE=tests/boot/boot_to_desktop,tests/shutdown/shutdown, or add in "bootloader_zkvm" in there.

#5 Updated by mgriessmeier almost 3 years ago

okurz wrote:

try SCHEDULE=tests/boot/boot_to_desktop,tests/shutdown/shutdown, or add in "bootloader_zkvm" in there.

problem is that the only fails in shutdown I can find are on create_hdd jobs - so we might miss the issue if we just use a qcow image
so I will first schedule the whole job once and grab the image from there

#6 Updated by okurz almost 3 years ago

well, I thought the "boot+shutdown" test would be super-fast to finish so you can schedule 100 runs easily on production and reject the hypothesis that it can also appear in the image-booting scenarios.

#7 Updated by mgriessmeier almost 3 years ago

triggered 50 jobs with custom scheduled - so far 2/11 failed =)
http://opeth.suse.de/tests/overview?build=107.5&distri=sle&version=15-SP1

#8 Updated by mgriessmeier almost 3 years ago

mgriessmeier wrote:

triggered 50 jobs with custom scheduled - so far 2/11 failed =)
http://opeth.suse.de/tests/overview?build=107.5&distri=sle&version=15-SP1

so 8 out of 50 failed, I will investigate those and trigger 50 more with increased timeout to see if the failure rate is reduced or completely gone

#9 Updated by mgriessmeier almost 3 years ago

ok, with double timeout it's 1 fail out of 50...
will go with *3 which we also have for other cases...

#10 Updated by okurz almost 3 years ago

  • Related to coordination #35215: [functional][u][epic][medium] test fails on shutdown module added

#11 Updated by okurz almost 3 years ago

  • Related to action #43064: [functional][u] test fails in boot_into_snapshot and reboot_gnome with encrypted setup - test not waiting long enough for shutdown/reboot until we end up in grub which shows in post_fail_hook added

#12 Updated by okurz almost 3 years ago

  • Related to action #43616: [functional][u][sporadic] test fails in shutdown - SUT took longer than 60 seconds to shutdown, no logs available (non-s390x) added

#13 Updated by okurz almost 3 years ago

  • Related to action #41183: [functional][u] soft-fail in shutdown should have valid bugref added

#14 Updated by okurz almost 3 years ago

  • Related to action #42038: [sle][functional[u] test fails in shutdown - add post_fail_hook for shutdown module if possible added

#15 Updated by okurz almost 3 years ago

  • Related to action #38108: [openqa][kernel] Power down needs to have longer timeout due longer shutdown of BTRFS related service added

#16 Updated by okurz almost 3 years ago

  • Related to action #35892: [qe-core][functional][hard] test fails in kdump_and_crash - improve bootup/shutdown debugging approach added

#17 Updated by okurz almost 3 years ago

custom scheduling rocks, right? ;)

Good evaluation. Please see the related issues as well. We should on one hand fix the test failing which we can do by bumping the timeout but also we should provide bug reports or help with investigation on already existing ones. The approach you found might be especially helpful for that.

oorlov, as the "shutdown specialist", what suggestions can you give how to investigate the issue further or how to track according bugs with soft-fails?

mgriessmeier maybe just find corresponding bugs about "shutdown takes too long", comment in there with suggestions how to investigate, e.g. point to the openqa-clone-job commands you used to reproduce easily.

#18 Updated by mgriessmeier almost 3 years ago

  • Priority changed from High to Normal

triple timeout also fails 2 out of 50 - but since it is improving the situation significantly, here is the PR:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6374

reducing urgency therefore and continuing on suggestions from okurz

#19 Updated by mgriessmeier almost 3 years ago

  • Status changed from In Progress to Workable
  • Assignee deleted (mgriessmeier)

Urgency was removed - not working on the followup atm

#20 Updated by okurz over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: create_hdd_textmode@zkvm
https://openqa.suse.de/tests/2347165

#21 Updated by okurz over 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: create_hdd_gnome
https://openqa.suse.de/tests/2369949

#22 Updated by zluo over 2 years ago

My PR (https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6702 , see #42038 for details) might be helpful to check the cause of shutdown issue.

#23 Updated by szarate over 2 years ago

I wonder if setting the $soft_fail_data hash to have a hard timeout of 180 and a soft timeout of 60 would work in this case, today we got 3 failures so far, most of them in the order of 2 minutes between the moment the poweroff command is sent and the moment where virsh actually reports that the SUT is actually shut off...

#24 Updated by okurz over 2 years ago

  • Target version changed from Milestone 22 to Milestone 23

#25 Updated by SLindoMansilla over 2 years ago

  • Subject changed from [functional][u][s390x] test fails in shutdown on s390x to [functional][u][s390x][sporadic] test fails in shutdown on s390x

New occurrence on OSD: https://openqa.suse.de/tests/2496334
It seems to be sporadic.

#26 Updated by zluo over 2 years ago

  • Assignee set to zluo

I filed yesterday ticket #48545, maybe this is related issue, let me check this and give an update on this.

#27 Updated by zluo over 2 years ago

  • Status changed from Workable to In Progress

#28 Updated by zluo over 2 years ago

#29 Updated by zluo over 2 years ago

WIP PR:

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6940

with extend timeout for check_shutdown:

Created job #2508629: sle-15-SP1-Installer-DVD-s390x-Build178.3-create_hdd_gnome@s390x-kvm-sle12 -> https://openqa.suse.de/t2508629

#31 Updated by okurz over 2 years ago

  • Priority changed from Normal to High

back to "High" as still or again linked to currently failing tests.

#32 Updated by okurz over 2 years ago

  • Related to action #46076: [sle][functional][u][medium] test fails in shutdown on minimalx added

#33 Updated by zluo over 2 years ago

my WIP PR which got tested on osd https://openqa.suse.de/t2508629 shows no issue and on my local server I tested:

http://f40.suse.de/tests/1526#next_previous shows all successful test runs, 100 total, a couple of tests failed at very beginning bootloader_zkvm which is a different issue.

#35 Updated by zluo over 2 years ago

  • Status changed from In Progress to Feedback

waiting for merging PR.

#37 Updated by SLindoMansilla over 2 years ago

I don't know how to interpret the link from okurz in https://progress.opensuse.org/issues/43880#note-36

Waiting for verification run on OSD: https://openqa.suse.de/tests/overview?build=poo43880_osd_verification

#38 Updated by zluo over 2 years ago

  • Status changed from Feedback to Resolved

#39 Updated by SLindoMansilla over 2 years ago

  • Status changed from Resolved to Feedback

That is not enough, because we need to verify an sporadic bug.

We need to verify with 100 jobs.

Waiting for: https://openqa.suse.de/tests/overview?build=poo43880_osd_verification

Please, feel free to assign the ticket to me if you don't want to have it, so I take care of updating the status when the jobs finish.

#40 Updated by zluo over 2 years ago

I checked this already over 100 times: http://f40.suse.de/tests/1526#next_previous

please see my previous comment, and this has been already tested on osd with WIP PR.

I think this is enough, but if you want to test extra, please go head...

#41 Updated by zluo over 2 years ago

  • Status changed from Feedback to Resolved

verification tests on osd look good. Resolved.

Also available in: Atom PDF