Project

General

Profile

Actions

action #169249

open

coordination #154768: [saga][epic][ux] State-of-art user experience for openQA

coordination #157345: [epic] Improved test reviewer user experience

[sporadic] openqa_install_multimachine test fails in test_running - ping test fails auto_review:"Test died: command[\s\S]*openqa-cli api jobs" size:S

Added by tinita about 2 months ago. Updated about 24 hours ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-11-04
Due date:
2024-12-31 (Due in 10 days)
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_install_multimachine@64bit-4G fails in
test_running

Reproducible

Fails since (at least) Build :TW.32491 (current job)

Expected result

Last good: :TW.32490 (or more recent)

Further details

Always latest result in this scenario: latest

Suggestions


Related issues 2 (1 open1 closed)

Related to openQA Project (public) - action #169204: [sporadic] [openqa-in-openqa] openqa_install_multimachine test fails in test_running - taking too long until test is running size:SResolvedmkittler2024-11-01

Actions
Related to openQA Project (public) - action #170296: [openqa-in-openqa][sporadic] test fails in test_running - ping_client is not complete size:SWorkable2024-11-26

Actions
Actions #1

Updated by tinita about 2 months ago

  • Subject changed from [sporadic] [openqa-in-openqa] test fails in test_running - ping test fails to [sporadic] [openqa-in-openqa] openqa_install_multimachine test fails in test_running - ping test fails
Actions #2

Updated by tinita about 2 months ago

  • Related to action #169204: [sporadic] [openqa-in-openqa] openqa_install_multimachine test fails in test_running - taking too long until test is running size:S added
Actions #3

Updated by tinita about 1 month ago

I had a look at the logfiles, but was only be able to spot a failing needle match.
We are uploading the test results directory, however, the screenshots are symlinks to /var/lib/openqa/..., so they are not part of the uploaded tarball, so we can't see how the screen looked like.
Maybe that can be improved.
We haven't seen the failure again so far.

Actions #4

Updated by okurz about 1 month ago

  • Status changed from New to Resolved
  • Assignee set to okurz

ok, thank you for looking into that. sometimes we just have to accept individual, spurious failures. Considering that otherwise https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-4G&test=openqa_install_multimachine&version=Tumbleweed#next_previous looks very green we should be ok to accept as is.

Actions #5

Updated by tinita about 1 month ago

  • Status changed from Resolved to New
Actions #6

Updated by livdywan about 1 month ago

  • Subject changed from [sporadic] [openqa-in-openqa] openqa_install_multimachine test fails in test_running - ping test fails to openqa_install_multimachine test fails in test_running - ping test fails size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #7

Updated by okurz about 1 month ago

  • Assignee deleted (okurz)
Actions #8

Updated by okurz about 1 month ago

  • Subject changed from openqa_install_multimachine test fails in test_running - ping test fails size:S to [sporadic] openqa_install_multimachine test fails in test_running - ping test fails size:S
  • Priority changed from High to Normal

sporadic but shouldn't bother us over hack week and this is now mostly about test code improvement anyway.

Actions #9

Updated by okurz 17 days ago

  • Related to action #170296: [openqa-in-openqa][sporadic] test fails in test_running - ping_client is not complete size:S added
Actions #10

Updated by okurz 17 days ago

  • Tags set to reactive work
Actions #11

Updated by okurz 17 days ago

  • Parent task set to #166556
Actions #12

Updated by okurz 17 days ago

  • Parent task changed from #166556 to #157345
Actions #13

Updated by okurz 10 days ago

  • Priority changed from Normal to Urgent
Actions #14

Updated by tinita 9 days ago

Actions #15

Updated by ybonatakis 9 days ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #16

Updated by openqa_review 8 days ago

  • Due date set to 2024-12-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions #17

Updated by ybonatakis 8 days ago · Edited

  • Due date deleted (2024-12-27)
  • Status changed from In Progress to Resolved

No failures on the main jobs on O3
Then we have the cloned jobs logs including some screenshots. Although they are not helpful because they are not properly uploaded or something.
When I try to open them I see

Extraction of the entry:
    ‘testresults/00000/00000001-opensuse-Tumbleweed-DVD-x86_64-Build20241209-ping_server@64bit/boot_to_desktop-2.png’
failed with the error message:
    Hard-link target 'testresults/00000/00000002-opensuse-Tumbleweed-DVD-x86_64-Build20241209-ping_client@64bit/boot_to_desktop-1.png' does not exist.

Do you want to continue extraction?

The retry is terminated after a few attempts. Not sure where this termination comes from tho.
I wonder if we set different retry params could change anything.
However I will resolve this for now, considering that we have #169249#note-14 and can provide some info next time

Actions #18

Updated by tinita 8 days ago

  • Status changed from Resolved to Workable

https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/217 will not bring any new information. It's just that so far we only had the screenshot of the non-pretty json output. Hard to spot the status, result, reason on that. The uploaded pretty-printed json will just make it a bit nicer.

Actions #19

Updated by tinita 8 days ago

ybonatakis wrote in #note-17:

The retry is terminated after a few attempts. Not sure where this termination comes from tho.

It is terminated because https://openqa.opensuse.org/tests/4695731#step/test_running/4 is waiting for a finished, passed job. When the job is failed or incomplete, the retry will stop and fail the test.

Actions #20

Updated by ybonatakis 5 days ago

@tinita the openqa_install_multimachine doesnt fail the last 5 days (since 4695731). I suggest to resolve it for now unless you have any better idea. I dont think there is anything I can do actually

Actions #22

Updated by ybonatakis 5 days ago

  • Subject changed from [sporadic] openqa_install_multimachine test fails in test_running - ping test fails size:S to [sporadic] openqa_install_multimachine test fails in test_running - ping test fails auto_review:"Test died: command.*retry" size:S
  • Status changed from Workable to In Progress

add auto_review to prevent further notifications

Actions #23

Updated by ybonatakis 5 days ago

  • Subject changed from [sporadic] openqa_install_multimachine test fails in test_running - ping test fails auto_review:"Test died: command.*retry" size:S to [sporadic] openqa_install_multimachine test fails in test_running - ping test fails auto_review:"Test died: command[\s\S]*openqa-cli api jobs" size:S

update the auto_review

Actions #24

Updated by livdywan 5 days ago

  • Priority changed from Urgent to High

ybonatakis wrote in #note-23:

update the auto_review

Let's lower the urgency then since that was the motivation behind the auto_review expression. Do also you have a plan for a fix?

Actions #25

Updated by openqa_review 4 days ago

  • Due date set to 2024-12-31

Setting due date based on mean cycle time of SUSE QE Tools

Actions #26

Updated by okurz 4 days ago

For the suggestion of the ticket

  • Ensure test_running-testresults.tar.gz contains image files and no (broken) symlinks

from the help of tar

-h, --dereference          follow symlinks; archive and dump the files they
                             point to

that could help

Actions #27

Updated by ybonatakis 4 days ago

https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/218 for

okurz wrote in #note-26:

For the suggestion of the ticket

  • Ensure test_running-testresults.tar.gz contains image files and no (broken) symlinks

from the help of tar

-h, --dereference          follow symlinks; archive and dump the files they
                             point to

that could help

Actions #28

Updated by livdywan 3 days ago

ybonatakis wrote in #note-27:

https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/218 for

With this merged, I'd suggest run 300 jobs or so to reproduce and get a failure with the missing files. Hopefully that will yield a cue on the underlying issue.

Maybe make the contained needles visible in the "outer" job (e.g. https://openqa.opensuse.org/tests/4643016#step/tests/1 )

If this is tricky locally, feel free to run it on o3 or open platform

Actions #29

Updated by ybonatakis 2 days ago · Edited

livdywan wrote in #note-28:

ybonatakis wrote in #note-27:

https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/218 for

With this merged, I'd suggest run 300 jobs or so to reproduce and get a failure with the missing files. Hopefully that will yield a cue on the underlying issue.

Maybe make the contained needles visible in the "outer" job (e.g. https://openqa.opensuse.org/tests/4643016#step/tests/1 )

If this is tricky locally, feel free to run it on o3 or open platform

https://openqa.opensuse.org/tests/overview?distri=openqa&groupid=24&version=Tumbleweed&build=iob_openqa

Actions #30

Updated by ybonatakis 2 days ago

first 300 pass. triggering again

Actions #31

Updated by livdywan about 24 hours ago · Edited

Looks like there's a bunch of failed jobs now that fail in start_test timing out in openqa-clone-job, but it seems consistent:

# Test died: command 'retry -e -- openqa-clone-job --show-progress --skip-chained-deps --from http://openqa.opensuse.org $job_id' timed out at /usr/lib/os-autoinst/autotest.pm line 411.

Is this the same underlying issue? And if not I wonder why we only see this one now 🤔

Actions

Also available in: Atom PDF