action #169249
closedcoordination #154768: [saga][epic][ux] State-of-art user experience for openQA
coordination #157345: [epic] Improved test reviewer user experience
[sporadic] openqa_install_multimachine test fails in test_running - ping test fails auto_review:"Test died: command[\s\S]*openqa-cli api jobs" size:S
0%
Description
Observation¶
openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_install_multimachine@64bit-4G fails in
test_running
Reproducible¶
Fails since (at least) Build :TW.32491 (current job)
Expected result¶
Last good: :TW.32490 (or more recent)
Further details¶
Always latest result in this scenario: latest
Suggestions¶
- Ensure test_running-testresults.tar.gz contains image files and no (broken) symlinks
- Maybe make the contained needles visible in the "outer" job (e.g. https://openqa.opensuse.org/tests/4643016#step/tests/1 )
Updated by tinita 3 months ago
- Related to action #169204: [sporadic] [openqa-in-openqa] openqa_install_multimachine test fails in test_running - taking too long until test is running size:S added
Updated by tinita 2 months ago
I had a look at the logfiles, but was only be able to spot a failing needle match.
We are uploading the test results directory, however, the screenshots are symlinks to /var/lib/openqa/...
, so they are not part of the uploaded tarball, so we can't see how the screen looked like.
Maybe that can be improved.
We haven't seen the failure again so far.
Updated by okurz 2 months ago
- Status changed from New to Resolved
- Assignee set to okurz
ok, thank you for looking into that. sometimes we just have to accept individual, spurious failures. Considering that otherwise https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-4G&test=openqa_install_multimachine&version=Tumbleweed#next_previous looks very green we should be ok to accept as is.
Updated by tinita 2 months ago
- Status changed from Resolved to New
It happened again: http://openqa.opensuse.org/tests/4643020
Updated by livdywan 2 months ago
- Subject changed from [sporadic] [openqa-in-openqa] openqa_install_multimachine test fails in test_running - ping test fails to openqa_install_multimachine test fails in test_running - ping test fails size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz 2 months ago
- Subject changed from openqa_install_multimachine test fails in test_running - ping test fails size:S to [sporadic] openqa_install_multimachine test fails in test_running - ping test fails size:S
- Priority changed from High to Normal
sporadic but shouldn't bother us over hack week and this is now mostly about test code improvement anyway.
Updated by okurz about 2 months ago
- Related to action #170296: [openqa-in-openqa][sporadic] test fails in test_running - ping_client is not complete size:S added
Updated by okurz about 2 months ago
- Parent task changed from #166556 to #157345
Updated by okurz about 1 month ago
- Priority changed from Normal to Urgent
From today: https://openqa.opensuse.org/tests/4695731
Updated by tinita about 1 month ago
https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/217 Upload list of jobs for easier debugging
Updated by ybonatakis about 1 month ago
- Status changed from Workable to In Progress
- Assignee set to ybonatakis
Updated by openqa_review about 1 month ago
- Due date set to 2024-12-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by ybonatakis about 1 month ago · Edited
- Due date deleted (
2024-12-27) - Status changed from In Progress to Resolved
No failures on the main jobs on O3
Then we have the cloned jobs logs including some screenshots. Although they are not helpful because they are not properly uploaded or something.
When I try to open them I see
Extraction of the entry:
‘testresults/00000/00000001-opensuse-Tumbleweed-DVD-x86_64-Build20241209-ping_server@64bit/boot_to_desktop-2.png’
failed with the error message:
Hard-link target 'testresults/00000/00000002-opensuse-Tumbleweed-DVD-x86_64-Build20241209-ping_client@64bit/boot_to_desktop-1.png' does not exist.
Do you want to continue extraction?
The retry is terminated after a few attempts. Not sure where this termination comes from tho.
I wonder if we set different retry params could change anything.
However I will resolve this for now, considering that we have #169249#note-14 and can provide some info next time
Updated by tinita about 1 month ago
- Status changed from Resolved to Workable
https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/217 will not bring any new information. It's just that so far we only had the screenshot of the non-pretty json output. Hard to spot the status, result, reason on that. The uploaded pretty-printed json will just make it a bit nicer.
Updated by tinita about 1 month ago
ybonatakis wrote in #note-17:
The retry is terminated after a few attempts. Not sure where this termination comes from tho.
It is terminated because https://openqa.opensuse.org/tests/4695731#step/test_running/4 is waiting for a finished, passed job. When the job is failed or incomplete, the retry will stop and fail the test.
Updated by ybonatakis about 1 month ago
@tinita the openqa_install_multimachine
doesnt fail the last 5 days (since 4695731). I suggest to resolve it for now unless you have any better idea. I dont think there is anything I can do actually
Updated by okurz about 1 month ago
fail rate seems to be around 1/300 based on https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-4G&test=openqa_install_multimachine&version=Tumbleweed#next_previous
Updated by ybonatakis about 1 month ago
- Subject changed from [sporadic] openqa_install_multimachine test fails in test_running - ping test fails size:S to [sporadic] openqa_install_multimachine test fails in test_running - ping test fails auto_review:"Test died: command.*retry" size:S
- Status changed from Workable to In Progress
add auto_review to prevent further notifications
Updated by ybonatakis about 1 month ago
- Subject changed from [sporadic] openqa_install_multimachine test fails in test_running - ping test fails auto_review:"Test died: command.*retry" size:S to [sporadic] openqa_install_multimachine test fails in test_running - ping test fails auto_review:"Test died: command[\s\S]*openqa-cli api jobs" size:S
update the auto_review
Updated by livdywan about 1 month ago
- Priority changed from Urgent to High
ybonatakis wrote in #note-23:
update the auto_review
Let's lower the urgency then since that was the motivation behind the auto_review expression. Do also you have a plan for a fix?
Updated by openqa_review about 1 month ago
- Due date set to 2024-12-31
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 1 month ago
For the suggestion of the ticket
- Ensure test_running-testresults.tar.gz contains image files and no (broken) symlinks
from the help of tar
-h, --dereference follow symlinks; archive and dump the files they
point to
that could help
Updated by ybonatakis about 1 month ago
https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/218 for
okurz wrote in #note-26:
For the suggestion of the ticket
- Ensure test_running-testresults.tar.gz contains image files and no (broken) symlinks
from the help of tar
-h, --dereference follow symlinks; archive and dump the files they point to
that could help
Updated by livdywan about 1 month ago
ybonatakis wrote in #note-27:
https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/218 for
With this merged, I'd suggest run 300 jobs or so to reproduce and get a failure with the missing files. Hopefully that will yield a cue on the underlying issue.
Maybe make the contained needles visible in the "outer" job (e.g. https://openqa.opensuse.org/tests/4643016#step/tests/1 )
If this is tricky locally, feel free to run it on o3 or open platform
Updated by ybonatakis about 1 month ago · Edited
livdywan wrote in #note-28:
ybonatakis wrote in #note-27:
https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/218 for
With this merged, I'd suggest run 300 jobs or so to reproduce and get a failure with the missing files. Hopefully that will yield a cue on the underlying issue.
Maybe make the contained needles visible in the "outer" job (e.g. https://openqa.opensuse.org/tests/4643016#step/tests/1 )
If this is tricky locally, feel free to run it on o3 or open platform
Updated by livdywan about 1 month ago · Edited
Looks like there's a bunch of failed jobs now that fail in start_test timing out in openqa-clone-job
, but it seems consistent:
# Test died: command 'retry -e -- openqa-clone-job --show-progress --skip-chained-deps --from http://openqa.opensuse.org $job_id' timed out at /usr/lib/os-autoinst/autotest.pm line 411.
Is this the same underlying issue? And if not I wonder why we only see this one now 🤔
Updated by ybonatakis 28 days ago
As requested I filed https://progress.opensuse.org/issues/174715 but this is not changing anything for the reffered issue.
Updated by jbaier_cz 14 days ago
- Status changed from In Progress to Workable
Did minor improvement for the already implemented first suggestion in https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/219. Will look at the second suggestion after that.
Updated by jbaier_cz 10 days ago
- Status changed from In Progress to Feedback
I implemented the second suggestion in https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/220. That should show us the failure (if any) inside the test within openQA webui.
Updated by jbaier_cz 7 days ago
There was some instability in the test over the weekend, I tried to target it in https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/221; so far it is looking much better.
Updated by jbaier_cz 7 days ago
- Status changed from Feedback to Resolved
Both suggestions are implemented, the test looks stable and I didn't spot the original issue for a while (at least for a month at this moment). As the logging is now improved, we should get more info if the issue reappears. I believe it is not feasible to dig more into this, hence closing as resolved.
Updated by tinita 6 days ago
- Status changed from Resolved to Feedback
Today we saw https://progress.opensuse.org/issues/169249
Updated by jbaier_cz 6 days ago
Can you provide a link to the test? If you mean https://openqa.opensuse.org/tests/4772286 then I am pretty sure that is not the same issue as the original one and that failure is more related to the other currently failing tests and not this ticket.
Updated by jbaier_cz 6 days ago
I do see a related failure in https://openqa.opensuse.org/tests/4772811 though. Will try to improve it with further with https://github.com/os-autoinst/os-autoinst-needles-openQA/pull/33