action #166445
openopenQA Project (public) - coordination #127031: [saga][epic] openQA for SUSE customers
openQA Project (public) - coordination #165393: [epic] Improved code coverage in openQA
[openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry size:S
0%
Description
Observation¶
The inner test is incomplete despite the only step simple_boot is passed.
openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_from_bootstrap@64bit-2G fails in
tests
autoinst-log.txt for the inner tests shows:
[2024-09-11T07:00:30.593688-04:00] [debug] [pid:51558] [autotest] process exited: 0
…
[2024-09-11T07:00:30.705630-04:00] [debug] [pid:51578] backend got TERM
[2024-09-11T07:00:30.705933-04:00] [info] [pid:51578] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2024-09-11T07:00:31.758537-04:00] [debug] [pid:51578] Passing remaining frames to the video encoder
[2024-09-11T07:00:31.794569-04:00] [debug] [pid:51578] Waiting for video encoder to finalize the video
[2024-09-11T07:00:31.794634-04:00] [debug] [pid:51578] The built-in video encoder (pid 51580) terminated
[2024-09-11T07:00:31.794967-04:00] [debug] [pid:51578] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 51578 (/usr/bin/isotovideo: backend)
[2024-09-11T07:00:31.795581-04:00] [debug] [pid:51578] sending magic and exit
[2024-09-11T07:00:31.908863-04:00] [debug] [pid:51558] done with backend process
51558: EXIT 1
[2024-09-11T07:00:31.927014-04:00] [info] Isotovideo exit status: 1
[2024-09-11T07:00:31.965003-04:00] [info] +++ worker notes +++
[2024-09-11T07:00:31.965099-04:00] [info] End time: 2024-09-11 11:00:31
[2024-09-11T07:00:31.965147-04:00] [info] Result: died
[2024-09-11T07:00:31.974998-04:00] [info] Uploading video.ogv
[2024-09-11T07:00:32.010558-04:00] [info] Uploading autoinst-log.txt
also from the worker
[2024-09-11T18:03:44.887585Z] [error] REST-API error (POST https://openqa.opensuse.org/api/v1/jobs/4472003/status): Connection error: Premature connection close (remaining tries: 59)
Reproducible¶
Fails since (at least) Build :TW.31137 (current job). Judging from https://openqa.opensuse.org/tests/4471721#next_previous it seems the job fails the same at least 1/100 times.
Expected result¶
Last good: :TW.31136 (or more recent)
Suggestions¶
Take a look into the inner job's autoinst-log.txt and also consider to upload the worker-log.txt(DONE)Consider compressing and uploading the whole results directory(DONE)Look into the error message visible in https://openqa.opensuse.org/tests/4471721/logfile?filename=tests-journal.log.txt#line-5968(DONE, this happens in successful tests as well) tests-journal.log.txt- Make the error more easily reproducible, e.g. by skipping test modules and ensuring that the problem still reproduces clearly
- Crosscheck with a different OS base, e.g. Leap vs. Tumbleweed, on multiple levels, could be OS within the inner test has an impact or qemu on the outer layer or something else
Create another more specific ticket about the underlying issue independant of os-autoinst-distri-openQA#174259- Make the inner distri-example test longer to see if the shortness of the test is the reason
Further details¶
Always latest result in this scenario: latest
- Issue can be reproduced with just
isotovideo -e
in a loop. The issue usually happens at least once in a 20 invocations. inner job's autoinst-log.txt
autoinst-log.txtinner job's worker-log.txt
[2024-11-11T11:57:08.888829-05:00] [info] Preparing cgroup to start isotovideo [2024-11-11T11:57:08.890332-05:00] [warn] Disabling cgroup usage because cgroup creation failed: mkdir /sys/fs/cgroup/systemd: Permission den ied at /usr/lib/perl5/vendor_perl/5.40.0/Mojo/File.pm line 85. [2024-11-11T11:57:08.890537-05:00] [info] You can define a custom slice with OPENQA_CGROUP_SLICE or indicating the base mount with MOJO_CGROU P_FS. [2024-11-11T11:57:08.891118-05:00] [info] Starting isotovideo container [2024-11-11T11:57:08.893714-05:00] [info] isotovideo has been started (PID: 124599) [2024-11-11T11:57:08.894866-05:00] [info] 124599: WORKING 1 [2024-11-11T11:57:46.750369-05:00] [info] Uploading video.webm [2024-11-11T11:57:46.803320-05:00] [info] Uploading autoinst-log.txt [2024-11-11T11:57:46.829894-05:00] [info] Uploading worker-log.txt
Updated by jbaier_cz 3 months ago
- Subject changed from [openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle with tag(s) .*openqa-(test-details|search-results).*matched" to [openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle with tag(s) .*openqa-test-details.*matched"
Updated by jbaier_cz 3 months ago
- Related to action #165692: [openQA-in-openQA] test does not fail if state is done but incomplete auto_review:"no candidate needle.*openqa-dashboard.*matched" size:S added
Updated by jbaier_cz 3 months ago
- Subject changed from [openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle with tag(s) .*openqa-test-details.*matched" to [openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry
Updated by tinita 3 months ago
Just a small PR so we have the actual logfile for investigation: https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/206 Correct uploading of autoinst-log.txt
Updated by jbaier_cz 3 months ago
Example of failure with the autoinst-log.txt attached: https://openqa.opensuse.org/tests/4471721
Updated by okurz 3 months ago
- Subject changed from [openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry to [openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by ybonatakis 3 months ago
- Status changed from Workable to In Progress
- Assignee set to ybonatakis
Updated by openqa_review 3 months ago
- Due date set to 2024-09-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by ybonatakis 3 months ago
- Due date deleted (
2024-09-27)
Updated by openqa_review 3 months ago
- Due date set to 2024-10-01
Setting due date based on mean cycle time of SUSE QE Tools
Updated by ybonatakis 3 months ago
- Status changed from In Progress to Feedback
baseclass calls stop_vm the qemu error comes between $self->_stop_video_encoder();
and $self->close_pipes();
messages.
However before we terminate test runner and isotovideo. The error code is assigned when $runner->testfd
. the code is commented with unusual shutdown
Updated by ybonatakis 3 months ago
Updated by ybonatakis 3 months ago · Edited
Updated by ybonatakis 3 months ago
ybonatakis wrote in #note-15:
I set up a virtual machine and I compared the logs:
{{collapse(failed job logs)
[1;34m[2024-09-16T09:38:21.788738-04:00] [debug] [pid:124359] ||| finished boot tests (runtime: 11 s)[0m
[37m[2024-09-16T09:38:21.793705-04:00] [debug] [pid:124339] [autotest] process exited: 0[0m
[37m[2024-09-16T09:38:21.794268-04:00] [debug] [pid:124339] stopping command server 124356 because test execution ended[0m
[37m[2024-09-16T09:38:21.794336-04:00] [debug] [pid:124339] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20013/IUjZf4snrxneedUn/broadcast[0m
[37m[2024-09-16T09:38:21.821683-04:00] [debug] [pid:124339] commands process exited: 0[0m
[37m[2024-09-16T09:38:21.921957-04:00] [debug] [pid:124339] done with command server[0m
[37m[2024-09-16T09:38:21.922056-04:00] [debug] [pid:124339] stopping autotest process 124359[0m
[37m[2024-09-16T09:38:21.922138-04:00] [debug] [pid:124339] done with autotest process[0m
[37m[2024-09-16T09:38:21.922191-04:00] [debug] [pid:124339] isotovideo failed[0m
[37m[2024-09-16T09:38:21.923372-04:00] [debug] [pid:124339] stopping backend process 124360[0m
[37m[2024-09-16T09:38:21.923566-04:00] [debug] [pid:124360] backend got TERM[0m
[33m[2024-09-16T09:38:21.924008-04:00] [info] [pid:124360] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json[0m
[37m[2024-09-16T09:38:22.978555-04:00] [debug] [pid:124360] Passing remaining frames to the video encoder[0m
[out#0/webm @ 0x555a322309c0] video:20KiB audio:0KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 3.039424%
frame= 24 fps=2.0 q=50.0 Lsize= 20KiB time=00:00:00.95 bitrate= 174.3kbits/s speed=0.0779x
[37m[2024-09-16T09:38:24.081336-04:00] [debug] [pid:124360] Waiting for video encoder to finalize the video[0m
[37m[2024-09-16T09:38:24.081465-04:00] [debug] [pid:124360] The external video encoder (pid 124362) terminated[0m
[37m[2024-09-16T09:38:24.081668-04:00] [debug] [pid:124360] The built-in video encoder (pid 124363) terminated[0m
[37m[2024-09-16T09:38:24.082421-04:00] [debug] [pid:124360] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 124360 (/usr/bin/isotovideo: backend)[0m
[37m[2024-09-16T09:38:24.083632-04:00] [debug] [pid:124360] sending magic and exit[0m
[37m[2024-09-16T09:38:24.228532-04:00] [debug] [pid:124339] done with backend process[0m
124339: EXIT 1
[2024-09-16T09:38:24.283590-04:00] [info] Isotovideo exit status: 1
}}{{collapse(good job logs)
[2024-09-17T04:27:18.239032-04:00] [debug] [pid:10746] stopping autotest process 10782
[2024-09-17T04:27:18.242803-04:00] [debug] [pid:10746] [autotest] process exited: 0
[2024-09-17T04:27:18.346308-04:00] [debug] [pid:10746] done with autotest process
[2024-09-17T04:27:18.347247-04:00] [debug] [pid:10746] stopping command server 10775 because test execution ended
[2024-09-17T04:27:18.347427-04:00] [debug] [pid:10746] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20013/1BzRWhMNYgs94svW/broadcast
[2024-09-17T04:27:18.499693-04:00] [debug] [pid:10746] commands process exited: 0
[2024-09-17T04:27:18.600627-04:00] [debug] [pid:10746] done with command server
[2024-09-17T04:27:18.600955-04:00] [debug] [pid:10746] isotovideo done
[2024-09-17T04:27:18.608242-04:00] [debug] [pid:10786] QEMU status is not 'shutdown', it is 'running'
[2024-09-17T04:27:18.609286-04:00] [debug] [pid:10746] backend shutdown state: 0
[2024-09-17T04:27:18.611449-04:00] [info] [pid:10786] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2024-09-17T04:28:01.158750-04:00] [debug] [pid:10786] Passing remaining frames to the video encoder[out#0/webm @ 0x55a05fdbf9c0] video:12KiB audio:0KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 6.289207%frame= 66 fps=0.9 q=50.0 Lsize= 13KiB time=00:00:02.70 bitrate= 39.0kbits/s speed=0.0362x
[2024-09-17T04:28:02.324960-04:00] [debug] [pid:10786] Waiting for video encoder to finalize the video
[2024-09-17T04:28:02.325313-04:00] [debug] [pid:10786] The built-in video encoder (pid 10791) terminated
[2024-09-17T04:28:02.325423-04:00] [debug] [pid:10786] The external video encoder (pid 10790) terminated
[2024-09-17T04:28:02.332402-04:00] [debug] [pid:10786] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 10786 (/usr/bin/isotovideo: backend)
[2024-09-17T04:28:02.335732-04:00] [debug] [pid:10786] sending magic and exit
[2024-09-17T04:28:02.336304-04:00] [debug] [pid:10746] received magic close
[2024-09-17T04:28:02.343881-04:00] [debug] [pid:10746] backend process exited: 0
[2024-09-17T04:28:02.450525-04:00] [debug] [pid:10746] stopping backend process 10786
[2024-09-17T04:28:02.450650-04:00] [debug] [pid:10746] done with backend process10746: EXIT 0
[2024-09-17T04:28:02.702003-04:00] [info] Isotovideo exit status: 0
[2024-09-17T04:27:18.239032-04:00] [debug] [pid:10746] stopping autotest process 10782
10746: EXIT 0
[2024-09-17T04:28:02.702003-04:00] [info] Isotovideo exit status: 0
[2024-09-17T04:28:02.450650-04:00] [debug] [pid:10746] done with backend process
}}
autotest process seems to start termination earlier but I am not sure. I tried to dig into the code but there are different places which are executing tasks at that point and I cant understand whats the sequence
Updated by ybonatakis 3 months ago
- Status changed from Feedback to Workable
Seems I focus on different problem. I tried to track the QEMU: qemu-system-x86_64: terminating on signal 15 from pid 51578
when the real problem is that the simple_test
passes but its status on the inner openqa is incomplete. QEMU: qemu-system-x86_64: terminating on signal 15 from pid 51578
comes after the end of the job, so propably another issue!!?
Running >60 jobs with isotovideo
I encountered the original problem only once. test failed a few times though in different occurrence. So I am running out of ideas and I will be unassigned to get the chance to someone for a fresh look.
Updated by livdywan 3 months ago · Edited
I tried to reproduce like so, using https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-2G&test=openqa_from_bootstrap&version=Tumbleweed#downloads for the image:
wget https://openqa.opensuse.org/tests/4493151/asset/hdd/opensuse-Tumbleweed-x86_64-20240917-minimalx@64bit.qcow2
isotovideo -e CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-openQA/ NEEDLES_DIR=https://github.com/os-autoinst/os-autoinst-needles-openQA QEMU_NO_KVM=1 QEMU=x86_64 HDD_1=opensuse-Tumbleweed-x86_64-20240917-minimalx@64bit.qcow2
[...]
[2024-09-19T16:46:25.500999+02:00] [info] [pid:171843] ::: basetest::runtest: # Test died: no candidate needle with tag(s) 'openqa-desktop' matched
[...]
[2024-09-19T16:46:59.397550+02:00] [debug] [pid:171843] post_fail_hook failed: command 'ip a del 10.0.2.2/15 dev br1' timed out at ~/bau/oa/repos/distri-openQA/os-autoinst-distri-openQA/lib/openQAcoretest.pm line 10.
After removing the failing script_run from lib/openQAcoretest.pm
I get this:
[2024-09-19T17:16:22.339125+02:00] [debug] [pid:249716] post_fail_hook failed: command 'ps -ef | tee running_processes.txt' timed out at ~/bau/oa/repos/distri-openQA/os-autoinst-distri-openQA/lib/utils.pm line 11.
That's presumably the upload to a non-existing web UI timing out. Is there any way to get to the logs produced by post_fail_hook
?
Updated by openqa_review about 1 month ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: openqa_install+publish
https://openqa.opensuse.org/tests/4634462#step/tests/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by jbaier_cz about 1 month ago
- Status changed from Workable to In Progress
- Assignee set to jbaier_cz
Updated by openqa_review about 1 month ago
- Due date set to 2024-11-29
Setting due date based on mean cycle time of SUSE QE Tools
Updated by jbaier_cz about 1 month ago
So to be sure, I used isotovideo
directly as mentioned earlier. The issue can be reproduced usually within twenty runs:
[2024-11-11T11:57:46.375712-05:00] [debug] [pid:124619] Waiting for video encoder to finalize the video
[2024-11-11T11:57:46.376082-05:00] [debug] [pid:124619] The external video encoder (pid 124621) terminated
[2024-11-11T11:57:46.376251-05:00] [debug] [pid:124619] The built-in video encoder (pid 124622) terminated
[2024-11-11T11:57:46.378280-05:00] [debug] [pid:124619] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 124619 (/usr/bin/isotovideo: backend)
[2024-11-11T11:57:46.380877-05:00] [debug] [pid:124619] sending magic and exit
[2024-11-11T11:57:46.545817-05:00] [debug] [pid:124599] done with backend process
124599: EXIT 1
[2024-11-11T11:57:46.613745-05:00] [info] Isotovideo exit status: 1
[2024-11-11T11:57:46.722113-05:00] [info] +++ worker notes +++
[2024-11-11T11:57:46.722836-05:00] [info] End time: 2024-11-11 16:57:46
[2024-11-11T11:57:46.722981-05:00] [info] Result: died
[2024-11-11T11:57:46.750790-05:00] [info] Uploading video.webm
[2024-11-11T11:57:46.803408-05:00] [info] Uploading autoinst-log.txt
after that I did try to just execute the inner test with isotovideo to see if that is stable, executed 500 tests without any issue.
As a next step I would like to create a minimal working example for the bug to reproduce. From the latest failure, one can see that the test is successfully executed inside osautoinst/test_running but a test module later, it is already showing as incomplete a few modules later and even long before the shutdown, so it shouldn't be related to shutting down the openQA inside the test (as I initially thought).
Updated by jbaier_cz about 1 month ago
Minor improvement https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/216 for the published HDD to be reusable right away by our current tests.
Updated by jbaier_cz about 1 month ago
- Due date deleted (
2024-11-29) - Status changed from In Progress to Workable
I tried to reuse the published image and spawn 50 inner jobs inside one test. I am struggle to reproduce the issue that way, so I will need to use another approach after HW.
Updated by jbaier_cz 20 days ago
So far I am able to reproduce the issue within 20 tests; when using the image generated by a successful run and spawning the same test (without the openQA installation part of course), the issue is not reproducible anymore (or it needs a way more runs, which I am trying at this moment). Anyway, my observation leads to a hypothesis that the issue is connected either to a very first test or that the machine needs a reboot before first use. I will see, if the issue goes away if I introduce reboot just after the installation.
Updated by jbaier_cz 14 days ago
- Subject changed from [openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry size:M to [openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry
- Description updated (diff)
- Status changed from Workable to New
- Assignee deleted (
jbaier_cz)
I collected the results and to my surprise even the reboot did not help to mitigate the issue. So either I did something wrong in my experiments or my hypothesis is completely wrong. I updated the description with all available data, from the log investigation the only real log difference is isotovideo failed
log line together with error code 1 from isotovideo. No other difference spotted in any log. Not sure, if we can achieve more verbosity? Anyway, as I am out of suggestions, I guess the way forward right now is to put this one back to re-estimate.
Updated by mkittler 14 days ago
- Subject changed from [openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry to [openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by tinita 13 days ago
What I don't understand is why stop_commands
is apparently called before stop_autotest
:
[2024-11-11T11:57:25.417909-05:00] [debug] [pid:124599] stopping command server 124615 because test execution ended <----- shouldn't be called that early
[2024-11-11T11:57:25.418077-05:00] [debug] [pid:124599] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20013/ZaiH7djgcxpbtus1/broadcast
[2024-11-11T11:57:25.441790-05:00] [debug] [pid:124599] commands process exited: 0
[2024-11-11T11:57:25.542427-05:00] [debug] [pid:124599] done with command server
[2024-11-11T11:57:25.542619-05:00] [debug] [pid:124599] stopping autotest process 124618
[2024-11-11T11:57:25.542794-05:00] [debug] [pid:124599] done with autotest process
I ran the inner test locally, shortened the test module to return immediately, added some sleep in the tests_done
handling, but the stop_commands
was always executed after that.
Updated by tinita 13 days ago · Edited
I started a bunch of tests with DEBUG_JSON_RPC=1
in the inner test and got one failure:
https://openqa.opensuse.org/tests/4688380
We can see that tests_done
is sent via myjsorpc:
https://openqa.opensuse.org/tests/4688380/logfile?filename=tests-autoinst-log.txt#line-211
But it is also read 4 lines below. Actually I cannot see that the token XcpFXfMJ
is ever read.
But even before that apparently stop_command_processing
is called. That would happen when the response is not defined.
[2024-12-06T13:57:41.592234-05:00] [debug] [pid:8923] send_json(5) JSON={"cmd":"tests_done","completed":1,"died":0,"json_cmd_token":"XcpFXfMJ"}
[2024-12-06T13:57:41.595469-05:00] [debug] [pid:8904] [autotest] process exited: 0
[2024-12-06T13:57:41.595601-05:00] [debug] [pid:8904] send_json(3) JSON={"json_cmd_token":"WqAryXaW","stop_processing_isotovideo_com":1}
[2024-12-06T13:57:41.595677-05:00] [debug] [pid:8920] read_json(4)
[2024-12-06T13:57:41.595757-05:00] [debug] [pid:8920] read_json(4) json_cmd_token=WqAryXaW
[2024-12-06T13:57:41.595883-05:00] [debug] [pid:8904] stopping command server 8920 because test execution ended
Updated by tinita 8 days ago
https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/217 Upload list of jobs for easier debugging
Updated by openqa_review 7 days ago
- Due date set to 2024-12-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita 7 days ago
- Copied to action #174259: [sporadic] isotovideo fails complaining about still existing testfd filehandle added
Updated by livdywan about 22 hours ago
- Due date deleted (
2024-12-26)
Due date with no assignee is not meaningful, hence resetting