action #174259
closedcoordination #127031: [saga][epic] openQA for SUSE customers
coordination #165393: [epic] Improved code coverage in openQA
[sporadic] isotovideo fails complaining about still existing testfd filehandle size:M
Description
Observation¶
See also #166445
The inner test is incomplete despite the only step simple_boot is passed.
It seems in some circumstances the process forked by isotovideo exits before the tests_done handling can be worked on and testdf
closed, so isotovideo complains about the still open filehandle. https://github.com/os-autoinst/os-autoinst/blob/master/script/isotovideo#L144
openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_from_bootstrap@64bit-2G fails in
tests
autoinst-log.txt for the inner tests shows:
[2024-09-11T07:00:30.593688-04:00] [debug] [pid:51558] [autotest] process exited: 0
…
[2024-09-11T07:00:30.705630-04:00] [debug] [pid:51578] backend got TERM
[2024-09-11T07:00:30.705933-04:00] [info] [pid:51578] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2024-09-11T07:00:31.758537-04:00] [debug] [pid:51578] Passing remaining frames to the video encoder
[2024-09-11T07:00:31.794569-04:00] [debug] [pid:51578] Waiting for video encoder to finalize the video
[2024-09-11T07:00:31.794634-04:00] [debug] [pid:51578] The built-in video encoder (pid 51580) terminated
[2024-09-11T07:00:31.794967-04:00] [debug] [pid:51578] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 51578 (/usr/bin/isotovideo: backend)
[2024-09-11T07:00:31.795581-04:00] [debug] [pid:51578] sending magic and exit
[2024-09-11T07:00:31.908863-04:00] [debug] [pid:51558] done with backend process
51558: EXIT 1
[2024-09-11T07:00:31.927014-04:00] [info] Isotovideo exit status: 1
[2024-09-11T07:00:31.965003-04:00] [info] +++ worker notes +++
[2024-09-11T07:00:31.965099-04:00] [info] End time: 2024-09-11 11:00:31
[2024-09-11T07:00:31.965147-04:00] [info] Result: died
[2024-09-11T07:00:31.974998-04:00] [info] Uploading video.ogv
[2024-09-11T07:00:32.010558-04:00] [info] Uploading autoinst-log.txt
also from the worker
[2024-09-11T18:03:44.887585Z] [error] REST-API error (POST https://openqa.opensuse.org/api/v1/jobs/4472003/status): Connection error: Premature connection close (remaining tries: 59)
Steps to reproduce¶
Steps to reproduce with isotovideo
Download HDD_1 from latest openqa_install+publish test from https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-2G&test=openqa_from_bootstrap&version=Tumbleweed&status=done#next_previous
Download vars.json
Run isotovideo until you hit the issue (~1/100 times)
For example for an older build that would look like this:
wget https://openqa.opensuse.org/tests/4699317/asset/hdd/opensuse-Tumbleweed-x86_64-20241210-minimalx@64bit.qcow2
wget https://openqa.opensuse.org/tests/4699317/file/vars.json
for i in $(seq 1 20); do echo RUN $i; isotovideo -e CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-openQA NEEDLES_DIR=https://github.com/os-autoinst/os-autoinst-needles-openQA PRODUCTDIR=. 2>log || break; done
Also see #174259-7
Acceptance Criteria¶
- AC1: the scenario openqa_from_bootstrap is consistently stable with fail ratio < 0.1% with inner openQA job not incomplete
Suggestions¶
- Run the inner example tests in a VM DONE didn't reproduce it
- Try to run the openqa-in-openqa test, needing nested virt or emulating
- Change the test code to schedule the inner test multiple times instead of once and run openqa-in-openqa many times and see if it makes a difference. Is it more often reproducible? Does it only fail for the first inner test?
- Run openqa-in-openqa test and pause the test and schedule the inner test manually multiple times
- Look for https://github.com/os-autoinst/os-autoinst/blob/master/script/isotovideo#L151 "isotovideo failed' from uploaded logs from the inner openQA job to detect the problem
Updated by tinita 3 months ago
- Copied from action #166445: [openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry size:S added
Updated by tinita 3 months ago
- Subject changed from [openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry to [sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry
Updated by jbaier_cz 3 months ago
- Description updated (diff)
I added steps to reproduce inside a distrobox (just a plain opensuse/distrobox:latest with os-autoinst-devel and openQA-devel installed).
RUN 12
Web application available at http://[::]:20183
GOT GO
3292083: EXIT 101
For some reason, I was not able to reproduce with podman and isotovideo:qemu-x86 container image.
Updated by openqa_review 3 months ago
- Due date set to 2024-12-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita 2 months ago
So, running the inner example tests in my VM didn't reproduce it.
I now tried to run the openqa-in-openqa test, but nested virtualization is not enabled in the VM. I asked for help in the openplatform channel.
Additional ideas:
- Change the test code to schedule the inner test multiple times instead of once and run openqa-in-openqa many times and see if it makes a difference. Is it more often reproducible? Does it only fail for the first inner test?
- Run openqa-in-openqa test and pause the test and schedule the inner test manually multiple times
Updated by tinita 2 months ago
- Status changed from In Progress to Workable
I wanted to run openqa-in-openqa tests in that VM, but wasn't able to get nested virtualization working. I asked in the openplatform channel and got some help, but still something seems to be missing.
Setting back to workable due to absence.
Updated by okurz 2 months ago
So what I did is run
runs=200 count-fail-ratio isotovideo -e CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-openQA NEEDLES_DIR=https://github.com/os-autoinst/os-autoinst-needles-openQA PRODUCTDIR=. GIT_CACHE_DIR=. CACHEDIRECTORY=. TIMEOUT_SCALE=3 2>> log
I needed GIT_CACHE_DIR=. CACHEDIRECTORY=.
as I was running as non-root and otherwise tests would fail trying to write to /var/lib/openqa which yields "Permission denied". I also needed to do TIMEOUT_SCALE=3
because my system is slow. Then in run 6 I observed the first failure. In the local directory ulogs/
I found log files uploaded to that directory during the last failure (I assume). In there in ulogs/tests-autoinst-log.txt I found
[2024-12-19T09:05:33.649386-05:00] [debug] [pid:8947] no match: 21.0s, best candidate: bootloader (0.00)
[2024-12-19T09:05:34.763255-05:00] [debug] [pid:8947] no match: 20.0s, best candidate: bootloader (0.00)
[2024-12-19T09:05:35.792371-05:00] [debug] [pid:8946] >>> testapi::_handle_found_needle: found bootloader, similarity 1.00 @ 0/370
[2024-12-19T09:05:35.796324-05:00] [debug] [pid:8946] ||| finished boot tests (runtime: 11 s)
[2024-12-19T09:05:35.798447-05:00] [debug] [pid:8946] Sending tests_done
[2024-12-19T09:05:35.802355-05:00] [debug] [pid:8925] [autotest] process exited: 0
[2024-12-19T09:05:35.802691-05:00] [debug] [pid:8925] isotovideo handle_shutdown
[2024-12-19T09:05:35.802815-05:00] [debug] [pid:8925] stopping command server 8943 because test execution ended
[2024-12-19T09:05:35.802911-05:00] [debug] [pid:8925] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20013/x5vGNXUdPjX1sPxZ/broadcast
[2024-12-19T09:05:35.823290-05:00] [debug] [pid:8925] commands process exited: 0
[2024-12-19T09:05:35.923850-05:00] [debug] [pid:8925] done with command server
[2024-12-19T09:05:35.924013-05:00] [debug] [pid:8925] stopping autotest process 8946
[2024-12-19T09:05:35.924160-05:00] [debug] [pid:8925] done with autotest process
[2024-12-19T09:05:35.924276-05:00] [debug] [pid:8925] isotovideo failed
[2024-12-19T09:05:35.925186-05:00] [debug] [pid:8925] stopping backend process 8947
[2024-12-19T09:05:35.925538-05:00] [debug] [pid:8947] backend got TERM
[2024-12-19T09:05:35.925948-05:00] [info] [pid:8947] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2024-12-19T09:05:36.983996-05:00] [debug] [pid:8947] Passing remaining frames to the video encoder
[2024-12-19T09:05:37.043069-05:00] [debug] [pid:8947] Waiting for video encoder to finalize the video
[2024-12-19T09:05:37.043190-05:00] [debug] [pid:8947] The built-in video encoder (pid 8949) terminated
[2024-12-19T09:05:37.043954-05:00] [debug] [pid:8947] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 8947 (/usr/bin/isotovideo: backend)
[2024-12-19T09:05:37.045321-05:00] [debug] [pid:8947] sending magic and exit
[2024-12-19T09:05:37.231040-05:00] [debug] [pid:8925] done with backend process
8925: EXIT 1
so unlike #174259-7 no EXIT 101
and nothing about a "filehandle". Is this the same or some different problem?
Updated by tinita about 1 month ago
- Assignee deleted (
tinita)
I checked the job history for all scenarios and this happened 11 days ago: https://openqa.opensuse.org/tests/4758316#step/tests/7
As I have more urgent tasks I will unassign myself.
Updated by mkittler 29 days ago · Edited
It looks like it fails to start ffmpeg:
[37m[2024-09-11T14:00:17.588199-04:00] [debug] [pid:51552] loaded 1 needles[0m
[37m[2024-09-11T14:00:17.787552-04:00] [debug] [pid:51552] 51552: channel_out 15, channel_in 14[0m
[37m[2024-09-11T14:00:17.816477-04:00] [debug] [pid:51573] 51573: cmdpipe 13, rsppipe 16[0m
[37m[2024-09-11T14:00:17.816576-04:00] [debug] [pid:51573] started mgmt loop with pid 51573[0m
Can't exec "ffmpeg": No such file or directory at /usr/lib/os-autoinst/backend/baseclass.pm line 347.
[37m[2024-09-11T14:00:18.012504-04:00] [debug] [pid:51573] qemu version detected: 9.0.2[0m
[37m[2024-09-11T14:00:18.016391-04:00] [debug] [pid:51573] running `/usr/bin/chattr +C /var/lib/openqa/pool/1/raid`[0m
[37m[2024-09-11T14:00:18.024234-04:00] [debug] [pid:51573] Command `/usr/bin/chattr +C /var/lib/openqa/pool/1/raid` terminated with 0[0m
[37m[2024-09-11T14:00:18.024305-04:00] [debug] [pid:51573] Configuring storage controllers and block devices[0m
This doesn't look fatal though and maybe even be the case when the inner test passes. Considering [2024-09-11T14:00:32.443868-04:00] [debug] [pid:51573] The built-in video encoder (pid 51575) terminated
it just fell back to the built-in video encoder.
So the inner test matches one needle, concludes one test module as passed and shuts down all processes as expected but then nevertheless incompletes. There are no suspicious log lines except that the line Sending tests_done
is missing. Very puzzling.
Updated by openqa_review 28 days ago
- Due date set to 2025-02-14
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler 28 days ago · Edited
I only see one reason how isotovideo (and autotest specifically) could come so far but then not reach Sending tests_done
: It received SIGTERM. In this case autotest simply invokes _exit(1)
when there is no current test module (and that is the case here) so it would not reach Sending tests_done
causing isotovideo to assume a premature test exit. It would also not flag the current test module as cancelled because there is no current test module.
If this is true the question is why it received sigterm and how we could make it more obvious what happened.
EDIT: With this change it would be easy to know whether my theory is correct: https://github.com/os-autoinst/os-autoinst/pull/2640
Updated by mkittler 28 days ago
I suppose this theory is wrong because then autotest would have exited with the return code 1 but it exited with return code 0 according to the log message from the "collected" event.
Then I really have no clue what happened. Even if some exception would have happened it would have been caught (via eval
) and a warning would have been printed. I also haven't found any other relevant places where we call exit
.
It also cannot be a problem with myjsonrpc because not even Sending tests_done
is logged. It also cannot be an exception in bmwqemu::save_vars(no_secret => 1);
because then we would get a warning. The same counts for any exception in autotest::runalltests();
.
Updated by mkittler 28 days ago
This test ran before commit https://github.com/os-autoinst/os-autoinst/commit/5c15841d7e208c9a7e9c1ca829edf7f031f3fa0d which improved the logging. So before this commit we would have silently ignored errors happening in bmwqemu::save_vars(no_secret => 1);
and myjsonrpc::send_json($isotovideo, {cmd => 'tests_done', died => $died, completed => $completed});
. There are no other relevant changes since then and as mentioned before I also don't see any other explanations. So I suspect a failure in one of those functions occurred.
Since the failure was silently ignored I don't know what to improve here. I would simply wait until the issue occurs again in production because running the full-blown test 100 times (or even more times) would be very time consuming (16 hours on my local machine).
I could also create 100 test runs on production if that is wanted. I don't think it is very useful because by now we ran a lot more than 100 tests on production over the time anyway and I'm not aware that we encountered this issue again. I will double-check by going through the "Next & Previous" tab of this scenario, though (over 500 jobs).
I also don't see a problem with the isotovideo failed
considering it rightfully returned with a non-zero return code. Now that we log the actual error it will hopefully also be clear why.
Updated by mkittler 28 days ago · Edited
There were over 20 more failures of this kind in the past months but logs are already gone. The most recent job I found is https://openqa.opensuse.org/tests/4758316 from just 22 days ago and it still has logs.
It looks a little bit different:
[2025-01-09T07:00:28.610026-05:00] [debug] [pid:8956] ||| finished boot tests (runtime: 11 s)
[2025-01-09T07:00:28.611396-05:00] [debug] [pid:8956] Sending tests_done
[2025-01-09T07:00:28.615050-05:00] [debug] [pid:8937] [autotest] process exited: 0
[2025-01-09T07:00:28.615269-05:00] [debug] [pid:8937] isotovideo handle_shutdown
Here we see the new log message (Sending tests_done
) but this is also the last error message. So there was no exception but presumably sending the tests_done
nevertheless didn't work. So send_json
is probably the culprit after all.
This actually still happens quite frequently, maybe one job out of 20. All jobs I looked into looked the same.
EDIT: I see that we do proper error handling in send_json
. Maybe the problematic part is the corresponding read_json
call in isotovideo which might simply not be invoked anymore before handle_shutdown
is entered.
EDIT: The problem is the following code:
# stop main loop as soon as one of the child processes terminates
my $stop_loop = sub (@) { $self->loop(0) if $self->loop; };
$self->testprocess->once(collected => $stop_loop);
$self->backend->process->once(collected => $stop_loop);
$self->cmd_srv_process->once(collected => $stop_loop);
This means the loop will not continue to run (and thus not continue to read the tests_done
message via read_json
) once autotest exits. This is not a problem in most cases because most of the time the loop is waiting for new data to read anyway and thus will still invoke read_json
once more. However, when it has just finished reading something and is checking whether it should continue to run it would not continue (and thus not read tests_done
).
EDIT: This PR will likely fix the issue: https://github.com/os-autoinst/os-autoinst/pull/2641
I still want to do more testing, though. (To be sure it doesn't causes any regressions.)
Updated by okurz 28 days ago
mkittler wrote in #note-24:
I could also create 100 test runs on production if that is wanted. I don't think it is very useful because by now we ran a lot more than 100 tests on production over the time anyway and I'm not aware that we encountered this issue again. I will double-check by going through the "Next & Previous" tab of this scenario, though (over 500 jobs).
As you already found out by now normal production tests reproduce the issue but logs might be gone. The benefit of triggering multiple tests in parallel is that you have an easier time to gather statistics and can be sure that for the sufficient investigation time logs and data will be preserved and that there are nearly no other parameters changing within the runtime of the test set, e.g. not vastly different combinations of package versions on the workers over time, etc. So I can still recommend to trigger hundreds of jobs on o3 for both reproduction of the original issue as well as to try out potential fixes.
Updated by mkittler 25 days ago
- Status changed from In Progress to Feedback
The PR has been merged. I'm keeping this ticket in feedback until it is deployed and in production for a few days.
Not sure whether it makes sense to do any further testing. We usually already create more than 20 jobs per day (see https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-2G&test=openqa_from_bootstrap&version=Tumbleweed#next_previous) but the most recent job reproducing the issue I could find is only from 24 days ago (not within the last 501 jobs of the scenario anymore). So it is probably too much effort to create/run/review 500+ jobs for this. We would find out when the fix doesn't work at some point and can then still continue to work on it.