action #174259: [sporadic] isotovideo fails complaining about still existing testfd filehandle size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #174259

closed

coordination #127031: [saga][epic] openQA for SUSE customers

coordination #165393: [epic] Improved code coverage in openQA

[sporadic] isotovideo fails complaining about still existing testfd filehandle size:M

Added by tinita 5 months ago. Updated 4 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

Due date:

% Done:

Estimated time:

Description

Observation¶

Steps to reproduce¶

Steps to reproduce with isotovideo
Download HDD_1 from latest openqa_install+publish test from https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-2G&test=openqa_from_bootstrap&version=Tumbleweed&status=done#next_previous
Download vars.json
Run isotovideo until you hit the issue (~1/100 times)

For example for an older build that would look like this:

wget https://openqa.opensuse.org/tests/4699317/asset/hdd/opensuse-Tumbleweed-x86_64-20241210-minimalx@64bit.qcow2
wget https://openqa.opensuse.org/tests/4699317/file/vars.json
for i in $(seq 1 20); do echo RUN $i; isotovideo -e CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-openQA NEEDLES_DIR=https://github.com/os-autoinst/os-autoinst-needles-openQA PRODUCTDIR=. 2>log || break; done

Also see #174259-7

Acceptance Criteria¶

AC1: the scenario openqa_from_bootstrap is consistently stable with fail ratio < 0.1% with inner openQA job not incomplete

Suggestions¶

Run the inner example tests in a VM DONE didn't reproduce it
Try to run the openqa-in-openqa test, needing nested virt or emulating
Change the test code to schedule the inner test multiple times instead of once and run openqa-in-openqa many times and see if it makes a difference. Is it more often reproducible? Does it only fail for the first inner test?
Run openqa-in-openqa test and pause the test and schedule the inner test manually multiple times
Look for https://github.com/os-autoinst/os-autoinst/blob/master/script/isotovideo#L151 "isotovideo failed' from uploaded logs from the inner openQA job to detect the problem

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by tinita 5 months ago

Copied from action #166445: [openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry added

Actions

Copy link

Updated by tinita 5 months ago

Subject changed from [openQA-in-openQA][sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry to [sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry

Actions

Copy link

Updated by tinita 5 months ago

Subject changed from [sporadic] test fails in tests, simple_boot incomplete auto_review:"no candidate needle.*openqa-test-details.*matched":retry to [sporadic] isotovideo fails complaining about still existing testfd filehandle

Actions

Copy link

Updated by okurz 5 months ago

Due date deleted (~~2024-12-26~~)
Start date deleted (~~2024-09-06~~)

@tinita please use steps to reproduce that don't involve openQA-in-openQA@o3, i.e. just isotovideo as jbaier_cz explained

Actions

Copy link

Updated by tinita 5 months ago

Assignee set to tinita

Actions

Copy link

Updated by tinita 5 months ago

Status changed from New to In Progress

Actions

Copy link

Updated by jbaier_cz 5 months ago

Description updated (diff)

I added steps to reproduce inside a distrobox (just a plain opensuse/distrobox:latest with os-autoinst-devel and openQA-devel installed).

RUN 12
Web application available at http://[::]:20183
GOT GO

3292083: EXIT 101

For some reason, I was not able to reproduce with podman and isotovideo:qemu-x86 container image.

Actions

Copy link

Updated by openqa_review 5 months ago

Due date set to 2024-12-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by tinita 5 months ago

I'm still installing openQA to the test vm.
I am apparently missing steps. I am going with os-autoinst-distri-openQA/tests/install/* but still some packages are missing.
The test fails with no kvm-img/qemu-img found

Actions

Copy link

#10

Updated by tinita 5 months ago

I'm running 500 distri-example tests in the VM. So far no failure.

Actions

Copy link

#11

Updated by tinita 5 months ago

So, running the inner example tests in my VM didn't reproduce it.
I now tried to run the openqa-in-openqa test, but nested virtualization is not enabled in the VM. I asked for help in the openplatform channel.

Additional ideas:

Change the test code to schedule the inner test multiple times instead of once and run openqa-in-openqa many times and see if it makes a difference. Is it more often reproducible? Does it only fail for the first inner test?
Run openqa-in-openqa test and pause the test and schedule the inner test manually multiple times

Actions

Copy link

#12

Updated by tinita 5 months ago

Status changed from In Progress to Workable

I wanted to run openqa-in-openqa tests in that VM, but wasn't able to get nested virtualization working. I asked in the openplatform channel and got some help, but still something seems to be missing.
Setting back to workable due to absence.

Actions

Copy link

#13

Updated by livdywan 5 months ago

Due date deleted (~~2024-12-28~~)

Due date with no assignee is not meaningful, hence resetting

Actions

Copy link

#14

Updated by livdywan 5 months ago

Subject changed from [sporadic] isotovideo fails complaining about still existing testfd filehandle to [sporadic] isotovideo fails complaining about still existing testfd filehandle size:M
Description updated (diff)

Actions

Copy link

#15

Updated by okurz 5 months ago

Description updated (diff)

Actions

Copy link

#16

Updated by okurz 5 months ago

So what I did is run

runs=200 count-fail-ratio isotovideo -e CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-openQA NEEDLES_DIR=https://github.com/os-autoinst/os-autoinst-needles-openQA PRODUCTDIR=. GIT_CACHE_DIR=. CACHEDIRECTORY=. TIMEOUT_SCALE=3 2>> log

I needed GIT_CACHE_DIR=. CACHEDIRECTORY=. as I was running as non-root and otherwise tests would fail trying to write to /var/lib/openqa which yields "Permission denied". I also needed to do TIMEOUT_SCALE=3 because my system is slow. Then in run 6 I observed the first failure. In the local directory ulogs/ I found log files uploaded to that directory during the last failure (I assume). In there in ulogs/tests-autoinst-log.txt I found

[2024-12-19T09:05:33.649386-05:00] [debug] [pid:8947] no match: 21.0s, best candidate: bootloader (0.00)
[2024-12-19T09:05:34.763255-05:00] [debug] [pid:8947] no match: 20.0s, best candidate: bootloader (0.00)
[2024-12-19T09:05:35.792371-05:00] [debug] [pid:8946] >>> testapi::_handle_found_needle: found bootloader, similarity 1.00 @ 0/370
[2024-12-19T09:05:35.796324-05:00] [debug] [pid:8946] ||| finished boot tests (runtime: 11 s)
[2024-12-19T09:05:35.798447-05:00] [debug] [pid:8946] Sending tests_done
[2024-12-19T09:05:35.802355-05:00] [debug] [pid:8925] [autotest] process exited: 0
[2024-12-19T09:05:35.802691-05:00] [debug] [pid:8925] isotovideo handle_shutdown
[2024-12-19T09:05:35.802815-05:00] [debug] [pid:8925] stopping command server 8943 because test execution ended
[2024-12-19T09:05:35.802911-05:00] [debug] [pid:8925] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20013/x5vGNXUdPjX1sPxZ/broadcast
[2024-12-19T09:05:35.823290-05:00] [debug] [pid:8925] commands process exited: 0
[2024-12-19T09:05:35.923850-05:00] [debug] [pid:8925] done with command server
[2024-12-19T09:05:35.924013-05:00] [debug] [pid:8925] stopping autotest process 8946
[2024-12-19T09:05:35.924160-05:00] [debug] [pid:8925] done with autotest process
[2024-12-19T09:05:35.924276-05:00] [debug] [pid:8925] isotovideo failed
[2024-12-19T09:05:35.925186-05:00] [debug] [pid:8925] stopping backend process 8947
[2024-12-19T09:05:35.925538-05:00] [debug] [pid:8947] backend got TERM
[2024-12-19T09:05:35.925948-05:00] [info] [pid:8947] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2024-12-19T09:05:36.983996-05:00] [debug] [pid:8947] Passing remaining frames to the video encoder
[2024-12-19T09:05:37.043069-05:00] [debug] [pid:8947] Waiting for video encoder to finalize the video
[2024-12-19T09:05:37.043190-05:00] [debug] [pid:8947] The built-in video encoder (pid 8949) terminated
[2024-12-19T09:05:37.043954-05:00] [debug] [pid:8947] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 8947 (/usr/bin/isotovideo: backend)
[2024-12-19T09:05:37.045321-05:00] [debug] [pid:8947] sending magic and exit
[2024-12-19T09:05:37.231040-05:00] [debug] [pid:8925] done with backend process
8925: EXIT 1

so unlike #174259-7 no EXIT 101 and nothing about a "filehandle". Is this the same or some different problem?

Actions

Copy link

#17

Updated by tinita 5 months ago

The problem is the "isotovideo failed" line. This only happens if the process finishes and the testfd filehandle is not set to undef.

Actions

Copy link

#18

Updated by tinita 4 months ago

Assignee deleted (~~tinita~~)

I checked the job history for all scenarios and this happened 11 days ago: https://openqa.opensuse.org/tests/4758316#step/tests/7

As I have more urgent tasks I will unassign myself.

Actions

Copy link

#19

Updated by mkittler 4 months ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

#20

Updated by mkittler 4 months ago · Edited

It looks like it fails to start ffmpeg:

[37m[2024-09-11T14:00:17.588199-04:00] [debug] [pid:51552] loaded 1 needles[0m
[37m[2024-09-11T14:00:17.787552-04:00] [debug] [pid:51552] 51552: channel_out 15, channel_in 14[0m
[37m[2024-09-11T14:00:17.816477-04:00] [debug] [pid:51573] 51573: cmdpipe 13, rsppipe 16[0m
[37m[2024-09-11T14:00:17.816576-04:00] [debug] [pid:51573] started mgmt loop with pid 51573[0m
Can't exec "ffmpeg": No such file or directory at /usr/lib/os-autoinst/backend/baseclass.pm line 347.
[37m[2024-09-11T14:00:18.012504-04:00] [debug] [pid:51573] qemu version detected: 9.0.2[0m
[37m[2024-09-11T14:00:18.016391-04:00] [debug] [pid:51573] running `/usr/bin/chattr +C /var/lib/openqa/pool/1/raid`[0m
[37m[2024-09-11T14:00:18.024234-04:00] [debug] [pid:51573] Command `/usr/bin/chattr +C /var/lib/openqa/pool/1/raid` terminated with 0[0m
[37m[2024-09-11T14:00:18.024305-04:00] [debug] [pid:51573] Configuring storage controllers and block devices[0m

This doesn't look fatal though and maybe even be the case when the inner test passes. Considering [2024-09-11T14:00:32.443868-04:00] [debug] [pid:51573] The built-in video encoder (pid 51575) terminated it just fell back to the built-in video encoder.

So the inner test matches one needle, concludes one test module as passed and shuts down all processes as expected but then nevertheless incompletes. There are no suspicious log lines except that the line Sending tests_done is missing. Very puzzling.

Actions

Copy link

#21

Updated by openqa_review 4 months ago

Due date set to 2025-02-14

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#22

Updated by mkittler 4 months ago · Edited

I only see one reason how isotovideo (and autotest specifically) could come so far but then not reach Sending tests_done: It received SIGTERM. In this case autotest simply invokes _exit(1) when there is no current test module (and that is the case here) so it would not reach Sending tests_done causing isotovideo to assume a premature test exit. It would also not flag the current test module as cancelled because there is no current test module.

If this is true the question is why it received sigterm and how we could make it more obvious what happened.

EDIT: With this change it would be easy to know whether my theory is correct: https://github.com/os-autoinst/os-autoinst/pull/2640

Actions

Copy link

#23

Updated by mkittler 4 months ago

I suppose this theory is wrong because then autotest would have exited with the return code 1 but it exited with return code 0 according to the log message from the "collected" event.

Then I really have no clue what happened. Even if some exception would have happened it would have been caught (via eval) and a warning would have been printed. I also haven't found any other relevant places where we call exit.

It also cannot be a problem with myjsonrpc because not even Sending tests_done is logged. It also cannot be an exception in bmwqemu::save_vars(no_secret => 1); because then we would get a warning. The same counts for any exception in autotest::runalltests();.

Actions

Copy link

#24

Updated by mkittler 4 months ago

This test ran before commit https://github.com/os-autoinst/os-autoinst/commit/5c15841d7e208c9a7e9c1ca829edf7f031f3fa0d which improved the logging. So before this commit we would have silently ignored errors happening in bmwqemu::save_vars(no_secret => 1); and myjsonrpc::send_json($isotovideo, {cmd => 'tests_done', died => $died, completed => $completed});. There are no other relevant changes since then and as mentioned before I also don't see any other explanations. So I suspect a failure in one of those functions occurred.

Since the failure was silently ignored I don't know what to improve here. I would simply wait until the issue occurs again in production because running the full-blown test 100 times (or even more times) would be very time consuming (16 hours on my local machine).

I could also create 100 test runs on production if that is wanted. I don't think it is very useful because by now we ran a lot more than 100 tests on production over the time anyway and I'm not aware that we encountered this issue again. I will double-check by going through the "Next & Previous" tab of this scenario, though (over 500 jobs).

I also don't see a problem with the isotovideo failed considering it rightfully returned with a non-zero return code. Now that we log the actual error it will hopefully also be clear why.

Actions

Copy link

#25

Updated by mkittler 4 months ago · Edited

There were over 20 more failures of this kind in the past months but logs are already gone. The most recent job I found is https://openqa.opensuse.org/tests/4758316 from just 22 days ago and it still has logs.

It looks a little bit different:

[2025-01-09T07:00:28.610026-05:00] [debug] [pid:8956] ||| finished boot tests (runtime: 11 s)
[2025-01-09T07:00:28.611396-05:00] [debug] [pid:8956] Sending tests_done
[2025-01-09T07:00:28.615050-05:00] [debug] [pid:8937] [autotest] process exited: 0
[2025-01-09T07:00:28.615269-05:00] [debug] [pid:8937] isotovideo handle_shutdown

Here we see the new log message (Sending tests_done) but this is also the last error message. So there was no exception but presumably sending the tests_done nevertheless didn't work. So send_json is probably the culprit after all.

This actually still happens quite frequently, maybe one job out of 20. All jobs I looked into looked the same.

EDIT: I see that we do proper error handling in send_json. Maybe the problematic part is the corresponding read_json call in isotovideo which might simply not be invoked anymore before handle_shutdown is entered.

EDIT: The problem is the following code:

    # stop main loop as soon as one of the child processes terminates
    my $stop_loop = sub (@) { $self->loop(0) if $self->loop; };
    $self->testprocess->once(collected => $stop_loop);
    $self->backend->process->once(collected => $stop_loop);
    $self->cmd_srv_process->once(collected => $stop_loop);

This means the loop will not continue to run (and thus not continue to read the tests_done message via read_json) once autotest exits. This is not a problem in most cases because most of the time the loop is waiting for new data to read anyway and thus will still invoke read_json once more. However, when it has just finished reading something and is checking whether it should continue to run it would not continue (and thus not read tests_done).

EDIT: This PR will likely fix the issue: https://github.com/os-autoinst/os-autoinst/pull/2641
I still want to do more testing, though. (To be sure it doesn't causes any regressions.)

Actions

Copy link

#26

Updated by okurz 4 months ago

mkittler wrote in #note-24:

I could also create 100 test runs on production if that is wanted. I don't think it is very useful because by now we ran a lot more than 100 tests on production over the time anyway and I'm not aware that we encountered this issue again. I will double-check by going through the "Next & Previous" tab of this scenario, though (over 500 jobs).

As you already found out by now normal production tests reproduce the issue but logs might be gone. The benefit of triggering multiple tests in parallel is that you have an easier time to gather statistics and can be sure that for the sufficient investigation time logs and data will be preserved and that there are nearly no other parameters changing within the runtime of the test set, e.g. not vastly different combinations of package versions on the workers over time, etc. So I can still recommend to trigger hundreds of jobs on o3 for both reproduction of the original issue as well as to try out potential fixes.

Actions

Copy link

#27

Updated by okurz 4 months ago

Due date deleted (~~2025-02-14~~)

https://github.com/os-autoinst/os-autoinst/pull/2641

Actions

Copy link

#28

Updated by mkittler 4 months ago

Status changed from In Progress to Feedback

The PR has been merged. I'm keeping this ticket in feedback until it is deployed and in production for a few days.

Not sure whether it makes sense to do any further testing. We usually already create more than 20 jobs per day (see https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=openqa&flavor=dev&machine=64bit-2G&test=openqa_from_bootstrap&version=Tumbleweed#next_previous) but the most recent job reproducing the issue I could find is only from 24 days ago (not within the last 501 jobs of the scenario anymore). So it is probably too much effort to create/run/review 500+ jobs for this. We would find out when the fix doesn't work at some point and can then still continue to work on it.

Actions

Copy link

#29

Updated by mkittler 4 months ago

Status changed from Feedback to Resolved

There's no breakage in production so far. So I'm considering this resolved, see my previous note.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #174259

[sporadic] isotovideo fails complaining about still existing testfd filehandle size:M

Observation¶

Steps to reproduce¶

Acceptance Criteria¶

Suggestions¶

Updated by tinita 5 months ago

Updated by tinita 5 months ago

Updated by tinita 5 months ago

Updated by okurz 5 months ago

Updated by tinita 5 months ago

Updated by tinita 5 months ago

Updated by jbaier_cz 5 months ago

Updated by openqa_review 5 months ago

Updated by tinita 5 months ago

Updated by tinita 5 months ago

Updated by tinita 5 months ago

Updated by tinita 5 months ago

Updated by livdywan 5 months ago

Updated by livdywan 5 months ago

Updated by okurz 5 months ago

Updated by okurz 5 months ago

Updated by tinita 5 months ago

Updated by tinita 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago · Edited

Updated by openqa_review 4 months ago

Updated by mkittler 4 months ago · Edited

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago · Edited

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago