coordination #64746: [saga][epic] Scale up: Efficient handling of large storage to be able to run current tests efficiently but keep big archives of old results
Support using a "generic" video encoder like ffmpeg instead of only relying on the one provided by os-autoinst itself
This ticket is a concrete idea to improve the general storage situation by storing the video more efficiently:
- The main benefit of using e.g. ffmpeg would be that we could easily switch the format as new formats are developed. We could also easily tweak parameters e.g. for slow workers or to utilize possibly available hardware encoding capabilities.
- Additionally, we would not need to invest effort into making our custom video encoder more efficient and fix its unresolved bugs. (It is not very efficient compared to e.g. VP9. It produces keyframes which are not marked correctly.)
AC1: The video encoder to be invoked can be configured (including the CLI parameters).
AC2: The format of the raw video passed to the video encoder (via a pipe) should be something most encoders can work with and should be documented.
AC3: Provide documentation including examples e.g. for using ffmpeg.
AC4: The custom video encoder still works as this is likely not hard to achieve and gives us room for experimentation with the new approach.
AC5: So far the name of the video file is hard-coded at several places within os-autoinst/openQA. It should at least be possible to allow different extensions because we don't want to be stuck with the Ogg container or use an incorrect extension.
- Find out which formats are usually supported as input by video encoding tools such as ffmpeg and x264. Likely the video encoder provided by os-autoinst itself already consumes some common raw video format so we wouldn't have to adapt much.
- Find out at which level the video encoder should be configurable for AC1.
- We could simply introduce yet another variable which would then likely be set within the worker config.
- We could use a placeholder for the position in the command where the output file goes. However, it likely makes sense to leave the file extension configurable.
#2 Updated by mkittler over 1 year ago
videoencoder binary of os-autoinst does more than just encoding the video. It implements a simple "protocol"¹ which allows to repeat the last frame and produces a PNG of the last frame as needed.
So we either
- move these features into the backend (implemented in Perl) and spawn the "generic" video encoder from there
- or we keep using the current
videoencoderbinary and make it fork the "generic" video encoder as configured.
Option 2. can likely be implement by changing less existing code but obviously means that the images go through two pipes instead of just one.
Option 1. on the other hand would likely let us get rid of the "protocol"¹ and lead to a simpler architecture by saving one intermediate process. Note that saving the PNGs is implemented by the current
videoencoder binary via OpenCV. We have OpenCV bindings for Perl anyways so moving that feature into the backend shouldn't make that much of a difference.
The format which is so far passed from the backend to the
videoencoder binary is
PPM. As coolo pointed out at least ffmpeg would support that directly:
for i in *.png; do pngtopnm $i; done | ffmpeg -y -r 60 -f image2pipe -vcodec ppm -i - -codec:v vp9 -qscale:v 7 /tmp/pnged.webm
¹The "protocol" used so far is basically:
- Enqueue a new frame:
"E " + to_string(length_of_ppm_image) + "\n" + ppm_image
- Repeat last frame:
The creation of the last PNG is not controlled via pipe input but simply by the presence of the file
live_log which the worker creates/removes.
But I'm still wondering whether I'm actually right about how the current
videoencoder works. Judging from the code it looks like it takes PPM images and it produces PNGs using the OpenCV function
imwrite. However, the comments at the top of
videoencoder.cpp say it would take PNGs.
#4 Updated by mkittler over 1 year ago
PR for os-autoinst: https://github.com/os-autoinst/os-autoinst/pull/1421
PR for openQA (worker and frontend): https://github.com/os-autoinst/openQA/pull/3125
See the descriptions of the PR/commits itself. These PRs should not change anything unless one configures os-autoinst to use a different encoder.
I implemented the configuration part via job settings. For os-autoinst it makes no difference where the job settings come from. However, in the context of openQA it makes sense to prevent setting the video encoder command line by arbitrary jobs¹ and only allow this configuration within
workers.ini. So it would be the best to merge the openQA PR first (or at least at the same time) because it prevents that.
¹ The introduced setting basically allows one to execute an arbitrary binary. Even if we would hard code it to be ffmpeg I wouldn't guarantee that it will be safe. Besides, it is easy to cause massive CPU usage with these commands so they should only be set in a controlled way.
#5 Updated by mkittler over 1 year ago
I am currently encountering issues with our unstable CI tests but generally both PR should be ready to merge.
The issue regarding truncated files mentioned in the os-autoinst PR is only a worker problem. I suppose the worker should not directly kill all processes when cancelling a job (user cancelled the job, a parallel job failed or timeout). It would make sense to send SIGTERM first and only kill the processes after a timeout.
#6 Updated by mkittler over 1 year ago
It seems that the killing behavior is part of Mojo-IOLoop-ReadWriteProcess which is fortunately configurable. With these changes it output file is no longer truncated: https://github.com/os-autoinst/openQA/pull/3152
That's at least how it looks like when looking into the pool directory after the job has finished with
--no-cleanup. The uploaded file is still truncated because the worker doesn't wait with the upload until ffmpeg has finalized the file.
Note that the previously mentioned PRs have been merged meanwhile. So I basically covered all ACs although I still need to avoid truncated files. (The PPM format might not be that common but at least ffmpeg handles it well so I consider AC2 covered.)
#7 Updated by mkittler over 1 year ago
Another problem is that ffmpeg's output disturbs the log with default flags. I think it is best not to suppress it completely and simply mention good flags in the documentation: https://github.com/os-autoinst/os-autoinst/pull/1425
#9 Updated by mkittler over 1 year ago
As mentioned in the meeting this setting is mainly a worker capability and therefore these settings need to go into the worker configuration. To make it test suite specific one needed to use a separate worker class for this test suite and configure a worker accordingly.
There's still the unresolved problem of truncated files in case of jobs which are cancelled by the user, because a parallel test failed or has been restarted or because the timeout has been exceeded. The videos truncated are still usable but seeking is impaired and the time isn't shown. (The format I'd like to try out is VP9 and it is not possible to mux it into an Ogg file which isn't as problematic in that regards.)
If that's not an issue I would configure a single worker instance on o3 to encode VP9 and supervise the CPU usage. It will be a little bit annoying that I needed to restart the worker all the time I want to tweak settings. Maybe it is better to test this on the staging worker first.
#10 Updated by mkittler over 1 year ago
The video can be truncated even if isotovideo exists normally. That's just not that likely but Mojo-IOLoop-ReadWriteProcess also within os-autoinst in the same way as it is used within the worker. However, it is only used to stop the backend and the command server. All sub processes managed by the backend itself are only indirectly affected by it. I now "tamed" Mojo-IOLoop-ReadWriteProcess within the worker similarly to my approach for the worker and also ensured that the backend waits until the video encoder exits. That works now but it is still annoying how the stop function in Mojo-IOLoop-ReadWriteProcess works. When one wants to give the process 10 seconds to terminate that always means a sleep of 10 seconds - even if the process terminates instantly. I suppose I'll have to change that directly within Mojo-IOLoop-ReadWriteProcess.
#11 Updated by okurz over 1 year ago
[…] I suppose I'll have to change that directly within Mojo-IOLoop-ReadWriteProcess.
well. I already looked into Mojo-IOLoop-ReadWriteProcess, e.g. see
for what we did recently. Maybe you are lucky and improve further ;)
#12 Updated by mkittler over 1 year ago
Yes, so far I came up with the following change: https://github.com/mudler/Mojo-IOLoop-ReadWriteProcess/compare/master...Martchus:improve-stopping?expand=1
I have also updated my branches for os-autoinst and openQA but I haven't created a PR yet because there are still problems. Depending on timing different problems occur:
- The worker terminates/kills the whole process group by using negative signal numbers. (It puts isotovideo its forks into a new process group via
setpgrp(0, 0).) That makes sense for the worker's because it helps avoiding leftover processes. However, when trying to properly terminate the video encoder already on backend-level it might lead to multiple SIGTERM signals being sent to the video encoder so ffmpeg stops finalizing the file.
- The file in the pool directory looks good but the uploaded file is still broken. That means ffmpeg is not forcefully terminated/killed anymore but stopping the backend within the worker does not block until ffmpeg has finished. Therefore the upload is already started while ffmpeg hasn't finalized the file yet.
I think the easiest solution for 1. is to simply not add my code for taking care of terminating the video encoder correctly on backend-level.
To fix 2. I need to implement the process group killing within Mojo-IOLoop-ReadWriteProcess (and not the worker) so it can appropriately wait for the entire group to terminate.
#13 Updated by mkittler over 1 year ago
PRs implementing the ideas mentioned in the last comment:
#15 Updated by mkittler about 1 year ago
All PR have been merged. Let me know if you still encounter any truncated videos (where seeking doesn't work correctly or the duration is not displayed).
Does it make sense to configure some test suites on o3 with a different video codec?
It would make sense now so I've configured o3 worker
openqaworker1 to use the following ffmpeg command:
ffmpeg -y -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 0
That change will have an effect after the next nightly deployment. The video will be using VP9 and the worker CPU usage will be a little bit higher. Let's see how well that works. If there are any problems, one can simply comment out or adjust the
EXTERNAL_VIDEO_ENCODER_CMD setting within
/etc/openqa/workers.ini on that worker and restart the worker services.
#16 Updated by mkittler about 1 year ago
- Status changed from In Progress to Resolved
- Target version deleted (
It turns out that the o3 workers are in their current configuration (transnational server) a terrible place to try things out. This ticket was mainly about supporting a generic video encoder within os-autoinst and openQA and I consider this done without utilizing the new feature in one of our production instances.
My current attempts to test this on an o3 worker where not successful so far because I have not been able to install ffmpeg and other tools and have already caused enough incomplete jobs. Using a different encoder and improving the o3 worker setup can be tracked as an infrastructure ticket.
Note that I've tested this locally of course. It would have only been interesting to see how certain encoder configurations perform on our production machine considering that there run up to 12 jobs in parallel.
#17 Updated by okurz about 1 year ago
It turns out that the o3 workers are in their current configuration (transnational server) a terrible place to try things out.
Come on, it's not terrible :) You can only make the change effective by restarting the worker instances anyway so you would either restart systemd services or just reboot the machine (after installing packages). Any incompleted openQA tests would be retriggered automatically anyway. The tradeoff of transactional vs. non-transactional is "consistency + easier rollback" vs. "faster turnaround without reboots", shouldn't be much more than that.
#19 Updated by mkittler about 1 year ago
Come on, it's not terrible
Last time I've tried
transactional-update pkg install … the packages I've tried to install weren't even there after the reboot. Maybe that was because I've also tried the read-write mounting you've suggested and that interfered. Additionally, it might turn out that I need to install yet another package after all. Besides, the interference between the different
transactional-update commands is not described well in the man page so for me even using that command itself is trial and error. I would never have assumed the overriding behavior you've mentioned after reading "General Commands can be used together in any combination" and "Package Commands … Can be combined with any number of General Commands." in the man page.
You can only make the change effective by restarting the worker instances anyway so you would either restart systemd services or just reboot the machine (after installing packages).
No. In my first attempt I've tried a third option: Wait for the nightly reboot which will happen anyways.
Any incompleted openQA tests would be retriggered automatically anyway.
At least the incompletes caused by the service restart. I could try to disable all worker instances except one first to reduce the number of incompletes.
I've been creating a follow-up ticket: https://progress.opensuse.org/issues/68923
I've left it open in that ticket whether we try it on an OSD worker or on an o3 worker first.