Project

General

Profile

Actions

action #129955

closed

coordination #103938: [saga][epic] Scale up: Efficient handling of large storage on o3

openQA Infrastructure - coordination #68923: [epic] Use external videoencoder in production auto_review:"External encoder not accepting data"

Second attempt to try out AV1 video codec as potential new default as of 2023 size:M

Added by okurz 11 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

According to
https://en.m.wikipedia.org/wiki/AV1#Software_implementations
AV1 seems to have good support by now in current decoders, browsers and encoders. We should explore use of the codec as potential new default to make efficient use of storage space considering encoding performance and time impact. We already tried two to three years ago. It's time to try again

Acceptance criteria

  • AC1: AV1 is used as new default video encoder where supported in os-autoinst or we researched why AV1 is not suitable for our purposes
  • AC1: Older products running os-autoinst potentially not supporting AV1 still run the old or an alternative as fallback

Suggestions

  • Read what has been done in the predecessor ticket
  • Research how AV1 can be supported for us, e.g. by using ffmpeg as external video encoder as supported by #67342
  • Either replace the internal encoder or use an AV1 capable one as "external" default (and switch of internal in this case)
    • Note that the internal encoder is also responsible for producing the PNG for the live view so it can not be replaced completely.
  • Check if AV1 provides better performance for our needs than the old internal encoder
  • Consider to also compare against its predecessor VP9
  • Consider fallback handling for older products potentially not supporting AV1
  • Also see suggestions in #68923

Related issues 1 (0 open1 closed)

Copied from openQA Project - action #75256: Try out AV1 video codec as potential new defaultResolvedmkittler2020-10-25

Actions
Actions #1

Updated by okurz 11 months ago

  • Copied from action #75256: Try out AV1 video codec as potential new default added
Actions #2

Updated by okurz 11 months ago

  • Parent task set to #68923
Actions #3

Updated by livdywan 11 months ago

  • Subject changed from Second attempt to try out AV1 video codec as potential new default as of 2023 to Second attempt to try out AV1 video codec as potential new default as of 2023 size:M
  • Status changed from New to Workable

Same as the previous ticket. Evaluate the state of affairs.

Actions #4

Updated by okurz 11 months ago

  • Priority changed from Low to High

Actually it would be good to do this earlier even before we add the external video encoder to OSD, increasing prio

Actions #5

Updated by mkittler 10 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #6

Updated by mkittler 10 months ago

I put a summary two years ago into comment #75256#note-6. This is an updated version.

I have done all encodings with the same source video for comparability. I picked a random openQA video from my disk to do all the test encodings. Only in the middle I've noticed that the video is actually 1920x1080 (must have been from some experimentations) but for consistency I kept using it. It should not make a big difference except that the encoding speeds are a bit lower than they would have normally been. Since it also depends a lot on the CPU this will be different between our various machines in production anyways.


  • SVT-1
  • libaom
    • available in ffmpeg-4 as provided by TW/packman
      • not available in Leap 15.4 and 15.5, I get Unknown encoder 'libaom_av1' despite --enable-encoder='…,libaom_av1,… showing up in the banner (same for just libaom).
    • documentation: https://trac.ffmpeg.org/wiki/Encode/AV1 and https://ffmpeg.org/ffmpeg-codecs.html#libaom_002dav1
    • the speed is better than it was two years ago but still not ideal, at least with the version provided by TW via ffmpeg -i video.ogv -c:v libaom-av1 -crf 35 -b:v 1500k -cpu-used 8 video-aom-av1.mkv (-cpu-used 8 is already the setting where it is as fast as possible)
      • I get around 1.5x but the encoder is using around three CPU cores. Likely we cannot afford to spend that much CPU time in production.
      • The Theora video shrunk from 19 MiB to 2.9 MiB and the quality was still acceptable.
  • rav1e
    • available in ffmpeg-4 as provided by TW/packman
      • not available in Leap 15.4 and 15.5, I get Unknown encoder 'librav1e' despite --enable-encoder='…,librav1e,… showing up in the banner.
    • documentation: https://ffmpeg.org/ffmpeg-codecs.html#librav1e
    • the speed is bad, at least with the version provided by TW (rav1e package, only 2 patch releases behind upstream) via ffmpeg -i video.ogv -c:v librav1e -qp 128 -speed 10 video-rav1e.mkv (-speed 10 seems to be the fastest)
      • I only get around 0.191x in the beginning but then it reached at least 0.485x. It was only using a single core, though.
      • The video was 3.7 MiB. That is bigger compared to the other encoders and the quality is nevertheless worse. So not a good result in comparison.
    • I've also tried ffmpeg -i video.ogv -c:v librav1e -b:v 1500k -speed 10 video-rav1e-cbr.mkv to see whether bitrate mode is any better. The encoding speed was even slightly lower. The file size was with 11 MiB bigger (as expected with that bitrate setting) but at least the quality was also on par with the other encoders.
  • libgav1
  • dav1d

For comparison:

  • vp9
    • The speed is not great as well, we currently use ffmpeg -i video.ogv -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 0 video-vp9.mkv in production (-cpu-used 0 means it is as slow as possible)
      • Locally I only get 0.6x, I'm wondering how this can work in production. At least is is just using one core (unlike libaom which used at least 2 cores).
      • The Theora video shrunk from 19 MiB to 2.7 MiB and the quality was still acceptable. So VP9 with -cpu-used 0 is a bit more efficient than libaom with -cpu-used 8.
    • I tried with -cpu-used 4 as well because -cpu-used 0 seems a little extreme¹. With that I get 3.06x but also 3.7 MiB.
    • I have also tried with -cpu-used 3, 2 and 1 and with that I get 3.09x/3.6MiB, 2.71x/3.6MiB, and 1.71x/3.3MiB respectively.

  • SVT-1 is fast and produces small files with acceptable quality. It is better than all other encoders although VP9 does quite well, too.
  • VP9 only produced a better quality/size-ratio than SVT-1 only when using more CPU time. That would mean SVT-1 is better. However, I'm not sure whether I have actually picked comparable quality levels. So maybe one should not interpret too much into these figures and just assume SVT-1 and VP9 are very comparable for typical openQA footage.
  • I could have tried to further tweak further settings. Maybe some e.g. rav1e would preform better with different settings. Considering SVT-1 and VP9 are ahead a lot I doubt it, though.
  • Considering that AV1 is the future and the encoder has supposedly still more room for optimizations I suppose SVT-1 is the winner.
  • Considering VP9 is the best we can do in Leap 15.4 and 15.5 without a custom ffmpeg build we will likely nevertheless keep using VP9 for the time being. Maybe with -cpu-used 1, though.

¹ When I recently enabled VP9 on all o3 workers I've just copied the settings from workers where it has already been enabled. So I'm not sure why we use -cpu-used 0 in production.

Actions #7

Updated by okurz 10 months ago

cool.

So what are your plans?

mkittler wrote:

is that something you could test and compare against libaom?

  • libaom
    • available in ffmpeg-4 as provided by TW/packman/Leap15.4
    • documentation: https://trac.ffmpeg.org/wiki/Encode/AV1 and https://ffmpeg.org/ffmpeg-codecs.html#libaom_002dav1
    • the speed is too bad, at least with the version provided by TW via ffmpeg -i video.ogv -c:v libaom-av1 -crf 35 -b:v 1500k -cpu-used 8 video-aom-av1.mkv (-cpu-used 8 is already the setting where it is as fast as possible)
      • I get around 1.5x but the encoder is hammering the CPU. Likely we cannot afford to spend that much CPU time in production.

It is likely expectable that libaom is not as quick as SVT-1 or others but do you have any idea why it is that bad?

plans to test it then?

EDIT: I tested multiple different variants and also container images. The results are very comparable. This means at least that we are ok to use ffmpeg from openSUSE. There is no benefit from other OS bases. I also tried podman run --pull=newer --rm -it --privileged -v "$(pwd):/videos" docker.io/masterofzen/av1an:latest -i video.ogv which crashed my computer so I don't have the detailed notes anymore. But then finally I found one thing that was impressive: podman run --pull=newer --rm -it -v "$PWD:/data" ghcr.io/tamara-schmitz/ffmpeg-docker-container -i /data/video.ogv -c:v libsvtav1 -preset 10 -crf 35 -c:a copy /data/video-svtav1-preset10_crf35.mkv -> frame= 1478 fps= 98 q=35.0 Lsize= 4777kB time=00:01:33.41 bitrate= 418.9kbits/s speed=6.18x, video:4769kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.157033% which is using ffmpeg 5.1.3 in a Tumbleweed container with SVT-AV1 encoder lib 1.4.1

Actions #8

Updated by mkittler 10 months ago

  • Status changed from In Progress to Feedback

I guess these findings are good enough to talk about it together on the next occasion.

@okurz I have frequently updated the comment you're quoting from. I suppose all your questions should be answered in the most recent version. Unfortunately this means that the Leap 15.4 support is not like I initially stated (the ffmpeg banner was misleading).

Actions #9

Updated by okurz 10 months ago

mkittler wrote:

I guess these findings are good enough to talk about it together on the next occasion.

@okurz I have frequently updated the comment you're quoting from. I suppose all your questions should be answered in the most recent version. Unfortunately this means that the Leap 15.4 support is not like I initially stated (the ffmpeg banner was misleading).

Yes, thank you. Very thorough report. So I suggest we do the following:

  1. Suggest in the openQA documentation that SVT-1 can be used with an example command line that would work on Tumbleweed when put into the external video encoder setting
  2. Change o3 production settings to cpu-used X with X in the range [1:8]
  3. Change one production o3 or osd worker to encode to AV1 with SVT-1 and a podman command line, e.g. based on what I wrote in #129955-7
Actions #10

Updated by mkittler 10 months ago

  • Status changed from Feedback to In Progress

PR for 1. https://github.com/os-autoinst/os-autoinst/pull/2326

For 2. I have invoked for i in aarch64 openqaworker4 openqaworker7 qa-power8-3 rebel; do echo $i && ssh root@$i "sed -i -e 's|-cpu-used 0|-cpu-used 1|g' /etc/openqa/workers.ini" ; done on ariel. I've excluded openqaworker19 and openqaworker20 because they are very powerful and not maxing out their CPU.

I'm going to try out SVT-1 on a production machine tomorrow.

Actions #11

Updated by openqa_review 10 months ago

  • Due date set to 2023-06-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by mkittler 10 months ago

  • Status changed from In Progress to Feedback

Unfortunately I see no way to make a containerized ffmpeg invocation work without modifying os-autoinst first to avoid using an absolute path. So I've created https://github.com/os-autoinst/os-autoinst/pull/2327 for that. It works locally and I'll try it in production once it has been deployed.

Actions #13

Updated by okurz 10 months ago

https://github.com/os-autoinst/os-autoinst/pull/2327 is merged and most certainly deployed. Would be great to see some AV1 video on o3 until tomorrow's workshop that is about the topic "More efficient video encoder used on o3 - how to work with videos" :)

Actions #14

Updated by mkittler 10 months ago

Ok, then I'll pick one of new new overpowered o3 workers for this.

Actions #15

Updated by mkittler 10 months ago

I've just created a test job on a special worker slot on the o3 worker openqaworker19: https://openqa.opensuse.org/tests/3358984 - To make this work I installed podman, configured user/group IDs so _openqa-worker can use it and configured the command from https://github.com/os-autoinst/os-autoinst/pull/2327.

If it works I'll enable SVT-1 on openqaworker19 for all regular slots. So far it looks good:

ffprobe /var/lib/openqa/pool/200/video.webm 
ffprobe version 4.4 Copyright (c) 2007-2021 the FFmpeg developers
…
[libdav1d @ 0x55dd49e6c7c0] libdav1d 0.9.2
Input #0, matroska,webm, from '/var/lib/openqa/pool/200/video.webm':
  Metadata:
    ENCODER         : Lavf60.3.100
  Duration: 00:00:34.63, start: 0.000000, bitrate: 772 kb/s
  Stream #0:0: Video: av1 (Main), yuv420p(tv, progressive), 1024x768 [SAR 1:1 DAR 4:3], 24 fps, 24 tbr, 1k tbn, 1k tbc
    Metadata:
      ENCODER         : Lavc60.3.100 libsvtav1
      DURATION        : 00:00:34.625000000
[libdav1d @ 0x55dd49e71780] libdav1d 0.9.2
Actions #16

Updated by okurz 10 months ago

Great to see this working in production. For comparison. For now AV1 is bigger than VP9:

Actions #17

Updated by mkittler 10 months ago

So this generally works. However, it relied on giving every user access to /var/lib/empty/ which is nothing we can do as it prevents login via ssh. Without this access we're running into the following pocman error: Error: creating runtime static files directory: mkdir /var/lib/empty/.local: permission denied

It is expected that SVT-1 is less efficient compared to the VP9 encoding which has been done with -cpu-used 0. However, I wouldn't have though it makes such a big difference. So I'll refrain from enabling this everywhere before I do some further local testing. (The tests I've conducted so far are 2.7 MiB from VP9 vs. 2.9 MiB from SVT-1 for the same video. Likely the -preset 10 option I picked up from #129955#note-7 made things worse.)

Actions #18

Updated by mkittler 10 months ago

The podman setup was fixed after just giving _openqa-worker a proper home directory. So I basically just did mkdir /var/lib/openqa/worker && chown _openqa-worker:users /var/lib/openqa/worker and changed /etc/passwd accordingly.

I've also just removed -preset 10 so and started another test job: https://openqa.opensuse.org/tests/3359054
Let's see how big the video will be now.
EDIT: It reduced the file size from 3.34 MiB to 3.01 MiB. This is still bigger so I'm trying -crf 45 now: https://openqa.opensuse.org/tests/3359063
EDIT: With -crf 45 we're at 2.7 MiB. Maybe it makes sense to increase the CRF further but I'll have to compare the quality first.

EDIT: I think reducing the quality to -crf 50 is still acceptable. Decreasing it further would definitely be worse then the VP9 encoding we're comparing with (and I guess also 50 is already worse). With 50 we still get 2.3 MiB by default. By using a lower preset one can make it more efficient. I think the lowest preset we can use in production is 6, otherwise it gets quite slow. With that we're comparable¹ to VP9:

1,7M    video-2-svt50-preset-6.mkv
1,8M    video-2-svt50-preset-7.mkv
2,1M    video-2-svt50-preset-8.mkv
2,3M    video-2-svt50-preset-default.mkv
3,1M    video-2-vp9-crf35-cpu-used-1.mkv

It is hard to compare because the CRF parameter's scale is not identical so the videos have slightly different qualify. From a brief comparison SVT's 50 is somewhere between VP9's 35 and 45. The speed is also hard to compare because it is also relevant how many CPU cores were kept busy (and SVT-1 seems to be faster but at the cost of utilizing more cores).

Overall I think -c:v libsvtav1 -crf 50 -preset 6 would be good parameters for SVT-1 (I've configured that on openqaworker19 now for all regular slots¹), maybe also -preset 7 for slower workers.


¹ If you see any problems with that, feel free to activate the VP9 encoder config in /etc/openqa/workers.ini on that worker again. The worker services are supposed to restart automatically after editing the config.

Actions #19

Updated by mkittler 10 months ago

It still looks good on o3, e.g. https://openqa.opensuse.org/tests/3361186/video?filename=video.webm and https://openqa.opensuse.org/tests/3361570/video?filename=video.webm - bitrate and quality are comparable with VP9 encodings.

As discussed, isotovideo should probe by default whether ffmpeg is installed and what it can do. It would then use SVT-1 (or VP9 as fallback) with reasonable default parameters. Only if none is possible the normal video encoder would be used. PR for this: https://github.com/os-autoinst/os-autoinst/pull/2328

Actions #20

Updated by dimstar 10 months ago

OW19 workers currently fail all jobs with

Reason: backend died: External encoder not accepting data: Broken pipe at /usr/lib/os-autoinst/backend/baseclass.pm line 141.

In the logs, I can find

Launching external video encoder: podman run --workdir /pool --pull=newer --rm -i -v .:/pool ghcr.io/tamara-schmitz/ffmpeg-docker-container-free -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -pix_fmt yuv420p -c:v libsvtav1 -crf 50 -preset 7 'video.webm'
Error: opening database /var/lib/openqa/worker/.local/share/containers/storage/libpod/bolt_state.db: open /var/lib/openqa/worker/.local/share/containers/storage/libpod/bolt_state.db: permission denied

The access denied seems to be a consequence of AppArmor blocking it:

type=AVC msg=audit(1687185372.888:3644): apparmor="DENIED" operation="open" profile="/usr/share/openqa/script/worker" name="/var/lib/openqa/worker/.local/share/containers/storage/libpod/bolt_state.db" pid=14267 comm="podman" requested_mask="wrc" denied_mask="wrc" fsuid=103 ouid=103

Actions #21

Updated by mkittler 10 months ago

I have enabled VP9 again on openqaworker19.

I don't think it is worth making podman/AppArmor work together. As we've seen issues might not be immediately obvious and we'd likely only be able to find them one-by-one when running this over a longer period of time.

Actions #22

Updated by okurz 10 months ago

Still waiting for second approval of https://github.com/os-autoinst/os-autoinst/pull/2328

Actions #23

Updated by okurz 10 months ago

https://github.com/os-autoinst/os-autoinst/pull/2328 merged meaning that all openQA workers using that version of os-autoinst should by default try AV1 and fallback to VP9. So far all o3 machines have the external video encoder set so the setting overrides the internal defaulting. And OSD workers failed deployment today so the new version is not yet installed. I suggest to await the deployment on OSD and wait a night to see if there is an immediate regression.

Actions #24

Updated by mkittler 10 months ago

Done:

martchus@ariel:~> for i in aarch64 openqaworker4 openqaworker7 qa-power8-3 rebel; do echo $i && ssh root@$i "sed -i -e 's|^EXTERNAL_VIDEO_ENCODER_CMD=|#EXTERNAL_VIDEO_ENCODER_CMD=|g' /etc/openqa/workers.ini" ; done
aarch64
openqaworker4
openqaworker7
qa-power8-3
rebel
martchus@ariel:~> for i in aarch64 openqaworker4 openqaworker7 qa-power8-3 rebel; do echo $i && ssh root@$i "grep -i external /etc/openqa/workers.ini" ; done
aarch64
#EXTERNAL_VIDEO_ENCODER_CMD=ffmpeg -y -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -pix_fmt yuv420p -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 1
EXTERNAL_VIDEO_ENCODER_OUTPUT_FILE_EXTENSION=webm
openqaworker4
#EXTERNAL_VIDEO_ENCODER_CMD=ffmpeg -y -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -pix_fmt yuv420p -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 1
EXTERNAL_VIDEO_ENCODER_OUTPUT_FILE_EXTENSION=webm
openqaworker7
#EXTERNAL_VIDEO_ENCODER_CMD=ffmpeg -y -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -pix_fmt yuv420p -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 1
EXTERNAL_VIDEO_ENCODER_OUTPUT_FILE_EXTENSION=webm
qa-power8-3
#EXTERNAL_VIDEO_ENCODER_CMD=ffmpeg -y -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -pix_fmt yuv420p -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 1
EXTERNAL_VIDEO_ENCODER_OUTPUT_FILE_EXTENSION=webm
rebel
#EXTERNAL_VIDEO_ENCODER_CMD=ffmpeg -y -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -pix_fmt yuv420p -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 1
EXTERNAL_VIDEO_ENCODER_OUTPUT_FILE_EXTENSION=webm

I kept the line on worker19/20 as those are powerful enough for -cpu-used 0. I added an according remark.

I'll have a look tomorrow to check a few jobs.

Actions #25

Updated by okurz 10 months ago

Please ensure that worker instances on worker19 are online again before closing the ticket

Actions #26

Updated by mkittler 10 months ago

Strange, looks like someone disabled them. I'll enable them again.

Actions #27

Updated by mkittler 10 months ago

  • Status changed from Feedback to Resolved

I enabled all slots again. It looks good (e.g. https://openqa.opensuse.org/tests/3375665#downloads and https://openqa.opensuse.org/tests/3375894#downloads) so I'm considering this ticket resolved.

Actions #28

Updated by okurz 10 months ago

  • Due date deleted (2023-06-28)
Actions

Also available in: Atom PDF