FFmpeg is a flexible multimedia framework able to dealing with nearly any media in existence. Decoding, encoding, transcoding, muxing, demuxing, streaming, filtering — it could possibly do all of it. Greater than a device, it powers large on-line streaming platforms like YouTube, Fb, Instagram, Disney, and Netflix. It’s so standard FFmpeg has even discovered its method onto Mars, powering the Perseverance Rovers stream in 2021!
Very generally FFmpeg is used to transcode movies: which means manipulating video codecs, audio channels, resolutions, and so forth. This course of is software-based, granting compatibility with virtually each cloud occasion on the market (like EC2). Why add a GPU and take away that versatility? Properly, the reply could be very easy. Cash!
To make clear, the objective is to save lots of prices by having an occasion work a lot sooner in order that it has much less uptime. For instance a g4dn.xlarge prices twice as a lot as an octa-core occasion (like a c7i.2xlarge) however it’ll transcode much faster. (about 3–4x sooner!)
One other consideration is the acceleration of different operations utilizing that very same GPU. Very generally FFmpeg is a method to finish — as in, you’re utilizing FFmpeg to course of a file to your service. Why not speed up these different steps as nicely? In spite of everything, a micro-service doesn’t entail micro ranges of processing. With the emergence of Machine studying & AI, so too has the utilization of GPU situations. In a case like working a mannequin for video evaluation, it solely is sensible to maneuver transcoding operations to the GPU fairly than holding it on the CPU.
For the Verity workforce at GumGum, we combine FFmpeg into our video pipelines, granting us unimaginable flexibility & velocity when manipulating content material for contextual evaluation. It’s additionally allowed us to keep away from costly situations which have each a robust CPU & GPU, and full micro-services that may’ve been devoted to this transcoding.
An essential merchandise to know is that the encoder & decoder on an Nvidia GPU are impartial items of {hardware} — which means they’ll run individually from the principle CUDA processing cores. Constructed for velocity, these NVENC / NVDEC cores have been made to alleviate the computational pressure of video processing (like streaming gameplay).
As such, there are 3 distinctions between CPU and GPU transcoding: velocity, cupboard space, and high quality.
CPU transcoding is gradual, however has a minimal house footprint and nice high quality. It additionally scales vertically so the extra cores you possibly can throw at it the higher.
GPU-based transcoding is sort of the other: It’s extremely quick, however can triple disk house utilization with various picture high quality. Correct settings can match CPU high quality besides on the highest ranges.
Think about a easy FFmpeg use case: Take an .mp4, scale to 720p, and output it as one other file. Right here FFmpeg will carry out a number of steps.
- Break up the container into particular person streams (audio, video)
- Decode the streams into their uncooked codecs
- Apply filters to stated streams (like scaling to 720 to cut back file dimension)
- Encode streams in specified codecs
- Mux the streams again right into a single file
The place {hardware} acceleration will occur is the decoding, filtering (of sure filters), and encoding steps.
After decoding, uncooked video frames are despatched to the VRAM permitting for GPU-accelerated filters. Publish-filtering, the frames are encoded, sending them again to the principle system’s RAM to be muxed and completed.
If sure filters or transformations can’t be finished on the GPU, FFmpeg may be configured to ship the decoded frames again into system reminiscence / RAM as wanted. Simply keep in mind every switch is expensive so attempt to maintain the information in a single place as a lot as potential.
Now let’s check out constructing a container able to transcoding on our GPU.
There’s an extended checklist of packages wanted to construct an accelerated FFmpeg: toolkits, drivers, and so forth. For simplicity, we’ll use a g4dn.xlarge occasion working on Ubuntu and outfitted with Nvidia’s Deep Learning Base AMI, beginning us with Nvidia drivers and Docker.
As soon as we launch with this AMI, we’re going to switch dockers default runtime to make use of Nvidia’s, permitting for GPU capabilities within the container.
After launching the occasion edit the /and so forth/docker/daemon.json
like so
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Then restart Docker to use the modifications: sudo service docker restart
With that finished, we will check out a pattern Dockerfile that I’ve tailored from Nvidia’s FFmpeg guide.
FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04# Make interactions with set up not look ahead to our enter
ENV DEBIAN_FRONTEND noninteractive
# Set up construct instruments and libraries for video/audio encoding (x264, libmp3lame), and SSL help for dealing with URLs
RUN apt-get replace &&
apt-get set up -y --no-install-recommends
build-essential
openssl
libssl-dev
yasm
cmake
libtool
libc6 libc6-dev
unzip
wget
libnuma1 libnuma-dev
pkg-config
nvidia-cuda-toolkit
git
libx264-163 libx264-dev libmp3lame-dev
zlib1g-dev
# Set up Nvidia Codec Headers: information ffmpeg makes use of to allow {hardware} acceleration
RUN git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git /choose/nv-codec-headers && cd /choose/nv-codec-headers && git checkout sdk/12.0 && make set up
# Clone particular FFmpeg model, configured for Nvidia acceleration and codecs, and compiling to /choose/ffmpeg
RUN git clone https://git.ffmpeg.org/ffmpeg.git /choose/ffmpeg && cd /choose/ffmpeg && git checkout launch/6.0
RUN cd /choose/ffmpeg && ./configure --enable-nonfree --enable-cuda-nvcc --enable-nvenc --nvccflags="-gencode arch=compute_52,code=sm_52 -O2" --enable-libnpp
--enable-gpl --enable-libx264 --enable-libmp3lame --enable-openssl --extra-cflags=-I/usr/native/cuda/embrace --extra-ldflags=-L/usr/native/cuda/lib64 --enable-shared --prefix=/choose/ffmpeg
&& make -j 8 && make set up
RUN rm -rf /choose/nv-codec-headers
# Arrange atmosphere variables and hyperlink information to allow ffmpeg & ffprobe from the command line
ENV PATH="${PATH}:/choose/ffmpeg"
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/choose/ffmpeg/lib"
ENV NVIDIA_VISIBLE_DEVICES="all"
ENV NVIDIA_DRIVER_CAPABILITIES="compute,utility,video"
From there we will create a small docker-compose.yml
file
---
model: "3"
companies:
accelerated-ffmpeg:
construct:
context: ./
platform: linux/amd64
person: root
privileged: true
stdin_open: true
tty: true
And launch our container through docker compose run -rm accelerated-ffmpeg bash
which ought to present an identical output upon entry
==========
== CUDA ==
==========CUDA Model 11.8.0
Container picture Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container picture and its contents are ruled by the NVIDIA Deep Studying Container License.
By pulling and utilizing the container, you settle for the phrases and situations of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A replica of this license is made out there on this container at /NGC-DL-CONTAINER-LICENSE to your comfort.
Whereas inside we will run nvidia-smi
and ffmpeg -version
to confirm all the things’s working
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Model: 535.104.12 CUDA Model: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Identify Persistence-M | Bus-Id Disp.A | Unstable Uncorr. ECC |
| Fan Temp Perf Pwr:Utilization/Cap | Reminiscence-Utilization | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 20C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------++---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Sort Course of identify GPU Reminiscence |
| ID ID Utilization |
|=======================================================================================|
| No working processes discovered |
+---------------------------------------------------------------------------------------+
ffmpeg model n6.0.1-6-gcd49ee45ba Copyright (c) 2000-2023 the FFmpeg builders
constructed with gcc 11 (Ubuntu 11.4.0-1ubuntu1~22.04)
configuration: --enable-nonfree --enable-cuda-nvcc --enable-nvenc --nvccflags='-gencode arch=compute_52,code=sm_52 -O2' --enable-libnpp --enable-gpl --enable-libx264 --enable-libmp3lame --enable-openssl --extra-cflags=-I/usr/native/cuda/embrace --extra-ldflags=-L/usr/native/cuda/lib64 --enable-shared --prefix=/choose/ffmpeg
libavutil 58. 2.100 / 58. 2.100
libavcodec 60. 3.100 / 60. 3.100
libavformat 60. 3.100 / 60. 3.100
libavdevice 60. 1.100 / 60. 1.100
libavfilter 9. 3.100 / 9. 3.100
libswscale 7. 1.100 / 7. 1.100
libswresample 4. 10.100 / 4. 10.100
libpostproc 57. 1.100 / 57. 1.100
You may be questioning, we began from an Ubuntu Cuda 11.8 picture, why the 12.2?
What you’re seeing is the magic of Nvidia Container Toolkit. This library mounts the host’s drivers and libraries for use inside our container. So after we run nvidia-smi
what we’re seeing is the host machine’s drivers. CUDA has a certain quantity of backwards compatibility so so long as nvidia-smi
is working high quality out and in of the container, you’re golden.
Now this was a gross simplification of runtime vs driver API, however it’s sufficient to proceed working with FFmpeg.
To check our accelerated construct of FFmpeg, we’ll do some transcoding on this public video.
https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/TearsOfSteel.mp4
Let’s check out our CPU transcoding command. We’ll be utilizing a c7i.2xlarge
occasion, outfitted with 8 CPU cores and 16GB of DDR5 RAM. With these cutting-edge specs this occasion is nicely suited to CPU intensive workloads.
ffmpeg -y -i "https://commondatastorage.googleapis.com/gtv-videos-bucket/pattern/TearsOfSteel.mp4"
-reset_timestamps 1
-map 0:v:0
-vf scale=1280:-2
-c:v:0 libx264
-map 0:a:0
-c:a:0 libmp3lame
"cpu.mp4"
-y
Mechanically overwrites present output information-i
specifies enter (both URL or path)-reset_timestamps 1
resets timestamps to begin at 0 in output-map 0:v:0
maps the primary video stream of enter to the primary output stream-vf scale:1280:-2
resample the video to a 1280px width, preserving side ratio-c:v:0 libx264
specifies libx264 encoder for the video stream-map 0:a:0
maps first audio stream in enter to first audio stream in output-c:a:0 libmp3lame
specifies libmp3lame for the primary audio stream“cpu.mp4”
identify of output file
This command took 1 minute and 57 seconds to finish, leading to a transcoded file of 153 MB.
Now let’s check out our GPU model, working on a g4dn.xlarge
.
ffmpeg -y
-hwaccel cuda
-hwaccel_output_format cuda
-i "https://commondatastorage.googleapis.com/gtv-videos-bucket/pattern/TearsOfSteel.mp4"
-reset_timestamps 1
-map 0:v:0
-vf "scale_npp=1280:-2:interp_algo=tremendous"
-c:v:0 h264_nvenc
-preset p7
-tune:v hq
-rc:v vbr
-cq:v 19
-b:v 0
-map 0:a:0
-c:a:0 libmp3lame
-fps_mode passthrough
"transcoded.mp4"
-hwaccel cuda
Use {hardware} acceleration for decoding video-hwaccel_output_format cuda
Retains decoded frames in GPU VRAM.- –
vf scale_npp=1280:-2:interp_algo=tremendous
Similar scaling however accelerated with scale_npp . interp_algo=tremendous will use the supersampling algorithm whereas scaling, tremendously bettering picture high quality when downscaling. -c:v:0 h264_nvenc
Encodes video with NVENC in h264 format.-preset p7
Excessive-quality encoder preset-tune:v hq
Prioritizes greater high quality over velocity-rc:v vbr
Permits for variable bit price-cq:v 19
Chooses fixed high quality setting of 19 (decrease is best, however leads to a bigger file dimension). Diminishing returns under 19.-b:v 0
Set the bit price to auto. This with the variable bit price and fixed high quality means the encoder tries to take care of the fixed high quality by adjusting the bit price as wanted.-fps_mode passthrough
Stop YUV output format and duplicate frames. In FFmpeg variations ≤ 5.1 this was -vsync 0 (Really helpful choice from Nvidia)
Our GPU-accelerated command took 44 seconds to finish, leading to a transcoded file of 422 MB. That’s 2.65x sooner!
All these additional argument flags (-cq:v 19, interp_algo=tremendous
) are there to extend picture high quality at the price of velocity/storage. For instance, our 422 MB file is ~2.75x bigger than our CPU consequence, so change settings as you see match.
Throughout transcoding we will additionally question our GPU to see our utilization
watch -n 1 nvidia-smi — query-gpu=utilization.gpu,utilization.encoder,utilization.decoder,energy.draw,reminiscence.whole,reminiscence.used,reminiscence.free — format=csv
utilization.gpu [%], utilization.encoder [%], utilization.decoder [%], energy.draw [W], reminiscence.whole [MiB], reminiscence.used [MiB], reminiscence.free [MiB]
5 %, 52 %, 15 %, 37.49 W, 15360 MiB, 174 MiB, 14756 MiB
Consider the encoder and decoder are as soon as once more, separate items of {hardware}, so it’s not unusual to see gpu utilization be close to 0%.
Useful Flags for FFmpeg:
-loglevel error
Will solely present error messages as a substitute of exhibiting all the things like stream data, length, velocity, and so forth.-hide_banner
Hides the banner that shows the FFmpeg model, what it was constructed with, configuration settings, and so forth.-nostdin
Disables interplay with the FFmpeg course of, helpful in scripts.
Staying inside the AWS ecosystem permits the utilization of their EKS Optimized AMIs, particularly, their GPU-accelerated variations.
Merely use this command along with your EKS model to seize the proper AMI ID
export EKS_VERSION=1.26
export REGION=us-east-1
aws ssm get-parameter --name /aws/service/eks/optimized-ami/${EKS_VERSION}/amazon-linux-2-gpu/beneficial/image_id --region ${REGION} --query "Parameter.Worth" --output textual content
Simply be aware that the CUDA and driver variations on that AMI should be appropriate with the FFmpeg model you compile with, and likewise should be run on a GPU based mostly occasion (g4dn for instance).
To make it possible for the nodes (EC2 machine) with this GPU-optimized AMI are matched to pods that include our accelerated FFmpeg, we’ll must set affinities on our nodes and pods, taints on our nodes, and tolerations on our pods.
Affinities have an effect on scheduling desire, which means nodes that match the important thing & worth of the pod can be favored (this may be set to not schedule if no matching nodes can be found)
Taints are on the node degree, containing a key and worth. If these don’t match, then it is not going to settle for the pod. That is used to repel pods that don’t have matching tolerations.
Tolerations are positioned on the pod degree, identical factor right here. In the event that they don’t match this pod is not going to be positioned on the node.