Hardware Accelerated Transcoding — A marriage between FFmpeg, Containers, and Nvidia | by Eric Luu | GumGum Tech Blog | Apr, 2024

FFmpeg is a flexible multimedia framework able to dealing with nearly any media in existence. Decoding, encoding, transcoding, muxing, demuxing, streaming, filtering — it could possibly do all of it. Greater than a device, it powers large on-line streaming platforms like YouTube, Fb, Instagram, Disney, and Netflix. It’s so standard FFmpeg has even discovered its method onto Mars, powering the Perseverance Rovers stream in 2021!

Very generally FFmpeg is used to transcode movies: which means manipulating video codecs, audio channels, resolutions, and so forth. This course of is software-based, granting compatibility with virtually each cloud occasion on the market (like EC2). Why add a GPU and take away that versatility? Properly, the reply could be very easy. Cash!

To make clear, the objective is to save lots of prices by having an occasion work a lot sooner in order that it has much less uptime. For instance a g4dn.xlarge prices twice as a lot as an octa-core occasion (like a c7i.2xlarge) however it’ll transcode much faster. (about 3–4x sooner!)

One other consideration is the acceleration of different operations utilizing that very same GPU. Very generally FFmpeg is a method to finish — as in, you’re utilizing FFmpeg to course of a file to your service. Why not speed up these different steps as nicely? In spite of everything, a micro-service doesn’t entail micro ranges of processing. With the emergence of Machine studying & AI, so too has the utilization of GPU situations. In a case like working a mannequin for video evaluation, it solely is sensible to maneuver transcoding operations to the GPU fairly than holding it on the CPU.

For the Verity workforce at GumGum, we combine FFmpeg into our video pipelines, granting us unimaginable flexibility & velocity when manipulating content material for contextual evaluation. It’s additionally allowed us to keep away from costly situations which have each a robust CPU & GPU, and full micro-services that may’ve been devoted to this transcoding.

An essential merchandise to know is that the encoder & decoder on an Nvidia GPU are impartial items of {hardware} — which means they’ll run individually from the principle CUDA processing cores. Constructed for velocity, these NVENC / NVDEC cores have been made to alleviate the computational pressure of video processing (like streaming gameplay).

As such, there are 3 distinctions between CPU and GPU transcoding: velocity, cupboard space, and high quality.

CPU transcoding is gradual, however has a minimal house footprint and nice high quality. It additionally scales vertically so the extra cores you possibly can throw at it the higher.

GPU-based transcoding is sort of the other: It’s extremely quick, however can triple disk house utilization with various picture high quality. Correct settings can match CPU high quality besides on the highest ranges.

Think about a easy FFmpeg use case: Take an .mp4, scale to 720p, and output it as one other file. Right here FFmpeg will carry out a number of steps.

Break up the container into particular person streams (audio, video)
Decode the streams into their uncooked codecs
Apply filters to stated streams (like scaling to 720 to cut back file dimension)
Encode streams in specified codecs
Mux the streams again right into a single file

The place {hardware} acceleration will occur is the decoding, filtering (of sure filters), and encoding steps.

Audio processing continues to be finished on the CPU as NVENC/NVDEC are just for video.

After decoding, uncooked video frames are despatched to the VRAM permitting for GPU-accelerated filters. Publish-filtering, the frames are encoded, sending them again to the principle system’s RAM to be muxed and completed.

If sure filters or transformations can’t be finished on the GPU, FFmpeg may be configured to ship the decoded frames again into system reminiscence / RAM as wanted. Simply keep in mind every switch is expensive so attempt to maintain the information in a single place as a lot as potential.

Now let’s check out constructing a container able to transcoding on our GPU.

There’s an extended checklist of packages wanted to construct an accelerated FFmpeg: toolkits, drivers, and so forth. For simplicity, we’ll use a g4dn.xlarge occasion working on Ubuntu and outfitted with Nvidia’s Deep Learning Base AMI, beginning us with Nvidia drivers and Docker.

As soon as we launch with this AMI, we’re going to switch dockers default runtime to make use of Nvidia’s, permitting for GPU capabilities within the container.

After launching the occasion edit the /and so forth/docker/daemon.json like so

{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}

Then restart Docker to use the modifications: sudo service docker restart

With that finished, we will check out a pattern Dockerfile that I’ve tailored from Nvidia’s FFmpeg guide.

FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04# Make interactions with set up not look ahead to our enter
ENV DEBIAN_FRONTEND noninteractive
# Set up construct instruments and libraries for video/audio encoding (x264, libmp3lame), and SSL help for dealing with URLs
RUN apt-get replace && 
apt-get set up -y --no-install-recommends 
build-essential 
openssl 
libssl-dev 
yasm 
cmake 
libtool 
libc6 libc6-dev 
unzip 
wget 
libnuma1 libnuma-dev 
pkg-config 
nvidia-cuda-toolkit 
git 
libx264-163 libx264-dev libmp3lame-dev 
zlib1g-dev
# Set up Nvidia Codec Headers: information ffmpeg makes use of to allow {hardware} acceleration
RUN git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git /choose/nv-codec-headers && cd /choose/nv-codec-headers && git checkout sdk/12.0 && make set up
# Clone particular FFmpeg model, configured for Nvidia acceleration and codecs, and compiling to /choose/ffmpeg
RUN git clone https://git.ffmpeg.org/ffmpeg.git /choose/ffmpeg && cd /choose/ffmpeg && git checkout launch/6.0 
RUN cd /choose/ffmpeg && ./configure --enable-nonfree --enable-cuda-nvcc --enable-nvenc --nvccflags="-gencode arch=compute_52,code=sm_52 -O2" --enable-libnpp 
--enable-gpl --enable-libx264 --enable-libmp3lame --enable-openssl --extra-cflags=-I/usr/native/cuda/embrace --extra-ldflags=-L/usr/native/cuda/lib64 --enable-shared --prefix=/choose/ffmpeg 
&& make -j 8 && make set up
RUN rm -rf /choose/nv-codec-headers
# Arrange atmosphere variables and hyperlink information to allow ffmpeg & ffprobe from the command line
ENV PATH="${PATH}:/choose/ffmpeg"
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/choose/ffmpeg/lib"
ENV NVIDIA_VISIBLE_DEVICES="all"
ENV NVIDIA_DRIVER_CAPABILITIES="compute,utility,video"

From there we will create a small docker-compose.yml file

---
model: "3"
companies:
accelerated-ffmpeg:
construct:
context: ./
platform: linux/amd64
person: root
privileged: true
stdin_open: true 
tty: true

And launch our container through docker compose run -rm accelerated-ffmpeg bashwhich ought to present an identical output upon entry

==========
== CUDA ==
==========CUDA Model 11.8.0
Container picture Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container picture and its contents are ruled by the NVIDIA Deep Studying Container License.
By pulling and utilizing the container, you settle for the phrases and situations of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A replica of this license is made out there on this container at /NGC-DL-CONTAINER-LICENSE to your comfort.

Whereas inside we will run nvidia-smi and ffmpeg -version to confirm all the things’s working

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Model: 535.104.12   CUDA Model: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Identify                 Persistence-M | Bus-Id        Disp.A | Unstable Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Utilization/Cap |         Reminiscence-Utilization | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   20C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------++---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Sort   Course of identify                            GPU Reminiscence |
|        ID   ID                                                             Utilization      |
|=======================================================================================|
|  No working processes discovered                                                           |
+---------------------------------------------------------------------------------------+

ffmpeg model n6.0.1-6-gcd49ee45ba Copyright (c) 2000-2023 the FFmpeg builders
constructed with gcc 11 (Ubuntu 11.4.0-1ubuntu1~22.04)
configuration: --enable-nonfree --enable-cuda-nvcc --enable-nvenc --nvccflags='-gencode arch=compute_52,code=sm_52 -O2' --enable-libnpp --enable-gpl --enable-libx264 --enable-libmp3lame --enable-openssl --extra-cflags=-I/usr/native/cuda/embrace --extra-ldflags=-L/usr/native/cuda/lib64 --enable-shared --prefix=/choose/ffmpeg
libavutil      58.  2.100 / 58.  2.100
libavcodec     60.  3.100 / 60.  3.100
libavformat    60.  3.100 / 60.  3.100
libavdevice    60.  1.100 / 60.  1.100
libavfilter     9.  3.100 /  9.  3.100
libswscale      7.  1.100 /  7.  1.100
libswresample   4. 10.100 /  4. 10.100
libpostproc    57.  1.100 / 57.  1.100

You may be questioning, we began from an Ubuntu Cuda 11.8 picture, why the 12.2?

What you’re seeing is the magic of Nvidia Container Toolkit. This library mounts the host’s drivers and libraries for use inside our container. So after we run nvidia-smi what we’re seeing is the host machine’s drivers. CUDA has a certain quantity of backwards compatibility so so long as nvidia-smi is working high quality out and in of the container, you’re golden.

Now this was a gross simplification of runtime vs driver API, however it’s sufficient to proceed working with FFmpeg.

To check our accelerated construct of FFmpeg, we’ll do some transcoding on this public video.

https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/TearsOfSteel.mp4

Let’s check out our CPU transcoding command. We’ll be utilizing a c7i.2xlarge occasion, outfitted with 8 CPU cores and 16GB of DDR5 RAM. With these cutting-edge specs this occasion is nicely suited to CPU intensive workloads.

ffmpeg -y -i "https://commondatastorage.googleapis.com/gtv-videos-bucket/pattern/TearsOfSteel.mp4" 
-reset_timestamps 1 
-map 0:v:0 
-vf scale=1280:-2 
-c:v:0 libx264 
-map 0:a:0 
-c:a:0 libmp3lame 
"cpu.mp4"

-y Mechanically overwrites present output information
-i specifies enter (both URL or path)
-reset_timestamps 1 resets timestamps to begin at 0 in output
-map 0:v:0 maps the primary video stream of enter to the primary output stream
-vf scale:1280:-2 resample the video to a 1280px width, preserving side ratio
-c:v:0 libx264 specifies libx264 encoder for the video stream
-map 0:a:0 maps first audio stream in enter to first audio stream in output
-c:a:0 libmp3lame specifies libmp3lame for the primary audio stream
“cpu.mp4” identify of output file

This command took 1 minute and 57 seconds to finish, leading to a transcoded file of 153 MB.

Now let’s check out our GPU model, working on a g4dn.xlarge.

ffmpeg -y 
-hwaccel cuda 
-hwaccel_output_format cuda 
-i "https://commondatastorage.googleapis.com/gtv-videos-bucket/pattern/TearsOfSteel.mp4" 
-reset_timestamps 1 
-map 0:v:0 
-vf "scale_npp=1280:-2:interp_algo=tremendous" 
-c:v:0 h264_nvenc 
-preset p7 
-tune:v hq 
-rc:v vbr 
-cq:v 19 
-b:v 0 
-map 0:a:0 
-c:a:0 libmp3lame 
-fps_mode passthrough 
"transcoded.mp4"

-hwaccel cuda Use {hardware} acceleration for decoding video
-hwaccel_output_format cuda Retains decoded frames in GPU VRAM.
–vf scale_npp=1280:-2:interp_algo=tremendous Similar scaling however accelerated with scale_npp . interp_algo=tremendous will use the supersampling algorithm whereas scaling, tremendously bettering picture high quality when downscaling.
-c:v:0 h264_nvenc Encodes video with NVENC in h264 format.
-preset p7 Excessive-quality encoder preset
-tune:v hq Prioritizes greater high quality over velocity
-rc:v vbr Permits for variable bit price
-cq:v 19 Chooses fixed high quality setting of 19 (decrease is best, however leads to a bigger file dimension). Diminishing returns under 19.
-b:v 0 Set the bit price to auto. This with the variable bit price and fixed high quality means the encoder tries to take care of the fixed high quality by adjusting the bit price as wanted.
-fps_mode passthrough Stop YUV output format and duplicate frames. In FFmpeg variations ≤ 5.1 this was -vsync 0 (Really helpful choice from Nvidia)

Our GPU-accelerated command took 44 seconds to finish, leading to a transcoded file of 422 MB. That’s 2.65x sooner!

All these additional argument flags (-cq:v 19, interp_algo=tremendous) are there to extend picture high quality at the price of velocity/storage. For instance, our 422 MB file is ~2.75x bigger than our CPU consequence, so change settings as you see match.

Attempt to guess which one is from the CPU or GPU

This one is from GPU transcoding! Exhausting to inform proper?

Throughout transcoding we will additionally question our GPU to see our utilization

watch -n 1 nvidia-smi — query-gpu=utilization.gpu,utilization.encoder,utilization.decoder,energy.draw,reminiscence.whole,reminiscence.used,reminiscence.free — format=csv

utilization.gpu [%], utilization.encoder [%], utilization.decoder [%], energy.draw [W], reminiscence.whole [MiB], reminiscence.used [MiB], reminiscence.free [MiB]
5 %, 52 %, 15 %, 37.49 W, 15360 MiB, 174 MiB, 14756 MiB

Consider the encoder and decoder are as soon as once more, separate items of {hardware}, so it’s not unusual to see gpu utilization be close to 0%.

Useful Flags for FFmpeg:

-loglevel error Will solely present error messages as a substitute of exhibiting all the things like stream data, length, velocity, and so forth.
-hide_banner Hides the banner that shows the FFmpeg model, what it was constructed with, configuration settings, and so forth.
-nostdin Disables interplay with the FFmpeg course of, helpful in scripts.

Staying inside the AWS ecosystem permits the utilization of their EKS Optimized AMIs, particularly, their GPU-accelerated variations.

Merely use this command along with your EKS model to seize the proper AMI ID

export EKS_VERSION=1.26
export REGION=us-east-1
aws ssm get-parameter --name /aws/service/eks/optimized-ami/${EKS_VERSION}/amazon-linux-2-gpu/beneficial/image_id --region ${REGION} --query "Parameter.Worth" --output textual content

Simply be aware that the CUDA and driver variations on that AMI should be appropriate with the FFmpeg model you compile with, and likewise should be run on a GPU based mostly occasion (g4dn for instance).

To make it possible for the nodes (EC2 machine) with this GPU-optimized AMI are matched to pods that include our accelerated FFmpeg, we’ll must set affinities on our nodes and pods, taints on our nodes, and tolerations on our pods.

Affinities have an effect on scheduling desire, which means nodes that match the important thing & worth of the pod can be favored (this may be set to not schedule if no matching nodes can be found)

Taints are on the node degree, containing a key and worth. If these don’t match, then it is not going to settle for the pod. That is used to repel pods that don’t have matching tolerations.

Tolerations are positioned on the pod degree, identical factor right here. In the event that they don’t match this pod is not going to be positioned on the node.

Source link

Hardware Accelerated Transcoding — A marriage between FFmpeg, Containers, and Nvidia | by Eric Luu | GumGum Tech Blog | Apr, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

How Real-Time Data Analytics and AI Are Transforming Heavy Equipment Operations

NVIDIA Accelerates Google Quantum AI Processor Design With Simulation of Quantum Device Physics

Game Development and Cloud Computing: Benefits of Cloud-Native Game Servers

Teradata AI Unlimited in Microsoft Fabric is Now Available for Public Preview through Microsoft Fabric Workload Hub

Cognigy Unveils Agentic AI: Transforming the Future of Enterprise Contact Centers

Our Picks

How to integrate machine learning model(.h5 extension) using Flask application to perform image classification | by Pasindu Sandamal | Jun, 2024

Mastering the Art of Decision Making: An Introduction to Reinforcement Learning | by Ibtihel Nemri | Jun, 2024

How Concept Bottleneck Models work part10(Machine Learning 2024) | by Monodeep Mukherjee | May, 2024

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Hardware Accelerated Transcoding — A marriage between FFmpeg, Containers, and Nvidia | by Eric Luu | GumGum Tech Blog | Apr, 2024

Related Posts