SD3 Medium was launched on June twelfth, 2024. Like everybody else, we gained entry to the mannequin on the identical day. From then on, it’s a race to deploy the stated mannequin to Draw Things customers on iPhone, iPad and Mac. On this publish, I’ll define the instruments we used, the teachings we realized, and the distinctive optimizations we utilized to make sure the best-in-the-class efficiency throughout a broad vary of Apple gadgets.
Over the previous yr, we’ve considerably streamlined our mannequin conversion workflow. What used to take weeks with Secure Diffusion 1.4 now takes a couple of day. For instance, we carried out our FP16 model of SD3 Medium on June thirteenth, 24 hours after the discharge.
To deploy cutting-edge picture/textual content generative fashions to native gadgets, we use Swift implementations that compile natively on these platforms. This includes translating Python code, usually written in PyTorch, into Swift. We start this by organising the right Python surroundings, creating minimal viable inference code to accurately name the mannequin, inspecting the end result, after which implementing the Swift code.
PythonKit has been important for our conversion work, permitting us to run Python reference code instantly alongside our Swift reimplementation. The primary-class assist of s4nnc on CUDA techniques additionally permits us to run our Swift reimplementation on Linux techniques with CUDA, which is usually probably the most hassle-free surroundings for operating PyTorch inference code.
Our reimplementation usually includes rewriting the PyTorch mannequin right into a extra declarative Swift mannequin and evaluating outputs layer by layer. That is notably easy with transformer fashions, the place every layer follows the identical structure.
Our implementation: https://github.com/liuliu/swift-diffusion/blob/main/examples/sd3/main.swift#L502-L661
SD3 Ref: https://github.com/Stability-AI/sd3-ref/blob/master/mmdit.py#L11-L619
Deploying giant fashions to native gadgets typically requires weight quantization. For picture generative fashions, we fastidiously stability high quality and measurement trade-offs. With Draw Issues, we guarantee all our quantized fashions are virtually “lossless.” We concentrate on smart reductions that preserve compatibility throughout a variety of gadgets relatively than pushing for the smallest attainable mannequin measurement.
Presently, s4nnc helps restricted quantization choices, together with 4-bit, 6-bit, and 8-bit block palletization as our fundamental schemes. For diffusion fashions, we use the imply squared error metrics of the ultimate picture between quantized and non-quantized fashions to information our selections. We chosen 8-bit quantization for SD3 Medium and 6-bit for the T5 encoder.
In contrast to the UNet in SDXL/SD v1.5, SD3 Medium makes use of easy transformer blocks, limiting optimization alternatives — particularly concerning FLOPs. Nevertheless, we managed to separate the mannequin to cut back peak RAM utilization throughout the diffusion sampling course of to roughly 2.2 GiB for the quantized mannequin (round 3.3 GiB for the non-quantized mannequin).
That is attainable by observing that whereas adaptive layer norm blocks are minimal in FLOPs, they’ve a excessive parameter depend, round 670M. For the reason that enter for the adaptive layer norm contains timestep conditioning, we can not cut back FLOP computation. Nevertheless, since there aren’t any dependencies on mannequin intermediate activations, we will batch the adaptive layer norm computation of each timestep to the start of diffusion sampling unexpectedly, changing matrix-vector multiplication to matrix-matrix multiplication, which is barely extra environment friendly.
Thanks to those optimizations, we carried out the quickest SD3 Medium mannequin inference on macOS, iOS, and iPadOS techniques with minimal RAM utilization and efficiently shipped it to actual customers inside a sensible app.
Our implementations can present useful suggestions into the coaching course of. Shifting ahead, we purpose to conduct extra analysis and ablation research to discover:
1. Optimum parameter depend distribution for adaptive layer norm — may we allocate fewer parameters right here, and extra to MLP/QKV projection?
2. Evaluating extra quantization schemes to determine per-layer enhancements and establishing an unbiased immediate dataset for the long run data-free fine-tuning.
3. Leveraging torch.compile to rewrite the PyTorch mannequin in Swift, all from inside Swift utilizing PythonKit.
We’re excited to proceed our analysis and share our improvement work sooner or later.