Grounded Compositional and Numerous Textual content-to-3D with Pretrained Multi-View Diffusion Mannequin
Authors: Xiaolong Li, Jiawei Mo, Ying Wang, Chethan Parameshwara, Xiaohan Fei, Ashwin Swaminathan, CJ Taylor, Zhuowen Tu, Paolo Favaro, Stefano Soatto
Summary: On this paper, we suggest an efficient two-stage strategy named Grounded-Dreamer to generate 3D belongings that may precisely observe advanced, compositional textual content prompts whereas reaching excessive constancy through the use of a pre-trained multi-view diffusion mannequin. Multi-view diffusion fashions, equivalent to MVDream, have proven to generate high-fidelity 3D belongings utilizing rating distillation sampling (SDS). Nonetheless, utilized naively, these strategies typically fail to understand compositional textual content prompts, and should typically completely omit sure topics or components. To deal with this difficulty, we first advocate leveraging text-guided 4-view photographs because the bottleneck within the text-to-3D pipeline. We then introduce an consideration refocusing mechanism to encourage text-aligned 4-view picture technology, with out the need to re-train the multi-view diffusion mannequin or craft a high-quality compositional 3D dataset. We additional suggest a hybrid optimization technique to encourage synergy between the SDS loss and the sparse RGB reference photographs. Our technique persistently outperforms earlier state-of-the-art (SOTA) strategies in producing compositional 3D belongings, excelling in each high quality and accuracy, and enabling various 3D from the identical tex