Grounded Compositional and Quite a few Textual content-to-3D with Pretrained Multi-View Diffusion Model
Authors: Xiaolong Li, Jiawei Mo, Ying Wang, Chethan Parameshwara, Xiaohan Fei, Ashwin Swaminathan, CJ Taylor, Zhuowen Tu, Paolo Favaro, Stefano Soatto
Abstract: On this paper, we propose an environment friendly two-stage technique named Grounded-Dreamer to generate 3D belongings which will exactly observe superior, compositional textual content material prompts whereas reaching extreme fidelity via using a pre-trained multi-view diffusion model. Multi-view diffusion fashions, equal to MVDream, have confirmed to generate high-fidelity 3D belongings using ranking distillation sampling (SDS). Nonetheless, utilized naively, these methods usually fail to know compositional textual content material prompts, and will usually utterly omit positive matters or elements. To take care of this issue, we first advocate leveraging text-guided 4-view images as a result of the bottleneck throughout the text-to-3D pipeline. We then introduce an consideration refocusing mechanism to encourage text-aligned 4-view image expertise, with out the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We extra counsel a hybrid optimization method to encourage synergy between the SDS loss and the sparse RGB reference images. Our method persistently outperforms earlier state-of-the-art (SOTA) methods in producing compositional 3D belongings, excelling in every prime quality and accuracy, and enabling varied 3D from the an identical tex