Diffusion models have achieved impressive results in generative tasks like text-to-image (T2I) and text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependency across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions or are constrained to limited prompts, hindering their scalability and applicability. In this paper, we propose Free2Guide, a novel gradient-free framework for aligning generated videos with text prompts without requiring additional model training. Leveraging principles from path integral control, Free2Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward model. Additionally, our framework supports the flexible ensembling of multiple reward models, including large-scale image-based models, to synergistically enhance alignment without incurring substantial computational overhead. We demonstrate that Free2Guide significantly improves text alignment across various dimensions and enhances the overall quality of generated videos.
Overall pipeline of Free2Guide, leveraging path integral control to enhance text-video alignment without requiring reward gradient. During the sampling process, Free2Guide generates multiple denoised video samples and evaluates text alignment using non-differentiable Large Vision-Language Models (LVLMs).
Image-based models are inherently limited in processing time-dependent features like motion, flow, and dynamics, making them unsuitable for providing high-quality guidance on temporal dynamics. In contrast, while LVLMs are trained on static image-text data, their extensive pretraining on diverse visual contexts enables them to effectively capture motion elements, enabling superior temporal guidance.
Since our framework does not require differentiability in the reward model, we can utilize powerful black-box LVLMs that are capable of capturing temporal information for video alignment. These LVLMs serve as reward models, leveraging their temporal awareness to effectively assess text-video alignment.
Leveraging LVLMs' ability to model temporal dynamics, ensembling techniques can be employed to combine insights from large-scale image models, enhancing text-video alignment through complementary guidance. This approach enables LVLMs to process temporal information, enhancing the quality of guidance. Unlike gradient-based guidance, our method significantly reduces memory requirements by avoiding the computationally intensive backpropagation process. This enables us to concurrently employ multiple rewards for sampling guidance, potentially leading to synergistic benefits with large-scale image models.