Diffusion models have achieved impressive results in generative tasks for text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependencies across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions trained for videos, hindering their scalability and applicability. In this paper, we propose Free2Guide, a novel gradient-free and training-free framework for aligning generated videos with text prompts. Specifically, leveraging principles from path integral control, Free2Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward models. To enable image-trained LVLMs to assess text-to-video alignment, we leverage \textit{stitching} between video frames and use system prompts to capture sequential attributions. Our framework supports the flexible ensembling of multiple reward models to synergistically enhance alignment without significant computational overhead. Experimental results confirm that Free2Guide using image-trained LVLMs significantly improves text-to-video alignment, thereby enhancing the overall video quality.
Overall pipeline of Free2Guide, leveraging path integral control to enhance text-video alignment without requiring reward gradient. During the sampling process, Free2Guide generates multiple denoised video samples and evaluates text alignment using non-differentiable Large Vision-Language Models (LVLMs).
Image-based models are inherently limited in processing time-dependent features like motion, flow, and dynamics, making them unsuitable for providing high-quality guidance on temporal dynamics. In contrast, while LVLMs are trained on static image-text data, their extensive pretraining on diverse visual contexts enables them to effectively capture motion elements, enabling superior temporal guidance.
Since our framework does not require differentiability in the reward model, we can utilize powerful black-box LVLMs that are capable of capturing temporal information for video alignment. These LVLMs serve as reward models, leveraging their temporal awareness to effectively assess text-video alignment.
Leveraging LVLMs' ability to model temporal dynamics, ensembling techniques can be employed to combine insights from large-scale image models, enhancing text-video alignment through complementary guidance. This approach enables LVLMs to process temporal information, enhancing the quality of guidance. Unlike gradient-based guidance, our method significantly reduces memory requirements by avoiding the computationally intensive backpropagation process. This enables us to concurrently employ multiple rewards for sampling guidance, potentially leading to synergistic benefits with large-scale image models.