Free2Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models

KAIST AI
Code(Coming soon)

Abstract

Diffusion models have achieved impressive results in generative tasks like text-to-image (T2I) and text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependency across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions or are constrained to limited prompts, hindering their scalability and applicability. In this paper, we propose Free2Guide, a novel gradient-free framework for aligning generated videos with text prompts without requiring additional model training. Leveraging principles from path integral control, Free2Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward model. Additionally, our framework supports the flexible ensembling of multiple reward models, including large-scale image-based models, to synergistically enhance alignment without incurring substantial computational overhead. We demonstrate that Free2Guide significantly improves text alignment across various dimensions and enhances the overall quality of generated videos.

Method

fig_pipeline

Overall pipeline of Free2Guide, leveraging path integral control to enhance text-video alignment without requiring reward gradient. During the sampling process, Free2Guide generates multiple denoised video samples and evaluates text alignment using non-differentiable Large Vision-Language Models (LVLMs).

Qualitative Comparison of Reward Models

LaVie

"Gwen Stacy reading a book, tilt down."
GIF 4
Baseline
GIF 5
Baseline + CLIP
GIF 6
Baseline + IR
GIF 7
Baseline + GPT4o
"A space shuttle launching into orbit, with flames and smoke ..."
GIF 4
Baseline
GIF 5
Baseline + CLIP
GIF 6
Baseline + IR
GIF 7
Baseline + GPT4o
"A beautiful coastal beach in spring, waves lapping on sand, pixel art."
GIF 4
Baseline
GIF 5
Baseline + CLIP
GIF 6
Baseline + IR
GIF 7
Baseline + GPT4o
"A cute fluffy panda eating Chinese food in a restaurant."
GIF 4
Baseline
GIF 5
Baseline + CLIP
GIF 6
Baseline + IR
GIF 7
Baseline + GPT4o

VideoCrafter2

"A cat on the right of a dog, front view."
GIF 1
Baseline
GIF 2
Baseline + CLIP
GIF 3
Baseline + IR
GIF 4
Baseline + GPT4o
"Cinematic shot of Van Gogh's selfie, Van Gogh style."
GIF 4
Baseline
GIF 5
Baseline + CLIP
GIF 6
Baseline + IR
GIF 7
Baseline + GPT4o
"Two pandas discussing an academic paper."
GIF 4
Baseline
GIF 5
Baseline + CLIP
GIF 6
Baseline + IR
GIF 7
Baseline + GPT4o
"A raccoon is playing the electronic guitar."
GIF 4
Baseline
GIF 5
Baseline + CLIP
GIF 6
Baseline + IR
GIF 7
Baseline + GPT4o

Image-based models are inherently limited in processing time-dependent features like motion, flow, and dynamics, making them unsuitable for providing high-quality guidance on temporal dynamics. In contrast, while LVLMs are trained on static image-text data, their extensive pretraining on diverse visual contexts enables them to effectively capture motion elements, enabling superior temporal guidance.

Qualitative Results

1. LVLMs as Reward Models

Since our framework does not require differentiability in the reward model, we can utilize powerful black-box LVLMs that are capable of capturing temporal information for video alignment. These LVLMs serve as reward models, leveraging their temporal awareness to effectively assess text-video alignment.

2. Ensembling with LVLMs

Improving CLIP with GPT4o
Improving ImageReward with GPT4o

Leveraging LVLMs' ability to model temporal dynamics, ensembling techniques can be employed to combine insights from large-scale image models, enhancing text-video alignment through complementary guidance. This approach enables LVLMs to process temporal information, enhancing the quality of guidance. Unlike gradient-based guidance, our method significantly reduces memory requirements by avoiding the computationally intensive backpropagation process. This enables us to concurrently employ multiple rewards for sampling guidance, potentially leading to synergistic benefits with large-scale image models.