Free²Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models

Jaemin Kim, Bryan S Kim, Jong Chul Ye

KAIST AI

Code(Coming soon)

Abstract

Diffusion models have achieved impressive results in generative tasks like text-to-image (T2I) and text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependency across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions or are constrained to limited prompts, hindering their scalability and applicability. In this paper, we propose Free²Guide, a novel gradient-free framework for aligning generated videos with text prompts without requiring additional model training. Leveraging principles from path integral control, Free²Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward model. Additionally, our framework supports the flexible ensembling of multiple reward models, including large-scale image-based models, to synergistically enhance alignment without incurring substantial computational overhead. We demonstrate that Free²Guide significantly improves text alignment across various dimensions and enhances the overall quality of generated videos.

Method

fig_pipeline

Overall pipeline of Free²Guide, leveraging path integral control to enhance text-video alignment without requiring reward gradient. During the sampling process, Free²Guide generates multiple denoised video samples and evaluates text alignment using non-differentiable Large Vision-Language Models (LVLMs).

Qualitative Comparison of Reward Models

LaVie

"Gwen Stacy reading a book, tilt down."

Baseline

Baseline + CLIP

Baseline + IR

Baseline + GPT4o

"A space shuttle launching into orbit, with flames and smoke ..."

Baseline

Baseline + CLIP

Baseline + IR

Baseline + GPT4o

"A beautiful coastal beach in spring, waves lapping on sand, pixel art."

Baseline

Baseline + CLIP

Baseline + IR

Baseline + GPT4o

"A cute fluffy panda eating Chinese food in a restaurant."

Baseline

Baseline + CLIP

Baseline + IR

Baseline + GPT4o

VideoCrafter2

"A cat on the right of a dog, front view."

Baseline

Baseline + CLIP

Baseline + IR

Baseline + GPT4o

"Cinematic shot of Van Gogh's selfie, Van Gogh style."

Baseline

Baseline + CLIP

Baseline + IR

Baseline + GPT4o

"Two pandas discussing an academic paper."

Baseline

Baseline + CLIP

Baseline + IR

Baseline + GPT4o

"A raccoon is playing the electronic guitar."

Baseline

Baseline + CLIP

Baseline + IR

Baseline + GPT4o

Image-based models are inherently limited in processing time-dependent features like motion, flow, and dynamics, making them unsuitable for providing high-quality guidance on temporal dynamics. In contrast, while LVLMs are trained on static image-text data, their extensive pretraining on diverse visual contexts enables them to effectively capture motion elements, enabling superior temporal guidance.

Qualitative Results

1. LVLMs as Reward Models

An astronaut flying in space.

A bear on the right of a zebra, front view.

A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, ...

A panda playing on a swing set.

Since our framework does not require differentiability in the reward model, we can utilize powerful black-box LVLMs that are capable of capturing temporal information for video alignment. These LVLMs serve as reward models, leveraging their temporal awareness to effectively assess text-video alignment.

2. Ensembling with LVLMs

Improving CLIP with GPT4o

A bus and a traffic light.

A super cool giant robot in Cyberpunk Beijing.

A bicycle and an airplane.

A bowl and a remote.

Improving ImageReward with GPT4o

This is how I do makeup in the morning.

A cute happy Corgi playing in park, sunset, black and white.

A beautiful coastal beach in spring, waves lapping on sand, pan left.

The bund Shanghai, pan right.

Leveraging LVLMs' ability to model temporal dynamics, ensembling techniques can be employed to combine insights from large-scale image models, enhancing text-video alignment through complementary guidance. This approach enables LVLMs to process temporal information, enhancing the quality of guidance. Unlike gradient-based guidance, our method significantly reduces memory requirements by avoiding the computationally intensive backpropagation process. This enables us to concurrently employ multiple rewards for sampling guidance, potentially leading to synergistic benefits with large-scale image models.