Rethinking Score Distillation as a Bridge
Between Image Distributions

David McAllister^, Songwei Ge^, Jia-Bin Huang David W. Jacobs

Alexei A. Efros Aleksander Holynski Angjoo Kanazawa

equal contribution University of Maryland, College Park UC Berkeley

NeurIPS 2024

TLDR: A unified framework to explain SDS and its variants, plus a new method that is fast & high-quality.

A Better Understanding of SDS

Score distillation sampling¹ enables the use of diffusion priors for tasks operating in data poor domains. However, it suffers from signficant artifacts that a number of follow-up works have tried to address^{2
3 4
5 6}. Here, we show how SDS is used to optimize a 3D NeRF using a text-to-image model.

We show that SDS (and its variants) can be cast as a Schrödinger Bridge (SB) problem, which aims to find the optimal transport between two distributions. SDS approximates this optimal path between the current optimized image distribution (e.g., renderings from a NeRF) and a target distribution (e.g., text-conditioned natural image distribution).

While SDS tries to model this optimal path, it is a per-iteration approximation of it—and this ultimately causes its characteristic artifacts.

An SB transports samples between source and target distributions.

SDS approximates this optimal path per-iteration.

There are two main types of errors in this approximation:

First-Order Approximation Error

Instead of training a model to solve this problem directly, we use pre-trained diffusion models to approximate the bridge. This requires solving two PF-ODEs, which invoke dozens of function evaluations (NFEs) to estimate the gradient at each iteration. Instead, SDS uses a single step estimate, which is more practical, but in turn less accurate. Recent works ISM [2] and SDI [3] can be interpreted as reducing this error with a multi-step simulation.

Source Distribution Mismatch Error

Estimating the Schrödinger Bridge relies on \(\mathrm{\epsilon}_{\phi, \text{src}}\) accurately estimating the distribution of the current sample, \(x_{\theta}\). SDS (under high CFG) uses the uncondtional distribution as a proxy for the current distribution, contributing to its characteristic artifacts. A series of works can be viewed as improving this error [4, 5, 6].

A Unified Framework

We show that this unified framework can be applied to a number of different methods—it explains VSD's high quality results, as well as the shortcomings of other, more efficient methods. Check out our paper for a more thorough analysis.

A fast, but effective alternative

We know that pre-trained diffusion models understand the distributions of high quality and corrupted images and their correspondence with natural language. So, by simply describing image corruptions with a text prompt, we can try to better model our original source distribution.

Instead of approximating the current distribution with the unconditional distribution as in SDS, we can instead use this negatively prompted conditional distribution to better model the type of artifacts our optimized image variables might be experiencing. This simple change considerably improves results.

We initialize from the SDS result for the prompt "a potato tree" and run our second stage gradient for 5K iterations.

Text-based Image Optimization

We validate our findings by using different SDS methods to optimize an image. Ours achieves better quality (lower FID) without sacrificing efficiency.

SDS

COCO-FID=86.02 | Time: 4.48min

NFSD

COCO-FID=91.70 | Time: 7.20min

CSD

COCO-FID=89.96 | Time: 6.21min

VSD

COCO-FID=59.22 | Time: 16.02min

VSD + Full Bridge

COCO-FID=55.65 | Time: 21.46min

Ours

COCO-FID=67.89 | Time: 4.48min

Text-based NeRF Optimization

Select a sample to view.

VSD

SDS

Ours

A pineapple, detailed, high resolution, high quality, sharp.

Painting-to-Real

We examine our method’s ability to serve as a general-purpose realism prior. An effective image prior should guide a painting toward a nearby natural image through optimization. We simply append the negative descriptor "painting" for our gradient and initialize from the painting. Slide your mouse across to see the difference!

Acknowledgment

We thank Matthew Tancik, Jiaming Song, Riley Peterlinz, Ayaan Haque, Ethan Weber, Konpat Preechakul, Ruiqi Gao, Amit Kohli and Ben Poole for their helpful feedback and discussion.

This project is supported in part by a Google Research Scholar award and IARPA DOI/IBC No. 140D0423C0035. The views and conclusions contained herein are those of the authors and do not represent the official policies or endorsements of these institutions.

BibTeX

@inproceedings{mcallister2024rethinking,
    title={Rethinking Score Distillation as a Bridge Between Image Distributions},
    author={David McAllister and Songwei Ge and Jia-Bin Huang and David W. Jacobs and Alexei A. Efros and Aleksander Holynski and Angjoo Kanazawa},
    booktitle={Advances in Neural Information Processing Systems},
    year={2024}
  }