Skip to content

Latest commit

 

History

History
64 lines (32 loc) · 5.51 KB

readme.md

File metadata and controls

64 lines (32 loc) · 5.51 KB

vqgan-clip

VQGAN+CLIP methods

Generates images from text prompts with VQGAN and CLIP (codebook sampling method).

Codebook sampling optimizes a grid of independent categorical distributions over VQGAN codes, parameterized by logits, with gradient descent, for the decoded image's similarity to the CLIP prompt.

Generates images from text prompts with VQGAN and CLIP (z+quantize method).

Generates images from text prompts with VQGAN and CLIP (z+quantize method) regularized with MSE.

Zero shot semantic style transfer

Non-VQGAN CLIP methods

Generates images from text prompts with the OpenAI discrete VAE and CLIP.

Codebook sampling optimizes a grid of independent categorical distributions over OpenAI discrete VAE codes, parameterized by logits, with gradient descent, for the decoded image's similarity to the CLIP prompt.

Generates images from text prompts with CLIP guided diffusion (256x256 output size).

CLIP guided diffusion samples from the diffusion model conditional on the output image being near the target CLIP embedding. In this notebook, the fact that CLIP is not noise level conditioned is dealt with by applying a Gaussian blur with timestep-dependent radius before processing the current timestep's output with CLIP.

Generates images from text prompts with CLIP guided diffusion (256x256 output size).

CLIP guided diffusion samples from the diffusion model conditional on the output image being near the target CLIP embedding. In this notebook, the fact that CLIP is not noise level conditioned is dealt with by obtaining a denoised prediction of the final timestep and processing that with CLIP.

Generates images from text prompts with CLIP guided diffusion (512x512 output size).

CLIP guided diffusion samples from the diffusion model conditional on the output image being near the target CLIP embedding. In this notebook, the fact that CLIP is not noise level conditioned is dealt with by obtaining a denoised prediction of the final timestep and processing that with CLIP. It uses a class-conditional diffusion model and this is dealt with by randomizing the input class on each timestep.

Generates images from text prompts with CLIP guided diffusion (512x512 output size).

CLIP guided diffusion samples from the diffusion model conditional on the output image being near the target CLIP embedding. In this notebook, the fact that CLIP is not noise level conditioned is dealt with by obtaining a denoised prediction of the final timestep and processing that with CLIP. It uses an unconditional diffusion model that was fine-tuned from the released 512x512 conditional diffusion model using the same training set but with no class labels.

Generates a mask from an image using a pixel-wise average over random crops scored by CLIP. In other words, this is a Monte Carlo method. Needs to be calibrated for optimal results. Also used in CLIP Semantic Segmentation.

Generates images from text prompts with a CLIP conditioned Decision Transformer.

This model outputs logits for the next VQGAN token conditioned on a CLIP embedding, target CLIP score, and sequence of past VQGAN tokens (possibly length 0). It can be used to sample images conditioned on CLIP text prompts.