Latent Denoising Diffusion Probabilistic Models

Role

Graduate ResearcherDeep Learning Researcher

Skills & Tools

Machine LearningDeep LearningComputer VisionPyTorchResearchDDPMDDIM

Work Responsibilities

Graduate ResearcherDeep Learning Researcher
  • Reviewed literature on DDPM, DDIM, VAE, and CFG to analyze trade-offs in image generation quality, inference speed, and controllability; set up the ImageNet-100 benchmark and co-designed standardized preprocessing pipelines.
  • Co-implemented multiple generation pipelines and noise schedulers, and conducted quantitative analysis to evaluate efficiency, cost, and controllability trade-offs in diffusion models.

Abstract

Diffusion models have emerged as a robust class of generative models, excelling in tasks like image synthesis, video generation, and text-to-image transformations. This project implements and evaluates Denoising Diffusion Probabilistic Models (DDPM) and Denoising Diffusion Implicit Models (DDIM), focusing on their ability to generate high-quality images.

Key components include the U-Net architecture for noise prediction, a DDPM noise scheduler for forward and reverse diffusion processes, and enhancements like Variational Autoencoders (VAE) for latent space mapping and Classifier-Free Guidance (CFG) for conditional image generation.

The project evaluates model performance using Frechet Inception Distance (FID) and Inception Score (IS), analyzing both the fidelity and diversity of generated samples. Initial results indicate that DDPM effectively models noise-injection and denoising processes, while DDIM significantly accelerates inference.

Introduction & Motivation

Diffusion models are a powerful class of generative models that have gained attention for their ability to generate high-quality and diverse data. These models function by progressively adding random noise to input data and then training a neural network to reverse the noise process step by step.

The framework has seen widespread application in domains such as image synthesis, text-to-image generation, medical imaging, 3D object generation for gaming and VR, and many other fields.

This study addresses the challenge of understanding and optimizing diffusion models for generative tasks, particularly image synthesis. The motivation stems from the growing demand for generative models that can efficiently produce high-quality outputs across a range of applications.

Input Datasets

  • CIFAR-10: 60,000 images (32×32) across 10 classes
  • ImageNet-100: 131,689 images (128×128) across 100 classes

Key Applications

  • Image synthesis and generation
  • Text-to-image transformations
  • Medical imaging enhancement
  • 3D object generation for gaming/VR

Model Architecture & Methods

Denoising Diffusion Probabilistic Models (DDPM)

The DDPM model serves as the foundational approach, modeling the gradual noise injection and removal processes using a U-Net architecture. The forward process gradually corrupts data by adding Gaussian noise across multiple time steps, while the reverse process learns to denoise step-by-step.

Key Features:

  • Progressive noise addition following a Markov chain
  • U-Net architecture for noise prediction with skip connections
  • Time embedding for conditioning on diffusion step
  • Trained with 1000 time steps, 1000 inference steps initially

Denoising Diffusion Implicit Models (DDIM)

DDIM introduces a deterministic, non-Markovian approach to accelerate sampling, achieving faster generation without compromising quality. This significantly reduces inference time from 1000 to 200 steps.

Key Advantages:

  • 80% reduction in sampling steps (1000 → 200)
  • Deterministic reverse process eliminates randomness
  • Maintains image quality comparable to DDPM
  • Significantly faster inference for practical applications

Variational Autoencoder (VAE) Integration

VAE integration compresses high-dimensional input data into a structured latent space, significantly reducing computational overhead while preserving essential features.

Benefits:

  • Encoder-decoder architecture reduces data dimensionality
  • Gaussian-distributed latent space for efficient training
  • Enables faster and more scalable high-resolution generation
  • Particularly effective for large-scale image synthesis

Classifier-Free Guidance (CFG)

CFG enhances conditional image generation by integrating conditional and unconditional training within a single model, eliminating the need for external classifiers.

Implementation:

  • Interpolates between conditional and unconditional score functions
  • Guidance scales tested from 3 to 6 for optimal balance
  • Provides flexible control over generation process
  • Improves sample quality and generation accuracy

Experimental Results

ModelFID ScoreInception Score
DDPM (CIFAR)5191.2757, 1.0749
DDPM (ImageNet)340-
DDIM (CIFAR)5561.4757, 1.0449
DDIM (ImageNet)335-
VAE3321.357, 1.72
DDPM + CFG2901.33, 1.021
DDPM + VAE + CFG248.51.0849, 1.0610
248.5
Best FID Score
DDPM + VAE + CFG
80%
Inference Speed-up
DDIM vs DDPM
200
Inference Steps
Reduced from 1000

Key Findings

  • DDPM + VAE + CFG achieved the best overall performance with FID score of 248.5
  • DDIM successfully reduced inference time by 80% while maintaining comparable quality
  • VAE integration significantly reduced computational overhead for high-resolution images
  • CFG provided flexible control and improved conditional generation accuracy

Challenges & Future Work

Initial Challenges

  • Image Quality:Initial experiments produced blurry images, requiring parameter optimization and longer training
  • Computational Cost:High computational requirements limited experimentation speed and scale
  • VAE Scaling:Encountered scaling issues during VAE experiments that required careful tuning

Future Directions

Advanced Noise Scheduling

Further exploration of cosine, sigmoid, and adaptive scheduling strategies for optimal performance

Learned Variance

Implementing learnable variance in Gaussian transitions for more flexible modeling

Architecture Enhancements

Exploring Diffusion Transformers (DiT) and larger U-Net architectures

Extended Applications

Applying to video generation, 3D synthesis, and medical imaging tasks

Conclusion

This project successfully implemented and evaluated DDPM and DDIM, demonstrating their effectiveness in high-quality image generation. Our experiments revealed that combining DDPM with VAE and CFG achieves the best results, with FID score of 248.5.

The scaled linear noise schedule proved most effective for VAE configurations. DDIM achieved an impressive 80% reduction in inference time while maintaining quality, making it practical for real-world applications.

This study deepens understanding of diffusion model mechanisms and lays a foundation for further innovations in image synthesis, text-to-image generation, medical imaging, and other generative AI applications.