Latent Denoising Diffusion Probabilistic Models
Role
Skills & Tools
Work Responsibilities
- Reviewed literature on DDPM, DDIM, VAE, and CFG to analyze trade-offs in image generation quality, inference speed, and controllability; set up the ImageNet-100 benchmark and co-designed standardized preprocessing pipelines.
- Co-implemented multiple generation pipelines and noise schedulers, and conducted quantitative analysis to evaluate efficiency, cost, and controllability trade-offs in diffusion models.
Abstract
Diffusion models have emerged as a robust class of generative models, excelling in tasks like image synthesis, video generation, and text-to-image transformations. This project implements and evaluates Denoising Diffusion Probabilistic Models (DDPM) and Denoising Diffusion Implicit Models (DDIM), focusing on their ability to generate high-quality images.
Key components include the U-Net architecture for noise prediction, a DDPM noise scheduler for forward and reverse diffusion processes, and enhancements like Variational Autoencoders (VAE) for latent space mapping and Classifier-Free Guidance (CFG) for conditional image generation.
The project evaluates model performance using Frechet Inception Distance (FID) and Inception Score (IS), analyzing both the fidelity and diversity of generated samples. Initial results indicate that DDPM effectively models noise-injection and denoising processes, while DDIM significantly accelerates inference.
Introduction & Motivation
Diffusion models are a powerful class of generative models that have gained attention for their ability to generate high-quality and diverse data. These models function by progressively adding random noise to input data and then training a neural network to reverse the noise process step by step.
The framework has seen widespread application in domains such as image synthesis, text-to-image generation, medical imaging, 3D object generation for gaming and VR, and many other fields.
This study addresses the challenge of understanding and optimizing diffusion models for generative tasks, particularly image synthesis. The motivation stems from the growing demand for generative models that can efficiently produce high-quality outputs across a range of applications.
Input Datasets
- •CIFAR-10: 60,000 images (32×32) across 10 classes
- •ImageNet-100: 131,689 images (128×128) across 100 classes
Key Applications
- •Image synthesis and generation
- •Text-to-image transformations
- •Medical imaging enhancement
- •3D object generation for gaming/VR
Model Architecture & Methods
Denoising Diffusion Probabilistic Models (DDPM)
The DDPM model serves as the foundational approach, modeling the gradual noise injection and removal processes using a U-Net architecture. The forward process gradually corrupts data by adding Gaussian noise across multiple time steps, while the reverse process learns to denoise step-by-step.
Key Features:
- •Progressive noise addition following a Markov chain
- •U-Net architecture for noise prediction with skip connections
- •Time embedding for conditioning on diffusion step
- •Trained with 1000 time steps, 1000 inference steps initially
Denoising Diffusion Implicit Models (DDIM)
DDIM introduces a deterministic, non-Markovian approach to accelerate sampling, achieving faster generation without compromising quality. This significantly reduces inference time from 1000 to 200 steps.
Key Advantages:
- •80% reduction in sampling steps (1000 → 200)
- •Deterministic reverse process eliminates randomness
- •Maintains image quality comparable to DDPM
- •Significantly faster inference for practical applications
Variational Autoencoder (VAE) Integration
VAE integration compresses high-dimensional input data into a structured latent space, significantly reducing computational overhead while preserving essential features.
Benefits:
- •Encoder-decoder architecture reduces data dimensionality
- •Gaussian-distributed latent space for efficient training
- •Enables faster and more scalable high-resolution generation
- •Particularly effective for large-scale image synthesis
Classifier-Free Guidance (CFG)
CFG enhances conditional image generation by integrating conditional and unconditional training within a single model, eliminating the need for external classifiers.
Implementation:
- •Interpolates between conditional and unconditional score functions
- •Guidance scales tested from 3 to 6 for optimal balance
- •Provides flexible control over generation process
- •Improves sample quality and generation accuracy
Experimental Results
| Model | FID Score | Inception Score |
|---|---|---|
| DDPM (CIFAR) | 519 | 1.2757, 1.0749 |
| DDPM (ImageNet) | 340 | - |
| DDIM (CIFAR) | 556 | 1.4757, 1.0449 |
| DDIM (ImageNet) | 335 | - |
| VAE | 332 | 1.357, 1.72 |
| DDPM + CFG | 290 | 1.33, 1.021 |
| DDPM + VAE + CFG | 248.5 | 1.0849, 1.0610 |
Key Findings
- •DDPM + VAE + CFG achieved the best overall performance with FID score of 248.5
- •DDIM successfully reduced inference time by 80% while maintaining comparable quality
- •VAE integration significantly reduced computational overhead for high-resolution images
- •CFG provided flexible control and improved conditional generation accuracy
Challenges & Future Work
Initial Challenges
- •Image Quality:Initial experiments produced blurry images, requiring parameter optimization and longer training
- •Computational Cost:High computational requirements limited experimentation speed and scale
- •VAE Scaling:Encountered scaling issues during VAE experiments that required careful tuning
Future Directions
Advanced Noise Scheduling
Further exploration of cosine, sigmoid, and adaptive scheduling strategies for optimal performance
Learned Variance
Implementing learnable variance in Gaussian transitions for more flexible modeling
Architecture Enhancements
Exploring Diffusion Transformers (DiT) and larger U-Net architectures
Extended Applications
Applying to video generation, 3D synthesis, and medical imaging tasks
Conclusion
This project successfully implemented and evaluated DDPM and DDIM, demonstrating their effectiveness in high-quality image generation. Our experiments revealed that combining DDPM with VAE and CFG achieves the best results, with FID score of 248.5.
The scaled linear noise schedule proved most effective for VAE configurations. DDIM achieved an impressive 80% reduction in inference time while maintaining quality, making it practical for real-world applications.
This study deepens understanding of diffusion model mechanisms and lays a foundation for further innovations in image synthesis, text-to-image generation, medical imaging, and other generative AI applications.