Project Analysis
Stable Diffusion-Based Image Generation and Prompt Engineering
About Project
Generating high-quality, realistic, and stylistically accurate images using text prompts is a complex task. Models like Stable Diffusion can produce widely varied outputs depending on multiple generation parameters.
- Prompt wording: Subtle changes can drastically affect composition and detail
- Negative prompts: Used to suppress unwanted elements
- Scheduler selection: Influences the image generation process (e.g., DDIM, Euler)
- CFG Scale: Controls the strength of prompt adherence (higher = more literal)
- Inference steps: Affects image quality, detail, and noise reduction
Understanding how each parameter impacts the output is critical for controlling quality, style, and realism in AI art and computer vision applications.
Problem Statement
Generating high-quality, realistic, and stylistically accurate images using text prompts is a complex task. Models like Stable Diffusion can produce widely varied outputs depending on multiple generation parameters.
- Prompt wording: Subtle changes can drastically affect composition and detail
- Negative prompts: Used to suppress unwanted elements
- Scheduler selection: Influences the image generation process (e.g., DDIM, Euler)
- CFG Scale: Controls the strength of prompt adherence (higher = more literal)
- Inference steps: Affects image quality, detail, and noise reduction
Understanding how each parameter impacts the output is critical for controlling quality, style, and realism in AI art and computer vision applications.
Objective
To explore and evaluate how prompt design, negative prompts, diffusion schedulers, CFG scale, and inference steps impact the output quality of a Stable Diffusion model.
Identify optimal parameter combinations that strike a balance between:
- Realism β Photographic quality and coherence of the output
- Creativity β Diversity and uniqueness of generated content
- Style Control β Consistent aesthetic output aligned with artistic intent
- Main Prompt: "A cozy cottage in the forest at sunset, highly detailed"
- Negative Prompt: "daylight, bright, sunny, day time"
- Insight: Negative prompt effectively transformed the output into a night-time setting by suppressing bright/daytime features.
- Schedulers Tested:
- DPMSolverMultistepScheduler β Clearer, sharper, photorealistic
- EulerAncestralDiscreteScheduler β Softer, more artistic but less sharp
- CFG Values Tested: 1.0, 3.0, 7.5, 12.0, 15.0, 19.0, 25.0
- Observations:
- Low CFG β More artistic, vague
- Mid-range CFG (7.5β12.0) β Best realism and fidelity
- High CFG β Overfitting, unwanted artifacts
- Conclusion: Optimal CFG range: 7.5β12.0
- Steps Tested: 10, 40, 80
- Findings:
- 10 steps β Fast but blurry with many artifacts
- 40 steps β Best balance between quality and speed
- 80 steps β Slight improvement, but not time-efficient
- Conclusion: 40 steps is the sweet spot
- Prompt: βA futuristic glass building glowing at night, cinematic lighting, ultra-realisticβ
- Parameter Combos Tested:
- CFG=7.5, Steps=30 β Acceptable but low clarity
- CFG=12.0, Steps=40 β Best result: realistic, clean, glowing effects
- CFG=15.0, Steps=60 β Over-sharpened with lighting artifacts
- Stable Diffusion v1.5 β Pretrained text-to-image generation model
- Diffusers Library (Hugging Face) β Simplified pipeline for loading models and schedulers
- DPMSolverMultistepScheduler
- EulerAncestralDiscreteScheduler
- CUDA GPU β Used for efficient, fast image generation
- Python β Core programming language
- Jupyter Notebook β Interactive development and visualization
- CFG Balance: Too low leads to vague, unfocused results; too high produces overfitted, rigid textures
- Inference Time: Significantly longer at higher steps (e.g., 80+), affecting real-time usability
- Scheduler Tradeoff: DPMSolver offers realism and sharpness; EulerA provides softness with artistic style β choice depends on use case
- Manual Prompt Tuning: Required for each new prompt to find optimal configuration (no one-size-fits-all)
- GPU Memory Management: Generating multiple images concurrently can exhaust GPU resources; requires careful batching or scaling
- Used StableDiffusionPipeline from Hugging Face diffusers
- Deployed on GPU for fast inference
- Schedulers were set dynamically based on experiment
- Tested base prompts, negative prompts, and parameter combinations
- Generated and saved images by varying one key parameter per experiment
- Visual sharpness
- Prompt accuracy
- Artifact presence
- Realism vs artistic expression
- Prompt design and tuning of CFG scale & inference steps are critical to realistic generation
- Best Configuration:
- Scheduler: DPMSolverMultistepScheduler
- CFG Scale: 12.0
- Inference Steps: 40
- Negative prompts and scheduler selection
Result / Outcome
π· Visual Output- Epoch 1: Random noise
- Epoch 10: Shapes of digits begin to appear
- Epoch 30: Recognizable, sharp digits generated
π Loss BehaviorEpoch d_loss (β) g_loss (β) 1 0.45 0.44 10 0.65 0.91 30 0.64 0.95 - Discriminator loss remained stable
- Generator loss increased as expected (indicates adversarial progress)
π Hyperparameter Tuning- Learning rates tested:
0.0002,0.0001 - Adam Ξ²β values:
0.5,0.4 - Latent dimensions:
100,128
Best Configuration:
lr = 0.0002Ξ²β = 0.4latent_dim = 100- Generator loss: 0.80
- Discriminator loss: 0.63
Proposed Solution
This project was executed in 6 experimental parts, each focusing on evaluating a specific parameter of the Stable Diffusion model.
Technologies Used
Challenges Faced
Methodology
| Experiment | Best Configuration | Insight/Output |
|---|---|---|
| Negative Prompting | Added: βdaylight, sunnyβ¦β | Shifted sunset to night effectively |
| Scheduler Comparison | DPMSolver | Clearer and more realistic than EulerA |
| CFG Scale | 7.5 β 12.0 | Best balance between accuracy and style |
| Inference Steps | 40 | Fast + high-quality rendering |
| Prompt Optimization | CFG=12.0, Steps=40 | Best result for futuristic architecture scene |