Diffusion-Based Data Augmentation for Medical Image Segmentation

In the rapidly evolving field of medical imaging, diffusion-based data augmentation for medical image segmentation is emerging as a game-changing solution to one of the most persistent challenges in AI-driven diagnostics: the scarcity of annotated pathological data. A groundbreaking new framework, DiffAug, introduced by Nazir, Aqeel, and Setti in their 2025 paper, leverages the power of text-guided diffusion models and latent-space validation to generate high-quality synthetic abnormalities—boosting model performance and reducing false negatives in critical screening tasks.

This article dives deep into how DiffAug works, why it outperforms existing methods, and what it means for the future of AI in radiology, gastroenterology, and ophthalmology.

Why Data Scarcity is a Major Bottleneck in Medical AI

Deep learning has transformed medical image analysis, enabling early disease detection and precise diagnosis. However, these models rely heavily on large, well-annotated datasets to learn complex patterns. While normal anatomical structures are abundant in clinical databases, rare pathologies—such as small polyps in colonoscopies or early glaucomatous changes in fundus images—are severely underrepresented.

This class imbalance leads to:

Poor sensitivity for rare abnormalities
High false negative rates
Overfitting on common patterns

Traditional data augmentation techniques like rotation, flipping, and scaling cannot solve this issue because they do not create new pathological patterns. Generative Adversarial Networks (GANs) have been explored but suffer from mode collapse, training instability, and lack of spatial control.

Enter diffusion models—a new generation of generative AI that promises high-fidelity, controllable image synthesis.

Introducing DiffAug: A Novel Framework for Medical Data Augmentation

DiffAug (Diffusion-Based Data Augmentation) is a cutting-edge framework that combines text-guided diffusion models with automatic segmentation validation to generate anatomically plausible synthetic abnormalities for training medical AI models.

Key Features of DiffAug:

✅ Text-guided generation using medical literature-based prompts
✅ Spatial conditioning via binary masks for precise anomaly placement
✅ Latent-space inpainting on normal images for anatomical consistency
✅ Single-step latent segmentation validation for real-time quality control
✅ Automatic filtering of low-quality generations using IoU thresholds

The result? A robust, scalable pipeline that generates high-quality synthetic data without manual annotation—directly addressing the data scarcity problem in medical imaging.

How DiffAug Works: A Step-by-Step Breakdown

1. Language-Guided Diffusion-Based Image Generation

At the core of DiffAug is a Latent Diffusion Model (LDM), specifically adapted from Stable Diffusion XL (SDXL). The model operates in a compressed latent space, balancing computational efficiency with fine detail preservation.

Given a normal medical image I_n , it is first encoded into a latent representation:

\[ z_n = E(I_n) \]

where E is the pre-trained VAE encoder from Stable Diffusion.

The model then performs conditional inpainting to insert abnormalities into the image. This process is guided by two inputs:

Text prompts (e.g., “sessile polyp with irregular surface”, “enlarged optic cup”)
Spatial masks indicating plausible abnormality locations

The text prompts are processed by a pre-trained text encoder τ :

\[ \text{e_text} = \tau(T) \]

During the reverse diffusion process, the denoising network ϵθ predicts noise while respecting both textual and spatial constraints:

\[ \hat{\epsilon} = \epsilon_{\theta}(z_t, t, e_{\text{text}}, M, z_n) \]

The training objective ensures that abnormalities are generated only within masked regions while preserving surrounding anatomy:

\[ \mathcal{L}_{\text{gen}} = \mathbb{E}_{t,\epsilon,z_0} \Big[ \| M \odot (\epsilon – \hat{\epsilon}) \|_2^2 + \lambda_{\text{preserve}} \, \| (1-M) \odot (z_t – \bar{\alpha}_t z_n) \|_2^2 \Big] \]

Classifier-free guidance is used during inference to strengthen conditioning:

\[ \epsilon^{\text{guided}} = \epsilon^{\text{uncond}} + s \cdot \big( \epsilon^{\text{cond}} – \epsilon^{\text{uncond}} \big) \]

This allows the model to generate diverse, realistic abnormalities with fine-grained control.

2. Latent-Space Segmentation for Quality Validation

A major innovation in DiffAug is its built-in quality validation mechanism. Instead of relying on subjective human review or slow post-processing, DiffAug uses a segmentation network that operates directly in the latent space.

Given a generated abnormal image I_a, its latent representation is:

\[ z_{a} = E(I_{a}) \]

A dual-encoder setup combines:

A frozen VAE encoder E
A trainable vision encoder V_ψ

The fused features are:

\[ f_{\text{fused}} = \psi\big([z_a \,;\, V_{\psi}(z_a)]\big) \]

The segmentation network S_ϕ predicts the clean latent segmentation z₀ in a single step:

\[ z^{0} = \bar{\alpha}_{t}^{\,1} \Big( z_{t-1} – \bar{\alpha}_{t}\,\hat{\epsilon}^{\text{seg}} \Big) \]

This enables real-time validation without iterative sampling, drastically improving efficiency.

The training loss combines noise prediction, latent supervision, and Dice loss:

\[ \mathcal{L}_{\text{seg}} = \lVert \epsilon – \hat{\epsilon}_{\text{seg}} \rVert_{1} + \lambda_{\text{latent}} \lVert z_{0} – \hat{z}_{0} \rVert_{1} + \lambda_{\text{dice}} \, \mathcal{L}_{\text{Dice}}\big(D(M^{\text{latent}}), M_{\text{gt}}\big) \]

3. Integrated Pipeline and Quality Control

The full DiffAug pipeline runs as an iterative generation-validation loop:

Generate multiple abnormal variants using different text prompts and masks
Encode the generated image into latent space
Predict segmentation mask using the latent-space network
Compute IoU between intended mask M_i and predicted mask M_latent
Accept only if IoU>θ quality=0.7

This threshold was empirically optimized to balance diversity (65–78% acceptance rate) and localization accuracy.

Only validated synthetic pairs (I_a,M) are added to the training set.

Experimental Results: Outperforming State-of-the-Art Methods

DiffAug was evaluated on three major medical imaging benchmarks:

DATASET	TASK	ABNORMALITIES
CVC-ClinicDB	Polyp Segmentation	488 polyps
Kvasir-SEG	Polyp Segmentation	800 polyps
REFUGE2	Glaucoma Assessment	800 optic cup/disc pairs

Table 1: Segmentation Performance (Dice / IoU %)

METHOD	CVC-CLINICDB	KVASIR-SEG	REFUGE2
U-Net	88.3 / 81.4	86.7 / 79.2	80.1 / 72.3
TransUNet	91.8 / 85.9	90.2 / 83.8	85.6 / 78.2
SDSeg	95.8 / 92.6	94.9 / 92.1	89.4 / 81.8
DiffAug (Ours)	96.4 / 93.2	95.6 / 92.8	90.2 / 82.6

DiffAug achieves 8–10% Dice improvement over baseline models and sets new benchmarks across all datasets.

Clinical Impact: Reducing False Negatives in Critical Cases

One of the most significant advantages of DiffAug is its ability to reduce false negative rates (FNR) in challenging cases where early detection is crucial.

Table 2: Lesion-Level Detection Performance

ABNORMALITY	METHOD	SENSITIVITY (%)	FNR (%)	FNR REDUCTION VS. BASELINE
Small Polyps (<5mm)	U-Net	68.3	31.7	—
	SDSeg	74.1	25.9	18.3%
	DiffAug	76.8	23.2	26.8%
Flat Lesions	U-Net	61.5	38.5	—
	SDSeg	68.9	31.1	19.2%
	DiffAug	72.4	27.6	28.3%
Early Glaucoma (CDR 0.5–0.6)	U-Net	72.1	27.9	—
	SDSeg	77.3	22.7	18.6%
	DiffAug	79.3	20.7	25.8%

DiffAug reduces FNR by up to 28%, meaning dozens of previously missed polyps or early glaucoma cases can now be detected—potentially saving lives through earlier intervention.

Computational Efficiency: Fast, Scalable, and Practical

Unlike many diffusion-based methods that require iterative sampling and multi-GPU setups, DiffAug is designed for clinical practicality.

Table 3: Computational Efficiency Comparison

METHOD	TRAINING TIME	GPU REQUIREMENT	INFERENCE SPEED (SAMPLES/SEC)
MedSegDiff-V1	48 hrs	4× GPU	0.30
SDSeg	12 hrs	1× GPU	8.36
DiffAug (Seg)	8 hrs	1× GPU	12.50
DiffAug (Gen)	15 hrs	1× GPU	0.40 (offline)

Segmentation inference: 12.5 samples/sec — 40× faster than MedSegDiff
Generation: 0.40 samples/sec (offline preprocessing)
Throughput: ~1,000 validated images/hour on a single RTX 4090

This makes DiffAug ideal for research labs and hospitals with limited computational resources.

Ablation Study: What Makes DiffAug Work?

To validate each component, an ablation study was conducted on the CVC-ClinicDB dataset.

Table 4: Ablation Results (CVC-ClinicDB)

CONFIGURATION	DICE (%)	IOU (%)	FNR (%)	SPEED (S/S)
Full DiffAug	96.4	93.2	23.2	12.5
w/o Text Guidance	95.3	91.8	25.1	12.5
w/o Validation Loop	95.0	91.4	25.8	N/A
w/o Diverse Prompts	95.6	92.3	24.3	12.5
w/o Latent Estimation	96.1	92.8	23.7	0.31
w/o Synthetic Data	88.3	81.4	31.7	12.5

Key findings:

Text guidance improves Dice by 1.1%
Validation loop prevents 1.4% performance drop
Latent estimation enables 40× faster inference
Synthetic data alone boosts Dice by 8.1%

Synthetic Data Volume: How Much is Enough?

The study also analyzed performance vs. synthetic data volume.

Table 5: Performance vs. Synthetic Data (CVC-ClinicDB)

SYNTHETIC DATA	DICE (%)	IOU (%)
0×	88.3	81.4
0.5×	92.8	87.3
1×	94.6	90.1
2×	95.7	92.0
3×	96.4	93.2
5×	96.3	93.1

Performance plateaus at 3× synthetic data, suggesting optimal diversity is achieved without unnecessary computational cost.

Why DiffAug is a Game-Changer for Medical AI

No Manual Annotation Needed: Text prompts derived from medical literature eliminate the need for expert labeling.
High-Quality, Diverse Synthesis: Combines textual and spatial control for realistic, varied abnormalities.
Built-in Quality Assurance: Automatic validation ensures only accurate, well-localized samples are used.
Clinically Relevant Gains: Reduces false negatives in small polyps, flat lesions, and early glaucoma.
Efficient & Scalable: Single-GPU training and fast inference make it accessible.

Future Applications and Research Directions

DiffAug opens doors for:

Rare disease modeling (e.g., pediatric tumors, genetic disorders)
Multimodal synthesis (text + radiology reports + imaging)
Personalized data augmentation for hospital-specific data distributions
Integration with zero-shot learning using medical CLIP models

Conclusion: The Future of Medical Image Segmentation is Here

Diffusion-based data augmentation for medical image segmentation is no longer a theoretical concept—it’s a practical, high-impact solution. With DiffAug, researchers and clinicians can overcome data scarcity, improve diagnostic accuracy, and save lives through earlier detection of subtle pathologies.

By combining text-guided generation, spatial conditioning, and latent-space validation, DiffAug sets a new standard for synthetic data in medical AI.

Call to Action

👉 Want to implement DiffAug in your research?
Download the open-source code and pre-trained models from the DiffAug GitHub repository (link simulated).

🎓 Are you working on medical AI projects?
Join our Medical AI Research Network for monthly webinars, dataset access, and collaboration opportunities.

💬 Have questions or feedback?
Leave a comment below or email us at research@medical-ai-research.com .

Let’s build the future of AI-powered healthcare—together.

Here is the complete end-to-end Python code for the “Diffusion-Based Data Augmentation for Medical Image Segmentation” (DiffAug) model proposed in the paper.

# DiffAug: Diffusion-Based Data Augmentation for Medical Image Segmentation
# Implementation based on the paper: arXiv:2508.17844v1

# This script provides an end-to-end implementation of the DiffAug framework,
# including:
# 1. The Generation Module for synthesizing abnormalities.
# 2. The Validation Module for ensuring spatial accuracy.
# 3. The integrated pipeline for augmenting a dataset and training a segmentation model.

# Ensure you have the required libraries installed:
# pip install torch torchvision torchaudio
# pip install diffusers transformers accelerate bitsandbytes
# pip install monai

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
from diffusers import StableDiffusionInpaintPipeline, UNet2DConditionModel, AutoencoderKL
from transformers import CLIPTextModel, CLIPTokenizer
from PIL import Image
import numpy as np
import os
from tqdm import tqdm
import monai.losses as monai_losses

# --- Configuration ---
class Config:
    """Configuration class for all hyperparameters and settings."""
    # Data settings
    IMAGE_SIZE = 256
    DATA_PATH = "./data" # Path to your dataset
    NORMAL_IMAGE_DIR = os.path.join(DATA_PATH, "normal")
    ABNORMAL_IMAGE_DIR = os.path.join(DATA_PATH, "abnormal")
    MASK_DIR = os.path.join(DATA_PATH, "masks")
    SYNTHETIC_DATA_DIR = "./synthetic_data"
    SYNTHETIC_IMAGES_DIR = os.path.join(SYNTHETIC_DATA_DIR, "images")
    SYNTHETIC_MASKS_DIR = os.path.join(SYNTHETIC_DATA_DIR, "masks")

    # Model settings
    SD_MODEL_NAME = "stabilityai/stable-diffusion-2-inpainting"
    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

    # Generation Module settings
    GEN_BATCH_SIZE = 4
    GEN_LEARNING_RATE = 1e-5
    GEN_TRAIN_STEPS = 50000

    # Validation Module (Segmentation Network) settings
    SEG_BATCH_SIZE = 8
    SEG_LEARNING_RATE = 1e-5
    SEG_TRAIN_STEPS = 100000
    VALIDATION_IOU_THRESHOLD = 0.7

    # Final Segmentation Model Training
    FINAL_MODEL_BATCH_SIZE = 16
    FINAL_MODEL_EPOCHS = 100
    FINAL_MODEL_LEARNING_RATE = 1e-4

    # Text prompts for generation
    TEXT_PROMPTS = [
        "a small sessile polyp with irregular surface",
        "large pedunculated polyp with a smooth surface",
        "a flat lesion with unclear borders",
        "early glaucomatous cupping",
        "advanced rim loss with vessel displacement",
    ]
    NEGATIVE_PROMPT = "healthy tissue, normal mucosa, normal optic disc"

# --- Utility Functions ---
def setup_directories(config):
    """Creates necessary directories for the project."""
    os.makedirs(config.SYNTHETIC_IMAGES_DIR, exist_ok=True)
    os.makedirs(config.SYNTHETIC_MASKS_DIR, exist_ok=True)
    # Create dummy data directories if they don't exist for demonstration
    os.makedirs(config.NORMAL_IMAGE_DIR, exist_ok=True)
    os.makedirs(config.ABNORMAL_IMAGE_DIR, exist_ok=True)
    os.makedirs(config.MASK_DIR, exist_ok=True)
    print("Directories created.")
    # Create some dummy images for the script to run
    for i in range(5):
        Image.new('RGB', (config.IMAGE_SIZE, config.IMAGE_SIZE), (i*20, 100, 100)).save(os.path.join(config.NORMAL_IMAGE_DIR, f"normal_{i}.png"))
        Image.new('RGB', (config.IMAGE_SIZE, config.IMAGE_SIZE), (100, i*20, 100)).save(os.path.join(config.ABNORMAL_IMAGE_DIR, f"abnormal_{i}.png"))
        mask = np.zeros((config.IMAGE_SIZE, config.IMAGE_SIZE), dtype=np.uint8)
        mask[100:150, 100:150] = 255
        Image.fromarray(mask).save(os.path.join(config.MASK_DIR, f"mask_{i}.png"))
    print("Dummy data created for demonstration.")


# --- Dataset Class ---
class MedicalImageDataset(Dataset):
    """Custom dataset for loading medical images and masks."""
    def __init__(self, image_dir, mask_dir=None, image_size=256):
        self.image_dir = image_dir
        self.mask_dir = mask_dir
        self.image_filenames = [f for f in os.listdir(image_dir) if f.endswith(('.png', '.jpg', '.jpeg'))]
        self.image_size = image_size
        self.transform = transforms.Compose([
            transforms.Resize((image_size, image_size)),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
        ])
        self.mask_transform = transforms.Compose([
            transforms.Resize((image_size, image_size)),
            transforms.ToTensor(),
        ])

    def __len__(self):
        return len(self.image_filenames)

    def __getitem__(self, idx):
        img_name = os.path.join(self.image_dir, self.image_filenames[idx])
        image = Image.open(img_name).convert("RGB")
        image = self.transform(image)

        if self.mask_dir:
            mask_name = os.path.join(self.mask_dir, self.image_filenames[idx])
            mask = Image.open(mask_name).convert("L")
            mask = self.mask_transform(mask)
            return image, mask
        return image

# --- 1. Generation Module ---
class GenerationModule:
    """Handles the synthesis of abnormal medical images using a diffusion model."""
    def __init__(self, config):
        self.config = config
        self.pipe = StableDiffusionInpaintPipeline.from_pretrained(
            config.SD_MODEL_NAME,
            torch_dtype=torch.float16 if config.DEVICE == "cuda" else torch.float32
        ).to(config.DEVICE)
        self.vae = self.pipe.vae
        self.tokenizer = self.pipe.tokenizer
        self.text_encoder = self.pipe.text_encoder
        self.unet = self.pipe.unet

    def generate(self, normal_image, mask, prompt):
        """
        Generates an abnormal image using inpainting.
        Args:
            normal_image (PIL.Image): The base normal image.
            mask (PIL.Image): The mask indicating the region to inpaint.
            prompt (str): The text prompt describing the abnormality.
        Returns:
            PIL.Image: The generated image with the synthetic abnormality.
        """
        image = self.pipe(
            prompt=prompt,
            image=normal_image.resize((self.config.IMAGE_SIZE, self.config.IMAGE_SIZE)),
            mask_image=mask.resize((self.config.IMAGE_SIZE, self.config.IMAGE_SIZE)),
            negative_prompt=self.config.NEGATIVE_PROMPT,
            num_inference_steps=50,
            guidance_scale=7.5,
        ).images[0]
        return image

    def fine_tune(self, dataloader):
        """
        Placeholder for fine-tuning the diffusion model on a specific medical domain.
        The paper suggests fine-tuning, but for simplicity, we use the pre-trained model.
        A full fine-tuning implementation is extensive and requires a large dataset.
        """
        print("Skipping fine-tuning of the generation model for this example.")
        print("In a real scenario, you would fine-tune the UNet on your medical data.")
        # Example of setting up optimizer for fine-tuning
        # optimizer = optim.AdamW(self.unet.parameters(), lr=self.config.GEN_LEARNING_RATE)
        # Here you would implement the training loop based on Equation 7 from the paper.


# --- 2. Validation Module (Latent-Space Segmentation Network) ---
class ValidationModule(nn.Module):
    """
    A segmentation network that operates in the latent space for efficient validation.
    It uses a dual-encoder strategy as described in the paper.
    """
    def __init__(self, config):
        super().__init__()
        self.config = config

        # 1. Frozen VAE Encoder from Stable Diffusion
        self.vae = AutoencoderKL.from_pretrained(config.SD_MODEL_NAME, subfolder="vae")
        self.vae.requires_grad_(False)

        # 2. Trainable Vision Encoder (e.g., a simple CNN or a pre-trained model)
        # For simplicity, we use a few conv layers. A ResNet-like encoder would be better.
        self.vision_encoder = nn.Sequential(
            nn.Conv2d(4, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
        )

        # 3. Denoising U-Net (Decoder part) for segmentation
        # We reuse the SD UNet architecture but train it for segmentation
        self.segmentation_decoder = UNet2DConditionModel.from_pretrained(
            config.SD_MODEL_NAME, subfolder="unet"
        )
        # Modify the input conv to accept concatenated features
        self.segmentation_decoder.conv_in = nn.Conv2d(
            128 + 4, # f_vision channels + z_a channels
            self.segmentation_decoder.conv_in.out_channels,
            kernel_size=self.segmentation_decoder.conv_in.kernel_size,
            padding=self.segmentation_decoder.conv_in.padding
        )
        # Modify the output conv to produce a 1-channel segmentation mask
        self.segmentation_decoder.conv_out = nn.Conv2d(
            self.segmentation_decoder.conv_out.in_channels,
            1, # Output is a single channel mask
            kernel_size=self.segmentation_decoder.conv_out.kernel_size,
            padding=self.segmentation_decoder.conv_out.padding
        )


    def forward(self, image_tensor):
        """
        Performs single-step latent estimation for segmentation.
        """
        # Encode image to latent space with the frozen VAE
        with torch.no_grad():
            latent_dist = self.vae.encode(image_tensor).latent_dist
            z_a = latent_dist.sample() * self.vae.config.scaling_factor

        # Get features from the trainable vision encoder
        f_vision = self.vision_encoder(z_a)

        # Fuse features by concatenation (as per Equation 13)
        f_fused = torch.cat([z_a, f_vision], dim=1)

        # Pass through the segmentation decoder (U-Net)
        # The U-Net expects a timestep and encoder hidden states, we can pass zeros
        timestep = torch.zeros(f_fused.shape[0], device=f_fused.device)
        encoder_hidden_states = torch.zeros(f_fused.shape[0], 1, self.segmentation_decoder.config.cross_attention_dim, device=f_fused.device)

        pred_mask_latent = self.segmentation_decoder(f_fused, timestep, encoder_hidden_states).sample
        return pred_mask_latent


    def train_module(self, dataloader):
        """Trains the validation module."""
        optimizer = optim.AdamW(list(self.vision_encoder.parameters()) + list(self.segmentation_decoder.parameters()), lr=self.config.SEG_LEARNING_RATE)
        dice_loss = monai_losses.DiceLoss(sigmoid=True)
        l1_loss = nn.L1Loss()

        self.train()
        progress_bar = tqdm(range(self.config.SEG_TRAIN_STEPS))
        progress_bar.set_description("Training Validation Module")

        # Use an iterator to handle indefinite training steps
        data_iterator = iter(dataloader)

        for step in progress_bar:
            try:
                images, masks = next(data_iterator)
            except StopIteration:
                data_iterator = iter(dataloader)
                images, masks = next(data_iterator)

            images = images.to(self.config.DEVICE)
            masks = masks.to(self.config.DEVICE)

            optimizer.zero_grad()

            # Encode ground truth masks to latent space for direct latent loss
            with torch.no_grad():
                # Repeat mask to match VAE input channels
                mask_for_vae = masks.repeat(1, 3, 1, 1)
                z0 = self.vae.encode(mask_for_vae).latent_dist.sample() * self.vae.config.scaling_factor

            # Forward pass
            pred_mask_latent = self(images)

            # Calculate loss (as per Equation 12)
            # For simplicity, we focus on the Dice loss and direct latent estimation loss
            # The noise prediction part is more complex and tied to a diffusion training process
            loss_latent = l1_loss(pred_mask_latent, z0)

            # Decode predicted mask for Dice loss calculation
            with torch.no_grad():
                pred_mask_decoded = self.vae.decode(pred_mask_latent / self.vae.config.scaling_factor).sample
            loss_dice = dice_loss(pred_mask_decoded, masks)

            # Total loss
            lambda_latent = 0.1 # Weight for latent loss
            lambda_dice = 1.0   # Weight for dice loss
            total_loss = lambda_dice * loss_dice + lambda_latent * loss_latent

            total_loss.backward()
            optimizer.step()

            progress_bar.set_postfix({"Dice Loss": f"{loss_dice.item():.4f}", "Latent L1": f"{loss_latent.item():.4f}"})


# --- 3. Integrated Pipeline ---
class DiffAugPipeline:
    """The main pipeline that integrates generation and validation."""
    def __init__(self, config):
        self.config = config
        self.generation_module = GenerationModule(config)
        self.validation_module = ValidationModule(config).to(config.DEVICE)

    def run_generation_and_validation(self):
        """
        Generates synthetic data and validates it, saving high-quality samples.
        """
        print("\n--- Stage 1: Generating and Validating Synthetic Data ---")
        normal_dataset = MedicalImageDataset(self.config.NORMAL_IMAGE_DIR, image_size=self.config.IMAGE_SIZE)
        mask_dataset = MedicalImageDataset(self.config.MASK_DIR, image_size=self.config.IMAGE_SIZE) # For plausible locations

        accepted_count = 0
        total_count = 0

        for i in tqdm(range(len(normal_dataset)), desc="Generating Images"):
            normal_img_tensor = normal_dataset[i]
            # Use a random mask from the dataset for location
            _, mask_tensor = mask_dataset[np.random.randint(0, len(mask_dataset))]
            # Choose a random prompt
            prompt = np.random.choice(self.config.TEXT_PROMPTS)

            # Convert tensors back to PIL for the generation pipeline
            normal_pil = transforms.ToPILImage()( (normal_img_tensor * 0.5 + 0.5).clamp(0, 1) )
            mask_pil = transforms.ToPILImage()(mask_tensor)

            # Generate image
            generated_image_pil = self.generation_module.generate(normal_pil, mask_pil, prompt)
            total_count += 1

            # Validate image
            self.validation_module.eval()
            with torch.no_grad():
                # Prepare generated image for validation
                generated_tensor = transforms.ToTensor()(generated_image_pil).unsqueeze(0).to(self.config.DEVICE)
                generated_tensor = transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])(generated_tensor)

                # Get predicted mask
                pred_mask_latent = self.validation_module(generated_tensor)
                pred_mask_decoded_tensor = self.validation_module.vae.decode(pred_mask_latent / self.validation_module.vae.config.scaling_factor).sample

                # Calculate IoU
                pred_mask_binary = (torch.sigmoid(pred_mask_decoded_tensor) > 0.5).float()
                intersection = torch.sum(pred_mask_binary * mask_tensor.to(self.config.DEVICE))
                union = torch.sum(pred_mask_binary) + torch.sum(mask_tensor.to(self.config.DEVICE)) - intersection
                iou = (intersection + 1e-6) / (union + 1e-6)

            if iou.item() > self.config.VALIDATION_IOU_THRESHOLD:
                accepted_count += 1
                # Save the validated image and mask
                img_filename = f"synthetic_{accepted_count}.png"
                generated_image_pil.save(os.path.join(self.config.SYNTHETIC_IMAGES_DIR, img_filename))
                mask_pil.save(os.path.join(self.config.SYNTHETIC_MASKS_DIR, img_filename))

        acceptance_rate = (accepted_count / total_count) * 100 if total_count > 0 else 0
        print(f"Generation complete. Accepted {accepted_count}/{total_count} images ({acceptance_rate:.2f}% acceptance rate).")

    def train_final_segmentation_model(self):
        """
        Trains a standard segmentation model (e.g., U-Net) on the augmented dataset.
        """
        print("\n--- Stage 2: Training Final Segmentation Model ---")
        # Combine real abnormal data and validated synthetic data
        real_images = [os.path.join(self.config.ABNORMAL_IMAGE_DIR, f) for f in os.listdir(self.config.ABNORMAL_IMAGE_DIR)]
        real_masks = [os.path.join(self.config.MASK_DIR, f) for f in os.listdir(self.config.ABNORMAL_IMAGE_DIR)]

        synth_images = [os.path.join(self.config.SYNTHETIC_IMAGES_DIR, f) for f in os.listdir(self.config.SYNTHETIC_IMAGES_DIR)]
        synth_masks = [os.path.join(self.config.SYNTHETIC_MASKS_DIR, f) for f in os.listdir(self.config.SYNTHETIC_IMAGES_DIR)]

        all_images = real_images + synth_images
        all_masks = real_masks + synth_masks

        # Create a new dataset and dataloader for the final model
        # This is a simplified version; you'd create a proper Dataset class for this
        class AugmentedDataset(Dataset):
            def __init__(self, images, masks, size):
                self.images = images
                self.masks = masks
                self.transform = transforms.Compose([
                    transforms.Resize((size, size)),
                    transforms.ToTensor(),
                ])
            def __len__(self):
                return len(self.images)
            def __getitem__(self, idx):
                img = Image.open(self.images[idx]).convert("RGB")
                mask = Image.open(self.masks[idx]).convert("L")
                return self.transform(img), self.transform(mask)

        augmented_dataset = AugmentedDataset(all_images, all_masks, self.config.IMAGE_SIZE)
        train_loader = DataLoader(augmented_dataset, batch_size=self.config.FINAL_MODEL_BATCH_SIZE, shuffle=True)

        # Define a standard segmentation model (e.g., MONAI's UNet)
        final_model = monai.networks.nets.UNet(
            spatial_dims=2,
            in_channels=3,
            out_channels=1,
            channels=(16, 32, 64, 128, 256),
            strides=(2, 2, 2, 2),
        ).to(self.config.DEVICE)

        optimizer = optim.Adam(final_model.parameters(), lr=self.config.FINAL_MODEL_LEARNING_RATE)
        loss_function = monai_losses.DiceCELoss(sigmoid=True)

        # Training loop
        final_model.train()
        for epoch in range(self.config.FINAL_MODEL_EPOCHS):
            epoch_loss = 0
            for batch_data in tqdm(train_loader, desc=f"Epoch {epoch+1}/{self.config.FINAL_MODEL_EPOCHS}"):
                inputs, labels = batch_data[0].to(self.config.DEVICE), batch_data[1].to(self.config.DEVICE)
                optimizer.zero_grad()
                outputs = final_model(inputs)
                loss = loss_function(outputs, labels)
                loss.backward()
                optimizer.step()
                epoch_loss += loss.item()

            print(f"Epoch {epoch+1} average loss: {epoch_loss / len(train_loader):.4f}")

        print("Final segmentation model training complete.")
        torch.save(final_model.state_dict(), "final_segmentation_model.pth")
        print("Model saved to final_segmentation_model.pth")


# --- Main Execution ---
if __name__ == "__main__":
    # 1. Setup
    config = Config()
    setup_directories(config)

    # 2. Initialize the main pipeline
    pipeline = DiffAugPipeline(config)

    # 3. Train the Validation Module
    # In a real scenario, you'd use your actual abnormal images and masks for this
    print("\n--- Training the Validation Module ---")
    abnormal_dataset = MedicalImageDataset(config.ABNORMAL_IMAGE_DIR, config.MASK_DIR, config.IMAGE_SIZE)
    val_dataloader = DataLoader(abnormal_dataset, batch_size=config.SEG_BATCH_SIZE, shuffle=True)
    pipeline.validation_module.train_module(val_dataloader)

    # 4. Run the Generation and Validation Pipeline
    pipeline.run_generation_and_validation()

    # 5. Train the Final Segmentation Model on the Augmented Dataset
    pipeline.train_final_segmentation_model()

    print("\nDiffAug pipeline finished successfully!")

Related posts, You May like to read