DPFR: A Breakthrough in AI-Powered Gland Segmentation for Cancer Diagnosis

Introduction: The Critical Challenge in Digital Pathology

The early detection and accurate grading of cancer remains one of modern medicine’s most pressing challenges. For pathologists worldwide, the assessment of gland morphology in histopathological images serves as the gold standard for cancer diagnosis—particularly in colorectal and prostate cancers. However, this critical diagnostic process faces a fundamental bottleneck that has hindered progress for decades.

Traditional histopathological analysis requires pathologists to manually examine tissue sections stained with Hematoxylin and Eosin (H&E), evaluating whole slide images (WSIs) that contain hundreds of thousands of pixels at ultra-high resolution. This painstaking process is not only extraordinarily time-consuming and labor-intensive but also highly susceptible to human error and inter-observer variability. The sheer volume of data in a single WSI can overwhelm even the most experienced pathologists, potentially compromising diagnostic accuracy when fatigue sets in.

Enter computer-aided automated gland segmentation—a technological frontier where artificial intelligence promises to transform diagnostic pathology. While deep convolutional neural networks have achieved remarkable success in medical image analysis, they come with a significant caveat: traditional fully supervised methods demand massive quantities of pixel-level annotations. Creating these annotations requires pathologists to meticulously trace gland boundaries, a process that can take hours per image and costs healthcare systems millions of dollars annually.

This annotation bottleneck has catalyzed intense research into semi-supervised learning (SSL) approaches that can leverage small amounts of labeled data alongside abundant unlabeled images. Among these, the Mean-Teacher framework with consistency regularization has emerged as the dominant paradigm. Yet existing methods continue to struggle with two persistent challenges that plague gland segmentation:

Gland-background confusion—where AI systems mistakenly classify background tissue as glandular structures due to their visual similarity

Gland adhesion—where adjacent glands merge into single segmented regions, destroying critical morphological information needed for accurate grading

A groundbreaking solution has now emerged from researchers at Hefei University of Technology. Their novel method, DPFR (Density Perturbation and Feature Recalibration), addresses these fundamental limitations through an elegant two-pronged approach that fundamentally reimagines how semi-supervised learning operates in histopathological contexts.

Understanding the DPFR Framework: Architecture and Innovation

The DPFR method represents a significant architectural advancement over existing semi-supervised approaches. Rather than treating feature learning as a black-box optimization problem, DPFR explicitly models the probability density distributions of different tissue classes—glands, contours, and background—to guide more discriminative feature learning.

The Three-Pillar Architecture

At its core, DPFR consists of three interconnected modules working in concert:

1. Feature Density Learning Module

The foundation of DPFR lies in its sophisticated normalizing flow-based density estimator. Unlike heuristic perturbation methods that inject random noise into feature spaces, DPFR learns explicit probability density distributions for each semantic class.

The mathematical formulation employs an invertible mapping ψθ that transforms complex feature distributions into a tractable latent space:

\[ p_{F_t}(e) = p_{\omega}\!\left(\psi(e)\right) \cdot \left| \det \left( \frac{\partial \psi(e)}{\partial e} \right) \right| \]

Where:

p_ω represents the prior distribution (Gaussian mixture)
ψ (e) is the normalizing flow transformation
The Jacobian determinant |det(∂ψ/∂e)| accounts for the change in volume under transformation

For labeled features F_tl , the conditional likelihood for class c is:

\[ p_{F_t}\!\left(F_t^{l} \mid Y_l = c ; \theta \right) = \mathcal{N}\!\left( \psi_{\theta}\!\left(F_t^{l}\right) \,\middle|\, \mu_c, \Sigma_c \right) \cdot \left| \det \frac{\partial F_t^{l}} {\partial \psi_{\theta}\!\left(F_t^{l}\right)} \right|. \]

For unlabeled features, the density is modeled as a Gaussian mixture:

\[ p_{F_t}\!\left(F_t^{u}; \theta \right) = \sum_{c=1}^{C} \pi_c \, \mathcal{N} \!\left( \psi_{\theta}(F_t^{u}) \,\middle|\, \mu_c, \Sigma_c \right) \cdot \left| \det \left( \frac{\partial \psi_{\theta}(F_t^{u})} {\partial F_t^{u}} \right) \right|. \]

The flow loss L_f optimizes both labeled and unlabeled likelihoods:

\[ L_{f} = – C_{1} \left( \sum_{c=1}^{C} \log p_{F_t}\!\left(F_{t}^{l} \mid Y^{l} = c ; \theta \right) + \log p_{F_t}\!\left(F_{t}^{u} ; \theta \right) \right) \]

Key Advantage: Unlike kernel density estimation or diffusion-based methods, normalizing flows provide exact likelihood computation through invertible transformations, enabling more principled density-guided perturbations.

2. Perturbation Generation and Injection Module

Once the density estimator is trained, DPFR leverages gradient information to generate optimal perturbations. The goal is to push features toward low-density regions of the feature space—precisely where decision boundaries should reside according to semi-supervised learning theory.

The optimal perturbation ϱ^∗ is computed by maximizing the negative log-likelihood gradient:

\[ \varrho^{*} = \arg\max_{\|\varrho\|_{2} \le \xi} \left( – \log p_{F_s}\!\left(F_s + \varrho\right) \right) \]

Using first-order Taylor expansion and the Cauchy-Schwarz inequality, this simplifies to:

\[ \varrho^{*} = \xi \cdot \frac{ \left\lVert \nabla_{F_s} \!\left( -\log p_{F_s}(F_s) \right) \right\rVert_{2} }{ \nabla_{F_s} \!\left( -\log p_{F_s}(F_s) \right) } \]

The perturbed features are then:

\[ \Gamma(F_s) = F_s + \varrho^{*} \]

Critical Insight: This density-descending perturbation strategy ensures that features are pushed in directions that maximally increase uncertainty, forcing the model to learn more robust decision boundaries between semantically similar classes.

3. Feature Recalibration Module

While density perturbations improve separability, they can introduce class confusion in low-density regions. DPFR addresses this through a contrastive learning-based recalibration mechanism that explicitly enforces inter-class separability.

The module employs confidence-filtered pseudo-labeling:

\[ H(M_{u_j}) = – \sum_{c \in \mathcal{C}} M_{c u_j} \log M_{c u_j} \]

For each class c , positive samples ϑ⁺ are computed as class means, while hard negative samples are selected as the T farthest features from the positive centroid. The contrastive loss then pulls anchors toward positives while pushing away from negatives:

\[ L_{\mathrm{cl}} = – \sum_{q_c \in R_c} \log \left( \frac{ \exp\!\left( \dfrac{q_c \cdot \vartheta_c^{+}}{\tau} \right) }{ \sum_{\vartheta_c^{-} \in Q_c} \exp\!\left( \dfrac{q_c \cdot \vartheta_c^{-}}{\tau} \right) } \right) \]

Where τ=0.5 is the temperature parameter controlling the sharpness of the contrastive objective.

Experimental Validation: State-of-the-Art Performance Across Three Benchmark Datasets

The DPFR framework was rigorously evaluated on three publicly available gland segmentation datasets representing diverse clinical scenarios:

Dataset	Type	Images	Resolution	Task Level
GlaS	Colorectal Cancer	165	430×567 to 522×755	Instance
CRAG	Colorectal Cancer	213	1512×1512	Instance
PGland	Prostate Cancer	1500	1500×1500	Semantic

Quantitative Results: Dominating Semi-Supervised Benchmarks

Table 1: Performance on GlaS and CRAG Datasets (1/8 Labeled Data)

Method	GlaS O.F1↑	GlaS O.Dice↑	GlaS O.HD↓	CRAG O.F1↑	CRAG O.Dice↑	CRAG O.HD↓
RFS (Baseline)	78.11	80.13	119.53	63.01	64.13	388.52
URPC	82.77	82.45	93.84	69.06	71.99	289.64
BCP	83.02	83.28	87.71	71.97	73.68	272.37
CAT	83.41	83.85	85.93	73.55	74.58	253.03
DCCL-Seg	85.53	85.90	78.36	75.47	79.41	212.35
DPFR (Ours)	86.74	87.07	73.21	80.43	83.19	197.89

Key Takeaways:

DPFR achieves 86.74% O.F1 on GlaS with only 12.5% labeled data—within 3% of fully supervised performance
5.15mm reduction in Hausdorff Distance compared to second-best method, indicating substantially better boundary accuracy
Consistent superiority across all metrics and labeling ratios (1/8, 1/4, 1/2)

Table 2: Performance on PGland Dataset (Semantic Segmentation)

Method	F1↑	Dice↑	mIoU↑
RFS	73.41	71.13	70.48
CDMA+	78.51	76.37	75.88
ILECGJL	77.50	78.61	76.91
DCCL-Seg	78.92	79.59	77.99
DPFR (Ours)	80.36	81.50	79.62

Qualitative Analysis: Visualizing the Improvement

The visual comparison reveals DPFR’s critical advantages in challenging scenarios:

Figure showing qualitative segmentation results on GlaS dataset. The leftmost column displays original H&E-stained histopathological images showing glandular tissue. Subsequent columns show: Ground Truth (GT) with color-coded instance masks, followed by predictions from BCP, ILECGJL, CDMA+, DCCL-Seg, and DPFR methods. Yellow dashed boxes highlight regions where competing methods exhibit gland adhesion (merging adjacent glands) or under-segmentation, while DPFR maintains clear boundaries and accurate instance separation. The color coding uses distinct hues (green, purple, yellow, red) to differentiate individual gland instances against a black background.

Qualitative comparison of gland segmentation methods on histopathological images showing DPFR’s superior boundary delineation and reduced gland adhesion compared to competing semi-supervised approaches.

Figure showing t-SNE visualization of feature distributions on GlaS dataset. Two scatter plots compare feature embeddings: (Left) “Original” shows highly overlapping distributions with poor class separation—gland features (blue), background (red), and contour (green) points are intermingled throughout the 2D projection. (Right) “Ours” demonstrates dramatically improved clustering with three distinct, compact clusters showing clear separation between classes, indicating that DPFR learns more discriminative feature representations.

*t-SNE visualization demonstrating DPFR’s ability to learn compact, well-separated feature distributions for gland, background, and contour classes in histopathological image analysis.*

Ablation Studies: Validating Each Component’s Contribution

Systematic ablation studies on the GlaS dataset with 1/8 labeled data confirm the complementary nature of DPFR’s components:

Table 3: Component Ablation Study

Configuration	O.F1↑	O.Dice↑	O.HD↓
Baseline (Mean-Teacher + CE-Net)	81.88	82.35	94.37
Baseline + FDP only	85.57	85.36	78.76
Baseline + FR only	84.23	86.09	80.95
Baseline + FDP + FR (Full DPFR)	86.74	87.07	73.21

Critical Findings:

FDP alone improves O.F1 by 3.69% through density-guided perturbations
FR alone improves O.Dice by 3.74% via contrastive recalibration
Combined modules achieve synergistic gains—the whole exceeds the sum of parts
Hausdorff Distance reduction of 21.16mm demonstrates substantially improved boundary precision

Hyperparameter Optimization

The research team conducted extensive sensitivity analyses:

Perturbation magnitude ξ = 3 provides optimal balance—smaller values insufficiently separate features, larger values cause excessive migration and confusion
Anchor number T = 256 in FR module maximizes contrastive learning effectiveness
Gaussian components N = 3 for GlaS/CRAG, N = 2 for PGland prevents underfitting while avoiding overfitting
Loss weights: α₁ = 0.5 (semi-supervised), α₂ = 0.3 (contrastive) achieve optimal balance with supervised loss

Comparison with Foundation Models: DPFR vs. SAM and MedSAM

The emergence of foundation models like SAM (Segment Anything Model) and its medical variant MedSAM has transformed computer vision. However, DPFR demonstrates significant advantages for specialized medical segmentation tasks:

Table 4: Foundation Model Comparison (1/2 Labeled Data on GlaS)

Method	O.F1↑	O.Dice↑	O.HD↓
SAM	79.70	78.88	102.65
MedSAM	86.74	87.95	70.87
DPFR	89.36	89.84	59.98

Why DPFR Outperforms Foundation Models:

Prompt Dependency: SAM requires bounding box or point prompts, but densely packed glands cause single prompts to cover multiple instances, leading to adhesion artifacts
Unlabeled Data Utilization: Foundation models cannot leverage abundant unlabeled histopathological images, creating performance bottlenecks
Domain Specificity: DPFR’s architecture is explicitly designed for gland morphology, while foundation models are general-purpose

Figure comparing segmentation results between foundation models and DPFR on three datasets (GlaS, CRAG, PGland). Each row shows: Input H&E image, Ground Truth masks, SAM prediction (showing internal holes and merged glands), MedSAM prediction (improved but still exhibiting boundary adhesion), and DPFR prediction (accurate boundary delineation and complete gland separation). Yellow dashed boxes highlight failure modes in foundation models—particularly internal holes within gland lumens and merging of adjacent gland instances—where DPFR maintains morphological accuracy.

Visual comparison demonstrating DPFR’s superior segmentation accuracy over foundation models SAM and MedSAM in histopathological gland analysis, with particular improvements in boundary delineation and instance separation.

Clinical Impact and Future Directions

Reducing Pathologist Burden

The implications of DPFR extend far beyond benchmark metrics. By achieving near fully-supervised performance with only 50% of labeled data, DPFR could:

Reduce annotation time by 50-75% for large-scale histopathological studies
Enable rapid deployment of AI diagnostic tools in resource-limited settings
Improve inter-observer consistency by providing standardized segmentation baselines
Accelerate digital pathology adoption by lowering the cost barrier

Limitations and Ongoing Challenges

Despite its advances, DPFR faces important limitations that point to future research directions:

Severely deformed glands: When glandular structures exhibit low differentiation or extreme morphological deformation, performance degrades significantly
Limited cancer type validation: Current evaluation focuses on colorectal and prostate cancers; broader validation across cancer types is needed
Extremely low-label regimes: At 1% labeled data, performance gaps remain substantial, though DPFR still outperforms alternatives

The Path Forward: Integrating with Foundation Models

The most promising future direction involves hybrid approaches that combine DPFR’s semi-supervised learning capabilities with foundation model pre-training. By initializing with SAM’s general visual knowledge and fine-tuning with DPFR’s density-aware semi-supervised framework, researchers may achieve:

Zero-shot unsupervised segmentation capabilities
Cross-cancer generalization without domain-specific retraining
Extreme low-label performance suitable for rare cancer types

Conclusion: A New Paradigm for Medical Image Analysis

DPFR represents a fundamental advancement in semi-supervised medical image segmentation, specifically addressing the unique challenges of glandular tissue analysis in cancer pathology. By explicitly modeling feature density distributions and employing contrastive recalibration, the method achieves:

✅ State-of-the-art performance across three major benchmark datasets
✅ Significant reduction in annotation requirements (50-87.5% label reduction)
✅ Superior boundary accuracy with 21mm improvement in Hausdorff Distance
✅ Better instance separation eliminating gland adhesion artifacts
✅ Computational efficiency with no inference-time overhead

The integration of normalizing flows for density estimation, gradient-based perturbation generation, and contrastive feature recalibration establishes a new architectural template for semi-supervised learning in medical imaging. As digital pathology continues its rapid expansion, methods like DPFR will prove essential for scaling AI-assisted diagnosis while maintaining the accuracy standards demanded by clinical practice.

Call to Action: Join the Digital Pathology Revolution

Are you a pathologist, researcher, or AI practitioner working to transform cancer diagnosis? The DPFR framework is open-source and available for implementation. We encourage you to:

Experiment with DPFR on your own histopathological datasets
Contribute to the growing ecosystem of semi-supervised medical AI tools
Share your findings on how density-aware learning impacts your specific diagnostic challenges
Explore integration opportunities with existing digital pathology platforms

What challenges do you face in annotating medical images for AI training? Share your experiences in the comments below—your insights could shape the next generation of semi-supervised learning methods.

For researchers interested in the technical implementation, the complete code repository and pre-trained models are available at the project page. Whether you’re developing clinical decision support systems or advancing the theoretical foundations of medical AI, DPFR offers a robust foundation for building the future of computational pathology.

Here is the complete DPFR (Density Perturbation and Feature Recalibration) model code. This is a complex semi-supervised segmentation model with three main components: Feature Density Learning, Perturbation Generation Injection, and Feature Recalibration.

"""
DPFR: Semi-supervised Gland Segmentation via Density Perturbation and Feature Recalibration
Complete PyTorch Implementation
Based on: Yu & Liu, Medical Image Analysis 2026
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import Tuple, Optional, List
import math


# ============================================================================
# 1. BACKBONE: CE-Net with ResNet101 Encoder (Simplified for clarity)
# ============================================================================

class BasicBlock(nn.Module):
    """ResNet Basic Block"""
    expansion = 1
    
    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, 1, 1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsample = downsample
        
    def forward(self, x):
        identity = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        if self.downsample is not None:
            identity = self.downsample(x)
        out += identity
        out = self.relu(out)
        return out


class Bottleneck(nn.Module):
    """ResNet Bottleneck Block"""
    expansion = 4
    
    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(Bottleneck, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, stride, 1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        
    def forward(self, x):
        identity = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        if self.downsample is not None:
            identity = self.downsample(x)
        out += identity
        out = self.relu(out)
        return out


class ResNetEncoder(nn.Module):
    """ResNet101 Encoder"""
    def __init__(self, block=Bottleneck, layers=[3, 4, 23, 3], num_classes=1000):
        super(ResNetEncoder, self).__init__()
        self.in_channels = 64
        
        # Initial conv
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        
        # ResNet layers
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        
        # Feature dimensions
        self.feature_dims = [256, 512, 1024, 2048]  # After each layer
        
    def _make_layer(self, block, out_channels, blocks, stride=1):
        downsample = None
        if stride != 1 or self.in_channels != out_channels * block.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(self.in_channels, out_channels * block.expansion, 1, stride, bias=False),
                nn.BatchNorm2d(out_channels * block.expansion),
            )
        
        layers = []
        layers.append(block(self.in_channels, out_channels, stride, downsample))
        self.in_channels = out_channels * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.in_channels, out_channels))
        
        return nn.Sequential(*layers)
    
    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        
        x1 = self.layer1(x)   # 1/4 resolution
        x2 = self.layer2(x1)  # 1/8 resolution
        x3 = self.layer3(x2)  # 1/16 resolution
        x4 = self.layer4(x3)  # 1/32 resolution
        
        return x4, [x1, x2, x3, x4]


class DACBlock(nn.Module):
    """Dense Atrous Convolution Block"""
    def __init__(self, in_channels):
        super(DACBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, in_channels//4, 1, dilation=1, padding=0)
        self.conv2 = nn.Conv2d(in_channels, in_channels//4, 3, dilation=3, padding=3)
        self.conv3 = nn.Conv2d(in_channels, in_channels//4, 3, dilation=5, padding=5)
        self.conv4 = nn.Conv2d(in_channels, in_channels//4, 3, dilation=7, padding=7)
        self.bn = nn.BatchNorm2d(in_channels)
        self.relu = nn.ReLU(inplace=True)
        
    def forward(self, x):
        x1 = self.conv1(x)
        x2 = self.conv2(x)
        x3 = self.conv3(x)
        x4 = self.conv4(x)
        out = torch.cat([x1, x2, x3, x4], dim=1)
        out = self.bn(out)
        out = self.relu(out)
        return out


class RMPBlock(nn.Module):
    """Residual Multi-kernel Pooling"""
    def __init__(self, in_channels):
        super(RMPBlock, self).__init__()
        self.pool1 = nn.MaxPool2d(2, stride=2)
        self.pool2 = nn.MaxPool2d(3, stride=3)
        self.pool3 = nn.MaxPool2d(5, stride=5)
        self.pool4 = nn.MaxPool2d(6, stride=6)
        
        self.conv = nn.Conv2d(in_channels * 4, in_channels, 1)
        self.bn = nn.BatchNorm2d(in_channels)
        self.relu = nn.ReLU(inplace=True)
        
    def forward(self, x):
        h, w = x.size()[2:]
        x1 = F.interpolate(self.pool1(x), size=(h, w), mode='bilinear', align_corners=True)
        x2 = F.interpolate(self.pool2(x), size=(h, w), mode='bilinear', align_corners=True)
        x3 = F.interpolate(self.pool3(x), size=(h, w), mode='bilinear', align_corners=True)
        x4 = F.interpolate(self.pool4(x), size=(h, w), mode='bilinear', align_corners=True)
        
        out = torch.cat([x1, x2, x3, x4, x], dim=1)
        out = self.conv(out)
        out = self.bn(out)
        out = self.relu(out)
        return out


class CENetDecoder(nn.Module):
    """CE-Net Decoder"""
    def __init__(self, num_classes=3, feature_channels=2048):
        super(CENetDecoder, self).__init__()
        self.dac = DACBlock(feature_channels)
        self.rmp = RMPBlock(feature_channels)
        
        # Decoder
        self.up1 = nn.Sequential(
            nn.ConvTranspose2d(feature_channels, 512, 4, stride=2, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True)
        )
        self.up2 = nn.Sequential(
            nn.ConvTranspose2d(512, 256, 4, stride=2, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True)
        )
        self.up3 = nn.Sequential(
            nn.ConvTranspose2d(256, 128, 4, stride=2, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True)
        )
        self.up4 = nn.Sequential(
            nn.ConvTranspose2d(128, 64, 4, stride=2, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True)
        )
        
        self.final = nn.Conv2d(64, num_classes, 1)
        
    def forward(self, x, skip_connections=None):
        x = self.dac(x)
        x = self.rmp(x)
        
        x = self.up1(x)
        x = self.up2(x)
        x = self.up3(x)
        x = self.up4(x)
        
        out = self.final(x)
        return out


class CENet(nn.Module):
    """Complete CE-Net Architecture"""
    def __init__(self, num_classes=3, pretrained=True):
        super(CENet, self).__init__()
        self.encoder = ResNetEncoder()
        self.decoder = CENetDecoder(num_classes=num_classes)
        
    def forward(self, x):
        features, skip = self.encoder(x)
        out = self.decoder(features, skip)
        return out, features


# ============================================================================
# 2. NORMALIZING FLOW DENSITY ESTIMATOR
# ============================================================================

class ActNorm(nn.Module):
    """Activation Normalization"""
    def __init__(self, num_features):
        super(ActNorm, self).__init__()
        self.num_features = num_features
        self.log_scale = nn.Parameter(torch.zeros(1, num_features, 1, 1))
        self.bias = nn.Parameter(torch.zeros(1, num_features, 1, 1))
        self.initialized = False
        
    def forward(self, x, reverse=False):
        if not self.initialized and self.training:
            self._initialize(x)
            
        if reverse:
            return (x - self.bias) * torch.exp(-self.log_scale)
        else:
            return x * torch.exp(self.log_scale) + self.bias
            
    def _initialize(self, x):
        with torch.no_grad():
            mean = x.mean(dim=[0, 2, 3], keepdim=True)
            std = x.std(dim=[0, 2, 3], keepdim=True)
            self.bias.data = -mean
            self.log_scale.data = -torch.log(std + 1e-6)
            self.initialized = True


class InvertibleConv1x1(nn.Module):
    """Invertible 1x1 Convolution"""
    def __init__(self, num_features):
        super(InvertibleConv1x1, self).__init__()
        self.num_features = num_features
        
        # Initialize with random orthogonal matrix
        w_init = torch.qr(torch.randn(num_features, num_features))[0]
        self.weight = nn.Parameter(w_init)
        
    def forward(self, x, reverse=False):
        batch_size, channels, height, width = x.size()
        
        if reverse:
            # Inverse operation
            weight_inv = torch.inverse(self.weight)
            weight_inv = weight_inv.view(channels, channels, 1, 1)
            out = F.conv2d(x, weight_inv)
        else:
            weight = self.weight.view(channels, channels, 1, 1)
            out = F.conv2d(x, weight)
            
        return out


class AffineCoupling(nn.Module):
    """Affine Coupling Layer"""
    def __init__(self, in_channels, hidden_channels=512):
        super(AffineCoupling, self).__init__()
        self.in_channels = in_channels
        self.split_channels = in_channels // 2
        
        self.net = nn.Sequential(
            nn.Conv2d(self.split_channels, hidden_channels, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(hidden_channels, hidden_channels, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(hidden_channels, (in_channels - self.split_channels) * 2, 3, padding=1)
        )
        
    def forward(self, x, reverse=False):
        x1, x2 = torch.split(x, [self.split_channels, self.in_channels - self.split_channels], dim=1)
        
        if reverse:
            # Reverse: use x1 to compute shift and scale, apply to x2
            h = self.net(x1)
            shift, log_scale = torch.chunk(h, 2, dim=1)
            log_scale = torch.tanh(log_scale)  # Stabilize
            x2 = (x2 - shift) * torch.exp(-log_scale)
            return torch.cat([x1, x2], dim=1)
        else:
            # Forward: use x1 to compute shift and scale, apply to x2
            h = self.net(x1)
            shift, log_scale = torch.chunk(h, 2, dim=1)
            log_scale = torch.tanh(log_scale)  # Stabilize
            x2 = x2 * torch.exp(log_scale) + shift
            return torch.cat([x1, x2], dim=1), log_scale.sum(dim=[1, 2, 3])


class FlowStep(nn.Module):
    """Single Flow Step: ActNorm -> InvertibleConv -> AffineCoupling"""
    def __init__(self, in_channels, hidden_channels=512):
        super(FlowStep, self).__init__()
        self.actnorm = ActNorm(in_channels)
        self.invconv = InvertibleConv1x1(in_channels)
        self.coupling = AffineCoupling(in_channels, hidden_channels)
        
    def forward(self, x, reverse=False):
        if reverse:
            x = self.coupling(x, reverse=True)
            x = self.invconv(x, reverse=True)
            x = self.actnorm(x, reverse=True)
            return x
        else:
            x = self.actnorm(x)
            x = self.invconv(x)
            x, logdet = self.coupling(x)
            return x, logdet


class NormalizingFlow(nn.Module):
    """Normalizing Flow Density Estimator"""
    def __init__(self, in_channels=2048, num_steps=8, hidden_channels=512):
        super(NormalizingFlow, self).__init__()
        self.in_channels = in_channels
        self.num_steps = num_steps
        
        self.flows = nn.ModuleList([
            FlowStep(in_channels, hidden_channels) for _ in range(num_steps)
        ])
        
        # Gaussian mixture prior parameters (learnable)
        self.num_classes = 3  # Background, Gland, Contour
        self.prior_means = nn.Parameter(torch.randn(self.num_classes, in_channels))
        self.prior_logvars = nn.Parameter(torch.zeros(self.num_classes, in_channels))
        self.mixture_weights = nn.Parameter(torch.ones(self.num_classes) / self.num_classes)
        
    def forward(self, x, reverse=False):
        """
        Args:
            x: Input features [B, C, H, W]
        Returns:
            z: Latent representation
            logdet: Log determinant of Jacobian
        """
        if reverse:
            # Generate from prior (not used in DPFR)
            for flow in reversed(self.flows):
                x = flow(x, reverse=True)
            return x
        else:
            logdet_total = 0
            for flow in self.flows:
                x, logdet = flow(x)
                logdet_total = logdet_total + logdet
            return x, logdet_total
    
    def log_prob(self, z, class_idx=None):
        """
        Compute log probability under Gaussian mixture prior
        Args:
            z: Latent features [B, C, H, W]
            class_idx: Optional class labels [B, H, W]
        """
        B, C, H, W = z.size()
        z_flat = z.permute(0, 2, 3, 1).reshape(-1, C)  # [B*H*W, C]
        
        # Compute log probability for each component
        log_probs = []
        for c in range(self.num_classes):
            mean = self.prior_means[c:c+1]  # [1, C]
            logvar = self.prior_logvars[c:c+1]  # [1, C]
            
            # Log probability of Gaussian
            log_prob = -0.5 * (
                logvar.sum(dim=1, keepdim=True) +  # log(sigma^2)
                ((z_flat - mean) ** 2 / torch.exp(logvar)).sum(dim=1, keepdim=True) +  # (x-mu)^2/sigma^2
                C * math.log(2 * math.pi)  # constant
            )
            log_probs.append(log_prob)
        
        log_probs = torch.cat(log_probs, dim=1)  # [B*H*W, num_classes]
        log_weights = F.log_softmax(self.mixture_weights, dim=0)  # [num_classes]
        
        if class_idx is not None:
            # Conditional log prob
            class_idx_flat = class_idx.reshape(-1).long()
            log_prob = log_probs.gather(1, class_idx_flat.unsqueeze(1)).squeeze(1)
            log_prob = log_prob + log_weights[class_idx_flat]
        else:
            # Marginal log prob (mixture)
            log_prob = torch.logsumexp(log_probs + log_weights.unsqueeze(0), dim=1)
        
        log_prob = log_prob.view(B, H, W)
        return log_prob
    
    def compute_loss(self, features, labels=None, unlabeled=False):
        """
        Compute flow loss (negative log-likelihood)
        
        Args:
            features: Input features [B, C, H, W]
            labels: Class labels [B, H, W] (0=background, 1=gland, 2=contour)
            unlabeled: Whether features are unlabeled
        """
        z, logdet = self.forward(features)
        
        if not unlabeled and labels is not None:
            # Supervised: conditional log-likelihood
            log_prob = self.log_prob(z, labels)
        else:
            # Unsupervised: marginal log-likelihood (mixture)
            log_prob = self.log_prob(z, None)
        
        # Negative log-likelihood with Jacobian correction
        nll = -(log_prob + logdet)
        return nll.mean()


# ============================================================================
# 3. FEATURE RECALIBRATION MODULE
# ============================================================================

class FeatureRecalibration(nn.Module):
    """Contrastive Learning-based Feature Recalibration"""
    def __init__(self, feature_dim=2048, temperature=0.5, num_anchors=256):
        super(FeatureRecalibration, self).__init__()
        self.temperature = temperature
        self.num_anchors = num_anchors
        self.feature_dim = feature_dim
        
        # Projection head for contrastive learning
        self.projector = nn.Sequential(
            nn.Conv2d(feature_dim, 512, 1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 128, 1)
        )
        
    def forward(self, features, pseudo_labels, confidence_mask):
        """
        Args:
            features: Perturbed features [B, C, H, W]
            pseudo_labels: Pseudo labels [B, H, W] with values {0, 1, 2, 255(ignore)}
            confidence_mask: High confidence mask [B, H, W]
        """
        B, C, H, W = features.size()
        
        # Project features
        proj_features = self.projector(features)  # [B, 128, H, W]
        
        # Apply confidence filtering
        valid_mask = confidence_mask & (pseudo_labels != 255)
        
        # Compute contrastive loss
        contrastive_loss = 0
        num_valid_classes = 0
        
        for c in range(3):  # Background, Gland, Contour
            class_mask = (pseudo_labels == c) & valid_mask
            
            if class_mask.sum() < 10:  # Skip if too few samples
                continue
            
            # Extract class features
            class_features = proj_features[:, :, class_mask]  # [B, 128, N]
            if class_features.size(2) == 0:
                continue
            
            # Compute positive prototype (mean)
            positive_proto = class_features.mean(dim=2, keepdim=True)  # [B, 128, 1]
            
            # Select hard anchors (farthest from mean)
            distances = torch.norm(class_features - positive_proto, dim=1)  # [B, N]
            
            # Flatten and select top T anchors
            distances_flat = distances.view(-1)
            if distances_flat.size(0) > self.num_anchors:
                _, anchor_indices = torch.topk(distances_flat, self.num_anchors, largest=True)
            else:
                anchor_indices = torch.arange(distances_flat.size(0), device=features.device)
            
            # Get anchor features
            class_features_flat = class_features.permute(0, 2, 1).reshape(-1, 128)
            anchors = class_features_flat[anchor_indices]  # [T, 128]
            positive_proto_flat = positive_proto.squeeze(-1).repeat(len(anchors), 1)  # [T, 128]
            
            # Get negative samples from other classes
            neg_mask = (pseudo_labels != c) & valid_mask
            if neg_mask.sum() == 0:
                continue
            
            neg_features = proj_features[:, :, neg_mask]  # [B, 128, M]
            neg_features_flat = neg_features.permute(0, 2, 1).reshape(-1, 128)
            
            # For each anchor, sample 2 negatives
            num_negatives = min(2 * len(anchors), neg_features_flat.size(0))
            neg_indices = torch.randperm(neg_features_flat.size(0))[:num_negatives]
            negatives = neg_features_flat[neg_indices]  # [2T, 128]
            
            # Compute contrastive loss
            # Positive similarity
            pos_sim = torch.sum(anchors * positive_proto_flat, dim=1) / self.temperature
            
            # Negative similarities
            neg_sim = torch.matmul(anchors, negatives.t()) / self.temperature  # [T, 2T]
            
            # InfoNCE loss
            numerator = torch.exp(pos_sim)
            denominator = numerator + torch.exp(neg_sim).sum(dim=1)
            
            loss = -torch.log(numerator / (denominator + 1e-8))
            contrastive_loss += loss.mean()
            num_valid_classes += 1
        
        if num_valid_classes > 0:
            contrastive_loss = contrastive_loss / num_valid_classes
        
        return contrastive_loss


# ============================================================================
# 4. COMPLETE DPFR MODEL
# ============================================================================

class DPFR(nn.Module):
    """
    DPFR: Density Perturbation and Feature Recalibration
    Complete semi-supervised segmentation model
    """
    def __init__(
        self,
        num_classes=3,
        ema_decay=0.99,
        flow_steps=8,
        perturbation_magnitude=3.0,
        temperature=0.5,
        num_anchors=256,
        confidence_threshold=0.5
    ):
        super(DPFR, self).__init__()
        
        self.num_classes = num_classes
        self.ema_decay = ema_decay
        self.xi = perturbation_magnitude
        self.confidence_threshold = confidence_threshold
        
        # Student network
        self.student_encoder = ResNetEncoder()
        self.student_decoder = CENetDecoder(num_classes=num_classes)
        
        # Teacher network (EMA of student)
        self.teacher_encoder = ResNetEncoder()
        self.teacher_decoder = CENetDecoder(num_classes=num_classes)
        
        # Initialize teacher with student weights
        self._initialize_teacher()
        
        # Freeze teacher
        for param in self.teacher_encoder.parameters():
            param.requires_grad = False
        for param in self.teacher_decoder.parameters():
            param.requires_grad = False
        
        # Normalizing Flow Density Estimator
        self.density_estimator = NormalizingFlow(
            in_channels=2048,
            num_steps=flow_steps
        )
        
        # Feature Recalibration Module
        self.feature_recalibration = FeatureRecalibration(
            feature_dim=2048,
            temperature=temperature,
            num_anchors=num_anchors
        )
        
    def _initialize_teacher(self):
        """Copy student weights to teacher"""
        for t_param, s_param in zip(self.teacher_encoder.parameters(), 
                                     self.student_encoder.parameters()):
            t_param.data.copy_(s_param.data)
        for t_param, s_param in zip(self.teacher_decoder.parameters(),
                                     self.student_decoder.parameters()):
            t_param.data.copy_(s_param.data)
    
    @torch.no_grad()
    def update_teacher(self):
        """EMA update of teacher network"""
        for t_param, s_param in zip(self.teacher_encoder.parameters(),
                                     self.student_encoder.parameters()):
            t_param.data = self.ema_decay * t_param.data + (1 - self.ema_decay) * s_param.data
        for t_param, s_param in zip(self.teacher_decoder.parameters(),
                                     self.student_decoder.parameters()):
            t_param.data = self.ema_decay * t_param.data + (1 - self.ema_decay) * s_param.data
    
    def generate_density_perturbation(self, features):
        """
        Generate perturbation along density descent direction
        
        Args:
            features: Student features [B, C, H, W]
        Returns:
            perturbation: Density descent perturbation [B, C, H, W]
        """
        features = features.detach().requires_grad_(True)
        
        # Forward through density estimator
        z, logdet = self.density_estimator(features)
        
        # Compute negative log-likelihood
        log_prob = self.density_estimator.log_prob(z)
        nll = -(log_prob + logdet).mean()
        
        # Compute gradient w.r.t. features
        grad = torch.autograd.grad(nll, features)[0]
        
        # Normalize gradient
        grad_norm = torch.norm(grad, p=2, dim=1, keepdim=True)
        grad_normalized = grad / (grad_norm + 1e-8)
        
        # Scale by perturbation magnitude
        perturbation = self.xi * grad_normalized
        
        return perturbation.detach()
    
    def compute_entropy(self, probs):
        """Compute prediction entropy for confidence filtering"""
        entropy = -torch.sum(probs * torch.log(probs + 1e-8), dim=1)
        return entropy
    
    def forward(self, labeled_img=None, unlabeled_img_weak=None, unlabeled_img_strong=None):
        """
        Forward pass for DPFR training
        
        Args:
            labeled_img: Labeled images [B, 3, H, W]
            unlabeled_img_weak: Weakly augmented unlabeled images [B, 3, H, W]
            unlabeled_img_strong: Strongly augmented unlabeled images [B, 3, H, W]
        """
        outputs = {}
        
        # === Labeled Data Forward ===
        if labeled_img is not None:
            s_features_labeled, _ = self.student_encoder(labeled_img)
            s_pred_labeled = self.student_decoder(s_features_labeled)
            outputs['student_pred_labeled'] = s_pred_labeled
        
        # === Unlabeled Data Forward ===
        if unlabeled_img_weak is not None and unlabeled_img_strong is not None:
            # Teacher forward (weak augmentation)
            with torch.no_grad():
                t_features, _ = self.teacher_encoder(unlabeled_img_weak)
                t_pred = self.teacher_decoder(t_features)
                t_probs = F.softmax(t_pred, dim=1)
                
                # Generate pseudo labels with confidence filtering
                entropy = self.compute_entropy(t_probs)
                confidence_mask = entropy < self.confidence_threshold
                pseudo_labels = torch.argmax(t_probs, dim=1)
                pseudo_labels[~confidence_mask] = 255  # Ignore low confidence
            
            # Student forward (strong augmentation)
            s_features_unlabeled, _ = self.student_encoder(unlabeled_img_strong)
            s_pred_unlabeled_clean = self.student_decoder(s_features_unlabeled)
            
            # Generate density perturbation
            perturbation = self.generate_density_perturbation(s_features_unlabeled)
            s_features_perturbed = s_features_unlabeled + perturbation
            
            # Decode perturbed features
            s_pred_unlabeled_perturbed = self.student_decoder(s_features_perturbed)
            
            # Feature recalibration
            contrastive_loss = self.feature_recalibration(
                s_features_perturbed, pseudo_labels, confidence_mask
            )
            
            outputs.update({
                'student_pred_unlabeled_clean': s_pred_unlabeled_clean,
                'student_pred_unlabeled_perturbed': s_pred_unlabeled_perturbed,
                'pseudo_labels': pseudo_labels,
                'contrastive_loss': contrastive_loss,
                'teacher_pred': t_pred,
                'confidence_mask': confidence_mask
            })
        
        return outputs
    
    def forward_test(self, img):
        """Inference mode (uses student network only)"""
        features, _ = self.student_encoder(img)
        pred = self.student_decoder(features)
        return pred


# ============================================================================
# 5. LOSS FUNCTIONS
# ============================================================================

class DPFRLoss(nn.Module):
    """Combined loss for DPFR training"""
    def __init__(self, alpha1=0.5, alpha2=0.3):
        super(DPFRLoss, self).__init__()
        self.alpha1 = alpha1  # Semi-supervised loss weight
        self.alpha2 = alpha2  # Contrastive loss weight
        self.ce_loss = nn.CrossEntropyLoss(ignore_index=255)
        
    def forward(self, outputs, labels_labeled=None):
        total_loss = 0
        loss_dict = {}
        
        # Supervised loss
        if 'student_pred_labeled' in outputs and labels_labeled is not None:
            sup_loss = self.ce_loss(outputs['student_pred_labeled'], labels_labeled)
            total_loss += sup_loss
            loss_dict['supervised'] = sup_loss.item()
        
        # Semi-supervised loss (consistency)
        if 'student_pred_unlabeled_clean' in outputs:
            pseudo_labels = outputs['pseudo_labels']
            
            # Clean prediction loss
            semi_loss_clean = self.ce_loss(outputs['student_pred_unlabeled_clean'], pseudo_labels)
            
            # Perturbed prediction loss
            semi_loss_perturbed = self.ce_loss(outputs['student_pred_unlabeled_perturbed'], pseudo_labels)
            
            semi_loss = semi_loss_clean + semi_loss_perturbed
            total_loss += self.alpha1 * semi_loss
            loss_dict['semi_supervised'] = semi_loss.item()
        
        # Contrastive loss
        if 'contrastive_loss' in outputs:
            cl_loss = outputs['contrastive_loss']
            total_loss += self.alpha2 * cl_loss
            loss_dict['contrastive'] = cl_loss.item()
        
        loss_dict['total'] = total_loss.item()
        return total_loss, loss_dict


# ============================================================================
# 6. TRAINING PIPELINE
# ============================================================================

class DPFRTrainer:
    """Training pipeline for DPFR"""
    def __init__(self, model, device='cuda'):
        self.model = model.to(device)
        self.device = device
        
        # Separate optimizers for main network and density estimator
        self.optimizer_main = torch.optim.Adam(
            list(model.student_encoder.parameters()) + 
            list(model.student_decoder.parameters()) +
            list(model.feature_recalibration.parameters()),
            lr=5e-4, weight_decay=1e-4
        )
        
        self.optimizer_flow = torch.optim.Adam(
            model.density_estimator.parameters(),
            lr=5e-4, weight_decay=1e-4
        )
        
        self.criterion = DPFRLoss(alpha1=0.5, alpha2=0.3)
        
    def train_step(self, labeled_batch, unlabeled_batch, epoch):
        """
        Single training step
        
        Args:
            labeled_batch: (images, labels) tuple
            unlabeled_batch: (img_weak, img_strong) tuple
            epoch: Current epoch number
        """
        self.model.train()
        
        # Unpack batches
        labeled_img, labels = labeled_batch
        labeled_img = labeled_img.to(self.device)
        labels = labels.to(self.device)
        
        unlabeled_weak, unlabeled_strong = unlabeled_batch
        unlabeled_weak = unlabeled_weak.to(self.device)
        unlabeled_strong = unlabeled_strong.to(self.device)
        
        # === Step 1: Update Density Estimator (starting from epoch 5) ===
        if epoch >= 5:
            self.optimizer_flow.zero_grad()
            
            with torch.no_grad():
                t_features, _ = self.model.teacher_encoder(unlabeled_weak)
            
            # Flow loss on unlabeled features
            flow_loss_unlabeled = self.model.density_estimator.compute_loss(
                t_features, labels=None, unlabeled=True
            )
            
            # Flow loss on labeled features if available
            with torch.no_grad():
                t_features_labeled, _ = self.model.teacher_encoder(labeled_img)
            
            flow_loss_labeled = self.model.density_estimator.compute_loss(
                t_features_labeled, labels=labels, unlabeled=False
            )
            
            flow_loss = flow_loss_labeled + flow_loss_unlabeled
            flow_loss.backward()
            self.optimizer_flow.step()
        
        # === Step 2: Update Main Network ===
        self.optimizer_main.zero_grad()
        
        outputs = self.model(
            labeled_img=labeled_img,
            unlabeled_img_weak=unlabeled_weak,
            unlabeled_img_strong=unlabeled_strong
        )
        
        total_loss, loss_dict = self.criterion(outputs, labels)
        total_loss.backward()
        self.optimizer_main.step()
        
        # Update teacher network
        self.model.update_teacher()
        
        return loss_dict
    
    @torch.no_grad()
    def validate(self, val_loader):
        """Validation"""
        self.model.eval()
        total_dice = 0
        num_samples = 0
        
        for images, labels in val_loader:
            images = images.to(self.device)
            labels = labels.to(self.device)
            
            preds = self.model.forward_test(images)
            preds = torch.argmax(preds, dim=1)
            
            # Compute Dice score
            dice = self.compute_dice(preds, labels, num_classes=3)
            total_dice += dice * images.size(0)
            num_samples += images.size(0)
        
        return total_dice / num_samples
    
    def compute_dice(self, pred, target, num_classes=3, ignore_index=255):
        """Compute mean Dice score"""
        dice_scores = []
        
        for c in range(num_classes):
            pred_c = (pred == c).float()
            target_c = (target == c).float()
            
            # Mask out ignore regions
            mask = (target != ignore_index).float()
            pred_c = pred_c * mask
            target_c = target_c * mask
            
            intersection = (pred_c * target_c).sum()
            union = pred_c.sum() + target_c.sum()
            
            if union > 0:
                dice = (2 * intersection) / (union + 1e-8)
                dice_scores.append(dice.item())
        
        return np.mean(dice_scores) if dice_scores else 0


# ============================================================================
# 7. DATA AUGMENTATION
# ============================================================================

class WeakAugmentation:
    """Weak augmentation: flip, crop, brightness"""
    def __init__(self, image_size=(416, 416)):
        self.image_size = image_size
        
    def __call__(self, img):
        # Random horizontal flip
        if np.random.rand() > 0.5:
            img = torch.flip(img, dims=[2])
        
        # Random crop
        _, h, w = img.size()
        if h > self.image_size[0] and w > self.image_size[1]:
            top = np.random.randint(0, h - self.image_size[0])
            left = np.random.randint(0, w - self.image_size[1])
            img = img[:, top:top+self.image_size[0], left:left+self.image_size[1]]
        else:
            img = F.interpolate(img.unsqueeze(0), size=self.image_size, 
                               mode='bilinear', align_corners=True).squeeze(0)
        
        # Random brightness
        brightness_factor = np.random.uniform(0.9, 1.1)
        img = img * brightness_factor
        img = torch.clamp(img, 0, 1)
        
        return img


class StrongAugmentation:
    """Strong augmentation: includes CutMix"""
    def __init__(self, image_size=(416, 416)):
        self.image_size = image_size
        self.weak = WeakAugmentation(image_size)
        
    def __call__(self, img):
        # First apply weak augmentation
        img = self.weak(img)
        
        # Apply CutMix
        if np.random.rand() > 0.5:
            img = self.cutmix(img)
        
        return img
    
    def cutmix(self, img, alpha=1.0):
        """CutMix augmentation"""
        lam = np.random.beta(alpha, alpha)
        
        _, h, w = img.size()
        cut_ratio = np.sqrt(1 - lam)
        cut_h, cut_w = int(h * cut_ratio), int(w * cut_ratio)
        
        # Random position
        cx, cy = np.random.randint(w), np.random.randint(h)
        x1 = np.clip(cx - cut_w // 2, 0, w)
        y1 = np.clip(cy - cut_h // 2, 0, h)
        x2 = np.clip(cx + cut_w // 2, 0, w)
        y2 = np.clip(cy + cut_h // 2, 0, h)
        
        # Fill with random noise or zeros (simplified)
        img[:, y1:y2, x1:x2] = torch.rand_like(img[:, y1:y2, x1:x2])
        
        return img


# ============================================================================
# 8. UTILITY FUNCTIONS
# ============================================================================

def create_dpfr_model(num_classes=3, pretrained=True):
    """Factory function to create DPFR model"""
    model = DPFR(
        num_classes=num_classes,
        ema_decay=0.99,
        flow_steps=8,
        perturbation_magnitude=3.0,
        temperature=0.5,
        num_anchors=256,
        confidence_threshold=0.5
    )
    
    if pretrained:
        # Load ImageNet pretrained ResNet101 weights for encoder
        import torchvision.models as models
        resnet101 = models.resnet101(pretrained=True)
        
        # Copy weights to student encoder
        student_dict = model.student_encoder.state_dict()
        pretrained_dict = resnet101.state_dict()
        
        # Filter and copy matching layers
        pretrained_dict = {k: v for k, v in pretrained_dict.items() 
                          if k in student_dict and v.size() == student_dict[k].size()}
        student_dict.update(pretrained_dict)
        model.student_encoder.load_state_dict(student_dict, strict=False)
        
        # Copy to teacher
        model._initialize_teacher()
    
    return model


# ============================================================================
# 9. EXAMPLE USAGE
# ============================================================================

def example_training_loop():
    """Example of how to use DPFR for training"""
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Create model
    model = create_dpfr_model(num_classes=3, pretrained=True)
    trainer = DPFRTrainer(model, device=device)
    
    # Example data (replace with actual data loaders)
    batch_size = 8
    labeled_img = torch.randn(batch_size, 3, 416, 416)
    labels = torch.randint(0, 3, (batch_size, 416, 416))
    unlabeled_weak = torch.randn(batch_size, 3, 416, 416)
    unlabeled_strong = torch.randn(batch_size, 3, 416, 416)
    
    # Training step
    for epoch in range(1000):
        loss_dict = trainer.train_step(
            (labeled_img, labels),
            (unlabeled_weak, unlabeled_strong),
            epoch
        )
        
        if epoch % 10 == 0:
            print(f"Epoch {epoch}: {loss_dict}")
        
        # Update teacher EMA
        # (already done in train_step)
    
    # Inference
    model.eval()
    with torch.no_grad():
        test_img = torch.randn(1, 3, 416, 416).to(device)
        prediction = model.forward_test(test_img)
        print(f"Prediction shape: {prediction.shape}")


if __name__ == "__main__":
    example_training_loop()