5 Shocking Mistakes in Knowledge Distillation (And the Brilliant Framework KD2M That Fixes Them)

Visual comparison of misaligned vs. aligned neural network features using KD2M, showing dramatic improvement in model performance.

In the fast-evolving world of deep learning, one of the most promising techniques for deploying AI on edge devices is Knowledge Distillation (KD). But despite its popularity, many implementations suffer from critical flaws that undermine performance. A groundbreaking new paper titled “KD2M: A Unifying Framework for Feature Knowledge Distillation” reveals 5 shocking mistakes commonly made in KD—and introduces a brilliant, theoretically sound solution that outperforms traditional methods.

In this article, we’ll expose these pitfalls, unpack the innovative KD2M framework, and show how it leverages optimal transport and information geometry to achieve superior model compression. Whether you’re a machine learning engineer, researcher, or tech enthusiast, this is a must-read for anyone serious about efficient AI.


What Is Knowledge Distillation? (And Why It Often Fails)

Knowledge Distillation (KD) is the process of transferring knowledge from a large, high-performing teacher model (e.g., ResNet-34) to a smaller, faster student model (e.g., ResNet-18). The goal? To retain accuracy while reducing computational cost.

Traditionally, KD works by matching the output predictions—the final probability distributions over classes—between teacher and student. While this works, it ignores a deeper layer of intelligence: feature representations.

❌ The 5 Shocking Mistakes in Traditional KD

  1. Ignoring Feature-Level Knowledge
    Most methods only distill output logits. But the real “knowledge” lives in the hidden layer activations—the feature maps that capture semantic patterns.
  2. Using Pointwise Losses Instead of Distribution Matching
    Simple L2 or KL divergence on outputs fails to capture the geometric structure of feature spaces.
  3. Neglecting Label-Aware Alignment
    Features should not just match—they should match in context of their labels. Blind matching can align wrong classes.
  4. Overlooking Theoretical Guarantees
    Many methods lack error bounds, making performance unpredictable.
  5. Treating All Metrics Equally
    Not all probability distances are created equal. Some, like Wasserstein, preserve geometry; others, like MMD, may fail in high dimensions.

💡 The Solution? A unified framework that matches feature distributions—not just outputs—using principled metrics from computational optimal transport.


Introducing KD2M: The Brilliant Fix

The paper proposes KD2M (Knowledge Distillation through Distribution Matching), a unifying framework that addresses all five flaws by formalizing feature-level distillation as a distribution matching problem.

✅ How KD2M Works

Instead of just matching predictions, KD2M aligns the distributions of neural activations (features) between the student and teacher models. This is done by minimizing a probability metric D between the push-forward distributions of the data through the student and teacher encoders.

The objective function is:

\[ \theta^{\star} = \arg_{\theta \in \Theta} \min \; \mathbb{E}_{(x^{(P)}, y^{(P)}) \sim \mathcal{P}} \left[ \mathcal{L}\left(y^{(P)}, h_S(g_S(x^{(P)}))\right) \right] + \lambda \, \mathcal{D}(g_S, \sharp\mathcal{P}, g_T, \sharp\mathcal{P}) \]

Where:

  • gS​, gT : Student and teacher encoders (feature extractors)
  • hS : Student classifier
  • gS,♯​P : The push-forward distribution of data P through gS
  • D : A probability metric (e.g., Wasserstein, KL)
  • λ : Balances classification and distillation loss

This simple yet powerful formulation unifies many existing methods under one theoretical umbrella.


Distribution Metrics That Actually Work

KD2M supports multiple probability metrics to measure the distance between feature distributions. The choice of metric is crucial—it determines how well features are aligned.

1. Empirical Distributions & Optimal Transport

For batch-level feature matching, KD2M uses empirical distributions built from mini-batch samples:

\[ P_{S}(z) = \frac{1}{n} \sum_{i=1}^{n} \delta\left(z – z_i^{(P)}\right), \quad z_i^{(P)} = g^{S}\left(x_i^{(P)}\right) \]

The 2-Wasserstein distance measures the minimal “cost” to transform one distribution into another:

\[ W_2^2(P_S, P_T) = \min_{\gamma \in \Gamma(P_S, P_T)} \sum_{i=1}^{n} \sum_{j=1}^{m} \left\| z_i^{(P_S)} – z_j^{(P_T)} \right\|_2^2 \, \gamma_{ij} \]

This captures geometric structure in feature space—something simple L2 losses miss.

🔹 Class-Conditional Wasserstein (CW₂)

To incorporate label information:

\[ \text{CW}_2(P_S, P_T)^2 = \frac{1}{n_c} \sum_{y=1}^{n_c} W_2\left(P_S(Z \mid Y = y),\; P_T(Z \mid Y = y)\right)^2 \]

This ensures features are aligned within each class, preventing misalignment across categories.

🔹 Joint Wasserstein (JW₂)

Even better: align features and labels together:

\[ \mathcal{JW}_2^2(P_S, P_T) = \min_{\gamma \in \Gamma(\hat{P}, \hat{Q})} \sum_{i,j} \gamma_{ij} \left( \left\| z_i(P^S) – z_j(P^T) \right\|_2^2 + \beta\, \mathcal{L}\left(h(z_i(P_S)), h(z_j(P_T))\right) \right) \]

This joint metric, used in recent domain adaptation work, ensures semantically meaningful alignment.

2. Gaussian Approximations

For faster computation, features can be modeled as Gaussian distributions:

\[ P_{S}(z) = \mathcal{N}(\mu_{S}, \Sigma_{S}) \]

🔹 Wasserstein Distance for Gaussians

Closed-form solution:

\[ W_2^2(P_S, P_T) = \left\| \hat{\mu}_S – \hat{\mu}_T \right\|_2^2 + \mathcal{B}(\hat{\Sigma}_S, \hat{\Sigma}_T) \]

where

\[ B(A, B) = \text{Tr}(A) + \text{Tr}(B) – 2\,\text{Tr}(A^{1/2} B A^{1/2}) \]

For diagonal covariances, this simplifies to:

\[ W_2^2(P_S, P_T) = \lVert \mu_S – \mu_T \rVert_2^2 + \lVert \sigma_S – \sigma_T \rVert_2^2 \]

🔹 Kullback-Leibler Divergence (KL)

Also supported, especially for Gaussian features:

\[ \text{KL}(P_S \,\|\, P_T) = \frac{1}{2} \left( \mathrm{Tr}\left((\Sigma_T)^{-1} \Sigma_S\right) + (\mu_T – \mu_S)^\top (\Sigma^T)^{-1} (\mu_T – \mu_S) – d + \log \frac{\det \Sigma_T}{\det \Sigma_S} \right) \]

While KL is widely used, it lacks geometric awareness—Wasserstein is often superior.


Theoretical Breakthrough: Why KD2M Actually Works

One of KD2M’s biggest strengths is its theoretical grounding. Drawing from domain adaptation theory (Redko et al., 2017), the paper proves that the generalization error gap between student and teacher is bounded by the Wasserstein distance between their feature distributions.

📐 Lemma 4.1: Error Bound via Wasserstein

Let PS = gS,♯​P , PT= gT,♯​P . Under mild conditions:

\[ \left| \text{RPS}(h) – \text{RPT}(h) \right| \leq \mathcal{W}_2(P_S, P_T) \]

This means: the closer the feature distributions, the closer the performance.

📈 Theorem 4.1: Encoder Alignment Guarantees Performance

Even stronger, the paper shows:

\[ \left| R_{P_S}(h) – R_{P_T}(h) \right| \leq \left\| g_S – g_T \right\|_{L_2(P)} \]

Implication: If the student encoder gS​ converges to the teacher encoder gT​ in L2(P) , then their generalization errors converge.

This is a game-changer: it provides a theoretical justification for feature distillation, something many prior methods lacked.


Empirical Results: KD2M Outperforms Baselines

The paper evaluates KD2M on SVHN, CIFAR-10, and CIFAR-100 using ResNet-18 (student) and ResNet-34 (teacher).

✅ Key Findings

METHODSVHN (%)CIFAR-10 (%)CIFAR-100 (%)AVG (%)
Student93.1085.1156.6678.29
Teacher94.4186.9862.2181.20
W₂ (E)94.0086.4561.0780.51
CW₂ (E)94.0686.5461.4780.69
JW₂ (E)94.0086.6061.0780.55
W₂ (G)93.9486.6360.6880.41
KL (G)94.0586.4460.6680.38

🔺 Insight: All KD2M variants improve over the student baseline, with label-aware metrics (CW₂, JW₂) performing best.

  • CW₂ (Class-Conditional) wins on CIFAR-100, proving that label-aware alignment matters in high-class settings.
  • Even Gaussian approximations perform well, offering a computationally efficient alternative.
  • JW₂ slightly edges out others by jointly aligning features and predictions.

🖼️ Visual Evidence: Feature Alignment

The paper includes t-SNE visualizations (Fig. 3) showing:

  • Baseline student: Features are scattered, poorly aligned with teacher.
  • KD2M student: Features are tightly clustered and well-aligned with teacher.

This visual proof confirms that KD2M doesn’t just improve numbers—it fixes the underlying representation.


How to Implement KD2M (Step-by-Step)

Here’s how to apply KD2M in practice:

1. Choose Your Backbone

  • Teacher: Pre-trained large model (e.g., ResNet-34)
  • Student: Smaller model (e.g., ResNet-18)

2. Select a Distribution Metric

METRICUSECASEPROSCONS
W₂ (Empirical)High accuracyCaptures geometryComputationally heavy
CW₂Multi-class tasksLabel-awareRequires class splits
JW₂Semantic alignmentJoint feature-label matchNeeds label loss
W₂ (Gaussian)Fast trainingClosed-form, fastAssumes normality
KL (Gaussian)Classic KDEasy to implementLess geometric

3. Train with KD2M Loss

# Pseudocode
def kd2m_loss(student_features, teacher_features, labels, lambda=1e-3):
    classification_loss = CrossEntropy(student_logits, labels)
    distillation_loss = CW2(student_features, teacher_features, labels)  # or W2, JW2, etc.
    return classification_loss + lambda * distillation_loss

4. Tune λ

  • Start with λ=10−4 to 10−5
  • Too high: Student overfits to teacher
  • Too low: No distillation effect

Why KD2M Is a Game-Changer

KD2M isn’t just another KD method—it’s a paradigm shift.

✅ Advantages

  • Unified framework: Combines multiple distillation strategies.
  • Theoretically sound: Proven error bounds.
  • Flexible: Works with any encoder architecture.
  • Empirically strong: Outperforms baselines across datasets.
  • Label-aware: Metrics like CW₂ and JW₂ respect class structure.

🔮 Future Applications

  • Dataset distillation: Compress entire datasets into synthetic examples.
  • Federated learning: Align models across devices.
  • Domain adaptation: Transfer knowledge across domains.

Final Verdict: Stop Wasting Time on Weak KD Methods

If you’re still using output-only distillation or naive L2 losses, you’re leaving performance on the table. The future of model compression lies in feature distribution matching—and KD2M is leading the way.

With strong theoretical backing, superior empirical results, and practical flexibility, KD2M sets a new standard for knowledge distillation.


Related posts, You May like to read

  1. 7 Shocking AI Vulnerabilities Exposed—How DBOM Defense Turns the Tables with 98% Accuracy
  2. 7 Shocking Vulnerabilities in AI Watermarking: The Hidden Threat of Unified Spoofing & Scrubbing Attacks (And How to Fix It)
  3. 7 Revolutionary Breakthroughs in Small Object Detection: The DAHI Framework
  4. 7 Breakthrough AI Insights: How Machine Learning Predicts Glioma Grading

Call to Action: Dive Deeper Today!

Want to implement KD2M in your own projects?

Get the code: https://github.com/eddardd/kddm
Read the full paper: arXiv:2504.01757
Try it on your dataset—and see the difference feature alignment makes!

💬 Have questions? Drop a comment below or connect with the author on Twitter/X @eddardd . Let’s push the boundaries of efficient AI—together.

I’ve reviewed the paper “KD²M: A Unifying Framework for Feature Knowledge Distillation” and will now provide the complete, end-to-end code for the proposed model.

# KD²M: A Unifying Framework for Feature Knowledge Distillation
# Complete end-to-end implementation based on the paper (arXiv:2504.01757v2)
# This script uses PyTorch, Torchvision, and the POT (Python Optimal Transport) library.

# --- 1. Imports and Setup ---
# Ensure you have the required libraries installed:
# pip install torch torchvision torchaudio pot

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torchvision.models import resnet18, resnet34
import ot # Python Optimal Transport library
import numpy as np
from tqdm import tqdm

print("KD²M Implementation using PyTorch")
print(f"PyTorch Version: {torch.__version__}")
print(f"Torchvision Version: {torchvision.__version__}")
print(f"POT Version: {ot.__version__}")


# --- 2. Model Definitions (Student and Teacher) ---

# We need to modify the ResNet models to easily extract features from the encoder part (g_S and g_T)
# and pass them to the classifier part (h_S and h_T).
class FeatureExtractor(nn.Module):
    """
    Wrapper for a ResNet model to separate the feature encoder (g) from the classifier (h).
    This corresponds to the g_S and g_T networks in the paper.
    """
    def __init__(self, model_name='resnet18', pretrained=True):
        super(FeatureExtractor, self).__init__()
        if model_name == 'resnet18':
            original_model = resnet18(weights=torchvision.models.ResNet18_Weights.DEFAULT if pretrained else None)
        elif model_name == 'resnet34':
            original_model = resnet34(weights=torchvision.models.ResNet34_Weights.DEFAULT if pretrained else None)
        else:
            raise ValueError("Model name not supported. Choose 'resnet18' or 'resnet34'.")

        # The encoder 'g' consists of all layers except the final fully connected layer.
        self.encoder = nn.Sequential(*list(original_model.children())[:-1])
        # The classifier 'h' is the final fully connected layer.
        self.classifier = original_model.fc

    def forward(self, x):
        # The forward pass returns the features from the encoder.
        features = self.encoder(x)
        features = torch.flatten(features, 1)
        return features

    def classify(self, features):
        # A separate method to get predictions from features.
        return self.classifier(features)

# --- 3. Distribution Distance Metrics ---
# Implementation of the probability metrics discussed in Section 2 of the paper.

def compute_wasserstein_empirical(p_s, p_t):
    """
    Computes the 2-Wasserstein distance between two empirical distributions.
    Corresponds to Equation 3 in the paper.
    Args:
        p_s (Tensor): Student features (batch_size, feature_dim)
        p_t (Tensor): Teacher features (batch_size, feature_dim)
    Returns:
        Tensor: The squared 2-Wasserstein distance.
    """
    # POT library requires numpy arrays
    a, b = np.ones((p_s.shape[0],)) / p_s.shape[0], np.ones((p_t.shape[0],)) / p_t.shape[0]
    # Cost matrix: pairwise squared Euclidean distance
    M = ot.dist(p_s.detach().cpu().numpy(), p_t.detach().cpu().numpy(), metric='sqeuclidean')
    # Ensure M is C-contiguous
    M = np.ascontiguousarray(M)
    
    # EMD (Earth Mover's Distance) is equivalent to the Wasserstein distance
    # The result is the total cost, which is W_2^2
    w2_squared = ot.emd2(a, b, M)
    return torch.tensor(w2_squared, device=p_s.device)

def compute_class_conditional_wasserstein(p_s, p_t, labels):
    """
    Computes the Class-Conditional Wasserstein distance.
    Corresponds to Equation 4 in the paper.
    Args:
        p_s (Tensor): Student features
        p_t (Tensor): Teacher features
        labels (Tensor): Ground truth labels for the batch
    Returns:
        Tensor: The squared CW2 distance.
    """
    unique_labels = torch.unique(labels)
    total_cw2_squared = 0.0
    num_classes_in_batch = 0

    for y in unique_labels:
        # Get features for the current class
        s_class_features = p_s[labels == y]
        t_class_features = p_t[labels == y]

        # Only compute distance if there's more than one sample to form a distribution
        if s_class_features.shape[0] > 1 and t_class_features.shape[0] > 1:
            total_cw2_squared += compute_wasserstein_empirical(s_class_features, t_class_features)
            num_classes_in_batch += 1

    if num_classes_in_batch == 0:
        return torch.tensor(0.0, device=p_s.device)
        
    return total_cw2_squared / num_classes_in_batch

def compute_gaussian_wasserstein(p_s, p_t):
    """
    Computes the 2-Wasserstein distance between two Gaussian distributions.
    Corresponds to Equation 6 in the paper.
    Args:
        p_s (Tensor): Student features
        p_t (Tensor): Teacher features
    Returns:
        Tensor: The squared W2 distance for Gaussian approximations.
    """
    # Estimate mean and covariance
    mu_s, cov_s = p_s.mean(dim=0), torch.cov(p_s.T)
    mu_t, cov_t = p_t.mean(dim=0), torch.cov(p_t.T)

    # Add a small epsilon for numerical stability of sqrt
    epsilon = 1e-6
    
    # Term 1: Squared norm of mean difference
    term_mean = (mu_s - mu_t).pow(2).sum()

    # Term 2: Bures metric between covariance matrices
    # B(A,B) = tr(A) + tr(B) - 2*tr(sqrt(sqrt(A) @ B @ sqrt(A)))
    sqrt_cov_s = torch.linalg.cholesky(cov_s + epsilon * torch.eye(cov_s.shape[0], device=p_s.device))
    sqrt_cov_s_B_sqrt_cov_s = sqrt_cov_s.T @ cov_t @ sqrt_cov_s
    sqrt_term = torch.linalg.cholesky(sqrt_cov_s_B_sqrt_cov_s + epsilon * torch.eye(cov_s.shape[0], device=p_s.device))
    
    term_bures = torch.trace(cov_s) + torch.trace(cov_t) - 2 * torch.trace(sqrt_term)
    
    w2_squared = term_mean + term_bures
    return w2_squared

def compute_gaussian_kl(p_s, p_t):
    """
    Computes the KL divergence between two Gaussian distributions.
    Corresponds to Equation 7 in the paper.
    Args:
        p_s (Tensor): Student features
        p_t (Tensor): Teacher features
    Returns:
        Tensor: The KL divergence KL(P_S || P_T).
    """
    mu_s, cov_s = p_s.mean(dim=0), torch.cov(p_s.T)
    mu_t, cov_t = p_t.mean(dim=0), torch.cov(p_t.T)
    d = p_s.shape[1]
    
    # Add a small epsilon for numerical stability of inverse and logdet
    epsilon = 1e-6
    cov_t_inv = torch.inverse(cov_t + epsilon * torch.eye(d, device=p_s.device))
    
    term1 = torch.trace(cov_t_inv @ cov_s)
    term2 = (mu_t - mu_s).T @ cov_t_inv @ (mu_t - mu_s)
    term3 = -d
    term4 = torch.log(torch.det(cov_t + epsilon * torch.eye(d, device=p_s.device)) / torch.det(cov_s + epsilon * torch.eye(d, device=p_s.device)))

    kl_div = 0.5 * (term1 + term2 + term3 + term4)
    return kl_div


# --- 4. Data Loading ---
# As per the paper, we use CIFAR-10 with standard augmentations.

def get_cifar10_loaders(batch_size=128):
    """
    Prepares the CIFAR-10 data loaders.
    """
    # Normalization values from the paper for CIFAR-10/100
    mean = (0.5071, 0.4867, 0.4408)
    std = (0.2675, 0.2565, 0.2761)

    transform_train = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean, std),
    ])

    transform_test = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean, std),
    ])

    trainset = torchvision.datasets.CIFAR10(
        root='./data', train=True, download=True, transform=transform_train)
    trainloader = torch.utils.data.DataLoader(
        trainset, batch_size=batch_size, shuffle=True, num_workers=2)

    testset = torchvision.datasets.CIFAR10(
        root='./data', train=False, download=True, transform=transform_test)
    testloader = torch.utils.data.DataLoader(
        testset, batch_size=batch_size, shuffle=False, num_workers=2)
        
    return trainloader, testloader

# --- 5. Training and Evaluation Logic ---

def train(student, teacher, trainloader, optimizer, scheduler, criterion_cls, criterion_dist, lambda_kd, epoch):
    """
    The main training function for one epoch, corresponding to Algorithm 1.
    """
    student.train()
    teacher.eval() # Teacher is frozen
    
    total_loss = 0
    total_cls_loss = 0
    total_dist_loss = 0
    
    progress_bar = tqdm(trainloader, desc=f"Epoch {epoch+1} Training")
    
    for batch_idx, (inputs, targets) in enumerate(progress_bar):
        inputs, targets = inputs.to(device), targets.to(device)
        
        optimizer.zero_grad()
        
        # --- Forward pass ---
        # Get features from student and teacher encoders
        student_features = student(inputs)
        with torch.no_grad(): # Teacher is not trained
            teacher_features = teacher(inputs)
            
        # Get predictions from student classifier
        student_outputs = student.classify(student_features)
        
        # --- Loss Calculation ---
        # 1. Classification Loss (L_c)
        loss_cls = criterion_cls(student_outputs, targets)
        
        # 2. Feature Distillation Loss (L_d)
        loss_dist = criterion_dist(student_features, teacher_features, targets)
        
        # 3. Total Loss
        total_epoch_loss = loss_cls + lambda_kd * loss_dist
        
        # --- Backward pass and optimization ---
        total_epoch_loss.backward()
        optimizer.step()
        
        total_loss += total_epoch_loss.item()
        total_cls_loss += loss_cls.item()
        total_dist_loss += loss_dist.item()
        
        progress_bar.set_postfix({
            'Loss': f'{total_loss/(batch_idx+1):.3f}',
            'Cls': f'{total_cls_loss/(batch_idx+1):.3f}',
            'Dist': f'{total_dist_loss/(batch_idx+1):.3f}'
        })
        
    scheduler.step()


def test(model, testloader, criterion):
    """
    Evaluates the model on the test set.
    """
    model.eval()
    test_loss = 0
    correct = 0
    total = 0
    with torch.no_grad():
        progress_bar = tqdm(testloader, desc="Testing")
        for batch_idx, (inputs, targets) in enumerate(progress_bar):
            inputs, targets = inputs.to(device), targets.to(device)
            features = model(inputs)
            outputs = model.classify(features)
            loss = criterion(outputs, targets)

            test_loss += loss.item()
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
            
            progress_bar.set_postfix({
                'Loss': f'{test_loss/(batch_idx+1):.3f}',
                'Acc': f'{100.*correct/total:.2f}%'
            })
    
    acc = 100. * correct / total
    print(f"Test Accuracy: {acc:.2f}%")
    return acc


# --- 6. Main Execution Block ---

if __name__ == '__main__':
    # --- Hyperparameters ---
    EPOCHS = 15
    BATCH_SIZE = 128
    LEARNING_RATE = 0.01
    MOMENTUM = 0.9
    LAMBDA_KD = 1e-5 # Distillation loss weight, can be tuned (as in Fig. 2)
    
    # Choose the distribution metric here
    # Options: 'W2_E', 'CW2_E', 'W2_G', 'KL_G'
    DIST_METRIC = 'CW2_E'
    
    # --- Setup ---
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f"Using device: {device}")
    
    # --- Load Data ---
    trainloader, testloader = get_cifar10_loaders(BATCH_SIZE)
    
    # --- Initialize Models ---
    print("Initializing Student (ResNet-18) and Teacher (ResNet-34)...")
    student_net = FeatureExtractor('resnet18', pretrained=True).to(device)
    teacher_net = FeatureExtractor('resnet34', pretrained=True).to(device)
    
    # --- Define Loss and Optimizer ---
    criterion_classification = nn.CrossEntropyLoss()
    
    # Select the distillation loss function
    dist_metrics_map = {
        'W2_E': lambda s, t, y: compute_wasserstein_empirical(s, t),
        'CW2_E': compute_class_conditional_wasserstein,
        'W2_G': lambda s, t, y: compute_gaussian_wasserstein(s, t),
        'KL_G': lambda s, t, y: compute_gaussian_kl(s, t)
    }
    
    if DIST_METRIC not in dist_metrics_map:
        raise ValueError(f"Unknown distance metric: {DIST_METRIC}")
        
    criterion_distillation = dist_metrics_map[DIST_METRIC]
    print(f"Using distillation metric: {DIST_METRIC}")

    optimizer = optim.SGD(student_net.parameters(), lr=LEARNING_RATE, momentum=MOMENTUM, weight_decay=5e-4)
    # Cosine Annealing scheduler as mentioned in the paper
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS, eta_min=1e-4)
    
    # --- Run Training and Evaluation ---
    print("\n--- Starting Training ---")
    best_acc = 0
    for epoch in range(EPOCHS):
        train(student_net, teacher_net, trainloader, optimizer, scheduler, criterion_classification, criterion_distillation, LAMBDA_KD, epoch)
        print(f"\n--- Evaluating after Epoch {epoch+1} ---")
        acc = test(student_net, testloader, criterion_classification)
        
        if acc > best_acc:
            print("Saving new best model...")
            best_acc = acc
            # You can save the model state if needed
            # torch.save(student_net.state_dict(), f'kddm_student_{DIST_METRIC}.pth')

    print("\n--- Training Finished ---")
    print(f"Best Test Accuracy with {DIST_METRIC}: {best_acc:.2f}%")

Leave a Comment

Your email address will not be published. Required fields are marked *

Follow by Email
Tiktok