Anchor-Based Knowledge Distillation: A Trustworthy AI Approach for Efficient Model Compression

In the rapidly evolving field of artificial intelligence (AI), knowledge distillation (KD) has emerged as a cornerstone technique for compressing powerful, resource-intensive neural networks into smaller, more efficient models suitable for deployment on mobile and edge devices. However, traditional KD methods often fall short in capturing the full richness of a teacher model’s knowledge, especially when dealing with imbalanced data or limited mini-batch diversity.

A groundbreaking new approach, Anchor-Based Knowledge Distillation (AKD), is changing the game. Introduced by Spanos, Passalis, and Tefas in their 2025 paper published in Knowledge-Based Systems, AKD leverages strategically selected representative samples—called anchors—to significantly improve the accuracy, generalization, and trustworthiness of distilled models.

This article dives deep into the AKD framework, explaining its core principles, benefits, and real-world performance. Whether you’re an AI researcher, a machine learning engineer, or simply interested in the future of efficient AI, understanding AKD is essential.

What Is Knowledge Distillation (KD)?

Before we explore AKD, let’s briefly recap what knowledge distillation is.

Knowledge distillation is a model compression technique where a smaller “student” neural network learns from a larger, pre-trained “teacher” network. Instead of training the student solely on raw labels (hard targets), it also learns from the teacher’s soft predictions—such as class probabilities or intermediate feature representations.

This soft supervision helps the student capture nuanced patterns, relationships, and decision boundaries that are not evident from labels alone, leading to better performance than training from scratch.

Why Traditional KD Falls Short

Despite its success, conventional KD—especially distribution learning-based methods—has notable limitations:

Mini-batch Imbalance: Random sampling can lead to batches missing entire classes, especially in datasets with many categories (e.g., CIFAR-100, ImageNet). This results in biased learning and poor knowledge transfer.
Lack of Representativeness: Not all samples are equally informative. Standard methods treat all data points equally, potentially diluting the learning signal.
Poor Uncertainty Estimation: Many KD methods produce overconfident predictions, especially on out-of-distribution (OoD) data, reducing model trustworthiness.

Introducing Anchor-Based Knowledge Distillation (AKD)

To address these challenges, the authors propose Anchor-Based Knowledge Distillation (AKD)—a novel two-stage framework that enhances distribution learning by introducing anchor samples.

Anchors are highly representative, informative data points selected from each class to serve as stable reference points in the feature space.

By matching the student’s representation distribution not just to the mini-batch, but also to these static anchors, AKD ensures a more balanced, consistent, and robust knowledge transfer process.

How AKD Works: The Core Mechanism

AKD builds upon probabilistic knowledge transfer methods like Probabilistic Knowledge Transfer (PKT) but introduces three key innovations:

Anchor Selection
Anchor-Guided Distribution Learning
Attention-Based Feature Enhancement

Let’s break down each component.

1. Anchor Selection: Finding the Most Representative Samples

Anchors are not chosen randomly. Instead, they are selected based on their centrality within each class in the teacher model’s feature space.

Here’s the process:

Pass all training data through the pre-trained teacher model to extract feature representations.
For each class, compute a similarity matrix using cosine similarity:Mc(i,j)=∥xi∥∥xj∥xi⋅xj
Calculate a centrality score for each sample by summing its row in the similarity matrix:C(si)=j∑Mc(i,j)
Select the top NAc samples with the highest centrality scores per class as anchors.

These anchors act as “prototypes” that encapsulate the most typical and discriminative features of their respective classes.

2. Anchor-Guided Distribution Learning

Once anchors are selected, AKD uses them to guide the student’s learning in two ways:

Batch-to-Batch Matching: Aligns the similarity structure within the current mini-batch (like traditional methods).
Batch-to-Anchor Matching: Forces the student to align its representations with the fixed anchor set, ensuring consistent class coverage.

The overall AKD loss function combines three Kullback-Leibler (KL) divergence terms:

\[ L_{\text{AKD}} = \lambda_{1} L_{1} + (1 – \lambda_{2}) L_{2} + \lambda_{2} L_{3} \]

Where:

L1 : Matches intra-batch similarities.

\[ L_{1} = \sum_{i=1}^{N} \sum_{\substack{j=1 \\ j \neq i}}^{N} \; p(x_i, x_j) \, \log \frac{q(y_i, y_j)}{p(x_i, x_j)} \]

L2 : Matches batch samples to anchors.

\[ L_2 = \sum_{i=1}^{N} \sum_{j=1}^{N_A} p(x_i, x_j^{(A)}) \, \log \frac{q(y_i, y_j^{(A)})}{p(x_i, x_j^{(A)})} \]

L3 : Ensures symmetry by matching anchors to batch samples (acts as a regularizer).

This dual alignment ensures that even if a mini-batch lacks certain classes, the student still learns from their representative anchors.

3. Attention-Based Feature Enhancement

AKD further enhances learning by applying a learnable attention mask to the anchor images during training.

The attention mechanism highlights semantically important regions (e.g., object parts in images).
It is applied only on the student side, guiding it to focus on the same critical features the teacher uses.
Crucially, the attention module is discarded after training, adding zero inference overhead.

This not only improves performance but also provides interpretability—you can visualize which parts of an anchor the student is focusing on.

Key Benefits of AKD

✅ Improved Accuracy and Generalization

By ensuring balanced class representation via anchors, AKD mitigates the negative effects of mini-batch sampling bias. This leads to:

Higher retrieval accuracy
Better classification performance
Stronger generalization across datasets

✅ Enhanced Model Trustworthiness

AKD contributes to trustworthy AI by enabling intrinsic uncertainty estimation.

The authors propose a simple yet effective method:

\[ u(x) = 1 – \max_{i = 1, \ldots, N_A} \, K\big(f(x), f(a_i)\big) \]

Where:

u(x) is the uncertainty of input x
K(⋅,⋅) is the similarity kernel
a_i are the anchor representations

Intuition: If a test sample is dissimilar to all anchors, it lies in a poorly understood region of the feature space—indicating high uncertainty.

This method outperforms traditional baselines like softmax entropy and even deep ensembles in out-of-distribution detection, despite using only a single model.

✅ Computational Efficiency

Despite its advanced design, AKD introduces minimal computational overhead:

DATASET	METHOD	EPOCH TIME (SEC)	MEMORY (MB)
CIFAR-10	PKT	15.04	598.15
	AKD	17.60	611.80
CIFAR-100	PKT	22.36	915.58
	AKD	29.98	1,180.69

Table: Training time and memory footprint comparison (batch size = 128). Source: Spanos et al., 2025.

The slight increase is justified by significant performance gains, and no overhead is incurred at inference time.

Performance Evaluation: AKD vs. State-of-the-Art

The paper evaluates AKD across multiple datasets and tasks, demonstrating consistent superiority.

📊 Image Retrieval Results

MODEL	MAP (CIFAR-10)	TOP-100 PRECISION
Student	38.96	59.17
PKT	51.19	63.39
SP	52.09	62.95
AKD	52.45	64.43

AKD achieves the highest mAP and precision across all metrics.

📈 Image Classification on CIFAR-100

METHOD	AVG. TOP-1 ACCURACY (%)
Student	71.53
CRD	74.05
WTTM	74.37
AKD	74.25
AKD+CRD	74.72

AKD alone matches or exceeds many state-of-the-art methods. When combined with CRD, it sets a new benchmark.

🛡️ Uncertainty Estimation (AUROC)

METHOD	CIFAR-10	SVHN	TINY-IMAGENET
Softmax Entropy	0.8120	0.6846	0.7692
Deep Ensemble (5)	0.8575	0.6568	0.8164
AKD (Proposed)	0.8594	0.7776	0.8434

AKD outperforms both softmax-based methods and even deep ensembles in average AUROC, proving its effectiveness in OoD detection.

Ablation Studies: What Makes AKD Work?

The authors conducted extensive ablation studies to validate each component.

Impact of Attention Mechanism

DATASET	AKD (W/O ATTENTION)	AKD (WITH ATTENTION)
CIFAR-100	75.76%	76.44%
Tiny-ImageNet	49.95%	50.02%

Attention consistently improves performance by focusing on key features.

Effect of Number of Anchors (NA )

Performance improves slightly with more anchors.
Recommended: N_A=N_cor 2N_c , where N_c is the number of classes.
Beyond this, gains are marginal.

Performance with Limited Labels

LABELED DATA (%)	TOP-1 ACCURACY (%)
100%	76.44
20%	75.95
0% (unsupervised)	74.89

AKD remains effective even with very few labels, making it ideal for semi-supervised and online learning scenarios.

Why AKD Matters for Real-World AI

AKD isn’t just an academic advance—it has practical implications:

Edge AI: Enables high-performance models on smartphones, IoT devices, and embedded systems.
Healthcare & Robotics: Improves model reliability in safety-critical applications.
Efficient Training: Reduces the need for large batch sizes or complex architectures.
Trustworthy AI: Provides built-in uncertainty quantification without extra cost.

By focusing on representative samples and balanced distribution learning, AKD makes AI models not just smaller, but smarter and more trustworthy.

How to Implement AKD

While the full code is available on GitHub , here’s a high-level implementation guide:

Pre-train or load a teacher model.
Extract features for all training data.
Select anchors per class using centrality scoring.
Train the student using the combined AKD loss:Ltotal=LKD+λAKDLAKD
Optionally combine with other KD methods (e.g., CRD, NORM) for even better results.

The method is model-agnostic and requires no architectural changes to the teacher or student.

Conclusion: The Future of Knowledge Distillation is Anchored

Anchor-Based Knowledge Distillation (AKD) represents a significant leap forward in model compression and trustworthy AI. By leveraging representative anchor samples, attention-guided learning, and intrinsic uncertainty estimation, AKD addresses the core weaknesses of traditional KD methods.

Its ability to improve accuracy, generalization, and reliability—while remaining computationally efficient—makes it a powerful tool for deploying AI in real-world applications.

As AI systems become more pervasive, methods like AKD will be essential for building models that are not only intelligent but also reliable, interpretable, and trustworthy.

Call to Action

Want to try AKD in your own projects?

👉 Download the official code from GitHub and start experimenting today!

Have questions or want to discuss the paper?
💬 Join the conversation on Reddit, Twitter, or LinkedIn using #AnchorKD and #TrustworthyAI.

And if you found this article helpful, share it with your network to help spread the knowledge!

Based on the research paper “Trustworthy knowledge distillation via anchor-guided distribution learning,” I have written the complete, end-to-end Python code for the proposed Anchor-based Knowledge Distillation (AKD) model.

# Full implementation of Anchor-based Knowledge Distillation (AKD)
# Based on the paper: "Trustworthy knowledge distillation via anchor-guided distribution learning"
# Spanos, D., Passalis, N., & Tefas, A. (2025). Knowledge-Based Systems, 329, 114297.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from torchvision.models import resnet18
import numpy as np
from tqdm import tqdm
from collections import defaultdict

# --- 1. Helper Functions and Loss Module ---

def pairwise_cosine_similarity(x, y=None, eps=1e-8):
    """
    Calculates the pairwise cosine similarity matrix.
    """
    x_norm = F.normalize(x, p=2, dim=1, eps=eps)
    y_norm = F.normalize(y, p=2, dim=1, eps=eps) if y is not None else x_norm
    return torch.mm(x_norm, y_norm.transpose(0, 1))

def similarity_kernel(cos_sim_matrix):
    """
    The similarity kernel from the paper, which maps cosine similarity [-1, 1] to [0, 1].
    K(v1, v2) = 0.5 * ( (v1^T * v2) / (||v1||*||v2||) + 1 )
    """
    return 0.5 * (cos_sim_matrix + 1.0)

class AKDLoss(nn.Module):
    """
    Implements the full Anchor-based Knowledge Distillation loss.
    L_AKD = lambda_1 * L_1 + (1 - lambda_2) * L_2 + lambda_2 * L_3
    """
    def __init__(self, lambda1=1.0, lambda2=0.1):
        super(AKDLoss, self).__init__()
        self.lambda1 = lambda1
        self.lambda2 = lambda2
        self.kl_div = nn.KLDivLoss(reduction='batchmean')

    def _get_dist_loss(self, student_sim, teacher_sim):
        """Calculates the KL divergence between student and teacher distributions."""
        # Use softmax to convert similarity scores to probability distributions
        teacher_dist = F.softmax(teacher_sim, dim=1)
        student_log_dist = F.log_softmax(student_sim, dim=1)
        
        # KL divergence loss
        loss = self.kl_div(student_log_dist, teacher_dist)
        return loss

    def forward(self, f_s_batch, f_t_batch, f_s_anchor, f_t_anchor):
        """
        Calculates the three components of the AKD loss.
        Args:
            f_s_batch: Student features for the current mini-batch.
            f_t_batch: Teacher features for the current mini-batch.
            f_s_anchor: Student features for the anchor set.
            f_t_anchor: Teacher features for the anchor set.
        """
        # L1: Batch-to-Batch similarity loss
        s_bb_sim = pairwise_cosine_similarity(f_s_batch)
        t_bb_sim = pairwise_cosine_similarity(f_t_batch)
        loss_1 = self._get_dist_loss(s_bb_sim, t_bb_sim)

        # L2: Batch-to-Anchor similarity loss
        s_ba_sim = pairwise_cosine_similarity(f_s_batch, f_s_anchor)
        t_ba_sim = pairwise_cosine_similarity(f_t_batch, f_t_anchor)
        loss_2 = self._get_dist_loss(s_ba_sim, t_ba_sim)

        # L3: Anchor-to-Batch similarity loss (symmetric KL divergence)
        s_ab_sim = s_ba_sim.t() # Transpose of student batch-anchor similarity
        t_ab_sim = t_ba_sim.t() # Transpose of teacher batch-anchor similarity
        loss_3 = self._get_dist_loss(s_ab_sim, t_ab_sim)
        
        total_loss = self.lambda1 * loss_1 + (1 - self.lambda2) * loss_2 + self.lambda2 * loss_3
        return total_loss

# --- 2. Model Definitions ---
# We define a generic wrapper to extract intermediate features from any model.

class FeatureExtractor(nn.Module):
    def __init__(self, model, feature_layer_name):
        super().__init__()
        self.model = model
        self.feature_layer_name = feature_layer_name
        self.features = None
        
        # Register a forward hook to capture the output of the desired layer
        layer = dict([*self.model.named_modules()])[self.feature_layer_name]
        layer.register_forward_hook(self._hook)

    def _hook(self, module, input, output):
        self.features = output

    def forward(self, x):
        logits = self.model(x)
        # The hook will automatically store the features
        return self.features, logits

# Example of a simple Student CNN
def create_student_model():
    return nn.Sequential(
        nn.Conv2d(3, 16, 3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(2, 2),
        nn.Conv2d(16, 32, 3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(2, 2),
        nn.Conv2d(32, 64, 3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(2, 2), # Features will be extracted here
        nn.Flatten(),
        nn.Linear(64 * 4 * 4, 128),
        nn.ReLU(),
        nn.Linear(128, 10) # 10 classes
    )

# --- 3. Main AKD Orchestrator Class ---

class AKD_Trainer:
    def __init__(self, teacher, student, num_classes, akd_lambda1=1.0, akd_lambda2=0.1, use_attention=True, attention_size=14):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print(f"Using device: {self.device}")
        
        self.teacher = teacher.to(self.device).eval()
        self.student = student.to(self.device)
        self.num_classes = num_classes
        
        self.akd_loss_fn = AKDLoss(akd_lambda1, akd_lambda2).to(self.device)
        self.ce_loss_fn = nn.CrossEntropyLoss()
        
        self.use_attention = use_attention
        if self.use_attention:
            # Learnable attention mask (Algorithm 1, applied to student anchors)
            self.attention_mask = nn.Parameter(torch.ones(1, 1, attention_size, attention_size)).to(self.device)
        
        self.anchors = None
        self.anchor_labels = None

    def select_anchors(self, dataset, num_anchors_per_class):
        print("Selecting anchors...")
        dataloader = DataLoader(dataset, batch_size=128, shuffle=False)
        
        all_features = []
        all_labels = []

        with torch.no_grad():
            for data, labels in tqdm(dataloader, desc="Extracting teacher features"):
                data = data.to(self.device)
                features, _ = self.teacher(data)
                all_features.append(features.cpu())
                all_labels.append(labels.cpu())

        all_features = torch.cat(all_features)
        all_labels = torch.cat(all_labels)

        anchor_indices = []
        for c in range(self.num_classes):
            class_indices = (all_labels == c).nonzero(as_tuple=True)[0]
            class_features = all_features[class_indices]
            
            # Compute similarity matrix and centrality score (Eq. 6 & 7)
            cos_sim = pairwise_cosine_similarity(class_features)
            centrality_scores = cos_sim.sum(dim=1)
            
            # Select top N_A samples with highest centrality
            num_to_select = min(num_anchors_per_class, len(class_indices))
            top_k_indices = torch.topk(centrality_scores, k=num_to_select).indices
            
            # Map back to original dataset indices
            anchor_indices.extend(class_indices[top_k_indices].tolist())
            
        self.anchors = dataset.tensors[0][anchor_indices].to(self.device)
        self.anchor_labels = dataset.tensors[1][anchor_indices].to(self.device)
        print(f"Selected {len(self.anchors)} anchors.")

    def train(self, train_loader, epochs, lr=1e-3, akd_weight=1.0, ce_weight=1.0):
        params_to_optimize = list(self.student.parameters())
        if self.use_attention:
            params_to_optimize.append(self.attention_mask)
            
        optimizer = optim.Adam(params_to_optimize, lr=lr)
        
        self.student.train()
        for epoch in range(epochs):
            total_loss, total_akd_loss, total_ce_loss = 0, 0, 0
            
            for data, target in tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs}"):
                data, target = data.to(self.device), target.to(self.device)
                
                optimizer.zero_grad()
                
                # Get Teacher outputs (no gradients needed)
                with torch.no_grad():
                    f_t_batch, _ = self.teacher(data)
                    f_t_anchor, _ = self.teacher(self.anchors)

                # Get Student outputs
                f_s_batch, logits_s_batch = self.student(data)
                
                student_anchor_input = self.anchors
                if self.use_attention:
                    # Apply attention mask to anchor inputs for the student
                    img_size = self.anchors.shape[2:]
                    resized_mask = F.interpolate(self.attention_mask, size=img_size, mode='bilinear', align_corners=False)
                    student_anchor_input = self.anchors * resized_mask

                f_s_anchor, _ = self.student(student_anchor_input)
                
                # Calculate losses
                loss_akd = self.akd_loss_fn(f_s_batch, f_t_batch, f_s_anchor, f_t_anchor)
                loss_ce = self.ce_loss_fn(logits_s_batch, target)
                
                loss = (akd_weight * loss_akd) + (ce_weight * loss_ce)
                
                loss.backward()
                optimizer.step()
                
                total_loss += loss.item()
                total_akd_loss += loss_akd.item()
                total_ce_loss += loss_ce.item()

            avg_loss = total_loss / len(train_loader)
            avg_akd = total_akd_loss / len(train_loader)
            avg_ce = total_ce_loss / len(train_loader)
            print(f"Epoch {epoch+1} Summary: Avg Loss: {avg_loss:.4f}, AKD Loss: {avg_akd:.4f}, CE Loss: {avg_ce:.4f}")

    def estimate_uncertainty(self, x):
        """
        Estimates predictive uncertainty for a given input tensor x.
        u = 1 - max_i K(f(x), f(a_i)) (Eq. 16 & 17)
        """
        if self.anchors is None:
            raise ValueError("Anchors must be selected before estimating uncertainty.")
        
        self.student.eval()
        x = x.to(self.device)
        with torch.no_grad():
            f_x, _ = self.student(x)
            f_anchors, _ = self.student(self.anchors)
            
            # Cosine similarity between input features and all anchor features
            cos_sims = pairwise_cosine_similarity(f_x, f_anchors)
            
            # Map to [0, 1] using the paper's kernel
            kernel_sims = similarity_kernel(cos_sims)
            
            # Uncertainty is 1 minus the max similarity to any anchor
            max_sim_per_sample, _ = torch.max(kernel_sims, dim=1)
            uncertainty = 1.0 - max_sim_per_sample
            
        return uncertainty.cpu().numpy()

# --- 4. Example Usage ---

if __name__ == '__main__':
    # Hyperparameters
    NUM_CLASSES = 10
    NUM_ANCHORS_PER_CLASS = 5
    IMG_SIZE = 32
    BATCH_SIZE = 64

    # Create dummy data (replace with a real dataset like CIFAR-10)
    print("Creating dummy dataset...")
    dummy_data = torch.randn(1000, 3, IMG_SIZE, IMG_SIZE)
    dummy_labels = torch.randint(0, NUM_CLASSES, (1000,))
    dummy_dataset = TensorDataset(dummy_data, dummy_labels)
    dummy_loader = DataLoader(dummy_dataset, batch_size=BATCH_SIZE, shuffle=True)

    # Initialize Teacher and Student models
    # Teacher: A pre-trained ResNet-18 (we use a random one for demo)
    teacher_base = resnet18(weights=None, num_classes=NUM_CLASSES)
    teacher_model = FeatureExtractor(teacher_base, 'layer4') # Extract features from last block
    
    # Student: A smaller custom CNN
    student_base = create_student_model()
    student_model = FeatureExtractor(student_base, '6') # Extract features after the last MaxPool
    
    # Setup AKD Trainer
    distiller = AKD_Trainer(teacher_model, student_model, num_classes=NUM_CLASSES, use_attention=True)
    
    # 1. Select Anchors using the teacher model
    distiller.select_anchors(dummy_dataset, num_anchors_per_class=NUM_ANCHORS_PER_CLASS)
    
    # 2. Train the student model using knowledge distillation
    print("\nStarting distillation training...")
    distiller.train(dummy_loader, epochs=5, lr=1e-3, akd_weight=2.5, ce_weight=1.0)
    print("Distillation complete.")

    # 3. Demonstrate uncertainty estimation on a few samples
    print("\nEstimating uncertainty for 5 random samples...")
    sample_to_test = torch.randn(5, 3, IMG_SIZE, IMG_SIZE)
    uncertainty_scores = distiller.estimate_uncertainty(sample_to_test)
    
    for i, score in enumerate(uncertainty_scores):
        print(f"Sample {i+1}: Uncertainty = {score:.4f}")
        # A higher score means the sample is less similar to any of the learned anchors,
        # indicating it might be out-of-distribution or from an ambiguous region.

References

Spanos, D., Passalis, N., & Tefas, A. (2025). Trustworthy knowledge distillation via anchor-guided distribution learning. Knowledge-Based Systems, 329, 114297. https://doi.org/10.1016/j.knosys.2025.114297
Hinton, G., et al. (2015). Distilling the knowledge in a neural network. arXiv:1503.02531.
Passalis, N., & Tefas, A. (2018). Learning deep representations with probabilistic knowledge transfer. ECCV.

This article is based on the original research paper and aims to make advanced AI concepts accessible to a broader audience.

Related posts, You May like to read