Knowledge Distillation Meets Integrated Gradients: A Smarter Way to Compress Neural Networks

Q: How does this approach compare to attention transfer in knowledge distillation?

Attention transfer methods add a loss term that penalizes the student for having different intermediate feature activations than the teacher, which requires aligning the architectures so that corresponding layers can be compared. IG augmentation works entirely on the input data and requires no architectural coupling. The teacher and student can have completely different internal structures, which gives more flexibility when designing the student for a specific hardware target.

Analysis by the aitrendblend editorial team • Published June 2026 • 8 min read

Model Compression Knowledge Distillation Explainable AI Edge AI CIFAR-10 MobileNetV2

Visualization of integrated gradient maps overlaid on CIFAR-10 training images, highlighting class-discriminative features used in knowledge distillation compression

Imagine watching someone take an expert’s detailed reasoning, strip out everything except the most important cues, and hand those cues to a student who has never seen the full picture. That is roughly what a team from National Cheng Kung University pulled off with a deceptively simple idea: take the explainability maps a large model produces and bake them directly into the training data for the smaller model that replaces it.

Key Points

A 4.1x compressed MobileNetV2 student trained with integrated gradient augmentation reaches 92.5% accuracy on CIFAR-10, beating both the standard student baseline (91.43%) and knowledge distillation alone (92.29%).
The approach reduces inference latency from 140 ms to 13 ms, a 10.8x speedup, without any hardware-specific tricks.
Integrated gradient maps are precomputed once before student training starts, so the runtime overhead is minimal.
The optimal IG overlay probability is surprisingly low at p = 0.1. Anything above 0.25 hurts generalisation.
Ablation results confirm that KD and IG augmentation contribute different and complementary signals, not redundant ones.
The method works at the pixel level, giving the student model explicit visual hints rather than abstract probability distributions.

The Problem That Has Not Gone Away

Anyone who has tried to ship a decent image classifier on a microcontroller or a low-end phone knows the core tension. The models that perform well are enormous. The devices that need to run them are not. Quantisation and pruning chip away at parameters, but aggressive compression tends to degrade accuracy in nonlinear ways. Knowledge distillation, the technique where a compact student model learns from a larger teacher model, has become the standard middle ground, and it works well enough that most published benchmarks on CIFAR-10 report accuracy drops of well under one percentage point at moderate compression ratios.

But there is a gap in what knowledge distillation actually transfers. When the teacher passes soft probability distributions to the student, it is communicating which classes are similar, not which pixels or regions drove that conclusion. A soft label that reads 40% automobile, 35% truck, and 25% everything else tells the student something real about the decision boundary, but it says nothing about whether the model was looking at the wheels, the grille, or the silhouette. That is useful information, and standard KD throws it away.

The paper from Hernandez, Chang, and Nordling at arXiv:2503.13008v1 asks a reasonable question: what if you did not throw it away? What if you used the teacher’s own explainability output as a data augmentation signal, literally overlaying the attention map onto the training image before the student ever sees it?

Where Integrated Gradients Fit In

Integrated gradients (IG), introduced by Sundararajan, Taly, and Yan in 2017, is one of the more principled attribution methods in the explainability toolkit. The core idea is to accumulate gradient information along a straight path from a baseline input (typically a black image) to the actual input, measuring how much each pixel’s shift away from zero contributed to the final prediction. The result is a heatmap that satisfies two formal properties: sensitivity, meaning a pixel that changes the output is always attributed a nonzero value, and implementation invariance, meaning two functionally identical networks produce identical attributions regardless of their internal structure.

That rigor matters in a compression context because you need the attribution signal to be trustworthy. If the heatmap is unstable or depends on implementation quirks, overlaying it on training images introduces noise rather than signal. IG’s axiomatic guarantees give you some confidence that the bright regions in the map actually reflect the teacher’s decision process.

The formulation from the paper expresses this as a path integral over the prediction function F of the teacher model:

EQUATION 1 — Integrated Gradients attribution

$$\text{IG}_i(x) = (x_i – x’_i) \int_{\beta=0}^{1} \frac{\partial F(x’ + \beta(x – x’))}{\partial x_i} \, d\beta$$

Here x is the input image, x’ is the baseline (black image), and the integral accumulates how the gradient of the prediction with respect to pixel i changes as you interpolate from baseline to input. In practice this integral is approximated with a finite number of steps, typically 50 to 300, which is why computing it for the entire training set is expensive enough to worry about.

Why This Differs From Attention Transfer

Attention transfer, used by Zhao et al. (2020) to reach 94.42% at 3.19x compression, guides the student by matching intermediate feature maps from the teacher. IG augmentation works earlier in the pipeline. Instead of adding a loss term that penalizes mismatched activations, it modifies the input data itself, letting the student’s own supervised training signal absorb the feature hint without any architectural coupling between teacher and student.

The Overlay Strategy: Simple and Specific

The way the team deploys IG is worth unpacking in detail, because the design choices are not obvious.

First, the maps are precomputed. Before student training begins, the team runs every training image through the teacher and computes its IG map. This converts what would be a per-batch attribution computation, at roughly 50 to 300 forward passes per image per step, into a one-time preprocessing cost that scales with dataset size, not training duration. On CIFAR-10 with 50,000 training images that is manageable. On ImageNet it would require more careful planning, but the principle holds.

Second, the map intensity is randomized with a log-uniform scale factor drawn from an exponential distribution between 1 and 2. The map is then normalized to the range zero to one. This prevents the overlay from always emphasizing regions at fixed contrast, which could encourage the student to latch onto a specific visual artifact rather than learning to recognize the underlying feature.

Third, the overlay itself is stochastic. With probability p, the final training image blends the original and the normalized IG map at equal weight:

EQUATION 2 — Stochastic overlay augmentation

$$\hat{x}_{\text{aug}} = \begin{cases} 0.5 \cdot x + 0.5 \cdot \hat{\text{IG}}(x) & \text{with probability } p \\ x & \text{otherwise} \end{cases}$$

The equal blend weight is a design choice worth scrutinizing. A heavier weight on the IG map would more aggressively highlight the discriminative region but would also distort the raw photometric signal the student relies on for other learned features. The authors do not report sensitivity to this blend ratio, which is a gap future work could fill.

The student model never sees a direct gradient from the teacher’s explainability system. It just sees images where important regions look slightly different, and it learns accordingly.

Paraphrase of the central mechanism from arXiv:2503.13008v1

Knowledge Distillation as the Backbone

The IG overlay does not replace knowledge distillation, it augments it. The distillation loss keeps its standard form: a weighted sum of cross-entropy against hard labels and Kullback-Leibler divergence against the teacher’s softened output distribution.

EQUATION 3 — Combined KD loss

$$\mathcal{L}_{\text{KD}} = (1 – \alpha)\,\mathcal{L}_H + \alpha\,\mathcal{L}_{\text{KL}}$$
$$\mathcal{L}_H = -\sum_i y_i \log f_s(x)_i$$
$$\mathcal{L}_{\text{KL}} = \sum_i f_t(x;T)_i \log \frac{f_t(x;T)_i}{f_s(x;T)_i}$$

The temperature parameter T softens the teacher’s probability distribution before the KL divergence is computed. A higher T spreads probability mass more evenly across classes, amplifying the inter-class similarity signal. A lower T preserves the hard, peaked distribution the teacher produces at inference time.

Through a grid search over 5 to 9 values of T, the distillation weight alpha, and the overlay probability p, the team found that T = 2.5 and alpha = 0.01 worked best. The low alpha is notable. It means hard labels still dominate the training signal, and the KL divergence term contributes a small corrective nudge. That makes sense on CIFAR-10, where the training labels are clean and reliable. On a noisier dataset the optimal alpha might shift upward.

Optimal Hyperparameter Summary

Temperature T = 2.5 balances class relationship preservation against categorical sharpness. Distillation weight alpha = 0.01 keeps hard labels dominant. IG overlay probability p = 0.1 provided the best generalization. Probabilities of 0.25 and 0.5 degraded accuracy, suggesting the student benefits from seeing clean, unadulterated images most of the time.

What the Numbers Actually Say

The teacher here is MobileNetV2 at 2.2 million parameters, trained to 93.91% on the CIFAR-10 test set. The student is a reduced version with 543,498 parameters, achieved by trimming both depth and width while preserving the early feature extraction layers. The 4.1x parameter ratio is more moderate than the most aggressive results in the literature, where Bhardwaj et al. (2019) reached 19.35x compression with a custom NoNN architecture, but it targets a different point on the tradeoff curve.

Table 1. Test accuracy by method, CIFAR-10 (4.1x compression, 543K parameters)
Method	Accuracy (%)	Gain vs baseline
Teacher (MobileNetV2)	93.91	Reference
Baseline student (no distillation)	91.43	0.00
KD only	92.29	+0.86
IG augmentation only	92.01	+0.58
KD and IG combined	92.45	+1.02

The combined result preserves 98.4% of the teacher’s accuracy while using 24.3% of its parameters. The inference speedup is disproportionately large: latency drops from 140 ms to 13 ms, a 10.8x reduction despite only a 4.1x parameter reduction. The authors attribute this to reduced memory bandwidth requirements and better cache utilization in the smaller model, both of which scale nonlinearly with model size on modern GPU memory hierarchies.

Table 2. Inference latency, NVIDIA RTX 3090
Model	Latency (ms)	Speedup
Teacher (MobileNetV2, 2.2M params)	140	1.0x
Student (KD and IG, 543K params)	13	10.8x

The ablation results are arguably the most interesting piece of the paper. IG augmentation alone gives +0.58 percentage points. KD alone gives +0.86. The combination gives +1.02, which is less than the sum of the two individual gains. That sub-additive synergy suggests the two signals overlap partially, which is expected. Both ultimately try to help the student focus on what matters. But the fact that the combination still outperforms either alone confirms they contribute non-redundant information.

Reading the Broader Comparison

The paper situates itself well in the literature. Figure 1 in the original paper plots every comparison study as a line from teacher accuracy to student accuracy, with the x-axis as compression factor. The mean compression factor across the compared studies is 9.57x and the mean accuracy is 91.51%. This paper’s result sits at a lower compression factor (4.1x) but with accuracy above the mean.

The fairest comparison is probably against Zhao et al. (2020), which achieved 94.42% at 3.19x compression using a feature-highlighting approach called collaborative teaching, and against Su et al. (2022), which reached 94.69% at 3.26x. Both of those target a similar compression range and both outperform this paper in absolute accuracy. What this paper trades for its 2.0 to 2.2 percentage point deficit is a fundamentally different mechanism. It does not require intermediate feature matching between teacher and student, which means the teacher and student architectures can differ more freely. You do not need to align the activation dimensions of corresponding layers.

The comparison against Bhardwaj et al. (2019) is less direct. Their 19.35x compression at 94.53% accuracy uses a custom NoNN (neural network-on-neural-network) architecture designed for IoT deployment. The compression method is incommensurable with a general KD framework. Still, it is a useful reminder that 4.1x is a conservative compression target, and the team’s framing around edge computing would be strengthened by testing at higher compression ratios.

If you have been following our coverage on model compression and knowledge distillation, the pattern here fits a broader shift. The field is moving away from treating the student’s output distribution as the only thing that matters and toward richer representations of what the teacher actually learned. Attention transfer, hint-based distillation, and now IG augmentation are all variants of the same instinct: the soft labels are not enough, tell the student more.

A Reproducible PyTorch Implementation

Below is a complete, runnable implementation of the KD and IG augmentation pipeline. All components are present, including the IG precomputation, the augmented dataset, the distillation loss, the training loop, and a smoke test on dummy data.

# ============================================================
# KD + Integrated Gradients Compression — Full Implementation
# Based on arXiv:2503.13008v1 (Hernandez, Chang, Nordling)
# Tested with PyTorch 2.2, torchvision 0.17
# ============================================================

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Dataset
import numpy as np


# ——————————————————–
# 1. Integrated Gradients computation
# ——————————————————–
def compute_integrated_gradients(
    model: nn.Module,
    input_img: torch.Tensor,   # (1, C, H, W)
    target_class: int,
    n_steps: int = 50,
    baseline: torch.Tensor = None,
) -> torch.Tensor:
    “””
    Returns a (C, H, W) tensor with the IG attribution map.
    Positive values indicate pixels that increase prediction score.
    “””
    if baseline is None:
        baseline = torch.zeros_like(input_img)

    model.eval()
    alphas = torch.linspace(0, 1, n_steps + 1).to(input_img.device)
    interpolated = torch.stack([
        baseline + alpha * (input_img – baseline)
        for alpha in alphas
    ]).squeeze(1)   # (steps+1, C, H, W)

    interpolated.requires_grad_(True)
    logits = model(interpolated)
    score = logits[:, target_class].sum()
    score.backward()

    grads = interpolated.grad          # (steps+1, C, H, W)
    avg_grads = grads.mean(dim=0)     # (C, H, W)
    ig = (input_img.squeeze(0) – baseline.squeeze(0)) * avg_grads
    return ig.detach()


# ——————————————————–
# 2. Normalise and scale a single IG map
# ——————————————————–
def normalize_ig_map(ig: torch.Tensor) -> torch.Tensor:
    “””
    Log-uniform scale then min-max normalization to [0, 1].
    ig shape: (C, H, W)
    “””
    s = torch.exp(torch.empty(1).uniform_(np.log(1), np.log(2)))
    ig_scaled = ig ** s
    lo, hi = ig_scaled.min(), ig_scaled.max()
    if (hi – lo) < 1e-8:
        return torch.zeros_like(ig)
    return (ig_scaled – lo) / (hi – lo)


# ——————————————————–
# 3. Precompute IG maps for the whole training set
# ——————————————————–
def precompute_ig_maps(
    teacher: nn.Module,
    dataset,                # raw dataset that returns (img_tensor, label)
    device: str = “cpu”,
    n_steps: int = 50,
) -> list:
    “””
    Returns a list of (C, H, W) normalized IG tensors,
    one per training sample.
    Expensive once; free during training.
    “””
    teacher.to(device).eval()
    ig_maps = []
    for idx in range(len(dataset)):
        img, label = dataset[idx]
        inp = img.unsqueeze(0).to(device)
        ig_raw = compute_integrated_gradients(
            teacher, inp, int(label), n_steps=n_steps
        )
        ig_norm = normalize_ig_map(ig_raw.abs())
        ig_maps.append(ig_norm.cpu())
        if idx % 1000 == 0:
            print(f”  Precomputed {idx}/{len(dataset)} IG maps”)
    return ig_maps


# ——————————————————–
# 4. Dataset that applies the stochastic IG overlay
# ——————————————————–
class IGAugmentedDataset(Dataset):
    def __init__(self, base_dataset, ig_maps, p: float = 0.1):
        self.base = base_dataset
        self.ig_maps = ig_maps
        self.p = p

    def __len__(self):
        return len(self.base)

    def __getitem__(self, idx):
        img, label = self.base[idx]
        if torch.rand(1).item() < self.p:
            ig = self.ig_maps[idx]
            img = 0.5 * img + 0.5 * ig   # blend
        return img, label


# ——————————————————–
# 5. Compact student model (4.1x smaller than MobileNetV2)
# ——————————————————–
class CompactStudent(nn.Module):
    “””
    Lightweight CNN for CIFAR-10.
    ~543K parameters at default widths.
    Preserves early feature extraction; drops deeper layers.
    “””
    def __init__(self, num_classes: int = 10, base_ch: int = 32):
        super().__init__()
        self.stem = nn.Sequential(
            nn.Conv2d(3, base_ch, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(base_ch),
            nn.ReLU(inplace=True),
        )
        self.block1 = self._make_block(base_ch, base_ch * 2, stride=2)
        self.block2 = self._make_block(base_ch * 2, base_ch * 4, stride=2)
        self.block3 = self._make_block(base_ch * 4, base_ch * 8, stride=2)
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.classifier = nn.Linear(base_ch * 8, num_classes)

    def _make_block(self, in_ch, out_ch, stride):
        return nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, stride=stride, padding=1, bias=False),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_ch, out_ch, 3, padding=1, bias=False),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
        )

    def forward(self, x):
        x = self.stem(x)
        x = self.block1(x)
        x = self.block2(x)
        x = self.block3(x)
        x = self.pool(x).flatten(1)
        return self.classifier(x)


# ——————————————————–
# 6. Knowledge distillation loss
# ——————————————————–
def kd_loss(
    student_logits: torch.Tensor,
    teacher_logits: torch.Tensor,
    hard_labels: torch.Tensor,
    T: float = 2.5,
    alpha: float = 0.01,
) -> torch.Tensor:
    “””
    Weighted sum of cross-entropy (hard labels) and
    KL-divergence (teacher soft labels).
    alpha = weight on KL; (1 – alpha) on CE.
    “””
    ce = F.cross_entropy(student_logits, hard_labels)
    soft_teacher = F.softmax(teacher_logits / T, dim=1)
    soft_student = F.log_softmax(student_logits / T, dim=1)
    kl = F.kl_div(soft_student, soft_teacher, reduction=“batchmean”) * (T ** 2)
    return (1 – alpha) * ce + alpha * kl


# ——————————————————–
# 7. Training loop
# ——————————————————–
def train_one_epoch(
    student, teacher, loader, optimizer, device, T=2.5, alpha=0.01
):
    student.train()
    teacher.eval()
    total_loss, correct, total = 0.0, 0, 0
    with torch.no_grad() as nograd_ctx:
        pass  # teacher stays frozen throughout
    for imgs, labels in loader:
        imgs, labels = imgs.to(device), labels.to(device)
        with torch.no_grad():
            t_logits = teacher(imgs)
        s_logits = student(imgs)
        loss = kd_loss(s_logits, t_logits, labels, T=T, alpha=alpha)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * imgs.size(0)
        preds = s_logits.argmax(dim=1)
        correct += (preds == labels).sum().item()
        total += imgs.size(0)
    return total_loss / total, 100.0 * correct / total


# ——————————————————–
# 8. Evaluation function
# ——————————————————–
def evaluate(model, loader, device):
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for imgs, labels in loader:
            imgs, labels = imgs.to(device), labels.to(device)
            preds = model(imgs).argmax(dim=1)
            correct += (preds == labels).sum().item()
            total += imgs.size(0)
    return 100.0 * correct / total


# ——————————————————–
# 9. Smoke test on dummy data (no real dataset required)
# ——————————————————–
if __name__ == “__main__”:
    device = “cuda” if torch.cuda.is_available() else “cpu”
    print(f“Device: {device}”)

    # Tiny dummy dataset
    N, C, H, W = 64, 3, 32, 32
    dummy_imgs = torch.randn(N, C, H, W)
    dummy_labels = torch.randint(0, 10, (N,))

    class TinyDataset(Dataset):
        def __len__(self): return N
        def __getitem__(self, i): return dummy_imgs[i], dummy_labels[i]

    base_ds = TinyDataset()

    # Build teacher (tiny version for smoke test)
    teacher = CompactStudent(num_classes=10, base_ch=64).to(device)
    teacher.eval()

    # Precompute IG maps (n_steps=5 for speed in smoke test)
    print(“Precomputing IG maps…”)
    ig_maps = precompute_ig_maps(teacher, base_ds, device=device, n_steps=5)

    # Build augmented dataset and loader
    aug_ds = IGAugmentedDataset(base_ds, ig_maps, p=0.1)
    loader = DataLoader(aug_ds, batch_size=16, shuffle=True)

    # Build student
    student = CompactStudent(num_classes=10, base_ch=32).to(device)
    optimizer = torch.optim.Adam(student.parameters(), lr=1e-3)

    # One training epoch
    loss, acc = train_one_epoch(student, teacher, loader, optimizer, device)
    print(f“Epoch loss: {loss:.4f} | Train acc: {acc:.1f}%”)

    # Eval pass
    val_loader = DataLoader(base_ds, batch_size=16)
    val_acc = evaluate(student, val_loader, device)
    print(f“Val acc (random baseline expected ~10%): {val_acc:.1f}%”)
    print(“Smoke test passed.”)

Honest Limitations

What This Paper Does Not Settle

Dataset scope. Every experiment runs on CIFAR-10, a dataset with 32×32 images and 10 balanced classes. How the IG overlay generalizes to ImageNet, with its 224×224 images and 1,000 classes, is entirely unknown. The spatial structure of IG maps at higher resolution may behave very differently.

Compression ratio coverage. The 4.1x compression factor sits well below the literature mean of 9.57x. Whether the IG advantage holds or degrades at deeper compression is not tested. It is plausible that at very high compression ratios, the student’s limited capacity cannot absorb the pixel-level hints efficiently.

Blend weight sensitivity. The equal 50/50 blend between the original image and the IG map is a fixed design choice that the paper does not ablate. A heavier IG weight might help in some domains and hurt in others.

Computational cost of IG precomputation. The paper claims this is a manageable one-time cost on CIFAR-10. On a 50,000-image dataset with 50 integration steps per image, that is 2.5 million teacher forward passes before training even begins. On larger datasets this could become the bottleneck, and alternatives like smoothgrad or gradient-times-input might be worth comparing on both cost and augmentation quality.

Single architecture pair. MobileNetV2 teacher to a custom compact student is the only pair tested. Whether the method works when the teacher and student architectures are more architecturally dissimilar, say a ResNet-50 teacher to a small CNN student, is an open question.

Why This Matters for Edge Deployment

The practical argument for this method is straightforward. Most edge deployment pipelines already run some form of knowledge distillation. The only new engineering requirement here is a preprocessing step that generates and stores IG maps before training starts. The training loop itself does not change. The loss function does not change. The student architecture does not need to expose intermediate layers for matching. That is a low adoption cost for a consistent +0.5 to +1.0 percentage point accuracy gain.

The interpretability angle is genuinely useful, not just a selling point. When you deploy a compressed model to a medical device or a safety-critical system, knowing that the compressed model attends to the same image regions as the full model is a meaningful assurance. Attention transfer methods provide a similar guarantee implicitly, through matched feature maps. IG augmentation makes it explicit at the input level, which is easier to visualize and audit.

The connection to explainable AI is also worth noting for teams working on explainability-constrained deployments. If a regulation or an internal policy requires that a model’s decisions be attributable to specific input features, a student trained with IG augmentation inherits a documented feature-alignment relationship with the teacher model, which is a stronger audit trail than a model trained purely with soft labels.

Frequently Asked Questions

What is integrated gradients and why use it for knowledge distillation?

Integrated gradients is an attribution method that quantifies how much each input pixel contributed to a model’s prediction, computed by accumulating gradients along a path from a blank baseline to the actual image. In knowledge distillation, the teacher’s soft labels communicate class similarity but not which image regions drove the decision. Using IG maps as a data augmentation layer gives the student explicit pixel-level guidance about what the teacher considered important, complementing the output-level signal from standard distillation.

Why is the IG overlay probability set so low at p = 0.1?

The authors found through grid search that probabilities above 0.25 degraded student accuracy. The interpretation is that seeing the IG overlay too often causes the student to over-focus on specific attributed regions at the expense of learning diverse representations. At p = 0.1, the student sees the clean original image 90% of the time, with occasional IG-highlighted versions providing feature guidance without dominating the training signal.

How does this approach compare to attention transfer in knowledge distillation?

Attention transfer methods like the one used by Zhao et al. (2020) add a loss term that penalizes the student for having different intermediate feature activations than the teacher, which requires aligning the architectures so that corresponding layers can be compared. IG augmentation works entirely on the input data and requires no architectural coupling. The teacher and student can have completely different internal structures, which gives more flexibility when designing the student for a specific hardware target.

Is the 10.8x inference speedup just from the parameter reduction?

Not entirely. A 4.1x parameter reduction produces a 10.8x latency speedup because smaller models have reduced memory bandwidth requirements and better cache utilization on GPU hardware. Memory access patterns scale nonlinearly with model size, so the efficiency gain from fitting more of the model into on-chip cache can exceed the raw parameter ratio. This effect is more pronounced on memory-bandwidth-limited hardware like mobile processors than on data center GPUs.

Can this method be used with other explainability techniques besides integrated gradients?

In principle yes. Smoothgrad, gradient-times-input, and SHAP-based attribution methods all produce pixel-level importance maps that could substitute for IG in the augmentation pipeline. The advantage of IG is its axiomatic guarantees around sensitivity and implementation invariance, which make the attribution signal more trustworthy. Gradient-times-input is faster to compute and could be a practical alternative when precomputing maps for large datasets, though its theoretical properties are weaker.

Has this been tested on datasets larger than CIFAR-10?

No, the paper reports only CIFAR-10 experiments. The authors acknowledge this as a limitation and suggest future work should test on more complex datasets and architectures. The method’s behavior on ImageNet with 224×224 images and 1,000 classes is an open question, both in terms of the quality of IG maps at higher resolution and the computational feasibility of precomputing them for 1.2 million training images.

Read the full paper and run the official code from National Cheng Kung University’s NordlingLab.

Read on arXiv View Code on GitHub

Closing Thoughts

The paper from Hernandez, Chang, and Nordling is a clean piece of applied research. The central idea, using explainability output as training-time augmentation rather than as a post-hoc inspection tool, is genuinely novel. Most prior work treats attribution methods as something you run after training to understand a model. This paper asks whether you can run them before training to improve one.

The results are encouraging but carefully scoped. A 4.1x compression ratio with a 98.4% accuracy retention rate and a 10.8x latency reduction is a solid outcome, but it is not the most aggressive compression in the literature. The paper wisely avoids overclaiming. The comparison figure makes clear that higher compression ratios exist, at the cost of accuracy, and that this method occupies a specific point on that frontier rather than redefining it.

The deeper contribution is methodological. By demonstrating that IG and KD contribute complementary non-redundant signals in the ablation study, the paper establishes a principle: feature-level attribution guidance is not just useful for human understanding, it is useful for machine learning. That principle could extend to other compression techniques, other attribution methods, and other tasks beyond image classification.

For practitioners, the immediate takeaway is pragmatic. If your pipeline already uses knowledge distillation, adding IG augmentation costs one preprocessing run and a minor change to the dataset loader. The accuracy gain is consistent and the mechanism is sound. The caveats are real: test it on your specific dataset and architecture before treating the CIFAR-10 numbers as a guarantee.

For the research community, the next obvious step is scale. Does this work at ImageNet resolution? Does it hold at 10x or 20x compression? Does the benefit transfer to tasks beyond classification? Those are questions worth answering, and this paper gives a credible starting point from which to answer them.

Academic citation: Hernandez, D. E., Chang, J. R., and Nordling, T. E. M. (2025). Knowledge Distillation: Enhancing Neural Network Compression with Integrated Gradients. arXiv:2503.13008v1 [cs.LG]. March 17, 2025.

This analysis is based on the published paper and an independent evaluation of its claims.