DSKD: How Sense Dictionaries Are Finally Making Decoder LLMs Smarter Without Slowing Them Down | AI Research

Natural Language Processing · arXiv:2602.22351v1 [cs.CL] · 15 min read

DSKD: The Lexical Knowledge Injection That Finally Works for Decoder Language Models

How researchers at RPI and IBM Research taught generative LLMs to understand word senses, synonyms, and antonyms during training—without adding a single extra step at inference time—and why it consistently beats standard knowledge distillation across five benchmarks.

Knowledge Distillation Sense Embeddings Decoder LLMs Lexical Semantics Llama 3 · Mistral Hinge MSE Loss Semantic Consistency LLM Compression

Figure 1: Token embedding space in Llama-3-8B-Instruct visualized conceptually (based on t-SNE analysis in the paper). Semantically related tokens cluster together—synonyms of red occupy a compact region, while vehicle-related words form their own neighborhood. Polysemous words like bank exhibit overlapping neighborhoods across multiple senses. This structured geometry is the core motivation for building a sense dictionary.

Modern large language models are extraordinary at capturing context. Train them on enough text and they develop a surprisingly nuanced feel for how words relate to one another. But there is a persistent gap between that learned intuition and the structured, explicit knowledge baked into human-crafted dictionaries and thesauri. Words have senses. Synonyms exist. Antonyms encode contrast. For years, bridging this gap has worked reasonably well for encoder models—but generative decoders presented a harder problem. Now a team from Rensselaer Polytechnic Institute and IBM Research has found a clean solution: inject all that lexical structure at training time, then throw the dictionary away before inference begins.

Their framework, DSKD (Decoder-based Sense Knowledge Distillation), doesn’t require architectural surgery, retraining from scratch, or dictionary lookups during serving. Instead, it constructs a rich sense dictionary from the teacher model’s own contextual embeddings, augments it with synonym and antonym relationships drawn from Wiktionary and Roget’s Thesaurus, and uses this combined resource to sharpen the supervision signal during student training. The result is a student model that inherits not just the teacher’s output distribution but its semantic geometry—and does so without carrying any of the dictionary overhead into production.

Evaluated against Llama-3-8B-Instruct and Mistral-7B-Instruct across ARC, CommonsenseQA, MMLU, PIQA, and SQuADv2, DSKD consistently outperforms standard knowledge distillation. The gains are especially striking for Mistral, where a smaller vocabulary forces each token to carry heavier semantic load—exactly the scenario where explicit sense-level supervision pays off most.

Why Decoder Models Have Always Struggled with Lexical Knowledge

The tension between contextual embeddings and structured lexical knowledge runs deep in NLP history. Dense word vectors like Word2Vec captured broad semantic similarity but lost fine-grained sense distinctions—the word bank sits somewhere between river and finance rather than clearly belonging to either. Contextual embeddings from transformers solved this by conditioning every representation on its surrounding text, but introduced a new problem: all that context-sensitivity makes the embedding space continuous and highly variable, resisting the kind of discrete symbolic structure that dictionaries encode.

Prior work on Sense Knowledge Distillation (SKD) tackled this for encoder models by clustering a teacher’s contextual embeddings into discrete sense prototypes and training students to predict which sense cluster a given token falls into. This approach worked well for classification—but it required replacing contextual embeddings with dictionary sense vectors at inference time, a dependency that makes deployment awkward and rules out free-form text generation entirely.

Applying the same idea directly to decoder models hits a fundamental mismatch. In an encoder, the hidden state at position $t$ corresponds directly to the input token at that position, so sense lookup is straightforward. In a decoder, the hidden state at position $t$ is optimized to predict the next token $x_{t+1}$—not to represent the current one. The model can in principle generate any token from the full vocabulary, making it impractical to query a sense dictionary for every position during inference. The dictionary would need to cover the entire vocabulary for every forward pass.

DSKD’s insight is to sidestep this inference-time problem entirely. Rather than asking the model to look up senses during generation, it uses the sense dictionary as a training signal only—a structured external resource that guides how the student’s hidden states are organized, without ever appearing in the deployed model’s computational graph.

Key Insight

DSKD resolves the decoder-dictionary incompatibility by using sense knowledge only during training. The deployed student model is fully self-contained—given any input, it generates text exactly like a standard language model, with no dictionary dependency at inference time.

Building the Sense Dictionary: Three Interlocking Layers

The DSKD sense dictionary is not a simple off-the-shelf resource. It is constructed in three stages that progressively enrich the representation from raw contextual statistics to structured lexical relationships to composed word-level sense embeddings.

Stage 1: Token-Level Sense Embeddings via Clustering

The construction begins with a single forward pass of the pretrained teacher model over the English Wikipedia dump (March 2024 edition) combined with the training datasets for the downstream tasks. For each token in the vocabulary, the framework collects up to 2,000 contextual embeddings from the teacher’s final hidden layer—the most semantically rich representations the teacher produces. These embeddings are then clustered using k-means:

Eq. 1 — Sense Embeddings for Token t $$S_t = [s_{t,1}; s_{t,2}; \ldots; s_{t,k}] \in \mathbb{R}^{k \times d}$$

Each centroid $s_{t,i}$ represents a prototypical semantic usage of token $t$ under different contexts—a river bank, a financial institution, a blood bank. The number of clusters $k$ is a hyperparameter that controls semantic granularity; ablation studies in the paper show that $k=10$ works best for LLaMA (with its 128K vocabulary) and $k=5$ for Mistral (32K vocabulary).

Stage 2: Lexical Relationship Extraction and Morphological Expansion

Raw clustering captures distributional sense distinctions but misses the relational structure that humans encode in thesauri. DSKD addresses this by extracting synonym and antonym word pairs from two resources: Wiktionary (via Wiktextract) and Roget’s Thesaurus. Each pair is labeled as a synonym or antonym relation edge.

A particularly clever extension handles morphological negation. In English, prefixes like un-, in-, dis- and suffixes like -less systematically invert meaning. Rather than treating these as edge cases, DSKD uses MorphoLEX-English to identify base forms and flip relationship labels accordingly:

Original Pair	Original Relation	New Pair (after morpho)	New Relation
quit, discontinue	synonym	quit, continue	antonym
accurate, faultless	synonym	accurate, fault	antonym
variable, unchangeable	antonym	variable, changeable	synonym
unkind, friendly	antonym	kind, friendly	synonym
disinterest, zeal	antonym	interest, zeal	synonym

Table 1: Morphological negation systematically transforms lexical relations. MorphoLEX-English recovers base forms, allowing DSKD to expand antonym coverage substantially beyond what standard thesauri provide.

This morphological expansion is particularly valuable because antonym pairs are naturally scarcer than synonym pairs in standard resources—a known limitation that DSKD explicitly compensates for. The final dictionary for LLaMA contains 923,950 synonym pairs and 75,949 antonym pairs.

Stage 3: Word-Level Composition for Multi-Token Words

There is a mismatch between lexical resources, which operate at the word level, and tokenizers, which split words into subword tokens. The word vermilion might be tokenized into two or three pieces, each with their own sense embeddings. DSKD handles this through an iterative composition procedure that builds word-level sense embeddings by nearest-neighbor alignment across the token sequence:

Eq. 2 — Composed Word-Level Sense Embedding $$\tilde{S}_w = \frac{1}{m} \sum_{h=1}^{m} \text{aligned}(S_{t_h})$$

The composition starts from the first token’s sense embeddings, then at each step uses the current mean to identify the nearest matching sense in the next token’s inventory via L2 distance. This ensures that each token contributes equally to the final representation while late tokens refine earlier alignment decisions. Setting the maximum span $m=3$ retains over 90% of synonym/antonym pairs for both LLaMA and Mistral tokenizers—a practical balance between coverage and computational cost.

Figure 2: The three-stage construction pipeline for the DSKD enriched sense dictionary. Token-level sense embeddings are built from clustering the teacher’s contextual representations over Wikipedia; lexical resources (Wiktionary, Roget’s Thesaurus, MorphoLEX-English) contribute relational structure; and word-level composition handles multi-token words. The final dictionary is used only during training.

The Training Objective: Semantic Consistency with Hinge MSE Loss

With the sense dictionary in hand, DSKD extends the standard knowledge distillation objective with a sense-guided supervision term. The standard KD loss combines cross-entropy against ground-truth tokens with KL divergence between teacher and student output distributions:

Eq. 3 — Standard KD Objective $$\mathcal{L}_{KD}(t) = \mathcal{L}_{CE}(t) + \alpha \cdot \mathcal{L}_{KL}(t)$$

DSKD adds a semantic consistency term $\mathcal{L}_{sem}$ that operates on the student’s last hidden state at position $t$, denoted $\mathbf{n}_t$. The key idea is to retrieve sense embeddings for the ground-truth next token $x_{t+1}$ and its lexical neighbors, then use the teacher’s hidden state $\mathbf{m}_t$ as a guide to select the most contextually relevant subset of those embeddings.

Specifically, the positive candidate set $\mathcal{P}_{t+1}$ consists of the sense embeddings of $x_{t+1}$ itself plus its synonyms, while the negative set $\mathcal{A}_{t+1}$ contains antonym sense embeddings. Since a single token can carry multiple senses and its synonyms may span different meanings, DSKD selects only the $\kappa$ nearest embeddings to $\mathbf{m}_t$ from each set—ensuring the selected candidates reflect the actual intended meaning rather than all possible uses of the word.

Eq. 4 — Semantic Consistency Loss (Hinge MSE) $$\mathcal{L}_{sem}(t) = \beta_p \frac{1}{|\mathcal{P}^\kappa_{t+1}|} \sum_{p \in \mathcal{P}^\kappa_{t+1}} \text{MSE}(\mathbf{n}_t, p) + \beta_n \frac{1}{|\mathcal{A}^\kappa_{t+1}|} \sum_{a \in \mathcal{A}^\kappa_{t+1}} \left[ \gamma – \text{MSE}(\mathbf{n}_t, a) \right]_+$$

The first term pulls the student representation toward its positive (sense + synonym) neighbors. The second term is a margin hinge that pushes the student away from antonym embeddings—but only until the MSE distance reaches the margin $\gamma$, preventing the antonym repulsion from dominating training. The overall DSKD objective simply sums these components:

Eq. 5 — Full DSKD Training Objective $$\mathcal{L}_{DSKD}(t) = \mathcal{L}_{KD}(t) + \mathcal{L}_{sem}(t)$$

A practical efficiency choice: only the final two decoder layers of the student model are trained. The first 14 layers (in the 16-layer student derived from a 32-layer teacher) are copied and frozen from the teacher, preserving the teacher’s general representations while focusing training resources on the semantic refinement layers where DSKD’s signal matters most.

“By grounding supervision in the input tokens and their lexical neighborhoods, we encourage the learned representations to capture finer-grained semantic structure beyond exact token-level prediction.” — Wang et al., arXiv:2602.22351v1

Figure 3: The DSKD training pipeline. The teacher model’s last hidden state $\mathbf{m}_t$ serves dual duty: it feeds into the standard KL divergence loss and acts as a guide for selecting the $\kappa$ most contextually relevant positive and negative sense embeddings from the DSKD dictionary. The student’s hidden state $\mathbf{n}_t$ is then optimized via the hinge MSE loss to align with positive candidates and stay distant from antonyms. At inference, the dictionary plays no role whatsoever.

Experimental Results: Consistent Gains Across Five Benchmarks

The evaluation covers a deliberately diverse set of challenges: ARC-Challenge (multi-step science reasoning), CommonsenseQA (commonsense five-choice questions), MMLU (57-subject knowledge breadth), PIQA (physical commonsense), and SQuADv2 (generative reading comprehension with unanswerable questions). All classification benchmarks use a 5-shot setting; SQuADv2 uses 1-shot with demonstrations drawn from within the same paragraph. Both teachers use 32 layers; students use 16, training only the last 2.

Model	# Layers	ARC	CSQA	MMLU	PIQA	SQuAD
LLaMA3-8B Teacher	32	81.45	78.08	67.22	78.49	72.90
KD (standard)	16	77.89	76.43	61.26	81.78	68.58
DSKD (ours)	16	79.66	76.70	64.38	82.63	69.11
Mistral-7B Teacher	32	70.33	68.34	55.37	73.72	66.26
KD (standard)	16	61.11	66.16	52.04	78.88	57.82
DSKD (ours)	16	64.78	68.44	54.44	79.53	58.09

Table 2: Performance comparison between KD and DSKD students across both teacher backbones. Bold entries indicate the best student model score. All experiments use 25% of available training data with 5-shot evaluation (SQuAD uses 1-shot F1).

The pattern is clear and consistent. For LLaMA, DSKD improves KD on every single benchmark—most dramatically on ARC (+1.77 pp) and MMLU (+3.12 pp), where semantic disambiguation is most consequential. Even on CSQA, where the teacher-student performance gap was already narrow, DSKD delivers a small but reliable lift, suggesting that sense-aware supervision adds value even near the ceiling of what distillation can recover.

For Mistral, the gains are larger in absolute terms. Standard KD suffers visibly here—reducing ARC accuracy from 70.33% to 61.11% and MMLU from 55.37% to 52.04%. DSKD recovers a substantial portion of that gap, reaching 64.78% on ARC and 54.44% on MMLU. The explanation lies in vocabulary size: Mistral’s 32K-token vocabulary is roughly four times smaller than LLaMA’s 128K. Each Mistral token therefore encodes greater semantic breadth, making the compression from teacher to student more lossy—and sense-level supervision correspondingly more valuable.

Key Finding

DSKD delivers its largest gains where semantic compression is most severe. Smaller vocabularies benefit more—individual tokens carry more senses, so explicit sense supervision recovers distinctions that standard KL distillation loses. This suggests DSKD’s value scales with the difficulty of the distillation regime.

Ablation Studies: What Actually Matters

The paper includes thorough ablation studies across four design dimensions, and the results reveal a few genuinely non-obvious findings.

Number of Sense Clusters (k)

More clusters generally help—finer-grained sense partition captures more semantic variability—but with diminishing returns. For LLaMA, going from k=5 to k=10 delivers a clear improvement, but gains flatten between k=10 and k=25 while the dictionary size grows substantially. For Mistral, k=5 strikes the better balance. This asymmetry tracks the vocabulary size difference: LLaMA’s larger vocabulary benefits from richer sense representation per token.

Number of Selected Senses During Training (κ)

Setting κ=1 consistently underperforms, confirming that a single nearest positive sense doesn’t provide enough supervision diversity. The sweet spot is κ=5, where performance stabilizes. Larger values of κ offer marginal additional benefit while increasing training overhead—a classic diminishing-returns pattern that suggests most of the useful signal lies in the five most contextually relevant sense candidates.

Loss Coefficients (β_p and β_n)

The semantic consistency loss needs careful tuning. Too small and it’s drowned out by the KD objective, adding no value. Too large and it dominates training, overriding the distributional knowledge from the teacher. The selected values—β=1.0 for LLaMA, β=1.5 for Mistral—represent the balanced operating point where both objectives contribute meaningfully.

Number of Trainable Layers

Moving from one to two trainable layers produces reliable improvements. Beyond that, additional trainable layers offer no consistent benefit while increasing training time. This finding supports the efficiency design choice: freezing the early layers preserves the teacher’s broad representation quality, while training only the final two layers gives DSKD’s semantic loss a focused surface to work on.

Training Efficiency

The overhead is minimal. DSKD adds approximately 15 minutes per epoch compared to standard KD (5h17m vs. 5h02m for LLaMA)—a 5% increase. Evaluation times are statistically identical between KD and DSKD since the dictionary plays no role at inference. Memory footprint is unchanged: both student variants use 8.5GB for LLaMA, compared to 15GB for the full teacher. The 56% memory reduction from teacher to student is preserved entirely.

Figure 4: Improvement of DSKD over standard KD (in absolute percentage points) across all five benchmarks and both model backbones. Every bar is positive—DSKD never regresses. Mistral benefits most from DSKD on reasoning-heavy tasks (ARC, CSQA, MMLU), consistent with the hypothesis that smaller vocabularies suffer more from semantic compression during distillation.

Why This Matters Beyond the Numbers

What DSKD ultimately demonstrates is that the gap between statistical language modeling and structured linguistic knowledge is not as fundamental as it might seem. The two can be bridged—but the bridge needs to be built with care about when each type of knowledge is applied.

The prior approach (SKD) tried to keep the dictionary in the loop at inference time, replacing contextual embeddings with discrete sense vectors. This preserved the explicit symbolic structure but broke the generative flow and introduced architectural dependencies. DSKD inverts the logic: use the dictionary freely during training, let the model internalize the structure, then remove the scaffolding before deployment. The resulting model carries the semantic geometry of the sense dictionary without carrying the dictionary itself.

This design philosophy has obvious practical appeal. Deployment pipelines hate dependencies. Every external lookup at inference time is a latency cost, a point of failure, a memory requirement. By confining the dictionary entirely to training, DSKD delivers the semantic benefits without the operational complexity. From an engineering standpoint, a DSKD-trained model is indistinguishable from a standard language model—it just performs better.

The morphological negation mechanism is worth a closer look for what it reveals about knowledge engineering. The observation that English uses prefixes and suffixes to invert meaning—and that this pattern is productive, regular, and systematically exploitable—is exactly the kind of insight that separates a thoughtfully designed system from a brute-force one. Rather than relying solely on manually curated antonym pairs (which are inherently sparse), DSKD manufactures new antonym relationships from the structural regularities of the language itself. The result is a substantially richer negative signal during training at minimal additional cost.

The Mistral results deserve particular attention for practitioners working with constrained vocabularies. The pattern suggests a general principle: the harder the compression problem (smaller vocabulary, larger teacher-student gap), the more valuable explicit semantic supervision becomes. This isn’t surprising in retrospect—when you’re forcing a lot of meaning into fewer tokens, you need more guidance about what that meaning actually is. But having a concrete empirical demonstration of this principle, and a method that exploits it, is genuinely useful for selecting acceleration strategies in production environments.

Looking ahead, the paper identifies an obvious next step: variable cluster numbers per token. The current framework assigns every token the same number of sense clusters $k$, which is clearly suboptimal—the word the needs far fewer clusters than bank. Allowing polysemous tokens to have richer sense inventories while assigning fewer clusters to function words would reduce dictionary size, improve sense precision, and potentially sharpen the training signal further. It’s a natural extension that the current clean infrastructure should support without fundamental redesign.

Conceptual Implementation (Python)

The code below illustrates the core components of DSKD: sense dictionary construction via k-means clustering, lexical relation management with morphological expansion, and the hinge MSE semantic consistency loss. This is an educational implementation of the mathematical concepts in the paper.

# ─────────────────────────────────────────────────────────────────────────────
# DSKD: Decoder-based Sense Knowledge Distillation
# Wang, Zaki, Kollias, Kalantzis · arXiv:2602.22351v1 [cs.CL] 2026
# Conceptual implementation of sense dictionary + hinge MSE training
# ─────────────────────────────────────────────────────────────────────────────

import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.cluster import KMeans
import numpy as np
from typing import Dict, List, Optional, Tuple, Set


# ─── Section 1: Sense Dictionary Construction ─────────────────────────────

class SenseDictionary:
    """
    Constructs a token-level sense dictionary by clustering contextual
    embeddings from a pretrained teacher model (Eq. 1 in paper).
    
    Usage:
        dict = SenseDictionary(k=10, d=4096)
        dict.add_embeddings(token_id, embeddings)  # during corpus pass
        dict.cluster_all()                          # after corpus pass
    """
    
    def __init__(self, k: int, d: int, max_per_token: int = 2000):
        """
        Args:
            k: Number of sense clusters per token
            d: Embedding dimension of teacher model
            max_per_token: Max contextual embeddings stored per token (paper: 2000)
        """
        self.k = k
        self.d = d
        self.max_per_token = max_per_token
        self.raw_embeddings: Dict[int, List[np.ndarray]] = {}
        self.sense_embeddings: Dict[int, torch.Tensor] = {}   # token_id → (k, d)
    
    def add_embeddings(self, token_id: int, embs: torch.Tensor):
        """Collect contextual embeddings during corpus forward pass"""
        if token_id not in self.raw_embeddings:
            self.raw_embeddings[token_id] = []
        
        batch = embs.detach().cpu().numpy()
        for emb in batch:
            if len(self.raw_embeddings[token_id]) < self.max_per_token:
                self.raw_embeddings[token_id].append(emb)
    
    def cluster_all(self):
        """
        Build sense embeddings S_t ∈ R^{k×d} for each token (Eq. 1).
        Tokens with fewer than k observations get fewer clusters.
        """
        for token_id, emb_list in self.raw_embeddings.items():
            X = np.stack(emb_list, axis=0)          # shape: (N, d)
            n_clusters = min(self.k, len(emb_list))
            
            km = KMeans(n_clusters=n_clusters, n_init=10, random_state=42)
            km.fit(X)
            
            centroids = torch.tensor(km.cluster_centers_, dtype=torch.float32)
            self.sense_embeddings[token_id] = centroids    # (k, d)
    
    def get_senses(self, token_id: int) -> Optional[torch.Tensor]:
        """Retrieve sense embeddings for a token; None if not in dictionary"""
        return self.sense_embeddings.get(token_id, None)


# ─── Section 2: Lexical Relations with Morphological Expansion ────────────

class LexicalRelationGraph:
    """
    Manages synonym/antonym word-level relationships including
    morphological negation expansion (Table 1 in paper).
    """
    
    NEGATION_PREFIXES = {'un', 'in', 'im', 'dis', 'non', 'anti', 'il', 'ir'}
    NEGATION_SUFFIXES = {'less', 'free'}
    
    def __init__(self):
        self.synonyms: Dict[str, Set[str]] = {}
        self.antonyms: Dict[str, Set[str]] = {}
    
    def add_pair(self, w1: str, w2: str, relation: str):
        """Add a lexical pair from Wiktionary/Roget's, then expand via morphology"""
        self._add_direct(w1, w2, relation)
        self._expand_morphological(w1, w2, relation)
    
    def _add_direct(self, w1: str, w2: str, relation: str):
        target = self.synonyms if relation == 'synonym' else self.antonyms
        target.setdefault(w1, set()).add(w2)
        target.setdefault(w2, set()).add(w1)
    
    def _expand_morphological(self, w1: str, w2: str, relation: str):
        """
        Morphological negation expansion (Section 3.2 of paper).
        If w1 has a negating prefix/suffix, extract base form and flip relation.
        """
        base1 = self._extract_base(w1)
        if base1 and base1 != w1:
            # Flip relation when we strip the negation (Table 1 logic)
            flipped = 'antonym' if relation == 'synonym' else 'synonym'
            self._add_direct(base1, w2, flipped)
        
        base2 = self._extract_base(w2)
        if base2 and base2 != w2:
            flipped = 'antonym' if relation == 'synonym' else 'synonym'
            self._add_direct(w1, base2, flipped)
    
    def _extract_base(self, word: str) -> Optional[str]:
        """Simple morphological base extraction via prefix/suffix stripping"""
        for prefix in self.NEGATION_PREFIXES:
            if word.startswith(prefix) and len(word) > len(prefix) + 2:
                return word[len(prefix):]
        for suffix in self.NEGATION_SUFFIXES:
            if word.endswith(suffix) and len(word) > len(suffix) + 2:
                return word[:-len(suffix)]
        return None


# ─── Section 3: Semantic Consistency Loss ─────────────────────────────────

class SemanticConsistencyLoss(nn.Module):
    """
    Implements L_sem from Eq. 4: Hinge MSE loss that pulls student
    representations toward positive sense embeddings and pushes them
    away from antonym embeddings.
    """
    
    def __init__(self, 
                 kappa: int = 5, 
                 beta_p: float = 1.0, 
                 beta_n: float = 1.0, 
                 gamma: float = 1.0):
        """
        Args:
            kappa: Number of nearest sense candidates to select (paper: κ=5)
            beta_p: Weight for positive (pull) term
            beta_n: Weight for negative (push) term
            gamma: Margin for hinge function
        """
        super().__init__()
        self.kappa = kappa
        self.beta_p = beta_p
        self.beta_n = beta_n
        self.gamma = gamma
    
    def forward(self,
                student_hidden: torch.Tensor,      # n_t: (B, d)
                teacher_hidden: torch.Tensor,      # m_t: (B, d)
                positive_senses: List[Optional[torch.Tensor]],  # P_{t+1}
                negative_senses: List[Optional[torch.Tensor]]   # A_{t+1}
                ) -> torch.Tensor:
        """
        Compute L_sem for a batch of token positions.
        
        Args:
            student_hidden: Student last hidden state at positions t
            teacher_hidden: Teacher last hidden state at positions t (guides selection)
            positive_senses: List of (num_senses, d) tensors or None per position
            negative_senses: List of (num_antonyms, d) tensors or None per position
            
        Returns:
            Scalar semantic consistency loss
        """
        total_loss = torch.tensor(0.0, device=student_hidden.device, requires_grad=True)
        n_valid = 0
        
        for i in range(student_hidden.size(0)):
            pos = positive_senses[i]
            neg = negative_senses[i]
            
            # Skip positions without dictionary entries (~86% of tokens)
            if pos is None:
                continue
            
            n_t = student_hidden[i]   # (d,)
            m_t = teacher_hidden[i]   # (d,)
            
            # Select κ nearest positive candidates to teacher hidden state
            pos_k = self._select_top_k(m_t, pos.to(m_t.device))   # (κ, d)
            
            # Positive term: minimize MSE to nearest positive senses
            pos_loss = self._mse_mean(n_t, pos_k)
            
            loss_i = self.beta_p * pos_loss
            
            # Negative term (only if antonyms available)
            if neg is not None:
                neg_k = self._select_top_k(m_t, neg.to(m_t.device))  # (κ, d)
                neg_mse = self._mse_mean(n_t, neg_k)
                # Hinge: only penalize if student too close to antonyms
                hinge = torch.clamp(self.gamma - neg_mse, min=0.0)
                loss_i = loss_i + self.beta_n * hinge
            
            total_loss = total_loss + loss_i
            n_valid += 1
        
        return total_loss / max(n_valid, 1)
    
    def _select_top_k(self, query: torch.Tensor, candidates: torch.Tensor) -> torch.Tensor:
        """
        Select κ nearest candidates to query via L2 distance.
        Implements: Top_κ_{p ∈ P} ||m_t - p||_2
        """
        dists = torch.cdist(query.unsqueeze(0), candidates)  # (1, num_candidates)
        k = min(self.kappa, candidates.size(0))
        _, top_idx = torch.topk(dists[0], k=k, largest=False)
        return candidates[top_idx]  # (k, d)
    
    def _mse_mean(self, x: torch.Tensor, candidates: torch.Tensor) -> torch.Tensor:
        """MSE(x, y) = (1/d) ||x - y||^2 averaged across candidate set"""
        d = x.size(0)
        diffs = x.unsqueeze(0) - candidates               # (k, d)
        mse_per = (diffs ** 2).sum(dim=-1) / d              # (k,)
        return mse_per.mean()


# ─── Section 4: DSKD Training Step ───────────────────────────────────────

class DSKDTrainer:
    """
    Orchestrates one DSKD training step combining standard KD and L_sem.
    Freezes early layers; only trains the last num_trainable layers.
    """
    
    def __init__(self,
                 student_model,
                 teacher_model,
                 sense_dict: SenseDictionary,
                 lex_graph: LexicalRelationGraph,
                 tokenizer,
                 alpha: float = 1.0,
                 kappa: int = 5,
                 beta_p: float = 1.0,
                 beta_n: float = 1.0,
                 gamma: float = 1.0):
        self.student = student_model
        self.teacher = teacher_model
        self.sense_dict = sense_dict
        self.lex_graph = lex_graph
        self.tokenizer = tokenizer
        self.alpha = alpha
        
        self.sem_loss_fn = SemanticConsistencyLoss(
            kappa=kappa, beta_p=beta_p, beta_n=beta_n, gamma=gamma
        )
    
    def training_step(self, input_ids: torch.Tensor) -> Dict[str, torch.Tensor]:
        """
        One DSKD training step implementing L_DSKD = L_KD + L_sem.
        
        Args:
            input_ids: Token ID sequence (B, T)
            
        Returns:
            Dictionary containing total loss and component losses
        """
        B, T = input_ids.shape
        
        # ── Teacher forward (no grad) ──
        with torch.no_grad():
            teacher_out = self.teacher(input_ids, output_hidden_states=True)
            teacher_logits = teacher_out.logits          # (B, T, vocab)
            teacher_hidden = teacher_out.hidden_states[-1]  # (B, T, d)
        
        # ── Student forward ──
        student_out = self.student(input_ids, output_hidden_states=True)
        student_logits = student_out.logits              # (B, T, vocab)
        student_hidden = student_out.hidden_states[-1]  # (B, T, d)
        
        # ── L_CE: cross-entropy against ground-truth next tokens ──
        shift_logits = student_logits[:, :-1, :].contiguous()
        shift_labels = input_ids[:, 1:].contiguous()
        l_ce = F.cross_entropy(
            shift_logits.view(-1, shift_logits.size(-1)),
            shift_labels.view(-1),
            ignore_index=-100
        )
        
        # ── L_KL: KL divergence between teacher and student ──
        temp = 1.0
        t_probs = F.softmax(teacher_logits[:, :-1] / temp, dim=-1)
        s_log_probs = F.log_softmax(shift_logits / temp, dim=-1)
        l_kl = F.kl_div(s_log_probs, t_probs, reduction='batchmean') * (temp ** 2)
        
        l_kd = l_ce + self.alpha * l_kl
        
        # ── L_sem: sense-guided semantic consistency ──
        # Retrieve sense embeddings for next-token targets at sampled positions
        pos_senses = []
        neg_senses = []
        
        for t in range(T - 1):
            next_token = input_ids[0, t + 1].item()  # simplified: single example
            
            # Get sense embeddings for the token itself
            s_emb = self.sense_dict.get_senses(next_token)
            
            if s_emb is None:
                pos_senses.append(None)
                neg_senses.append(None)
                continue
            
            # Get synonym/antonym sense embeddings if available
            token_str = self.tokenizer.decode([next_token])
            synonyms = self.lex_graph.synonyms.get(token_str, set())
            antonyms = self.lex_graph.antonyms.get(token_str, set())
            
            # Collect synonym sense embeddings
            syn_embs = [s_emb]  # start with token's own senses
            for syn in synonyms:
                syn_tok_ids = self.tokenizer.encode(syn, add_special_tokens=False)
                for tid in syn_tok_ids:
                    syn_emb = self.sense_dict.get_senses(tid)
                    if syn_emb is not None:
                        syn_embs.append(syn_emb)
            
            pos_senses.append(torch.cat(syn_embs, dim=0))  # P_{t+1}
            
            # Collect antonym sense embeddings
            ant_embs = []
            for ant in antonyms:
                ant_tok_ids = self.tokenizer.encode(ant, add_special_tokens=False)
                for tid in ant_tok_ids:
                    ant_emb = self.sense_dict.get_senses(tid)
                    if ant_emb is not None:
                        ant_embs.append(ant_emb)
            
            neg_senses.append(torch.cat(ant_embs, dim=0) if ant_embs else None)
        
        # Flatten batch and time for semantic loss
        l_sem = self.sem_loss_fn(
            student_hidden[0, :-1],   # n_t: positions 0..T-2
            teacher_hidden[0, :-1],   # m_t: positions 0..T-2
            pos_senses,
            neg_senses
        )
        
        # ── Full DSKD objective (Eq. 5) ──
        l_dskd = l_kd + l_sem
        
        return {
            'loss': l_dskd,
            'l_ce': l_ce.detach(),
            'l_kl': l_kl.detach(),
            'l_sem': l_sem.detach()
        }


if __name__ == "__main__":
    print("DSKD Core Components:")
    print("  SenseDictionary    — k-means clustering over teacher's last hidden layer")
    print("  LexicalRelationGraph — Wiktionary + Roget's + morphological expansion")
    print("  SemanticConsistencyLoss — Hinge MSE pull/push on student hidden states")
    print("  DSKDTrainer        — L_DSKD = L_KD + L_sem, trains only final 2 layers")
    print("\nKey parameters (from paper ablations):")
    print("  k=10 (LLaMA), k=5 (Mistral)  |  κ=5  |  β=1.0 (LLaMA), 1.5 (Mistral)")
    print("  Training: 2 epochs, bf16, 6×V100 GPUs, ~5.2 hours per epoch")
    print("  Inference: no dictionary, standard autoregressive decoding")

Access the Paper and Resources

The full DSKD framework, experimental details, and sense dictionary construction are available on arXiv. The project code, sense dictionary, and model weights will be released on GitHub post-review by the RPI and IBM Research team.

📄 Read the Paper (arXiv) 🏫 RPI Data Science Lab

Academic Citation:
Wang, Q., Zaki, M. J., Kollias, G., & Kalantzis, V. (2026). Decoder-based Sense Knowledge Distillation. arXiv preprint arXiv:2602.22351. Rensselaer Polytechnic Institute & IBM Research.

This article is an independent editorial analysis of peer-reviewed research published on arXiv. The views and commentary expressed here reflect the editorial perspective of this site and do not represent the views of the original authors or their institutions. Code implementations are provided for educational purposes to illustrate the technical concepts described in the paper. Always refer to the original publication for authoritative details and official implementations.

Explore More on AITrendBlend

If this analysis caught your interest, here’s more of what we cover—from foundational tutorials to the latest research breakthroughs in efficient NLP, model compression, and industrial AI deployment.

Homepage Latest research breakdowns, trending papers, and editorial picks in AI

→

Machine Learning Architecture reviews, training techniques, and theoretical foundations

→

Cybersecurity Adversarial attacks, defensive techniques, and secure system design

→

Contact Reach out for collaboration, corrections, or paper submission requests

→