Unlock 13% Better Speech Recognition: How Label-Context-Dependent ILM Estimation Shatters CTC Limits

Connectionist Temporal Classification (CTC) powers countless speech recognition systems. But here’s the dirty secret: its “context-independent” assumption is a myth. Modern encoders do learn context-dependent patterns, and ignoring this wastes potential. This paper reveals how to harness this hidden power, slashing word error rates (WER) by over 13% in cross-domain tasks. If your ASR system uses CTC, this isn’t just an upgrade—it’s a revolution.

The Connectionist Temporal Classification Paradox: Assumed Independence vs. Hidden Dependence

CTC’s core premise is label independence: each output depends only on acoustic input, not prior labels. Yet, powerful encoders (e.g., Conformers) implicitly model label context. This creates an Internal Language Model (ILM) within CTC. Traditional ILM estimation fails here because:

Heuristic methods (e.g., acoustic input masking) are crude approximations.
Frame-level priors ignore context, hurting cross-domain adaptation.
Transcription LMs (used as ILM proxies) aren’t derived from CTC outputs.

💡 The breakthrough? CTC’s ILM is context-dependent. Treating it as context-independent leaves performance on the table.

Label-Context-Dependent ILM: The Knowledge Distillation Solution

The authors propose knowledge distillation (KD) to extract Connectionist Temporal Classification implicit ILM. A small LSTM “student” learns from the CTC “teacher” via two strategies:

🔬 1. Label-Level Distillation

How it works: Compute label posteriors using CTC prefix probabilities (Eq. 6). For a label sequence prefix a₁...aₛ, this sums probabilities of all expansions.
EOS Handling: Infers end-of-sequence probability via P(<EOS>|a₁ˢ,X) = P(a₁ˢ|X) / P(a₁ˢ,...|X).
Training: Minimize KL divergence between CTC posteriors and LSTM outputs (Eq. 7).

⚖️ 2. Regularization: Fixing framework Overconfidence

Connectionist Temporal Classification overfits training data, assigning near-0/1 probabilities. Solution:

Smoothing (Eq. 9-10): Interpolate empirical data distribution with marginal distributions (factor α=0.5)
Masking (Eq. 11): Randomly mask acoustic inputs (p_mask=0.4) using alignment boundaries.

🔮 Sequence-Level Distillation

Distills entire sequence probabilities (Eq. 12). Spoiler: This underperforms label-level KD by ignoring per-label distributions.

Results: 13.8% WER & Cross-Domain Dominance

Experiments on Librispeech (in-domain) and TED-LIUMv2 (cross-domain) reveal:

📊 Table: WER Comparison with ELM Integration (Tedlium2 Test Set)

ILM Method	Context	WER (%)
Shallow Fusion (Baseline)	–	15.9
Frame-Level Prior (FP)	None	14.7
Transcription LM	Full	14.5
Label-Level KD (Masking)	Full	14.0
Label-Level KD (Smoothing)	Full	13.8

💥 Key Findings:

Cross-domain superiority: Context-dependent ILMs (e.g., label-KD) beat context-independent priors (FP/unigram) by >1% absolute WER.
Smoothing wins: Label-KD + smoothing achieves 13.8% WER vs. 15.9% for shallow fusion (13.2% relative gain).
PPL is irrelevant: ILM perplexity doesn’t correlate with WER (unlike external LMs). Optimize on dev-set WER, not PPL.
In-domain similarity: All methods perform equally well (Librispeech WERs ~4.8%). ILM correction shines in domain shifts.

If you’re Interested in Self‑Supervised Knowledge Distillation, you may also find this article helpful: 7 Incredible Upsides and Downsides of Layered Self‑Supervised Knowledge Distillation (LSSKD) for Edge AI

Practical Insights: Implementation Tips

Forget Sequence-Level KD: Label-level provides richer signal (per-label distributions vs. whole sequences).
Combine with Frame Prior? Only helps for weak ILMs (e.g., transcription LM). Label-KD + smoothing replaces it.
Context Length: Full-context LSTM ILMs work best. Limited-context (e.g., 6 labels) underperforms.
Training Efficiency: Smoothing within mini-batches avoids full-dataset costs.

Conclusion: Stop Guessing, Start Distilling

Connectionist Temporal Classification ILM is context-dependent. Ignoring this caps your system’s potential. By distilling label-context dependencies via regularized KD, this work unlocks 13%+ lower WER in cross-domain scenarios. It’s not an incremental change—it’s the key to production-ready, robust CTC systems.

🚀 Call to Action
Ready to slash your ASR error rates?

Implement Now: Code is available on GitHub.

Experiment: Try label-KD + smoothing on your cross-domain data.

Share: Comment below with your results!

Complete Implementation of Label-Context-Dependent ILM Estimation for Connectionist Temporal Classification.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchaudio.models.decoder import ctc_prefix_beam_search
from transformers import ConformerModel

class CTCConformer(nn.Module):
    """Conformer-based CTC Acoustic Model"""
    def __init__(self, input_dim, vocab_size, num_layers=12, dim=512):
        super().__init__()
        self.conformer = ConformerModel(
            num_layers=num_layers,
            hidden_size=dim,
            num_attention_heads=8,
            intermediate_size=2048,
            input_feat_per_channel=input_dim,
            conv_channels=dim,
            conv_kernel_size=31
        )
        self.output_layer = nn.Linear(dim, vocab_size + 1)  # +1 for blank token

    def forward(self, x, lengths):
        outputs = self.conformer(x, lengths)
        logits = self.output_layer(outputs.last_hidden_state)
        return logits

class ILMEstimator(nn.Module):
    """LSTM-based Internal Language Model Estimator"""
    def __init__(self, vocab_size, embed_dim=128, hidden_dim=1000):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size + 1, embed_dim)  # +1 for <EOS>
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.output_layer = nn.Linear(hidden_dim, vocab_size + 1)  # +1 for <EOS>

    def forward(self, labels):
        embedded = self.embedding(labels)
        lstm_out, _ = self.lstm(embedded)
        logits = self.output_layer(lstm_out)
        return logits

class CTCILMTrainer:
    """Knowledge Distillation for CTC Internal Language Model"""
    def __init__(self, ctc_model, ilm_estimator, vocab_size, alpha=0.5, p_mask=0.4):
        self.ctc = ctc_model
        self.ilm = ilm_estimator
        self.vocab_size = vocab_size
        self.alpha = alpha  # Smoothing factor
        self.p_mask = p_mask  # Masking probability
        self.ctc.eval()  # Freeze CTC model

    def compute_prefix_probabilities(self, logits, labels, label_lengths):
        """Compute CTC prefix probabilities using beam search"""
        batch_size = logits.size(0)
        probs = torch.softmax(logits, dim=-1)
        prefix_probs = []
        next_token_probs = []
        
        for i in range(batch_size):
            # Get valid labels for sequence (remove padding)
            seq_labels = labels[i, :label_lengths[i]].tolist()
            
            # Compute prefix probabilities
            results, _ = ctc_prefix_beam_search(
                probs[i].unsqueeze(0), 
                torch.tensor([logits.size(1)]),
                beam_size=1,
                blank=self.vocab_size
            )
            
            # Extract probabilities for the ground truth prefix
            seq_probs = []
            for s in range(len(seq_labels) + 1):  # +1 for <EOS>
                prefix = tuple(seq_labels[:s])
                seq_probs.append(results[prefix][0])
            
            prefix_probs.append(seq_probs)
            
            # Compute next token probabilities
            next_probs = torch.zeros(len(seq_labels) + 1, self.vocab_size + 1)
            for s in range(len(seq_labels) + 1):
                current_prefix = tuple(seq_labels[:s])
                total_prob = results[current_prefix][0]
                
                # Probability for <EOS>
                if s == len(seq_labels):
                    eos_prob = results.get(current_prefix, (0, None))[0]
                    next_probs[s, self.vocab_size] = eos_prob / total_prob
                else:
                    # Probability for continuing tokens
                    for token in range(self.vocab_size):
                        new_prefix = current_prefix + (token,)
                        if new_prefix in results:
                            token_prob = results[new_prefix][0]
                            next_probs[s, token] = token_prob / total_prob
            
            next_token_probs.append(next_probs)
        
        return prefix_probs, next_token_probs

    def smooth_distribution(self, probs_batch):
        """Apply smoothing to probability distribution"""
        batch_size = len(probs_batch)
        smoothed_probs = []
        
        for i in range(batch_size):
            current_probs = probs_batch[i]
            seq_len, vocab_size = current_probs.shape
            smoothed = torch.zeros_like(current_probs)
            
            # Compute marginal distributions
            input_marginal = 1 / batch_size  # Simplified
            label_marginal = current_probs.mean(dim=0, keepdim=True)
            
            for j in range(batch_size):
                # Interpolate with marginal distributions
                smoothed = self.alpha * current_probs + (1 - self.alpha) * input_marginal * label_marginal
                
            smoothed_probs.append(smoothed)
        
        return smoothed_probs

    def apply_acoustic_mask(self, features, alignments):
        """Mask acoustic features based on alignments"""
        masked_features = features.clone()
        for i, alignment in enumerate(alignments):
            for start, end in alignment:
                if torch.rand(1) < self.p_mask:
                    masked_features[i, start:end] = 0  # Simple zero masking
        return masked_features

    def label_level_kd_loss(self, teacher_probs, ilm_logits, labels):
        """Compute label-level KD loss with smoothing"""
        batch_size = len(teacher_probs)
        total_loss = 0.0
        total_items = 0
        
        for i in range(batch_size):
            seq_probs = teacher_probs[i]
            seq_logits = ilm_logits[i]
            seq_labels = labels[i]
            
            for s in range(seq_probs.size(0)):  # Positions in sequence
                # Get ground truth prefix (previous tokens)
                prefix = seq_labels[:s] if s > 0 else torch.tensor([])
                
                # Compute KL divergence
                teacher_dist = seq_probs[s]
                student_dist = F.log_softmax(seq_logits[s], dim=-1)
                
                kl_loss = F.kl_div(
                    student_dist, 
                    teacher_dist, 
                    reduction='batchmean',
                    log_target=False
                )
                
                # Apply smoothing weights
                weight = self.alpha if i == i else (1 - self.alpha) / (batch_size - 1)
                total_loss += weight * kl_loss
                total_items += 1
        
        return total_loss / total_items if total_items > 0 else total_loss

    def train_step(self, features, feature_lengths, labels, label_lengths, alignments):
        # Apply acoustic masking
        if self.p_mask > 0:
            features = self.apply_acoustic_mask(features, alignments)
        
        # Get CTC logits
        with torch.no_grad():
            ctc_logits = self.ctc(features, feature_lengths)
        
        # Compute prefix probabilities
        prefix_probs, next_token_probs = self.compute_prefix_probabilities(
            ctc_logits, labels, label_lengths
        )
        
        # Apply smoothing
        if self.alpha < 1.0:
            next_token_probs = self.smooth_distribution(next_token_probs)
        
        # Get ILM logits
        ilm_logits = self.ilm(labels)
        
        # Compute KD loss
        loss = self.label_level_kd_loss(next_token_probs, ilm_logits, labels)
        return loss

# Example Usage
if __name__ == "__main__":
    # Hyperparameters
    INPUT_DIM = 80  # Mel features
    VOCAB_SIZE = 10000  # BPE tokens
    BATCH_SIZE = 16
    SEQ_LEN = 100
    
    # Initialize models
    ctc_model = CTCConformer(INPUT_DIM, VOCAB_SIZE)
    ilm_model = ILMEstimator(VOCAB_SIZE)
    
    # Initialize trainer
    trainer = CTCILMTrainer(ctc_model, ilm_model, VOCAB_SIZE, alpha=0.5, p_mask=0.4)
    
    # Sample data (replace with actual dataloader)
    features = torch.randn(BATCH_SIZE, SEQ_LEN, INPUT_DIM)
    feature_lengths = torch.full((BATCH_SIZE,), SEQ_LEN)
    labels = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LEN))
    label_lengths = torch.full((BATCH_SIZE,), SEQ_LEN)
    alignments = [[(i, i+5) for i in range(0, SEQ_LEN, 5)] for _ in range(BATCH_SIZE)]
    
    # Training step
    loss = trainer.train_step(
        features, feature_lengths, labels, label_lengths, alignments
    )
    print(f"Knowledge Distillation Loss: {loss.item():.4f}")
    
    # Save ILM model
    torch.save(ilm_model.state_dict(), "ctc_ilm_estimator.pt")

References
Das, N. et al. (2023). Mask the Bias. ICASSP.
Zhao, Z. & Bell, P. (2025). On CTC’s Internal LM. ICASSP.
Yang, Z. et al. (2025). Label-Context-Dependent ILM for CTC. arXiv:2506.06096.