Unlock 57.2% Reasoning Accuracy: KDRL Revolutionary Fusion Crushes LLM Training Limits

The Hidden Flaw Crippling Your LLM’s Reasoning Power

Large language models (LLMs) promise revolutionary reasoning capabilities, yet most hit an invisible wall. Traditional training forces a brutal trade-off:

Knowledge Distillation (KD) rapidly imitates teacher models but fails catastrophically on unseen problems (↓ 3.1% accuracy vs RL).
Reinforcement Learning (RL) enables creative problem-solving yet guzzles compute resources, struggling to explore high-reward paths.

Enter KDRL—a Huawei/HIT-developed framework merging KD and RL into a single unified pipeline. Results from 6 reasoning benchmarks reveal:

57.2% average accuracy—outperforming GRPO (↑2.6%) and KD-RKL (↑1.1%)
20% shorter responses than pure distillation
80-hour training convergence—2.3× faster than RL-only systems

How KDRL Shatters the KD-RL Deadlock

Proposed model breakthrough lies in Reverse KL Divergence (RKL) + Policy Gradient Optimization—forcing alignment between teacher guidance and environmental rewards.

The 3-Pillar Architecture

Teacher Supervision
- On-policy sampling from student rollouts
- Token-level reverse KL loss (RKL) minimizes divergence from teacher
Rule-Based Reinforcement
- Binary verifier rewards (format + accuracy)
- GRPO advantage function scales self-exploration
Dynamic Balancing
- Annealing KL coefficient: Shifts focus from imitation (β=5e⁻³) → reward optimization (β=1e⁻³)
- Reward-guided masking: Suppresses KD on correct responses to avoid over-regularization

Critical Insight: Unlike Qwen3’s distillation, KDRL’s k2 estimator provides unbiased gradients—accelerating convergence by 17% vs k3 approximations.

Benchmark Domination:

Tests across MATH500, AIME24, and OlympiadBench prove proposed model supremacy:

Method	AIME24 (%)	Avg. Length	Training Hours
SFT (Reject Sampling)	33.5	4,812	80
GRPO (RL-only)	38.3	5,109	68
KD-RKL (Qwen3-style)	41.0	8,642	92
KDRL (Ours)	42.9	6,203	80

MODEL	AIME24 AVG@16	MATH PASS@1	AMC23 AVG@16
GRPO (Pure RL)	38.3%	88.6%	79.5%
KD-RKL (On-Policy KD)	41.0%	89.0%	80.2%
KDRL (Ours)	42.1%	90.0%	81.3%
KDRL-Annealing	42.9%	90.4%	82.2%

Takeaway : Proposod model outperforms all baselines, achieving a 5.7% gain over SFT and 2.6% over GRPO

3 Unmatched Advantages

Token Efficiency
- Response-level masking cuts redundant outputs by 1,400 tokens (↓11%)
- No accuracy trade-off: 42.9% AIME24 score retained
Training Stability
- Annealing β prevents early length explosion (truncation ↓15% vs KD-RKL)
- Group-level sampling eliminates “gradient conflict” in correct responses
Scalability
- Works for “Zero-RL” (direct pretrained model tuning) and distilled backbones
- 32.5% accuracy on Qwen2.5-3B—outperforming GRPO by 1.2%

Implement KDRL: Your 4-Step Blueprint

Deploy model using Huawei’s VeRL framework with these optimizations:

# Pseudo-code for KDRL joint loss (PyTorch-style)  
def kdrl_loss(policy_logits, teacher_logits, advantages):  
    # 1. Compute k2 KL approximation  
    rkl = 0.5 * (log(teacher_probs) - log(policy_probs))**2  
      
    # 2. Anneal KL coefficient (β from 5e-3 → 1e-3)  
    beta = max(5e-3 - 5e-5 * step, 1e-3)  
      
    # 3. Reward-guided masking  
    mask = (rewards == 0).float()  # Suppress KD on correct responses  
    masked_rkl = rkl * mask  
      
    # 4. Combine with GRPO objective  
    return grpo_loss - beta * masked_rkl.mean()

Critical Hyperparameters

Parameter	Value	Effect
KL scheduler (β)	5e⁻³ → 1e⁻³	Balances imitation/reward phases
Rollouts per prompt	16	Optimal exploration-compute balance
Top-K tokens	50	Prevents memory overflow in KL calc

If you’re Interested in Vehicle Routing, you may also find this article helpful: MTL-KD: 5 Breakthroughs That Shatter Old Limits in AI Vehicle Routing (But Reveal New Challenges)

The Future of Reasoning LLMs Starts Here

KDRL isn’t just incremental—it’s a paradigm shift. By fusing teacher wisdom with environment rewards, it solves the “generalization-efficiency paradox” plaguing LLMs. As Huawei’s team confirms:

“KDRL enables smaller models (1.5B) to achieve 80% of 7B teacher performance—democratizing high-level reasoning.”

Ready to deploy?
➔ Download our KDRL implementation kit (VeRL scripts + difficulty-balanced datasets)
➔ Subscribe for LLM optimization tactics delivered weekly

Free Resource: Get the PDF: “5 KDRL Tweaks for 90% Faster LLM Training”

Transform your LLM from a memorizing machine to a reasoning powerhouse—today.

Here’s the complete PyTorch implementation of KDRL based on the research paper. This code integrates knowledge distillation (KD) with reinforcement learning (RL) using the GRPO algorithm and reverse KL divergence:

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
from collections import deque
import numpy as np

class KDRL:
    def __init__(self, student_model, teacher_model, kl_coef_init=5e-3, kl_coef_min=1e-3, 
                 kl_annealing_rate=5e-5, gamma=0.99, top_k=50):
        """
        KDRL: Unified Knowledge Distillation and Reinforcement Learning
        Args:
            student_model: Pretrained student model (to be optimized)
            teacher_model: Frozen teacher model for knowledge distillation
            kl_coef_init: Initial KL coefficient (β)
            kl_coef_min: Minimum KL coefficient after annealing
            kl_annealing_rate: Linear decay rate for KL coefficient
            gamma: Discount factor for reward calculation
            top_k: Top-K tokens for KL approximation
        """
        self.student = student_model
        self.teacher = teacher_model
        self.kl_coef = kl_coef_init
        self.kl_init = kl_coef_init
        self.kl_min = kl_coef_min
        self.kl_annealing_rate = kl_annealing_rate
        self.gamma = gamma
        self.top_k = top_k
        
        # Freeze teacher model
        for param in self.teacher.parameters():
            param.requires_grad = False

    def anneal_kl_coef(self, step):
        """Linearly anneal KL coefficient"""
        self.kl_coef = max(self.kl_init - self.kl_annealing_rate * step, self.kl_min)

    def k2_kl_approximation(self, student_logits, teacher_logits, mask):
        """
        Compute k2 KL divergence approximation
        Args:
            student_logits: [batch_size, seq_len, vocab_size]
            teacher_logits: [batch_size, seq_len, vocab_size]
            mask: [batch_size, seq_len] (1 for valid tokens, 0 for padding)
        Returns:
            kl_loss: Scalar tensor
        """
        # Apply top-K filtering to teacher logits
        topk_values, topk_indices = torch.topk(teacher_logits, self.top_k, dim=-1)
        filtered_teacher_logits = torch.full_like(teacher_logits, -float('inf'))
        filtered_teacher_logits.scatter_(-1, topk_indices, topk_values)
        
        # Compute probabilities
        student_probs = F.softmax(student_logits, dim=-1)
        teacher_probs = F.softmax(filtered_teacher_logits, dim=-1)
        
        # Calculate reverse KL using k2 approximation
        log_ratio = torch.log(teacher_probs + 1e-8) - torch.log(student_probs + 1e-8)
        kl_div = 0.5 * (log_ratio ** 2)
        
        # Apply mask and average
        kl_loss = (kl_div * mask.unsqueeze(-1)).sum() / mask.sum()
        return kl_loss

    def compute_advantage(self, rewards, values):
        """
        Compute generalized advantage estimation (GAE)
        Args:
            rewards: [batch_size, seq_len]
            values: [batch_size, seq_len]
        Returns:
            advantages: [batch_size, seq_len]
        """
        advantages = torch.zeros_like(rewards)
        last_advantage = 0
        
        # Calculate advantages in reverse
        for t in reversed(range(rewards.size(1))):
            delta = rewards[:, t] + self.gamma * values[:, t+1] - values[:, t]
            advantages[:, t] = last_advantage = delta + self.gamma * last_advantage
            
        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        return advantages

    def reward_guided_masking(self, kl_loss, rewards, response_level=True):
        """
        Apply reward-guided masking to KL loss
        Args:
            kl_loss: Original KL loss tensor
            rewards: Reward values for each response
            response_level: True for response-level masking, False for group-level
        Returns:
            masked_kl_loss: Masked KL loss
        """
        if response_level:
            # Response-level masking: Suppress KL for successful responses
            mask = (rewards == 0).float()
            return kl_loss * mask.unsqueeze(-1)
        else:
            # Group-level masking: Suppress if any response in group succeeds
            group_mask = (rewards.mean(dim=0) == 0).float()
            return kl_loss * group_mask.unsqueeze(-1).unsqueeze(-1)

    def grpo_loss(self, student_logits, old_logits, actions, advantages, mask):
        """
        Compute Group Relative Policy Optimization (GRPO) loss
        Args:
            student_logits: Current model logits [batch, seq, vocab]
            old_logits: Old model logits [batch, seq, vocab]
            actions: Selected token indices [batch, seq]
            advantages: Computed advantages [batch, seq]
            mask: Padding mask [batch, seq]
        Returns:
            policy_loss: GRPO loss value
        """
        # Get log probabilities
        student_log_probs = F.log_softmax(student_logits, dim=-1)
        old_log_probs = F.log_softmax(old_logits, dim=-1).detach()
        
        # Calculate importance ratio
        action_log_probs = student_log_probs.gather(-1, actions.unsqueeze(-1)).squeeze(-1)
        old_action_log_probs = old_log_probs.gather(-1, actions.unsqueeze(-1)).squeeze(-1)
        ratio = torch.exp(action_log_probs - old_action_log_probs)
        
        # Compute clipped policy loss
        policy_loss = -torch.min(
            ratio * advantages,
            torch.clamp(ratio, 0.8, 1.2) * advantages
        )
        
        # Apply mask and average
        return (policy_loss * mask).sum() / mask.sum()

    def train_step(self, prompts, optimizer, step, response_level_mask=True):
        """
        Perform single KDRL training step
        Args:
            prompts: Batch of input prompts
            optimizer: Model optimizer
            step: Current training step (for KL annealing)
            response_level_mask: Type of reward masking
        Returns:
            total_loss: Combined KDRL loss
        """
        # Generate rollouts with current policy
        rollouts, old_logits, actions, masks = self.generate_rollouts(prompts)
        
        # Compute rewards (format + correctness)
        rewards = self.compute_rewards(rollouts)
        
        # Get student and teacher logits
        student_logits = self.student(prompts).logits
        with torch.no_grad():
            teacher_logits = self.teacher(prompts).logits
        
        # Calculate KL divergence with k2 approximation
        kl_loss = self.k2_kl_approximation(student_logits, teacher_logits, masks)
        
        # Apply reward-guided masking
        masked_kl_loss = self.reward_guided_masking(kl_loss, rewards, response_level_mask)
        
        # Estimate value function (simplified)
        values = torch.ones_like(rewards) * rewards.mean()  # Placeholder
        
        # Compute advantages
        advantages = self.compute_advantage(rewards.unsqueeze(0), values.unsqueeze(0))
        
        # Calculate GRPO policy loss
        policy_loss = self.grpo_loss(
            student_logits, 
            old_logits, 
            actions, 
            advantages, 
            masks
        )
        
        # Anneal KL coefficient
        self.anneal_kl_coef(step)
        
        # Combine losses: J_KDRL = J_GRPO - β * D_KL
        total_loss = policy_loss - self.kl_coef * masked_kl_loss
        
        # Optimization step
        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()
        
        return total_loss.item()

    # Helper functions would be implemented below:
    # - generate_rollouts()
    # - compute_rewards()
    # - format_checker()
    # - answer_verifier()

    def generate_rollouts(self, prompts, num_rollouts=16, max_length=2048):
        """
        Generate on-policy rollouts with current student model
        Args:
            prompts: Batch of input prompts [batch, seq]
            num_rollouts: Responses per prompt
            max_length: Maximum response length
        Returns:
            rollouts: Generated responses
            old_logits: Logits from generation
            actions: Selected token indices
            masks: Attention masks
        """
        self.student.eval()
        rollouts = []
        all_logits = []
        all_actions = []
        all_masks = []
        
        with torch.no_grad():
            for prompt in prompts:
                # Repeat prompt for multiple rollouts
                repeated_prompt = prompt.repeat(num_rollouts, 1)
                
                # Generate responses
                outputs = self.student.generate(
                    repeated_prompt,
                    max_length=max_length,
                    do_sample=True,
                    top_k=50,
                    output_scores=True,
                    return_dict_in_generate=True
                )
                
                # Store results
                rollouts.append(outputs.sequences)
                all_logits.append(torch.stack(outputs.scores, dim=1))
                all_actions.append(outputs.sequences[:, prompt.size(-1):])
                
                # Create masks
                mask = torch.ones_like(outputs.sequences)
                mask[outputs.sequences == self.tokenizer.pad_token_id] = 0
                all_masks.append(mask)
        
        self.student.train()
        return (
            torch.cat(rollouts),
            torch.cat(all_logits),
            torch.cat(all_actions),
            torch.cat(all_masks)
        )

    def generate_rollouts(self, prompts, num_rollouts=16, max_length=2048):
        """
        Generate on-policy rollouts with current student model
        Args:
            prompts: Batch of input prompts [batch, seq]
            num_rollouts: Responses per prompt
            max_length: Maximum response length
        Returns:
            rollouts: Generated responses
            old_logits: Logits from generation
            actions: Selected token indices
            masks: Attention masks
        """
        self.student.eval()
        rollouts = []
        all_logits = []
        all_actions = []
        all_masks = []
        
        with torch.no_grad():
            for prompt in prompts:
                # Repeat prompt for multiple rollouts
                repeated_prompt = prompt.repeat(num_rollouts, 1)
                
                # Generate responses
                outputs = self.student.generate(
                    repeated_prompt,
                    max_length=max_length,
                    do_sample=True,
                    top_k=50,
                    output_scores=True,
                    return_dict_in_generate=True
                )
                
                # Store results
                rollouts.append(outputs.sequences)
                all_logits.append(torch.stack(outputs.scores, dim=1))
                all_actions.append(outputs.sequences[:, prompt.size(-1):])
                
                # Create masks
                mask = torch.ones_like(outputs.sequences)
                mask[outputs.sequences == self.tokenizer.pad_token_id] = 0
                all_masks.append(mask)
        
        self.student.train()
        return (
            torch.cat(rollouts),
            torch.cat(all_logits),
            torch.cat(all_actions),
            torch.cat(all_masks)
        )

    def compute_rewards(self, rollouts):
        """
        Compute binary rewards (format + correctness)
        Args:
            rollouts: Generated responses
        Returns:
            rewards: Tensor of rewards [batch_size]
        """
        rewards = []
        for response in rollouts:
            text = self.tokenizer.decode(response, skip_special_tokens=True)
            
            # Format checking
            format_ok = self.format_checker(text)
            
            # Answer verification
            answer_correct = self.answer_verifier(text)
            
            # Combined binary reward
            rewards.append(1.0 if format_ok and answer_correct else 0.0)
        
        return torch.tensor(rewards, device=self.student.device)

    def format_checker(self, text):
        """Check response format validity"""
        return "<think>" in text and "</think>" in text and "\\boxed" in text

    def answer_verifier(self, text):
        """Verify answer correctness (simplified)"""
        # In practice, integrate with math_verify library
        return "CORRECT" in text  # Placeholder logic

# Training Loop Integration
student = AutoModelForCausalLM.from_pretrained("DeepSeek/R1-Distill-Qwen-1.5B")
teacher = AutoModelForCausalLM.from_pretrained("Skywork/Skywork-OR1-Math-7B")
tokenizer = AutoTokenizer.from_pretrained("DeepSeek/R1-Distill-Qwen-1.5B")

# Initialize KDRL trainer
kdrl = KDRL(
    student_model=student,
    teacher_model=teacher,
    kl_coef_init=5e-3,
    kl_coef_min=1e-3,
    kl_annealing_rate=5e-5
)

# Initialize optimizer
optimizer = torch.optim.AdamW(student.parameters(), lr=1e-6)

# Training loop
for step, batch in enumerate(dataloader):
    prompts = tokenizer(batch["questions"], return_tensors="pt", padding=True).input_ids
    loss = kdrl.train_step(
        prompts=prompts,
        optimizer=optimizer,
        step=step,
        response_level_mask=True
    )
    
    # Logging and checkpointing
    if step % 50 == 0:
        print(f"Step {step}: Loss={loss:.4f}, KL Coef={kdrl.kl_coef:.6f}")
    
    if step % 1000 == 0:
        torch.save(student.state_dict(), f"kdrl_checkpoint_step{step}.pt")

# Top-K filtering implementation
topk_values, topk_indices = torch.topk(teacher_logits, self.top_k, dim=-1)
filtered_teacher_logits = torch.full_like(teacher_logits, -float('inf'))
filtered_teacher_logits.scatter_(-1, topk_indices, topk_values)

# Response-level masking
mask = (rewards == 0).float()
return kl_loss * mask.unsqueeze(-1)
#Adaptive KL Annealing
self.kl_coef = max(self.kl_init - self.kl_annealing_rate * step, self.kl_min)

# Usage instrunctions
student = AutoModelForCausalLM.from_pretrained("your-base-model")
teacher = AutoModelForCausalLM.from_pretrained("powerful-teacher-model")

kdrl = KDRL(
    student_model=student,
    teacher_model=teacher,
    kl_coef_init=5e-3,       # Initial KL weight
    kl_coef_min=1e-3,        # Minimum KL weight
    kl_annealing_rate=5e-5,  # Annealing rate
    top_k=50                 # Top-K for KL approx
)

for step, batch in enumerate(dataloader):
    loss = kdrl.train_step(
        prompts=batch["inputs"],
        optimizer=optimizer,
        step=step,
        response_level_mask=True
    )

References
Xu et al. (2025). KDRL: Post-Training Reasoning LLMs via Unified KD and RL. arXiv:2506.02208
Agarwal, R. et al. (2024). On-Policy Distillation of Language Models. ICLR
Guo, D. et al. (2025). DeepSeek-R1: Boosting Reasoning via RL. arXiv:2501.12948