Enhancing Vision-Audio Capability in Omnimodal LLMs with Self-KD

Introduction: The Challenge of Audio-Vision Integration in Omnimodal LLMs

Omnimodal Large Language Models (OLLMs) like GPT-4o and Megrez have revolutionized how AI interacts with the world by seamlessly processing text, images, and audio. However, a critical performance gap persists: OLLMs perform significantly better with vision-text inputs than with vision-audio inputs.

For example, when asked “What’s the name of the book on the top of the pile?” in text form, models like Megrez answer correctly with “Ariel”. But when the same question is spoken, the model might respond with “Plays pleasant” — a plausible but incorrect answer. This inconsistency reveals a fundamental flaw: vision-audio integration lags far behind vision-text integration.

In this article, we dive into a groundbreaking solution proposed in the paper “Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language Models”: Self-Knowledge Distillation (Self-KD). We’ll explore why this gap exists, how Self-KD bridges it, and why this advancement is pivotal for the future of multimodal AI.

Why Vision-Audio Performance Lags Behind Vision-Text

1. The Performance Gap is Real and Widespread

The paper evaluates several OLLMs — VITA, VITA-1.5, and Megrez — across multiple benchmarks like MME, HallusionBench, and TextVQA. Text questions are converted to audio using Text-to-Speech (TTS), ensuring a fair comparison.

MODEL	VISION-TEXT SCORE	VISION-AUDIO SCORE	PERFORMANCE DROP
VITA-8x7B	79.45	18.19	62.20
VITA-1.5-7B	71.34	36.28	35.06
Megrez-3B	68.96	49.72	19.24

As shown, all models suffer a significant drop in accuracy when switching from text to audio queries, with VITA losing over 60 points. This isn’t a minor glitch — it’s a systemic issue in current OLLM architectures.

2. Attention Imbalance: Audio Queries Distract from Visual Cues

Using attention weight analysis, the authors found that:

In text queries, OLLMs allocate strong attention to both query and image tokens.
In audio queries, the model focuses more on the audio input itself and less on the visual information.

This imbalance explains why models generate relevant but inaccurate answers. For instance:

Question:“What’s the least popular game in the chart?”
- Text response: ✅ Simulation
- Audio response: ❌ Puzzle (present in chart, but not the correct answer)

The model sees the chart and hears the question — but fails to integrate them effectively.

3. Weak Modality Alignment Between Vision and Audio

The paper introduces MMAlign, a new benchmark to measure alignment between modalities. It’s based on the ARO dataset and tests whether models can choose the correct caption (e.g., “The white hose is in front of the fence”) over a perturbed one.

Results show:

Vision-text alignment is strong (e.g., 75.67% accuracy for VITA-1.5).
Vision-audio alignment is weak (e.g., 32.83% accuracy for VITA-1.5).

This confirms the hypothesis: during training, OLLMs align vision with text and audio with text, but never directly align vision with audio.

The Root Cause: Limitations in OLLM Training Pipelines

Current OLLM training follows four stages:

Vision-Text Alignment
Vision-Text Supervised Fine-Tuning (SFT)
Audio-Text Alignment
Vision-Audio SFT

While this pipeline enables multimodal understanding, it lacks direct vision-audio alignment. The model is expected to learn vision-audio integration implicitly during SFT — but as the results show, this is insufficient.

“Conventional vision-audio SFT alone is insufficient for enabling the model to effectively integrate vision and audio.”
— Hu et al., 2025

Introducing Self-Knowledge Distillation (Self-KD)

To close the vision-audio performance gap, the authors propose Self-Knowledge Distillation (Self-KD) — a novel training framework where:

The vision-text component of the OLLM acts as the teacher.
The vision-audio component acts as the student.
The student learns to mimic the teacher’s behavior, even when the input modality differs.

How Self-KD Works

Self-KD leverages the KL divergence between the teacher’s output distribution and the student’s:

\[ \mathcal{L}_{\text{Self-KD}} = \mathbb{E}_{x_a \sim X_A, \; x_t \sim X_T, \; y \sim Y} \Big[ \log p_S(y \mid x_a) \, p_T(y \mid x_t) \Big] \]

Where:

\[ \begin{aligned} p_{T}(y \mid x_{t}) &: \; \text{Teacher model’s output (vision-text)} \\ p_{S}(y \mid x_{a}) &: \; \text{Student model’s output (vision-audio)} \\ x_{t} &: \; \text{Text query} \\ x_{a} &: \; \text{Audio query} \\ y &: \; \text{Ground truth answer} \end{aligned} \]

The total training loss combines Self-KD and standard SFT:

\[ L = \alpha \, L_{\text{Self-KD}} \;+\; (1-\alpha) \, L_{\text{SFT}} \]

Here, α controls the balance between distillation and direct supervision.

🔍 Key Insight: Unlike traditional KD, Self-KD uses different inputs (text vs. audio) for teacher and student, making it a cross-modal distillation technique.

Experimental Results: Self-KD Delivers Significant Gains

The authors tested Self-KD on InternVL2 and Qwen2VL models of varying sizes.

Table: Vision-Audio Performance Before and After Self-KD

MODEL	SFT ONLY	SELF-KD	IMPROVEMENT
InternVL2-1B	21.16	33.84	+12.68
InternVL2-2B	22.52	36.58	+14.06
InternVL2-4B	32.22	42.30	+10.08
InternVL2-8B	38.71	51.45	+12.74
Qwen2VL-2B	46.21	52.58	+6.37
Qwen2VL-7B	67.75	68.27	+0.52

Average improvement: +9.5 points
Best gain: +14.06 (InternVL2-2B)

Even on models already strong in vision-audio tasks (like Qwen2VL-7B), Self-KD provides a small but consistent boost.

Why Self-KD Works: Behavioral Alignment

1. Attention Weights Become More Balanced

After Self-KD training, the student model’s attention pattern closely matches the teacher’s:

Higher attention to vision tokens during audio queries.
Lower self-attention on query tokens, reducing over-focus on audio input.

This behavioral alignment ensures the model integrates both modalities effectively.

2. Improved Modality Alignment on MMAlign

MODEL	RELATION (SFT)	RELATION (SELF-KD)	ATTRIBUTE (SFT)	ATTRIBUTE (SELF-KD)
InternVL2-1B	42.67	50.67	45.33	47.33
InternVL2-8B	53.33	57.33	54.67	56.00
Qwen2VL-7B	71.00	71.33	58.33	61.67

Self-KD improves both relation and attribute understanding, proving it enhances semantic alignment between vision and audio.

Ablation Study: Finding the Optimal KD Ratio

The hyperparameter α (KD loss ratio) was tested across values 0.0 to 1.0.

KD RATIO (A)	AVG. PERFORMANCE
0.0 (SFT only)	21.16
0.25	26.62
0.50	29.47
0.75	35.53
1.0 (KD only)	33.84

✅ Best performance at α = 0.75
➡️ A mix of 75% distillation + 25% direct supervision works best.

Takeaway: Pure distillation (α=1.0) underperforms because the model still needs direct feedback on audio inputs.

Case Study: Real-World Output Comparison

Prompt (Audio): “Describe the color of the dress of the kids.”

Base Model (Text Query):
“The first child wears a red and white striped sweater, the second a yellow sweater, the third a blue shirt, the fourth a red sweater with a pocket, and the fifth a green and orange striped sweater.”
SFT Model (Audio Query):
“The kids are wearing colorful clothes, mostly sweaters in red, yellow, and blue.”
Self-KD Model (Audio Query):
“One child has a red and white striped sweater, another wears yellow, a third has a blue shirt, one has a red sweater with a pocket, and the last has a multicolored striped sweater.”

🔍 Observation: The Self-KD model’s output is nearly as detailed as the base model, while the SFT model gives only a general summary.

Implications for the Future of Multimodal AI

Self-KD represents a paradigm shift in how we train OLLMs:

No need for new data — leverages existing vision-text capabilities.
Improves cross-modal reasoning — aligns behavior across input modalities.
Scalable and model-agnostic — works across different OLLM architectures.

This approach could be extended to:

Video-audio integration
Speech-to-image generation
Multimodal robotics (e.g., voice-controlled visual navigation)

Limitations and Future Work

While Self-KD is effective, it has two key limitations:

Higher Training Cost: Requires teacher inference during training.
Performance Ceiling: Vision-audio performance still lags behind vision-text.

Future research could explore:

Asymmetric distillation (e.g., image → audio mapping)
Joint vision-audio pre-training
Dynamic KD scheduling based on query complexity

Conclusion: Self-KD is a Game-Changer for OLLMs

The paper “Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language Models” delivers a powerful insight: OLLMs can learn to process audio like text by learning from themselves.

By introducing Self-Knowledge Distillation, the authors provide a simple yet effective method to:

Reduce the vision-audio performance gap
Improve attention to visual cues in audio queries
Enhance modality alignment without new data

As OLLMs move toward true multimodal parity, techniques like Self-KD will be essential for building AI that sees, hears, and understands the world as humans do.

Call to Action: Join the Multimodal AI Revolution

🚀 Want to implement Self-KD in your own OLLM?
👉 Visit our GitHub repo: https://github.com/isruihu/Self-KD
📚 Read the full paper: arXiv:2503.00059v2

💬 Have questions or ideas?
Leave a comment below or connect with us on Twitter @AIResearchBlog. Let’s build the future of multimodal AI — together.

Here is the complete, end-to-end Python code for the Self-Knowledge Distillation (Self-KD) model as proposed in the paper.

# main.py
# -----------------------------------------------------------------------------
# This script provides a complete, end-to-end implementation of the
# Self-Knowledge Distillation (Self-KD) training method for Omnimodal Large
# Language Models (OLLMs), as described in the research paper.
#
# The core idea is to improve the vision-audio capabilities of an OLLM by
# using its stronger vision-text component as a "teacher" to guide the
# training of its vision-audio "student" component.
#
# This implementation includes:
# 1. A mock OLLM architecture.
# 2. The Self-KD loss function using KL divergence.
# 3. A complete training loop demonstrating the proposed method.
# -----------------------------------------------------------------------------

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from transformers import BertModel, BertConfig, WhisperModel, WhisperConfig

# --- 1. Model Architecture ---
# We define a simplified OLLM to demonstrate the Self-KD process.
# In a real-world scenario, these would be pre-trained, state-of-the-art models.

class OLLM(nn.Module):
    """
    A simplified Omnimodal Large Language Model (OLLM).

    This model includes components for processing vision, audio, and text inputs.
    - Vision Encoder: A placeholder for an image processing model (e.g., ViT).
    - Audio Encoder: A mock Whisper model to process audio features.
    - Text Embedding: Standard text embeddings.
    - LLM: A mock BERT model as the core language processing unit.
    """
    def __init__(self, num_labels=50):
        super(OLLM, self).__init__()
        # Vision Encoder (Placeholder)
        # In a real implementation, this would be a pre-trained vision transformer.
        # For this example, we use a simple linear layer to simulate its output.
        self.vision_encoder = nn.Linear(256, 768) # Input feature size 256 -> BERT hidden size 768

        # Audio Encoder
        # Using a small, randomly initialized Whisper model configuration for demonstration.
        whisper_config = WhisperConfig(
            vocab_size=51865,
            num_mel_bins=80,
            encoder_layers=2,
            encoder_attention_heads=4,
            decoder_layers=2,
            decoder_attention_heads=4,
            d_model=768,
        )
        self.audio_encoder = WhisperModel(whisper_config)

        # Large Language Model (LLM)
        # Using a small, randomly initialized BERT model configuration.
        llm_config = BertConfig(
            vocab_size=30522,
            hidden_size=768,
            num_hidden_layers=4,
            num_attention_heads=4,
            intermediate_size=1024
        )
        self.llm = BertModel(llm_config)

        # Output Classifier
        # A linear layer to map the LLM's output to the desired number of labels.
        self.classifier = nn.Linear(768, num_labels)

    def forward(self, image_input=None, text_input=None, audio_input=None, attention_mask=None):
        """
        Forward pass for the OLLM.

        It can process either vision-text or vision-audio pairs.
        """
        # 1. Process Vision Input
        # The image input is always present.
        vision_embeds = self.vision_encoder(image_input)
        # Add a sequence dimension for concatenation with text/audio embeds.
        vision_embeds = vision_embeds.unsqueeze(1)

        # 2. Process Text or Audio Input
        if text_input is not None:
            # Get text embeddings from the LLM's embedding layer.
            text_embeds = self.llm.embeddings(input_ids=text_input)
            # Combine vision and text embeddings.
            combined_embeds = torch.cat((vision_embeds, text_embeds), dim=1)
        elif audio_input is not None:
            # Get audio embeddings from the audio encoder.
            # The Whisper encoder expects a specific input format, which we simulate here.
            audio_embeds = self.audio_encoder.encoder(audio_input).last_hidden_state
            # Combine vision and audio embeddings.
            combined_embeds = torch.cat((vision_embeds, audio_embeds), dim=1)
        else:
            raise ValueError("Either text_input or audio_input must be provided.")

        # 3. Pass through LLM
        # The combined embeddings are processed by the core language model.
        outputs = self.llm(inputs_embeds=combined_embeds, attention_mask=attention_mask)
        # We take the output of the [CLS] token (first token) for classification.
        pooled_output = outputs.pooler_output

        # 4. Get Final Logits
        logits = self.classifier(pooled_output)
        return logits

# --- 2. Self-Knowledge Distillation (Self-KD) Loss ---

def self_kd_loss(teacher_logits, student_logits, temperature=2.0):
    """
    Calculates the Self-Knowledge Distillation loss.

    Args:
        teacher_logits: The output logits from the teacher model (vision-text).
        student_logits: The output logits from the student model (vision-audio).
        temperature: A softening parameter for the probability distributions.

    Returns:
        The KL divergence loss.
    """
    # Soften the probability distributions using the temperature parameter.
    soft_teacher_probs = F.softmax(teacher_logits / temperature, dim=-1)
    soft_student_log_probs = F.log_softmax(student_logits / temperature, dim=-1)

    # Calculate the KL divergence loss.
    # The `batchmean` reduction averages the loss over the batch.
    loss = F.kl_div(soft_student_log_probs, soft_teacher_probs, reduction='batchmean')

    return loss * (temperature ** 2) # Scale the loss as suggested in Hinton's paper.

# --- 3. Training Setup ---

def train(model, data_loader, optimizer, alpha=0.75, num_epochs=5):
    """
    The main training loop implementing the Self-KD methodology.
    """
    model.train()
    sft_criterion = nn.CrossEntropyLoss() # Standard SFT loss

    print("--- Starting Training ---")
    for epoch in range(num_epochs):
        total_sft_loss = 0
        total_kd_loss = 0
        total_combined_loss = 0

        for batch_idx, (image_input, text_input, audio_input, labels) in enumerate(data_loader):
            optimizer.zero_grad()

            # --- Teacher Forward Pass (Vision-Text) ---
            # The teacher component processes the image and text query.
            # We detach its output to prevent gradients from flowing back to it.
            with torch.no_grad():
                teacher_logits = model(image_input=image_input, text_input=text_input)

            # --- Student Forward Pass (Vision-Audio) ---
            # The student component processes the image and audio query.
            student_logits = model(image_input=image_input, audio_input=audio_input)

            # --- Loss Calculation ---
            # 1. Standard Supervised Fine-Tuning (SFT) Loss
            # This is the conventional loss for the vision-audio task.
            sft_loss = sft_criterion(student_logits, labels)

            # 2. Self-Knowledge Distillation (Self-KD) Loss
            # This loss encourages the student to mimic the teacher's output distribution.
            kd_loss = self_kd_loss(teacher_logits, student_logits)

            # 3. Combined Loss
            # The final loss is a weighted sum of the SFT and Self-KD losses.
            # The hyperparameter 'alpha' controls the balance between them.
            combined_loss = (alpha * kd_loss) + ((1 - alpha) * sft_loss)

            # --- Backpropagation ---
            combined_loss.backward()
            optimizer.step()

            total_sft_loss += sft_loss.item()
            total_kd_loss += kd_loss.item()
            total_combined_loss += combined_loss.item()

        # --- Logging ---
        avg_sft_loss = total_sft_loss / len(data_loader)
        avg_kd_loss = total_kd_loss / len(data_loader)
        avg_combined_loss = total_combined_loss / len(data_loader)

        print(f"Epoch {epoch+1}/{num_epochs} | "
              f"SFT Loss: {avg_sft_loss:.4f} | "
              f"KD Loss: {avg_kd_loss:.4f} | "
              f"Combined Loss: {avg_combined_loss:.4f}")

    print("--- Training Finished ---")


# --- 4. Main Execution Block ---

if __name__ == '__main__':
    # --- Hyperparameters ---
    BATCH_SIZE = 8
    NUM_EPOCHS = 5
    LEARNING_RATE = 5e-5
    ALPHA = 0.75 # As suggested by the paper for best performance
    NUM_LABELS = 50 # Example number of output classes

    # --- Model Initialization ---
    ollm_model = OLLM(num_labels=NUM_LABELS)
    optimizer = optim.Adam(ollm_model.parameters(), lr=LEARNING_RATE)

    # --- Dummy Data Generation ---
    # Create mock data to simulate the training process.
    # In a real application, this data would be loaded from a pre-processed dataset.
    num_samples = 128
    image_dim = 256
    text_seq_len = 10
    audio_seq_len = 15
    vocab_size = 30522

    # Generate random tensors for each modality.
    dummy_images = torch.randn(num_samples, image_dim)
    dummy_texts = torch.randint(0, vocab_size, (num_samples, text_seq_len))
    # For Whisper encoder, input is (batch_size, num_mel_bins, sequence_length)
    dummy_audios = torch.randn(num_samples, 80, audio_seq_len)
    dummy_labels = torch.randint(0, NUM_LABELS, (num_samples,))

    # Create a PyTorch DataLoader.
    dataset = TensorDataset(dummy_images, dummy_texts, dummy_audios, dummy_labels)
    data_loader = DataLoader(dataset, batch_size=BATCH_SIZE)

    # --- Start Training ---
    train(ollm_model, data_loader, optimizer, alpha=ALPHA, num_epochs=NUM_EPOCHS)

Related posts, You May like to read