ProMSC-MIS: Revolutionizing Multimodal Semantic Communication for Multi-Spectral Image Segmentation

In the rapidly evolving landscape of artificial intelligence and wireless communication, a groundbreaking new framework—ProMSC-MIS (Prompt-based Multimodal Semantic Communication for Multi-Spectral Image Segmentation)—is setting a new benchmark in task-driven data transmission. Developed by Haoshuo Zhang, Yufei Bo, and Meixia Tao from Shanghai Jiao Tong University, this innovative system redefines how multimodal data is processed, transmitted, and utilized for high-impact applications like autonomous driving and nighttime surveillance.

This article dives deep into the architecture, training methodology, and performance advantages of ProMSC-MIS, offering a comprehensive overview for researchers, engineers, and AI enthusiasts interested in semantic communication, multimodal learning, and efficient image segmentation.

What Is ProMSC-MIS?

ProMSC-MIS stands at the intersection of deep learning, multimodal fusion, and semantic-aware transmission. Unlike traditional communication systems that focus on pixel-accurate image reconstruction, ProMSC-MIS prioritizes the transmission of task-relevant semantic information. This means only the most meaningful features needed for image segmentation are extracted, compressed, and sent over bandwidth-limited channels.

The framework is specifically designed for multi-spectral image segmentation, where spatially aligned RGB and thermal images are processed together to enhance scene understanding under diverse lighting conditions.

🔍 Primary Keywords: ProMSC-MIS, multimodal semantic communication, multi-spectral image segmentation, prompt learning, semantic fusion

Why ProMSC-MIS Matters: The Problem with Traditional Methods

Conventional image transmission pipelines follow a source-channel separation principle: images are compressed (e.g., using JPEG2000 or BPG), transmitted, reconstructed, and then fed into a segmentation model. This approach has critical drawbacks:

High bandwidth consumption due to full-image transmission.
Poor performance in low-light or extreme conditions.
Redundant data transmission, including visually irrelevant pixels.
Decoupled processing between communication and task execution.

Semantic communication addresses these issues by integrating task awareness into the transmission pipeline. However, most existing systems are unimodal or lack intelligent fusion strategies.

ProMSC-MIS fills this gap with a prompt-guided, contrastive pre-training mechanism and an advanced cross-attention + SE network fusion module, enabling superior performance at 50–70% lower bandwidth.

How ProMSC-MIS Works: A Step-by-Step Breakdown

1. Input: RGB and Thermal Image Pair

Let the input images be:

x_r ∈ R^H×W×3 : RGB image
x_t∈ R^H×W×1 : Thermal image

Both are spatially aligned and captured from the same viewpoint.

2. Unimodal Semantic Encoders with Prompt Learning

ProMSC-MIS employs two parallel encoders:

f_RGB(⋅) : Extracts RGB semantic features y_RGB^r
f_THE(⋅) : Extracts thermal semantic features y_THE^t

But here’s the innovation: during pre-training, each modality is used as a prompt for the other.

🔄 Prompt Learning in Action

\[ x_r’ = \text{ConvertRGBtoGray}(x_r) \] \[ x_t’ = \text{ExpandTo3Channels}(x_t) \] \[ y_t^{\text{RGB}} = f_{\text{RGB}}(x_t’; \theta_r) \] \[ y_r^{\text{THE}} = f_{\text{THE}}(x_r’; \theta_t) \]

These cross-modal outputs serve as “semantic hints” or prompts, guiding the encoders to learn complementary features rather than redundant ones.

🧠 Contrastive Learning Objective

To ensure diversity in learned features, a contrastive loss minimizes the cosine similarity between projected features:

\[ L_v(v_r, v_t) = \frac{\| v_r \|_2 \cdot \| v_t \|_2}{\| v_r \cdot v_t \|} \]

where:

\[ v_r = g_{\text{RGB}}\Big( \text{concat}(y_{r}^{\text{RGB}}, \, y_{t}^{\text{RGB}}); \, \phi_r \Big) \] \[ v_t = g_{\text{THE}}\Big( \text{concat}(y_{t}^{\text{THE}}, \, y_{r}^{\text{THE}}); \, \phi_t \Big) \]

Minimizing Lv pushes the embeddings apart, encouraging the model to capture distinct, modality-specific semantics.

3. Semantic Fusion Module: Cross-Attention + SE Networks

After encoding, features are fused using a hybrid architecture:

✅ Cross-Attention Transformer Block

Self-Attention: Enhances intra-modal features
Cross-Attention: Fuses information across modalities

For RGB features F_r ∈ R^L_r×D and thermal F_t ∈ R^L_t×D :

Query from RGB:

\[ Q_{rc} = F_{rs} W_{Q_c} \]

Key & Value from thermal:

\[ K_{tc} = F_{ts} W_{K_c}, \quad V_{tc} = F_{ts} W_{V_c} \]

Multi-Head Cross-Attention (MHCA):

\[ CA_h(Fr_s) = \text{softmax}\!\left( \frac{D_m}{H} \, Q_{r,h_c}(K_{t,h_c})^{T} \right) V_{t,h_c} \] \[ MHCA(Fr_s) = \Big[ \text{concat}_{h=1}^{H} \; CA_h(Fr_s) \Big] W_{rc} \]

A learnable matrix M_r dynamically adjusts cross-modal contribution:

\[ \alpha_{r} = F_{rc} \odot M_{r} + F_{rc} \]

Same process applies to thermal → RGB direction.

✅ Fusion-Enhancement with Squeeze-and-Excitation (SE)

The concatenated outputs F~=[α_r ; α_t] pass through:

Fusion Blocks: Mini-Inception modules for multi-scale feature capture
SE Networks: Channel-wise attention to recalibrate feature importance

After three stages, the final fused semantic vector z_s ∈ R^L_s is generated.

4. Bit Generation & Transmission (SSCC)

To ensure compatibility with existing digital systems, ProMSC-MIS uses Separate Source-Channel Coding (SSCC):

Bit Generator f_B(⋅) : A learnable probabilistic layer maps z_s to a probability table p_s ∈ R^L_b×2
Gumbel-Softmax Sampling: Enables differentiable discrete bit generation:b_ℓ=one_hot(argmax(logp_s,ℓ+g)),g ∼ Gumbel(0,1)
Transmission: Bit sequence b is sent over a Binary Symmetric Channel (BSC) with bit-flip probability p
Reception: Received b is decoded back to z_s

5. Semantic Decoding for Segmentation

The receiver uses a decoder f_D(⋅) —built with transpose convolutions—to upsample z_s or b into the final segmentation map m ∈ R^H×W×N , where N is the number of classes.

Training uses a composite loss:

\[ L_{e}(\hat{m}, m) = \lambda \cdot L_{\text{Dice}} + (1 – \lambda) \cdot L_{\text{SoftCE}} \]

Where:

Dice Loss (L_Dice ) improves class overlap
Soft Cross-Entropy (L_SoftCE ) smooths label distribution

Performance: Why ProMSC-MIS Outperforms the Competition

📊 Benchmarking Setup

Dataset: MFNet (1,569 RGB-T image pairs, 8 object classes + background)
Metrics: mIoU (mean Intersection over Union), mAcc (mean Accuracy)
Compression Rate: Bits per pixel (bpp)

🔍 Comparison with Traditional Pipelines (JPEG2000/BPG + Segmentation)

METHOD	MIOU (%)	MACC (%)	BP	NOTES
BPG-FEANet	~42	~65	0.2	High bandwidth
ProMSC-MIS	42	65	0.06–0.09	50–70% less bandwidth

Key Findings:

Achieves same segmentation quality at 1/20 the bandwidth of BPG-FEANet
Works in extreme low-bitrate regimes (0.0039–0.015625 bpp), where traditional codecs fail
More robust to channel fluctuations due to end-to-end optimization

💡 At 0.09375 bpp, ProMSC-MIS delivers 3–4× higher mIoU than JPEG2000-based methods and 45% improvement over BPG-FEANet.

🔍 Comparison with DeepSC Variants

MODEL	MIOU (%)	MACC (%)	BPP	NOTES
Multimodal DeepSC (no pre-train)	~38	~58	0.07	Baseline
RGB-DeepSC	~34	~52	0.07	Unimodal
Thermal-DeepSC	~32	~50	0.07	Unimodal
ProMSC-MIS	41.5	63.8	0.07	+9% mIoU gain

Insights:

Pre-training significantly boosts performance, especially at low bpp
Multimodal > Unimodal: RGB + thermal consistently outperforms single-modality models
RGB dominates in daylight, thermal excels at night—ProMSC-MIS adapts via fusion

Model Efficiency: Lower Complexity, Higher Performance

MODEL	PARAMS (M)	FLOPS (G)	LATENCY (MS)
ProMSC-MIS (avg)	186.99	212.34	46.95
MFNet	0.74	8.42	4.60
RTFNet	254.51	337.46	51.87
FEANet	255.21	337.47	65.89

Despite higher parameter count than MFNet, ProMSC-MIS achieves 26% lower storage overhead and 37% less computational complexity than RTFNet/FEANet when considering the entire pipeline (including source coding).

⚡ Even with full semantic processing, ProMSC-MIS is faster than FEANet and more efficient than all high-performance benchmarks.

Real-World Applications

🚗 Autonomous Driving

Combines RGB detail and thermal night vision
Enables 24/7 object detection (pedestrians, bikes, cones)
Reduces data load on vehicle-to-infrastructure (V2I) links

🌙 Nighttime Surveillance

Thermal data compensates for poor lighting
Sharp segmentation of humans and vehicles in darkness
Operates under ultra-low bandwidth (e.g., remote drones)

🛰️ Satellite & UAV Imaging

Efficient transmission of multi-spectral Earth observation data
Ideal for bandwidth-constrained satellite downlinks

Ablation Studies: What Makes ProMSC-MIS Work?

The paper validates two core innovations:

CONFIGURATION	MIOU (%)	FROM FULL MODEL
Full ProMSC-MIS	41.5	—
w/o Pre-training	38.0	-3.5
w/o Cross-Attention	39.2	-2.3
w/o SE Networks	40.1	-1.4
w/o Prompt Guidance	37.8	-3.7

Conclusion: Both prompt-based pre-training and advanced fusion are critical for peak performance.

Future Directions

Integration with generative AI for missing modality synthesis
Extension to video for semantic video communication
Adaptive bit allocation based on scene complexity
Hardware acceleration for real-time deployment

Conclusion: The Future of Task-Oriented Communication

ProMSC-MIS represents a paradigm shift in how we think about data transmission. By combining prompt learning, contrastive pre-training, and intelligent semantic fusion, it delivers:

✅ 50–70% bandwidth reduction
✅ Superior segmentation accuracy in low-light and low-bitrate conditions
✅ Lower computational and storage costs
✅ End-to-end task optimization from sensor to decision

This framework is not just a technical advancement—it’s a blueprint for next-generation AI-driven communication systems in autonomous vehicles, smart cities, and beyond.

Want to Dive Deeper?

👉 Download the full paper: ProMSC-MIS: Prompt-based Multimodal Semantic Communication for Multi-Spectral Image Segmentation
👉 Explore the MFNet dataset: MFNet: Multi-Spectral Semantic Segmentation
👉 Try the code: Implementation coming soon on GitHub (linked in the paper)

💬 Call to Action

Are you working on semantic communication, multimodal AI, or edge-based vision systems? Join the conversation!

📧 Subscribe to our AI & Communications newsletter
💬 Comment below with your thoughts on ProMSC-MIS
🔗 Share this article with your network to spread the word on this breakthrough

Together, we’re building the future of intelligent, efficient, and task-driven communication.

Here is the Python implementation of the ProMSC-MIS model, structured to be a single, self-contained file.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models
import math

# ==============================================================================
# Helper Modules (Attention, SE Net, etc.)
# ==============================================================================

class ChannelAttention(nn.Module):
    """
    Channel-wise attention module as described in the paper.
    Recalibrates channel importance.
    """
    def __init__(self, in_planes, ratio=16):
        super(ChannelAttention, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.max_pool = nn.AdaptiveMaxPool2d(1)

        self.fc = nn.Sequential(
            nn.Conv2d(in_planes, in_planes // ratio, 1, bias=False),
            nn.ReLU(),
            nn.Conv2d(in_planes // ratio, in_planes, 1, bias=False)
        )
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        avg_out = self.fc(self.avg_pool(x))
        max_out = self.fc(self.max_pool(x))
        out = avg_out + max_out
        return self.sigmoid(out)

class SpatialAttention(nn.Module):
    """
    Spatial attention module as described in the paper.
    Emphasizes significant spatial regions.
    """
    def __init__(self, kernel_size=7):
        super(SpatialAttention, self).__init__()
        self.conv1 = nn.Conv2d(2, 1, kernel_size, padding=kernel_size // 2, bias=False)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        avg_out = torch.mean(x, dim=1, keepdim=True)
        max_out, _ = torch.max(x, dim=1, keepdim=True)
        x_cat = torch.cat([avg_out, max_out], dim=1)
        x_att = self.conv1(x_cat)
        return self.sigmoid(x_att)

class SENet(nn.Module):
    """
    Squeeze-and-Excitation Network as shown in Figure 3(c).
    """
    def __init__(self, in_channels, reduction=16):
        super(SENet, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool1d(1)
        self.fc = nn.Sequential(
            nn.Linear(in_channels, in_channels // reduction, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(in_channels // reduction, in_channels, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        # x shape: (batch, channels, length)
        b, c, _ = x.size()
        y = self.avg_pool(x).view(b, c)
        y = self.fc(y).view(b, c, 1)
        return x * y.expand_as(x)


# ==============================================================================
# Core Architectural Components (Encoder, Decoder, Fusion)
# ==============================================================================

class SemanticEncoder(nn.Module):
    """
    Semantic Encoder module based on ResNet-152 with attention. (Figure 2(a))
    Used for both RGB and Thermal modalities.
    """
    def __init__(self):
        super(SemanticEncoder, self).__init__()
        resnet = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1)
        self.layer0 = nn.Sequential(resnet.conv1, resnet.bn1, resnet.relu, resnet.maxpool)
        self.layer1 = resnet.layer1
        self.layer2 = resnet.layer2
        self.layer3 = resnet.layer3
        
        # Adding attention modules as per Figure 2(a)
        self.ca1 = ChannelAttention(256)
        self.sa1 = SpatialAttention()
        self.ca2 = ChannelAttention(512)
        self.sa2 = SpatialAttention()
        self.ca3 = ChannelAttention(1024)
        self.sa3 = SpatialAttention()

    def forward(self, x):
        x = self.layer0(x)
        
        # Block 1
        x1 = self.layer1(x)
        x1_att = self.ca1(x1) * x1
        x1_att = self.sa1(x1_att) * x1_att
        x1 = x1 + x1_att

        # Block 2
        x2 = self.layer2(x1)
        x2_att = self.ca2(x2) * x2
        x2_att = self.sa2(x2_att) * x2_att
        x2 = x2 + x2_att
        
        # Block 3
        x3 = self.layer3(x2)
        x3_att = self.ca3(x3) * x3
        x3_att = self.sa3(x3_att) * x3_att
        x3 = x3 + x3_att
        
        return x3 # Output feature map

class TransformerBlock(nn.Module):
    """ A single Transformer block for self-attention or cross-attention. """
    def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.att = nn.MultiheadAttention(embed_dim, num_heads, dropout=dropout, batch_first=True)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, embed_dim)
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, context=None):
        # If context is None, it's self-attention. Otherwise, it's cross-attention.
        if context is None:
            context = x
        
        # Attention
        attn_output, _ = self.att(x, context, context)
        x = x + self.dropout(attn_output)
        x = self.norm1(x)
        
        # Feed Forward
        ffn_output = self.ffn(x)
        x = x + self.dropout(ffn_output)
        x = self.norm2(x)
        
        return x

class CrossAttentionModule(nn.Module):
    """
    Cross-Attention Module to fuse features from two modalities. (Figure 3(b))
    """
    def __init__(self, feature_dim=1024, embed_dim=256, num_heads=8, ff_dim=512):
        super().__init__()
        self.embed_dim = embed_dim
        # Initial projection to embedding dimension
        self.rgb_embed = nn.Conv2d(feature_dim, embed_dim, kernel_size=1)
        self.the_embed = nn.Conv2d(feature_dim, embed_dim, kernel_size=1)
        
        # Positional encoding
        self.pos_encoder = nn.Parameter(torch.randn(1, 28*28, embed_dim)) # Assuming 28x28 feature map size

        # Transformer blocks
        self.self_att_rgb = TransformerBlock(embed_dim, num_heads, ff_dim)
        self.self_att_the = TransformerBlock(embed_dim, num_heads, ff_dim)
        self.cross_att_rgb = TransformerBlock(embed_dim, num_heads, ff_dim)
        self.cross_att_the = TransformerBlock(embed_dim, num_heads, ff_dim)
        
        # Learnable contribution matrices
        self.M_r = nn.Parameter(torch.ones(1, 28*28, embed_dim))
        self.M_t = nn.Parameter(torch.ones(1, 28*28, embed_dim))

    def forward(self, y_rgb, y_the):
        bs, c, h, w = y_rgb.shape
        
        F_r = self.rgb_embed(y_rgb).flatten(2).permute(0, 2, 1) # (bs, h*w, embed_dim)
        F_t = self.the_embed(y_the).flatten(2).permute(0, 2, 1) # (bs, h*w, embed_dim)
        
        # Add positional encoding
        F_r = F_r + self.pos_encoder[:, :h*w, :]
        F_t = F_t + self.pos_encoder[:, :h*w, :]
        
        # Self-attention
        F_r_s = self.self_att_rgb(F_r)
        F_t_s = self.self_att_the(F_t)
        
        # Cross-attention
        F_r_c = self.cross_att_rgb(F_r_s, context=F_t_s)
        F_t_c = self.cross_att_the(F_t_s, context=F_r_s)
        
        # Apply learnable matrices (Eq. 10)
        alpha_r = F_r_c * self.M_r[:, :h*w, :] + F_r_c
        alpha_t = F_t_c * self.M_t[:, :h*w, :] + F_t_c
        
        # Concatenate and return
        F_tilde = torch.cat([alpha_r, alpha_t], dim=2) # (bs, h*w, 2*embed_dim)
        return F_tilde

class FusionBlock(nn.Module):
    """ Fusion Block with Mini-Inception as shown in Figure 3(c). """
    def __init__(self, in_channels):
        super(FusionBlock, self).__init__()
        self.conv_block1 = nn.Conv1d(in_channels, in_channels, kernel_size=3, padding=1)
        self.conv_block2 = nn.Conv1d(in_channels, in_channels, kernel_size=3, padding=1)
        
        # Mini-Inception module
        self.inception_conv1 = nn.Conv1d(in_channels // 2, in_channels // 2, kernel_size=1)
        self.inception_conv2 = nn.Conv1d(in_channels // 2, in_channels // 2, kernel_size=3, padding=1)

    def forward(self, x):
        x = F.relu(self.conv_block1(x))
        x = F.relu(self.conv_block2(x))
        
        # Mini-Inception
        x1, x2 = torch.split(x, x.size(1) // 2, dim=1)
        x1 = self.inception_conv1(x1)
        x2 = self.inception_conv2(x2)
        x = torch.cat([x1, x2], dim=1)
        return x

class FusionEnhancementModule(nn.Module):
    """ Fusion-Enhancement Module. (Figure 3(c)) """
    def __init__(self, in_channels, num_blocks=3):
        super(FusionEnhancementModule, self).__init__()
        self.fusion_blocks = nn.ModuleList([FusionBlock(in_channels) for _ in range(num_blocks)])
        self.se_nets = nn.ModuleList([SENet(in_channels) for _ in range(num_blocks)])
        
    def forward(self, F_tilde):
        # F_tilde shape: (bs, seq_len, channels) -> (bs, channels, seq_len) for Conv1d
        x = F_tilde.permute(0, 2, 1)
        
        for i in range(len(self.fusion_blocks)):
            x_res = x
            x = self.fusion_blocks[i](x)
            x = self.se_nets[i](x)
            x = x + x_res
            
        return x.flatten(1) # Flatten to get the final zs vector

class SemanticFusionModule(nn.Module):
    """
    Complete Semantic Fusion Module. (Figure 3(a))
    """
    def __init__(self, feature_dim=1024, embed_dim=256, seq_len=28*28):
        super().__init__()
        self.cross_attention = CrossAttentionModule(feature_dim, embed_dim)
        self.fusion_enhancement = FusionEnhancementModule(in_channels=2*embed_dim)

    def forward(self, y_rgb, y_the):
        F_tilde = self.cross_attention(y_rgb, y_the)
        z_s = self.fusion_enhancement(F_tilde)
        return z_s
        
class BitGenerator(nn.Module):
    """
    Learnable Bit Generator using Gumbel-Softmax.
    """
    def __init__(self, input_dim, output_len):
        super().__init__()
        self.fc = nn.Linear(input_dim, output_len * 2)
        self.output_len = output_len

    def forward(self, z_s, training=True):
        logits = self.fc(z_s).view(-1, self.output_len, 2)
        # Gumbel-Softmax trick for differentiable sampling
        p_s = F.gumbel_softmax(logits, tau=1, hard=True, dim=-1)
        
        # Get the discrete bits (0s and 1s)
        b = p_s[:, :, 1] # Probability of being 1 gives the bit value
        return b

class SemanticDecoder(nn.Module):
    """
    Semantic Decoder to produce the final segmentation map. (Figure 2(b))
    """
    def __init__(self, input_dim, num_classes=9):
        super().__init__()
        # This part needs to reconstruct spatial dimensions from the flat vector
        self.initial_upsample = nn.Linear(input_dim, 1024 * 7 * 7) # Example size
        
        self.layers = nn.ModuleList()
        # Upsampling blocks
        # from 7x7 -> 14x14 -> 28x28 -> 56x56 -> 112x112 -> 224x224 -> 448x480 (approx)
        channels = [1024, 512, 256, 128, 64, 32]
        for i in range(len(channels) - 1):
            self.layers.append(self._make_upsample_block(channels[i], channels[i+1]))
        
        self.final_conv = nn.ConvTranspose2d(channels[-1], num_classes, kernel_size=4, stride=2, padding=1)
        # Adjust final size to 480x640 if needed with a final Conv or interpolation

    def _make_upsample_block(self, in_channels, out_channels):
        return nn.Sequential(
            nn.ConvTranspose2d(in_channels, out_channels, kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
        )

    def forward(self, x):
        x = self.initial_upsample(x)
        x = x.view(-1, 1024, 7, 7)
        for layer in self.layers:
            x = layer(x)
        x = self.final_conv(x)
        # Upsample to final image size (e.g., 480x640)
        x = F.interpolate(x, size=(480, 640), mode='bilinear', align_corners=False)
        return x

class PromptProjection(nn.Module):
    """ Prompt projection module g() from Section III-D-1. """
    def __init__(self, in_channels=1024, out_dim=256):
        super().__init__()
        self.conv_layers = nn.Sequential(
            nn.Conv2d(in_channels * 2, in_channels, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(in_channels, in_channels // 2, kernel_size=3, padding=1),
            nn.ReLU()
        )
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.mlp = nn.Linear(in_channels // 2, out_dim)
        
    def forward(self, y1, y2):
        x = torch.cat([y1, y2], dim=1)
        x = self.conv_layers(x)
        x = self.pool(x).flatten(1)
        v = self.mlp(x)
        return v

# ==============================================================================
# The Complete ProMSC-MIS Model
# ==============================================================================

class ProMSC_MIS(nn.Module):
    """ The complete ProMSC-MIS framework. """
    def __init__(self, L_b=30720, num_classes=9, projection_dim=256):
        super().__init__()
        self.f_RGB = SemanticEncoder()
        self.f_THE = SemanticEncoder()
        
        self.f_SF = SemanticFusionModule()
        # The input dim to the bit generator is the flattened output of fusion module
        fusion_output_dim = 512 * 28 * 28 
        self.f_B = BitGenerator(fusion_output_dim, L_b)
        self.f_D = SemanticDecoder(L_b, num_classes)
        
        # Modules for pre-training
        self.g_RGB = PromptProjection(out_dim=projection_dim)
        self.g_THE = PromptProjection(out_dim=projection_dim)

    def forward(self, x_r, x_t, training=True):
        # End-to-end forward pass (Phase 2)
        y_RGB_r = self.f_RGB(x_r)
        y_THE_t = self.f_THE(x_t)
        
        z_s = self.f_SF(y_RGB_r, y_THE_t)
        
        b = self.f_B(z_s, training)
        
        # Simulate BSC channel (in a real scenario, this happens outside the model)
        # Here we assume an ideal channel for simplicity in the forward pass.
        # Noise can be added during the training loop.
        b_hat = b 
        
        m_hat = self.f_D(b_hat)
        
        return m_hat

    def pretrain_forward(self, x_r, x_t, x_r_prime, x_t_prime):
        # Pre-training forward pass (Phase 1)
        y_RGB_r = self.f_RGB(x_r)
        y_THE_t = self.f_THE(x_t)
        y_RGB_t = self.f_RGB(x_t_prime)
        y_THE_r = self.f_THE(x_r_prime)
        
        v_r = self.g_RGB(y_RGB_r, y_RGB_t)
        v_t = self.g_THE(y_THE_t, y_THE_r)
        
        return v_r, v_t

# ==============================================================================
# Loss Functions
# ==============================================================================

def pretrain_loss_fn(v_r, v_t):
    """ Cosine similarity loss for pre-training (Eq. 15). """
    v_r_norm = F.normalize(v_r, p=2, dim=1)
    v_t_norm = F.normalize(v_t, p=2, dim=1)
    return torch.abs(torch.sum(v_r_norm * v_t_norm, dim=1)).mean()

class DiceLoss(nn.Module):
    """ Dice loss for segmentation task (Eq. 17). """
    def __init__(self, smooth=1.0):
        super(DiceLoss, self).__init__()
        self.smooth = smooth

    def forward(self, pred, target):
        pred_softmax = F.softmax(pred, dim=1)
        intersection = torch.sum(pred_softmax * target, dim=(2, 3))
        union = torch.sum(pred_softmax, dim=(2, 3)) + torch.sum(target, dim=(2, 3))
        dice = (2. * intersection + self.smooth) / (union + self.smooth)
        return 1 - dice.mean()

class SoftCrossEntropyLoss(nn.Module):
    """ Soft Cross-Entropy loss for segmentation task (Eq. 18). """
    def __init__(self, num_classes, epsilon=0.1):
        super(SoftCrossEntropyLoss, self).__init__()
        self.num_classes = num_classes
        self.epsilon = epsilon

    def forward(self, pred, target):
        # Target is one-hot encoded
        log_probs = F.log_softmax(pred, dim=1)
        soft_target = (1 - self.epsilon) * target + self.epsilon / self.num_classes
        return -torch.sum(soft_target * log_probs, dim=1).mean()
        
def end_to_end_loss_fn(pred, target, num_classes, lambda_dice=0.5):
    """ Combined loss for end-to-end training (Eq. 16). """
    dice_loss = DiceLoss()(pred, target)
    soft_ce_loss = SoftCrossEntropyLoss(num_classes)(pred, target)
    return lambda_dice * dice_loss + (1 - lambda_dice) * soft_ce_loss

# ==============================================================================
# Training Simulation (Algorithm 1)
# ==============================================================================
def simulate_bsc_channel(b, flip_probability=0.01):
    """ Simulates a Binary Symmetric Channel. """
    noise = torch.rand_like(b) < flip_probability
    b_hat = b.clone()
    b_hat[noise] = 1 - b_hat[noise]
    return b_hat

if __name__ == '__main__':
    # --- Configuration ---
    BATCH_SIZE = 2
    IMG_HEIGHT = 480
    IMG_WIDTH = 640
    NUM_CLASSES = 9
    BPP = 0.0703
    L_b = int(BPP * IMG_HEIGHT * IMG_WIDTH) # Length of bit sequence
    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
    
    print(f"Using device: {DEVICE}")
    print(f"Bit sequence length (L_b): {L_b}")

    # --- Model Initialization ---
    model = ProMSC_MIS(L_b=L_b, num_classes=NUM_CLASSES).to(DEVICE)

    # --- Dummy Data ---
    # RGB image (3 channels)
    dummy_x_r = torch.randn(BATCH_SIZE, 3, IMG_HEIGHT, IMG_WIDTH).to(DEVICE)
    # Thermal image (1 channel)
    dummy_x_t = torch.randn(BATCH_SIZE, 1, IMG_HEIGHT, IMG_WIDTH).to(DEVICE)
    # Ground truth segmentation map (one-hot encoded)
    dummy_m = F.one_hot(torch.randint(0, NUM_CLASSES, (BATCH_SIZE, IMG_HEIGHT, IMG_WIDTH)), num_classes=NUM_CLASSES).permute(0, 3, 1, 2).float().to(DEVICE)
    
    # Pre-processed images for pre-training prompts
    dummy_x_r_prime = dummy_x_r.mean(dim=1, keepdim=True) # Grayscale
    dummy_x_t_prime = dummy_x_t.repeat(1, 3, 1, 1) # 3-channel thermal
    
    print("\n" + "="*50)
    print("PHASE 1: PRE-TRAINING SIMULATION")
    print("="*50)

    # --- Phase 1: Pre-training ---
    pretrain_params = (
        list(model.f_RGB.parameters()) + list(model.f_THE.parameters()) +
        list(model.g_RGB.parameters()) + list(model.g_THE.parameters())
    )
    optimizer_pretrain = torch.optim.Adam(pretrain_params, lr=1e-4)

    optimizer_pretrain.zero_grad()
    v_r, v_t = model.pretrain_forward(dummy_x_r, dummy_x_t, dummy_x_r_prime, dummy_x_t_prime)
    loss_pretrain = pretrain_loss_fn(v_r, v_t)
    loss_pretrain.backward()
    optimizer_pretrain.step()
    
    print(f"Pre-training forward pass successful.")
    print(f"  - v_r shape: {v_r.shape}")
    print(f"  - v_t shape: {v_t.shape}")
    print(f"  - Calculated Pre-training Loss: {loss_pretrain.item():.4f}")

    print("\n" + "="*50)
    print("PHASE 2: END-TO-END TRAINING SIMULATION")
    print("="*50)

    # --- Phase 2: End-to-end training ---
    # In a real scenario, you would load the pre-trained f_RGB and f_THE weights here.
    end_to_end_params = model.parameters() # Train all parameters
    optimizer_e2e = torch.optim.Adam(end_to_end_params, lr=1e-4)

    optimizer_e2e.zero_grad()
    
    # Forward pass to get semantic features and bits
    y_RGB_r = model.f_RGB(dummy_x_r)
    y_THE_t = model.f_THE(dummy_x_t)
    z_s = model.f_SF(y_RGB_r, y_THE_t)
    b = model.f_B(z_s)
    
    # Simulate channel
    b_hat = simulate_bsc_channel(b, flip_probability=0.01)
    
    # Decoder
    m_hat = model.f_D(b_hat)
    
    loss_e2e = end_to_end_loss_fn(m_hat, dummy_m, NUM_CLASSES)
    loss_e2e.backward()
    optimizer_e2e.step()
    
    print(f"End-to-end forward pass successful.")
    print(f"  - z_s (fused features) shape: {z_s.shape}")
    print(f"  - b (bit sequence) shape: {b.shape}")
    print(f"  - m_hat (output segmentation) shape: {m_hat.shape}")
    print(f"  - Calculated End-to-End Loss: {loss_e2e.item():.4f}")

Related posts, You May like to read

Share on Facebook

Post on X

Save