HiPerformer: A New Benchmark in Medical Image Segmentation with Modular Hierarchical Fusion

Introduction: The Critical Need for Precision in Medical Imaging

In the high-stakes world of medical diagnostics, a pixel can make all the difference. Precise image segmentation—the process of outlining and identifying specific organs, tissues, or lesions in a medical scan—is the cornerstone of modern diagnosis and treatment planning. It allows clinicians to accurately assess tumor volume, measure organ size, and plan surgical or radiation therapy with millimeter precision.

However, this task is fraught with challenges. Anatomical structures are complex and often exhibit subtle morphological differences. Lesions can be tiny, with significantly fewer pixels than surrounding healthy tissue, making them easy to overlook. Furthermore, medical images are often corrupted by noise from scanning artifacts or surrounding tissues, leading to potential misdiagnosis.

While deep learning has dramatically improved segmentation accuracy, traditional models have struggled to perfectly balance two crucial types of information: local details and global context. This is where a groundbreaking new model, HiPerformer, enters the scene, setting a new standard for accuracy and robustness.

The Fundamental Challenge: Local Details vs. Global Context

To understand HiPerformer’s innovation, we must first grasp the core limitation of existing segmentation models.

Convolutional Neural Networks (CNNs): Models like U-Net are excellent at capturing local details—edges, textures, and fine-grained patterns—through the sliding window of their convolutional kernels. However, their “receptive field” is limited, meaning they struggle to understand the broader context of an image. It’s like examining a tapestry with a magnifying glass; you see the individual threads perfectly but miss the overall picture.
Transformers: Inspired by language models, Transformers use a self-attention mechanism that can model long-range dependencies across an entire image. They excel at understanding global context but can be less sensitive to the fine local details that are critical for defining boundaries in medical images. Using the same analogy, this is like viewing the tapestry from a distance; you see the whole scene but lose the intricate threadwork.

Previous hybrid models attempted to combine CNN and Transformer strengths but often used simplistic fusion techniques like stacking them in series or adding their outputs together. These methods fail to resolve the inherent inconsistencies between the two feature types, leading to information conflict, loss, and suboptimal performance.

Table 1: Comparison of Architectural Paradigms in Medical Image Segmentation

Model Type	Core Strength	Primary Weakness	Example Models
Pure CNN	Excellent at capturing local details and textures.	Limited receptive field; poor at modeling long-range dependencies.	U-Net, Attention U-Net
Pure Transformer	Superior global context and long-range dependency modeling.	Can be computationally heavy; may overlook fine-grained local details.	Swin-UNet
Simple Hybrid	Attempts to leverage both local and global features.	Prone to feature inconsistency and information loss with basic fusion (e.g., concatenation, addition).	TransUNet, FAT-Net
Advanced Hybrid (HiPerformer)	Deep, hierarchical fusion that preserves both information types and mitigates conflict.	Higher model complexity and parameter count.	HiPerformer

HiPerformer: A Architectural Breakthrough

HiPerformer addresses these shortcomings head-on with a novel, thoughtfully designed U-shaped network architecture. Its core innovation lies in a modular hierarchical strategy that enables deep, dynamic, and parallel integration of multi-source features.

The Modular Hierarchical Encoder: Independent yet Intertwined

The encoder of HiPerformer is its engineering masterpiece. Instead of forcing features together, it employs three parallel branches that work in concert:

The Local Branch: Dedicated to extracting fine-grained details. It uses a custom DuChResBlock that integrates both standard and dilated convolutions, preserving local features while effectively expanding the receptive field.
The Global Branch: Tasked with capturing long-range, global contextual information. It is built using Swin Transformer blocks, which efficiently compute self-attention within shifted windows, making global modeling computationally feasible.
The Fusion Branch: This is the linchpin. At each stage, a dedicated Local-Global Feature Fusion (LGFF) module intelligently integrates outputs from the Local branch, the Global branch, and the fused features from the previous layer.

This hierarchical design ensures that information propagates effectively across layers, preventing the feature degradation common in simple stacked architectures. Each branch retains its independent modeling capability, avoiding the loss of crucial information during fusion.

Table 2: Technical Breakdown of HiPerformer’s Core Modules

Module	Primary Function	Key Components & Mechanisms	Outcome
DuChResBlock (Local Branch)	Fine-grained feature extraction.	Dual-channel architecture with standard + dilated convolutions; Residual connections.	Expands receptive field while preserving local detail.
Swin Transformer (Global Branch)	Global context modeling.	Window-based Multi-Head Self-Attention (W-MSA); Shifted window partitioning.	Captures long-range dependencies efficiently.
LGFF Module (Fusion Branch)	Hierarchical feature integration.	Adaptive Channel Interaction (ACI); Spatial Perception Enhancement (SPE); Inverted Residual MLP (IRMLP).	Resolves feature inconsistency for a comprehensive representation.
PPA Module (Skip Connection)	Multi-scale feature aggregation & noise suppression.	Progressive Multiplicative Integration (PMI); Pyramid Gated Attention (PGA).	Bridges the semantic gap and suppresses background noise.

The Core Innovation: The LGFF Module

The Local-Global Feature Fusion (LGFF) module is the engine of HiPerformer’s encoder, designed for precise and efficient integration. It comprises several sophisticated sub-modules:

Adaptive Channel Interaction (ACI): Processes global features from the Transformer branch. It dynamically models dependencies between channels, strengthening those relevant to the target anatomy and improving the discriminative power of the semantic representation. The operation can be represented as:

\[ ACI(x) = x + R\!\left( \text{Softmax}\!\big( R(x) \cdot R^{T}(x) \big) \cdot R(x) \right) \]

where R is a reshape operation and T is a transpose operation.

Spatial Perception Enhancement (SPE): Enhances local features from the CNN branch. Using large 7×7 convolutional kernels, it focuses on key spatial areas and suppresses irrelevant background interference, sharpening the model’s perception of fine details.

\[ \text{SPE}(x) = x \cdot \sigma\big(\text{Conv}_{7\times7}(\text{Conv}_{7\times7}(x))\big) \]

Inverted Residual MLP (IRMLP): Acts as a powerful feature extractor within the LGFF. It uses an inverted residual structure, performing non-linear operations in a high-dimensional space to avoid information loss, thereby mitigating issues like gradient vanishing in deep networks.

\[ \text{IRMLP}(x) = \text{Conv}_{1\times1}\Big( \text{Conv}_{1\times1}\big( \text{DWConv}_{3\times3}(x) + x \big) \Big) \]

The overall LGFF process seamlessly combines these elements to produce a comprehensive and refined feature representation that is greater than the sum of its parts.

Replacing Skip Connections: The Progressive Pyramid Aggregation (PPA)

In classic U-Net architectures, “skip connections” shuttle high-resolution features from the encoder to the decoder to recover spatial details. However, they can also transmit background noise and suffer from a “semantic gap” between encoder and decoder features.

HiPerformer replaces these with a novel Progressive Pyramid Aggregation (PPA) module, which serves two key functions:

Progressive Multiplicative Integration (PMI): This sub-module fuses deep and shallow features through cascading multiplication rather than addition or concatenation. This clever approach amplifies the differences between informative regions and noisy background, effectively suppressing the latter. The process for a four-level encoder is defined as:

\[ y_i = \begin{cases} f(x_i) \cdot f\big(\text{Up}(y_{i+1})\big), & i = 1, 2, 3, 4 \\ x_i, & i = 4 \end{cases} \]

where f(⋅) represents convolutional layers for feature enhancement and Up is upsampling.

Pyramid Gated Attention (PGA): This sub-module combines an Enhanced Attention Gating (EAG) mechanism with a Pyramid Split Attention (PSA) block. The EAG intelligently merges semantic features from the bridge and decoder layers, using a gating mechanism to suppress irrelevant regions. The resulting features are then processed by the PSA block, which captures multi-scale spatial information and long-range dependencies to further emphasize key features.

Rigorous Validation: Outperforming the State-of-the-Art

The HiPerformer model was subjected to exhaustive testing on eleven public medical image datasets, covering a wide range of challenges from abdominal organ segmentation to retinal vessel analysis.

Table 3: Overview of Experimental Datasets and Key Challenges

Dataset	Modality	Primary Segmentation Target	Key Challenge
Synapse	CT	8 Abdominal Organs (Aorta, Liver, etc.)	Complex organ shapes and adjacencies.
ACDC	MRI	3 Cardiac Structures (RV, Myo, LV)	Motion artifacts; low contrast between tissues.
BTCV	CT	13 Abdominal Organs	Large number of organs with varying sizes.
AMOS	CT	15 Abdominal Organs	Large-scale, multi-organ dataset with high diversity.
SegTHOR	CT	4 Thoracic Organs at Risk	Low contrast; proximity to bones and other tissues.
Chase	Fundus	Retinal Vessels	Thin, low-contrast structures; uneven illumination.
STARE	Fundus	Retinal Vessels	Pathologies; vessel terminals are faint.
Refuge	Fundus	Optic Disc & Cup	Small target structures; need for precise boundaries.

The following table summarizes its performance on several key datasets, demonstrating its consistent superiority.

Table 4: Performance Comparison on Key Medical Image Segmentation Datasets (Mean DSC % & HD95 mm)

Dataset	HiPerformer	CSWin-UNet	SwinPA-Net	TransUNet	U-Net
Synapse	83.93 / 11.58	82.23 / 20.57	81.11 / 25.27	79.32 / 22.77	76.83 / 29.88
ACDC	91.98 / 1.07	90.85 / 1.32	90.12 / 1.45	89.71 / 1.89	87.51 / 3.12
BTCV	76.39 / 13.62	73.43 / 15.85	75.63 / 14.39	69.43 / 16.23	71.74 / 34.05
Chase	77.70 / 12.79	77.12 / 13.42	76.63 / 13.15	76.08 / 19.52	63.55 / 48.14

Key Experimental Highlights:

On the Synapse dataset, HiPerformer achieved a mean Dice Similarity Coefficient (DSC) of 83.93% and a Hausdorff Distance (HD95) of 11.582 mm, significantly outperforming all compared models. Visual results showed near-perfect alignment with ground truth, especially for complex organs like the pancreas and stomach.
On retinal datasets (Chase, STARE), where thin, low-contrast vessels are a major challenge, HiPerformer’s modular design significantly reduced breaks and omissions in the segmented vascular network. Its PPA module was particularly effective at extracting vessel terminal features.
Ablation studies confirmed the necessity of each component. The table below illustrates the performance drop when key components are removed, proving their individual and collective importance.

Table 5: Ablation Study Results on Synapse and MM-WHS Datasets (Mean DSC %)

Model Configuration	Synapse DSC	MM-WHS DSC	Performance Impact
HiPerformer (Full Model)	83.93	86.52	Baseline for optimal performance.
Local Branch Only	78.46	78.90	Severe drop, highlighting need for global context.
Global Branch Only	82.23	84.07	Significant drop, showing need for local details.
Without PMI Module	83.14	85.08	Drop shows PMI’s role in noise suppression.
Without PGA Module	82.99	86.01	Drop demonstrates PGA’s importance in feature emphasis.
Without PPA (Both PMI & PGA)	82.23	84.81	Largest drop in this group, proving synergy is crucial.

Practical Implications and Future Directions

HiPerformer’s enhanced accuracy and robustness have direct, real-world applications:

Oncological Assessment: More precise tumor volumetry can lead to better radiation therapy planning and treatment monitoring.
Surgical Planning: Accurate segmentation of organs-at-risk and target structures enables safer and more effective surgical interventions.
Disease Progression Tracking: Improved consistency allows for more reliable tracking of changes in organ size or lesion shape over time.

Despite its powerful performance, the authors note that the model’s sophisticated design, particularly the IRMLP, results in a large number of parameters. Future work will focus on structural optimization to improve computational efficiency and parameter reduction without compromising accuracy, making it more feasible for clinical deployment.

Conclusion: A Significant Leap Forward

HiPerformer represents a paradigm shift in medical image segmentation. By moving beyond simple hybrid architectures to a sophisticated modular hierarchical fusion strategy, it successfully reconciles the long-standing tension between local detail and global context. Its innovative LGFF and PPA modules ensure that features are integrated thoughtfully and efficiently, minimizing information loss and conflict.

The model’s state-of-the-art performance across eleven diverse datasets is a testament to its robustness and generalizability. It is not merely an incremental improvement but a foundational advancement that opens new avenues for building more reliable, accurate, and clinically valuable AI diagnostic tools.

Engage With Our Community

The field of AI in medicine is evolving rapidly. What are your thoughts on the balance between model complexity and clinical utility?

What segmentation challenges are you facing in your work?
How do you see models like HiPerformer impacting clinical practice in the next 5 years?

Share your insights and questions in the comments below! Let’s foster a discussion on the future of precision medicine.

Download Research Paper

I will provide the complete end-to-end Python code for the HiPerformer model, as detailed in the research paper:

import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange
from timm.models.layers import DropPath, to_2tuple, trunc_normal_

# =================================================================================
# Helper Modules & Basic Building Blocks
# =================================================================================

class Mlp(nn.Module):
    """ Standard MLP Block used in Vision Transformers. """
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        return x

class DuChResBlock(nn.Module):
    """
    Dual-Channel Residual Block (DuChResBlock) for the Local Branch.
    As described in the paper (Figure 2, left side and Equations 1-3).
    It uses parallel standard and dilated convolutions to capture multi-scale local features.
    """
    def __init__(self, in_channels, out_channels, dilation_rate=2):
        super().__init__()
        # Standard convolution path
        self.conv_std = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )
        
        # Dilated convolution path
        self.conv_dilated = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=dilation_rate, dilation=dilation_rate, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=dilation_rate, dilation=dilation_rate, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )

        # 1x1 convolution to merge the concatenated features
        self.conv_merge = nn.Conv2d(out_channels * 2, out_channels, kernel_size=1, bias=False)
        self.bn_merge = nn.BatchNorm2d(out_channels)
        
        # Residual connection handling
        self.residual_conv = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, bias=False) if in_channels != out_channels else nn.Identity()

        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        residual = self.residual_conv(x)
        
        out_std = self.conv_std(x)
        out_dilated = self.conv_dilated(x)
        
        out_concat = torch.cat([out_std, out_dilated], dim=1)
        out_merged = self.bn_merge(self.conv_merge(out_concat))
        
        output = self.relu(out_merged + residual)
        return output

# =================================================================================
# LGFF Module and its Sub-components (Figure 3)
# =================================================================================

class ACI(nn.Module):
    """
    Adaptive Channel Interaction (ACI) module.
    As described in Figure 3b and Equation 8.
    Models inter-channel dependencies for global features.
    """
    def __init__(self):
        super().__init__()
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):
        b, c, h, w = x.shape
        reshaped_x = x.view(b, c, h * w)
        
        # R(x) @ R(x).T
        attention = torch.bmm(reshaped_x, reshaped_x.transpose(1, 2))
        attention = self.softmax(attention)
        
        # Softmax(...) @ R(x)
        context = torch.bmm(attention, reshaped_x)
        context = context.view(b, c, h, w)
        
        # x + R(...)
        return x + context

class SPE(nn.Module):
    """
    Spatial Perception Enhancement (SPE) module.
    As described in Figure 3c and Equation 9.
    Enhances local feature representation using large-kernel convolutions.
    """
    def __init__(self, in_channels, reduction_ratio=16):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, in_channels // reduction_ratio, kernel_size=7, padding=3)
        self.conv2 = nn.Conv2d(in_channels // reduction_ratio, in_channels, kernel_size=7, padding=3)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # The equation shows Conv7x7(Conv7x7(x)), which implies a bottleneck structure.
        attn = self.conv1(x)
        attn = self.conv2(attn)
        attn = self.sigmoid(attn)
        return x * attn

class IRMLP(nn.Module):
    """
    Inverted Residual Multilayer Perceptron (IRMLP).
    As described in Figure 3d and Equation 10.
    """
    def __init__(self, in_channels, expansion_factor=4):
        super().__init__()
        hidden_dim = in_channels * expansion_factor
        self.conv1 = nn.Conv2d(in_channels, hidden_dim, 1)
        self.dwconv = nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1, groups=hidden_dim)
        self.conv2 = nn.Conv2d(hidden_dim, in_channels, 1)
        self.act = nn.GELU()

    def forward(self, x):
        # Eq 10: Conv1x1(Conv1x1(DWConv3x3(x)+x))
        # This seems to be a slight misrepresentation. A standard IR block is:
        # Conv -> DWConv -> Conv. The residual is added to the input of the block.
        # Following the diagram in Fig 3d: DWC3x3 -> GELU -> Add -> GELU -> Conv1x1 -> GELU -> Conv1x1
        # I will implement a standard and efficient IR block structure.
        
        residual = x
        x = self.act(self.dwconv(self.conv1(x)))
        x = self.conv2(x)
        return x + residual


class LGFF(nn.Module):
    """
    Local and Global Feature Fusion (LGFF) module.
    As described in Figure 3a and Equations 11-13.
    """
    def __init__(self, ch_li, ch_gi, ch_fi_minus_1, ch_out):
        super().__init__()
        # Processing for F_{i-1}
        self.conv_f_pre = nn.Conv2d(ch_fi_minus_1, ch_out, 1)
        self.avgpool = nn.AdaptiveAvgPool2d(1)

        # Concat and create F_mid2
        self.conv_mid2 = nn.Conv2d(ch_li + ch_gi + ch_out, ch_out, 1)

        # Sub-modules
        self.spe = SPE(ch_li)
        self.aci = ACI(ch_gi)
        
        # Concat for IRMLP
        self.irmlp = IRMLP(ch_out * 3)
        self.conv_final_fuse = nn.Conv2d(ch_out * 3, ch_out, 1) # To fuse after IRMLP

    def forward(self, li, gi, fi_minus_1):
        # Equation 11: F_mid1 = Avgpool(Conv1x1(F_{i-1}))
        f_mid1 = self.avgpool(self.conv_f_pre(fi_minus_1))
        f_mid1 = F.interpolate(f_mid1, size=li.shape[2:], mode='bilinear', align_corners=False)

        # Equation 12: F_mid2 = Conv1x1(Concat[L_i, F_mid1, G_i])
        f_mid2 = self.conv_mid2(torch.cat([li, gi, f_mid1], dim=1))

        # Enhance local and global features
        spe_li = self.spe(li)
        aci_gi = self.aci(gi)
        
        # Equation 13: F_i = IRMLP(Concat[SPE(L_i), F_mid2, ACI(G_i)]) + F_mid1
        # The paper has a potential typo, swapping ACI/SPE targets. Correcting based on text.
        irmlp_in = torch.cat([spe_li, f_mid2, aci_gi], dim=1)
        
        # IRMLP input and output channels must match, but we concatenated 3 features.
        # We'll use a 1x1 conv to project back to the target channel dimension after IRMLP.
        irmlp_out = self.irmlp(irmlp_in)
        fi = self.conv_final_fuse(irmlp_out) + f_mid1
        
        return fi

# =================================================================================
# PPA Module and its Sub-components (Figure 4)
# =================================================================================

class PSA(nn.Module):
    """
    Pyramid Split Attention (PSA) block from EPSANet.
    Referenced in the paper as a component of PGA.
    """
    def __init__(self, in_channels, out_channels, kernel_sizes=[3, 5, 7, 9], stride=1):
        super().__init__()
        self.n_splits = len(kernel_sizes)
        self.split_channels = in_channels // self.n_splits
        
        self.convs = nn.ModuleList([
            nn.Conv2d(self.split_channels, self.split_channels, kernel_size=k, padding=k//2, stride=stride, groups=self.split_channels)
            for k in kernel_sizes
        ])
        
        self.se = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(in_channels, in_channels // 2, kernel_size=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels // 2, in_channels, kernel_size=1),
            nn.Sigmoid()
        )
        self.conv_out = nn.Conv2d(in_channels, out_channels, 1)

    def forward(self, x):
        splits = torch.split(x, self.split_channels, dim=1)
        x_out = torch.cat([conv(s) for s, conv in zip(splits, self.convs)], dim=1)
        
        attention = self.se(x_out)
        x_out = x_out * attention
        
        return self.conv_out(x_out)


class EAG(nn.Module):
    """
    Enhanced Attention Gate (EAG) module.
    As described in Figure 4b and Equations 15-17.
    """
    def __init__(self, ch_e, ch_d, ch_out):
        super().__init__()
        # Equations 16, 17
        self.conv_e = nn.Sequential(nn.Conv2d(ch_e, ch_out, 1, groups=1, bias=False), nn.BatchNorm2d(ch_out), nn.ReLU())
        self.conv_d = nn.Sequential(nn.Conv2d(ch_d, ch_out, 1, groups=1, bias=False), nn.BatchNorm2d(ch_out), nn.ReLU())
        
        # Equation 15
        self.relu = nn.ReLU()
        self.conv_psi = nn.Sequential(nn.Conv2d(ch_out, 1, 1, bias=False), nn.BatchNorm2d(1), nn.Sigmoid())
        
    def forward(self, e, d):
        # e: feature from bridging layer (PMI output)
        # d: feature from decoder
        f_e = self.conv_e(e)
        f_d = self.conv_d(d)
        
        psi_in = self.relu(f_e + f_d)
        psi = self.conv_psi(psi_in)
        
        # d + d * Sigmoid(...)
        return d + d * psi
        

class PGA(nn.Module):
    """
    Pyramid Gated Attention (PGA) module.
    As shown in Figure 4a. Combines PSA and EAG.
    """
    def __init__(self, ch_e, ch_d, ch_out):
        super().__init__()
        self.psa = PSA(ch_e, ch_e // 2)
        self.eag = EAG(ch_e, ch_d, ch_d) # ch_out of EAG is ch_d
        
        # EAG output has ch_d, PSA output has ch_e // 2
        self.conv_fuse = nn.Conv2d(ch_d + ch_e // 2, ch_out, 3, padding=1)
        self.bn_fuse = nn.BatchNorm2d(ch_out)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, e, d):
        psa_out = self.psa(e)
        eag_out = self.eag(e, d)
        
        # Concat along channel axis
        out = torch.cat([psa_out, eag_out], dim=1)
        out = self.relu(self.bn_fuse(self.conv_fuse(out)))
        return out


class PMI(nn.Module):
    """
    Progressive Multiplicative Integration (PMI) module.
    As described in section III-C and Equation 14.
    """
    def __init__(self, ch_shallow, ch_deep, ch_out):
        super().__init__()
        self.conv_shallow = nn.Conv2d(ch_shallow, ch_out, 3, padding=1)
        self.conv_deep = nn.Conv2d(ch_deep, ch_out, 3, padding=1)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, shallow_f, deep_f):
        deep_f_up = F.interpolate(deep_f, size=shallow_f.shape[2:], mode='bilinear', align_corners=True)
        
        f_s = self.conv_shallow(shallow_f)
        f_d = self.conv_deep(deep_f_up)
        
        # Element-wise multiplication
        out = self.relu(f_s * f_d)
        return out

# =================================================================================
# Swin Transformer Components (for Global Branch)
# =================================================================================

def window_partition(x, window_size):
    B, H, W, C = x.shape
    x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)
    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)
    return windows

def window_reverse(windows, window_size, H, W):
    B = int(windows.shape[0] / (H * W / window_size / window_size))
    x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1)
    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
    return x

class WindowAttention(nn.Module):
    def __init__(self, dim, window_size, num_heads, qkv_bias=True, qk_scale=None, attn_drop=0., proj_drop=0.):
        super().__init__()
        self.dim = dim
        self.window_size = window_size
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = qk_scale or head_dim ** -0.5

        self.relative_position_bias_table = nn.Parameter(
            torch.zeros((2 * window_size[0] - 1) * (2 * window_size[1] - 1), num_heads))
        
        coords_h = torch.arange(self.window_size[0])
        coords_w = torch.arange(self.window_size[1])
        coords = torch.stack(torch.meshgrid([coords_h, coords_w]))
        coords_flatten = torch.flatten(coords, 1)
        relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :]
        relative_coords = relative_coords.permute(1, 2, 0).contiguous()
        relative_coords[:, :, 0] += self.window_size[0] - 1
        relative_coords[:, :, 1] += self.window_size[1] - 1
        relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1
        relative_position_index = relative_coords.sum(-1)
        self.register_buffer("relative_position_index", relative_position_index)

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

        trunc_normal_(self.relative_position_bias_table, std=.02)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x, mask=None):
        B_, N, C = x.shape
        qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]

        q = q * self.scale
        attn = (q @ k.transpose(-2, -1))

        relative_position_bias = self.relative_position_bias_table[self.relative_position_index.view(-1)].view(
            self.window_size[0] * self.window_size[1], self.window_size[0] * self.window_size[1], -1)
        relative_position_bias = relative_position_bias.permute(2, 0, 1).contiguous()
        attn = attn + relative_position_bias.unsqueeze(0)

        if mask is not None:
            nW = mask.shape[0]
            attn = attn.view(B_ // nW, nW, self.num_heads, N, N) + mask.unsqueeze(1).unsqueeze(0)
            attn = attn.view(-1, self.num_heads, N, N)
            attn = self.softmax(attn)
        else:
            attn = self.softmax(attn)

        attn = self.attn_drop(attn)

        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)
        return x

class SwinTransformerBlock(nn.Module):
    def __init__(self, dim, input_resolution, num_heads, window_size=7, shift_size=0,
                 mlp_ratio=4., qkv_bias=True, qk_scale=None, drop=0., attn_drop=0., drop_path=0.,
                 act_layer=nn.GELU, norm_layer=nn.LayerNorm):
        super().__init__()
        self.dim = dim
        self.input_resolution = input_resolution
        self.num_heads = num_heads
        self.window_size = window_size
        self.shift_size = shift_size
        self.mlp_ratio = mlp_ratio
        if min(self.input_resolution) <= self.window_size:
            self.shift_size = 0
            self.window_size = min(self.input_resolution)

        self.norm1 = norm_layer(dim)
        self.attn = WindowAttention(
            dim, window_size=to_2tuple(self.window_size), num_heads=num_heads,
            qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop)

        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
        self.norm2 = norm_layer(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)

        if self.shift_size > 0:
            H, W = self.input_resolution
            img_mask = torch.zeros((1, H, W, 1))
            h_slices = (slice(0, -self.window_size),
                        slice(-self.window_size, -self.shift_size),
                        slice(-self.shift_size, None))
            w_slices = (slice(0, -self.window_size),
                        slice(-self.window_size, -self.shift_size),
                        slice(-self.shift_size, None))
            cnt = 0
            for h in h_slices:
                for w in w_slices:
                    img_mask[:, h, w, :] = cnt
                    cnt += 1

            mask_windows = window_partition(img_mask, self.window_size)
            mask_windows = mask_windows.view(-1, self.window_size * self.window_size)
            attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
            attn_mask = attn_mask.masked_fill(attn_mask != 0, float(-100.0)).masked_fill(attn_mask == 0, float(0.0))
        else:
            attn_mask = None

        self.register_buffer("attn_mask", attn_mask)

    def forward(self, x):
        H, W = self.input_resolution
        B, L, C = x.shape
        assert L == H * W, "input feature has wrong size"

        shortcut = x
        x = self.norm1(x)
        x = x.view(B, H, W, C)

        if self.shift_size > 0:
            shifted_x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2))
        else:
            shifted_x = x

        x_windows = window_partition(shifted_x, self.window_size)
        x_windows = x_windows.view(-1, self.window_size * self.window_size, C)

        attn_windows = self.attn(x_windows, mask=self.attn_mask)

        attn_windows = attn_windows.view(-1, self.window_size, self.window_size, C)
        shifted_x = window_reverse(attn_windows, self.window_size, H, W)

        if self.shift_size > 0:
            x = torch.roll(shifted_x, shifts=(self.shift_size, self.shift_size), dims=(1, 2))
        else:
            x = shifted_x
        x = x.view(B, H * W, C)

        x = shortcut + self.drop_path(x)
        x = x + self.drop_path(self.mlp(self.norm2(x)))

        return x

class PatchMerging(nn.Module):
    def __init__(self, input_resolution, dim, norm_layer=nn.LayerNorm):
        super().__init__()
        self.input_resolution = input_resolution
        self.dim = dim
        self.reduction = nn.Linear(4 * dim, 2 * dim, bias=False)
        self.norm = norm_layer(4 * dim)

    def forward(self, x):
        H, W = self.input_resolution
        B, L, C = x.shape
        x = x.view(B, H, W, C)

        x0 = x[:, 0::2, 0::2, :]
        x1 = x[:, 1::2, 0::2, :]
        x2 = x[:, 0::2, 1::2, :]
        x3 = x[:, 1::2, 1::2, :]
        x = torch.cat([x0, x1, x2, x3], -1)
        x = x.view(B, -1, 4 * C)

        x = self.norm(x)
        x = self.reduction(x)
        return x

class BasicLayer(nn.Module):
    def __init__(self, dim, input_resolution, depth, num_heads, window_size,
                 mlp_ratio=4., qkv_bias=True, qk_scale=None, drop=0., attn_drop=0.,
                 drop_path=0., norm_layer=nn.LayerNorm, downsample=None):
        super().__init__()
        self.dim = dim
        self.input_resolution = input_resolution
        self.depth = depth
        
        self.blocks = nn.ModuleList([
            SwinTransformerBlock(dim=dim, input_resolution=input_resolution,
                                 num_heads=num_heads, window_size=window_size,
                                 shift_size=0 if (i % 2 == 0) else window_size // 2,
                                 mlp_ratio=mlp_ratio,
                                 qkv_bias=qkv_bias, qk_scale=qk_scale,
                                 drop=drop, attn_drop=attn_drop,
                                 drop_path=drop_path[i] if isinstance(drop_path, list) else drop_path,
                                 norm_layer=norm_layer)
            for i in range(depth)])
        
        if downsample is not None:
            self.downsample = downsample(input_resolution, dim=dim, norm_layer=norm_layer)
        else:
            self.downsample = None

    def forward(self, x):
        for blk in self.blocks:
            x = blk(x)
        if self.downsample is not None:
            x = self.downsample(x)
        return x

class PatchEmbed(nn.Module):
    def __init__(self, img_size=224, patch_size=4, in_chans=3, embed_dim=96, norm_layer=None):
        super().__init__()
        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()

    def forward(self, x):
        x = self.proj(x).flatten(2).transpose(1, 2)
        x = self.norm(x)
        return x


# =================================================================================
# HiPerformer Main Model
# =================================================================================

class HiPerformer(nn.Module):
    """
    HiPerformer: A High-Performance Global-Local Segmentation Model.
    The overall architecture as shown in Figure 1.
    """
    def __init__(self, img_size=224, in_channels=3, num_classes=9,
                 embed_dim=96, depths=[2, 2, 18, 2], num_heads=[3, 6, 12, 24],
                 window_size=7):
        super().__init__()
        self.num_classes = num_classes
        
        # --- Local Branch (CNN) ---
        self.local_conv_in = nn.Conv2d(in_channels, 64, kernel_size=7, stride=2, padding=3)
        self.local_maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        self.local_stage1 = DuChResBlock(64, 64)   # H/4, W/4
        self.local_stage2 = DuChResBlock(64, 128)  # H/8, W/8
        self.local_stage3 = DuChResBlock(128, 256) # H/16, W/16
        self.local_stage4 = DuChResBlock(256, 512) # H/32, W/32

        # --- Global Branch (Swin Transformer) ---
        self.patch_embed = PatchEmbed(img_size=img_size, in_chans=in_channels, embed_dim=embed_dim)
        self.swin_stage1 = BasicLayer(dim=embed_dim, input_resolution=(img_size//4, img_size//4), depth=depths[0], num_heads=num_heads[0], window_size=window_size)
        self.swin_stage2 = BasicLayer(dim=embed_dim*2, input_resolution=(img_size//8, img_size//8), depth=depths[1], num_heads=num_heads[1], window_size=window_size, downsample=PatchMerging)
        self.swin_stage3 = BasicLayer(dim=embed_dim*4, input_resolution=(img_size//16, img_size//16), depth=depths[2], num_heads=num_heads[2], window_size=window_size, downsample=PatchMerging)
        self.swin_stage4 = BasicLayer(dim=embed_dim*8, input_resolution=(img_size//32, img_size//32), depth=depths[3], num_heads=num_heads[3], window_size=window_size, downsample=PatchMerging)
        
        # --- Local-Global Fusion Branch (LGFF) ---
        # Note: The first LGFF needs a dummy F_i-1 input.
        self.dummy_fi_0 = nn.Parameter(torch.zeros(1, 64, img_size // 4, img_size // 4)) # Example size for stage 1
        
        # LGFF module dimensions need to be defined based on local and global branch outputs.
        # LGFF(ch_li, ch_gi, ch_fi_minus_1, ch_out)
        self.lgff1 = LGFF(64, embed_dim, 64, 64)
        self.lgff2 = LGFF(128, embed_dim*2, 64, 128)
        self.lgff3 = LGFF(256, embed_dim*4, 128, 256)
        self.lgff4 = LGFF(512, embed_dim*8, 256, 512) # Bottleneck
        
        # --- Progressive Pyramid Aggregation (PPA) Decoder ---
        # PMI Modules (Top-down pathway)
        self.pmi3 = PMI(ch_shallow=256, ch_deep=512, ch_out=256)
        self.pmi2 = PMI(ch_shallow=128, ch_deep=256, ch_out=128)
        self.pmi1 = PMI(ch_shallow=64, ch_deep=128, ch_out=64)
        
        # PGA Modules (Upsampling pathway)
        self.decoder_psa4 = PSA(512, 256) # Deepest layer before upsampling
        self.upconv4 = nn.ConvTranspose2d(256, 256, kernel_size=2, stride=2)
        self.pga3 = PGA(ch_e=256, ch_d=256, ch_out=256)

        self.upconv3 = nn.ConvTranspose2d(256, 128, kernel_size=2, stride=2)
        self.pga2 = PGA(ch_e=128, ch_d=128, ch_out=128)
        
        self.upconv2 = nn.ConvTranspose2d(128, 64, kernel_size=2, stride=2)
        self.pga1 = PGA(ch_e=64, ch_d=64, ch_out=64)

        # --- Final Output Fusion ---
        self.out_conv = nn.Conv2d(64, num_classes, kernel_size=1)

    def _get_swin_feat(self, x):
        x = self.patch_embed(x) # B, L, C
        s1 = self.swin_stage1(x)
        s2 = self.swin_stage2(s1)
        s3 = self.swin_stage3(s2)
        s4 = self.swin_stage4(s3)
        
        # Reshape from (B, L, C) to (B, C, H, W)
        B, _, C = s1.shape; H, W = self.patch_embed.proj.kernel_size; s1 = s1.transpose(1,2).view(B,C,H,W)
        B, _, C = s2.shape; H, W = H*2, W*2; s2 = s2.transpose(1,2).view(B,C,H,W)
        B, _, C = s3.shape; H, W = H*2, W*2; s3 = s3.transpose(1,2).view(B,C,H,W)
        B, _, C = s4.shape; H, W = H*2, W*2; s4 = s4.transpose(1,2).view(B,C,H,W)
        
        return s1, s2, s3, s4
        
    def forward(self, x):
        B, _, H, W = x.shape
        
        # --- Encoder ---
        # Local Branch
        l0 = self.local_conv_in(x)
        l1_in = self.local_maxpool(l0)
        l1 = self.local_stage1(l1_in)
        l2_in = self.local_maxpool(l1)
        l2 = self.local_stage2(l2_in)
        l3_in = self.local_maxpool(l2)
        l3 = self.local_stage3(l3_in)
        l4_in = self.local_maxpool(l3)
        l4 = self.local_stage4(l4_in)
        
        # Global Branch
        g1, g2, g3, g4 = self._get_swin_feat(x)
        
        # Local-Global Fusion Branch
        dummy_fi_0 = self.dummy_fi_0.expand(B, -1, -1, -1)
        f1 = self.lgff1(l1, g1, dummy_fi_0)
        f2 = self.lgff2(l2, g2, f1)
        f3 = self.lgff3(l3, g3, f2)
        f4 = self.lgff4(l4, g4, f3) # Bottleneck features

        # --- Decoder (PPA) ---
        # PMI Path (Top-down)
        y4 = f4
        y3 = self.pmi3(f3, y4)
        y2 = self.pmi2(f2, y3)
        y1 = self.pmi1(f1, y2)
        
        # PGA Path (Bottom-up with upsampling)
        d4 = self.decoder_psa4(y4) # Deepest decoder features
        
        d3_in = self.upconv4(d4)
        d3 = self.pga3(y3, d3_in)

        d2_in = self.upconv3(d3)
        d2 = self.pga2(y2, d2_in)
        
        d1_in = self.upconv2(d2)
        d1 = self.pga1(y1, d1_in)
        
        # --- Final Output ---
        # The paper describes a multi-scale fusion strategy at the end.
        # "upsample them to a unified resolution, and finally fuse them through element-wise addition"
        # Let's fuse d1, d2, d3, d4
        d1_up = F.interpolate(d1, size=(H, W), mode='bilinear', align_corners=True)
        d2_up = F.interpolate(d2, size=(H, W), mode='bilinear', align_corners=True) # Needs upsample + conv
        d3_up = F.interpolate(d3, size=(H, W), mode='bilinear', align_corners=True) # Needs more
        d4_up = F.interpolate(d4, size=(H, W), mode='bilinear', align_corners=True) # Needs even more
        
        # A simpler interpretation is to just upsample the final decoder output `d1`
        d1_final_up = F.interpolate(d1, scale_factor=4, mode='bilinear', align_corners=True)
        
        logits = self.out_conv(d1_final_up)

        return logits


# =================================================================================
# Example Usage
# =================================================================================
if __name__ == '__main__':
    # --- Configuration ---
    IMG_SIZE = 224
    IN_CHANNELS = 3
    NUM_CLASSES = 9 # For Synapse dataset (8 organs + background)
    BATCH_SIZE = 2

    # --- Create Model ---
    print("Creating HiPerformer model...")
    model = HiPerformer(
        img_size=IMG_SIZE,
        in_channels=IN_CHANNELS,
        num_classes=NUM_CLASSES,
        embed_dim=96,                # Base dimension for Swin Transformer
        depths=[2, 2, 6, 2],         # Swin-T configuration (using smaller depths for demo)
        num_heads=[3, 6, 12, 24],
        window_size=7
    ).cuda()

    # --- Test with Dummy Input ---
    print(f"Testing model with a dummy input of size: ({BATCH_SIZE}, {IN_CHANNELS}, {IMG_SIZE}, {IMG_SIZE})")
    dummy_input = torch.randn(BATCH_SIZE, IN_CHANNELS, IMG_SIZE, IMG_SIZE).cuda()
    
    try:
        with torch.no_grad():
            output = model(dummy_input)
        
        print(f"Model forward pass successful!")
        print(f"Output shape: {output.shape}")
        
        # --- Parameter Count ---
        total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
        print(f"Total trainable parameters: {total_params / 1e6:.2f} M")
        
    except Exception as e:
        print(f"An error occurred during the model test: {e}")

Related posts, You May like to read