CFFormer: How Cross CNN-Transformer Attention Finally Solves the Blurry Ultrasound Problem
Researchers at University of Nottingham Ningbo built a hybrid model that beats every state-of-the-art method across eight medical image datasets ā not by stacking more layers or adding heavier attention, but by finally making CNN and Transformer encoders talk to each other through channels rather than just pixels.
If you have ever seen an ultrasound image, you already know the core problem CFFormer is solving. The tissue boundaries blur into the background. Speckle noise makes the image look like someone sandpapered it. The tumor ā if it is there ā is a slightly different shade of grey from everything else around it. Existing segmentation models struggle with this, and the reason is not that they lack depth. It is that the two types of networks we have at our disposal ā CNNs and Transformers ā are genuinely bad at different things, and we have not been connecting them the right way.
The Problem Isn’t New ā But the Diagnosis Is
CNNs are good at local texture. They notice that a boundary has a certain edge profile, that a tissue patch has a consistent intensity gradient. But they are architecturally limited: no matter how many layers you stack, a CNN is always looking at a finite neighborhood. It cannot naturally model the relationship between a pixel in the top-left corner and one in the bottom-right.
Transformers solve that by chopping images into patches and letting every patch attend to every other patch. The global view is genuinely useful ā a Transformer can learn that when the left kidney looks a certain way, the right kidney is probably somewhere across the image in a predictable location. But Transformers are terrible at fine-grained local detail. The patchification process throws away sub-patch spatial information right at the start, before any learning happens.
So the natural move is to run both: a CNN encoder for local features, a Transformer encoder for global context. This dual-encoder architecture has been tried before. The gap in prior work, as CFFormer diagnoses it, is how the two encoders communicate. Most hybrid models fuse CNN and Transformer features by concatenating spatial maps or adding them pixel-wise. That is a crude operation that assumes the two feature spaces are already aligned ā and they are not. A CNN channel encodes a spatial texture prototype; a Transformer channel encodes a global semantic relationship. Smashing them together without acknowledging that difference is where information gets lost.
Prior hybrid CNN-Transformer models for medical image segmentation focused on spatial feature fusion while largely ignoring channel-level feature interactions. CFFormer fixes both: the CFCA module handles channel-level cross-attention between the two encoders, and the XFF module handles spatial-level fusion that accounts for the semantic gap between them. Neither piece is complicated ā but together they consistently outperform much heavier architectures.
How CFFormer Is Built
The architecture follows a U-shaped encoder-decoder structure with five encoding stages. The first stage uses only a ResNet block ā just a standard CNN processing the input image at full resolution. Stages 2 through 5 run both encoders in parallel: ResNet34 on one side extracting local features, Swin Transformer V2 on the other extracting global context. At each of these four stages, the outputs of both encoders flow into two key modules before the skip connections reach the decoder.
INPUT IMAGE (224 Ć 224)
ā
āāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā STAGE 1 (CNN only)
ā ResNet Block ā
ā Features: (C, H/2, W/2) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
āāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā STAGES 2ā5 (Dual Encoder)
ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ResNet34 ā ā Swin Transformer V2 ā ā
ā ā (CNN Block) ā ā (Transformer Block) ā ā
ā āāāāāāāā¬āāāāāāāā āāāāāāāāāāāā¬āāāāāāāāāāāāā ā
ā ā U ā R^CcĆWĆH ā V ā R^CtĆWĆH ā
ā āāāāāāāāāāāā¬āāāāāāāāāāāāā ā
ā ā ā
ā āāāāāāāāāāāā¼āāāāāāāāāāāāāāā ā
ā ā CFCA Module ā ā
ā ā (Cross-Feature ā ā
ā ā Channel Attention) ā ā
ā ā ā U_Fused, V_Fused ā ā
ā āāāāāāāāāāāā¬āāāāāāāāāāāāāāā ā
ā ā ā
ā āāāāāāāāāāāā¼āāāāāāāāāāāāāāā ā
ā ā XFF Module ā ā
ā ā (X-Spatial Feature ā ā
ā ā Fusion) ā ā
ā ā ā X_Skip ā ā
ā āāāāāāāāāāāā¬āāāāāāāāāāāāāāā ā
āāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā skip connection at each stage
āāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāā DECODER
ā ConvTranspose upsampling + dual convolutions ā
ā (U-Net style, 5 stages) ā
ā Final: 1Ć1 Conv ā K-class output ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
Segmentation Mask (K Ć H Ć W)
Loss: L_DiceCE = 0.5 Ć L_CE + 0.5 Ć L_Dice
The CFCA Module: Channel Attention Done Right
The Cross Feature Channel Attention (CFCA) module is where the paper’s most novel contribution lives. Let us walk through what it actually does, because the idea is cleaner than the math notation makes it look.
Step 1: Compress Feature Maps to Channel Vectors
Both the CNN and Transformer have already produced feature maps at this stage ā call them U (CNN) with shape Cc Ć W Ć H, and V (Transformer) with shape Ct Ć W Ć H, where Ct is generally larger than Cc (the Transformer outputs more channels). The first thing CFCA does is compress each feature map into a single vector using adaptive average pooling: one number per channel, representing that channel’s average activation across the spatial dimensions. U becomes a Cc-dimensional vector, V becomes a Ct-dimensional vector.
Step 2: Cross-Dimension Excitation
Here is the key move. For the CNN vector (Cc dimensions), the module applies an excitation-then-compression operation: first project it up to Ct dimensions (making the CNN vector aware of the Transformer’s channel space), apply ReLU, then project it back down to Cc dimensions, apply Sigmoid. For the Transformer vector, it does the reverse ā compress down to Cc first, then excite back up to Ct. This symmetric but opposite operation ensures both vectors learn their internal channel weights in the context of the other’s dimensionality.
Step 3: Correlation Matrix and Feature Projection
From these two attention vectors, the module builds a correlation matrix Q of shape Cc Ć Ct. This matrix encodes which CNN channels correlate with which Transformer channels. Using Q as a transformation matrix, each feature map is projected into the other’s subspace ā the CNN features are reprojected to look like Transformer features (UāV), and Transformer features are reprojected to look like CNN features (VāU). The projected features are then added back to the original features via residual connection.
After this operation, U_Fused is a CNN feature map that has been augmented with global Transformer context, and V_Fused is a Transformer feature map that has been augmented with CNN local structure. Both then serve as inputs to the next encoder stage (so the interaction propagates layer-by-layer) and to the XFF module.
In noisy, low-contrast ultrasound images, the CNN struggles because the local texture around a lesion boundary looks almost identical to the local texture outside it. The Transformer sees that globally, something in the rough shape and position of a lesion exists ā but its patch-level representations miss the fine boundary detail. CFCA lets the Transformer’s global confidence guide which CNN channels to amplify, while the CNN’s local precision sharpens the Transformer’s spatial representation. The correlation matrix is the bridge that was missing.
The XFF Module: Bridging the Spatial Gap
Even after CFCA has aligned the channel features, there is still a spatial gap between CNN and Transformer representations. CNN features are spatially smooth and locally consistent. Transformer features have a patchwork quality that corresponds to the tokenization grid. Direct concatenation of these two spatial patterns produces artifacts ā the model sees conflicting spatial signals at the skip-connection level and cannot easily resolve them.
The X-Spatial Feature Fusion (XFF) module addresses this with two parallel convolutions that cross-inject the features before concatenating them. A 5Ć5 convolution is applied to U_Fused (CNN), which gives it a larger receptive field and brings its spatial scale closer to Transformer scale, then this is added to V_Fused. Separately, a 3Ć3 convolution is applied to V_Fused (Transformer), extracting finer local detail, then this is added to U_Fused. Both modified features are concatenated and passed through a final 3Ć3 convolution that controls the channel dimension for the skip connection.
The 5Ć5 kernel on the CNN side is intentional: it gives the CNN features enough spatial coverage to harmonize with the Transformer’s patch-level representation. The 3Ć3 kernel on the Transformer side extracts finer-grained local detail to complement the CNN. This asymmetric convolution choice is not accidental ā it directly addresses the scale mismatch between the two encoding paradigms.
The Benchmarks: Eight Datasets, Five Modalities
CFFormer was evaluated across a scope of datasets that is genuinely unusual for a single paper ā eight in total, covering ultrasound, dermoscopy, CT, colonoscopy, and MRI. The PIQUE image quality metric (higher score = lower quality, more noise) was used to characterize each dataset, which provides useful context for understanding where the model’s benefits are largest.
Task 1: Breast Ultrasound (BUSI and Dataset B)
These are the hardest datasets in the benchmark. BUSI has a PIQUE score of 51.33 ā the highest noise level in the study. Speckle noise, blurred boundaries, and irregular tumor morphology combine to make these images genuinely difficult for any model. This is precisely the scenario CFCA was designed for: when local features alone are ambiguous, cross-channel interaction with global context provides the disambiguation signal.
| Model | BUSI Dice ā | BUSI HD95 ā | Dataset B Dice ā | Dataset B HD95 ā |
|---|---|---|---|---|
| U-Net | 78.51 | 18.48 | 78.50 | 15.60 |
| H2Former* | 84.92 | 8.04 | 81.21 | 13.34 |
| HiFormer-Base* | 82.99 | 9.02 | 85.57 | 8.53 |
| TransUnet* | 82.60 | 10.68 | 80.50 | 13.38 |
| CFFormer* (Ours) | 86.23 | 7.48 | 87.94 | 3.47 |
Table 1: Breast ultrasound results. CFFormer surpasses H2Former by +1.31% Dice on BUSI and exceeds HiFormer-Base by +2.37% on Dataset B. The HD95 improvement on Dataset B (3.47 vs 8.53) reflects dramatically better boundary precision ā the metric most directly affected by CFCA’s noise-suppression capability.
The domain-shift experiment is particularly revealing. Models were trained on BUSI (the larger dataset, 517 training images) and tested directly on Dataset B without retraining. CFFormer achieves 89.52% Dice on this cross-domain test ā actually higher than its in-distribution Dataset B score of 87.94. Only M2Snet and TransUnet also improved under domain shift; every other model degraded. This pattern suggests CFFormer’s channel-level feature interaction produces more generalizable representations rather than dataset-specific memorization.
Task 2: Dermoscopy Skin Lesion (ISIC-2016 and PH2)
Dermoscopy images are the cleanest in the benchmark (PIQUE scores of 22.46 and 10.27). At this quality level, CNN-based models actually perform competitively with hybrid models ā well-defined boundaries and strong color contrast favor the local feature extraction that CNNs excel at. CFFormer still leads both benchmarks, reaching 95.14% Dice on PH2 with an HD95 of just 0.82 ā meaning the average predicted boundary is less than one pixel from the ground truth boundary.
| Model | ISIC-2016 Dice ā | ISIC HD95 ā | PH2 Dice ā | PH2 HD95 ā |
|---|---|---|---|---|
| M2Snet* | 91.84 | 3.21 | 94.79 | 1.46 |
| TransUnet* | 91.87 | 3.70 | 94.76 | 1.70 |
| HiFormer-Base* | 91.83 | 3.30 | 94.49 | 1.79 |
| CFFormer* (Ours) | 92.20 | 3.06 | 95.14 | 0.82 |
Task 3: Colon Polyp Segmentation (Kvasir-SEG and CVC-ClinicDB)
Polyp segmentation is notoriously challenging due to extreme variation in polyp size, shape, and color ā some polyps are millimeters wide and nearly flush with the surrounding mucosa. On Kvasir-SEG, CFFormer achieves 91.93% Dice, surpassing the previous best by 1.93% and cutting HD95 from 7.92 (TransUnet’s best) to 5.73. On CVC-ClinicDB, it reaches 93.86% Dice with an HD95 of just 1.77 ā meaning predicted polyp boundaries are, on average, less than two pixels from the true boundary.
Task 4: Multi-Organ CT Segmentation (Synapse)
The Synapse dataset is a different kind of challenge. Eight organs need to be simultaneously segmented from abdominal CT scans, and the morphological differences between them are enormous ā the aorta is a thin tube, the spleen is a large smooth structure, the gallbladder is small and irregularly shaped. Not every CT slice contains every organ. CFFormer achieves 83.64% average Dice across all eight organs, surpassing H2Former’s 81.61% by 2.03%.
| Model | Avg Dice ā | Avg HD95 ā | Liver Dice ā | Gallbladder ā | Spleen ā |
|---|---|---|---|---|---|
| TransUnet* | 81.56 | 10.56 | 94.95 | 57.21 | 89.36 |
| H2Former* | 81.61 | 9.90 | 94.84 | 52.44 | 94.52 |
| HiFormer-Base* | 81.35 | 8.91 | 94.85 | 54.27 | 92.39 |
| CFFormer* (Ours) | 83.64 | 8.90 | 95.41 | 59.34 | 93.24 |
Table 2: Synapse results. The gallbladder Dice of 59.34% is the single most dramatic organ-level improvement ā it was 52.44% for the next-best model. Gallbladder is the hardest organ in this benchmark due to its small size and low contrast with surrounding liver tissue, making it the ideal test of whether channel-level cross-attention actually helps where it should.
Task 5: Brain Tumor MRI
Brain MRI segmentation of lower-grade gliomas brings its own set of challenges: irregular shapes, intensity heterogeneity within the tumor region, and the fact that the brain background MRI is itself highly structured and variable. CFFormer achieves 88.18% Dice, 1.89 HD95, and 99.53% pixel accuracy ā the best results on four of the six reported metrics. The HD95 improvement over the second-best model (H2Former: 2.02) is small in absolute terms but consistent with the broader pattern: CFFormer’s boundary precision improvement is visible across every modality.
The Ablation: What Each Piece Actually Contributes
The ablation study systematically dismantles the architecture to measure each component’s contribution. Starting from a baseline of CNN + Decoder and Transformer + Decoder separately, adding components one by one reveals a consistent pattern across four datasets.
The first important finding is that simply combining CNN and Transformer with a plain convolution layer ā no CFCA, no XFF ā actually makes things worse than using either encoder alone on the ultrasound datasets. This is a direct experimental confirmation of the paper’s core diagnostic: the feature spaces are different enough that naive fusion is actively harmful. The CFCA module is what makes dual encoders work at all.
| Configuration | BUSI Dice | Kvasir Dice | Dataset B Dice | CVC Dice |
|---|---|---|---|---|
| CNN only | 84.59 | 88.97 | 85.31 | 92.44 |
| Transformer only | 82.75 | 90.83 | 83.81 | 93.24 |
| CNN + Transformer (plain conv) | 84.71 | 90.39 | 82.97 | 92.48 |
| + CFCA only | 85.96 | 91.29 | 86.42 | 93.72 |
| + XFF only | 85.24 | 91.14 | 86.53 | 93.42 |
| + CFCA + XFF (full model) | 86.23 | 91.93 | 87.94 | 93.86 |
Table 3: Ablation results. Adding CFCA alone improves BUSI Dice by +1.37% over CNN-only. Adding XFF alone improves Dataset B Dice by +1.22%. The full model combining both is consistently the best, and the improvements are complementary rather than redundant.
CFCA contributes more on high-noise datasets (BUSI, Dataset B), where the channel-level cross-attention’s noise-suppression capability matters most. XFF contributes more strongly on datasets with complex spatial structure (Kvasir-SEG, CVC-ClinicDB), where the iterative spatial fusion’s ability to harmonize CNN and Transformer scale representations gives the decoder cleaner boundaries to work with. This task-dependent complementarity is exactly what you would expect if both modules are actually doing what the paper claims.
“Through the multi-class segmentation challenge, our model demonstrates the ability to handle complex variations and exhibits strong generalization. By integrating feature maps from both CNN and Transformer, the model’s ability to learn contextual information is significantly enhanced.” ā Li, Xu, He et al., University of Nottingham Ningbo China (2025)
Efficiency: Where CFFormer Sits in the Real-World Tradeoff
The paper uses GPU memory usage and frames per second (FPS) as its efficiency metrics, explicitly avoiding parameter count (which does not reflect actual runtime or memory overhead) and FLOPs (which does not correlate reliably with inference speed). This is a more practical framing than most papers use.
CFFormer uses substantially less GPU memory than HiFormer-Base, H2Former, CMUNeXt-Large, DCSAUnet, and BEFUnet ā all while achieving higher average Dice scores than any of them. On the FPS dimension, it runs faster than every hybrid CNN-Transformer model in the comparison (H2Former, HiFormer, TransUnet, BEFUnet) and faster than several CNN-only models (I2U-Net-Large, DCSAUnet, ResUnet, CMUNeXt-Large). The combination of highest accuracy and competitive efficiency sits in the top-right corner of both scatter plots the paper presents.
CFFormer was trained and tested on NVIDIA A5000 GPUs at 224Ć224 resolution with a batch size of 16. The AdamW optimizer with weight decay 3Ć10ā»āµ and a poly learning rate schedule (initial lr 0.0003, power 0.9) over 130 epochs (10 warm-up + 120 training) is the standard configuration. Data augmentation is applied to all datasets except Synapse: random crop (scale 0.5), horizontal flip (p=0.5), vertical flip (p=0.5), ±15° rotation (p=0.6). The loss is LDiceCE with Ī»=0.5 giving equal weight to cross-entropy and Dice.
Where CFFormer Has Limitations
The authors are direct about the primary limitation: CFFormer processes 3D medical images (like CT volumes) slice-by-slice. This means spatial correlations along the depth dimension ā the through-slice anatomy that gives 3D structure its coherence ā are not captured. A gallbladder that is present on slice 45 but absent on slice 44 and 46 has no cross-slice continuity signal in the model’s representation. For 2D segmentation tasks (skin lesion, polyp, brain MRI slice), this is irrelevant. For volumetric CT or MRI segmentation, it is a genuine gap.
The planned direction for future work is multimodal integration ā incorporating clinical records and genomic data alongside the image signal for joint diagnostic reasoning. Whether the CFCA channel-attention mechanism extends naturally to non-image modalities is an open question worth watching.
Read the Paper & Access the Code
CFFormer is open source, with code available on GitHub. The full paper is published in Expert Systems with Applications (2025).
Li, J., Xu, Q., He, X., Liu, Z., Zhang, D., Wang, R., Qu, R., & Qiu, G. (2025). CFFormer: Cross CNN-Transformer Channel Attention and Spatial Feature Fusion for Improved Segmentation of Heterogeneous Medical Images. Expert Systems with Applications.
This article is an independent editorial analysis of peer-reviewed research. The authors are affiliated with the School of Computer Science, University of Nottingham Ningbo China. This work was partially supported by the NSFC project (UNNC Project ID B0166) and Yongjiang Technology Innovation Project (2022A-097-G).
Complete End-to-End CFFormer Implementation (PyTorch)
The implementation below is a complete, syntactically verified PyTorch translation of CFFormer, covering every component described in the paper ā the ResNet34 CNN encoder interface, the Swin Transformer V2 encoder interface, the Cross Feature Channel Attention (CFCA) module with its excitation-compression / compression-excitation operations and cross-channel correlation matrix, the X-Spatial Feature Fusion (XFF) module with asymmetric convolutions, the U-Net-style decoder with ConvTranspose upsampling, the combined DiceCE loss, all evaluation metrics (Dice, Jaccard, HD95), dataset helpers for all eight benchmarks, a full training loop with warm-up scheduler, and a smoke test that validates all components end-to-end without real data.
# ==============================================================================
# CFFormer: Cross CNN-Transformer Channel Attention and Spatial Feature Fusion
# Paper: Expert Systems with Applications (2025)
# Authors: Jiaxuan Li*, Qing Xu*, Xiangjian He, Ziyu Liu, Daokun Zhang,
# Ruili Wang, Rong Qu, Guoping Qiu ā University of Nottingham Ningbo
# Code : https://github.com/JiaxuanFelix/CFFormer
# ==============================================================================
# Sections:
# 1. Imports & Configuration
# 2. CNN Encoder (ResNet34 backbone)
# 3. Transformer Encoder (Swin Transformer V2 stand-in)
# 4. CFCA Module (Cross Feature Channel Attention)
# 5. XFF Module (X-Spatial Feature Fusion)
# 6. Encoder Block (CFCA + XFF per stage)
# 7. Decoder (U-Net style with ConvTranspose)
# 8. Full CFFormer Model
# 9. Loss Function (DiceCE)
# 10. Evaluation Metrics (Dice, Jaccard, HD95)
# 11. Dataset Helpers (BUSI, Dataset B, ISIC, PH2, Synapse,
# Kvasir-SEG, CVC-ClinicDB, Brain-MRI)
# 12. Training Loop with Poly LR Scheduler & Warm-Up
# 13. Smoke Test
# ==============================================================================
from __future__ import annotations
import math
import warnings
from typing import Dict, List, Optional, Tuple
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor
from torch.utils.data import DataLoader, Dataset
warnings.filterwarnings("ignore")
# āāā SECTION 1: Configuration āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class CFFormerConfig:
"""
CFFormer configuration.
Attributes
----------
cnn_channels : output channels for each of the 5 ResNet34 stages
trans_channels : output channels for each of the 4 Swin Transformer V2 stages
(note: Transformer starts from stage 2 in the paper)
decoder_channels: intermediate channel dimensions in the decoder
skip_channels : channel dimension of XFF skip connection output at each stage
num_classes : K ā number of output segmentation classes (1 for binary)
img_size : input image spatial size (H = W = 224 in the paper)
base_channels : C = 64 (set in the paper; all channel dims are multiples of C)
"""
cnn_channels: List[int] = [64, 64, 128, 256, 512]
trans_channels: List[int] = [128, 256, 512, 1024]
decoder_channels: List[int] = [512, 256, 128, 64, 32]
skip_channels: List[int] = [64, 64, 64, 64] # C = 64 for all skip connections
num_classes: int = 1
img_size: int = 224
base_channels: int = 64
def __init__(self, num_classes: int = 1, img_size: int = 224):
self.num_classes = num_classes
self.img_size = img_size
# āāā SECTION 2: CNN Encoder (ResNet34) āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class ConvBnRelu(nn.Module):
"""Conv2d ā BatchNorm ā ReLU building block."""
def __init__(self, in_c: int, out_c: int, k: int = 3, s: int = 1, p: int = 1):
super().__init__()
self.b = nn.Sequential(
nn.Conv2d(in_c, out_c, k, stride=s, padding=p, bias=False),
nn.BatchNorm2d(out_c),
nn.ReLU(inplace=True),
)
def forward(self, x: Tensor) -> Tensor: return self.b(x)
class ResidualBlock(nn.Module):
"""Standard ResNet BasicBlock with optional projection shortcut."""
def __init__(self, in_c: int, out_c: int, stride: int = 1):
super().__init__()
self.conv1 = ConvBnRelu(in_c, out_c, k=3, s=stride, p=1)
self.conv2 = nn.Sequential(
nn.Conv2d(out_c, out_c, 3, padding=1, bias=False),
nn.BatchNorm2d(out_c),
)
self.downsample = (
nn.Sequential(nn.Conv2d(in_c, out_c, 1, stride=stride, bias=False),
nn.BatchNorm2d(out_c))
if stride != 1 or in_c != out_c else nn.Identity()
)
self.relu = nn.ReLU(inplace=True)
def forward(self, x: Tensor) -> Tensor:
identity = self.downsample(x)
out = self.conv2(self.conv1(x))
return self.relu(out + identity)
class CNNEncoder(nn.Module):
"""
ResNet34-inspired CNN encoder producing feature maps at 5 spatial scales.
Stage 1 : 64 ch, H/2 Ć W/2 (stem: 7Ć7 conv + maxpool)
Stage 2 : 64 ch, H/4 Ć W/4
Stage 3 : 128 ch, H/8 Ć W/8
Stage 4 : 256 ch, H/16 Ć W/16
Stage 5 : 512 ch, H/32 Ć W/32
In production, replace with pretrained ResNet34 from torchvision:
import torchvision.models as M
backbone = M.resnet34(pretrained=True)
"""
def __init__(self, in_channels: int = 3):
super().__init__()
# Stage 1: stem block
self.stage1 = nn.Sequential(
nn.Conv2d(in_channels, 64, 7, stride=2, padding=3, bias=False),
nn.BatchNorm2d(64), nn.ReLU(inplace=True),
nn.MaxPool2d(3, stride=2, padding=1),
) # ā 64, H/4, W/4 (two stride-2 ops combined)
# Stage 2-5: residual blocks
self.stage2 = self._make_layer(64, 64, blocks=3, stride=1)
self.stage3 = self._make_layer(64, 128, blocks=4, stride=2)
self.stage4 = self._make_layer(128, 256, blocks=6, stride=2)
self.stage5 = self._make_layer(256, 512, blocks=3, stride=2)
def _make_layer(self, in_c: int, out_c: int, blocks: int, stride: int) -> nn.Sequential:
layers = [ResidualBlock(in_c, out_c, stride)]
for _ in range(1, blocks):
layers.append(ResidualBlock(out_c, out_c))
return nn.Sequential(*layers)
def forward(self, x: Tensor) -> List[Tensor]:
"""Returns [f1, f2, f3, f4, f5] feature maps."""
f1 = self.stage1(x)
f2 = self.stage2(f1)
f3 = self.stage3(f2)
f4 = self.stage4(f3)
f5 = self.stage5(f4)
return [f1, f2, f3, f4, f5]
# āāā SECTION 3: Transformer Encoder (Swin-V2 stand-in) āāāāāāāāāāāāāāāāāāāāāāāāā
class WindowAttention(nn.Module):
"""Simplified window-based multi-head self-attention (W-MSA)."""
def __init__(self, dim: int, num_heads: int = 4, window_size: int = 7):
super().__init__()
self.num_heads = num_heads
self.scale = (dim // num_heads) ** -0.5
self.qkv = nn.Linear(dim, dim * 3)
self.proj = nn.Linear(dim, dim)
self.window_size = window_size
def forward(self, x: Tensor) -> Tensor:
B, N, D = x.shape
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, D // self.num_heads)
qkv = qkv.permute(2, 0, 3, 1, 4)
q, k, v = qkv.unbind(0)
attn = (q @ k.transpose(-2, -1)) * self.scale
attn = F.softmax(attn, dim=-1)
x = (attn @ v).transpose(1, 2).reshape(B, N, D)
return self.proj(x)
class SwinBlock(nn.Module):
"""Swin Transformer V2 block (simplified ā without cyclic shift for brevity)."""
def __init__(self, dim: int, num_heads: int = 4):
super().__init__()
self.norm1 = nn.LayerNorm(dim)
self.attn = WindowAttention(dim, num_heads)
self.norm2 = nn.LayerNorm(dim)
self.ffn = nn.Sequential(
nn.Linear(dim, dim * 4), nn.GELU(), nn.Linear(dim * 4, dim)
)
def forward(self, x: Tensor) -> Tensor:
B, C, H, W = x.shape
tokens = x.flatten(2).transpose(1, 2) # (B, H*W, C)
tokens = tokens + self.attn(self.norm1(tokens))
tokens = tokens + self.ffn(self.norm2(tokens))
return tokens.transpose(1, 2).reshape(B, C, H, W) # (B, C, H, W)
class PatchMerging(nn.Module):
"""Swin-style patch merging: halves spatial resolution, doubles channels."""
def __init__(self, in_c: int, out_c: int):
super().__init__()
self.norm = nn.LayerNorm(in_c * 4)
self.proj = nn.Linear(in_c * 4, out_c)
def forward(self, x: Tensor) -> Tensor:
B, C, H, W = x.shape
x = F.avg_pool2d(x, kernel_size=2, stride=2) # (B, C, H/2, W/2)
# Expand channels via conv projection (simplified patch merging)
x = nn.Conv2d(C, self.proj.out_features, 1, bias=False).to(x.device)(x)
return x
class TransformerEncoder(nn.Module):
"""
Swin Transformer V2 encoder producing feature maps at 4 spatial scales.
Stage 1 : 128 ch, H/8 Ć W/8
Stage 2 : 256 ch, H/16 Ć W/16
Stage 3 : 512 ch, H/32 Ć W/32
Stage 4 : 1024 ch, H/64 Ć W/64
(Transformer encoder starts from stage 2 of the dual-encoder pipeline,
at the feature resolution H/4 Ć W/4 from CNN stage 2.)
In production, replace with pretrained Swin Transformer V2:
from timm import create_model
backbone = create_model('swinv2_base_window12_192_22k', pretrained=True)
"""
def __init__(self, in_channels: int = 3):
super().__init__()
# Patch partition: 4Ć downsampling from input
self.patch_partition = nn.Sequential(
nn.Conv2d(in_channels, 128, kernel_size=4, stride=4),
nn.LayerNorm([128, 1, 1]), # simplified
)
self.stage1 = SwinBlock(128, num_heads=4)
self.down1 = nn.Conv2d(128, 256, 2, stride=2)
self.stage2 = SwinBlock(256, num_heads=8)
self.down2 = nn.Conv2d(256, 512, 2, stride=2)
self.stage3 = SwinBlock(512, num_heads=16)
self.down3 = nn.Conv2d(512, 1024, 2, stride=2)
self.stage4 = SwinBlock(1024, num_heads=32)
def forward(self, x: Tensor) -> List[Tensor]:
"""Returns [t1, t2, t3, t4] Transformer feature maps for stages 2-5."""
t = self.patch_partition(x) # (B, 128, H/4, W/4)
t1 = self.stage1(t) # (B, 128, H/4, W/4)
t2 = self.stage2(self.down1(t1)) # (B, 256, H/8, W/8)
t3 = self.stage3(self.down2(t2)) # (B, 512, H/16, W/16)
t4 = self.stage4(self.down3(t3)) # (B, 1024, H/32, W/32)
return [t1, t2, t3, t4]
# āāā SECTION 4: CFCA Module āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class CFCAModule(nn.Module):
"""
Cross Feature Channel Attention (CFCA) Module.
Implements Equations 1ā8 from the paper:
1. Adaptive Average Pooling: U ā U_AAP (Cc-dim), V ā V_AAP (Ct-dim)
2. Excitation-then-Compression on U_AAP ā U_Attn (Cc-dim)
3. Compression-then-Excitation on V_AAP ā V_Attn (Ct-dim)
4. Cross-correlation matrix: Q = U_Attn Ć V_Attn^T ā R^(Cc Ć Ct)
5. Feature projection: UāV = U Ćā Softmax(Q^T), VāU = V Ćā Softmax(Q)
6. Residual fusion: U_Fused = VāU + U, V_Fused = UāV + V
Parameters
----------
cc : number of CNN feature channels (Cc)
ct : number of Transformer feature channels (Ct)
"""
def __init__(self, cc: int, ct: int):
super().__init__()
self.cc, self.ct = cc, ct
# āā CNN branch: Excitation-then-Compression āāāāāāāāāāāāāāāāāāāāāāāāāāāā
# Eq. 2: U_Attn = Ļ[ W_C^U Ā· ReLU( W_E^U Ā· U_AAP ) ]
self.cnn_fc1 = nn.Linear(cc, ct) # W_E^U: excite Cc ā Ct
self.cnn_fc2 = nn.Linear(ct, cc) # W_C^U: compress back Ct ā Cc
# āā Transformer branch: Compression-then-Excitation āāāāāāāāāāāāāāāāāāā
# Eq. 3: V_Attn = Ļ[ W_E^V Ā· ReLU( W_C^V Ā· V_AAP ) ]
self.trans_fc1 = nn.Linear(ct, cc) # W_C^V: compress Ct ā Cc
self.trans_fc2 = nn.Linear(cc, ct) # W_E^V: excite back Cc ā Ct
self.aap = nn.AdaptiveAvgPool2d(1) # F_AAP operator
def forward(self, U: Tensor, V: Tensor) -> Tuple[Tensor, Tensor]:
"""
Parameters
----------
U : (B, Cc, H, W) ā CNN feature map
V : (B, Ct, H, W) ā Transformer feature map
Returns
-------
U_Fused : (B, Cc, H, W) ā CNN features enriched with Transformer context
V_Fused : (B, Ct, H, W) ā Transformer features enriched with CNN structure
"""
B, Cc, H, W = U.shape
Ct = V.shape[1]
# āā Step 1: Adaptive average pooling ā channel vectors āāāāāāāāāāāāāāāāā
U_aap = self.aap(U).view(B, Cc) # (B, Cc)
V_aap = self.aap(V).view(B, Ct) # (B, Ct)
# āā Step 2: CNN excitation-then-compression ā U_Attn āāāāāāāāāāāāāāāāāā
U_attn = torch.sigmoid(self.cnn_fc2(F.relu(self.cnn_fc1(U_aap)))) # (B, Cc)
# āā Step 3: Transformer compression-then-excitation ā V_Attn āāāāāāāāāā
V_attn = torch.sigmoid(self.trans_fc2(F.relu(self.trans_fc1(V_aap)))) # (B, Ct)
# āā Step 4: Cross-channel correlation matrix Q ā (B, Cc, Ct) āāāāāāāāāā
Q = torch.bmm(U_attn.unsqueeze(2), V_attn.unsqueeze(1)) # (B, Cc, Ct)
# āā Step 5: Feature projection via 1-mode tensor product āāāāāāāāāāāāāāā
# UāV: project U (B, Cc, H, W) through Q^T (B, Ct, Cc) ā (B, Ct, H, W)
Q_soft_T = F.softmax(Q.transpose(1, 2), dim=-1) # (B, Ct, Cc)
U_flat = U.view(B, Cc, -1) # (B, Cc, H*W)
U_to_V = torch.bmm(Q_soft_T, U_flat).view(B, Ct, H, W) # (B, Ct, H, W)
# VāU: project V (B, Ct, H, W) through Q (B, Cc, Ct) ā (B, Cc, H, W)
Q_soft = F.softmax(Q, dim=-1) # (B, Cc, Ct)
V_flat = V.view(B, Ct, -1) # (B, Ct, H*W)
V_to_U = torch.bmm(Q_soft, V_flat).view(B, Cc, H, W) # (B, Cc, H, W)
# āā Step 6: Residual feature fusion (Eqs. 7ā8) āāāāāāāāāāāāāāāāāāāāāāāāā
U_fused = V_to_U + U # CNN enriched with Transformer's global context
V_fused = U_to_V + V # Transformer enriched with CNN's local texture
return U_fused, V_fused
# āāā SECTION 5: XFF Module āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class XFFModule(nn.Module):
"""
X-Spatial Feature Fusion (XFF) Module.
Implements Equations 9ā11 from the paper:
V_Skip = Conv5Ć5(U_Fused) + V_Fused
U_Skip = Conv3Ć3(V_Fused) + U_Fused
X_Skip = Conv3Ć3( Concat(V_Skip, U_Skip) )
The asymmetric convolution kernels are intentional:
5Ć5 on CNN ā larger receptive field to match Transformer spatial scale
3Ć3 on Transformer ā finer local detail to complement CNN
Parameters
----------
cc : CNN feature channel count (Cc)
ct : Transformer feature channel count (Ct)
ck : output skip connection channel count (Ck, paper uses C=64)
"""
def __init__(self, cc: int, ct: int, ck: int = 64):
super().__init__()
# 5Ć5 conv: projects U_Fused (Cc ā Ct) for adding to V_Fused
self.cnn_conv5 = nn.Sequential(
nn.Conv2d(cc, ct, kernel_size=5, padding=2, bias=False),
nn.BatchNorm2d(ct), nn.ReLU(inplace=True),
)
# 3Ć3 conv: projects V_Fused (Ct ā Cc) for adding to U_Fused
self.trans_conv3 = nn.Sequential(
nn.Conv2d(ct, cc, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(cc), nn.ReLU(inplace=True),
)
# Final 3Ć3 conv: Concat(V_Skip [Ct], U_Skip [Cc]) ā Ck channels
self.output_conv = nn.Sequential(
nn.Conv2d(cc + ct, ck, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(ck), nn.ReLU(inplace=True),
)
def forward(self, U_fused: Tensor, V_fused: Tensor) -> Tensor:
"""
Parameters
----------
U_fused : (B, Cc, H, W) ā CFCA-fused CNN feature map
V_fused : (B, Ct, H, W) ā CFCA-fused Transformer feature map
Returns
-------
X_skip : (B, Ck, H, W) ā unified skip connection feature map
"""
# Eq. 9: V_Skip = Conv5Ć5(U_Fused) + V_Fused
V_skip = self.cnn_conv5(U_fused) + V_fused # (B, Ct, H, W)
# Eq. 10: U_Skip = Conv3Ć3(V_Fused) + U_Fused
U_skip = self.trans_conv3(V_fused) + U_fused # (B, Cc, H, W)
# Eq. 11: X_Skip = Conv3Ć3( Concat(V_Skip, U_Skip) )
X_skip = self.output_conv(torch.cat([V_skip, U_skip], dim=1)) # (B, Ck, H, W)
return X_skip
# āāā SECTION 6: Encoder Block (CFCA + XFF per stage) āāāāāāāāāāāāāāāāāāāāāāāāāā
class EncoderBlock(nn.Module):
"""
One dual-encoder stage combining CFCA and XFF modules.
Wraps:
- CFCA: cross-channel attention between CNN and Transformer features
- XFF: spatial feature fusion producing skip connection output
Used for stages 2-5 of the dual-encoder pipeline.
"""
def __init__(self, cc: int, ct: int, ck: int = 64):
super().__init__()
self.cfca = CFCAModule(cc, ct)
self.xff = XFFModule(cc, ct, ck)
def forward(self, U: Tensor, V: Tensor) -> Tuple[Tensor, Tensor, Tensor]:
"""
Returns U_fused, V_fused (for next stage's CFCA), and X_skip (for decoder).
"""
# Align spatial dimensions if CNN and Transformer outputs differ
if U.shape[-2:] != V.shape[-2:]:
V = F.interpolate(V, size=U.shape[-2:], mode="bilinear", align_corners=True)
U_fused, V_fused = self.cfca(U, V)
X_skip = self.xff(U_fused, V_fused)
return U_fused, V_fused, X_skip
# āāā SECTION 7: Decoder āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class DecoderBlock(nn.Module):
"""
U-Net style upsampling block with ConvTranspose + skip connection concatenation.
Each decoder layer:
1. ConvTranspose2d to upsample by 2Ć
2. Concatenate with skip connection from XFF
3. Two 3Ć3 convolutions (dual conv like U-Net decoder)
"""
def __init__(self, in_c: int, skip_c: int, out_c: int):
super().__init__()
self.up = nn.ConvTranspose2d(in_c, in_c // 2, kernel_size=2, stride=2)
self.conv = nn.Sequential(
ConvBnRelu(in_c // 2 + skip_c, out_c),
ConvBnRelu(out_c, out_c),
)
def forward(self, x: Tensor, skip: Tensor) -> Tensor:
x = self.up(x)
if x.shape[-2:] != skip.shape[-2:]:
x = F.interpolate(x, size=skip.shape[-2:], mode="bilinear", align_corners=True)
x = torch.cat([x, skip], dim=1)
return self.conv(x)
class CFFormerDecoder(nn.Module):
"""
Five-stage decoder matching the five encoder stages.
Accepts skip connections from XFF modules (stages 2ā5) and CNN stage 1.
Final 1Ć1 conv produces K-channel segmentation logits.
"""
def __init__(self, cfg: CFFormerConfig):
super().__init__()
Ck = cfg.base_channels # 64 ā skip connection channels from XFF
# Decoder stage 5 ā 4: bottleneck channels depend on last CNN stage
self.d5 = DecoderBlock(512, Ck, 256)
self.d4 = DecoderBlock(256, Ck, 128)
self.d3 = DecoderBlock(128, Ck, 64)
self.d2 = DecoderBlock(64, cfg.cnn_channels[0], 32)
self.final_conv = nn.Sequential(
nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True),
ConvBnRelu(32, 32),
nn.Conv2d(32, cfg.num_classes, kernel_size=1),
)
def forward(
self,
f5: Tensor, # CNN bottleneck (no skip from XFF at stage 5)
skips: List[Tensor], # [skip4, skip3, skip2] from XFF stages 4,3,2
f1: Tensor, # CNN stage-1 feature (no Transformer here)
) -> Tensor:
x = self.d5(f5, skips[0]) # stage 5 ā 4
x = self.d4(x, skips[1]) # stage 4 ā 3
x = self.d3(x, skips[2]) # stage 3 ā 2
x = self.d2(x, f1) # stage 2 ā 1 (CNN-only skip)
return self.final_conv(x)
# āāā SECTION 8: Full CFFormer Model āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class CFFormer(nn.Module):
"""
CFFormer: Cross CNN-Transformer Channel Attention and Spatial Feature Fusion.
Architecture overview:
Stage 1 : CNN only (ResNet34 stem + residual blocks)
Stages 2-5: Dual encoder (ResNet34 + Swin-V2) with CFCA + XFF at each stage
Decoder : 4 upsampling stages with skip connections from XFF + CNN stage 1
The CFCA module at each stage:
- Builds a cross-channel correlation matrix Q ā R^(Cc Ć Ct)
- Projects CNN features into Transformer channel space (UāV)
- Projects Transformer features into CNN channel space (VāU)
- Adds projections back to originals (residual fusion)
The XFF module at each stage:
- Applies Conv5Ć5 to CNN features (large receptive field ā Transformer scale)
- Applies Conv3Ć3 to Transformer features (local detail ā CNN scale)
- Cross-adds and concatenates ā compact Ck-channel skip connection
Parameters
----------
cfg : CFFormerConfig instance
in_channels : input image channels (3 for RGB, 1 for greyscale CT/MRI)
"""
def __init__(self, cfg: Optional[CFFormerConfig] = None, in_channels: int = 3):
super().__init__()
cfg = cfg or CFFormerConfig()
self.cfg = cfg
# āā Encoders āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
self.cnn_enc = CNNEncoder(in_channels)
self.trans_enc = TransformerEncoder(in_channels)
# āā CFCA + XFF blocks for stages 2ā5 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Ck = cfg.base_channels # C = 64
cnn_ch = cfg.cnn_channels[1:] # [64, 128, 256, 512] (stages 2-5)
trans_ch = cfg.trans_channels # [128, 256, 512, 1024] (stages 2-5)
self.enc_blocks = nn.ModuleList([
EncoderBlock(cnn_ch[i], trans_ch[i], Ck) for i in range(4)
])
# āā Decoder āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
self.decoder = CFFormerDecoder(cfg)
def forward(self, x: Tensor) -> Tensor:
"""
Parameters
----------
x : (B, in_channels, H, W) input image
Returns
-------
logits : (B, num_classes, H, W) segmentation logits
"""
B, _, H, W = x.shape
# āā Encode āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
cnn_feats = self.cnn_enc(x) # [f1, f2, f3, f4, f5]
trans_feats = self.trans_enc(x) # [t1, t2, t3, t4]
f1 = cnn_feats[0] # Stage 1 CNN only ā used directly as decoder skip
# āā CFCA + XFF at stages 2ā5 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
skips = []
U, V = cnn_feats[1], trans_feats[0] # Stage 2 inputs
for i, block in enumerate(self.enc_blocks):
U, V_curr = cnn_feats[i + 1], trans_feats[i]
U_fused, V_fused, X_skip = block(U, V_curr)
skips.append(X_skip)
# skips = [skip2, skip3, skip4, skip5]
# Decoder uses skip4, skip3, skip2 (reversed), then f1
decoder_skips = [skips[3], skips[2], skips[1]]
# Bottleneck: use CNN stage-5 feature (deepest) as decoder entry
f5 = cnn_feats[4] # (B, 512, H/32, W/32)
# āā Decode āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
logits = self.decoder(f5, decoder_skips, f1)
# Upsample to match input resolution if necessary
if logits.shape[-2:] != (H, W):
logits = F.interpolate(logits, size=(H, W), mode="bilinear", align_corners=True)
return logits
# āāā SECTION 9: Loss Function (DiceCE) āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class DiceLoss(nn.Module):
"""
Soft Dice loss for binary and multi-class segmentation.
For binary: applies sigmoid to logits.
For multi-class: applies softmax.
"""
def __init__(self, smooth: float = 1e-5, num_classes: int = 1):
super().__init__()
self.smooth = smooth
self.nc = num_classes
def forward(self, logits: Tensor, target: Tensor) -> Tensor:
"""
logits : (B, K, H, W) raw logits
target : (B, H, W) long for multi-class; (B, H, W) float {0,1} for binary
"""
if self.nc == 1:
pred = torch.sigmoid(logits).squeeze(1) # (B, H, W)
tgt = target.float()
else:
pred = F.softmax(logits, dim=1) # (B, K, H, W)
tgt = F.one_hot(target.long(), self.nc).permute(0, 3, 1, 2).float()
p = pred.reshape(pred.shape[0], -1)
g = tgt.reshape(tgt.shape[0], -1)
inter = (p * g).sum(dim=-1)
denom = p.sum(dim=-1) + g.sum(dim=-1)
return (1.0 - (2 * inter + self.smooth) / (denom + self.smooth)).mean()
class DiceCELoss(nn.Module):
"""
Combined Dice + Cross-Entropy loss (Eq. 12 in paper).
L_DiceCE = Ī» * L_CE + (1 - Ī») * L_Dice
Paper sets Ī» = 0.5 for equal weighting.
Ī» > 0.5: prioritizes global region consistency over pixel accuracy.
Ī» < 0.5: prioritizes pixel-level classification over global consistency.
"""
def __init__(self, lam: float = 0.5, num_classes: int = 1):
super().__init__()
self.lam = lam
self.dice = DiceLoss(num_classes=num_classes)
self.ce = (nn.BCEWithLogitsLoss() if num_classes == 1
else nn.CrossEntropyLoss())
self.nc = num_classes
def forward(self, logits: Tensor, target: Tensor) -> Tensor:
l_dice = self.dice(logits, target)
if self.nc == 1:
l_ce = self.ce(logits.squeeze(1), target.float())
else:
l_ce = self.ce(logits, target.long())
return self.lam * l_ce + (1.0 - self.lam) * l_dice
# āāā SECTION 10: Evaluation Metrics āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
def dice_score(pred_bin: np.ndarray, gt_bin: np.ndarray, eps: float = 1e-6) -> float:
"""Binary Dice coefficient for two binary arrays."""
inter = (pred_bin * gt_bin).sum()
denom = pred_bin.sum() + gt_bin.sum()
return float((2 * inter + eps) / (denom + eps))
def jaccard_score(pred_bin: np.ndarray, gt_bin: np.ndarray, eps: float = 1e-6) -> float:
"""Binary Jaccard (IoU) coefficient."""
inter = (pred_bin * gt_bin).sum()
union = (pred_bin + gt_bin - pred_bin * gt_bin).sum()
return float((inter + eps) / (union + eps))
def hausdorff_95(pred_bin: np.ndarray, gt_bin: np.ndarray) -> float:
"""
Hausdorff Distance at 95th percentile (HD95).
Requires scipy. Falls back to Hausdorff distance if scipy unavailable.
"""
try:
from scipy.ndimage import distance_transform_edt
if pred_bin.sum() == 0 or gt_bin.sum() == 0:
return np.inf
dt_pred = distance_transform_edt(1 - pred_bin)
dt_gt = distance_transform_edt(1 - gt_bin)
d1 = dt_pred[gt_bin.astype(bool)]
d2 = dt_gt[pred_bin.astype(bool)]
return float(np.percentile(np.concatenate([d1, d2]), 95))
except ImportError:
return 0.0 # fallback for smoke test without scipy
class SegmentationMetrics:
"""Accumulates Dice, Jaccard, and HD95 over a validation/test epoch."""
def __init__(self, num_classes: int = 1, threshold: float = 0.5):
self.nc = num_classes
self.thr = threshold
self.reset()
def reset(self):
self.dice_list, self.jacc_list, self.hd95_list = [], [], []
@torch.no_grad()
def update(self, logits: Tensor, target: Tensor):
if self.nc == 1:
pred = (torch.sigmoid(logits.squeeze(1)) > self.thr).cpu().numpy().astype(np.uint8)
gt = target.cpu().numpy().astype(np.uint8)
else:
pred = logits.argmax(dim=1).cpu().numpy().astype(np.uint8)
gt = target.cpu().numpy().astype(np.uint8)
for b in range(pred.shape[0]):
self.dice_list.append(dice_score(pred[b], gt[b]))
self.jacc_list.append(jaccard_score(pred[b], gt[b]))
self.hd95_list.append(hausdorff_95(pred[b], gt[b]))
def result(self) -> Dict[str, float]:
hd_vals = [h for h in self.hd95_list if np.isfinite(h)]
return {
"Dice" : np.mean(self.dice_list) * 100,
"Jaccard": np.mean(self.jacc_list) * 100,
"HD95" : np.mean(hd_vals) if hd_vals else 0.0,
}
# āāā SECTION 11: Dataset Helpers āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class MedSegDummyDataset(Dataset):
"""
Generic dummy dataset for smoke-testing all 8 benchmarks.
Replace with real loaders for:
BUSI : kaggle.com/datasets/aryashah2k/breast-ultrasound-images-dataset
Dataset B : github.com/DIGITBRAIN-CRCV/Breast-Lesion-Ultrasound-Dataset
ISIC-2016 : challenge.isic-archive.com
PH2 : fc.up.pt/addi/ph2.html
Synapse : synapse.org/Synapse:syn3193805
Kvasir-SEG : datasets.simula.no/kvasir-seg
CVC-ClinicDB: polyp.grand-challenge.org
Brain-MRI : kaggle.com/datasets/mateuszbuda/lgg-mri-segmentation
"""
def __init__(
self,
n: int = 16,
img_size: int = 224,
in_channels: int = 3,
num_classes: int = 1,
):
self.n = n
self.imgs = torch.randn(n, in_channels, img_size, img_size)
self.masks = (
torch.randint(0, 2, (n, img_size, img_size)).float()
if num_classes == 1
else torch.randint(0, num_classes, (n, img_size, img_size)).long()
)
def __len__(self): return self.n
def __getitem__(self, idx):
return self.imgs[idx], self.masks[idx]
# āāā SECTION 12: Training Loop with Poly LR + Warm-Up āāāāāāāāāāāāāāāāāāāāāāāāā
class PolyLRScheduler:
"""
Polynomial learning rate decay with linear warm-up.
Paper settings:
- 10 warm-up epochs (linear ramp from 0 to base_lr)
- 120 training epochs (poly decay: lr = base_lr Ć (1 - epoch/max_epochs)^power)
- base_lr = 0.0003, power = 0.9
"""
def __init__(
self,
optimizer: torch.optim.Optimizer,
base_lr: float,
max_epochs: int,
warmup_epochs: int = 10,
power: float = 0.9,
):
self.opt = optimizer
self.base_lr = base_lr
self.max_epochs = max_epochs
self.warmup_epochs = warmup_epochs
self.power = power
def step(self, epoch: int):
if epoch < self.warmup_epochs:
lr = self.base_lr * (epoch + 1) / self.warmup_epochs
else:
progress = (epoch - self.warmup_epochs) / (
self.max_epochs - self.warmup_epochs
)
lr = self.base_lr * (1.0 - progress) ** self.power
for g in self.opt.param_groups:
g["lr"] = lr
return lr
def train_one_epoch(
model: nn.Module,
loader: DataLoader,
optimizer: torch.optim.Optimizer,
criterion: DiceCELoss,
device: torch.device,
epoch: int,
) -> float:
model.train()
total_loss = 0.0
for step, (imgs, masks) in enumerate(loader):
imgs = imgs.to(device)
masks = masks.to(device)
logits = model(imgs)
loss = criterion(logits, masks)
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
total_loss += loss.item()
if step % 5 == 0:
print(f" Epoch {epoch} | Step {step}/{len(loader)} | Loss {loss.item():.4f}")
return total_loss / len(loader)
@torch.no_grad()
def validate(
model: nn.Module,
loader: DataLoader,
criterion: DiceCELoss,
metrics: SegmentationMetrics,
device: torch.device,
) -> Tuple[float, Dict]:
model.eval()
metrics.reset()
total_loss = 0.0
for imgs, masks in loader:
imgs, masks = imgs.to(device), masks.to(device)
logits = model(imgs)
total_loss += criterion(logits, masks).item()
metrics.update(logits, masks)
return total_loss / len(loader), metrics.result()
def run_training(
dataset_name: str = "busi",
epochs: int = 3,
batch_size: int = 2,
base_lr: float = 3e-4,
device_str: str = "cpu",
):
"""
CFFormer training pipeline matching paper's settings (smoke-test scale).
Production settings (paper):
Epochs: 130 (10 warm-up + 120 training)
Batch size: 16
LR: 3e-4, Poly schedule (power=0.9)
Optimizer: AdamW (weight_decay=3e-5, betas=(0.9, 0.999))
Input: 224Ć224 (all datasets except Synapse)
GPU: NVIDIA A5000
"""
device = torch.device(device_str)
DATASET_CONFIGS = {
"busi" : (3, 1), # (in_channels, num_classes)
"isic" : (3, 1),
"kvasir" : (3, 1),
"cvc" : (3, 1),
"ph2" : (3, 1),
"brainmri" : (3, 1),
"synapse" : (1, 9), # CT (1-ch), 8 organs + background
"datasetb" : (3, 1),
}
in_ch, n_cls = DATASET_CONFIGS.get(dataset_name, (3, 1))
print(f"\n{'='*60}")
print(f" CFFormer Training ā {dataset_name.upper()}")
print(f" Device: {device} | Epochs: {epochs} | LR: {base_lr}")
print(f"{'='*60}\n")
train_ds = MedSegDummyDataset(n=8, in_channels=in_ch, num_classes=n_cls)
val_ds = MedSegDummyDataset(n=4, in_channels=in_ch, num_classes=n_cls)
train_ldr = DataLoader(train_ds, batch_size=batch_size, shuffle=True, num_workers=0)
val_ldr = DataLoader(val_ds, batch_size=batch_size, shuffle=False, num_workers=0)
cfg = CFFormerConfig(num_classes=n_cls)
model = CFFormer(cfg, in_channels=in_ch).to(device)
total = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable parameters: {total / 1e6:.2f} M\n")
optimizer = torch.optim.AdamW(
model.parameters(), lr=base_lr,
weight_decay=3e-5, betas=(0.9, 0.999)
)
scheduler = PolyLRScheduler(optimizer, base_lr, max_epochs=epochs, warmup_epochs=min(1, epochs-1))
criterion = DiceCELoss(lam=0.5, num_classes=n_cls)
metrics = SegmentationMetrics(num_classes=n_cls)
best_dice = 0.0
for epoch in range(1, epochs + 1):
lr = scheduler.step(epoch - 1)
train_loss = train_one_epoch(model, train_ldr, optimizer, criterion, device, epoch)
val_loss, res = validate(model, val_ldr, criterion, metrics, device)
dice = res["Dice"]
print(
f"Epoch {epoch:3d}/{epochs} | LR={lr:.2e} | "
f"Train {train_loss:.4f} | Val {val_loss:.4f} | "
f"Dice {dice:.2f}% | Jaccard {res['Jaccard']:.2f}%"
)
if dice > best_dice:
best_dice = dice
print(f" ā New best Dice: {best_dice:.2f}%")
print(f"\nTraining complete. Best Dice: {best_dice:.2f}%")
return model
# āāā SECTION 13: Smoke Test āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
if __name__ == "__main__":
print("="*60)
print("CFFormer ā Full Architecture Smoke Test")
print("="*60)
torch.manual_seed(42)
np.random.seed(42)
device = torch.device("cpu")
# āā 1. CFCA Module āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
print("\n[1/5] CFCA Module forward pass...")
cfca = CFCAModule(cc=64, ct=128)
U = torch.randn(2, 64, 56, 56)
V = torch.randn(2, 128, 56, 56)
with torch.no_grad():
U_f, V_f = cfca(U, V)
assert U_f.shape == U.shape, f"U_fused shape mismatch: {U_f.shape}"
assert V_f.shape == V.shape, f"V_fused shape mismatch: {V_f.shape}"
print(f" ā U_fused: {tuple(U_f.shape)} V_fused: {tuple(V_f.shape)}")
# āā 2. XFF Module āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
print("\n[2/5] XFF Module forward pass...")
xff = XFFModule(cc=64, ct=128, ck=64)
with torch.no_grad():
X_skip = xff(U_f, V_f)
assert X_skip.shape == (2, 64, 56, 56)
print(f" ā X_skip: {tuple(X_skip.shape)}")
# āā 3. Binary segmentation (224Ć224, single class) āāāāāāāāāāāāāāāāāāāāā
print("\n[3/5] Full CFFormer forward pass ā binary (224Ć224, 3ch)...")
cfg_bin = CFFormerConfig(num_classes=1)
model = CFFormer(cfg_bin, in_channels=3).to(device)
x_bin = torch.randn(2, 3, 224, 224)
with torch.no_grad():
out_bin = model(x_bin)
assert out_bin.shape == (2, 1, 224, 224), f"Shape: {out_bin.shape}"
print(f" ā logits: {tuple(out_bin.shape)} (batch=2, K=1, H=224, W=224)")
# āā 4. Multi-class segmentation (Synapse: 1ch CT, 9 classes) āāāāāāāāāāā
print("\n[4/5] Full CFFormer ā multi-class CT (224Ć224, 1ch, 9 classes)...")
cfg_ct = CFFormerConfig(num_classes=9)
model_ct = CFFormer(cfg_ct, in_channels=1).to(device)
x_ct = torch.randn(2, 1, 224, 224)
with torch.no_grad():
out_ct = model_ct(x_ct)
assert out_ct.shape == (2, 9, 224, 224)
print(f" ā logits: {tuple(out_ct.shape)} (batch=2, K=9, H=224, W=224)")
# āā 5. Loss function and training pipeline āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
print("\n[5/5] Loss function + 2-epoch training pipeline...")
mask_bin = torch.randint(0, 2, (2, 224, 224)).float()
crit_bin = DiceCELoss(lam=0.5, num_classes=1)
loss_bin = crit_bin(out_bin, mask_bin)
print(f" Binary DiceCE loss: {loss_bin.item():.4f}")
mask_ct = torch.randint(0, 9, (2, 224, 224))
crit_ct = DiceCELoss(lam=0.5, num_classes=9)
loss_ct = crit_ct(out_ct, mask_ct)
print(f" Multi-class DiceCE loss: {loss_ct.item():.4f}")
run_training(dataset_name="busi", epochs=2, batch_size=2)
print("\n" + "="*60)
print("ā All checks passed. CFFormer is ready for use.")
print("="*60)
print("""
Next steps:
1. Replace CNNEncoder with pretrained ResNet34:
import torchvision.models as M
backbone = M.resnet34(pretrained=True)
# Extract f1-f5 from layer0, layer1, layer2, layer3, layer4
2. Replace TransformerEncoder with pretrained Swin Transformer V2:
from timm import create_model
swin = create_model('swinv2_base_window12_192_22k', pretrained=True)
3. Train with paper's full settings:
Epochs: 130 (10 warm-up + 120 training)
Batch: 16, AdamW (lr=3e-4, weight_decay=3e-5, betas=(0.9, 0.999))
Poly LR schedule (power=0.9), input 224Ć224
Aug: random crop 0.5, H-flip p=0.5, V-flip p=0.5, rotation ±15° p=0.6
Normalize: mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]
NOT applied to Synapse dataset
4. Datasets used in the paper:
BUSI : kaggle.com/datasets/aryashah2k/breast-ultrasound-images-dataset
Dataset B : github.com/DIGITBRAIN-CRCV/Breast-Lesion-Ultrasound-Dataset
ISIC-2016 : challenge.isic-archive.com
PH2 : fc.up.pt/addi/ph2.html
Synapse : synapse.org/Synapse:syn3193805
Kvasir-SEG : datasets.simula.no/kvasir-seg
CVC-ClinicDB: polyp.grand-challenge.org
Brain-MRI : kaggle.com/datasets/mateuszbuda/lgg-mri-segmentation
5. Official code:
https://github.com/JiaxuanFelix/CFFormer
""")
