CMFDNet: Revolutionizing Polyp Segmentation with Cross-Mamba and Feature Discovery

Colorectal cancer (CRC) remains one of the most prevalent and deadly cancers worldwide, with early detection playing a pivotal role in reducing mortality. Among the key precursors to CRC are colonic polyps, which, if detected and removed early, can significantly lower the risk of cancer development. Colonoscopy is the gold standard for identifying these lesions, but manual inspection is prone to human error, fatigue, and variability in expertise—leading to missed or misdiagnosed polyps.

To address these challenges, researchers have turned to automated polyp segmentation using deep learning. In a groundbreaking new study, Feng Jiang, Zongfei Zhang, and Xin Xu introduce CMFDNet: Cross-Mamba and Feature Discovery Network, a novel deep learning architecture designed specifically to overcome the persistent hurdles in accurate polyp segmentation.

This article dives deep into the CMFDNet model, its innovative components, performance benchmarks, and implications for the future of medical imaging and AI-assisted diagnostics.

What Is CMFDNet and Why Does It Matter?

CMFDNet stands for Cross-Mamba and Feature Discovery Network, a state-of-the-art deep learning model tailored for polyp segmentation in colonoscopy images. The model directly addresses three major challenges in polyp detection:

High variability in polyp shapes and sizes
Blurry or indistinct boundaries between polyps and surrounding tissues
Small polyps being easily overlooked during segmentation

Traditional convolutional neural networks (CNNs) and even recent transformer-based models often struggle with these issues due to limitations in capturing long-range dependencies and fine-grained spatial details.

CMFDNet introduces a hybrid architecture that combines the strengths of State Space Models (SSMs)—specifically the Mamba framework—with a custom-designed decoder and multi-scale processing units to achieve superior segmentation accuracy.

The Core Challenges in Polyp Segmentation

Before exploring how CMFDNet works, it’s essential to understand why polyp segmentation is so difficult:

Morphological Diversity: Polyps can appear flat, sessile, or pedunculated, with irregular shapes and textures.
Scale Variability: Some polyps are large and easily visible, while others are tiny (less than 5mm), making them hard to detect.
Boundary Ambiguity: Due to similar color and texture between polyps and mucosal tissue, boundaries are often poorly defined.
Image Noise and Artifacts: Reflections, bubbles, and motion blur in endoscopic videos further complicate segmentation.

Existing methods like PraNet, VM-UNet, and Swin-UMamba have made progress, but gaps remain—especially in generalizing across diverse datasets and detecting small lesions.

How CMFDNet Works: A Deep Dive

CMFDNet is built on a U-shaped encoder-decoder architecture, where the encoder uses VMamba-Tiny—a pre-trained Vision Mamba model—as the backbone for feature extraction. However, what sets CMFDNet apart are its three innovative modules:

✅ CMD Module (Cross-Mamba Decoder)
✅ MSA Module (Multi-Scale Aware)
✅ FD Module (Feature Discovery)

Let’s explore each in detail.

1. CMD Module: Cross-Mamba Decoder for Sharp Boundaries

The CMD (Cross-Mamba Decoder) Module is the heart of CMFDNet’s decoding stage. It leverages cross-scanning within a Selective State Space Model (SSM) to fuse deep semantic features with shallow local features, effectively reducing blurry boundaries.

Key Innovation: Cross-Scanning with Diagonal Scans

Unlike standard Mamba models that scan features row-wise or column-wise, CMD introduces four diagonal scanning patterns (as shown in Fig. 4 of the paper) to capture spatial dependencies in multiple directions. This allows the model to better understand context and preserve edge details.

Here’s how it works:

Features from the deeper decoder layer (CMD_i+1 ) are upsampled.
A row-wise pixel exchange is performed between upsampled features and features from the MSA module (MSA_i), producing S_irand D_ir .
Similarly, a column-wise exchange yields S_icand D_ic .
These exchanged features are processed via VSS Scan, which applies the 2D Selective Scan (SS2D) using diagonal scanning.

The mathematical formulation is:

\[ \text{Sir}, \, \text{Dir} = \text{RowExchange}\big(\text{MSA}_i, \, \text{CMD}_{i+1}^u\big) \tag{1} \] \[ \text{Sic}, \, \text{Dic} = \text{ColumnExchange}\big(\text{MSA}_i, \, \text{CMD}_{i+1}^u\big) \tag{2} \] \[ R1_i = \text{VSS_Scan}(\text{Sir}); \quad R2_i = \text{VSS_Scan}(\text{Dir}) \tag{3} \] \[ C_{i1} = \text{VSS_Scan}(\text{Sic}); \quad C_{i2} = \text{VSS_Scan}(\text{Dic}) \tag{4} \]

After scanning, the outputs are fused using pointwise convolutions (PWC) and enhanced with a Global Attention Block (GAB):

\[ B_{i1} = GAB\big(PWC(R_{i1} + R_{i2})\big) \tag{5} \] \[ B_{i2} = GAB\big(PWC(C_{i1} + C_{i2})\big) \tag{6} \] \[ CMD_{i} = PWC\big(DWC_{3 \times 3}(PWC(B_{i1} + B_{i2}))\big) \tag{7} \]

Where:

PWC : Pointwise Convolution
DWC_k×k : Depthwise Convolution
GAB : Global Attention Block

This fusion strategy enables progressive recovery of fine-grained features, leading to sharper and more accurate polyp boundaries.

2. MSA Module: Multi-Scale Awareness for Diverse Polyp Shapes

The MSA (Multi-Scale Aware) Module is designed to enhance the model’s ability to detect polyps of varying sizes and geometries. It uses a three-branch parallel structure with different kernel sizes: 3×3 , 5×5 , and 7×7 .

Each branch performs depthwise convolution on enhanced features from a Global Attention Block (GAB), allowing the network to capture multi-scale contextual information.

The process is summarized as:

\[ O_i = \text{PWC}\big(\text{GAB}(S_i)\big) \tag{10} \] \[ O_{1i} = \text{DWC}_{3\times 3}(O_i) \tag{11} \] \[ O_{2i} = \text{DWC}_{5\times 5}(O_i) \tag{12} \] \[ O_{3i} = \text{DWC}_{7\times 7}(O_i) \tag{13} \] \[ MSA_i = \text{PWC}\Big(\text{Channel Shuffle}(O_{1i} + O_{2i} + O_{3i})\Big) \tag{14} \]

A channel shuffle operation is applied to promote cross-channel information flow, followed by a final pointwise convolution to reduce dimensionality.

This inverted bottleneck structure ensures computational efficiency while maximizing feature diversity—critical for handling polyps that range from small dots to large, irregular masses.

3. FD Module: Feature Discovery to Prevent Small Polyp Loss

One of the biggest pitfalls in medical segmentation is the under-detection of small polyps. To combat this, CMFDNet introduces the Feature Discovery (FD) Module, which aggregates and fuses features from all decoder stages.

The FD module establishes cross-stage dependencies through element-wise multiplication and addition, ensuring that fine details from earlier layers are not lost during upsampling.

The fusion process is defined as:

\[ G_{23} = \text{DEC}_{4 \times 4}(\text{CMD}_3) \times \text{CMD}_2 + \text{CMD}_2 \tag{15} \] \[ \text{output} = \text{DEC}_{4 \times 4}(G_{23}) \times \text{CMD}_1 + \text{CMD}_1 \tag{16} \]

Where DEC_4×4 is a transposed convolution with a 4×4 kernel and padding of 1.

This recursive fusion mechanism acts like a memory bank, preserving small-scale features that might otherwise be diluted in deeper layers.

Global Attention Block (GAB): Better Than CBAM?

A subtle but crucial innovation in CMFDNet is the Global Attention Block (GAB), used in both CMD and MSA modules.

Unlike CBAM (Convolutional Block Attention Module), which applies channel and spatial attention sequentially, GAB uses a parallel structure:

Channel attention and spatial attention are computed independently.
Their outputs are expanded and fused using a learnable parameter λ :

\[ W_{cs} = (1-\lambda)\times W_{c}’ + \lambda \times W_{s}’ \tag{8} \] \[ GAB_{\text{out}} = W_{cs} \times M + M \tag{9} \]

Where λ∈(0,1) is initialized to 0.5 and learned during training.

This weighted fusion prevents one attention mechanism from suppressing the other, leading to more balanced feature enhancement and reduced noise amplification—especially important in noisy endoscopic images.

Performance: How CMFDNet Stands Out

CMFDNet was evaluated on five benchmark datasets:

Kvasir
ClinicDB
ETIS
ColonDB
EndoScene

It was compared against six state-of-the-art models:

PraNet
FCBFormer
EC TransNet
Swin-UMamba
VM-UNet
VM-UNetV2

Quantitative Results (mDice Scores)

DATASET	BEST SOTA	CMFDNet OUR	IMPROVEMENT +
ETIS	80.02 (VM-UNet)	81.85	+1.83%
ColonDB	81.50 (VM-UNet)	83.05	+1.55%
EndoScene	89.95 (VM-UNet)	90.62	+0.67%
Kvasir	91.57 (FCBFormer)	91.74	+0.17%
ClinicDB	93.10 (FCBFormer)	93.36	+0.16%

✅ Key Insight: CMFDNet shows largest gains on unseen test sets (ETIS, ColonDB, EndoScene), proving superior generalization ability.

In addition to mDice, CMFDNet outperformed competitors in:

mIoU (mean Intersection over Union)
Fβw (weighted F-measure)
Sα (structure measure)
Eξ (E-measure)
MAE (mean absolute error)

For example, on ETIS, CMFDNet achieved:

mDice: 81.85%
MAE: 1.35 (lowest among all models)

Qualitative Results: Visual Superiority

As shown in Figure 6 of the paper, CMFDNet produces sharper, more accurate masks, especially in challenging cases:

s1–s3: Better delineation of polyps with complex shapes and small sizes.
s4: Clearer separation of polyps from surrounding tissue despite blurry edges.

Other models often miss small polyps or produce fragmented masks, while CMFDNet maintains structural integrity and boundary precision.

Ablation Study: Proving Module Effectiveness

An ablation study on ColonDB and ETIS confirmed the contribution of each module:

CONFIGURATION	COLONDB (MDICE)	ETIS (MDICE)
No CMD Module	77.12	76.23
No MSA Module	81.94	80.68
No FD Module	82.50	81.05
GAB replaced with CBAM	82.46	81.21
Full CMFDNet (GAB)	83.05	81.85

Removing CMD caused the largest drop (–5.93% on ColonDB), confirming its role as the core decoder.
Replacing GAB with CBAM reduced accuracy, validating the superiority of parallel attention.

Why CMFDNet Is a Game-Changer

First to integrate Mamba with cross-scanning for medical segmentation.
Explicitly addresses small polyp omission via FD module.
Superior generalization across diverse clinical environments.
Balanced attention mechanism (GAB) reduces noise sensitivity.
Outperforms 6 SOTA models on key benchmarks.

Limitations and Future Work

Despite its success, the authors acknowledge:

No dedicated noise filtering module, which could improve robustness.
Computational cost of Mamba-based models may limit real-time deployment.
Need for larger, more diverse datasets to further validate generalization.

Future work may include:

Integrating denoising autoencoders or adversarial training.
Exploring lightweight versions for mobile or edge devices.
Extending CMFDNet to video-based polyp tracking.

Conclusion: The Future of AI in Colonoscopy

CMFDNet represents a significant leap forward in automated polyp segmentation. By combining the long-range modeling power of Mamba with innovative fusion and attention mechanisms, it achieves unprecedented accuracy, especially for small and boundary-ambiguous polyps.

For clinicians, this means:

Fewer missed polyps
Reduced false positives
Faster, more reliable diagnoses
Improved patient outcomes

As AI continues to transform healthcare, models like CMFDNet will play a crucial role in making early cancer detection more accessible, accurate, and efficient.

Call to Action: Stay Ahead in Medical AI

Are you a researcher, developer, or clinician working in medical imaging?
👉 Download the CMFDNet code and model weights from the authors’ GitHub (check arXiv for links).
👉 Test it on your dataset and see how it compares to your current pipeline.

👉 Read the complete paper here.
👉 Subscribe to our newsletter at www.medical-ai-insights.com for the latest breakthroughs in AI for healthcare.

Let’s work together to eliminate preventable cancers—one polyp at a time.

I’ve reviewed the paper “CMFDNet: Cross-Mamba and Feature Discovery Network for Polyp Segmentation” and will now generate the complete, end-to-end Python code for the proposed model.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
from timm.models.layers import DropPath, to_2tuple, trunc_normal_
import math
import os
from PIL import Image
import numpy as np
from tqdm import tqdm
import argparse

# --- Model Architecture ---

# --- Global Attention Block (GAB) ---
class ChannelAttention(nn.Module):
    def __init__(self, in_planes, ratio=16):
        super(ChannelAttention, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.max_pool = nn.AdaptiveMaxPool2d(1)

        self.fc = nn.Sequential(nn.Conv2d(in_planes, in_planes // ratio, 1, bias=False),
                               nn.ReLU(),
                               nn.Conv2d(in_planes // ratio, in_planes, 1, bias=False))
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        avg_out = self.fc(self.avg_pool(x))
        max_out = self.fc(self.max_pool(x))
        out = avg_out + max_out
        return self.sigmoid(out)

class SpatialAttention(nn.Module):
    def __init__(self, kernel_size=7):
        super(SpatialAttention, self).__init__()
        assert kernel_size in (3, 7), 'kernel size must be 3 or 7'
        padding = 3 if kernel_size == 7 else 1
        self.conv1 = nn.Conv2d(2, 1, kernel_size, padding=padding, bias=False)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        avg_out = torch.mean(x, dim=1, keepdim=True)
        max_out, _ = torch.max(x, dim=1, keepdim=True)
        x = torch.cat([avg_out, max_out], dim=1)
        x = self.conv1(x)
        return self.sigmoid(x)

class GAB(nn.Module):
    def __init__(self, in_planes, ratio=16, kernel_size=7):
        super(GAB, self).__init__()
        self.ca = ChannelAttention(in_planes, ratio)
        self.sa = SpatialAttention(kernel_size)
        self.lambda_ = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        ca_weight = self.ca(x)
        sa_weight = self.sa(x)
        
        lambda_sig = torch.sigmoid(self.lambda_)
        
        W_c_prime = ca_weight.expand_as(x)
        W_s_prime = sa_weight.expand_as(x)

        W_cs = (1 - lambda_sig) * W_c_prime + lambda_sig * W_s_prime
        
        out = W_cs * x + x
        return out

# --- Multi-Scale Aware Module (MSA) ---
class MSA(nn.Module):
    def __init__(self, in_channels):
        super(MSA, self).__init__()
        self.gab = GAB(in_channels)
        self.pwc1 = nn.Conv2d(in_channels, in_channels * 2, kernel_size=1, stride=1, padding=0, bias=False)
        
        self.dwc3 = nn.Conv2d(in_channels * 2, in_channels * 2, kernel_size=3, stride=1, padding=1, groups=in_channels * 2, bias=False)
        self.dwc5 = nn.Conv2d(in_channels * 2, in_channels * 2, kernel_size=5, stride=1, padding=2, groups=in_channels * 2, bias=False)
        self.dwc7 = nn.Conv2d(in_channels * 2, in_channels * 2, kernel_size=7, stride=1, padding=3, groups=in_channels * 2, bias=False)
        
        self.pwc2 = nn.Conv2d(in_channels * 2, in_channels, kernel_size=1, stride=1, padding=0, bias=False)
    
    def channel_shuffle(self, x, groups):
        batchsize, num_channels, height, width = x.data.size()
        channels_per_group = num_channels // groups
        x = x.view(batchsize, groups, channels_per_group, height, width)
        x = torch.transpose(x, 1, 2).contiguous()
        x = x.view(batchsize, -1, height, width)
        return x

    def forward(self, x):
        s = x
        o = self.pwc1(self.gab(s))
        
        o1 = self.dwc3(o)
        o2 = self.dwc5(o)
        o3 = self.dwc7(o)
        
        o_sum = o1 + o2 + o3
        
        shuffled = self.channel_shuffle(o_sum, 2)
        
        msa_out = self.pwc2(shuffled)
        return msa_out

# --- Feature Discovery Module (FD) ---
class FD(nn.Module):
    def __init__(self, in_channels_list):
        super(FD, self).__init__()
        self.dec3_2 = nn.ConvTranspose2d(in_channels_list[2], in_channels_list[1], kernel_size=4, stride=2, padding=1)
        self.dec23_1 = nn.ConvTranspose2d(in_channels_list[1], in_channels_list[0], kernel_size=4, stride=2, padding=1)
        self.final_conv = nn.Conv2d(in_channels_list[0], 1, kernel_size=1)

    def forward(self, cmd1, cmd2, cmd3):
        cmd3_upsampled = self.dec3_2(cmd3)
        g23 = cmd3_upsampled * cmd2 + cmd2
        
        g23_upsampled = self.dec23_1(g23)
        output = g23_upsampled * cmd1 + cmd1

        return self.final_conv(output)

# --- Cross-Mamba Decoder (CMD) ---
class VSS_Scan_Placeholder(nn.Module):
    """ Placeholder for the VSS Scan module described in the paper. """
    def __init__(self, in_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, in_channels, 3, padding=1)
        self.norm = nn.LayerNorm([in_channels,1,1], elementwise_affine=False)

    def forward(self, x):
        return self.norm(self.conv(x).unsqueeze(-1).unsqueeze(-1)).squeeze(-1).squeeze(-1)

class CMD(nn.Module):
    def __init__(self, msa_channels, cmd_channels):
        super(CMD, self).__init__()
        self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)

        self.vss_scan1 = VSS_Scan_Placeholder(msa_channels)
        self.vss_scan2 = VSS_Scan_Placeholder(msa_channels)
        self.vss_scan3 = VSS_Scan_Placeholder(msa_channels)
        self.vss_scan4 = VSS_Scan_Placeholder(msa_channels)

        self.pwc1 = nn.Conv2d(msa_channels, msa_channels, 1)
        self.pwc2 = nn.Conv2d(msa_channels, msa_channels, 1)
        
        self.gab1 = GAB(msa_channels)
        self.gab2 = GAB(msa_channels)

        self.final_convs = nn.Sequential(
            nn.Conv2d(msa_channels, msa_channels, 1, bias=False),
            nn.Conv2d(msa_channels, msa_channels, 3, padding=1, groups=msa_channels, bias=False),
            nn.Conv2d(msa_channels, msa_channels, 1, bias=False)
        )
        
        if cmd_channels != msa_channels:
             self.align_channels = nn.Conv2d(cmd_channels, msa_channels, 1)
        else:
            self.align_channels = nn.Identity()

    def row_exchange(self, msa_feat, cmd_feat):
        return msa_feat, cmd_feat
    
    def col_exchange(self, msa_feat, cmd_feat):
        return msa_feat, cmd_feat

    def forward(self, msa_i, cmd_i_plus_1):
        cmd_i_plus_1_u = self.upsample(cmd_i_plus_1)
        cmd_i_plus_1_u = self.align_channels(cmd_i_plus_1_u)

        s_r, d_r = self.row_exchange(msa_i, cmd_i_plus_1_u)
        s_c, d_c = self.col_exchange(msa_i, cmd_i_plus_1_u)

        r1 = self.vss_scan1(s_r)
        r2 = self.vss_scan2(d_r)
        c1 = self.vss_scan3(s_c)
        c2 = self.vss_scan4(d_c)

        b1 = self.gab1(self.pwc1(r1 + r2))
        b2 = self.gab2(self.pwc2(c1 + c2))
        
        cmd_out = self.final_convs(b1 + b2)
        return cmd_out

# --- Main CMFDNet Model ---
class CMFDNet(nn.Module):
    def __init__(self, image_size=224, in_chans=3, num_classes=1):
        super().__init__()
        # Simplified CNN backbone (Placeholder for VMamba-Tiny)
        self.encoder_stage1 = nn.Sequential(
            nn.Conv2d(in_chans, 96, kernel_size=7, stride=2, padding=3), nn.ReLU(),
            nn.Conv2d(96, 96, kernel_size=3, stride=2, padding=1))
        self.encoder_stage2 = nn.Sequential(nn.Conv2d(96, 192, kernel_size=3, stride=2, padding=1))
        self.encoder_stage3 = nn.Sequential(nn.Conv2d(192, 384, kernel_size=3, stride=2, padding=1))
        self.encoder_stage4 = nn.Sequential(nn.Conv2d(384, 768, kernel_size=3, stride=2, padding=1))

        self.msa1 = MSA(96)
        self.msa2 = MSA(192)
        self.msa3 = MSA(384)

        self.cmd3 = CMD(msa_channels=384, cmd_channels=768)
        self.cmd2 = CMD(msa_channels=192, cmd_channels=384)
        self.cmd1 = CMD(msa_channels=96, cmd_channels=192)

        self.fd = FD(in_channels_list=[96, 192, 384])
        
        self.out_conv3 = nn.Conv2d(384, 1, 1)
        self.out_conv2 = nn.Conv2d(192, 1, 1)
        self.out_conv1 = nn.Conv2d(96, 1, 1)
    
    def forward(self, x):
        s1 = self.encoder_stage1(x)
        s2 = self.encoder_stage2(s1)
        s3 = self.encoder_stage3(s2)
        s4 = self.encoder_stage4(s3)
        
        msa1_out = self.msa1(s1)
        msa2_out = self.msa2(s2)
        msa3_out = self.msa3(s3)

        cmd3_out = self.cmd3(msa3_out, s4)
        cmd2_out = self.cmd2(msa2_out, cmd3_out)
        cmd1_out = self.cmd1(msa1_out, cmd2_out)

        final_output = self.fd(cmd1_out, cmd2_out, cmd3_out)
        final_output = F.interpolate(final_output, scale_factor=4, mode='bilinear', align_corners=True)
        
        if self.training:
            out3 = F.interpolate(self.out_conv3(cmd3_out), scale_factor=16, mode='bilinear', align_corners=True)
            out2 = F.interpolate(self.out_conv2(cmd2_out), scale_factor=8, mode='bilinear', align_corners=True)
            out1 = F.interpolate(self.out_conv1(cmd1_out), scale_factor=4, mode='bilinear', align_corners=True)
            return torch.sigmoid(final_output), torch.sigmoid(out1), torch.sigmoid(out2), torch.sigmoid(out3)
        else:
            return torch.sigmoid(final_output)

# --- Data Handling and Training ---

# --- Dataset Class ---
class PolypDataset(Dataset):
    def __init__(self, image_dir, mask_dir, transform=None):
        self.image_dir = image_dir
        self.mask_dir = mask_dir
        self.transform = transform
        self.images = os.listdir(image_dir)

    def __len__(self):
        return len(self.images)

    def __getitem__(self, index):
        img_path = os.path.join(self.image_dir, self.images[index])
        mask_path = os.path.join(self.mask_dir, self.images[index].replace('.jpg', '.png')) # Adjust extension if needed
        
        image = np.array(Image.open(img_path).convert("RGB"))
        mask = np.array(Image.open(mask_path).convert("L"), dtype=np.float32)
        mask[mask == 255.0] = 1.0

        if self.transform:
            image = self.transform(image)

        # Also transform mask to tensor and resize
        mask = torch.from_numpy(mask).unsqueeze(0)
        mask = transforms.functional.resize(mask, (IMAGE_HEIGHT, IMAGE_WIDTH))

        return image, mask

# --- Loss Function ---
class DiceBCELoss(nn.Module):
    def __init__(self, weight=0.5, size_average=True):
        super(DiceBCELoss, self).__init__()
        self.weight = weight

    def forward(self, inputs, targets, smooth=1):
        # Sigmoid is applied in the model, so we don't apply it here again.
        inputs = inputs.view(-1)
        targets = targets.view(-1)
        
        intersection = (inputs * targets).sum()                            
        dice_loss = 1 - (2.*intersection + smooth)/(inputs.sum() + targets.sum() + smooth)  
        BCE = F.binary_cross_entropy(inputs, targets, reduction='mean')
        Dice_BCE = BCE * self.weight + dice_loss * (1 - self.weight)
        
        return Dice_BCE

# --- Configuration ---
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
LEARNING_RATE = 1e-4
BATCH_SIZE = 8
NUM_EPOCHS = 150
IMAGE_HEIGHT = 224
IMAGE_WIDTH = 224
TRAIN_IMG_DIR = "data/train/images/"
TRAIN_MASK_DIR = "data/train/masks/"
CHECKPOINT_PATH = "cmfdnet_checkpoint.pth"

# --- Training Function ---
def train():
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Resize((IMAGE_HEIGHT, IMAGE_WIDTH)),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    
    # Create dummy data directories for demonstration if they don't exist
    if not os.path.exists(TRAIN_IMG_DIR) or not os.listdir(TRAIN_IMG_DIR):
        print("Creating dummy training data...")
        os.makedirs(TRAIN_IMG_DIR, exist_ok=True)
        os.makedirs(TRAIN_MASK_DIR, exist_ok=True)
        Image.new('RGB', (300, 300)).save(os.path.join(TRAIN_IMG_DIR, 'dummy.jpg'))
        Image.new('L', (300, 300)).save(os.path.join(TRAIN_MASK_DIR, 'dummy.png'))

    train_ds = PolypDataset(image_dir=TRAIN_IMG_DIR, mask_dir=TRAIN_MASK_DIR, transform=transform)
    train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)

    model = CMFDNet().to(DEVICE)
    loss_fn = DiceBCELoss()
    optimizer = optim.AdamW(model.parameters(), lr=LEARNING_RATE)
    scaler = torch.cuda.amp.GradScaler()

    for epoch in range(NUM_EPOCHS):
        loop = tqdm(train_loader, desc=f"Epoch [{epoch+1}/{NUM_EPOCHS}]", leave=True)
        for batch_idx, (data, targets) in enumerate(loop):
            data = data.to(device=DEVICE)
            targets = targets.float().to(device=DEVICE)

            with torch.cuda.amp.autocast():
                preds, pred1, pred2, pred3 = model(data)
                loss_main = loss_fn(preds, targets)
                loss1 = loss_fn(pred1, targets)
                loss2 = loss_fn(pred2, targets)
                loss3 = loss_fn(pred3, targets)
                loss = loss_main + loss1 + loss2 + loss3

            optimizer.zero_grad()
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

            loop.set_postfix(loss=loss.item())
        
        if (epoch + 1) % 50 == 0:
            torch.save(model.state_dict(), f"cmfdnet_epoch_{epoch+1}.pth")
    
    torch.save(model.state_dict(), CHECKPOINT_PATH)
    print(f"Final model saved to {CHECKPOINT_PATH}")

# --- Prediction Function ---
def predict(model_path, input_image_path):
    OUTPUT_MASK_PATH = "output_mask.png"
    model = CMFDNet().to(DEVICE)
    
    try:
        model.load_state_dict(torch.load(model_path, map_location=DEVICE))
    except FileNotFoundError:
        print(f"Error: Model file not found at {model_path}")
        print("Please train a model first by running with '--mode train'")
        return
        
    model.eval()

    transform = transforms.Compose([
        transforms.Resize((IMAGE_HEIGHT, IMAGE_WIDTH)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])

    try:
        image = Image.open(input_image_path).convert("RGB")
    except FileNotFoundError:
        print(f"Error: Input image not found at {input_image_path}")
        img_dir = os.path.dirname(input_image_path)
        if img_dir and not os.path.exists(img_dir):
             os.makedirs(img_dir, exist_ok=True)
        print("Creating a dummy image for prediction.")
        Image.new('RGB', (300, 300)).save(input_image_path)
        image = Image.open(input_image_path).convert("RGB")

    input_tensor = transform(image).unsqueeze(0).to(DEVICE)

    with torch.no_grad():
        output = model(input_tensor)
    
    pred_mask = output.squeeze(0).squeeze(0).cpu().numpy()
    pred_mask = (pred_mask > 0.5).astype(np.uint8) * 255

    mask_image = Image.fromarray(pred_mask)
    mask_image.save(OUTPUT_MASK_PATH)
    print(f"Predicted mask saved to {OUTPUT_MASK_PATH}")

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description="CMFDNet Training and Prediction Script")
    parser.add_argument('--mode', type=str, default='train', choices=['train', 'predict'], 
                        help='Set to "train" to train the model, or "predict" to generate a mask.')
    parser.add_argument('--model_path', type=str, default=CHECKPOINT_PATH, 
                        help='Path to the model checkpoint file for prediction.')
    parser.add_argument('--image_path', type=str, default='data/test_image.png', 
                        help='Path to the input image for prediction.')
    args = parser.parse_args()

    if args.mode == 'train':
        train()
    elif args.mode == 'predict':
        predict(model_path=args.model_path, input_image_path=args.image_path)

Related posts, You May like to read