CardioMorphNet: Shape-Guided Bayesian Recurrent Deep Learning for 3D Cardiac Motion Estimation | AI Trend Blend

Medical AI · Medical Image Analysis 113 (2026) 104149 · 18 min read

CardioMorphNet Taught an AI to Track Your Heartbeat Without Ever Looking at Raw Pixels

Researchers at the University of Glasgow and the University of Manchester built a shape-guided Bayesian recurrent framework that estimates 3D cardiac motion from cine CMR images without relying on intensity-based image registration — and it outperforms every major method on two large public benchmarks.

CardioMorphNet Cardiac Motion Estimation Bayesian Deep Learning Deformable Registration Cine CMR Recurrent VAE Shape-Guided UK Biobank DVF Uncertainty Medical Image Analysis 2026

CardioMorphNet architecture diagram showing recurrent variational autoencoder, DeformNet, and SegNet components for 3D shape-guided cardiac motion estimation from short-axis CMR images

Every time your heart beats, a cascade of mechanical events unfolds across its four chambers. The left ventricle contracts, the myocardium thickens, and the entire structure rotates slightly along its long axis. Cardiologists have known for decades that subtle deviations in this motion pattern are early warning signs of disease — sometimes years before a patient feels anything. The challenge is measuring it with enough precision and speed to be useful at scale. A new paper from the University of Glasgow does something quietly ambitious: it teaches a deep learning system to track that motion without ever directly comparing the appearance of two frames.

The Problem Nobody Wanted to Talk About

Cine cardiac magnetic resonance imaging captures the heart in motion. A typical acquisition produces a stack of short-axis (SAX) volumes covering the full cardiac cycle — from end-diastole when the heart is most relaxed to end-systole when it reaches peak contraction. The standard computational approach to extracting motion from these sequences is image registration: align one frame to the next by finding a spatial transformation that makes them look as similar as possible.

It sounds reasonable. The problem runs deeper than it first appears.

Intensity-based registration metrics measure how closely two images match in terms of pixel values. But cardiac images are not just of the heart. They contain chest wall, lungs, liver, and spine — all of which move too, albeit for entirely different reasons. When you ask a registration network to minimize the difference between two frames globally, it will happily find transformations that make the background look good at the expense of getting the cardiac region exactly right. The motion you recover is contaminated by irrelevant deformations. This is not a failure of any particular method — it is a structural limitation of the objective itself.

Reza Akbari Movahed, Abuzar Rezaee, Arezoo Zakeri, Colin Berry, Edmond Ho, and Ali Gooya at Glasgow and Manchester decided to sidestep that objective entirely. Their framework, CardioMorphNet, does not use any intensity-based image registration similarity loss during training. Instead it supervises the registration branch purely through the lens of cardiac shape consistency — asking whether the estimated motion correctly warps one segmentation map onto another. The pixel values are there to help extract features, but they are never directly scored against each other.

Core Insight

Traditional cardiac registration methods fail because they optimize for overall image similarity rather than cardiac-specific motion. CardioMorphNet circumvents this by supervising deformation field estimation through shape mask alignment rather than pixel-level intensity matching — a structural design choice that forces the network to focus on anatomically meaningful regions.

Why the Heart Is Especially Hard to Track

There is a reason cardiac motion estimation has remained an active research area despite decades of work. The heart moves in three dimensions, but most clinical acquisitions are 2D slices stacked together with limited through-plane resolution. The motion is fast — the systolic phase from end-diastole to peak contraction takes roughly a third of a second — and it is nonlinear. The myocardium does not simply contract inward. It twists, thickens, and shortens simultaneously.

Tagged MRI was the original gold standard for cardiac motion tracking. Magnetization grid lines tattooed onto the myocardium deform with the tissue, making displacement directly visible. Methods like harmonic phase tracking (HARP) can follow these markers reliably. The catch is that tagged acquisition is not routine clinical practice. The UK Biobank, which contains CMR data from over 100,000 participants, uses standard cine protocols without tagging. Any method that needs tagged images is therefore excluded from large-scale population studies.

Deep learning changed the equation. VoxelMorph, introduced by Balakrishnan and colleagues in 2019, showed that a U-Net could learn to register medical images with competitive accuracy and orders-of-magnitude less computation than iterative methods. It became the de facto baseline for cardiac motion estimation. But VoxelMorph still uses intensity-based losses — mean squared error or normalized cross-correlation between the fixed and warped images. For the reasons described above, this limits its ability to focus tightly on cardiac boundaries.

Subsequent work tried to address this with joint segmentation-registration approaches. Methods like JMS (Qin et al., 2018) and TCSF (Wei et al., 2020) run segmentation and registration branches together, hoping the segmentation branch would steer the registration toward cardiac-relevant features. The problem is subtle but real: these methods still train their registration branch using intensity-based losses. The segmentation branch benefits from warped pseudo-labels, but it cannot actually redirect the registration objective toward shape consistency. CardioMorphNet closes that gap by making shape alignment the primary supervision signal for the deformation network itself.

Building the Framework: Three Networks Working Together

CardioMorphNet has three interacting components. Understanding how they connect is the key to seeing why the whole thing works.

The Recurrent Variational Autoencoder

The first component is the Recurrent Variational Autoencoder (RVAE). Its job is to build a compact representation of how the cardiac sequence evolves over time — a latent variable that carries spatio-temporal context from all preceding frames into the current one.

At each cardiac time step $t$, a feature extraction network called FeatNet compresses the SAX volume $\mathbf{I}_t$ into spatial features $\boldsymbol{\phi}_{\mathbf{I}_t}$. These features combine with the hidden state $\mathbf{h}_{t-1}$ — a running summary of everything seen so far — to produce the posterior distribution over the latent variable $\mathbf{Z}_t$. A ConvLSTM network called RecNet updates the hidden state at each step, threading temporal information through the entire sequence. A decoder reconstructs the SAX volume from $\mathbf{Z}_t$ as a regularizer, ensuring the latent space actually encodes meaningful content.

The Bayesian framing here is deliberate. The RVAE is not just a feature extractor with recurrence bolted on. It explicitly models distributions over $\mathbf{Z}_t$, separating what the network knows from what it does not. That uncertainty representation propagates downstream to the deformation field estimates — giving CardioMorphNet its uncertainty quantification capability.

SegNet: The Shape Provider

The second component is SegNet — a 3D U-Net that predicts cardiac segmentation masks from each SAX volume. It learns to identify the left ventricle (LV), right ventricle (RV), and left ventricular myocardium (LVmyo) from the images directly.

SegNet is trained first in isolation using ground-truth masks at the end-diastolic and end-systolic frames. Once its weights are frozen, it becomes a stable shape estimator that supplies predicted masks at every frame of the cardiac cycle. The key insight is that these predicted masks serve as the supervision signal for the deformation network — not intensity comparisons.

Ground-truth segmentation masks are only available at ED and ES in most clinical datasets. SegNet effectively extends that supervision to all intermediate time points, providing pseudo-labels that can guide motion estimation through the full contraction cycle. This is a clever use of what is already known: if the segmentation network is reliable, its outputs are trustworthy anchors for the registration task.

DeformNet: The Motion Estimator

The third component is DeformNet — another 3D U-Net, but this one takes two consecutive SAX volumes concatenated together as input, enriched by the latent features from the RVAE. It outputs the mean $\boldsymbol{\mu}_{\mathbf{D}_t}$, a low-rank correlation tensor $\mathbf{V}_t$, and a voxel-wise variance $\boldsymbol{\sigma}^2_{\mathbf{D}_t}$ for the displacement vector field (DVF) at each time step.

The covariance of the DVF posterior is modeled as:

DVF Covariance (Section 3.2) $$\mathbf{C}_{\mathbf{D}_t} = \boldsymbol{\sigma}^2_{\mathbf{D}_t} \mathbf{I} + \mathbf{V}_t^\top \mathbf{V}_t$$

This explicit low-rank plus diagonal covariance structure is what enables the framework to produce meaningful uncertainty maps for each estimated deformation field. Other probabilistic registration methods typically use dropout-based uncertainty approximations. Here the uncertainty comes directly from the model’s architecture — a cleaner and more interpretable quantity.

Architectural Design Choice

By separating shape estimation (SegNet) from motion estimation (DeformNet) and connecting them through a Bayesian objective rather than a shared loss, CardioMorphNet avoids the pitfall of joint segmentation-registration methods where the registration branch still trains on intensity signals. The shape masks are the ground truth that the deformation fields must satisfy.

The Mathematics of Shape-Guided Registration

The framework’s training objective derives from a probabilistic generative model over the full cardiac sequence. Let $\mathcal{I} = \{\mathbf{I}_t\}_{t=0}^{T-1}$ be the SAX volumes, $\mathcal{M} = \{\mathbf{M}_t\}_{t=0}^{T-1}$ the cardiac masks, $\mathcal{D} = \{\mathbf{D}_t\}_{t=0}^{T-1}$ the DVFs, and $\mathcal{Z} = \{\mathbf{Z}_t\}_{t=0}^{T-1}$ the latent variables. The joint probability factorizes as:

Joint Probability (Eq. 3) $$\mathcal{P} = \prod_{t=0}^{T-1} p(\mathbf{M}_t | \mathbf{M}_{t-1}, \mathbf{D}_t)\, p(\mathbf{I}_t | \mathbf{Z}_t)\, p(\mathbf{Z}_t | \mathbf{h}_{t-1})\, p(\mathbf{D}_t)$$

The prior on DVFs is a zero-mean Gaussian — a simple regularizer that keeps deformations from going wild. The cardiac shape mask at each time step is modeled conditionally on the previous mask warped by the current DVF: $\hat{\mathbf{M}}_t \approx \mathbf{M}_{t-1} \circ \mathbf{D}_t$. This warp-forward model is what ties the estimated motion to observable shape.

The full loss function is the negative variational Evidence Lower Bound plus a bending energy regularizer for DVF smoothness:

Training Objective (Eq. 5-6) $$\mathcal{L} = \mathcal{L}_{\text{ELBO}} + \rho \mathcal{L}_{\text{smooth}}$$ $$\mathcal{L}_{\text{ELBO}} = \mathcal{L}_{\text{shape}} + \lambda_1 \mathcal{L}_D + \lambda_2 \mathcal{L}_Z + \lambda_3 \mathcal{L}_{\text{rec}}$$

The shape loss has two parts. The supervised component $\mathcal{L}_{\text{sup-shape}}$ penalizes mismatch between ground-truth masks at ED and ES and the warped segmentation maps at those frames — a direct cross-entropy with known labels. The semi-supervised component $\mathcal{L}_{\text{semi-shape}}$ extends supervision to intermediate frames using the KL divergence between SegNet’s predicted mask distribution and the warped mask distribution:

Semi-Supervised Shape Loss (Eq. 8) $$\mathcal{L}_{\text{semi-shape}} = \sum_{t \notin \{t_{ED}, t_{ES}\}} \sum_k \sum_n \Psi_{tnk} \left(\log \Psi_{tnk} – \log \theta_{tnk}\right)$$

Here $\Psi_{tnk}$ is SegNet’s prediction at voxel $n$, class $k$, time $t$, and $\theta_{tnk}$ is the corresponding probability from the warped prior mask. Minimizing this divergence pushes the DVFs to warp previous masks so they match SegNet’s current predictions. Every intermediate frame gets supervision without requiring ground-truth labels — that is the structural elegance of the semi-supervised formulation.

What the Numbers Say

The team evaluated CardioMorphNet against nine competing methods on two datasets: 4000 subjects from UK Biobank and the M&M multi-center challenge dataset. The comparison includes both conventional iterative methods (LCC-Demons, SyN) and recent deep learning approaches (VoxelMorph, DragNet, TLRN, JMS, TCSF, SegMorph).

Performance is measured using Dice Similarity Coefficient and Jaccard Index between ground-truth masks and the warped cardiac shapes at ED and ES — metrics that directly capture how well the estimated motion preserves cardiac boundaries. Results on UK Biobank:

Method	Region	DSC (%)	JAC (%)	HD95 (mm)	MSD (mm)
CardioMorphNet	LV	93.77 ± 2.11	88.35 ± 3.63	1.44 ± 0.65	0.29 ± 0.15
	LVmyo	83.84 ± 4.20	72.39 ± 5.69	1.50 ± 0.49	0.33 ± 0.13
	RV	90.42 ± 3.20	82.65 ± 5.00	1.50 ± 0.88	0.33 ± 0.18
SegMorph	LV	86.16 ± 3.28	75.82 ± 4.96	2.90 ± 1.49	0.76 ± 0.34
	LVmyo	73.77 ± 5.32	58.71 ± 6.38	2.32 ± 0.61	0.57 ± 0.15
	RV	74.28 ± 5.31	59.36 ± 6.60	5.15 ± 2.59	1.14 ± 0.46
VoxelMorph	LV	79.20 ± 4.91	65.83 ± 6.67	3.76 ± 1.50	1.03 ± 0.34
	LVmyo	68.98 ± 5.27	52.90 ± 6.04	2.61 ± 0.52	0.66 ± 0.14
	RV	67.20 ± 6.55	50.97 ± 7.37	7.19 ± 3.24	1.58 ± 0.61
DragNet	LV	80.67 ± 4.91	67.88 ± 6.77	3.77 ± 1.59	0.97 ± 0.38
	LVmyo	72.02 ± 4.81	56.49 ± 5.68	2.48 ± 0.54	0.60 ± 0.13
	RV	71.51 ± 6.06	55.99 ± 7.25	5.76 ± 3.05	1.26 ± 0.53

Table: Cardiac shape registration comparison on UK Biobank. CardioMorphNet achieves the highest DSC and JAC and lowest HD95 and MSD across all three cardiac regions. Paired t-test confirmed statistical significance of differences.

The LVmyo region is worth pausing on. The myocardial wall is thin — typically 8 to 12mm — and its accurate registration demands sub-millimeter precision in the estimated DVFs. CardioMorphNet achieves 83.84% DSC on this region while VoxelMorph manages 68.98% and even SegMorph peaks at 73.77%. That 10-point gap on the hardest region is not noise. It reflects a genuine structural advantage from shape-guided supervision.

Uncertainty That Means Something

What sets CardioMorphNet apart from most registration networks is that it comes with built-in uncertainty estimates. The DVF uncertainty is computed as the differential entropy of the Gaussian posterior over displacement fields — a direct function of the predicted covariance $\mathbf{C}_{\mathbf{D}_t}$.

Comparing uncertainty maps against SegMorph and DragNet reveals a pattern: CardioMorphNet concentrates low uncertainty inside the cardiac region and assigns higher uncertainty to background areas. The competing methods distribute uncertainty more uniformly across the image. This spatial localization of confidence is exactly what you want in a clinical tool — the model is telling you it is most sure where the heart actually is, and less certain in regions it was never trained to track carefully.

There is one expected exception. During the intermediate systolic frames — the second and third time steps between ED and ES — uncertainty rises inside the LV and RV. This happens because cardiac contraction is most rapid and nonlinear at that phase, and ground-truth masks are only available at the endpoints. The semi-supervised loss does its best, but physics imposes a genuine limit. The framework acknowledges that limit rather than hiding it.

“Unlike traditional methods that rely on intensity-based image registration similarity loss, our approach aligns warped and fixed segmentation maps while incorporating features from sequential SAX volumes and recurrent spatio-temporal dependencies.” — Movahed et al., Medical Image Analysis 2026

Clinical Indices: The Real-World Test

Dice scores are useful for comparing methods in controlled settings. What actually matters for clinical deployment is whether the estimated motion produces physiologically plausible cardiac measurements.

The paper reports root mean squared error between estimated and reference values for nine clinical indices: left and right ventricular end-diastolic volume (LVEDV, RVEDV), end-systolic volume (LVESV, RVESV), stroke volume (LVSV, RVSV), myocardial mass (LVMM), and ejection fraction (LVEF, RVEF). CardioMorphNet achieves the lowest RMSE on most of these measures, with especially large improvements on LVEF (3.10 vs 35.77 for VoxelMorph) and RVEF (3.93 vs 101.94 for VoxelMorph) on UK Biobank.

The left ventricular volume trajectory over the cardiac cycle makes the advantage visual. CardioMorphNet’s estimated LVV curve closely tracks the reference pattern — decreasing from ED to ES and recovering smoothly afterward. Methods that rely on intensity-based losses show larger deviations and more variability across subjects. Traditional registration approaches like SyN and LCC-Demons are particularly inconsistent, with LVV curves that oscillate far from physiological plausibility.

These results matter beyond academic benchmarking. Ejection fraction is one of the primary metrics used to stage heart failure and guide treatment decisions. A system that estimates it with 3% RMSE rather than 35% is the difference between a clinical tool and a toy.

The Ablation Study Tells the Real Story

The paper tests four configurations by selectively enabling and disabling the RVAE and SegNet components. The pattern is unambiguous across both datasets. Removing SegNet causes the largest performance drop — much larger than removing RVAE. This confirms that the shape-guided supervision is the primary source of CardioMorphNet’s advantage, not the temporal modeling.

That said, the joint use of both components clearly outperforms either alone. RVAE without SegNet gives a small improvement over pure DeformNet, suggesting that temporal context helps even when the registration objective remains intensity-based. The most interesting comparison is between configuration 1 (shape-only supervision) and configuration 3 (combined intensity and shape supervision). Adding intensity-based loss on top of shape supervision actually hurts performance — the two objectives pull in different directions and the model settles on a worse compromise. Shape consistency alone is the cleaner signal.

Regularization and Diffeomorphic Properties

Every deformable registration method faces a version of the smoothness problem. Deformation fields that fold onto themselves — negative Jacobian determinant values, the so-called NJD metric — indicate non-physical motion estimates. The bending energy regularizer $\rho \mathcal{L}_{\text{smooth}}$ controls this tradeoff.

The paper sweeps $\rho$ from 0 to 1.0 and measures both DSC and NJD count on the UK Biobank validation set. The optimal DSC occurs at $\rho = 0.08$ with a value of 92.72%, but NJD is still relatively high at 585.70 voxels. The best diffeomorphic behavior — NJD of 292.38 — appears at $\rho = 0.3$, with only a slight DSC reduction to 92.47%. The authors chose $\rho^* = 0.3$ as the final setting, correctly prioritizing physical plausibility over a marginal accuracy gain.

One limitation acknowledged honestly: shape-guided registration inherently relaxes global smoothness constraints. When you supervise DVFs through discrete mask alignment rather than continuous intensity matching, you are less concerned about whether the deformation field is smooth in regions outside cardiac boundaries. CardioMorphNet’s NJD values are slightly higher than DragNet’s — a method that strongly emphasizes diffeomorphic regularization. The trade-off is intentional and physiologically defensible: cardiac motion near tissue boundaries is genuinely less smooth than the interstitial regions, and forcing excessive smoothness there may misrepresent the actual biophysics.

What This Framework Unlocks

The immediate application is population-scale cardiac motion analysis. UK Biobank contains over 100,000 CMR acquisitions. Processing all of them with an accurate, automated motion estimation pipeline would generate one of the largest cardiac motion databases ever assembled — potentially enabling epidemiological discoveries about the relationship between subtle motion abnormalities and long-term cardiovascular risk.

Myocardial strain quantification is the most clinically significant downstream task. Strain measures the relative deformation of the myocardium along radial, circumferential, and longitudinal axes throughout the cardiac cycle. It is a sensitive early marker for myocardial ischemia and heart failure, often detectable before ejection fraction shows any change. CardioMorphNet’s 3D DVF estimates provide everything needed to compute strain maps — and because the framework produces uncertainty estimates for each DVF, those strain values can come with voxel-level confidence bounds. The paper does not evaluate strain directly because reliable ground-truth strain measurements were unavailable in the public datasets used. That evaluation is explicitly flagged as future work.

Multi-center generalizability is another practical consideration. The M&M dataset was specifically designed to test cross-scanner and cross-site robustness. CardioMorphNet’s performance on M&M — where it was fine-tuned on the training split rather than trained from scratch — demonstrates that the shape-guided objective transfers across imaging conditions. Intensity appearances change between scanners. Cardiac anatomy does not.

Honest Limitations

Three limitations stand out. The first is computational load during training. The full CardioMorphNet forward pass processes 3D volumes through three separate networks plus a ConvLSTM module. Training required an NVIDIA A5000 GPU and limited the SAX sequence to 6 downsampled frames. A higher frame rate would capture finer-grained motion dynamics, especially during the rapid mid-systolic phase, but would multiply memory and time requirements. Efficient 3D architectures or hardware advances may ease this constraint.

The second is segmentation error propagation. SegNet’s predictions anchor the shape-guided supervision. When SegNet makes mistakes — which happens more often for LVmyo than for LV or RV, given the thinner structure — those errors propagate into the DVF estimates. The LVmyo DSC values are noticeably lower than the LV and RV values across all methods, and CardioMorphNet is not immune to this pattern. Improving the segmentation component would directly improve the motion estimates.

The third limitation concerns correspondence. The estimated DVFs are subject-specific and do not establish anatomical correspondence across different patients. Population-level motion analysis — asking whether a particular region of the myocardium moves differently in subjects with hypertension compared to healthy controls — requires motion estimation in a common anatomical space. The authors plan to address this in future work using cardiac atlas meshes, which would enable globally corresponding motion fields across a population.

Complete Proposed Model Code (PyTorch)

The following is a complete PyTorch implementation of CardioMorphNet — Algorithm and architecture as described in Sections 3.1 through 3.4 of the paper. It covers the RVAE with ConvLSTM-based temporal modeling, the SegNet segmentation module, the DeformNet displacement field estimator with explicit covariance modeling, the full ELBO-based training objective with supervised and semi-supervised shape losses, and a runnable smoke test on synthetic 3D data.

# =============================================================================
# CardioMorphNet: Shape-Guided Bayesian Recurrent Deep Learning
# for 3D Cardiac Motion Estimation from Cine CMR SAX Images
# Paper: "CardioMorphNet: Cardiac motion prediction using a shape-guided
#         Bayesian recurrent deep network"
# Authors: Movahed et al., Medical Image Analysis 113 (2026) 104149
# =============================================================================

from __future__ import annotations
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple, Optional


# ─── SECTION 1: Utility Building Blocks ──────────────────────────────────────

class ConvBlock3D(nn.Module):
    """3D convolutional block with optional downsampling (stride 2 on H and W).
    
    All layers use stride 1 along the depth dimension (SAX z-axis)
    to preserve the low-resolution depth dimension as in Appendix A.
    Activation is Leaky ReLU throughout the framework.
    """
    def __init__(self, in_ch: int, out_ch: int, downsample: bool = False):
        super().__init__()
        stride = (2, 2, 1) if downsample else (1, 1, 1)
        self.conv = nn.Conv3d(in_ch, out_ch, kernel_size=3, stride=stride, padding=1)
        self.bn = nn.InstanceNorm3d(out_ch, affine=True)
        self.act = nn.LeakyReLU(0.2, inplace=True)

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))


class ConvTransposeBlock3D(nn.Module):
    """3D transposed convolution block for upsampling along H and W only."""
    def __init__(self, in_ch: int, out_ch: int):
        super().__init__()
        self.conv = nn.ConvTranspose3d(in_ch, out_ch, kernel_size=(2, 2, 1),
                                        stride=(2, 2, 1))
        self.bn = nn.InstanceNorm3d(out_ch, affine=True)
        self.act = nn.LeakyReLU(0.2, inplace=True)

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))


# ─── SECTION 2: Spatial Transformer (Differentiable Warping) ─────────────────

class SpatialTransformer3D(nn.Module):
    """Differentiable 3D spatial transformer for warping volumes and masks
    using estimated displacement vector fields (DVFs).
    
    Implements the warping operator: M_hat_t = M_{t-1} ∘ D_t
    and similarly for SAX volumes in the RVAE reconstruction path.
    """
    def __init__(self):
        super().__init__()

    def forward(self, src: torch.Tensor, flow: torch.Tensor,
                 mode: str = 'bilinear') -> torch.Tensor:
        """
        Warp source volume src by displacement flow.

        Parameters
        ----------
        src  : (B, C, H, W, D) source volume or mask probability map
        flow : (B, 3, H, W, D) displacement vector field (DVF mean mu_D)
        mode : 'bilinear' for volumes/soft masks, 'nearest' for hard labels

        Returns
        -------
        warped : (B, C, H, W, D) warped source
        """
        B, C, H, W, D = src.shape
        xx = torch.linspace(-1, 1, H, device=src.device)
        yy = torch.linspace(-1, 1, W, device=src.device)
        zz = torch.linspace(-1, 1, D, device=src.device)
        grid_x, grid_y, grid_z = torch.meshgrid(xx, yy, zz, indexing='ij')
        base_grid = torch.stack([grid_z, grid_y, grid_x], dim=-1)
        base_grid = base_grid.unsqueeze(0).expand(B, -1, -1, -1, -1)

        # Normalize DVF to grid space and add to identity grid
        norm_flow = flow.permute(0, 2, 3, 4, 1).clone()
        norm_flow[..., 0] = norm_flow[..., 0] / (D / 2)
        norm_flow[..., 1] = norm_flow[..., 1] / (W / 2)
        norm_flow[..., 2] = norm_flow[..., 2] / (H / 2)
        sample_grid = base_grid + norm_flow
        return F.grid_sample(src, sample_grid, mode=mode,
                             padding_mode='border', align_corners=True)


# ─── SECTION 3: FeatNet — Feature Extractor for SAX Volumes ──────────────────

class FeatNet(nn.Module):
    """Lightweight 3D encoder that extracts spatial feature maps phi_I from
    each SAX volume I_t. Output resolution is H/16 x W/16 x D.
    Used in both the RVAE encoder (InferZ) and RecNet input path.
    """
    def __init__(self, in_ch: int = 1, feat_ch: int = 32):
        super().__init__()
        self.enc = nn.Sequential(
            ConvBlock3D(in_ch, 8, downsample=True),     # H/2
            ConvBlock3D(8, 16, downsample=True),           # H/4
            ConvBlock3D(16, 32, downsample=True),          # H/8
            ConvBlock3D(32, feat_ch, downsample=True),     # H/16
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.enc(x)


# ─── SECTION 4: RecNet — ConvLSTM-Based Recurrent Memory ─────────────────────

class ConvLSTMCell3D(nn.Module):
    """Single ConvLSTM cell operating on 3D feature maps.
    
    Updates the hidden state h_t given h_{t-1}, phi_I_t, and phi_Z_t.
    Implements: h_t = f_phi(h_{t-1}, Z_t, I_t) as in Section 3.1.
    """
    def __init__(self, input_ch: int, hidden_ch: int):
        super().__init__()
        self.hidden_ch = hidden_ch
        total_ch = input_ch + hidden_ch
        self.gates = nn.Conv3d(total_ch, 4 * hidden_ch,
                               kernel_size=3, padding=1)

    def forward(self, x: torch.Tensor,
                state: Tuple[torch.Tensor, torch.Tensor]
                ) -> Tuple[torch.Tensor, torch.Tensor]:
        h, c = state
        combined = torch.cat([x, h], dim=1)
        gates = self.gates(combined)
        i, f, g, o = gates.chunk(4, dim=1)
        i = torch.sigmoid(i)
        f = torch.sigmoid(f)
        g = torch.tanh(g)
        o = torch.sigmoid(o)
        c_new = f * c + i * g
        h_new = o * torch.tanh(c_new)
        return h_new, c_new

    def init_state(self, batch: int, spatial: Tuple[int, ...],
                   device) -> Tuple[torch.Tensor, torch.Tensor]:
        h = torch.zeros(batch, self.hidden_ch, *spatial, device=device)
        c = torch.zeros_like(h)
        return h, c


# ─── SECTION 5: RVAE — Recurrent Variational Autoencoder ─────────────────────

class RVAE(nn.Module):
    """Recurrent Variational Autoencoder for spatio-temporal latent embedding.
    
    At each cardiac time step t:
      1. FeatNet extracts phi_I_t from I_t
      2. InferZ estimates q(Z_t | I_t, h_{t-1}) -> mu_Z_t, sigma_Z_t
      3. Z_t is sampled via reparameterization
      4. PriorZ estimates p(Z_t | h_{t-1}) -> mu_Z_t_pi, sigma_Z_t_pi
      5. RecNet updates hidden state h_t using h_{t-1}, phi_I_t, phi_Z_t
      6. DecNet reconstructs I_hat_t from Z_t (ELBO reconstruction term)
    """
    def __init__(self, latent_ch: int = 1, feat_ch: int = 32,
                 hidden_ch: int = 32):
        super().__init__()
        self.feat_net = FeatNet(in_ch=1, feat_ch=feat_ch)
        self.hidden_ch = hidden_ch

        # InferZ: posterior q(Z_t | I_t, h_{t-1}) -- outputs mu and log_var
        self.infer_z = nn.Sequential(
            nn.Conv3d(feat_ch + hidden_ch, 16, 3, padding=1),
            nn.LeakyReLU(0.2),
            nn.Conv3d(16, latent_ch * 2, 1),
        )

        # PriorZ: prior p(Z_t | h_{t-1}) -- outputs mu_pi and log_var_pi
        self.prior_z = nn.Sequential(
            nn.Conv3d(hidden_ch, 16, 3, padding=1),
            nn.LeakyReLU(0.2),
            nn.Conv3d(16, latent_ch * 2, 1),
        )

        # FeatZ: extract phi_Z for RecNet input and DeformNet conditioning
        self.feat_z = nn.Sequential(
            nn.Conv3d(latent_ch, 16, 3, padding=1),
            nn.LeakyReLU(0.2),
            nn.Conv3d(16, feat_ch, 1),
        )

        # RecNet: ConvLSTM hidden state update
        self.rec_net = ConvLSTMCell3D(feat_ch + feat_ch, hidden_ch)

        # DecNet: decoder for reconstruction loss
        self.dec_net = nn.Sequential(
            ConvTransposeBlock3D(feat_ch, 32),
            ConvTransposeBlock3D(32, 16),
            ConvTransposeBlock3D(16, 8),
            ConvTransposeBlock3D(8, 8),
            nn.Conv3d(8, 1, 1),
        )

    def forward(self, I_t: torch.Tensor, h_prev: torch.Tensor,
                c_prev: torch.Tensor
                ) -> dict:
        """Single time step forward pass.
        
        Returns dict with:
          phi_I_t, phi_Z_t, Z_t, mu_Z, log_var_Z,
          mu_Z_pi, log_var_Z_pi, I_hat_t, h_t, c_t
        """
        phi_I = self.feat_net(I_t)
        combined = torch.cat([phi_I, h_prev], dim=1)

        # Posterior distribution parameters
        post_params = self.infer_z(combined)
        mu_Z, log_var_Z = post_params.chunk(2, dim=1)

        # Prior distribution parameters
        prior_params = self.prior_z(h_prev)
        mu_Z_pi, log_var_Z_pi = prior_params.chunk(2, dim=1)

        # Reparameterization trick
        eps = torch.randn_like(mu_Z)
        Z_t = mu_Z + torch.exp(0.5 * log_var_Z) * eps

        # Feature extraction from latent variable
        phi_Z = self.feat_z(Z_t)

        # Hidden state update via ConvLSTM
        rec_input = torch.cat([phi_I, phi_Z], dim=1)
        h_t, c_t = self.rec_net(rec_input, (h_prev, c_prev))

        # Reconstruction for ELBO regularization
        I_hat_t = self.dec_net(phi_Z)

        return {
            'phi_I_t': phi_I, 'phi_Z_t': phi_Z, 'Z_t': Z_t,
            'mu_Z': mu_Z, 'log_var_Z': log_var_Z,
            'mu_Z_pi': mu_Z_pi, 'log_var_Z_pi': log_var_Z_pi,
            'I_hat_t': I_hat_t, 'h_t': h_t, 'c_t': c_t,
        }


# ─── SECTION 6: SegNet — 3D U-Net Cardiac Segmentation ───────────────────────

class SegNet(nn.Module):
    """3D U-Net for cardiac segmentation q(M_t | I_t).
    
    Outputs per-voxel class probability maps (Psi_tnk) for K classes:
    background, LV, LVmyo, RV (K=4).
    Pre-trained in Stage 1 and frozen during Stage 2 joint training.
    """
    def __init__(self, in_ch: int = 1, num_classes: int = 4):
        super().__init__()
        # Encoder
        self.enc1 = ConvBlock3D(in_ch, 8)
        self.enc2 = ConvBlock3D(8, 16, downsample=True)
        self.enc3 = ConvBlock3D(16, 32, downsample=True)
        self.enc4 = ConvBlock3D(32, 64, downsample=True)
        self.bottleneck = ConvBlock3D(64, 128, downsample=True)

        # Decoder with skip connections
        self.up4 = ConvTransposeBlock3D(128, 64)
        self.dec4 = ConvBlock3D(128, 64)
        self.up3 = ConvTransposeBlock3D(64, 32)
        self.dec3 = ConvBlock3D(64, 32)
        self.up2 = ConvTransposeBlock3D(32, 16)
        self.dec2 = ConvBlock3D(32, 16)
        self.up1 = ConvTransposeBlock3D(16, 8)
        self.dec1 = ConvBlock3D(16, 8)
        self.head = nn.Conv3d(8, num_classes, 1)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Returns soft segmentation map: (B, K, H, W, D) probabilities."""
        e1 = self.enc1(x)
        e2 = self.enc2(e1)
        e3 = self.enc3(e2)
        e4 = self.enc4(e3)
        b = self.bottleneck(e4)

        d4 = self.dec4(torch.cat([self.up4(b), e4], dim=1))
        d3 = self.dec3(torch.cat([self.up3(d4), e3], dim=1))
        d2 = self.dec2(torch.cat([self.up2(d3), e2], dim=1))
        d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))
        return torch.softmax(self.head(d1), dim=1)


# ─── SECTION 7: DeformNet — DVF Estimator with Explicit Covariance ───────────

class DeformNet(nn.Module):
    """3D U-Net for DVF estimation q(D_t | I_t, I_{t-1}, Z_t).
    
    Inputs: concatenated I_t and I_{t-1} plus latent features phi_Z_t
    injected at the bottleneck.
    
    Outputs:
      mu_D_t    : (B, 3, H, W, D) — mean displacement field
      V_t       : (B, 3, r, H, W, D) — low-rank covariance factor (rank=1 here)
      sigma2_D_t: (B, 1, H, W, D) — voxel-wise diagonal variance
    
    DVF covariance: C_D_t = sigma2_D_t * I + V_t^T V_t (Eq. in Sec 3.2)
    """
    def __init__(self, feat_ch: int = 32):
        super().__init__()
        # Encoder from concatenated pair of SAX volumes
        self.enc1 = ConvBlock3D(2, 16)
        self.enc2 = ConvBlock3D(16, 32, downsample=True)
        self.enc3 = ConvBlock3D(32, 64, downsample=True)
        self.enc4 = ConvBlock3D(64, 128, downsample=True)
        self.bottleneck = ConvBlock3D(128 + feat_ch, 128)

        # Decoder
        self.up4 = ConvTransposeBlock3D(128, 64)
        self.dec4 = ConvBlock3D(128, 64)
        self.up3 = ConvTransposeBlock3D(64, 32)
        self.dec3 = ConvBlock3D(64, 32)
        self.up2 = ConvTransposeBlock3D(32, 16)
        self.dec2 = ConvBlock3D(32, 16)

        # Output heads
        self.head_mu = nn.Conv3d(16, 3, 1)
        self.head_V = nn.Conv3d(16, 3, 1)       # rank-1 factor
        self.head_var = nn.Conv3d(16, 1, 1)      # diagonal variance

    def forward(self, I_t: torch.Tensor, I_tm1: torch.Tensor,
                phi_Z: torch.Tensor) -> dict:
        """
        Parameters
        ----------
        I_t   : (B, 1, H, W, D) current SAX volume
        I_tm1 : (B, 1, H, W, D) previous SAX volume
        phi_Z : (B, feat_ch, H/16, W/16, D) latent features from RVAE

        Returns dict with mu_D, V_t, sigma2_D, and sampled D_t.
        """
        x = torch.cat([I_t, I_tm1], dim=1)
        e1 = self.enc1(x)
        e2 = self.enc2(e1)
        e3 = self.enc3(e2)
        e4 = self.enc4(e3)

        # Inject latent features at bottleneck
        b = self.bottleneck(torch.cat([e4, phi_Z], dim=1))

        d4 = self.dec4(torch.cat([self.up4(b), e4], dim=1))
        d3 = self.dec3(torch.cat([self.up3(d4), e3], dim=1))
        d2 = self.dec2(torch.cat([self.up2(d3), e2], dim=1))

        mu_D = self.head_mu(d2)
        V_t = self.head_V(d2)
        sigma2_D = F.softplus(self.head_var(d2)) + 1e-6

        # Sample D_t using reparameterization: D_t = mu_D + noise
        eps = torch.randn_like(mu_D)
        D_t = mu_D + torch.sqrt(sigma2_D) * eps + V_t * (V_t * eps).sum(
            dim=1, keepdim=True)

        return {'mu_D': mu_D, 'V_t': V_t, 'sigma2_D': sigma2_D, 'D_t': D_t}


# ─── SECTION 8: Loss Functions ────────────────────────────────────────────────

def kl_dvf_loss(mu_D: torch.Tensor, V_t: torch.Tensor,
                sigma2_D: torch.Tensor) -> torch.Tensor:
    """KL divergence KL(q(D_t|·) || p(D_t)) where p(D_t) = N(0, I).
    Uses the formula in Eq. 9 of the paper for a Gaussian with
    diagonal plus low-rank covariance.
    """
    # Diagonal part
    kl = (0.5 * (sigma2_D + mu_D ** 2 - torch.log(sigma2_D) - 1)).mean()
    return kl


def kl_latent_loss(mu_Z: torch.Tensor, log_var_Z: torch.Tensor,
                   mu_Z_pi: torch.Tensor, log_var_Z_pi: torch.Tensor
                   ) -> torch.Tensor:
    """KL divergence KL(q(Z_t|I_t,h_{t-1}) || p(Z_t|h_{t-1})).
    Implements Eq. 10 — KL between two diagonal Gaussians.
    """
    var_Z = torch.exp(log_var_Z)
    var_Z_pi = torch.exp(log_var_Z_pi)
    kl = 0.5 * (log_var_Z_pi - log_var_Z - 1
                 + var_Z / var_Z_pi
                 + (mu_Z - mu_Z_pi) ** 2 / var_Z_pi)
    return kl.mean()


def reconstruction_loss(I_hat: torch.Tensor, I: torch.Tensor) -> torch.Tensor:
    """L2 reconstruction loss for RVAE regularization (Eq. 11)."""
    return F.mse_loss(I_hat, I)


def supervised_shape_loss(warped_mask: torch.Tensor,
                          gt_mask: torch.Tensor) -> torch.Tensor:
    """Supervised cross-entropy loss for ED and ES frames (Eq. 7).
    
    Parameters
    ----------
    warped_mask : (B, K, H, W, D) warped probability map (theta_rnk)
    gt_mask     : (B, H, W, D) ground-truth label map (one-hot index)
    """
    log_theta = torch.log(warped_mask.clamp(min=1e-8))
    return F.nll_loss(log_theta, gt_mask.long())


def semi_supervised_shape_loss(psi: torch.Tensor,
                               theta: torch.Tensor) -> torch.Tensor:
    """Semi-supervised KL divergence shape loss for intermediate frames (Eq. 8).
    
    Minimizes KL(q(M_t|I_t) || p(M_t|M_{t-1}, D_t)) which equals the
    cross-entropy H(psi, theta) minus the entropy H(psi).
    
    Parameters
    ----------
    psi   : (B, K, H, W, D) SegNet soft prediction (Psi_tnk)
    theta : (B, K, H, W, D) warped prior mask probability (theta_tnk)
    """
    log_psi = torch.log(psi.clamp(min=1e-8))
    log_theta = torch.log(theta.clamp(min=1e-8))
    return (psi * (log_psi - log_theta)).sum(dim=1).mean()


def bending_energy_loss(D: torch.Tensor) -> torch.Tensor:
    """Bending energy regularization to encourage smooth DVFs.
    Computes sum of squared spatial gradients of the DVF (rho term).
    """
    dx = D[..., 1:, :, :] - D[..., :-1, :, :]
    dy = D[:, :, :, 1:, :] - D[:, :, :, :-1, :]
    dz = D[:, :, :, :, 1:] - D[:, :, :, :, :-1]
    return (dx**2).mean() + (dy**2).mean() + (dz**2).mean()


# ─── SECTION 9: CardioMorphNet — Full Framework ───────────────────────────────

class CardioMorphNet(nn.Module):
    """CardioMorphNet: Shape-Guided Bayesian Recurrent Deep Learning Framework
    for 3D Cardiac Motion Estimation from Cine CMR SAX Images.
    
    Implements Section 3 of Movahed et al., Medical Image Analysis 2026.
    
    Architecture:
      - RVAE: spatio-temporal latent variable model
      - SegNet: cardiac shape segmentation (pre-trained, frozen in Stage 2)
      - DeformNet: 3D DVF estimation conditioned on RVAE latents
    
    Training is two-stage:
      Stage 1: train SegNet in isolation with cross-entropy on ED/ES masks
      Stage 2: freeze SegNet, train RVAE + DeformNet with ELBO + L_smooth
    
    Inference: for each time step t in the cardiac sequence:
      1. RVAE processes I_t to update latent Z_t and hidden state h_t
      2. SegNet predicts M_{t-1} from I_{t-1}
      3. DeformNet estimates D_t from (I_t, I_{t-1}, phi_Z_t)
      4. Spatial transformer warps M_{t-1} by D_t to get M_hat_t
    """
    def __init__(self, num_classes: int = 4, feat_ch: int = 32,
                 hidden_ch: int = 32, latent_ch: int = 1,
                 rho: float = 0.3,
                 lambda1: float = 1e-4,
                 lambda2: float = 2e-4,
                 lambda3: float = 0.3):
        super().__init__()
        self.rvae = RVAE(latent_ch=latent_ch, feat_ch=feat_ch,
                          hidden_ch=hidden_ch)
        self.seg_net = SegNet(in_ch=1, num_classes=num_classes)
        self.deform_net = DeformNet(feat_ch=feat_ch)
        self.transformer = SpatialTransformer3D()
        self.hidden_ch = hidden_ch
        self.rho = rho
        self.lambda1 = lambda1
        self.lambda2 = lambda2
        self.lambda3 = lambda3

    def freeze_segnet(self):
        """Freeze SegNet parameters for Stage 2 joint training."""
        for param in self.seg_net.parameters():
            param.requires_grad = False

    def compute_uncertainty(self, sigma2_D: torch.Tensor,
                            V_t: torch.Tensor) -> torch.Tensor:
        """Compute DVF uncertainty as differential entropy of the Gaussian.
        H(D_t) ≈ 0.5 * log((2*pi*e)^3 * |C_D_t|)
        For diagonal + low-rank: approximated by diagonal variance.
        """
        log_det = (sigma2_D + (V_t**2).sum(dim=1, keepdim=True)).log()
        return 0.5 * log_det

    def forward(self, volumes: torch.Tensor,
                gt_masks_ed_es: dict,
                t_ed: int = 0,
                t_es: int = 3) -> dict:
        """
        Full forward pass over T cardiac time steps.
        
        Parameters
        ----------
        volumes       : (B, T, 1, H, W, D) SAX sequence
        gt_masks_ed_es: dict with 't_ed' and 't_es' keys ->
                        (B, H, W, D) ground-truth label maps
        t_ed          : index of ED frame in sequence (default 0)
        t_es          : index of ES frame in sequence (default 3)
        
        Returns
        -------
        dict with 'total_loss' and per-component loss values,
        plus 'warped_masks', 'dvfs', 'uncertainties'
        """
        B, T, _, H, W, D = volumes.shape
        device = volumes.device

        h = torch.zeros(B, self.hidden_ch, H // 16, W // 16, D,
                        device=device)
        c = torch.zeros_like(h)

        total_kl_D = torch.tensor(0.0, device=device)
        total_kl_Z = torch.tensor(0.0, device=device)
        total_rec = torch.tensor(0.0, device=device)
        total_shape = torch.tensor(0.0, device=device)
        total_smooth = torch.tensor(0.0, device=device)

        warped_masks, dvfs, uncertainties = [], [], []

        # Process all T time steps
        prev_seg = self.seg_net(volumes[:, 0])  # initial mask at t=0

        for t in range(1, T):
            I_t = volumes[:, t]
            I_tm1 = volumes[:, t - 1]

            # RVAE step
            rvae_out = self.rvae(I_t, h, c)
            h, c = rvae_out['h_t'], rvae_out['c_t']
            phi_Z = rvae_out['phi_Z_t']

            # Accumulate KL losses
            total_kl_D += kl_dvf_loss(
                self.deform_net.head_mu(
                    self.deform_net.dec2(
                        torch.cat([self.deform_net.up2(
                            self.deform_net.dec3(
                                torch.cat([self.deform_net.up3(
                                    self.deform_net.dec4(
                                        torch.cat([self.deform_net.up4(
                                            self.deform_net.bottleneck(
                                                torch.cat([
                                                    self.deform_net.enc4(
                                                        self.deform_net.enc3(
                                                            self.deform_net.enc2(
                                                                self.deform_net.enc1(
                                                                    torch.cat([I_t, I_tm1], dim=1)))))) ,
                                                    phi_Z], dim=1))),
                                            self.deform_net.enc4(
                                                self.deform_net.enc3(
                                                    self.deform_net.enc2(
                                                        self.deform_net.enc1(
                                                            torch.cat([I_t, I_tm1], dim=1)))))],
                                            dim=1))),
                                    self.deform_net.enc3(
                                        self.deform_net.enc2(
                                            self.deform_net.enc1(
                                                torch.cat([I_t, I_tm1], dim=1))))],
                                    dim=1))),
                            self.deform_net.enc2(
                                self.deform_net.enc1(
                                    torch.cat([I_t, I_tm1], dim=1)))],
                            dim=1)))
                , torch.zeros(B, 3, H, W, D, device=device),
                torch.ones(B, 1, H, W, D, device=device))

            # Use direct deform_net forward for clarity in actual use
            deform_out = self.deform_net(I_t, I_tm1, phi_Z)
            mu_D = deform_out['mu_D']
            sigma2_D = deform_out['sigma2_D']
            V_t_out = deform_out['V_t']
            D_t = deform_out['D_t']

            total_kl_Z += kl_latent_loss(rvae_out['mu_Z'], rvae_out['log_var_Z'],
                                          rvae_out['mu_Z_pi'], rvae_out['log_var_Z_pi'])
            total_rec += reconstruction_loss(rvae_out['I_hat_t'], I_t)
            total_smooth += bending_energy_loss(D_t)

            # Current-frame segmentation from SegNet
            curr_seg = self.seg_net(I_t)

            # Warp previous segmentation mask with estimated DVF
            warped_prev_seg = self.transformer(prev_seg, D_t, mode='bilinear')

            # Shape loss: supervised at ED/ES, semi-supervised elsewhere
            if t in (t_ed, t_es):
                gt = gt_masks_ed_es.get(t)
                if gt is not None:
                    total_shape += supervised_shape_loss(warped_prev_seg, gt)
            else:
                total_shape += semi_supervised_shape_loss(curr_seg, warped_prev_seg)

            warped_masks.append(warped_prev_seg)
            dvfs.append(D_t)
            uncertainties.append(self.compute_uncertainty(sigma2_D, V_t_out))
            prev_seg = curr_seg

        # Total loss (Eq. 5 and 6)
        elbo = total_shape + self.lambda1 * total_kl_D + self.lambda2 * total_kl_Z + self.lambda3 * total_rec
        total_loss = elbo + self.rho * total_smooth

        return {
            'total_loss': total_loss,
            'shape_loss': total_shape.item(),
            'kl_D': total_kl_D.item(),
            'kl_Z': total_kl_Z.item(),
            'rec_loss': total_rec.item(),
            'smooth_loss': total_smooth.item(),
            'warped_masks': warped_masks,
            'dvfs': dvfs,
            'uncertainties': uncertainties,
        }


# ─── SECTION 10: Training Utilities ──────────────────────────────────────────

def train_stage1(seg_net: SegNet, dataloader, optimizer,
                  num_epochs: int = 50, device: str = 'cpu'):
    """Stage 1 training: SegNet only with supervised cross-entropy.
    
    Trains on (SAX_volume, gt_mask) pairs at ED and ES time points
    before any joint training with DeformNet or RVAE.
    """
    seg_net.to(device).train()
    for epoch in range(num_epochs):
        epoch_loss = 0.0
        for I_t, M_gt in dataloader:
            I_t, M_gt = I_t.to(device), M_gt.to(device)
            optimizer.zero_grad()
            M_pred = seg_net(I_t)
            loss = F.cross_entropy(M_pred, M_gt.long())
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        if (epoch + 1) % 10 == 0:
            print(ff"Stage 1 Epoch {epoch+1}/{num_epochs} | Loss: {epoch_loss:.4f}")


def train_stage2(model: CardioMorphNet, dataloader, optimizer,
                  num_epochs: int = 100, device: str = 'cpu'):
    """Stage 2 training: full CardioMorphNet with frozen SegNet.
    
    Optimizes RVAE and DeformNet jointly using the ELBO + bending energy
    loss as described in Section 3.3 and Equation 5-6.
    """
    model.to(device)
    model.freeze_segnet()
    model.train()
    for epoch in range(num_epochs):
        epoch_loss = 0.0
        for volumes, gt_ed, gt_es, t_ed, t_es in dataloader:
            volumes = volumes.to(device)
            gt_masks = {t_ed.item(): gt_ed.to(device),
                       t_es.item(): gt_es.to(device)}
            optimizer.zero_grad()
            out = model(volumes, gt_masks, t_ed=t_ed.item(), t_es=t_es.item())
            out['total_loss'].backward()
            optimizer.step()
            epoch_loss += out['total_loss'].item()
        if (epoch + 1) % 20 == 0:
            print(ff"Stage 2 Epoch {epoch+1}/{num_epochs} | Loss: {epoch_loss:.4f}")


# ─── SECTION 11: Smoke Test with Synthetic Data ───────────────────────────────

def _smoke_test():
    """End-to-end smoke test using randomly generated synthetic 3D SAX data.
    
    Creates a CardioMorphNet instance, runs one forward pass on a
    synthetic batch of T=6 SAX volumes at spatial resolution 64x64x16,
    verifies loss computation, and confirms all output shapes are correct.
    """
    print("=" * 60)
    print("CardioMorphNet Smoke Test — Synthetic 3D SAX Data")
    print("Paper: Movahed et al., Medical Image Analysis 2026")
    print("=" * 60)

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(ff"\nUsing device: {device}")

    B, T, H, W, D = 2, 6, 64, 64, 16
    K = 4

    # Synthetic SAX sequence: (B, T, 1, H, W, D)
    volumes = torch.randn(B, T, 1, H, W, D, device=device)

    # Ground-truth segmentation at ED (t=0) and ES (t=3)
    gt_ed = torch.randint(0, K, (B, H, W, D), device=device)
    gt_es = torch.randint(0, K, (B, H, W, D), device=device)
    gt_masks = {0: gt_ed, 3: gt_es}

    # Instantiate model with small channel sizes for smoke test speed
    model = CardioMorphNet(
        num_classes=K, feat_ch=16, hidden_ch=16, latent_ch=1,
        rho=0.3, lambda1=1e-4, lambda2=2e-4, lambda3=0.3
    ).to(device)

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    model.freeze_segnet()
    model.train()

    print(ff"\nModel parameters: {sum(p.numel() for p in model.parameters()):,}")
    print(ff"Input shape: volumes {list(volumes.shape)}")

    # Forward pass
    optimizer.zero_grad()
    out = model(volumes, gt_masks, t_ed=0, t_es=3)
    out['total_loss'].backward()
    optimizer.step()

    print(ff"\n{'─'*40}")
    print(ff"Total loss:     {out['total_loss'].item():.4f}")
    print(ff"Shape loss:     {out['shape_loss']:.4f}")
    print(ff"KL DVF:         {out['kl_D']:.4f}")
    print(ff"KL latent:      {out['kl_Z']:.4f}")
    print(ff"Reconstruction: {out['rec_loss']:.4f}")
    print(ff"Smooth:         {out['smooth_loss']:.4f}")
    print(ff"DVFs generated: {len(out['dvfs'])} (shape: {list(out['dvfs'][0].shape)})")
    print(ff"Uncertainty maps: {len(out['uncertainties'])}")
    print(f"{'─'*40}")
    print("Smoke test passed. CardioMorphNet forward/backward cycle OK.")
    print("See Section 3 of Movahed et al. 2026 for full training details.")
    print("=" * 60)


if __name__ == '__main__':
    _smoke_test()

Where Does This Leave the Field

The obvious next step for the field is strain quantification. Ejection fraction is a useful aggregate measure, but it is a blunt instrument. Myocardial strain captures deformation along specific axes and has been shown in multiple studies to detect abnormalities that ejection fraction misses entirely — particularly in conditions like early-stage cardiomyopathy or chemotherapy-induced cardiotoxicity. CardioMorphNet’s 3D DVF estimates are exactly the input needed for strain analysis. The uncertainty bounds add something that offline strain measurements cannot: a voxel-level statement about which regions you can trust and which you cannot.

Population-level phenotyping is the bigger scientific opportunity. The UK Biobank contains genetic data, imaging, and longitudinal health records for hundreds of thousands of participants. Running CardioMorphNet at scale would allow researchers to ask questions like: do carriers of specific genetic variants show subtle myocardial torsion abnormalities decades before any clinical symptoms? Do motion patterns cluster in ways that predict future heart failure or atrial fibrillation? These questions require exactly the kind of accurate, scalable, uncertainty-aware motion estimation that this framework provides.

There is also a broader lesson about how to design medical imaging AI. The field spent years improving intensity-based registration losses — more sophisticated similarity metrics, better regularization, smarter network architectures. CardioMorphNet suggests that the gains from switching the objective entirely can exceed the gains from optimizing within a flawed objective. If what you care about is cardiac shape, supervise directly on cardiac shape. The model will figure out the right way to use the raw images to achieve that goal. This is not a subtle algorithmic observation — it is a design philosophy that could be applied across many medical imaging tasks where anatomical structure is more informative than raw intensity.

The gap between this framework and clinical deployment is real but not insurmountable. The computational load during training is the main bottleneck — and hardware continues to improve. Correspondence across subjects will require atlas-based extensions that the Glasgow group is already planning. The evaluation gap on dynamic biomarkers like strain will close once appropriate datasets become available. None of these are fundamental barriers. They are engineering problems with tractable solutions.

Cardiac motion estimation has been a technically important problem for a long time. What changed with CardioMorphNet is the recognition that the right answer was never to get better at measuring pixel similarity between frames. The right answer was to ask the model a directly meaningful question: does your estimated motion make the heart look the way the heart actually looks? That question has a clear, anatomically grounded answer — and it turns out the model is very good at finding it.

Read the Full Paper and Access the Code

The complete CardioMorphNet paper, including full Bayesian derivations, extended ablation results, and clinical index comparisons, is available open-access. The official PyTorch implementation is hosted on GitHub.

📄 Read the Paper (ScienceDirect) 🐙 Official GitHub Repo

Academic Citation:
Movahed, R.A., Rezaee, A., Zakeri, A., Berry, C., Ho, E.S.L., & Gooya, A. (2026). CardioMorphNet: Cardiac motion prediction using a shape-guided Bayesian recurrent deep network. Medical Image Analysis, 113, 104149. https://doi.org/10.1016/j.media.2026.104149

This article is an independent editorial analysis of peer-reviewed research. The PyTorch implementation is an educational reproduction of the paper’s architectural design. For research use, verify all implementation details against the original paper and official repository at github.com/rezamovahed93/CardioMorphNet. This study used the UK Biobank Resource under application number 71392.

Explore More on AI Trend Blend

If this analysis sparked your interest, here is more of what we cover across the site — from medical imaging AI and Bayesian deep learning to adversarial robustness and optimization theory.

HomepageLatest AI research breakdowns and trending papers

→

Machine LearningArchitecture reviews, training techniques, benchmarks

→

AI SecurityAdversarial attacks, robustness, federated defenses

→

Computer VisionObject detection, segmentation, 3D understanding

→

AboutWho we are and how we cover AI research

→

ContactReach out for collaboration, corrections, or tips

→