HP2L Shows How AI Can Now Think Like a Radiologist and Diagnose 23 Brain Disorders Step by Step
A research team from ShanghaiTech University built a hierarchical framework that classifies 23 brain disorders across three diagnostic levels. It achieves 88.43 percent balanced accuracy on 54360 subjects and beats every competing method by more than 8 percentage points while keeping the performance gap between coarse and fine predictions below 2 percent.
A radiologist does not look at a brain scan and immediately name the exact subtype of a disease. The process starts broad. Is there a vascular problem at all? If yes is the answer then the next question is whether it is hemorrhagic or ischemic. Only after clearing those earlier steps does the expert narrow things down to the precise subtype. That layered reasoning is years of training compressed into a few seconds. Teaching a deep learning model to replicate that approach rather than just matching patterns against a flat list of 23 labels turns out to be genuinely hard. Yuxiao Liu, Kaicong Sun, and their colleagues from ShanghaiTech University along with Henan Provincial People’s Hospital and Shanghai United Imaging Intelligence just published a framework that takes this challenge seriously. The numbers they report suggest they got it right.
The Problem with Flat Classification in Brain Imaging
The dominant approach in medical image AI is still what researchers call flat classification. You feed the scan in, produce a probability over every disease class at the same time, and pick the winner. This works reasonably well when a dataset has thousands of examples per class and when the classes look visually different from each other. Brain disorders are neither of these things.
The dataset used in this research spans 23 brain disorders. Common conditions like white matter hyperintensity contribute 22934 of the 54360 total subjects while rare ones like penetrating deformity appear in only 1074 cases. That is a 21x ratio. When a flat classifier trains on this kind of distribution it learns to perform well on common classes and becomes quietly unreliable on rare ones. The balanced accuracy metric exists precisely to expose this failure. A standard Vision Transformer baseline drops from 85.53 percent balanced accuracy at the broadest classification level all the way down to 70.65 percent at the fine-grained level. That nearly 15-point collapse is not a minor measurement difference. Patients with rare subtypes are being misclassified at rates that would be unacceptable in any real clinical setting.
The theoretical solution has been known for years. Hierarchical classification organizes the task into broad categories first and then refines progressively. This mimics how clinicians actually reason and it focuses learning signal on the structural relationships between disease classes. But practical implementation of hierarchical classification has a fatal flaw that most research papers quietly skip over.
Once a hierarchical classifier commits to a higher-level decision every prediction after that gets locked into a branch of that decision. If the model incorrectly routes a sub-acute hemorrhage into the infarction branch at level two there is no way to fix that error at level three. The mistake grows and the final prediction ends up doubly wrong. This is the error propagation problem and it is the core motivation behind everything HP2L sets out to do.
Most hierarchical classifiers lock in their top-level decisions before processing lower levels. One bad routing choice cascades through the full hierarchy and makes the final fine-grained prediction worse than a flat classifier would produce. HP2L breaks this pattern by allowing prompt tokens to be dynamically refined at each level using evidence from class-specific prototypes. This enables cross-level correction rather than rigid top-down propagation.
Three Innovations That Make HP2L Work
The HP2L framework stands for Hierarchical Prompt and Prototype Learning. It introduces three interconnected components that together address the error propagation problem while keeping all the benefits of working hierarchically.
A Hierarchical ViT Backbone That Handles Each Level Differently
The backbone stacks three level-specific Vision Transformer blocks and each one handles a different diagnostic level. The first block is calibrated for broad discrimination between vascular, occupying, and developmental lesion categories. The second handles intermediate distinctions like hemorrhage versus infarction or tumor versus degeneration. The third handles fine-grained classification across all 23 subtypes.
Each block contains two sequential sub-units. The Prompting Transformer Block processes an extended token sequence that includes image patch tokens, a classification token, and a dedicated prompt token. The prompt token carries the accumulated diagnostic context from higher levels and it conditions the self-attention mechanism directly. The model’s visual reading is shaped by what it already suspects about the disease category. After the Prompting Transformer Block runs, the Vanilla Transformer Block refines the patch and classification tokens without the prompt token present. This gives the image features room to develop independently before the next level’s classification head reads from the updated classification token.
Prompt Learning That Refines Rather Than Just Propagates
This is where the design diverges most sharply from earlier work. Instead of passing the prompt token from one level to the next unchanged, HP2L runs it through a cross-attention update against the class-specific prototype tokens at each level. The attention weights measure how much the current prompt should incorporate each class’s semantic signature. The resulting refined prompt is a dynamically weighted blend of class prototypes that is shaped by what the model currently believes about the input scan.
This distinction matters enormously in practice. Fixed propagation means a wrong high-level prompt poisons every level that follows. Prototype-guided refinement means a wrong high-level prompt can be corrected by the evidence emerging at lower levels. If the image features at level two strongly suggest hemorrhage even though level one wavered between vascular and occupying categories, the cross-attention will weight the hemorrhage prototype heavily and the updated prompt will steer level three toward hemorrhage subtypes. The hierarchy becomes bidirectionally informed rather than strictly top-down.
Prototype Learning with EMA Stabilization for Rare Classes
The prototypes need to be more than randomly initialized vectors. For the cross-attention mechanism to work correctly each prototype must genuinely represent its class’s imaging characteristics. Not just the last batch of examples and not an average corrupted by outliers, but a stable accumulated summary of what each disease class looks like in the model’s learned feature space.
The prototype update rule is an exponential moving average applied after every training batch. The momentum coefficient is set at 0.99 which is deliberately high. Each update only incorporates 1 percent of the current batch’s class representation. This design choice makes prototypes robust to noisy batches, class imbalance, and the inherent variability of medical imaging data across different scanners and hospital sites. When a particular class is absent from a batch entirely its prototype simply holds its previous value unchanged. For rare classes like penetrating deformity with only 1074 training examples this kind of stability is critical to learning anything meaningful at all.
Without exponential moving average stabilization, prototype tokens trained with standard gradient descent show high variance for rare classes where individual batches may contain zero or one example. EMA provides noise-robust accumulation where each prototype integrates information from every training step rather than only recent ones. The ablation study shows that removing EMA from the prototype update drops balanced accuracy at the finest level from 88.43 percent to 85.90 percent. That loss is concentrated precisely on the rare and clinically important disease subtypes.
How the Training Objective Enforces Consistency Across Levels
HP2L trains end to end with a loss function that has two components. The first is a standard sigmoid-based binary cross-entropy applied independently at each hierarchical level. The second component is a hierarchy-consistency penalty that fires when a child class has a lower predicted probability than its parent in cases where the child is a true positive.
The intuition here is worth spelling out carefully. When a fine-grained class is truly present in a scan its probability should be at least as high as its parent’s probability. The consistency loss only fires in one direction. It never penalizes fine-grained confidence that exceeds coarse-level confidence because that is exactly the behavior you want when fine-grained evidence is strong. The penalty is asymmetric by design. This asymmetry is not an accident. It encodes the clinical logic that certainty should increase as you gather more specific evidence, not decrease.
The Dataset Behind the Results
The scale and diversity of the validation process deserves its own discussion because it is unusually rigorous for a medical AI paper.
The primary training cohort comes from Henan Provincial People’s Hospital covering 47227 subjects with diagnostic labels extracted from radiology reports using a natural language processing pipeline. The team manually verified a random 10 percent subset before trusting the pipeline at scale. The remaining subjects come from three public research cohorts including ADNI with 1432 subjects, OASIS with 823 subjects, and NACC with 3989 subjects. These contribute mainly Alzheimer’s disease and mild cognitive impairment labels.
The real test of generalizability comes from two fully independent external cohorts that were never seen during training. The first is from Fuwai Central China Cardiovascular Disease Hospital with 329 cases focusing on cerebral small vessel disease. The second is from the First Hospital of Xi’an with 560 tumor cases. Different institutions, different scanners, different disease distributions and the model had no access to any of them during training or validation rounds.
All five MRI sequences including T1, T2, FLAIR, DWI, and ADC are concatenated as channels. Missing modalities which are common in heterogeneous multi-center data are handled by zero-filling the missing channel. The model processes 3D volumes at 1.5mm isotropic spacing mapped to a 32-token sequence through a 3D CNN encoder before entering the transformer hierarchy.
Where HP2L Pulls Clearly Ahead of Everything Else
HP2L achieves 88.43 percent balanced accuracy at the fine-grained third level. The best-performing hierarchical baseline called TransHP reaches 83.14 percent while HPDT lands at 80.01 percent. That 8.42 point gap over the best prior method on a task with 23 classes and severe class imbalance is not a marginal improvement. It is the difference between a system that handles common diseases adequately and one that handles the full spectrum of presentations a radiologist actually encounters.
| Method | Level 1 BAcc | Level 1 AUC | Level 3 BAcc | Level 3 AUC | BAcc Drop |
|---|---|---|---|---|---|
| ViT | 85.53% | 85.22% | 70.65% | 71.87% | 14.88% |
| PromptViT | 87.19% | 85.77% | 73.11% | 73.21% | 14.08% |
| TransHP | 89.22% | 91.37% | 83.14% | 82.37% | 4.08% |
| HPDT | 87.23% | 87.01% | 80.01% | 80.12% | 7.22% |
| HP2L (Ours) | 90.45% | 90.03% | 88.43% | 87.58% | 2.02% |
Performance comparison across hierarchical levels. HP2L’s 2.02 percent drop from level 1 to level 3 is dramatically smaller than all competing methods. All improvements are statistically significant at p less than 0.05 via paired bootstrap.
The performance gap metric showing the drop from level 1 to level 3 is where HP2L’s design advantage is most visible. A 2.02 percent drop compared to 14.88 percent for a vanilla ViT. That comparison reflects whether the hierarchical structure helps or hurts. For all non-hierarchical methods adding hierarchy makes things worse at lower levels because errors propagate. For HP2L the hierarchy is a genuine benefit at every level because the prompt refinement mechanism contains the damage from any single wrong prediction.
The Long Tail Diseases Where Gains Matter Most Clinically
Fine-grained AUC comparisons across individual disease subtypes reveal where the improvement actually lands. Penetrating deformity with just 1074 training cases improves from 72.25 percent AUC under TransHP to 80.94 percent under HP2L. Mild cognitive impairment improves from 76.47 percent to 83.46 percent. All three hemorrhage subtypes show consistent gains.
The external cohort results reinforce this story even further. On the cerebral small vessel disease cohort, cerebral microbleeds improve from 66.10 percent AUC under the ViT baseline to 83.88 percent under HP2L. On the tumor cohort metastatic tumor classification improves from 70.28 percent to 86.64 percent. These improvements appear on data from entirely different hospitals with different equipment and the model had no exposure to any of it during training.
“HP2L can revise a suboptimal higher-level preference and reach the correct final label whereas other methods remain on an incorrect trajectory throughout the hierarchy.” Liu et al. Medical Image Analysis 2026
What the Attention Maps Reveal About How the Model Reasons
The attention dynamics across hierarchical levels are one of the most clinically compelling parts of this paper. At each level the prompt token attends over image patches and the resulting attention maps shift in a medically coherent way as the hierarchy deepens.
For hemorrhage cases the level one attention spreads broadly over the lateral ventricles. By level three it has narrowed to the boundary of the specific lesion. That is the exact feature radiologists use to distinguish acute from chronic hemorrhage based on signal intensity at the hematoma margins. For infarction cases attention progressively concentrates on the infarction center. This matches clinical practice where subtype discrimination depends on the lesion core rather than surrounding edema. For tumors early attention covers both the mass and surrounding edema while final-level attention isolates the tumor center itself. This reflects the clinical importance of distinguishing primary tumor morphology from secondary reactive tissue changes.
The prototype evolution during training is equally revealing. At epoch 10 all prototype tokens cluster together with no meaningful separation. By epoch 50 the coarse-level prototypes begin to pull apart. By epoch 100 the full hierarchical structure of the taxonomy has emerged in the embedding space. Fine-grained subtypes cluster around their parent prototypes and the three broad groups are well separated. The model has learned the same organizational structure that clinicians use and it arrived there without that structure being explicitly enforced beyond the hierarchical loss function.
What the Ablation Study Tells Us About Each Design Choice
The ablation results are worth reading carefully because they reveal which design choices matter and by exactly how much.
On prompt configuration, removing all prompting drops level 3 balanced accuracy from 88.43 percent to 71.28 percent. Adding static one-hot prompts barely helps and gets to 72.26 percent. Learnable prompts without prototype guidance reach 80.47 percent. Only when learnable prompts are updated through prototype cross-attention does performance jump to 88.43 percent. The prototype refinement mechanism is responsible for roughly 8 of the 17 percentage points gained over the no-prompt baseline.
On prototype configuration, removing prototypes entirely drops level 3 balanced accuracy to 73.75 percent. Fixed one-hot prototype tokens reach 79.91 percent. Learnable prototypes without EMA reach 85.90 percent. HP2L with EMA achieves 88.43 percent. The stabilization provided by exponential moving average contributes a 2.53-point gain concentrated precisely on the rare fine-grained classes where individual batch variance is highest.
On hierarchy depth, no hierarchy produces 74.23 percent balanced accuracy. A two-level hierarchy reaches 82.38 percent. Three levels reach 88.43 percent. Each additional level of hierarchy contributes roughly 6 percentage points. That consistent scaling behavior validates the core architectural decision.
Complete PyTorch Implementation of HP2L
The following is a complete PyTorch implementation of HP2L covering Sections 3.1 through 3.4 of the paper. It includes the full three-level hierarchical ViT backbone with Prompting Transformer Block and Vanilla Transformer Block sub-units, the cross-attention Prompt Learning module, the EMA-based Prototype Learning module, the level-wise classification heads, the combined training loss with hierarchy consistency penalty, and a runnable smoke test on synthetic 3D brain MRI data.
# ===========================================================================
# HP2L Hierarchical Prompt and Prototype Learning for Brain Disorder Diagnosis
# Paper "A hierarchical prompt and prototype learning framework for brain
# disorder classification"
# Authors Yuxiao Liu et al. ShanghaiTech / Henan Provincial People's Hospital
# Journal Medical Image Analysis 112 (2026) 104063
# ===========================================================================
from __future__ import annotations
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import List, Dict, Optional, Tuple
# SECTION 1 Multi-Head Self-Attention and Feed-Forward Utilities
class MultiHeadSelfAttention(nn.Module):
"""Standard multi-head self-attention used in both PTB and VTB.
For PTB the input sequence includes the prompt token.
For VTB the prompt token is excluded and only patch and CLS tokens remain.
All math follows Equations 4 through 8 in the paper.
"""
def __init__(self, D: int, H: int):
super().__init__()
self.H = H
self.d_h = D // H
self.scale = math.sqrt(self.d_h)
self.W_Q = nn.Linear(D, D, bias=False)
self.W_K = nn.Linear(D, D, bias=False)
self.W_V = nn.Linear(D, D, bias=False)
self.W_O = nn.Linear(D, D, bias=False)
self.dropout = nn.Dropout(0.1)
def forward(self, x: torch.Tensor) -> torch.Tensor:
B, N, D = x.shape
Q = self.W_Q(x).view(B, N, self.H, self.d_h).transpose(1, 2)
K = self.W_K(x).view(B, N, self.H, self.d_h).transpose(1, 2)
V = self.W_V(x).view(B, N, self.H, self.d_h).transpose(1, 2)
attn = self.dropout(torch.softmax(Q @ K.transpose(-2, -1) / self.scale, dim=-1))
out = (attn @ V).transpose(1, 2).contiguous().view(B, N, D)
return self.W_O(out)
class FeedForwardNetwork(nn.Module):
"""Position-wise FFN with GELU activation as in standard ViT blocks."""
def __init__(self, D: int, ffn_dim: int = None):
super().__init__()
ffn_dim = ffn_dim or D * 4
self.net = nn.Sequential(
nn.Linear(D, ffn_dim),
nn.GELU(),
nn.Dropout(0.1),
nn.Linear(ffn_dim, D),
nn.Dropout(0.1),
)
def forward(self, x): return self.net(x)
# SECTION 2 Prompting Transformer Block (PTB)
class PromptingTransformerBlock(nn.Module):
"""PTB processes the full token sequence including the prompt token.
Implements the operation described in Equations 1 through 9.
Z_l = [x_cls, x_pro, x_1, ..., x_N]
After PTB the updated prompt token x_pro_prime is extracted and forwarded
to the Prompt Learning module. CLS and patch tokens go to VTB.
"""
def __init__(self, D: int, H: int):
super().__init__()
self.ln1 = nn.LayerNorm(D)
self.attn = MultiHeadSelfAttention(D, H)
self.ln2 = nn.LayerNorm(D)
self.ffn = FeedForwardNetwork(D)
def forward(self, x_cls: torch.Tensor, x_pro: torch.Tensor,
X: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""
x_cls (B, 1, D) classification token
x_pro (B, 1, D) prompt token
X (B, N, D) patch tokens
Returns updated (x_cls_prime, x_pro_prime, X_prime)
"""
Z = torch.cat([x_cls, x_pro, X], dim=1) # (B, N+2, D)
Z = Z + self.attn(self.ln1(Z))
Z = Z + self.ffn(self.ln2(Z))
x_cls_p = Z[:, 0:1, :] # updated CLS
x_pro_p = Z[:, 1:2, :] # updated prompt
X_p = Z[:, 2:, :] # updated patches
return x_cls_p, x_pro_p, X_p
# SECTION 3 Vanilla Transformer Block (VTB)
class VanillaTransformerBlock(nn.Module):
"""VTB refines CLS and patch tokens without the prompt token.
Implements Equation 10 [x_cls_next, X_next] = B_van([x_cls_prime, X_prime]).
Excluding the prompt token here lets image features develop independently
of diagnostic priors, creating a clean separation of concerns.
"""
def __init__(self, D: int, H: int):
super().__init__()
self.ln1 = nn.LayerNorm(D)
self.attn = MultiHeadSelfAttention(D, H)
self.ln2 = nn.LayerNorm(D)
self.ffn = FeedForwardNetwork(D)
def forward(self, x_cls: torch.Tensor,
X: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
"""Operates on CLS plus patch tokens only. No prompt token involved."""
Z = torch.cat([x_cls, X], dim=1) # (B, N+1, D)
Z = Z + self.attn(self.ln1(Z))
Z = Z + self.ffn(self.ln2(Z))
return Z[:, 0:1, :], Z[:, 1:, :] # updated CLS and X
# SECTION 4 Prompt Learning Module
class PromptLearningModule(nn.Module):
"""Refines the prompt token through cross-attention over class prototypes.
Implements Equations 11 through 15.
Q = x_pro_prime times W_Q (query from updated prompt token)
K = P_l times W_K (keys from class prototype tokens)
V = P_l times W_V (values from class prototype tokens)
alpha_i = Softmax_i(QK^T / sqrt(d))
x_pro_next = sum_i alpha_i V_i
This lets the prompt for the next level be a dynamically weighted blend
of class-specific semantic anchors, enabling evidence-driven correction
instead of fixed propagation of whatever the prior level decided.
"""
def __init__(self, D: int, d: int = None):
super().__init__()
d = d or D
self.W_Q = nn.Linear(D, d, bias=False)
self.W_K = nn.Linear(D, d, bias=False)
self.W_V = nn.Linear(D, d, bias=False)
self.out_proj = nn.Linear(d, D, bias=False)
self.scale = math.sqrt(d)
self.ln = nn.LayerNorm(D)
def forward(self, x_pro_prime: torch.Tensor,
prototypes: torch.Tensor) -> torch.Tensor:
"""
x_pro_prime (B, 1, D) updated prompt token from PTB
prototypes (C_l, D) class prototype tokens for current level
Returns x_pro_next (B, 1, D) refined prompt token for next level
"""
B = x_pro_prime.shape[0]
P = prototypes.unsqueeze(0).expand(B, -1, -1) # (B, C, D)
Q = self.W_Q(x_pro_prime) # (B, 1, d)
K = self.W_K(P) # (B, C, d)
V = self.W_V(P) # (B, C, d)
alpha = torch.softmax(Q @ K.transpose(-2, -1) / self.scale, dim=-1)
attended = alpha @ V # (B, 1, d)
x_pro_next = x_pro_prime + self.out_proj(attended) # residual update
return self.ln(x_pro_next)
# SECTION 5 Prototype Learning with EMA Update
class PrototypeLearning(nn.Module):
"""Maintains per-class prototype tokens at each diagnostic level.
Prototypes update after each mini-batch using exponential moving average
as described in Equations 16 through 17.
x_bar_c = mean of CLS tokens for class c in current batch
p_c_t = alpha times p_c_{t-1} + (1 - alpha) times x_bar_c
When class c is absent from the batch its prototype retains its previous
value unchanged. This provides stability for rare classes that may appear
in only a fraction of all training batches.
Parameters
C number of disease classes at this level
D token dimension (768 in the paper)
alpha EMA momentum coefficient (default 0.99 per paper)
"""
def __init__(self, C: int, D: int, alpha: float = 0.99):
super().__init__()
self.C = C
self.alpha = alpha
# Prototype tokens initialized from N(0,1) as in Section 3.3
self.register_buffer('prototypes', torch.randn(C, D))
def get_prototypes(self) -> torch.Tensor:
"""Return current prototype embeddings of shape (C, D)."""
return self.prototypes
@torch.no_grad()
def update(self, x_cls_batch: torch.Tensor, labels: torch.Tensor) -> None:
"""EMA prototype update for one training batch.
x_cls_batch (B, D) CLS tokens for the current batch
labels (B, C) multi-label ground-truth at this level
(binary matrix where a subject can have multiple labels)
"""
for c in range(self.C):
mask = labels[:, c].bool()
if mask.sum() > 0:
x_bar_c = x_cls_batch[mask].mean(dim=0)
self.prototypes[c] = (
self.alpha * self.prototypes[c]
+ (1 - self.alpha) * x_bar_c
)
# If class is absent from batch the prototype holds its value
# SECTION 6 Level-Specific ViT Block
class HierarchicalViTBlock(nn.Module):
"""One complete level of the HP2L backbone combining PTB, Prompt Learning, and VTB.
The full inference loop at level l:
1. PTB process [CLS, prompt, patches] together to get updated tokens
2. Prompt Learning refine prompt through cross-attention over prototypes
3. VTB refine CLS and patch tokens without the prompt token present
4. Classification head predict logits from updated CLS token
5. Prototype update EMA update of prototype tokens (training only)
"""
def __init__(self, D: int, H: int, C_l: int):
super().__init__()
self.ptb = PromptingTransformerBlock(D, H)
self.prompt_learn = PromptLearningModule(D)
self.vtb = VanillaTransformerBlock(D, H)
self.prototype_learn = PrototypeLearning(C_l, D)
self.cls_head = nn.Linear(D, C_l)
self.C_l = C_l
def forward(self, x_cls, x_pro, X, labels=None) -> Dict:
"""
x_cls (B, 1, D) CLS token from previous level
x_pro (B, 1, D) prompt token from previous level
X (B, N, D) patch tokens
labels (B, C_l) ground-truth labels for EMA update (training only)
Returns dict with next-level tokens, logits, and updated CLS.
"""
# Step 1 Prompting Transformer Block
x_cls_p, x_pro_p, X_p = self.ptb(x_cls, x_pro, X)
# Step 2 Prompt Learning through prototype cross-attention
prototypes = self.prototype_learn.get_prototypes()
x_pro_next = self.prompt_learn(x_pro_p, prototypes)
# Step 3 Vanilla Transformer Block to refine CLS and patches
x_cls_next, X_next = self.vtb(x_cls_p, X_p)
# Step 4 Classification head predicts logits from updated CLS
logits = self.cls_head(x_cls_next.squeeze(1)) # (B, C_l)
# Step 5 Prototype EMA update (training only)
if labels is not None and self.training:
self.prototype_learn.update(x_cls_next.squeeze(1).detach(), labels)
return {'x_cls': x_cls_next, 'x_pro': x_pro_next, 'X': X_next, 'logits': logits}
# SECTION 7 3D CNN Image Feature Encoder
class BrainMRIEncoder(nn.Module):
"""3D CNN encoder that maps (B, 5, H, W, D) to (B, 32, token_dim).
Performs four spatial downsampling stages as in Section 4.2.
Input is 5 MRI sequences (T1, T2, FLAIR, DWI, ADC) concatenated as channels.
Missing modalities are zero-filled before concatenation.
"""
def __init__(self, in_channels: int = 5, D: int = 768):
super().__init__()
self.cnn = nn.Sequential(
nn.Conv3d(in_channels, 32, kernel_size=3, stride=2, padding=1),
nn.GroupNorm(8, 32), nn.GELU(),
nn.Conv3d(32, 64, kernel_size=3, stride=2, padding=1),
nn.GroupNorm(16, 64), nn.GELU(),
nn.Conv3d(64, 128, kernel_size=3, stride=2, padding=1),
nn.GroupNorm(16, 128), nn.GELU(),
nn.Conv3d(128, 32, kernel_size=3, stride=2, padding=1),
nn.GroupNorm(8, 32), nn.GELU(),
)
self.token_proj = nn.Linear(32, D)
self.D = D
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
x (B, 5, H, W, D_depth) multi-modal 3D brain MRI
Returns (B, N, D) flattened spatial tokens
"""
feat = self.cnn(x)
B, C, H, W, Dp = feat.shape
tokens = feat.permute(0, 2, 3, 4, 1).reshape(B, H * W * Dp, C)
return self.token_proj(tokens)
# SECTION 8 HP2L Full Framework
class HP2L(nn.Module):
"""HP2L Hierarchical Prompt and Prototype Learning for Brain Disorder Diagnosis.
Implements the full three-level hierarchical classification framework
described in Sections 3.1 through 3.4 and Algorithm 1 of the paper.
Three diagnostic levels:
Level 1 C1 = 4 (vascular, occupying, developmental, normal)
Level 2 C2 = 6 (hemorrhage, infarction, WMH, tumor, degeneration, deformity)
Level 3 C3 = 16 (23 BDs represented as 15 disease subtypes plus normal)
Single forward pass at inference produces predictions for all three levels.
No additional post-processing is required for clinical deployment.
"""
def __init__(self,
class_hierarchy: List[int] = [4, 6, 16],
D: int = 768,
H: int = 12,
in_channels: int = 5):
super().__init__()
self.L = len(class_hierarchy)
self.D = D
self.encoder = BrainMRIEncoder(in_channels, D)
self.x_cls_init = nn.Parameter(torch.randn(1, 1, D))
self.x_pro_init = nn.Parameter(torch.randn(1, 1, D))
self.levels = nn.ModuleList([
HierarchicalViTBlock(D, H, C_l)
for C_l in class_hierarchy
])
self.label_smoothing = 0.1
def forward(self, images: torch.Tensor,
labels_per_level: Optional[List[torch.Tensor]] = None) -> Dict:
"""
images (B, 5, H, W, D_depth) multi-modal 3D brain MRI
labels_per_level list of L label tensors each (B, C_l) for EMA update
Returns dict with logits_per_level and cls_per_level
"""
B = images.shape[0]
X = self.encoder(images)
x_cls = self.x_cls_init.expand(B, -1, -1).clone()
x_pro = self.x_pro_init.expand(B, -1, -1).clone()
logits_per_level = []
cls_per_level = []
for l, level_block in enumerate(self.levels):
labels_l = labels_per_level[l] if labels_per_level else None
out = level_block(x_cls, x_pro, X, labels=labels_l)
x_cls = out['x_cls']
x_pro = out['x_pro']
X = out['X']
logits_per_level.append(out['logits'])
cls_per_level.append(x_cls)
return {'logits_per_level': logits_per_level, 'cls_per_level': cls_per_level}
# SECTION 9 Loss Functions
def level_classification_loss(logits: torch.Tensor, labels: torch.Tensor,
smoothing: float = 0.1) -> torch.Tensor:
"""Level-wise sigmoid binary cross-entropy with label smoothing (Eq. 18).
Uses BCEWithLogits applied element-wise over all class labels.
Label smoothing prevents overconfidence on noisy multi-center labels.
"""
labels_smooth = labels.float() * (1 - smoothing) + 0.5 * smoothing
return F.binary_cross_entropy_with_logits(logits, labels_smooth)
def hierarchy_consistency_loss(logits_per_level, labels_per_level,
parent_map) -> torch.Tensor:
"""Hierarchy-consistency penalty across adjacent diagnostic levels (Eq. 19).
For each true child label c penalize max(0, p_parent minus p_child).
This asymmetric penalty never suppresses fine-grained confidence that
exceeds coarse confidence. It only fires when fine-grained evidence lags
behind coarse evidence for a true positive class.
logits_per_level list of (B, C_l) logit tensors
labels_per_level list of (B, C_l) ground-truth label tensors
parent_map list of dicts mapping child class index to parent index
"""
consist_loss = torch.tensor(0.0, device=logits_per_level[0].device)
for l in range(len(logits_per_level) - 1):
probs_coarse = torch.sigmoid(logits_per_level[l])
probs_fine = torch.sigmoid(logits_per_level[l + 1])
labels_fine = labels_per_level[l + 1].float()
for child_idx, parent_idx in parent_map[l].items():
if child_idx >= probs_fine.shape[1] or parent_idx >= probs_coarse.shape[1]:
continue
p_child = probs_fine[:, child_idx]
p_parent = probs_coarse[:, parent_idx]
y_child = labels_fine[:, child_idx]
penalty = y_child * torch.clamp(p_parent - p_child, min=0)
consist_loss = consist_loss + penalty.mean()
return consist_loss
def hp2l_total_loss(logits_per_level, labels_per_level, parent_map,
lambda_levels=[1.0, 1.0, 1.0], lambda_consist=0.5,
label_smoothing=0.1) -> Dict:
"""Full training objective (Eq. 20) level-wise BCE plus consistency penalty.
Returns dict with total, cls_losses, and consist for logging.
"""
cls_losses = [
level_classification_loss(logits_per_level[l], labels_per_level[l], label_smoothing)
for l in range(len(logits_per_level))
]
total_cls = sum(lambda_levels[l] * cls_losses[l] for l in range(len(cls_losses)))
consist = hierarchy_consistency_loss(logits_per_level, labels_per_level, parent_map)
total = total_cls + lambda_consist * consist
return {'total': total, 'cls_losses': cls_losses, 'consist': consist}
# SECTION 10 Smoke Test on Synthetic 3D Brain MRI Data
def _smoke_test():
"""End-to-end smoke test of HP2L on synthetic 3D brain MRI data.
Verifies:
Forward pass through the full 3-level hierarchical backbone
EMA prototype updates during training
Loss computation with hierarchy-consistency penalty
Gradient flow through all components
"""
print("=" * 65)
print("HP2L Smoke Test Synthetic 3D Brain MRI Data")
print("Paper Liu et al. Medical Image Analysis 112 (2026) 104063")
print("=" * 65)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
B = 4
H_vol, W_vol, D_vol = 64, 64, 32
D_model = 256
num_heads = 8
class_hierarchy = [4, 6, 16]
images = torch.randn(B, 5, H_vol, W_vol, D_vol, device=device)
labels = [torch.randint(0, 2, (B, C), device=device).float() for C in class_hierarchy]
parent_map = [
{0: 0, 1: 0, 2: 0, 3: 1, 4: 1, 5: 2},
{i: i // 3 for i in range(16)},
]
model = HP2L(class_hierarchy=class_hierarchy, D=D_model, H=num_heads, in_channels=5).to(device)
model.train()
total_params = sum(p.numel() for p in model.parameters())
print(f"\nDevice {device}")
print(f"Total parameters {total_params:,}")
print(f"Input shape {list(images.shape)}")
print(f"Class hierarchy {class_hierarchy}")
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-2)
out = model(images, labels_per_level=labels)
loss_dict = hp2l_total_loss(out['logits_per_level'], labels, parent_map)
optimizer.zero_grad()
loss_dict['total'].backward()
optimizer.step()
print(f"\n{'─'*45}")
print(f"Total loss {loss_dict['total'].item():.4f}")
print(f"Level 1 cls loss {loss_dict['cls_losses'][0].item():.4f}")
print(f"Level 2 cls loss {loss_dict['cls_losses'][1].item():.4f}")
print(f"Level 3 cls loss {loss_dict['cls_losses'][2].item():.4f}")
print(f"Consistency loss {loss_dict['consist'].item():.4f}")
print(f"Logit shapes {[list(l.shape) for l in out['logits_per_level']]}")
print(f"{'─'*45}")
model.eval()
with torch.no_grad():
out_inf = model(images)
probs = [torch.sigmoid(l) for l in out_inf['logits_per_level']]
print(f"\nInference level 3 mean probability {probs[2].mean().item():.3f}")
print("Smoke test passed. HP2L forward and backward cycles OK.")
print("=" * 65)
if __name__ == '__main__':
_smoke_test()
What This Work Opens Up and Where Honest Gaps Remain
The external cohort results suggest something clinically important about HP2L’s ability to generalize. When a model trained on data from one hospital is tested on two different institutions with different scanners, different patient populations, and different disease focuses, the performance does not collapse. That kind of robustness is not guaranteed by anything in the architecture. It has to be earned through a combination of multi-center training data, prototype representations that are not tied to specific scanner types, and a hierarchical structure that reflects genuine pathological relationships rather than dataset-specific patterns.
The failure modes are worth understanding too. The interpretability study points to cases where HP2L still struggles. Tiny lesions that fall below the resolution threshold of the spatial attention mechanism remain difficult. Closely related subtypes with overlapping imaging signatures continue to cause confusion. Comorbid presentations where secondary findings go undetected are also an ongoing challenge. HP2L handles comorbidity better than competing methods and can correctly identify both lacunar infarction and metastatic tumor in the same subject, but the problem is not fully solved. Under-detected secondary findings remain a real clinical risk.
There is also a methodological gap worth acknowledging. The paper evaluates a single expert-defined disease hierarchy. The taxonomy was built in collaboration with experienced neuroradiologists and that collaboration matters. But alternative valid hierarchies exist based on temporal staging, etiology, or treatment pathways. Whether HP2L is equally robust to different hierarchical organizations is only partially tested. A more systematic sensitivity analysis across hierarchy definitions would strengthen the conclusions considerably.
The inter-disease dependency problem is explicitly acknowledged as future work. Real neuroradiology recognizes that vascular burden increases the likelihood of subsequent degenerative findings. Certain genetic profiles co-express multiple tumor types. White matter hyperintensity and lacunar infarction frequently appear together as markers of small vessel disease. A diagnostic system that treats each disorder independently is missing structural information that could substantially improve calibration in the multi-label setting. Building label-dependency priors into the framework through a learned graph structure over the disease taxonomy is an obvious and promising extension.
From a deployment perspective, the computational demands deserve attention. Training was performed on four NVIDIA L40 GPUs with 40GB of memory each. The 3D volumetric inputs, five modalities, and full hierarchical backbone combine into a model that is not trivially portable to resource-constrained clinical environments. Inference produces all three diagnostic levels in a single forward pass without post-processing and that is a genuine advantage. But the memory footprint during training would require careful management in any federated or distributed learning setup.
None of these gaps reduce the core contribution. The error propagation problem in hierarchical classification is real. Its consequences for rare disease subtypes are clinically significant. The HP2L mechanism of prototype-guided prompt refinement addresses it directly and measurably. The 8.42-point improvement over the best prior method on 54360 subjects is not a marginal gain. It represents the difference between a system that performs adequately on common diseases and one that handles the full range of presentations a radiologist actually encounters day to day.
The deeper point is about what kind of structure we choose to build into medical AI. Flat classifiers are convenient. They require no knowledge of disease taxonomy and no decisions about how to organize the label space. HP2L requires all of those things. The argument from this paper is that the difficulty is worth it. A model structured around clinically meaningful relationships between diseases will be more accurate, more interpretable, and more robust to distribution shift than one treating diagnosis as an unstructured label assignment task. That argument is supported by 54360 data points, two independent external cohorts, and attention maps that a clinician could walk a medical student through in a teaching session. That is a strong and honest evidentiary foundation.
Read the Full Paper and Access the Official Code
The complete HP2L paper, supplementary analyses, and the official code and data release are available via the links below. Six-cohort experiments, full ablation details, and per-disorder AUC breakdowns are included open-access.
Liu Y., Sun K., Wu Y., Lin X., Bai Y., Yang L., Zhou W., Yuan H., Wu X., He Y., Wu Q., Che Z., Zhan Y., Zhou S., Wu D., Shi F., Wang M., and Shen D. (2026). A hierarchical prompt and prototype learning framework for brain disorder classification. Medical Image Analysis, 112, 104063. https://doi.org/10.1016/j.media.2026.104063
This article is an independent editorial analysis of peer-reviewed research. The PyTorch implementation is an educational reproduction and may differ from the official repository in engineering details. For research use please verify against the official code and original paper. This work is supported by the National Natural Science Foundation of China and the China Ministry of Science and Technology.
