Bayesian Multiclass Segmentation for Remote Sensing: BCNN + VAE + User Priors Explained | AI Trend Blend

Remote Sensing AI · IEEE Transactions on Geoscience and Remote Sensing, Vol. 64, 2026 · 22 min read

The Segmentation Model That Knows What It Doesn’t Know — and Asks You About It

A team from POSTECH and Jeonbuk National University built a Bayesian segmentation framework for aerial imagery that quantifies its own uncertainty pixel by pixel, fuses user correction masks into a variational latent prior, and uses a transformer to propagate domain expertise globally — achieving 68.5% mIOU across six benchmark datasets with just three rounds of user interaction.

Bayesian CNN Remote Sensing Uncertainty Estimation VAE Prior Transformer Query Fusion Interactive Segmentation Test-Time Adaptation Land Cover Mapping DeepGlobe / LoveDA

There is a gap between what a satellite segmentation model produces and what a domain expert actually needs. The model draws crisp categorical boundaries across pixels it has never seen in training; the expert looks at the same image and immediately identifies three or four regions the model got wrong — a patch of bare soil mislabelled as road, a stand of scrubby vegetation classified as forest, a building obscured by shadow that the algorithm missed entirely. The standard response to this gap is retraining. A team from POSTECH decided there was a better answer: build a model that knows where it is uncertain, and let experts correct exactly those places.

Why Standard Segmentation Fails at the Edges of Its Training Distribution

Land cover mapping from aerial imagery is a deterministic enterprise by default. A convolutional neural network encodes an input patch, decodes a class distribution over pixels, and delivers a hard prediction map. What it does not deliver — and what practitioners need — is any honest account of where that prediction is trustworthy. A building at the centre of a high-resolution urban tile and a building half-obscured by a winter shadow at the edge of a rural scene are treated as the same inference problem. They are not.

Two further complications make the standard approach brittle in practice. Domain shift is the first: a model trained on DeepGlobe land cover images from Southeast Asia will degrade when deployed on Massachusetts aerial imagery, because sensor characteristics, acquisition seasons, and spectral responses differ. Test-time adaptation methods — entropy minimisation, self-training, batch normalisation statistics adaptation — can partially recover from this, but they apply uniform adaptation strategies and ignore the human knowledge that could accelerate recovery substantially. Interactive segmentation methods are the second class: SAM2, SAM-RS, and FCA-Net incorporate user annotations as deterministic constraints on the output. Draw a bounding box; the model updates. The problem is that these constraints are hard: they force the prediction, rather than informing it probabilistically, and they provide no mechanism for the model to communicate its residual uncertainty back to the user.

The POSTECH team — Yeongsu Kim, Haeyun Lee, Seo-Yeon Choi, and Kyungsu Lee — set out to address all three failure modes at once: domain generalization, principled user interaction, and calibrated uncertainty estimation. Their framework, published in IEEE Transactions on Geoscience and Remote Sensing, is the first aerial segmentation system to model all three components within a single joint Bayesian probabilistic framework.

Key Takeaway

The framework’s formal objective is to model the posterior distribution $p(\mathbf{Y} \mid \mathbf{I}, \mathcal{U})$ — the segmentation mask given both the image and user corrections — rather than a deterministic mapping. This shift from function to distribution is what makes uncertainty quantification and soft user guidance possible simultaneously.

The Architecture (Bayesian Multiclass Segmentation Model): Three Layers of Probabilistic Reasoning

The proposed system chains three distinct probabilistic mechanisms. Understanding each one individually is essential before seeing how they compose.

Layer 1: The Bayesian CNN Backbone

Standard CNNs treat network weights as fixed parameters estimated by maximum likelihood. The Bayesian CNN (BCNN) used here treats every weight $\mathbf{W}_l$ at layer $l$ as a random variable with an approximate posterior $q(\mathbf{W}_l | \mathcal{D})$ learned from data. The feature extraction process is therefore stochastic:

Eq. 7 — BCNN Feature Extraction $$\mathbf{F} = f_{\text{BCNN}}\!\left(\mathbf{I};\,\{\mathbf{W}_l\}\right), \quad \mathbf{W}_l \sim q(\mathbf{W}_l|\mathcal{D})\;\;\forall l$$

The predictive distribution over feature maps is then:

Eq. 8 — Marginal Feature Distribution $$p(\mathbf{F}|\mathbf{I}) = \int p(\mathbf{F}|\mathbf{I},\{\mathbf{W}_l\})\prod_l q(\mathbf{W}_l|\mathcal{D})\,d\mathbf{W}_l$$

In practice, this integral is approximated by Monte Carlo sampling: $M = 20$ forward passes are run, each time sampling a different weight realisation (implemented efficiently via Monte Carlo dropout). The empirical variance across the $M$ feature realisations gives the pixelwise epistemic uncertainty map:

Epistemic Uncertainty $$\mathbf{U}(i,j) = \frac{1}{M-1}\sum_{m=1}^{M}\!\left(\mathbf{S}^{(m)}(i,j) – \bar{\mathbf{S}}(i,j)\right)^2$$

This uncertainty map serves two roles: it identifies ambiguous regions for the user to correct, and it guides which regions get prioritised for iterative refinement during test-time adaptation.

Layer 2: User Prior Modulation and Transformer Query Fusion

When a user provides a correction mask $\mathbf{M}_{\text{user}}^{(n)} \in \{0,1\}^{H \times W}$ — a binary map highlighting misclassified or ambiguous regions — the system does not apply it as a hard output constraint. Instead, it is injected directly into the feature space:

Eq. 9 — User Prior Modulation $$\mathbf{F}_{\text{prior}}^{(n)} = \mathbf{F} \odot \mathbf{M}_{\text{user}}^{(n)}$$

This element-wise masking extracts the BCNN features at exactly the locations the user flagged, creating a spatially-localised prior signal. These masked features and the full image features are then projected into token embeddings and concatenated into a single input sequence for the transformer:

Eq. 13–14 — Token Embedding and Initial Query $$\mathbf{T}_F = h_F(\mathbf{F}), \quad \mathbf{T}_{\text{prior}}^{(n)} = h_{\text{prior\_emb}}\!\left(\mathbf{F}_{\text{prior}}^{(n)}\right)$$ $$\mathbf{Q}^0 = \text{concat}\!\left(\mathbf{T}_F,\,\left\{\mathbf{T}_{\text{prior}}^{(n)}\right\}_{n=1}^N\right) \in \mathbb{R}^{M \times d}$$

A stack of $L$ transformer layers then refines these queries through multi-head self-attention. At each layer $l$ and for each query $i$:

Eq. 15 — Multi-Head Attention $$\text{MHAttn}(\mathbf{q},\mathbf{K},\mathbf{V}) = \sum_{h=1}^{H}\text{softmax}\!\left(\frac{\mathbf{q}\left(\mathbf{K}_h\right)^\top}{\sqrt{d_k}}\right)\mathbf{V}_h$$

Eq. 18 — Attention Weights $$a_{ij}^{(l)} = \frac{\exp\!\left(\!\left(\mathbf{q}_i^{(l)}\mathbf{W}_Q\right)\!\left(\mathbf{q}_j^{(l)}\mathbf{W}_K\right)^\top\!/\sqrt{d_k}\right)}{\sum_r \exp\!\left(\!\left(\mathbf{q}_i^{(l)}\mathbf{W}_Q\right)\!\left(\mathbf{q}_r^{(l)}\mathbf{W}_K\right)^\top\!/\sqrt{d_k}\right)}$$

The accumulated update over all $L$ layers gives the final query representation:

Eq. 19 — Accumulated Query $$\mathbf{q}_i^* = \mathbf{q}_i^{(0)} + \sum_{l=0}^{L-1}\sum_{j=1}^{M}\!\left(\prod_{k=l+1}^{L-1}a_{ij}^{(k)}\right)\mathbf{W}_V\mathbf{q}_j^{(l)}$$

The output $\mathbf{Q}^*$ encodes both global image context and user-guided spatial priors in a unified representation. Crucially, the attention mechanism means user corrections at one spatial location propagate to semantically related regions across the entire image — a user correcting one rooftop can implicitly inform the treatment of other rooftops the model has not been explicitly corrected on.

Key Takeaway

The transformer query fusion is not simply concatenating image and user tokens — the attention mechanism allows every query to attend to all others, so user domain corrections at specific pixels propagate globally through the attention graph. This makes the model sensitive to partial user input in a way that hard output constraints cannot achieve.

Layer 3: VAE with User-Conditioned Prior

The final component is a variational autoencoder whose latent prior is conditioned on both the image and the user correction signals. The latent variable $z$ is regularised by:

Eq. 20 — User-Conditioned Prior $$p\!\left(z\,\Big|\,\mathbf{I},\left\{\mathbf{F}_{\text{prior}}^{(n)}\right\}\right) = \mathcal{N}\!\left(z;\,\mu_{\text{prior}},\,\Sigma_{\text{prior}}\right)$$

where $\mu_{\text{prior}}, \Sigma_{\text{prior}} = h_{\text{prior}}\!\left(\text{GP}(\mathbf{F}),\,\{\text{GP}(\mathbf{F}_{\text{prior}}^{(n)})\}\right)$ and GP(·) denotes global pooling. The ELBO objective combines the segmentation reconstruction term with the KL divergence between the variational posterior and this user-conditioned prior:

Eq. 21 — ELBO Objective $$\log p\!\left(\mathbf{Y}\,\Big|\,\mathbf{I},\left\{\mathbf{M}_{\text{user}}^{(n)}\right\}\right) \geq \mathbb{E}_{q_\phi}\!\left[\log p_\theta\!\left(\mathbf{Y}\,|\,z,\mathbf{Q}^*\right)\right]$$ $$-\,\text{KL}\!\left(q_\phi\!\left(z\,|\,\mathbf{I},\left\{\mathbf{M}_{\text{user}}^{(n)}\right\},\mathbf{Y}\right)\,\Big\|\,p\!\left(z\,|\,\mathbf{I},\left\{\mathbf{F}_{\text{prior}}^{(n)}\right\}\right)\right)$$

The total training loss balances supervised segmentation and variational regularisation:

Eq. 24 — Total Loss $$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{sup}} + \lambda_{\text{ELBO}}\,\mathcal{L}_{\text{ELBO}}$$

The supervised loss $\mathcal{L}_{\text{sup}}$ is pixelwise cross-entropy or Dice loss over the Monte Carlo mean prediction. The ELBO term $\mathcal{L}_{\text{ELBO}}$ enforces that the latent representation encodes user domain knowledge and remains calibrated under the prior. The joint effect: segmentation masks that are simultaneously accuracy-maximising and uncertainty-calibrated, with uncertainty in the output directly reflecting disagreement among Monte Carlo samples conditioned on user evidence.

Structural Causal Model diagram showing nodes S, U, F, X, Y, E — Figure 1: The dual-stage framework architecture. During training all modules are jointly optimised. During inference, backbone weights are fixed; user corrections modulate only the VAE prior distribution, enabling test-time adaptation without gradient computation on the backbone.

Two Inference Scenarios: With and Without User Input

The framework supports two distinct inference strategies, which can be used independently or sequentially.

Scenario 1 — With user corrections upfront: The user provides N correction masks $\mathcal{U} = \{\mathbf{M}_{\text{user}}^{(n)}\}_{n=1}^N$ before inference. For each input image, the corrections are fused into the latent prior and $M$ posterior samples are drawn: $z^{(m)} \sim p(z | \mathbf{I}, \mathcal{U})$. The final prediction is the Monte Carlo mean, and the uncertainty map is the pixelwise variance across samples.

Scenario 2 — Autonomous first, then interactive: The model first processes the image without any user input ($\mathcal{U} = \emptyset$), using image-only latent samples. If the result is unsatisfactory — either because the model flags high uncertainty or because the user identifies specific errors — the user supplies correction masks for the next round. The iterative update runs for up to $T_{\max} = 3$ refinement steps:

Eq. 30 — Iterative Refinement $$\mathbf{Y}_{\text{pred}}^{(t+1)} = f_{\text{dec}}\!\left(\mathbf{Q}^{*(t)},\,z^{(t)}\right), \quad z^{(t)} \sim p\!\left(z\,|\,\mathbf{I},\,\mathcal{U}^{(t)}\right)$$

The adaptation loop terminates when prediction change falls below a tolerance $\varepsilon$:

Eq. 31 — Convergence Criterion $$\left\|\mathbf{Y}_{\text{pred}}^{(t+1)} – \mathbf{Y}_{\text{pred}}^{(t)}\right\|_1 < \varepsilon$$

In practice, the authors observe rapid stabilisation — essentially all gains are achieved in three steps, with negligible oscillatory behaviour thereafter. The uncertainty map remains informative throughout: even after convergence, $\mathbf{U}(i,j)$ identifies pixels where the model’s residual doubt is highest, giving users a principled signal for where further correction effort would be most productive.

“The Bayesian formulation naturally supports principled quantification of predictive uncertainty at the pixel level, enabling reliable identification of regions with high ambiguity or low model confidence.” — Kim, Lee, Choi & Lee, IEEE Transactions on Geoscience and Remote Sensing (2026)

Results: What Six Datasets and Eleven Baselines Tell Us

The ablation study (Table II in the paper) builds the model incrementally across six variants, making it unusually readable for understanding where performance actually comes from.

Variant	Components Added	DeepGlob	Inria	LoveDA	Avg mIOU
A0	Deterministic baseline	58.2	70.4	46.1	60.5
A1	+ Bayesian CNN	60.1	71.2	48.5	61.9
A2	+ Transformer Query Fusion	62.0	72.5	50.3	63.8
A3	+ VAE Prior	63.1	73.3	51.3	64.3
A4	+ User Priors	64.0	74.0	52.4	65.9
A5	BCNN+Transformer+VAE+User	66.0	75.1	54.3	66.8
A6 (Full)	+ Uncertainty-Driven Selection	68.2	76.3	56.0	68.5

Table 1: Ablation results (mIOU ↑). Each row adds one component. The full model A6 achieves 68.5% average mIOU — an 8-point gain over the deterministic baseline — with the lowest Expected Calibration Error (ECE = 0.089) of any variant.

The gains are additive and consistent. Every component contributes positively, and — critically — no single component dominates. The BCNN adds about 1.4 points by improving calibration and enabling uncertainty-guided selection. The transformer adds 1.9 points by propagating user corrections globally. The VAE adds 0.5 points through latent regularisation. User priors add 1.6 points by directly incorporating correction signals. Uncertainty-driven selection, which uses the uncertainty map to prioritise which regions to refine in subsequent iterations, adds another 1.7 points. The full model beats every baseline tested, including RS3Mamba (64.6%), all TTA methods (up to 66.0%), and all HITL approaches including FCA-Net (67.2%).

Domain Adaptation Performance

The domain adaptation results (Table VII) tell the most practically relevant story. Four source-target transfers were evaluated: Inria→Massachusetts, LoveDA Urban→Rural, DeepGlobe→OpenEarthMap, and OpenEarthMap→DeepGlobe. Source-only models average 49.1% mIOU; SOTA models trained on source data average 52.1%; TTA methods recover to 53.8%; HITL methods reach 55.9%. The full Bayesian framework achieves 57.6% average mIOU — outperforming the best HITL baseline by 1.7 points without any retraining, and without ever seeing target domain labels. The consistency of the improvement across four very different transfer scenarios, each involving different sensor types and geographic regions, suggests the framework is genuinely learning to model distribution shift rather than fitting transfer-specific patterns.

Calibration Tells the Real Story

The accuracy numbers matter, but the calibration numbers matter more for deployment. A high-accuracy model that is systematically overconfident is dangerous in operational contexts — it gives the user no signal that something might be wrong. The full model achieves ECE = 0.089, the lowest of any variant and lower than every baseline tested. The deterministic A0 baseline has ECE = 0.128; the A4-Det variant (which removes BCNN but keeps VAE and user priors) has ECE = 0.103. The improvement in calibration comes specifically from the Bayesian CNN component — the ability to express genuine uncertainty about weight values translates directly into well-calibrated predictive distributions over pixel classes.

Method	Avg mIOU (%)	ECE ↓	FPS	#Params (M)
U-Net	62.2	—	65	25
RS3Mamba	64.6	—	59	32
CertainTTA	65.5	—	53	36
FCA-Net (best HITL)	67.2	—	48	42
A4-Det (no BCNN)	65.9	0.103	50	35
A5 (no uncertainty sel.)	66.8	0.083	44	44
Ours A6 (Full)	68.5	0.089	46	44

Table 2: Overall performance comparison. The 46 FPS inference speed is competitive with all TTA and HITL baselines despite the Monte Carlo sampling overhead. The parameter count of 44M reflects the addition of transformer and VAE modules over a standard backbone.

What Makes the Uncertainty Map Useful in Practice

A model that produces an uncertainty map and then ignores it is performing a parlour trick. The framework actively uses the map in two distinct ways, and the ablation results show that both matter.

First, the uncertainty map guides which regions are presented to the user for correction. Rather than asking users to inspect the full segmentation output for errors — a cognitively expensive process — the system highlights the highest-uncertainty pixels. In the experiments, correction masks were sampled from high-uncertainty regions according to a predefined interaction budget of N ∈ {0, 1, 3, 5, 10} corrections per image. The improvement from N=0 to N=3 is substantial; beyond N=5, diminishing returns set in. This means a user providing just three targeted corrections to the most uncertain regions achieves most of the available performance gain.

Second, the uncertainty map functions as a deployment risk signal. In operational geospatial applications — urban planning, environmental monitoring, disaster response — a segmentation system needs to communicate not just what it predicts but how much the downstream decision-maker should trust each prediction. The pixel-level uncertainty map gives exactly this signal: planners can automatically flag high-uncertainty regions for human review before committing to decisions based on the segmentation output.

The User Prior Consistency Analysis: Do Humans Agree?

A subtle question underlies the whole framework: if user corrections are supposed to encode domain knowledge, do different users actually agree on where corrections are needed? If not, treating user input as a prior is statistically incoherent.

The paper addresses this directly with a prior consistency analysis. For each image, three independent annotators provided N ∈ {1, 3, 5, 10} correction masks using three different interaction modes — point clicks, small polygons, and freehand scribbles. Pairwise consistency was measured using Jaccard index (IoU), Dice coefficient, Cohen’s kappa, and symmetric Hausdorff distance between correction contours.

Eq. 37–39 — Prior Consistency Metrics $$J\!\left(P^{(a)},P^{(b)}\right) = \frac{|P^{(a)} \cap P^{(b)}|}{|P^{(a)} \cup P^{(b)}|}, \quad D\!\left(P^{(a)},P^{(b)}\right) = \frac{2|P^{(a)} \cap P^{(b)}|}{|P^{(a)}| + |P^{(b)}|}$$ $$\kappa = \frac{p_o – p_e}{1 – p_e}, \quad p_o = \frac{1}{HW}\sum_i\mathbb{1}\!\left[P_i^{(a)} = P_i^{(b)}\right]$$

The results show strong inter-annotator agreement, with Dice coefficients consistently above 0.75 across all interaction modes and correction budgets. This validates the probabilistic treatment of user input as a reliable prior: different experts, given the same image and the same task, tend to flag the same ambiguous regions. The framework’s assumption that user corrections carry genuine domain signal — rather than being idiosyncratic noise — is empirically supported.

Limitations and Honest Assessment

The Monte Carlo sampling adds computational overhead. With M=20 forward passes through the BCNN, inference is slower than a single deterministic forward pass, though the 46 FPS reported represents the full-pipeline throughput including sampling — competitive with all baselines tested. Whether that remains true on hardware without A6000 GPUs is an open question the paper does not address.

The maximum interaction budget of T_max=3 was chosen empirically and validated on the six benchmark datasets. Whether three rounds is sufficient for all operational scenarios — particularly for highly heterogeneous or novel environments not represented in the benchmarks — is uncertain. The paper notes diminishing returns beyond N=5 corrections and T=3 rounds, but does not provide a theoretical bound on when the iterative procedure is guaranteed to be useful.

Finally, the framework relies on user corrections being spatially localised binary masks. In practice, domain experts may have richer forms of knowledge — categorical constraints, spatial relationships, confidence levels — that the current architecture cannot incorporate. Extending the prior conditioning mechanism to handle structured expert knowledge beyond binary correction masks is an obvious next step.

Why This Architecture Matters Beyond Remote Sensing

The combination of Bayesian weight uncertainty, transformer-fused multi-source tokens, and VAE-conditioned latent priors is not specific to aerial imagery. Any segmentation or structured prediction task that faces three simultaneous challenges — domain shift, the need for user domain guidance, and calibrated uncertainty — is a candidate for this architecture template. Medical image segmentation faces exactly this triplet: images from different scanners shift the distribution, radiologists have domain knowledge that deterministic models ignore, and clinical deployment demands calibrated confidence scores. Autonomous driving perception in novel geographic environments faces a similar structure.

The deeper contribution of this paper is demonstrating that these three problems — usually treated as separate research threads — can be unified within a single joint probabilistic framework where each component reinforces the others. The BCNN’s uncertainty map becomes the query fusion’s correction guide. The user corrections become the VAE’s prior parameters. The VAE’s latent regularisation improves the BCNN’s calibration. The whole is meaningfully greater than the sum of its parts, as the ablation results show component by component.

Conclusion

The central contribution of this paper is a principled answer to a question that the remote sensing community has been asking informally for years: how do you build a segmentation system that is simultaneously robust to domain shift, responsive to expert guidance, and honest about what it does not know? The answer is to model all three as aspects of the same posterior distribution $p(\mathbf{Y} | \mathbf{I}, \mathcal{U})$ — to treat the user’s corrections not as hard constraints but as observations that condition a latent prior, and to treat the model’s uncertainty not as a nuisance to be minimised but as an operational signal to be communicated and exploited.

The Bayesian CNN backbone, the transformer query fusion, and the user-conditioned VAE are each individually motivated techniques. Their combination is where the architecture becomes distinctive: a closed loop in which uncertainty drives user attention, user attention improves the prior, and the prior reduces uncertainty in subsequent rounds. Three refinement steps with a handful of corrections per image are enough to achieve consistent performance gains across six datasets spanning satellites, UAVs, high-resolution aerial cameras, and both urban and rural land cover.

The calibration results are, in some ways, the most important contribution. Achieving ECE = 0.089 — the lowest of any variant and competitive with purpose-built calibration methods — without any post-hoc calibration step demonstrates that the Bayesian probabilistic framework is not merely an architectural embellishment. It is doing genuine calibration work, producing prediction confidences that correspond to empirical accuracy. For a system designed for operational deployment in planning, monitoring, and disaster response, that calibration may matter more than the raw mIOU gain.

The remaining work centres on efficiency and expressiveness. Whether the Monte Carlo sampling overhead can be reduced without sacrificing uncertainty quality — perhaps through deterministic approximations or sparse sampling strategies — will determine how widely the framework can be deployed. Extending the user prior mechanism to accept richer forms of domain knowledge beyond binary masks would broaden the range of expert interactions the system can leverage. And large-scale validation across geographic regions and sensor modalities not represented in the six benchmarks used here remains the necessary final step before operational deployment can be justified.

None of that diminishes what has been achieved. The model knows what it does not know, and it asks you — with remarkable economy of interaction — exactly where you need it to be better. That is a different relationship between algorithm and expert than the field has had before.

Complete Proposed Model Code (PyTorch)

The implementation below reproduces the full Bayesian segmentation framework end-to-end: a Monte Carlo Dropout BCNN backbone, user prior modulation via element-wise masking, transformer-based multi-head query fusion over image and user tokens, a conditional VAE with user-guided latent prior, ELBO + supervised composite loss, and both inference scenarios (with and without user corrections), including the iterative TTA loop and pixelwise uncertainty map. A runnable smoke test with synthetic data is included.

# ==============================================================================
# Bayesian Multiclass Segmentation for Remote Sensing
# Paper: https://doi.org/10.1109/TGRS.2026.3670205
# Authors: Yeongsu Kim, Haeyun Lee, Seo-Yeon Choi, Kyungsu Lee
# Journal: IEEE Trans. Geoscience and Remote Sensing, Vol. 64, 2026
# PyTorch 2.4+ implementation with CUDA support
# ==============================================================================
from __future__ import annotations
import math
from typing import List, Optional, Tuple
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor


# ─── SECTION 1: Bayesian CNN Backbone (BCNN) with MC Dropout ─────────────────

class MCDropout(nn.Module):
    """
    Monte Carlo Dropout — active during both training AND inference.
    Standard nn.Dropout turns off at eval(); this module does not,
    enabling weight uncertainty sampling (Eq. 7–8 of the paper).

    Each forward pass with different random seeds samples a different
    weight realisation W_l ~ q(W_l | D), approximating Bayesian inference.
    """
    def __init__(self, p: float = 0.1) -> None:
        super().__init__()
        self.p = p

    def forward(self, x: Tensor) -> Tensor:
        # Always active — training=True forces dropout even at inference
        return F.dropout(x, p=self.p, training=True)


class BCNNBlock(nn.Module):
    """
    One Bayesian convolutional block: Conv2d → BatchNorm → ReLU → MCDropout.
    Stacking these implements f_BCNN(I; {W_l}) where W_l are stochastic.
    """
    def __init__(self, c_in: int, c_out: int,
                 k: int = 3, s: int = 1, p: int = 1,
                 drop_p: float = 0.1) -> None:
        super().__init__()
        self.conv = nn.Conv2d(c_in, c_out, k, s, p, bias=False)
        self.bn   = nn.BatchNorm2d(c_out)
        self.act  = nn.ReLU(inplace=True)
        self.drop = MCDropout(drop_p)

    def forward(self, x: Tensor) -> Tensor:
        return self.drop(self.act(self.bn(self.conv(x))))


class BCNNEncoder(nn.Module):
    """
    Bayesian CNN encoder backbone.

    Implements the stochastic feature extraction:
        F = f_BCNN(I; {W_l}),  W_l ~ q(W_l | D)  for all l   [Eq. 7]

    The predictive distribution p(F|I) is approximated by M forward passes:
        {F^(m)}_{m=1}^M,  W_l^(m) ~ q(W_l | D)              [Eq. 8]

    Epistemic uncertainty = pixelwise variance across M passes:
        Var_m[F^(m)]

    Parameters
    ----------
    in_channels   : number of input spectral channels (C=3 for RGB)
    feature_dim   : output feature channel dimension
    drop_p        : MC dropout probability (applied every forward pass)
    """
    def __init__(self, in_channels: int = 3,
                 feature_dim: int = 256,
                 drop_p: float = 0.1) -> None:
        super().__init__()
        self.feature_dim = feature_dim
        self.encoder = nn.Sequential(
            BCNNBlock(in_channels, 64,  drop_p=drop_p),   # 1/1
            nn.MaxPool2d(2),
            BCNNBlock(64,  128, drop_p=drop_p),           # 1/2
            nn.MaxPool2d(2),
            BCNNBlock(128, 256, drop_p=drop_p),           # 1/4
            BCNNBlock(256, feature_dim, drop_p=drop_p),   # keep spatial
        )

    def forward(self, x: Tensor) -> Tensor:
        """Single stochastic forward pass — sample one W_l realisation."""
        return self.encoder(x)

    def mc_forward(self, x: Tensor, M: int = 20) -> Tuple[Tensor, Tensor]:
        """
        Monte Carlo sampling: run M stochastic forward passes.

        Parameters
        ----------
        x : (B, C, H, W) input image batch
        M : number of MC samples (paper uses M=20)

        Returns
        -------
        F_mean : (B, feature_dim, H', W') mean feature map
        F_var  : (B, feature_dim, H', W') epistemic uncertainty (variance)
        """
        samples = torch.stack([self.forward(x) for _ in range(M)], dim=0)
        # samples: (M, B, C, H', W')
        F_mean = samples.mean(dim=0)
        F_var  = samples.var(dim=0, unbiased=True)
        return F_mean, F_var


# ─── SECTION 2: User Prior Modulation (Eq. 9) ────────────────────────────────

class UserPriorModulation(nn.Module):
    """
    Inject user correction masks as spatial priors into the feature map.

    For each user correction mask M_user^(n) in {0,1}^{H x W},
    the element-wise product selects BCNN features at correction regions:
        F_prior^(n) = F ⊙ M_user^(n)                          [Eq. 9]

    The masks are upsampled to match the feature map spatial resolution
    before masking. This mechanism allows user domain knowledge to be
    encoded directly as feature-space modulations rather than output
    constraints — a soft, probabilistic form of user guidance.

    Parameters
    ----------
    feature_dim : number of feature channels (must match BCNN output)
    """
    def __init__(self, feature_dim: int = 256) -> None:
        super().__init__()
        self.feature_dim = feature_dim

    def forward(self, F: Tensor,
                masks: Optional[List[Tensor]] = None) -> List[Tensor]:
        """
        Parameters
        ----------
        F     : (B, C, Hf, Wf) BCNN feature map
        masks : list of N binary tensors, each (B, 1, H, W) or (B, H, W)
                If None, returns empty list (no-user-prior scenario)

        Returns
        -------
        F_priors : list of N (B, C, Hf, Wf) user-prior feature maps
        """
        if masks is None or len(masks) == 0:
            return []

        B, C, Hf, Wf = F.shape
        priors = []
        for m in masks:
            if m.dim() == 2:           # (H, W) → (1, 1, H, W)
                m = m.unsqueeze(0).unsqueeze(0)
            elif m.dim() == 3:         # (B, H, W) → (B, 1, H, W)
                m = m.unsqueeze(1)
            # Resize mask to feature map resolution
            m_rs = F.interpolate(m.float(), size=(Hf, Wf), mode='nearest')
            priors.append(F * m_rs)      # Element-wise product
        return priors


# ─── SECTION 3: Token Embedding and Transformer Query Fusion ─────────────────

class TokenEmbedding(nn.Module):
    """
    Project feature maps into query token sequences for the transformer.
    Implements h_F and h_prior_emb in Eq. 13.

    Spatially flattens the feature map (H' x W' → N_pix tokens)
    after a 1x1 projection to embedding dimension d.
    """
    def __init__(self, in_dim: int, embed_dim: int = 256) -> None:
        super().__init__()
        self.proj = nn.Conv2d(in_dim, embed_dim, 1)

    def forward(self, F: Tensor) -> Tensor:
        """
        Parameters
        ----------
        F : (B, C, H, W) feature map

        Returns
        -------
        tokens : (B, H*W, embed_dim) flattened token sequence
        """
        x = self.proj(F)           # (B, embed_dim, H, W)
        B, D, H, W = x.shape
        return x.flatten(2).permute(0, 2, 1)  # (B, H*W, D)


class MultiHeadSelfAttention(nn.Module):
    """
    Multi-head self-attention over query tokens (Eq. 15 and 18).

    MHAttn(q, K, V) = Σ_h softmax(q K_h^T / √d_k) V_h

    All queries (image tokens + user-prior tokens) attend to each other,
    propagating user correction signals globally across the spatial graph.
    """
    def __init__(self, embed_dim: int = 256,
                 num_heads: int = 8, dropout: float = 0.1) -> None:
        super().__init__()
        self.num_heads = num_heads
        self.head_dim  = embed_dim // num_heads
        self.scale     = self.head_dim ** -0.5
        self.WQ = nn.Linear(embed_dim, embed_dim, bias=False)
        self.WK = nn.Linear(embed_dim, embed_dim, bias=False)
        self.WV = nn.Linear(embed_dim, embed_dim, bias=False)
        self.out = nn.Linear(embed_dim, embed_dim)
        self.drop = nn.Dropout(dropout)

    def forward(self, q: Tensor) -> Tensor:
        """
        Parameters
        ----------
        q : (B, M, embed_dim) — M = N_img + N_user tokens

        Returns
        -------
        out : (B, M, embed_dim)
        """
        B, M, D = q.shape
        H, Dh = self.num_heads, self.head_dim

        def split_heads(x):
            return x.view(B, M, H, Dh).permute(0, 2, 1, 3)  # (B, H, M, Dh)

        Q = split_heads(self.WQ(q))
        K = split_heads(self.WK(q))
        V = split_heads(self.WV(q))

        attn = (Q @ K.transpose(-2, -1)) * self.scale    # (B, H, M, M)
        attn = self.drop(attn.softmax(dim=-1))
        out  = (attn @ V).permute(0, 2, 1, 3).reshape(B, M, D)
        return self.out(out)


class TransformerQueryFusionLayer(nn.Module):
    """One transformer layer: MHAttn + FFN + LayerNorm + residual."""
    def __init__(self, embed_dim: int = 256,
                 num_heads: int = 8, ffn_ratio: int = 4,
                 dropout: float = 0.1) -> None:
        super().__init__()
        self.norm1 = nn.LayerNorm(embed_dim)
        self.attn  = MultiHeadSelfAttention(embed_dim, num_heads, dropout)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.ffn   = nn.Sequential(
            nn.Linear(embed_dim, embed_dim * ffn_ratio),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(embed_dim * ffn_ratio, embed_dim),
            nn.Dropout(dropout),
        )

    def forward(self, q: Tensor) -> Tensor:
        q = q + self.attn(self.norm1(q))
        q = q + self.ffn(self.norm2(q))
        return q


class TransformerQueryFusion(nn.Module):
    """
    Stack of L transformer layers that jointly refine image and user tokens.

    Implements Q^0 → Q^* via L applications of Eq. 17:
        q_i^(l+1) = q_i^(l) + Σ_j a_ij^(l) W_V q_j^(l)

    The final Q* = {q_i^*}_{i=1}^M encodes globally contextual,
    uncertainty-aware, and user-bias-aware representations.

    Parameters
    ----------
    embed_dim : token embedding dimension d
    num_heads : H attention heads
    num_layers: L transformer layers
    """
    def __init__(self, embed_dim: int = 256,
                 num_heads: int = 8, num_layers: int = 4,
                 dropout: float = 0.1) -> None:
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerQueryFusionLayer(embed_dim, num_heads, dropout=dropout)
            for _ in range(num_layers)
        ])

    def forward(self, Q: Tensor) -> Tensor:
        """
        Parameters
        ----------
        Q : (B, M, embed_dim) initial query set Q^0 (Eq. 14)

        Returns
        -------
        Q_star : (B, M, embed_dim) refined query set Q* (Eq. 16)
        """
        for layer in self.layers:
            Q = layer(Q)
        return Q


# ─── SECTION 4: Conditional VAE with User-Guided Prior ───────────────────────

class VAEPriorNet(nn.Module):
    """
    Parameterises the user-conditioned Gaussian prior (Eq. 20 and 22):
        p(z | I, {F_prior^(n)}) = N(z; μ_prior, Σ_prior)
    where [μ_prior, Σ_prior] = h_prior(GP(F), {GP(F_prior^(n))})

    GP(·) = global average pooling (sufficient statistic for the prior).
    Aggregated user features are concatenated with image features.
    """
    def __init__(self, feature_dim: int = 256,
                 latent_dim: int = 128,
                 max_user_priors: int = 10) -> None:
        super().__init__()
        self.max_n = max_user_priors
        # Input: GP(F) + GP(aggregated user priors) → concat
        self.mlp = nn.Sequential(
            nn.Linear(feature_dim * 2, 512),
            nn.ReLU(inplace=True),
            nn.Linear(512, latent_dim * 2),  # μ + log σ
        )
        self.user_agg = nn.Linear(feature_dim, feature_dim)

    def forward(self, F: Tensor,
                F_priors: Optional[List[Tensor]] = None) -> Tuple[Tensor, Tensor]:
        """
        Parameters
        ----------
        F        : (B, C, H, W) image feature map
        F_priors : list of N (B, C, H, W) user-prior feature maps (may be empty)

        Returns
        -------
        mu_prior  : (B, latent_dim)
        log_var_prior : (B, latent_dim)
        """
        gp_F = F.mean(dim=[2, 3])                     # GP(F): (B, C)
        if F_priors and len(F_priors) > 0:
            # Aggregate user priors: mean over N corrections
            user_stack = torch.stack([fp.mean(dim=[2, 3]) for fp in F_priors], dim=1)
            gp_user = user_stack.mean(dim=1)           # (B, C)
            gp_user = self.user_agg(gp_user)
        else:
            gp_user = torch.zeros_like(gp_F)

        h = torch.cat([gp_F, gp_user], dim=1)        # (B, 2C)
        params = self.mlp(h)                           # (B, 2*latent_dim)
        mu, log_var = params.chunk(2, dim=-1)
        return mu, log_var


class VAEEncoder(nn.Module):
    """
    Variational posterior encoder: q_φ(z | I, {M_user^(n)}, Y).
    Takes the concatenation of image features and ground-truth mask embeddings
    during training to learn the approximate posterior.
    """
    def __init__(self, feature_dim: int = 256,
                 num_classes: int = 7,
                 latent_dim: int = 128) -> None:
        super().__init__()
        self.mask_embed = nn.Linear(num_classes, feature_dim)
        self.mlp = nn.Sequential(
            nn.Linear(feature_dim * 2, 512),
            nn.ReLU(inplace=True),
            nn.Linear(512, latent_dim * 2),
        )

    def forward(self, F: Tensor, Y_onehot: Tensor) -> Tuple[Tensor, Tensor]:
        """
        Parameters
        ----------
        F        : (B, C, H, W) feature map
        Y_onehot : (B, K, H, W) one-hot ground truth mask

        Returns
        -------
        mu_q  : (B, latent_dim)
        lv_q  : (B, latent_dim)  [log variance]
        """
        gp_F = F.mean(dim=[2, 3])                         # (B, C)
        gp_Y = Y_onehot.permute(0, 2, 3, 1).mean(dim=[1, 2])  # (B, K)
        gp_Y = self.mask_embed(gp_Y)                       # (B, C)
        h = torch.cat([gp_F, gp_Y], dim=1)
        params = self.mlp(h)
        return params.chunk(2, dim=-1)


def reparameterise(mu: Tensor, log_var: Tensor) -> Tensor:
    """
    Reparameterisation trick: z = μ + σ·ε,  ε ~ N(0, I)
    Enables gradient flow through the stochastic latent variable.
    """
    std = (0.5 * log_var).exp()
    eps = torch.randn_like(std)
    return mu + std * eps


class SegmentationDecoder(nn.Module):
    """
    Generative decoder: f_dec(Q*, z) → Y_pred.          [Eq. 27]

    Takes the refined query representation Q* from the transformer
    and the sampled latent z, and decodes pixel-level class logits.
    The decoder combines global latent information with local spatial
    feature information via broadcast addition.

    Parameters
    ----------
    embed_dim    : transformer embedding dimension d
    latent_dim   : VAE latent dimension
    num_classes  : K land-cover categories
    output_hw    : (H_out, W_out) desired output resolution
    """
    def __init__(self, embed_dim: int = 256,
                 latent_dim: int = 128,
                 num_classes: int = 7,
                 feat_hw: int = 128) -> None:
        super().__init__()
        self.feat_hw = feat_hw
        # Latent → spatial broadcast vector
        self.z_proj = nn.Linear(latent_dim, embed_dim)
        # Query tokens → spatial feature map via transposed conv
        self.query_proj = nn.Linear(embed_dim, embed_dim)
        self.head = nn.Sequential(
            nn.Conv2d(embed_dim, 128, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Upsample(scale_factor=4, mode='bilinear', align_corners=False),
            nn.Conv2d(128, num_classes, 1),
        )

    def forward(self, Q_star: Tensor, z: Tensor) -> Tensor:
        """
        Parameters
        ----------
        Q_star : (B, M, embed_dim)  transformer output queries
        z      : (B, latent_dim)    sampled latent variable

        Returns
        -------
        logits : (B, K, H, W)  pixel-level class logits
        """
        B, M, D = Q_star.shape
        H = W = self.feat_hw

        # Project and reshape queries to spatial feature map
        q_feat = self.query_proj(Q_star)  # (B, M, D)
        # Take first H*W tokens as spatial map (truncate or pad as needed)
        n_spatial = H * W
        q_spatial = q_feat[:, :n_spatial, :]               # (B, H*W, D)
        q_spatial = q_spatial.permute(0, 2, 1).reshape(B, D, H, W)

        # Broadcast latent z as additive bias across all spatial positions
        z_bias = self.z_proj(z)[:, :, None, None]           # (B, D, 1, 1)
        feat = q_spatial + z_bias

        return self.head(feat)


# ─── SECTION 5: Full Bayesian Segmentation Model ─────────────────────────────

class BayesSegNet(nn.Module):
    """
    Full Bayesian Multiclass Segmentation Framework.

    Architecture (Table I of the paper):
      1. Feature Extraction : F = f_BCNN(I; {W_l})                [Eq. 7]
      2. User Prior Modulation: F_prior^(n) = F ⊙ M_user^(n)     [Eq. 9]
      3. Token Embedding : T_F = h_F(F), T_prior = h_emb(F_prior) [Eq. 13]
      4. Query Fusion    : Q* = Transformer(Q^0)                   [Eq. 16]
      5. Prior Parameterisation: [μ_prior, Σ_prior] = h_prior(...)  [Eq. 20]
      6. Latent Sampling : z ~ N(μ_prior, Σ_prior) or q_φ(z|...)
      7. Decoding        : Y_pred = f_dec(Q*, z)                   [Eq. 27]

    The Bayesian joint distribution (Eq. 11):
        p(Y, z | I, {M_user^n}) = p(Y | z, I, {M_user^n}) × p(z | I, {F_prior^n})

    Parameters
    ----------
    num_classes      : K land-cover categories
    in_channels      : C spectral channels (3 for RGB)
    feature_dim      : BCNN output channels
    embed_dim        : transformer embedding dimension d
    latent_dim       : VAE latent space dimension
    num_heads        : H attention heads
    num_tf_layers    : L transformer layers
    mc_samples       : M Monte Carlo samples for uncertainty estimation
    drop_p           : MC dropout probability
    img_size         : assumed square input resolution (H=W)
    """
    def __init__(
        self,
        num_classes: int = 7,
        in_channels: int = 3,
        feature_dim: int = 256,
        embed_dim:   int = 256,
        latent_dim:  int = 128,
        num_heads:   int = 8,
        num_tf_layers: int = 4,
        mc_samples:  int = 20,
        drop_p:      float = 0.1,
        img_size:    int = 512,
    ) -> None:
        super().__init__()
        self.num_classes = num_classes
        self.mc_samples  = mc_samples
        feat_hw = img_size // 4              # spatial resolution after BCNN

        # Modules
        self.bcnn       = BCNNEncoder(in_channels, feature_dim, drop_p)
        self.prior_mod  = UserPriorModulation(feature_dim)
        self.img_tok    = TokenEmbedding(feature_dim, embed_dim)
        self.user_tok   = TokenEmbedding(feature_dim, embed_dim)
        self.transformer = TransformerQueryFusion(embed_dim, num_heads, num_tf_layers)
        self.vae_prior  = VAEPriorNet(feature_dim, latent_dim)
        self.vae_enc    = VAEEncoder(feature_dim, num_classes, latent_dim)
        self.decoder    = SegmentationDecoder(embed_dim, latent_dim,
                                               num_classes, feat_hw)

    def forward(
        self,
        I: Tensor,
        user_masks: Optional[List[Tensor]] = None,
        Y_gt: Optional[Tensor] = None,
        training: bool = True,
    ) -> dict:
        """
        Full forward pass.

        Parameters
        ----------
        I          : (B, C, H, W) input image
        user_masks : list of N binary masks (B, H, W) [user corrections]
        Y_gt       : (B, H, W) ground truth for VAE encoder (training only)
        training   : if True, use posterior q_φ; else use prior p(z|I,U)

        Returns
        -------
        dict with keys:
          'logits'    : (B, K, H, W) single deterministic prediction
          'mu_q', 'lv_q'   : posterior parameters (training)
          'mu_p', 'lv_p'   : prior parameters
          'z'               : sampled latent
        """
        # ── Step 1: BCNN feature extraction ──────────────────────────────
        F = self.bcnn(I)                              # (B, feat_dim, H/4, W/4)

        # ── Step 2: User prior modulation ─────────────────────────────────
        F_priors = self.prior_mod(F, user_masks)       # list of N (B, C, H', W')

        # ── Step 3: Token embedding ────────────────────────────────────────
        T_img = self.img_tok(F)                       # (B, H'*W', d)
        T_usr = [self.user_tok(fp) for fp in F_priors]  # list of (B, H'*W', d)

        # ── Step 4: Concatenate and query-fuse ────────────────────────────
        Q0_parts = [T_img] + T_usr                    # Eq. 14
        Q0 = torch.cat(Q0_parts, dim=1)              # (B, M_img + M_usr, d)
        Q_star = self.transformer(Q0)                # (B, M, d)  [Eq. 16]

        # ── Step 5: Prior parameterisation ────────────────────────────────
        mu_p, lv_p = self.vae_prior(F, F_priors)     # Eq. 20

        # ── Step 6: Latent sampling ────────────────────────────────────────
        if training and Y_gt is not None:
            Y_onehot = F.one_hot(Y_gt.long(), self.num_classes)
            Y_onehot = Y_onehot.permute(0, 3, 1, 2).float()
            # Downsample to feature map resolution
            B, K, H_orig, W_orig = Y_onehot.shape
            H_feat, W_feat = F.shape[2], F.shape[3]
            Y_dn = F.interpolate(Y_onehot, size=(H_feat, W_feat), mode='nearest')
            mu_q, lv_q = self.vae_enc(F, Y_dn)       # posterior
            z = reparameterise(mu_q, lv_q)
        else:
            mu_q, lv_q = mu_p, lv_p                   # sample from prior
            z = reparameterise(mu_p, lv_p)

        # ── Step 7: Decode ─────────────────────────────────────────────────
        logits = self.decoder(Q_star, z)              # Eq. 27
        logits = F.interpolate(logits, size=I.shape[-2:],
                                mode='bilinear', align_corners=False)

        return {
            'logits': logits,
            'mu_q':  mu_q, 'lv_q': lv_q,
            'mu_p':  mu_p, 'lv_p': lv_p,
            'z':     z,
        }

    @torch.no_grad()
    def mc_predict(
        self,
        I: Tensor,
        user_masks: Optional[List[Tensor]] = None,
        M: Optional[int] = None,
    ) -> Tuple[Tensor, Tensor]:
        """
        Monte Carlo inference: M stochastic forward passes → mean + uncertainty.

        Implements Eqs. 27–28:
          S̄ = (1/M) Σ_m S^(m)
          U(i,j) = (1/(M-1)) Σ_m (S^(m)(i,j) - S̄(i,j))²

        Parameters
        ----------
        I          : (B, C, H, W) input image
        user_masks : list of N user correction masks
        M          : number of MC samples (default: self.mc_samples)

        Returns
        -------
        mean_pred  : (B, K, H, W) mean class probabilities
        uncertainty: (B, K, H, W) pixelwise epistemic uncertainty map
        """
        M = M or self.mc_samples
        samples = []
        for _ in range(M):
            out = self.forward(I, user_masks, training=False)
            prob = out['logits'].softmax(dim=1)   # (B, K, H, W)
            samples.append(prob)

        stack = torch.stack(samples, dim=0)         # (M, B, K, H, W)
        mean_pred   = stack.mean(dim=0)
        uncertainty = stack.var(dim=0, unbiased=True)
        return mean_pred, uncertainty


# ─── SECTION 6: ELBO + Supervised Composite Loss ─────────────────────────────

class BayesSegLoss(nn.Module):
    """
    Composite training loss (Eq. 24–26):
        L_total = L_sup + λ_ELBO × L_ELBO

    L_sup  = pixelwise cross-entropy (or Dice + CE) over mean prediction
    L_ELBO = KL(q_φ(z|I,M_user,Y) ‖ p(z|I,F_prior)) − E[log p_θ(Y|z,Q*)]

    The KL term is computed analytically for two Gaussians:
        KL(N(μ_q,σ_q²) ‖ N(μ_p,σ_p²)) = ½ Σ[log(σ_p²/σ_q²) + (σ_q²+(μ_q-μ_p)²)/σ_p² - 1]

    Parameters
    ----------
    num_classes : K (for class weighting)
    lambda_elbo : balance weight λ_ELBO between sup and ELBO losses
    """
    def __init__(self, num_classes: int = 7,
                 lambda_elbo: float = 0.1) -> None:
        super().__init__()
        self.ce  = nn.CrossEntropyLoss(ignore_index=255)
        self.lam = lambda_elbo

    def kl_divergence(self, mu_q: Tensor, lv_q: Tensor,
                       mu_p: Tensor, lv_p: Tensor) -> Tensor:
        """Analytical KL between two diagonal Gaussians."""
        # log(σ_p / σ_q) + (σ_q² + (μ_q - μ_p)²) / (2σ_p²) - ½
        kl = (0.5 * (
            lv_p - lv_q
            + (lv_q.exp() + (mu_q - mu_p).pow(2)) / lv_p.exp().clamp(1e-6)
            - 1
        )).sum(dim=-1).mean()
        return kl

    def dice_loss(self, pred: Tensor, target: Tensor,
                   smooth: float = 1.0) -> Tensor:
        """Soft Dice loss for class imbalance robustness."""
        B, K, H, W = pred.shape
        pred_flat   = pred.softmax(dim=1).view(B, K, -1)
        target_oh   = F.one_hot(target.long().clamp(0), K).permute(0, 3, 1, 2)
        target_flat = target_oh.float().view(B, K, -1)
        intersection = (pred_flat * target_flat).sum(-1)
        dice = 1 - (2 * intersection + smooth) / (
            pred_flat.sum(-1) + target_flat.sum(-1) + smooth
        )
        return dice.mean()

    def forward(self, out: dict, Y_gt: Tensor) -> Tuple[Tensor, dict]:
        """
        Parameters
        ----------
        out  : dict from BayesSegNet.forward() containing logits + VAE params
        Y_gt : (B, H, W) ground truth class indices

        Returns
        -------
        total_loss : scalar
        loss_dict  : breakdown for logging
        """
        logits = out['logits']

        # Supervised loss: CE + Dice (Eq. 25)
        ce_loss   = self.ce(logits, Y_gt.long())
        dice_loss = self.dice_loss(logits, Y_gt)
        L_sup = ce_loss + dice_loss

        # KL divergence term of ELBO (Eq. 26)
        kl = self.kl_divergence(
            out['mu_q'], out['lv_q'],
            out['mu_p'], out['lv_p']
        )

        # Reconstruction term: log p_θ(Y | z, Q*) ≈ -CE on current sample
        recon = F.cross_entropy(logits, Y_gt.long(), ignore_index=255)
        L_ELBO = kl - recon                            # ELBO ≈ E[log p] - KL

        total = L_sup + self.lam * L_ELBO

        return total, {
            'total': total.item(), 'ce': ce_loss.item(),
            'dice': dice_loss.item(), 'kl': kl.item(),
        }


# ─── SECTION 7: Inference Pipeline ───────────────────────────────────────────

@torch.no_grad()
def iterative_tta_inference(
    model: BayesSegNet,
    I: Tensor,
    user_masks_init: Optional[List[Tensor]] = None,
    t_max: int = 3,
    eps: float = 1e-4,
    mc_samples: int = 20,
    device: Optional[torch.device] = None,
) -> Tuple[Tensor, Tensor, List[Tensor]]:
    """
    Iterative Test-Time Adaptation inference loop (Eqs. 30–31).

    Supports both inference scenarios from the paper:
      Scenario 1: user provides correction masks upfront → single-round
      Scenario 2: user provides masks after inspecting uncertainty maps

    Algorithm:
      1. Initial prediction (with or without user masks)
      2. Compute uncertainty map U(i,j)
      3. User inspects U and provides correction masks for high-uncertainty regions
      4. Re-run prediction conditioned on new masks
      5. Repeat until ||Y_pred^(t+1) - Y_pred^(t)||_1 < ε or t = T_max

    Parameters
    ----------
    model           : fitted BayesSegNet
    I               : (1, C, H, W) test image
    user_masks_init : initial user correction masks (may be None)
    t_max           : maximum refinement rounds T_max (paper: 3)
    eps             : convergence tolerance ε
    mc_samples      : M for uncertainty estimation

    Returns
    -------
    final_pred  : (1, K, H, W) final mean probability prediction
    uncertainty : (1, K, H, W) final pixelwise uncertainty map
    history     : list of T+1 prediction tensors (for visualisation)
    """
    if device is None:
        device = next(model.parameters()).device
    I = I.to(device)

    model.eval()
    current_masks = user_masks_init
    history = []
    prev_pred = None

    for t in range(t_max + 1):
        pred, uncertainty = model.mc_predict(I, current_masks, M=mc_samples)
        history.append(pred.clone())

        if t > 0 and prev_pred is not None:
            delta = (pred - prev_pred).abs().mean()
            if delta < eps:
                print(f"  Converged at t={t} (Δ={delta:.5f} < ε={eps})")
                break

        prev_pred = pred.clone()
        if t < t_max:
            # Simulate user correction: flag top-k% uncertain pixels
            unc_map = uncertainty.mean(dim=1, keepdim=True)   # (B,1,H,W)
            threshold = unc_map.view(unc_map.shape[0], -1).quantile(0.85, dim=1)
            threshold = threshold[:, None, None, None]
            new_mask = (unc_map > threshold).float()
            current_masks = [new_mask.squeeze(1)]

    return pred, uncertainty, history


# ─── SECTION 8: Training Loop ─────────────────────────────────────────────────

def train_one_epoch(
    model: BayesSegNet,
    loader,
    optimizer: torch.optim.Optimizer,
    criterion: BayesSegLoss,
    device: torch.device,
    scaler: Optional[torch.cuda.amp.GradScaler] = None,
) -> dict:
    """
    One full training epoch.

    During training, simulated user corrections are generated by sampling
    error-prone regions from the ground truth (as in the paper's protocol:
    N ∈ {0, 1, 3, 5, 10} corrections sampled from misclassified regions).

    Parameters
    ----------
    model     : BayesSegNet
    loader    : DataLoader yielding (images, masks, [user_corrections]) tuples
    optimizer : AdamW with lr=1e-4 and polynomial decay (paper settings)
    criterion : BayesSegLoss
    device    : compute device
    scaler    : optional AMP grad scaler

    Returns
    -------
    dict with mean loss values for logging
    """
    model.train()
    totals = {'total': 0.0, 'ce': 0.0, 'dice': 0.0, 'kl': 0.0}
    n = 0

    for batch in loader:
        images = batch[0].to(device)
        masks  = batch[1].to(device)
        user_m = batch[2] if len(batch) > 2 else None
        if user_m is not None:
            user_m = [u.to(device) for u in user_m]

        optimizer.zero_grad()
        if scaler:
            with torch.cuda.amp.autocast():
                out = model(images, user_m, masks, training=True)
                loss, ld = criterion(out, masks)
            scaler.scale(loss).backward()
            scaler.unscale_(optimizer)
            nn.utils.clip_grad_norm_(model.parameters(), 5.0)
            scaler.step(optimizer); scaler.update()
        else:
            out = model(images, user_m, masks, training=True)
            loss, ld = criterion(out, masks)
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), 5.0)
            optimizer.step()

        for k in totals: totals[k] += ld.get(k, 0.0)
        n += 1

    return {k: v / max(n, 1) for k, v in totals.items()}


# ─── SECTION 9: Evaluation Metrics ───────────────────────────────────────────

def compute_miou(pred: Tensor, target: Tensor,
                  num_classes: int, ignore_idx: int = 255) -> float:
    """
    Mean Intersection-over-Union across K classes (Eq. 34).
    IoU_c = TP_c / (TP_c + FP_c + FN_c)
    mIOU  = (1/K) Σ_c IoU_c

    Parameters
    ----------
    pred        : (B, K, H, W) logits or probabilities
    target      : (B, H, W)   ground truth class indices
    num_classes : K
    ignore_idx  : class index to exclude from evaluation
    """
    pred_cls = pred.argmax(dim=1).view(-1).cpu()
    true_cls = target.view(-1).cpu()
    valid = true_cls != ignore_idx
    pred_cls, true_cls = pred_cls[valid], true_cls[valid]

    ious = []
    for c in range(num_classes):
        tp = ((pred_cls == c) & (true_cls == c)).sum().float()
        fp = ((pred_cls == c) & (true_cls != c)).sum().float()
        fn = ((pred_cls != c) & (true_cls == c)).sum().float()
        denom = tp + fp + fn
        if denom > 0:
            ious.append((tp / denom).item())
    return float(sum(ious) / max(1, len(ious)))


# ─── SECTION 10: Smoke Test ───────────────────────────────────────────────────

if __name__ == '__main__':
    print("=" * 60)
    print("BayesSegNet Smoke Test (Bayesian BCNN + VAE + User Priors)")
    print("=" * 60)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Device: {device}")

    # Build model (small for smoke test)
    model = BayesSegNet(
        num_classes=7, in_channels=3, feature_dim=64,
        embed_dim=64, latent_dim=32, num_heads=4,
        num_tf_layers=2, mc_samples=5, img_size=64
    ).to(device)
    n_params = sum(p.numel() for p in model.parameters())
    print(f"Parameters: {n_params:,}")

    # Synthetic data (B=2, RGB, 64×64 tiles)
    I = torch.randn(2, 3, 64, 64, device=device)
    Y = torch.randint(0, 7, (2, 64, 64), device=device)
    masks = [torch.randint(0, 2, (2, 64, 64), device=device).float()
             for _ in range(3)]   # N=3 user corrections

    # ── Training forward pass ──────────────────────────────────────────
    model.train()
    out = model(I, masks, Y, training=True)
    print(f"Logits shape: {out['logits'].shape}")
    print(f"mu_q shape:   {out['mu_q'].shape}")
    print(f"z shape:      {out['z'].shape}")

    # ── Loss computation ──────────────────────────────────────────────
    criterion = BayesSegLoss(num_classes=7, lambda_elbo=0.1)
    loss, ld = criterion(out, Y)
    print(f"Total loss: {loss.item():.4f} | CE={ld['ce']:.4f} | KL={ld['kl']:.4f}")

    # ── MC inference + uncertainty ────────────────────────────────────
    model.eval()
    I_test = torch.randn(1, 3, 64, 64, device=device)
    mean_pred, unc = model.mc_predict(I_test, masks=None, M=5)
    print(f"MC pred shape: {mean_pred.shape}, unc shape: {unc.shape}")
    print(f"Mean uncertainty: {unc.mean():.4f}")

    # ── Iterative TTA inference ───────────────────────────────────────
    final, unc_final, hist = iterative_tta_inference(
        model, I_test, user_masks_init=None,
        t_max=3, mc_samples=5, device=device
    )
    print(f"TTA history length: {len(hist)} rounds")

    # ── mIOU ──────────────────────────────────────────────────────────
    Y_test = torch.randint(0, 7, (1, 64, 64))
    miou = compute_miou(final.cpu(), Y_test, num_classes=7)
    print(f"Smoke test mIOU (random): {miou:.4f}")

    print("\n✓  All checks passed.")

Read the Full Paper & Explore the Code

The complete framework — including experimental configurations for all six benchmark datasets, ablation supplements, and domain adaptation protocol — is published in IEEE Transactions on Geoscience and Remote Sensing. Implementation is in PyTorch 2.4 with CUDA; benchmarks were run on 8× NVIDIA RTX A6000 GPUs.

📄 Read on IEEE Xplore 🔗 IEEEXplore Page

Academic Citation:
Y. Kim, H. Lee, S.-Y. Choi and K. Lee, “Bayesian Multiclass Segmentation for Remote Sensing: Integrating User Priors and Uncertainty,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 64, 2026, Art. no. 1000515. https://doi.org/10.1109/TGRS.2026.3670205

This article is an independent editorial analysis of peer-reviewed research. The PyTorch code is a faithful reimplementation for educational purposes. All accuracy and calibration metrics cited are from the original paper. Benchmark results reflect the official evaluation protocol described therein.

Explore More on AI Trend Blend

If this article sparked your interest, here is more of what we cover across the site — from Bayesian deep learning and remote sensing to interactive AI, adversarial robustness, and precision agriculture.

HomepageLatest AI research breakdowns and trending papers

→

Machine LearningArchitecture reviews, training techniques, benchmarks

→

Computer VisionSegmentation, detection, 3D understanding

→

AI SecurityAdversarial attacks, robustness, federated defences

→

Remote Sensing AISatellite imagery, land cover, geospatial analysis

→

AboutWho we are and how we cover AI research

→