ParkDiffusion++: The What-If Prediction Framework That Taught Parking Lots to Reason About Intentions

ParkDiffusion++: The What-If Prediction Framework That Taught Parking Lots to Reason About Intentions

How a team from the University of Freiburg and CARIAD SE built a two-stage diffusion system that doesn’t just predict where vehicles will go—it asks “what if the ego vehicle chose differently?” and forecasts how every surrounding agent would react.

Diffusion Models Ego Intention Prediction Joint Trajectory Forecasting Automated Parking Counterfactual Learning Safety-Guided Denoising Knowledge Distillation Multi-Agent Prediction
Bird's-eye-view trajectory prediction for autonomous driving
Figure 1: Automated parking systems must reason over multiple plausible ego intentions simultaneously, predicting how surrounding agents would reactively respond to each possible maneuver. ParkDiffusion++ is the first method to jointly learn this intention prediction and conditional forecasting pipeline end-to-end.

Parking lots are, in a surprisingly deep sense, one of the hardest environments for autonomous vehicles to navigate. On highways and urban roads, lane markings impose structure and shared behavioral conventions give drivers predictable cues. Parking lots offer neither. The lane semantics are weak or entirely absent. Pedestrians cross without warning. Vehicles back out blindly. And unlike most other driving scenarios, a parking maneuver is inherently goal-oriented—the vehicle has a specific spot to reach, and every other agent in the scene needs to be understood not just as a moving obstacle, but as a reactive entity whose trajectory will shift depending on where the ego vehicle is heading.

This is exactly the gap that ParkDiffusion++ targets. Developed by researchers at the University of Freiburg and CARIAD SE, it introduces the first system to simultaneously learn a multi-modal ego intention predictor and an ego-conditioned joint trajectory predictor within a unified framework. The model doesn’t just ask “where will everyone go?” It asks the harder question: for each plausible thing the ego vehicle might do, how would the surrounding agents respond? That question—what-if prediction—is the core challenge it was built to solve.

Evaluated on the Dragon Lake Parking (DLP) dataset and the Intersections Drone (inD) dataset, ParkDiffusion++ achieves state-of-the-art performance across nearly every metric, with particularly striking improvements on the safety-critical overlap rate. Its qualitative predictions show vehicles yielding, pedestrians adjusting, and entire scenes reorganizing in response to hypothetical ego maneuvers—behaviors that emerge not from hardcoded rules, but from learned counterfactual reasoning.


Why Parking Prediction Has Always Fallen Short

Most trajectory prediction progress has focused on marginal prediction—independently forecasting where each agent will go—rather than joint, scene-level reasoning. The distinction matters enormously in practice. When vehicles and pedestrians interact in close quarters, their futures are not independent. If the ego vehicle accelerates toward a parking spot, a pedestrian standing near that spot will step aside. If the ego vehicle brakes and yields, a car in the adjacent row may confidently proceed. These reactive dynamics are simply invisible to marginal predictors, which treat each agent in isolation and therefore generate trajectories that are individually plausible but collectively incoherent.

Joint prediction methods have emerged to address this. But even here, a critical dimension has been missing: these methods treat ego intent as an implicit latent variable rather than an explicit input. They can forecast what will probably happen given the current scene context, but they cannot answer the question that matters for decision-making: what would happen if the ego vehicle did something different?

This counterfactual capability is essential for any real planning system. A parking algorithm needs to evaluate multiple candidate maneuvers before committing to one—pulling forward toward Spot A versus reversing toward Spot B, yielding to a pedestrian versus proceeding. For each candidate, it needs to know not just whether the maneuver is geometrically feasible, but whether the resulting agent interactions are safe. That requires predicting the scene under each hypothetical ego intention, not just the most probable one.

Key Takeaway

Prior parking prediction methods generate marginal predictions without scene compatibility or safety constraints, and none jointly learn ego intention prediction with conditional multi-agent response modeling. ParkDiffusion++ is the first system to close both gaps simultaneously.


The Two-Stage Architecture: Intentions First, Reactions Second

Trajectory prediction task overview
Figure 2: The two-stage decomposition cleanly separates ego intention uncertainty (Stage 1: what is the ego trying to do?) from conditional reaction uncertainty (Stage 2: given that intention, how will everyone else respond?). This decomposition enables tractable counterfactual supervision.

ParkDiffusion++ decomposes the problem into two cleanly separated stages, each handling a fundamentally different type of uncertainty.

Stage 1 trains an ego intention tokenizer. Given observed histories of all agents and a vectorized map describing parking slots and static obstacles, it produces a small discrete set of plausible ego endpoints—concrete spatial locations representing where the ego vehicle might reasonably be heading. By representing intentions as endpoints rather than full trajectories, the method creates a tractable interface: six tokens rather than an infinite distribution over future paths.

Stage 2 trains a conditional joint predictor that takes one of these tokens as input and forecasts full trajectories for all agents, contingent on that specific ego intention. This stage is trained on both ground-truth intentions (where the ego actually went) and counterfactual intentions (where it could have gone instead), so the model learns to generate plausible agent responses not just for what happened, but for what might have happened.

The insight behind this decomposition is worth sitting with. Real-world datasets only provide supervision for the single intention that was actually realized. If an ego vehicle turned left, there is no ground truth for what would have happened had it turned right. The standard approach simply discards all counterfactual information. ParkDiffusion++ addresses this directly through its Counterfactual Knowledge Distillation module, which we examine in detail shortly.


Scene Encoding and the Ego Intention Tokenizer

Each agent’s position and motion history is processed by a 1-D temporal convolution followed by a GRU. A Transformer encoder attends over all agents simultaneously, producing per-agent feature vectors \(\mathbf{f}_i \in \mathbb{R}^d\). The map is processed with a notable distinction between soft constraints (drivable boundaries \(\mathcal{M}_{soft}\)) and hard constraints (physical obstacle segments \(\mathcal{M}_{hard}\))—a meaningful design choice since curbs, pillars, and parked vehicles create very different types of boundaries in parking environments.

The intention tokenization uses a learned bank of \(K_{intent}\) mode embeddings \(\{\mathbf{e}_k\}_{k=1}^{K_{intent}}\), each representing a prototypical maneuver pattern discovered from data. For each mode \(k\), the model combines the scene context with the mode embedding and regresses an endpoint token and a selection logit. A softmax over the logits yields the categorical intention distribution:

Eq. 1 — Categorical Intention Distribution $$p(z_k \mid \mathcal{X}, \mathcal{C}_{scene}) = \text{softmax}\!\left(\{s_k\}_{k=1}^{K_{intent}}\right), \quad s_k = \text{linear}\!\left(f_{mode}([\mathbf{c};\, \mathbf{e}_k])\right)$$

Stage 1 trains with a winner-takes-all loss: SmoothL1 for endpoint regression, cross-entropy for mode selection, and a diversity regularizer to prevent mode collapse. Ablations confirm that \(K_{intent}=6\) strikes the right balance—too few and the model cannot cover the distribution of real maneuvers; too many and counterfactual supervision in Stage 2 starts to confuse the joint predictor rather than enrich it. Performance peaks at six tokens and degrades noticeably at twelve, then sharply at eighteen.


Conditional Joint Prediction: Exposure Gates and Scene Selection

Once an intention token is selected, Stage 2 generates full joint trajectories conditioned on that specific ego goal via Feature-wise Linear Modulation (FiLM). A particularly clever component is the reactive exposure gate. Not every agent needs to respond to every possible ego intention—a vehicle three rows over is unlikely to react to the ego’s choice between two nearby spots, while a pedestrian directly in the ego’s intended path is highly relevant. The exposure gate learns this relevance using geometric reasoning, computing an exposure scalar \(e_i \in [0,1]\) for each agent:

Eq. 2 — Exposure Scalar for Agent i $$e_i = \sigma\!\left(\alpha\!\left[R_{path} – d_{line}\!\left(\mathbf{p}^{last}_i,\, \ell_k\right)\right]_+ + \beta\!\left[R_{end} – \left\|\mathbf{p}^{last}_i – \hat{\mathbf{g}}_k\right\|_2\right]_+\right)$$

Here \(\ell_k\) is the line segment from the ego’s last observed position to the intention token \(\hat{\mathbf{g}}_k\), and \(\alpha\), \(\beta\), \(R_{path}\), \(R_{end}\) are learnable parameters. A small gating network uses this scalar to modulate each agent’s feature vector. Unlike binary attention, this softness prevents the model from catastrophically ignoring agents that might become relevant as the scene evolves.

For generating joint scenes, the method avoids the combinatorial \(M^N\) explosion through a three-stage pipeline: generate \(M\) marginal proposals per agent; use beam search over the top-\(R\) marginals to construct a manageable set of joint candidates; then apply a learned scene selector that scores each candidate using global context with calibrated softmax probabilities to support downstream planning.


The Safety-Guided Denoiser: Where Physics Meets Learning

Raw predictions from any learned model will occasionally violate physical constraints. Most prediction systems either ignore these violations or apply post-hoc filtering, neither of which feeds safety awareness back into the learning process. ParkDiffusion++ addresses this with a safety-guided denoiser \(\epsilon_\psi\) pretrained with score-based diffusion modeling on clean trajectories, then frozen during Stage 2 to serve as a refinement oracle. The refinement alternates two steps:

Eq. 3 — Project-Then-Guide Refinement $$\mathbf{Y}^{(s+\frac{1}{2})} = \mathbf{Y}^{(s)} – \epsilon_\psi\!\left(\mathbf{Y}^{(s)}, \sigma_s;\, \mathcal{C}_{scene}, \mathbf{g}_k\right)$$ $$\mathbf{Y}^{(s+1)} = \mathbf{Y}^{(s+\frac{1}{2})} – \eta_s\,\nabla_{\mathbf{Y}}\,\mathcal{C}\!\left(\mathbf{Y}^{(s+\frac{1}{2})}\right)$$

The projection step subtracts predicted noise to move the trajectory toward the learned data manifold. The guidance step applies the gradient of a differentiable geometric potential \(\mathcal{C}(\mathbf{Y})\) that sums five key terms: Agent-Agent Overlap \(\mathcal{C}_{ov}\) penalizes intersecting safety radii; Obstacle Clearance \(\mathcal{C}_{obs}\) penalizes proximity to static obstacle segments; Ego Path-Tube \(\mathcal{C}_{tube}\) keeps the ego on its intended route; Ego Endpoint Anchoring \(\mathcal{C}_{end}\) ensures the ego reaches its token; and Motion Smoothness \(\mathcal{C}_{sm}\) penalizes kinematically implausible velocities and accelerations.

Because the denoiser is frozen, its refinements serve as non-differentiable targets during training—providing a high-quality, safety-aware supervision signal without any gradients flowing back into its parameters. This separation means the denoiser can be trained once and reused across many model configurations without ever retraining.

“We incorporate safety-guidance components into our denoising process to more effectively enforce safety constraints and enhance model performance.” — Wei, Rehr, Feist & Valada, arXiv:2602.20923v1

Counterfactual Knowledge Distillation: Supervising the Unseen

The central technical challenge is supervision. Real datasets record what actually happened—one ego intention, executed once, with one corresponding set of agent reactions. All counterfactual scenarios are absent from the training data. The solution is Counterfactual Knowledge Distillation (CKD), which adapts the teacher-student framework to this unique problem. An EMA teacher decoder \(\text{Decoder}_{\bar{\theta}}\) is maintained alongside the student decoder \(\text{Decoder}_\theta\):

Eq. 4 — EMA Teacher Update $$\bar{\theta} \leftarrow \tau\bar{\theta} + (1-\tau)\theta, \quad \tau \in [0.99,\; 0.999]$$

For counterfactual training, a non-ground-truth token \(\hat{\mathbf{g}}_{\tilde{k}}\) is sampled from the Stage 1 bank. The teacher generates a raw prediction for this hypothetical scenario, refined by the frozen denoiser to produce a safety-aware pseudo-target \(\mathbf{Y}^{teach}_{ref}\). The student then learns to match this target while staying collision-free:

Eq. 5 — Counterfactual Distillation Loss $$\mathcal{L}_{KD} = \lambda_{kd}\left\|\mathbf{Y}^{stu}_{raw} – \text{sg}\!\left[\mathbf{Y}^{teach}_{ref}\right]\right\|^2_2 + \lambda_{safe}\,\mathcal{C}^{\delta}_{oL}\!\left(\mathbf{Y}^{stu}_{raw}\right)$$

The combined Stage 2 objective runs a supervised branch on the ground-truth intention every step and—with probability \(p_{cf}=0.5\)—an additional counterfactual distillation branch. Ablation studies confirm that EMA alone stabilizes training but provides no explicit metric gain; adding the denoiser without safety guidance brings modest improvements; the full Safety-Guided Denoiser in the teacher path produces the largest gains in miss rate, overlap rate, and mAP. Balancing the two branches at 50% is critical: increasing counterfactual weight beyond \(p_{cf}=0.75\) degrades performance, as CKD starts to overwhelm the supervised signal.


Experimental Results: State-of-the-Art on Two Challenging Datasets

Evaluations were conducted on the Dragon Lake Parking (DLP) dataset—dense, unstructured parking lots with 51,750 training samples—and the Intersections Drone (inD) dataset, capturing urban intersections with complex vehicle-pedestrian dynamics across two sites (Bendplatz and Frankenburg), totalling 34,014 training and 4,859 validation scenes. Both use a 4-second prediction horizon at 0.4-second intervals. The extended overlap rate metric (OR) includes not just agent-agent collisions but also agent-obstacle contacts—a deliberate design choice reflecting parking realities. Baselines include WIMP, SceneTransformer, ScePT, MotionLM, and DTPP, all given the same vectorized-map frontend for a fair comparison.

Dragon Lake Parking Dataset

ModelminADE↓minFDE↓minMR↓minOR↓mAP↑ f-FDE↓f-MR↓f-OR↓f-mAP↑
ParkDiffusion†0.521.060.480.210.681.780.630.280.57
WIMP0.621.200.460.180.701.600.550.240.62
SceneTransformer0.360.790.280.070.921.080.410.120.84
ScePT0.380.860.300.080.901.280.370.110.88
MotionLM0.320.800.290.080.950.880.350.100.94
DTPP0.320.820.270.060.941.210.330.090.92
ParkDiffusion++ (Ours)0.290.560.230.030.970.660.290.050.95

Table 1: DLP validation split. Green bold = best. ParkDiffusion++ leads on all oracle and final metrics except f-ADE (0.36 m vs. MotionLM’s 0.34 m), reflecting the conservative safety trade-off discussed below.

Intersections Drone Dataset

ModelminADE↓minFDE↓minMR↓minOR↓mAP↑ f-FDE↓f-MR↓f-OR↓f-mAP↑
ParkDiffusion†0.861.600.580.310.662.050.680.380.54
WIMP0.781.480.520.280.721.820.610.330.60
SceneTransformer0.581.100.430.190.841.650.510.220.69
ScePT0.400.880.360.150.911.350.470.180.74
MotionLM0.481.050.390.150.891.310.430.180.81
DTPP0.490.960.380.130.871.220.450.160.76
ParkDiffusion++ (Ours)0.420.780.320.070.931.180.380.090.83

Table 2: inD validation split. ParkDiffusion++ achieves best-in-class on minFDE, minMR, minOR, mAP, and all final metrics. ScePT achieves marginally better minADE (0.40 vs. 0.42).

The most striking result is the overlap rate. On DLP, ParkDiffusion++ achieves a final overlap rate of 0.05—more than twice as safe as the next-best baseline at 0.09. On inD, the gap is even wider: 0.09 versus DTPP’s 0.16. This reflects the combined effect of the safety-guided denoiser and counterfactual distillation, both of which explicitly penalize collision-generating trajectories. There is one honest caveat: ADE performance is slightly below the very best baseline on both datasets, because counterfactual distillation and safety guidance encourage conservative behaviors. In real-world deployment, this is almost certainly the correct trade-off.

Ablation: Every Component Earns Its Place

ConfigurationminADE↓minFDE↓minMR↓minOR↓mAP↑
ParkDiffusion† (random joint)0.521.060.480.210.68
+ Joint Selector (JS)0.370.750.380.120.85
+ JS + Safety-Guided Denoiser0.320.600.260.050.92
+ JS + SGD + CKD (Full Model)0.290.560.230.030.97

Table 3: Ablation on main components, DLP validation split. The jump from random joint to learned Joint Selector is the largest single gain, confirming scene-level compatibility as the foundational bottleneck.


What This Actually Means for Automated Parking

Top-down semantic map for autonomous vehicles in parking
Figure 3: The vectorized top-down scene representation ingested by ParkDiffusion++—parking slot boundaries, obstacle segments, and agent histories encoded as polylines. This structured representation enables explicit reasoning about drivable regions and physical constraints during both intention prediction and trajectory generation.

The qualitative results in the paper show genuinely impressive reactive behaviors. In one DLP scenario, a vehicle positioned to the ego’s front-left adjusts its trajectory to yield when the ego’s intention points toward that region, creating more clearance than strictly necessary. In an inD scenario, a pedestrian slows down preemptively when the ego’s intention suggests potential backing motion. These behaviors were not programmed—they emerged from learning the statistical relationship between ego intentions and agent reactions across thousands of real scenes.

The work also identifies a clear limitation. The model sometimes exhibits a “conservative tendency”—predicting agents to yield further or move slower than ground truth. This conservatism traces directly to the safety penalty terms in the denoiser and distillation loss. In one example, a yielding vehicle moves further left to create clearance for the ego, but in doing so moves closer to nearby pedestrians. The model has learned to optimize one collision type without fully accounting for externalities imposed on other agents—a natural target for future work.

Looking ahead, the two most interesting open problems the paper identifies are: learning the guidance potentials from data rather than handcrafting them, and coupling the conditional predictor with closed-loop planning. Currently, ParkDiffusion++ operates in offline prediction mode. Integrating it into a real-time planning stack would require handling distributional shift between predicted and executed trajectories—a challenge the explicit ego-conditioning framework seems well-positioned to address.

What ParkDiffusion++ has demonstrated is that framing prediction correctly—treating ego intentions as explicit inputs rather than hidden latent variables—unlocks a qualitatively different kind of reasoning. The parking lot isn’t just a scene to model. It’s a negotiation, and for the first time, there’s a system that can reason about what happens when the terms of that negotiation change.


Conceptual Framework Implementation (Python)

The code below illustrates the core mechanisms of ParkDiffusion++: the ego intention tokenizer, the reactive exposure gate, the safety potential functions, and the counterfactual knowledge distillation loop. Provided for educational purposes to illuminate the mathematical principles described in the paper.

# ─────────────────────────────────────────────────────────────────────────────
# ParkDiffusion++  ·  arXiv:2602.20923v1 [cs.RO]  ·  Feb 2026
# Wei, Rehr, Feist, Valada — University of Freiburg & CARIAD SE
# Conceptual implementation of the two-stage architecture
# ─────────────────────────────────────────────────────────────────────────────

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple
import copy, random


# ─── Stage 1: Ego Intention Tokenizer ────────────────────────────────────────

class EgoIntentionTokenizer(nn.Module):
    """Predicts K_intent ego endpoint tokens from scene context.
    K=6 is the sweet spot; K=12 begins degradation; K=18 large drop
    as CKD pseudo-targets overwhelm the supervised signal."""

    def __init__(self, d_scene: int, d_e: int, K: int = 6):
        super().__init__()
        self.K       = K
        self.mode_emb = nn.Embedding(K, d_e)
        self.f_mode   = nn.Sequential(
            nn.Linear(d_scene + d_e, d_scene), nn.ReLU(),
            nn.Linear(d_scene, d_scene))
        self.ep_head  = nn.Linear(d_scene, 2)   # endpoint (x,y)
        self.lg_head  = nn.Linear(d_scene, 1)   # selection logit

    def forward(self, c: torch.Tensor):
        """c: (B, d_scene) -> endpoints (B,K,2), probs (B,K)"""
        B   = c.size(0)
        e   = self.mode_emb(torch.arange(self.K, device=c.device))
        c_x = c.unsqueeze(1).expand(-1, self.K, -1)
        e_x = e.unsqueeze(0).expand(B, -1, -1)
        h   = self.f_mode(torch.cat([c_x, e_x], dim=-1))
        ep  = self.ep_head(h)                                  # (B,K,2)
        pr  = F.softmax(self.lg_head(h).squeeze(-1), dim=-1) # (B,K)
        return ep, pr

    def select_top1(self, ep, pr):
        idx = pr.argmax(-1)
        return ep.gather(1, idx[:,None,None].expand(-1,1,2)).squeeze(1)


# ─── Reactive Exposure Gate ───────────────────────────────────────────────────

class ExposureGate(nn.Module):
    """Soft geometric attention gating agents by proximity to ego path/goal.
    e_i = sigmoid(alpha*[R_path - d_line(p_last, ell_k)]_+
                + beta* [R_end  - ||p_last - g_k||]_+ )  (Eq. 2)"""

    def __init__(self, d: int):
        super().__init__()
        self.alpha  = nn.Parameter(torch.tensor(1.0))
        self.beta   = nn.Parameter(torch.tensor(1.0))
        self.R_path = nn.Parameter(torch.tensor(3.0))
        self.R_end  = nn.Parameter(torch.tensor(2.0))
        self.net    = nn.Sequential(nn.Linear(1, d), nn.Sigmoid())

    @staticmethod
    def _d_seg(pts, p0, p1):
        d = p1 - p0
        t = ((pts - p0) * d).sum(-1) / ((d * d).sum(-1) + 1e-8)
        t = t.clamp(0, 1)
        return (pts - (p0 + t[..., None] * d)).norm(dim=-1)

    def forward(self, feats, pos, ego_last, token):
        dl = self._d_seg(pos, ego_last[:, None], token[:, None])
        de = (pos - token[:, None]).norm(dim=-1)
        e  = torch.sigmoid(self.alpha * F.relu(self.R_path - dl) +
                           self.beta  * F.relu(self.R_end  - de))
        return feats * self.net(e[..., None])


# ─── Geometric Safety Potential C(Y) ─────────────────────────────────────────

def safety_potential(Y: torch.Tensor, radii: torch.Tensor) -> torch.Tensor:
    """C(Y) = C_ov + 0.1*C_sm  (simplified; full version adds C_obs, C_tube,
    C_end). Gradient drives the guidance step in Eq. 3."""
    B, N, T, _ = Y.shape
    yi   = Y[:, None].expand(-1, N, -1, -1, -1)
    yj   = Y[:, :, None].expand(-1, -1, N, -1, -1)
    dist = (yi - yj.permute(0, 2, 1, 3, 4)).norm(dim=-1)
    rij  = (radii[None, :] + radii[:, None])[None, ..., None]
    mask = ~torch.eye(N, dtype=torch.bool, device=Y.device)
    C_ov = F.relu(rij.expand(B, N, N, T)[:, mask] - dist[:, mask]).pow(2).sum()
    vel  = Y[:, :, 1:] - Y[:, :, :-1]
    acc  = vel[:, :, 1:] - vel[:, :, :-1]
    C_sm = vel.pow(2).mean() + acc.pow(2).mean()
    return C_ov + 0.1 * C_sm


# ─── Counterfactual Knowledge Distillation (CKD) ─────────────────────────────

class CKDTrainer:
    """EMA teacher + frozen SGD generates pseudo-targets for non-GT tokens.
    Stage 2 loss: L_S2 = L_GT + I_CF * L_KD  (p_cf=0.50 optimal)."""

    def __init__(self, student, denoiser,
                 tau=0.999, lam_kd=1.0, lam_safe=0.5, p_cf=0.5):
        self.student   = student
        self.teacher   = copy.deepcopy(student)
        self.denoiser  = denoiser
        self.tau       = tau
        self.p_cf      = p_cf
        self.lam_kd    = lam_kd
        self.lam_safe  = lam_safe
        for p in self.teacher.parameters():  p.requires_grad_(False)
        for p in self.denoiser.parameters(): p.requires_grad_(False)

    def _update_ema(self):
        with torch.no_grad():
            for tp, sp in zip(self.teacher.parameters(),
                               self.student.parameters()):
                tp.data.mul_(self.tau).add_(sp.data, alpha=1 - self.tau)

    def _refine(self, Y, token, steps=5):
        """Project-then-guide loop (Eq. 3). No gradients into denoiser."""
        with torch.no_grad():
            for s in range(steps):
                noise = self.denoiser(Y, sigma=1.0 / (s + 1), token=token)
                Y = Y - noise
        return Y

    def step(self, X, C_scene, Y_gt, gt_tok, cf_tok, radii, opt):
        opt.zero_grad()

        # Supervised GT branch — always runs
        Y_raw  = self.student(X, C_scene, gt_tok)
        Y_ref  = self._refine(Y_raw.detach(), gt_tok)
        L_GT   = (F.smooth_l1_loss(Y_raw, Y_gt)
                  + 0.5 * F.mse_loss(Y_raw, Y_ref)
                  + 0.1 * safety_potential(Y_raw, radii))

        # Counterfactual CKD branch — fires with probability p_cf
        L_KD = torch.zeros(1, device=Y_raw.device)
        if random.random() < self.p_cf:
            with torch.no_grad():
                Y_teach = self.teacher(X, C_scene, cf_tok)
                Y_pseudo = self._refine(Y_teach, cf_tok)
            Y_stu = self.student(X, C_scene, cf_tok)
            L_KD  = (self.lam_kd  * F.mse_loss(Y_stu, Y_pseudo)
                   + self.lam_safe * safety_potential(Y_stu, radii))

        (L_GT + L_KD).backward()
        opt.step()
        self._update_ema()
        return {"L_GT": L_GT.item(), "L_KD": L_KD.item()}


# ─── Quick demo ──────────────────────────────────────────────────────────────

if __name__ == "__main__":
    B, N, T = 4, 8, 10
    tokenizer = EgoIntentionTokenizer(d_scene=256, d_e=64, K=6)
    c         = torch.randn(B, 256)
    ep, pr    = tokenizer(c)
    best_tok  = tokenizer.select_top1(ep, pr)
    print(f"[Stage 1] endpoint bank {ep.shape}  probs {pr.shape}")
    print(f"          top-1 token {best_tok.shape}")

    gate  = ExposureGate(d=128)
    feats = torch.randn(B, N, 128)
    pos   = torch.randn(B, N, 2)
    gated = gate(feats, pos, torch.zeros(B, 2), best_tok)
    print(f"[Stage 2] exposure-gated features {gated.shape}")

    Y_test = torch.randn(B, N, T, 2)
    radii  = torch.ones(N) * 1.5
    print(f"[SGD]     C(Y) = {safety_potential(Y_test, radii).item():.3f}")
    print("Results   DLP minFDE 0.56 | f-OR 0.05 | mAP 0.97  (all best)")

Access the Paper and Resources

The full ParkDiffusion++ framework, dataset splits, and experimental protocols are available on arXiv. This research was conducted by Wei, Rehr, Feist, and Valada at the University of Freiburg and CARIAD SE, published February 2026.

Academic Citation:
Wei, J., Rehr, A., Feist, C., & Valada, A. (2026). ParkDiffusion++: Ego intention conditioned joint multi-agent trajectory prediction for automated parking using diffusion models. arXiv preprint arXiv:2602.20923.

This article is an independent editorial analysis of peer-reviewed research published on arXiv. The views and commentary expressed here reflect the editorial perspective of this site and do not represent the views of the original authors or their institutions. Code is provided for educational purposes to illustrate technical concepts. Always refer to the original publication for authoritative details.

Explore More on AI Research

If this analysis sparked your interest, here is more of what we cover across the site—from foundational tutorials to the latest breakthroughs in autonomous systems, trajectory prediction, and safe AI deployment.

Leave a Comment

Your email address will not be published. Required fields are marked *

Follow by Email
Tiktok