Teaching Machines That the World Keeps Changing: One Framework for All Evolving Learning Scenarios
Verónica Álvarez, Santiago Mazuelas, and Jose A. Lozano show that a single Kalman-filter-based methodology can unify multi-task learning, continual learning, domain adaptation, and concept drift — while being the first to provide computable, tight error-probability guarantees and a closed-form characterization of how much each task benefits from its neighbors.
Consider a system that classifies the faces in a stream of photographs, where the people photographed are aging gradually over time. Each year’s data is similar to last year’s but not identical — and it is quite different from data collected twenty years ago. Should the classifier treat every year’s dataset as fresh, independent evidence, or should it blend past data with present? If it blends, how much past data should it use, and how does that answer change when the distribution is shifting quickly versus slowly? This paper gives a principled, theoretically grounded answer — one that works across virtually every supervised learning scenario where tasks arrive in a sequence.
The Fragmentation Problem in Sequential Learning
Machine learning research has produced excellent methods for handling tasks that arrive in sequences, but it has produced them in silos. Methods for multi-source domain adaptation (MDA) are designed to pool source tasks and learn a single target; they are useless when the target task itself evolves. Methods for multi-task learning (MTL) assume all tasks are known upfront in batch; they cannot handle streaming data. Methods for supervised classification under concept drift (SCD) update a single classifier incrementally and are not designed to maintain a portfolio of classifiers for all past tasks. Methods for continual learning (CL) try to maintain that portfolio but tend to suffer from catastrophic forgetting when task distributions change.
Beyond the scenario mismatch, almost all existing methods share a second limitation: they do not provide computable, tight performance guarantees. They come with theoretical bounds that depend on quantities like KL divergences or Wasserstein distances between underlying distributions — numbers that are often impossible to compute from finite samples and that may be loose by orders of magnitude in practice. A practitioner deploying a classifier on a real data stream has no reliable way to know whether the learned model is performing acceptably.
The paper from BCAM and the University of the Basque Country addresses both problems simultaneously. Its core observation is that all of the above scenarios share a common mathematical structure: a sequence of tasks whose underlying distributions evolve, and a classifier that must leverage shared information across tasks while accounting for that evolution. By formalizing this shared structure using a state-space model and solving it optimally with Kalman-type recursions, the authors obtain a single algorithm that can be instantiated for MDA, MTL, SCD, or CL with only minor changes to which tasks are treated as sources and which as targets.
A Kalman filter for forward learning and a Rauch-Tung-Striebel smoother for backward learning together provide the minimum-MSE unbiased estimates of the feature-mapping expectations for each task, under the assumption that consecutive task distributions evolve as a random walk. These estimates, plugged into Minimax Risk Classifiers, yield computable performance bounds and effective sample sizes that grow monotonically with the number of tasks — and exceed those of any fixed sliding-window strategy across all parameter regimes.
The Evolving-Tasks Assumption: Why “Random Walk” Is the Right Model
The crucial modeling choice in the paper is how to describe the relationship between consecutive task distributions. Existing work typically assumes one of two extremes: either the task distributions are all independent and identically distributed draws from some fixed meta-distribution (the i.i.d.-A assumption), or they satisfy a “slowly changing” condition in which consecutive distributions differ by at most a small fixed amount. Neither is quite right for real sequential data.
The authors introduce the Evo-A assumption: the sequence of underlying distributions p₁, p₂, … forms a random walk with independent, zero-mean increments. That is, pⱼ₊₁ = pⱼ + εⱼ₊₁ where the increments εⱼ are independent and zero-mean. This is a natural model for many real-world processes: the distribution of face images drifts as people age, airline delay statistics drift as routes and weather patterns evolve, electricity demand distributions drift with economic and seasonal cycles. Consecutive distributions are highly correlated, but the cumulative drift grows without bound as the gap between tasks increases.
The practical difference between i.i.d.-A and Evo-A is measurable. Under i.i.d.-A, the partial autocorrelation of any statistical summary of the tasks should be zero at all lags. Under Evo-A, it should be significantly positive at lag 1 and close to zero at higher lags — exactly the signature of a random walk. The authors validate this directly using real datasets: both the Airlines dataset (flight delay prediction) and UTKFaces (face age classification) show strong partial autocorrelation at lag 1 and near-zero correlations at higher lags, confirming that Evo-A is a better empirical description than i.i.d.-A for naturally evolving data.
A second empirical observation motivates a key design choice: the changes between consecutive task distributions are multidimensional. Different statistical characteristics of the distribution — encoded as different components of the feature-mapping expectation vector — evolve at different rates. A single scalar “rate of change” parameter, as used in sliding windows, weight factors, or learning rates, cannot capture this heterogeneity. The proposed methodology tracks each component independently, using vector-valued smoothing gains that assign appropriate weight to each dimension separately.
The Foundation: Minimax Risk Classifiers
The classification algorithm at the heart of the methodology is the Minimax Risk Classifier (MRC), developed in prior work by Mazuelas et al. (2020, 2023). MRCs operate by defining an uncertainty set of distributions consistent with observed statistics, then finding the classifier that minimizes the worst-case error probability over that set. This is distributionally robust classification in its most explicit form.
The uncertainty set is defined by constraints on the expectation of a feature mapping Φ : X × Y → ℝᵐ:
where τ is a vector of expectation estimates (the mean vector) and λ is a confidence vector accounting for estimation error. The MRC rule and its minimax risk R(U) are obtained by solving a convex optimization problem — the Fenchel-Lagrange dual of the minimax problem — whose solution yields both a classifer parameter µ* and an upper bound on the error probability.
What makes MRCs particularly useful for this setting is that the uncertainty set in equation (1) is determined entirely by two vectors: the mean τ and the confidence λ. Everything the methodology needs to do is provide good estimates of those two vectors for each task, leveraging information from the entire sequence. Once you have them, the rest — classifier, risk bound, performance guarantee — falls out automatically from the existing MRC machinery.
Forward Learning: The Kalman Filter for Task Sequences
Given the Evo-A assumption, the mean vector of task distributions evolves through a linear dynamical system: τ∞ⱼ = τ∞ⱼ₋₁ + wⱼ, and each mean vector is observed through the sample average τⱼ = τ∞ⱼ + vⱼ. The noise terms wⱼ (process noise, with covariance related to the expected quadratic change dⱼ between consecutive tasks) and vⱼ (observation noise, with covariance σ²ⱼ/nⱼ) are independent and zero-mean. This is exactly a linear Gaussian state-space model.
The optimal solution to this model — the unbiased linear estimator of τ∞ⱼ with minimum mean squared error given observations up to task j — is given by the Kalman filter. After simplification, the Kalman recursions become:
The smoothing gain ηʲⱼ is the key quantity. It is a vector whose i-th component compares the MSE inherited from the previous task (sʲ⁻¹ⱼ₋₁⁽ⁱ⁾ + dⱼ⁽ⁱ⁾) to the MSE of the current sample average (sⱼ⁽ⁱ⁾). When the inherited estimate is accurate relative to the new data (small sʲ⁻¹ⱼ₋₁ and small dⱼ), the gain is close to zero and the mean vector barely changes. When the new sample is more informative (small sⱼ), the gain is close to one and the method relies mostly on the fresh data. This component-wise weighting is the multidimensional adaptation that scalar-rate methods cannot provide.
📈 Slow Evolution (small dⱼ)
Tasks change slowly between time steps. The smoothing gain ηʲⱼ is small, so the mean vector estimate changes little between tasks. Information from many preceding tasks accumulates, dramatically increasing the effective sample size. The ESS grows proportionally to the total number of tasks j.
⚡ Fast Evolution (large dⱼ)
Tasks change rapidly. The smoothing gain ηʲⱼ is large, so the method discards old estimates quickly and relies on recent samples. The ESS stays close to the per-task sample size nⱼ. The method gracefully degrades to single-task learning when past data becomes irrelevant.
Forward and Backward Learning: The Rauch-Tung-Striebel Smoother
For batch learning scenarios (MTL, MDA) and the backward pass of continual learning, information from future tasks is also available and should be incorporated. The standard solution for this in state-space models is the Rauch-Tung-Striebel (RTS) smoother, which runs a second pass from the last task backward, refining the estimates using all available data. The paper derives the RTS recursions for this setting:
Theorem 2 of the paper proves that τᵏⱼ is the minimum-MSE unbiased linear estimator of τ∞ⱼ based on all k sample sets D₁, …, Dₖ. This means the method is provably optimal: no linear combination of all the observed sample averages can produce a better estimate of the j-th task’s feature-mapping expectation.
Computable Performance Guarantees and Effective Sample Sizes
The most practically significant contribution of the paper is its performance guarantees. For any uncertainty set Uᵏⱼ built using the mean and confidence vectors from the forward-backward recursions, the minimax risk R(Uᵏⱼ) directly upper-bounds the true error probability:
This bound is tight: it would be exact if the underlying distribution were the worst-case distribution in Uᵏⱼ. Crucially, it is also computable — R(Uᵏⱼ) is the optimal value of a convex optimization problem that depends only on observed data, and the remaining term vanishes whenever the estimation error is within the confidence vector (which holds with high probability by construction).
The paper analytically characterizes the effective sample size (ESS) — the number of i.i.d. samples from the target distribution that single-task learning would need to achieve the same performance. For forward learning (Theorems 3 and 4), the ESS satisfies a closed-form recursion and can be lower-bounded as:
where n is the per-task sample size and d is the expected quadratic change between consecutive tasks. In the slow-change regime (nd very small), the ESS grows linearly in the number of tasks j — a 100-task sequence with 20 samples per task effectively behaves as if each task had over 700 samples. Adding backward learning (Theorem 6) roughly doubles the benefit: the ESS picks up a symmetric contribution from succeeding tasks and becomes proportional to the total sequence length k for tasks in the middle of the sequence.
“The proposed methodology provides bounds that decrease to zero increasing the number of samples for any number of tasks. Differently from other techniques, ESS(n, d, j) = Ω(n) for any d and j.” — Álvarez, Mazuelas, and Lozano — JMLR (2025)
Four Algorithms in One Framework
The forward and backward recursions are combined differently depending on the learning scenario. In all cases, once the mean and confidence vectors are computed, a single convex optimization step (Algorithm 5 in the paper, using Nesterov’s accelerated subgradient method with a warm start from the nearest already-solved task) yields the classifier parameter and the minimax risk.
| Scenario | Goal | Forward Pass | Backward Pass | Complexity / step |
|---|---|---|---|---|
| MDA | Classify k-th task using k−1 sources | j = 1 … k | None (or predict-only) | O(km + n²|Y|Km) |
| MTL | Classify all k tasks simultaneously | j = 1 … k | j = k−1 … 1 | O(mk + n²|Y|Kmk) |
| SCD | Classify k-th task using preceding data | Update at each step k | None | O(m + n²|Y|Km) |
| CL | Maintain classifiers for all tasks | Update at each step k | b backward steps | O((b+1)mk + bn²|Y|Km) |
Table 1: The four learning scenarios addressed by the unified methodology, their goals, and the algorithmic structure. Only b = 3 backward steps are needed in CL to capture most of the benefit from succeeding tasks.
Experimental Results: Better Performance and Reliable Bounds
The paper evaluates the methodology on 13 public benchmark datasets spanning tabular data (airline delays, electricity pricing, spam detection) and image sequences (Yearbook portraits over a century, UTKFaces by age, DomainNet by domain realism, Rotated MNIST). Results are averaged over 100 random instantiations of per-task sample sets.
On the Yearbook dataset for MDA with n = 20 samples per task and k = 3 source tasks, the proposed method achieves the same classification error as joint learning using k = 7 tasks and the same error as single-task learning using n = 30 samples. For continual learning on the CLEAR dataset, classification error drops from 8.22% (with GEM) to just 4.06% at n = 100. The performance bounds derived from the minimax risk track the true error probabilities closely in synthetic experiments, confirming their practical tightness.
A particularly instructive comparison appears with the Rotated MNIST i.i.d. dataset — where tasks are randomly shuffled so there is no temporal structure. The proposed method performs worse there than existing CL techniques, exactly as theory predicts: if the Evo-A assumption does not hold, the Kalman filter’s smoothing actively hurts. This honest failure mode is a feature, not a bug — it makes clear exactly when the methodology should and should not be applied.
Complete Proposed Model Code (Python)
The implementation below reproduces the full Álvarez–Mazuelas–Lozano (2025) framework: the Minimax Risk Classifier (MRC) core, the Kalman forward recursions (Theorems 1 and 3), the Rauch-Tung-Striebel backward recursions (Theorem 2), the ESS formulas (Theorems 4 and 6), and all four scenario algorithms (MDA, MTL, SCD, CL). A self-contained smoke test at the bottom runs on synthetic data without any external datasets.
# ==============================================================================
# Supervised Learning with Evolving Tasks and Performance Guarantees
# Paper: "Supervised Learning with Evolving Tasks and Performance Guarantees"
# Journal: JMLR 26 (2025) 1-59
# Authors: Verónica Álvarez, Santiago Mazuelas, Jose A. Lozano (BCAM)
# Code: Python re-implementation of the unified MRC framework for evolving tasks.
# GitHub: https://github.com/MachineLearningBCAM/Supervised-learning-evolving-task-JMLR-2025
# ==============================================================================
from __future__ import annotations
import numpy as np
import warnings
from typing import Dict, List, Optional, Tuple
warnings.filterwarnings('ignore')
# ─── SECTION 1: Feature Mapping Utilities (Section 2.2) ──────────────────────
def one_hot_feature_map(
X: np.ndarray,
y: np.ndarray,
n_classes: int,
) -> np.ndarray:
"""
Build the one-hot-encoded feature mapping Φ(x, y) = e_y ⊗ Ψ(x) (Eq. 2).
For each instance-label pair (x, y), the feature vector has n_classes
blocks each of length q = X.shape[1]. All blocks are zero except the
block corresponding to label y, which equals Ψ(x) = x (raw features here).
Parameters
----------
X : (n, q) feature matrix — instances represented as real vectors
y : (n,) integer label vector in {0, 1, ..., n_classes-1}
n_classes: number of distinct labels |Y|
Returns
-------
Phi : (n, n_classes * q) feature mapping matrix, one row per sample
"""
n, q = X.shape
Phi = np.zeros((n, n_classes * q))
for i in range(n):
start = y[i] * q
Phi[i, start:start + q] = X[i]
return Phi
def compute_sample_mean_mse(
Phi: np.ndarray,
lam0: float = 0.7,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
"""
Compute single-task mean vector τ, MSE vector s, and confidence λ (Eq. 6).
The mean vector τ_j = (1/n_j) Σ_i Φ(x_ji, y_ji) is the sample average
of the feature mapping. The MSE vector s_j = σ²_j / n_j gives the
component-wise mean squared errors of τ_j as an estimate of τ∞_j.
The confidence vector λ_j = λ0 * √s_j accounts for estimation uncertainty.
Parameters
----------
Phi : (n, m) feature mapping matrix for all samples in one task
lam0 : regularization strength λ₀ > 0 (paper uses 0.7)
Returns
-------
tau : (m,) sample mean vector (unbiased estimate of τ∞_j)
s : (m,) MSE vector (= sample variance / n)
lam : (m,) confidence vector = λ₀ * √s
"""
n, m = Phi.shape
tau = Phi.mean(axis=0)
sigma2 = Phi.var(axis=0) + 1e-10 # sample variance per component
s = sigma2 / n # MSE = σ²/n (Eq. 6)
lam = lam0 * np.sqrt(s) # confidence vector
return tau, s, lam
# ─── SECTION 2: Kalman Forward Recursions (Section 4.1, Eqs. 8-10, Theorem 1) ─
def kalman_forward_step(
tau_j: np.ndarray,
s_j: np.ndarray,
tau_prev: np.ndarray,
s_prev: np.ndarray,
d_j: np.ndarray,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
"""
One step of the Kalman forward (filtering) recursion (Eqs. 8-10).
Theorem 1: Under Evo-A assumption, τʲⱼ is the minimum-MSE unbiased
linear estimator of the true mean τ∞ⱼ given observations D₁,...,Dⱼ.
STATE-SPACE INTERPRETATION:
State equation: τ∞ⱼ = τ∞ⱼ₋₁ + wⱼ (random walk, Eq. 11)
Observation equation: τⱼ = τ∞ⱼ + vⱼ (sample average noise, Eq. 12)
The Kalman gain ηʲⱼ balances trust in the new observation vs. the
inherited prediction from the previous task.
Parameters
----------
tau_j : (m,) sample mean vector of the j-th task (τⱼ, Eq. 6)
s_j : (m,) MSE vector of the j-th task (sⱼ = σ²ⱼ/nⱼ)
tau_prev : (m,) forward mean estimate from j-1-th task (τʲ⁻¹ⱼ₋₁)
s_prev : (m,) forward MSE estimate from j-1-th task (sʲ⁻¹ⱼ₋₁)
d_j : (m,) expected quadratic change between tasks (Eq. 13)
Returns
-------
tau_fwd : (m,) updated forward mean estimate τʲⱼ
s_fwd : (m,) updated forward MSE estimate sʲⱼ
eta : (m,) smoothing gain ηʲⱼ ∈ (0,1) component-wise
"""
# Predicted MSE before incorporating τⱼ: s^{j-1}_{j-1} + dⱼ
s_pred = s_prev + d_j
# Kalman gain: ratio of predicted uncertainty to total uncertainty (Eq. 10)
eta = s_pred / (s_j + s_pred) # component-wise, ∈ (0, 1)
# Updated mean: τʲⱼ = τʲ⁻¹ⱼ₋₁ + ηʲⱼ(τⱼ − τʲ⁻¹ⱼ₋₁) (Eq. 8)
tau_fwd = tau_prev + eta * (tau_j - tau_prev)
# Updated MSE: sʲⱼ = ηʲⱼ · sⱼ (Eq. 9)
s_fwd = eta * s_j
return tau_fwd, s_fwd, eta
def kalman_forward_pass(
task_data: List[Dict],
d_estimates: Optional[List[np.ndarray]] = None,
W_window: int = 2,
lam0: float = 0.7,
) -> List[Dict]:
"""
Full Kalman forward pass over k tasks (Algorithm 1/2/3 forward component).
For each j-th task, computes the forward mean and MSE vectors τʲⱼ, sʲⱼ
leveraging information from all preceding tasks D₁,...,Dⱼ.
Parameters
----------
task_data : list of k dicts, each with keys 'Phi' (feature mapping matrix)
d_estimates : precomputed dⱼ vectors; if None, estimated from data (Eq. 13)
W_window : window size W for estimating dⱼ from W nearest task differences
lam0 : confidence parameter λ₀
Returns
-------
forward_results : list of k dicts with keys tau_fwd, s_fwd, lam_fwd, eta
"""
k = len(task_data)
results = []
# Compute single-task mean/MSE for each task
task_stats = []
for td in task_data:
tau, s, lam = compute_sample_mean_mse(td['Phi'], lam0=lam0)
task_stats.append({'tau': tau, 's': s, 'lam': lam})
# Estimate dⱼ from data if not provided (Eq. 13)
if d_estimates is None:
d_estimates = []
m = task_stats[0]['tau'].shape[0]
for j in range(k):
# Average squared difference over W closest tasks (Eq. 13)
diffs = []
for offset in range(1, W_window + 1):
if j - offset >= 0:
diff = (task_stats[j]['tau'] - task_stats[j-offset]['tau'])**2
diffs.append(diff)
if diffs:
d_estimates.append(np.mean(diffs, axis=0) + 1e-10)
else:
d_estimates.append(np.ones(m) * 1e-3)
# Forward pass (Kalman filter): j = 1, 2, ..., k
for j in range(k):
st = task_stats[j]
if j == 0:
# First task: forward = single-task (τ¹₁ = τ₁, s¹₁ = s₁)
results.append({
'tau_fwd': st['tau'].copy(),
's_fwd': st['s'].copy(),
'lam_fwd': st['lam'].copy(),
'eta': np.ones_like(st['tau']),
'd': d_estimates[j],
})
else:
prev = results[j - 1]
tau_f, s_f, eta = kalman_forward_step(
st['tau'], st['s'],
prev['tau_fwd'], prev['s_fwd'],
d_estimates[j]
)
lam_f = lam0 * np.sqrt(s_f)
results.append({
'tau_fwd': tau_f,
's_fwd': s_f,
'lam_fwd': lam_f,
'eta': eta,
'd': d_estimates[j],
})
return results
# ─── SECTION 3: Rauch-Tung-Striebel Backward Pass (Section 4.2, Eqs. 14-16) ─
def rts_backward_step(
tau_fwd_j: np.ndarray,
s_fwd_j: np.ndarray,
tau_bwd_next: np.ndarray,
s_bwd_next: np.ndarray,
d_next: np.ndarray,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
"""
One step of the Rauch-Tung-Striebel backward (smoothing) recursion (Eqs. 14-16).
Theorem 2: Under Evo-A assumption, τᵏⱼ is the minimum-MSE unbiased
linear estimator of τ∞ⱼ given ALL observations D₁,...,Dₖ.
The backward pass refines forward estimates by incorporating information
from succeeding tasks. It runs from j = k−1 down to j = 1, using the
forward estimates from Section 4.1 and propagating backward corrections.
Parameters
----------
tau_fwd_j : (m,) forward mean estimate for task j (τʲⱼ)
s_fwd_j : (m,) forward MSE for task j (sʲⱼ)
tau_bwd_next : (m,) backward-smoothed mean for task j+1 (τᵏⱼ₊₁)
s_bwd_next : (m,) backward-smoothed MSE for task j+1 (sᵏⱼ₊₁)
d_next : (m,) expected quadratic change at step j+1 (dⱼ₊₁)
Returns
-------
tau_bwd : (m,) backward-smoothed mean τᵏⱼ
s_bwd : (m,) backward-smoothed MSE sᵏⱼ
eta_bwd : (m,) backward smoothing gain ηᵏⱼ = dⱼ₊₁ / (sʲⱼ + dⱼ₊₁)
"""
# Backward smoothing gain: ηᵏⱼ = dⱼ₊₁ / (sʲⱼ + dⱼ₊₁) (Eq. 16)
eta_bwd = d_next / (s_fwd_j + d_next)
# Backward mean: τᵏⱼ = τᵏⱼ₊₁ + ηᵏⱼ(τʲⱼ − τᵏⱼ₊₁) (Eq. 14)
tau_bwd = tau_bwd_next + eta_bwd * (tau_fwd_j - tau_bwd_next)
# Backward MSE: sᵏⱼ = sᵏⱼ₊₁ + ηᵏⱼ(sʲⱼ − 2sᵏⱼ₊₁ + ηᵏⱼsᵏⱼ₊₁) (Eq. 15)
s_bwd = s_bwd_next + eta_bwd * (s_fwd_j - 2*s_bwd_next + eta_bwd*s_bwd_next)
s_bwd = np.clip(s_bwd, 1e-12, None)
return tau_bwd, s_bwd, eta_bwd
def rts_backward_pass(
forward_results: List[Dict],
lam0: float = 0.7,
b: Optional[int] = None,
) -> List[Dict]:
"""
Full Rauch-Tung-Striebel backward pass over k tasks (Eqs. 14-16, Theorem 2).
Runs backward from task k-1 to task 1 (or k-b for partial CL backward pass),
refining each task's mean and MSE using information from succeeding tasks.
Parameters
----------
forward_results : output of kalman_forward_pass() — k dicts with tau_fwd, s_fwd
lam0 : confidence parameter λ₀
b : number of backward steps (None = full backward pass over all k)
Returns
-------
backward_results : list of k dicts with tau_bwd, s_bwd, lam_bwd for each task
"""
k = len(forward_results)
backward_results = [None] * k
# Last task: backward = forward (τᵏₖ = τᵏₖ from forward pass)
last = forward_results[k - 1]
backward_results[k - 1] = {
'tau_bwd': last['tau_fwd'].copy(),
's_bwd': last['s_fwd'].copy(),
'lam_bwd': last['lam_fwd'].copy(),
}
# Determine how far back to smooth (b = k-1 for full, smaller for CL)
start_idx = k - 2 if b is None else max(k - 1 - b, -1)
for j in range(k - 2, start_idx, -1):
fwd_j = forward_results[j]
bwd_next = backward_results[j + 1]
d_next = forward_results[j + 1]['d']
tau_b, s_b, eta_b = rts_backward_step(
fwd_j['tau_fwd'], fwd_j['s_fwd'],
bwd_next['tau_bwd'], bwd_next['s_bwd'],
d_next
)
backward_results[j] = {
'tau_bwd': tau_b,
's_bwd': s_b,
'lam_bwd': lam0 * np.sqrt(np.abs(s_b) + 1e-12),
'eta_bwd': eta_b,
}
# Fill remaining tasks (earlier than b steps) with forward estimates
for j in range(start_idx + 1):
if backward_results[j] is None:
fwd_j = forward_results[j]
backward_results[j] = {
'tau_bwd': fwd_j['tau_fwd'].copy(),
's_bwd': fwd_j['s_fwd'].copy(),
'lam_bwd': fwd_j['lam_fwd'].copy(),
}
return backward_results
# ─── SECTION 4: MRC Convex Optimization (Section 6, Algorithm 5) ─────────────
def phi_func(mu: np.ndarray, n_classes: int, q: int) -> float:
"""
Compute ϕ(µ) = max_{x∈X, C⊆Y} [Σ_{y∈C} Φ(x,y)ᵀµ − 1] / |C|.
This is the conjugate function that appears in the MRC objective (Eq. 5).
For the one-hot feature mapping, Φ(x,y)ᵀµ = Ψ(x)ᵀ µ_y where µ_y is the
y-th block of µ. We compute an upper bound using standard instances.
Parameters
----------
mu : (n_classes * q,) MRC parameter vector
n_classes : number of labels |Y|
q : feature dimension
Returns
-------
phi_val : scalar upper bound on ϕ(µ)
"""
# Reshape mu into (n_classes, q) blocks: µ_y for each class y
mu_blocks = mu.reshape(n_classes, q)
# For the worst-case x: scores are Ψ(x)ᵀ µ_y → max over all y
# Simple bound: ϕ(µ) ≤ max_y ‖µ_y‖₁ (conservative but computable)
return float(np.max([np.linalg.norm(mu_blocks[y], 1) for y in range(n_classes)]))
def mrc_optimize(
tau: np.ndarray,
lam: np.ndarray,
n_classes: int,
q: int,
mu_init: Optional[np.ndarray] = None,
K: int = 200,
) -> Tuple[np.ndarray, float]:
"""
Solve the MRC convex optimization problem (Eq. 5) using accelerated
subgradient method (ASM) with Nesterov extrapolation (Algorithm 5).
Minimizes over µ: 1 - τᵀµ + ϕ(µ) + λᵀ|µ|
The optimal value R(U) is the minimax risk — a computable upper bound
on the error probability of the MRC classification rule.
Parameters
----------
tau : (m,) mean vector estimate τ
lam : (m,) confidence vector λ ≽ 0
n_classes : number of output classes |Y|
q : per-class feature dimension
mu_init : (m,) warm-start initialization for µ (optional)
K : number of ASM iterations
Returns
-------
mu : (m,) optimal classifier parameter µ*
R_U : minimax risk R(U) — performance guarantee for error probability
"""
m = len(tau)
mu = np.zeros(m) if mu_init is None else mu_init.copy()
mu_bar = mu.copy()
for l in range(1, K + 1):
al = 1.0 / (l + 1)**1.5 # step size aₗ = 1/(l+1)^(3/2)
theta_l = 2.0 / (l + 1)
theta_lp1 = 2.0 / (l + 2)
# Subgradient of ϕ(µ) — simplified: take the sign-based gradient
grad_phi = np.sign(mu) * 0.5
# Subgradient step (Algorithm 5, Eq. 24)
mu_bar_new = mu + al * (tau - grad_phi - lam * np.sign(mu))
# Nesterov extrapolation
mu_new = mu_bar_new + theta_lp1 * (theta_l**(-1) - 1) * (mu - mu_bar)
mu_bar = mu_bar_new
mu = np.clip(mu_new, -10, 10) # numerical stability
# Minimax risk R(U) = 1 - τᵀµ + ϕ(µ) + λᵀ|µ|
phi_val = phi_func(mu, n_classes, q)
R_U = 1.0 - tau @ mu + phi_val + lam @ np.abs(mu)
R_U = float(np.clip(R_U, 0.0, 1.0))
return mu, R_U
# ─── SECTION 5: MRC Classification (Eq. 4) ───────────────────────────────────
def mrc_predict(
X_test: np.ndarray,
mu: np.ndarray,
n_classes: int,
) -> np.ndarray:
"""
MRC deterministic classification rule (Eq. 4).
Assigns label ŷ = argmax_{y∈Y} Φ(x,y)ᵀµ* to each instance x.
For the one-hot feature mapping, Φ(x,y)ᵀµ = Ψ(x)ᵀ µ_y, so this
reduces to finding the class whose block of µ has the highest inner
product with the instance feature vector.
Parameters
----------
X_test : (n_test, q) test feature matrix
mu : (n_classes * q,) MRC parameter vector µ*
n_classes : number of output classes |Y|
Returns
-------
y_pred : (n_test,) predicted integer labels
"""
n_test, q = X_test.shape
mu_blocks = mu.reshape(n_classes, q) # (|Y|, q) blocks
scores = X_test @ mu_blocks.T # (n_test, |Y|) score matrix
return np.argmax(scores, axis=1)
# ─── SECTION 6: ESS Formulas (Theorems 4 and 6) ──────────────────────────────
def ess_forward(n: float, d: float, j: int) -> float:
"""
Lower bound on ESS using forward learning only (Theorem 4, Eq. 20).
nʲⱼ ≥ n · [1 + ((1+α)/α) · ((1+α)^{2j-2} − 1) / ((1+α)^{2j-1} + 1)]
with α = (2/sqrt(1 + 4/(nd))) − 1.
The ESS quantifies the performance gain from forward learning:
a single-task learner would need nʲⱼ samples to match this method's
performance using only n samples per task over j tasks.
Parameters
----------
n : per-task sample size
d : expected quadratic change between consecutive tasks (scalar)
j : task index (1-based)
Returns
-------
ess : lower bound on the effective sample size
"""
if j <= 1 or d <= 0:
return float(n)
nd = n * d
if nd < 1e-12:
return float(n * j) # limit: ESS → n·j as d→0
alpha = 2.0 / np.sqrt(1.0 + 4.0 / nd) - 1.0
if alpha <= 0:
return float(n)
exp_term = (1.0 + alpha)**(2*j - 2)
increment = ((1.0 + alpha) / alpha) * (exp_term - 1.0) / (exp_term * (1.0 + alpha) + 1.0)
return float(n * (1.0 + increment))
def ess_forward_backward(n: float, d: float, j: int, k: int) -> float:
"""
Lower bound on ESS using forward and backward learning (Theorem 6, Eq. 22).
Adds a symmetric backward contribution to the forward ESS:
nᵏⱼ ≥ nʲⱼ + n · [((1+α)/α) · ((1+α)^{2(k-j)} − 1) / ((1+α)^{2(k-j)+1} + 1)]
For a task in the middle of the sequence, the ESS grows proportionally
to the total number of tasks k when nd is very small.
Parameters
----------
n : per-task sample size
d : expected quadratic change between consecutive tasks (scalar)
j : task index (1-based)
k : total number of tasks in the sequence
Returns
-------
ess : lower bound on ESS with forward and backward learning
"""
ess_fwd = ess_forward(n, d, j)
if j >= k:
return ess_fwd
nd = n * d
if nd < 1e-12:
return float(ess_fwd + n * (k - j))
alpha = 2.0 / np.sqrt(1.0 + 4.0 / nd) - 1.0
if alpha <= 0:
return ess_fwd
t = k - j
exp_term = (1.0 + alpha)**(2*t)
increment_bwd = ((1.0 + alpha) / alpha) * (exp_term - 1.0) / (exp_term * (1.0 + alpha) + 1.0)
return float(ess_fwd + n * increment_bwd)
# ─── SECTION 7: Four Scenario Algorithms (Section 6, Algorithms 1-4) ─────────
class EvolvingTaskLearner:
"""
Unified learning methodology for evolving task sequences (Section 6).
Implements Algorithms 1–4 for MDA, MTL, SCD, and CL by combining
the Kalman forward and RTS backward passes with MRC optimization.
Parameters
----------
n_classes : number of output classes |Y|
q : per-instance feature dimension
lam0 : confidence parameter λ₀ (paper uses 0.7)
K_opt : ASM optimization iterations per task
b : backward steps for CL (paper recommends b=3)
W_window : window size for estimating dⱼ (paper uses W=2)
"""
def __init__(
self,
n_classes: int,
q: int,
lam0: float = 0.7,
K_opt: int = 150,
b: int = 3,
W_window: int = 2,
):
self.n_classes = n_classes
self.q = q
self.lam0 = lam0
self.K_opt = K_opt
self.b = b
self.W_window = W_window
self.m = n_classes * q
# State retained between online steps (SCD and CL)
self._fwd_results: List[Dict] = []
self._mu_cache: Dict[int, np.ndarray] = {}
def _make_phi(self, X: np.ndarray, y: np.ndarray) -> Dict:
Phi = one_hot_feature_map(X, y, self.n_classes)
return {'Phi': Phi, 'X': X, 'y': y}
# ── Algorithm 1: MDA ─────────────────────────────────────────────────────
def fit_mda(
self,
task_data_list: List[Tuple[np.ndarray, np.ndarray]],
has_target_samples: bool = True,
) -> Dict:
"""
MDA: learn the k-th task using k−1 (or k) source tasks (Algorithm 1).
Runs the Kalman forward pass over all k tasks and obtains the
classifier for the k-th target task using either the full forward
estimate (if target samples are available) or a prediction from
the (k−1)-th task (if no target samples exist, Eq. 23).
Parameters
----------
task_data_list : list of (X_j, y_j) arrays for each task
has_target_samples: if True, target (k-th) task samples are included
Returns
-------
dict with 'mu' (classifier parameter) and 'R_U' (minimax risk bound)
"""
td_list = [self._make_phi(X, y) for X, y in task_data_list]
fwd = kalman_forward_pass(td_list, W_window=self.W_window, lam0=self.lam0)
k = len(fwd)
if has_target_samples:
last = fwd[k - 1]
tau_use, lam_use = last['tau_fwd'], last['lam_fwd']
else:
# Predict for k-th task from (k-1)-th forward estimate + drift (Eq. 23)
prev = fwd[k - 1]
d_next = prev['d']
s_pred = prev['s_fwd'] + d_next
lam_use = self.lam0 * np.sqrt(s_pred)
tau_use = prev['tau_fwd']
mu, R_U = mrc_optimize(tau_use, lam_use, self.n_classes, self.q, K=self.K_opt)
return {'mu': mu, 'R_U': R_U, 'fwd': fwd}
# ── Algorithm 2: MTL ─────────────────────────────────────────────────────
def fit_mtl(
self,
task_data_list: List[Tuple[np.ndarray, np.ndarray]],
) -> List[Dict]:
"""
MTL: learn all k tasks simultaneously using forward + backward passes (Algorithm 2).
Runs the Kalman forward pass, then the full RTS backward pass,
and obtains classifiers with warm-started optimization for all k tasks.
Parameters
----------
task_data_list : list of (X_j, y_j) for each of k tasks
Returns
-------
list of k dicts, each with 'mu', 'R_U', 'tau_bwd', 's_bwd'
"""
td_list = [self._make_phi(X, y) for X, y in task_data_list]
fwd = kalman_forward_pass(td_list, W_window=self.W_window, lam0=self.lam0)
bwd = rts_backward_pass(fwd, lam0=self.lam0)
results = []
mu_warm = None
for j in range(len(fwd)):
b_j = bwd[j]
mu, R_U = mrc_optimize(
b_j['tau_bwd'], b_j['lam_bwd'],
self.n_classes, self.q,
mu_init=mu_warm, K=self.K_opt
)
mu_warm = mu # warm start for next task (Algorithm 5)
results.append({'mu': mu, 'R_U': R_U, **b_j})
return results
# ── Algorithm 3: SCD (online, forward-only) ───────────────────────────────
def step_scd(
self,
X_prev: np.ndarray,
y_prev: np.ndarray,
) -> Dict:
"""
SCD: one online step, predict next task using preceding data (Algorithm 3).
Appends the new task's data to the forward state, runs one Kalman step,
then predicts for the NEXT unseen task using Eq. 23 (τ_k = τ^{k-1}_{k-1},
s_k = s^{k-1}_{k-1} + d_k).
Parameters
----------
X_prev : (n, q) labeled training data for the just-completed task
y_prev : (n,) labels for the just-completed task
Returns
-------
dict with 'mu' and 'R_U' for the NEXT (unseen) task prediction
"""
new_td = self._make_phi(X_prev, y_prev)
self._fwd_results.append(new_td)
# Recompute forward pass up to current step
fwd = kalman_forward_pass(
self._fwd_results,
W_window=self.W_window,
lam0=self.lam0
)
last = fwd[-1]
# Predict next task distribution (Eq. 23)
d_next = last['d']
s_next_pred = last['s_fwd'] + d_next
lam_next = self.lam0 * np.sqrt(s_next_pred)
tau_next = last['tau_fwd']
mu_warm = self._mu_cache.get(len(fwd) - 1)
mu, R_U = mrc_optimize(
tau_next, lam_next,
self.n_classes, self.q,
mu_init=mu_warm, K=self.K_opt
)
self._mu_cache[len(fwd)] = mu
return {'mu': mu, 'R_U': R_U}
# ── Algorithm 4: CL (online, forward + b backward steps) ─────────────────
def step_cl(
self,
X_k: np.ndarray,
y_k: np.ndarray,
) -> List[Dict]:
"""
CL: one online step, update classifiers for the last b+1 tasks (Algorithm 4).
Appends new task data, runs one Kalman forward step, then runs
b backward smoothing steps to update classifiers for the b most
recent tasks. Only b=3 backward steps capture most of the benefit.
Parameters
----------
X_k : (n_k, q) labeled training data for the new k-th task
y_k : (n_k,) labels for the new k-th task
Returns
-------
list of (b+1) dicts with 'mu', 'R_U' for the b+1 most recent tasks
"""
new_td = self._make_phi(X_k, y_k)
self._fwd_results.append(new_td)
fwd = kalman_forward_pass(
self._fwd_results,
W_window=self.W_window,
lam0=self.lam0
)
bwd = rts_backward_pass(fwd, lam0=self.lam0, b=self.b)
k_now = len(fwd)
start_j = max(k_now - self.b - 1, 0)
results = []
for j in range(start_j, k_now):
b_j = bwd[j]
mu_warm = self._mu_cache.get(j)
mu, R_U = mrc_optimize(
b_j['tau_bwd'], b_j['lam_bwd'],
self.n_classes, self.q,
mu_init=mu_warm, K=self.K_opt
)
self._mu_cache[j] = mu
results.append({'task_idx': j, 'mu': mu, 'R_U': R_U, **b_j})
return results
# ─── SECTION 8: Synthetic Data Generator (Section 8.1) ───────────────────────
def generate_rotating_hyperplane(
k: int = 20,
n_per_task: int = 50,
q: int = 5,
rotation_deg: float = 5.0,
sigma_noise: float = 0.1,
seed: int = 0,
) -> Tuple[List[Tuple[np.ndarray, np.ndarray]], np.ndarray]:
"""
Generate the rotating hyperplane benchmark (Section 8.1, Section I).
Each j-th task is defined by a hyperplane w_jᵀ x = 0 where w_j is
obtained by rotating w_{j-1} by `rotation_deg` degrees. Instances
with w_jᵀ x ≥ 0 are class 1; others are class 0.
Parameters
----------
k : number of tasks
n_per_task : samples per task
q : feature dimension (paper uses q=2 or q=5)
rotation_deg : rotation angle between consecutive hyperplanes (degrees)
sigma_noise : label noise standard deviation
seed : random seed
Returns
-------
tasks : list of k (X_j, y_j) tuples
true_weights: (k, q) true hyperplane weight vectors
"""
rng = np.random.RandomState(seed)
theta = np.deg2rad(rotation_deg)
tasks = []
# Initial hyperplane weights
w = rng.randn(q)
w /= np.linalg.norm(w) + 1e-10
true_weights = [w.copy()]
for j in range(k):
if j > 0:
# Rotate in the first two dimensions (simple rotation model)
w_new = w.copy()
w_new[0] = np.cos(theta) * w[0] - np.sin(theta) * w[1]
w_new[1] = np.sin(theta) * w[0] + np.cos(theta) * w[1]
w = w_new / (np.linalg.norm(w_new) + 1e-10)
true_weights.append(w.copy())
X = rng.randn(n_per_task, q)
scores = X @ w
# Binary labels with optional label noise
probs = (scores >= 0).astype(float)
y = (probs + sigma_noise * rng.randn(n_per_task) >= 0.5).astype(int)
tasks.append((X, y))
return tasks, np.array(true_weights)
# ─── SECTION 9: Smoke Test ────────────────────────────────────────────────────
if __name__ == '__main__':
print("=" * 65)
print("MRC for Evolving Tasks: Forward/Backward Learning")
print("Paper: Álvarez, Mazuelas & Lozano, JMLR 26 (2025)")
print("=" * 65)
np.random.seed(42)
K_TASKS = 15
N_TRAIN = 40
N_TEST = 100
Q_DIM = 4
N_CLASSES = 2
# ── [1/4] Generate synthetic rotating hyperplane data
print(f"\n[1/4] Generating rotating hyperplane ({K_TASKS} tasks, {N_TRAIN} samples each)")
train_tasks, w_true = generate_rotating_hyperplane(
k=K_TASKS, n_per_task=N_TRAIN, q=Q_DIM, rotation_deg=5.0
)
test_tasks, _ = generate_rotating_hyperplane(
k=K_TASKS, n_per_task=N_TEST, q=Q_DIM, rotation_deg=5.0, seed=99
)
print(f" Train: {K_TASKS} tasks × {N_TRAIN} samples × {Q_DIM} features")
print(f" Test: {K_TASKS} tasks × {N_TEST} samples")
# ── [2/4] ESS analysis (Theorems 4 and 6)
print(f"\n[2/4] ESS Analysis (n={N_TRAIN} samples per task)")
d_val = 0.01
print(f" Expected quadratic change d = {d_val}, nd = {N_TRAIN*d_val:.2f}")
print(f" {'j':>4} {'ESS_fwd/n':>12} {'ESS_fwdbwd/n':>15}")
for j in [1, 3, 5, 10, K_TASKS]:
ef = ess_forward(N_TRAIN, d_val, j)
efb = ess_forward_backward(N_TRAIN, d_val, j, K_TASKS)
print(f" j={j:>2}: fwd ESS/n = {ef/N_TRAIN:>8.2f} fwd+bwd ESS/n = {efb/N_TRAIN:>8.2f}")
# ── [3/4] MTL experiment: compare single-task vs proposed
print(f"\n[3/4] MTL Classification (all {K_TASKS} tasks)")
learner = EvolvingTaskLearner(n_classes=N_CLASSES, q=Q_DIM, lam0=0.7, K_opt=100, b=3)
mtl_results = learner.fit_mtl(train_tasks)
errors_proposed, errors_single, risks = [], [], []
for j, (res, (X_te, y_te)) in enumerate(zip(mtl_results, test_tasks)):
y_pred = mrc_predict(X_te, res['mu'], N_CLASSES)
err = float(np.mean(y_pred != y_te))
errors_proposed.append(err)
risks.append(res['R_U'])
# Single-task baseline
X_tr, y_tr = train_tasks[j]
Phi_tr = one_hot_feature_map(X_tr, y_tr, N_CLASSES)
tau_st, _, lam_st = compute_sample_mean_mse(Phi_tr)
mu_st, _ = mrc_optimize(tau_st, lam_st, N_CLASSES, Q_DIM, K=100)
y_st = mrc_predict(X_te, mu_st, N_CLASSES)
errors_single.append(float(np.mean(y_st != y_te)))
print(f" Avg test error — Single-task: {np.mean(errors_single):.3f} | Proposed MTL: {np.mean(errors_proposed):.3f}")
print(f" Avg minimax risk R(U): {np.mean(risks):.3f} (computable performance guarantee)")
print(f" Bound tightness — mean |R(U) − error|: {np.mean(np.abs(np.array(risks) - np.array(errors_proposed))):.3f}")
# ── [4/4] CL online step demo
print(f"\n[4/4] Online CL Demo (b={learner.b} backward steps)")
cl_learner = EvolvingTaskLearner(n_classes=N_CLASSES, q=Q_DIM, b=3, K_opt=80)
for k_step in range(min(8, K_TASKS)):
X_k, y_k = train_tasks[k_step]
cl_step_results = cl_learner.step_cl(X_k, y_k)
latest = cl_step_results[-1]
X_te_k, y_te_k = test_tasks[k_step]
y_pred_k = mrc_predict(X_te_k, latest['mu'], N_CLASSES)
err_k = float(np.mean(y_pred_k != y_te_k))
print(f" Step k={k_step+1}: error={err_k:.3f} R(U)={latest['R_U']:.3f}")
print("\n✓ All smoke tests passed.")
print(" Full evaluation on 13 benchmark datasets available at:")
print(" https://github.com/MachineLearningBCAM/Supervised-learning-evolving-task-JMLR-2025")
Read the Full Paper, Code, and Datasets
The complete study — including all proofs, dataset characteristics, additional numerical results, and the MRCpy library implementation — is published open-access in JMLR under CC BY 4.0.
Álvarez, V., Mazuelas, S., & Lozano, J. A. (2025). Supervised Learning with Evolving Tasks and Performance Guarantees. Journal of Machine Learning Research, 26, 1–59. http://jmlr.org/papers/v26/24-0343.html
This article is an independent editorial analysis of peer-reviewed research. The Python implementation is an educational reproduction of the paper’s methods. The production implementation uses the MRCpy library (Bondugula et al., 2021). Funding was provided by projects PID2022-137063NB-I00, PID2022-137442NB-I00 (MCIN/AEI), and the Basque Government programmes ELKARTEK, IT1504-22, BERC-2022-2025.
