The ODE Method for Stochastic Approximation with Markovian Noise: Breaking the Deadly Triad in Reinforcement Learning | AI Trend Blend

The ODE Method Gets Its Markovian Upgrade — and Reinforcement Learning’s Most Stubborn Problem Finally Has a Clean Answer

A team from the University of Virginia and Scaled Foundations has extended the celebrated Borkar-Meyn theorem to handle Markovian noise, unlocking the first rigorous almost-sure convergence guarantees for GTD(λ) and ETD(λ) — the two principal algorithms for tackling the deadly triad in off-policy RL.

Stochastic Approximation Markovian Noise Borkar-Meyn Theorem ODE Method Reinforcement Learning Eligibility Traces GTD(λ) ETD(λ) Deadly Triad Off-Policy Learning Almost Sure Convergence
Diagram showing the ODE method connecting stochastic approximation iterates with continuous ODE trajectories under Markovian noise for reinforcement learning stability proofs
📐 ODE Method for Stochastic Approximation with Markovian Noise — Stability & Convergence
The ODE method connects the discrete, stochastic iterates \(\{x_n\}\) to the continuous trajectories of an ordinary differential equation. Extending this connection to Markovian noise — where the standard Martingale convergence theorem fails — required fundamentally new proof machinery built around diminishing asymptotic rates of change and a subsequence-based contradiction argument.

For roughly thirty years, reinforcement learning researchers faced an uncomfortable gap between what their algorithms could do and what they could prove. Off-policy temporal difference methods with eligibility traces work beautifully in practice, but their convergence theory kept running into a wall: the standard mathematical tools for stochastic approximation were all built for independent or Martingale difference noise, and the noise in these algorithms is emphatically neither. Shuze Daniel Liu, Shuhang Chen, and Shangtong Zhang have now closed that gap. Their paper, published in JMLR in 2025, extends the Borkar-Meyn stability theorem to the Markovian noise setting under assumptions that actually hold for the algorithms researchers care about — and the payoff is immediate: the first clean, bias-free almost-sure convergence proofs for GTD(λ) and ETD(λ).


Why This Problem Existed at All

Stochastic approximation is the umbrella framework that covers a huge swath of machine learning — stochastic gradient descent, Q-learning, temporal difference methods, and much more. The basic update rule is deceptively simple:

Stochastic Approximation Update (Eq. 1) $$x_{n+1} = x_n + \alpha(n) H(x_n, Y_{n+1})$$

You have a vector \(x_n\), a sequence of random “noise” inputs \(\{Y_n\}\), a function \(H\) that maps your current state and the noise to an increment, and a shrinking learning rate \(\alpha(n)\). The deep question is: does this sequence converge, and to what?

The answer, broadly, comes from the ODE method: treat \(\{x_n\}\) as Euler’s discretization of a continuous ODE \(\dot{x}(t) = h(x(t))\) where \(h(x) = \mathbb{E}[H(x, y)]\). If that ODE has a nice attractor, the discrete stochastic iterates should track it. But there is a catch — and it is a big one. Before you can say anything about convergence, you first need stability: you need \(\sup_n \|x_n\| < \infty\) almost surely. Without that, the ODE analogy falls apart entirely.

The Borkar-Meyn theorem, published in 2000, gave a beautiful sufficient condition for stability: if a rescaled limiting ODE (the “ODE@∞”) is globally asymptotically stable at zero, then the iterates stay bounded. The catch? It required \(\{Y_n\}\) to be i.i.d. noise. With i.i.d. noise, the centered increments \(\{H(x_n, Y_{n+1}) – h(x_n)\}\) form a Martingale difference sequence, and the Martingale convergence theorem carries the analysis home.

Reinforcement learning — especially off-policy RL with function approximation and eligibility traces — does not have i.i.d. noise. The “noise” is the state of a Markov chain that may wander through an unbounded, uncountable space. In some configurations, the eligibility trace component \(e_t\) is provably unbounded almost surely and may not even have a finite second moment. Previous attempts to handle this (Ramaswamy & Bhatnagar 2018; Borkar et al. 2021) either required assumptions that fail for the most important RL algorithms or needed the eighth moment of certain functions to be bounded — a condition that is violated precisely when things get interesting.

The Core Problem

The Borkar-Meyn stability theorem — the bedrock of stochastic approximation convergence theory — only applies to Martingale difference noise. Off-policy RL algorithms with eligibility traces produce Markovian noise that can be unbounded almost surely with potentially infinite variance, making every prior stability result inapplicable to the algorithms that need it most.


What the New Paper Actually Proves

The central result is Theorem 7: under six assumptions that the authors carefully calibrate to be verifiable in real RL settings, the iterates \(\{x_n\}\) are stable almost surely. Convergence then follows from standard arguments as Corollary 8, which says \(\{x_n\}\) converges to a bounded invariant set of the limiting ODE.

The six assumptions deserve a close look, because choosing them correctly is most of the work.

Assumptions 1 and 2 are standard: the Markov chain \(\{Y_n\}\) has a unique stationary distribution, and the learning rates are positive, decreasing, and satisfy the Robbins-Monro conditions. Nothing exotic here.

Assumption 3 controls how the rescaled update function \(H_c(x,y) = H(cx,y)/c\) converges to a limit \(H_\infty(x,y)\) as \(c \to \infty\). In nearly all RL applications, \(H(x,y)\) is linear in \(x\), so this convergence is immediate and the “remainder” function \(b(x,y)\) does not even depend on \(x\).

Assumptions 4 and 5 impose Lipschitz continuity on \(H\) and require the ODE@∞ to be globally asymptotically stable at zero — this is the structural condition that has driven the ODE method since Borkar and Meyn’s original paper. For GTD(λ), this reduces to showing that a particular block matrix is Hurwitz (all eigenvalues with negative real parts), which is a standard result.

The real innovation is Assumption 6. Rather than imposing a Lyapunov drift condition on the Markov chain (as Borkar et al. 2021 did) or requiring compactness of the state space (as Ramaswamy & Bhatnagar 2018 did), the authors ask for something much simpler: the asymptotic rate of change of certain weighted sums involving the Markov chain must diminish to zero:

Diminishing Rate of Change (Assumption 6) $$\lim_{n\to\infty} \alpha(n) \sum_{i=1}^{n} \left[g(Y_i) – \mathbb{E}_{y \sim d_Y}[g(y)]\right] = 0 \quad \text{a.s.}$$

For learning rates \(\alpha(n) = B_1/(n+B_2)\), this is implied by the strong law of large numbers. For polynomial rates with exponent in \((0.5, 1]\), it is implied by the law of the iterated logarithm. Both hold for the Markov chains arising in GTD(λ) and ETD(λ) by results of Huizhen Yu (2012, 2015).

The beauty of this framing is what it does not require. The chain need not be positive Harris. It need not satisfy any Lyapunov drift condition. Its moments need not be finite. All that matters is this one asymptotic averaging property — and for the chains that appear in off-policy RL with traces, Yu proved precisely this property years ago, without anyone having a theorem that could use it.

“We only need \(h_\infty(x)\) to behave well, i.e., we only need \(H_\infty(x, y)\) to behave well in expectation. Prior work required it to behave well almost surely — a condition that fails for essentially every important off-policy algorithm.” — Liu, Chen, and Zhang, JMLR (2025)

The Proof Architecture: Why Subsequences Are the Key

Every prior proof of the Borkar-Meyn theorem tried to show that the discretization error — the gap between the stochastic iterate and the corresponding ODE solution — diminishes along the entire sequence. This paper takes a different path, and the difference is not just technical; it reflects a genuine insight about what the Arzelà-Ascoli theorem can and cannot give you.

The proof proceeds in five stages. First, the authors show that the asymptotic rate of change of the functions in Assumption 6 is zero along any sample path where the assumptions hold. This is Lemma 9 — the technical engine that replaces the Martingale convergence argument of the original Borkar-Meyn paper.

Second, they scale the iterates by \(r_n = \max\{1, \|x(T_n)\|\}\) to obtain a sequence of bounded functions \(\hat{x}(T_n + t)\). These scaled functions, together with the corresponding ODE solutions \(z_n(t)\) and their differences \(f_n(t)\), are shown to be equicontinuous in the extended sense — meaning their oscillations over short time intervals go to zero asymptotically.

Third — and here is the key move — the paper applies the Arzelà-Ascoli theorem to extract a subsequence \(\{n_k\}\) along which all three sequences converge uniformly. Because the assumptions only guarantee convergence along subsequences, not along the entire sequence, this is genuinely necessary rather than merely convenient.

Fourth, the authors prove that the discretization error \(f_{n_k}(t) \to 0\) uniformly as \(k \to \infty\). The clever part here is using the Moore-Osgood theorem for interchanging limits — the double limit over both \(k\) (the subsequence index) and \(j\) (a rescaling parameter) can be computed as an iterated limit, and each iterated limit is tractable.

Fifth, a Corollary 3.3 of Borkar (2009) guarantees that for large enough rescaling, ODE@∞ trajectories starting in the unit ball must contract significantly by time \(T\). Combined with the diminishing discretization error, this forces \(\|x(T_{n_{k_0}+1})\| \leq \|x(T_{n_{k_0}})\|\) for some index \(k_0\) — directly contradicting the assumption that \(r_n \to \infty\) along the subsequence.

Technical Innovation

The subsequence argument via Arzelà-Ascoli is what makes the Markovian extension possible. Prior proofs required the discretization error to vanish along the full sequence — a stronger statement that demands the noise averaging to happen uniformly. By working with a subsequence, the authors only need averaging along the specific sample paths where the Markov chain is “well-behaved,” which is exactly what the strong law of large numbers and law of the iterated logarithm deliver.


The Deadly Triad and Why It Has Been So Hard to Escape

The “deadly triad” in RL refers to the notorious instability that emerges when an algorithm combines three ingredients simultaneously: bootstrapping (updating estimates from other estimates), function approximation (representing value functions with a parameterized class), and off-policy learning (learning about one policy while following another). Put all three together with a constant per-step computational budget, and the iterates can diverge to infinity — as Baird demonstrated with a concrete counterexample back in 1995.

Two algorithmic families dominate the attempts to escape this triad: gradient temporal difference learning (GTD) and emphatic temporal difference learning (ETD). GTD, introduced by Sutton et al. in 2008, performs stochastic gradient descent on an objective called the off-policy mean squared projected Bellman error. ETD, introduced by Sutton et al. in 2016, reweights the TD update by an “emphasis” factor that corrects for the distribution mismatch between behavior and target policies.

Both work. Both have clean convergence properties in expectation, and both have seen years of empirical success. The open question was convergence almost surely for their eligibility-trace variants — GTD(λ) and ETD(λ) — without adding projection operators or regularization that would bias the limit point.

GTD(λ): The First Complete Almost-Sure Convergence Proof

GTD(λ) maintains two interacting parameter vectors: \(\theta\) (the value function weights) and \(\nu\) (an auxiliary vector from the weight-duplication trick). The update involves an eligibility trace \(e_t\) evolving according to:

GTD(λ) Eligibility Trace (Eq. 43) $$e_t = \lambda\gamma\rho_{t-1}e_{t-1} + \phi_t$$

where \(\rho_t = \pi(A_t|S_t)/\mu(A_t|S_t)\) is the importance sampling ratio. The trace \(e_t\) is a product of importance sampling ratios accumulated over time — and in off-policy problems, these ratios can be large and their products can grow without bound. Proposition 3.1 of Yu (2012) showed that whenever there is a cycle in the Markov chain, \(e_t\) diverges to infinity almost surely in “essentially all” natural problems. The second moment of \(e_t\) can likewise be infinite.

This is exactly the setting where all prior stability theorems break down. Borkar et al. (2021) needed the eighth moment bounded. The V4 drift condition requires the Lyapunov function \(v(y)\) to grow polynomially along the chain — but the chain \(\{Y_t\} = \{(S_t, A_t, S_{t+1}, e_t)\}\) evolves in the unbounded space \(\mathbb{R}^K\) and has no such Lyapunov function. Prior to this paper, the only way to handle GTD(λ) rigorously was Yu (2017), which added projection operators and accepted a biased limit point.

Theorem 24 in this paper establishes that, under standard assumptions (finite irreducible MDP, full-rank features, appropriate learning rates), the iterates \(\{(\nu_t, \theta_t)\}\) generated by GTD(λ) converge almost surely to \(-A^{\prime -1}b’\), which is exactly the unbiased zero of the off-policy MSPBE. No projection. No regularization. No bias.

ETD(λ): A Cleaner Proof of a Previously Unwieldy Result

For ETD(λ), the situation is different in character. The convergence result itself was established by Huizhen Yu in 2015, but the proof required a detour through a constrained variant of the algorithm. Yu analyzed a projected ETD update, showed that the projected and unprojected iterates grow close together almost surely, and concluded the convergence of the original algorithm indirectly.

The comparison technique worked, but only because the matrix \(A\) for ETD is negative definite. For GTD(λ), the corresponding block matrix is Hurwitz but not negative definite — so the comparison approach simply cannot be adapted. This is why GTD(λ) remained open while ETD(λ) had a (circuitous) proof.

Theorem 26 in the new paper gives ETD(λ) a direct proof via Corollary 8, bypassing the comparison entirely. The proof of Theorem 26 is, in the paper’s own words, “a verbatim repetition of the proof of Theorem 24 after noticing that \(A\) is negative definite.” The complexity — which spans several pages in Yu (2015) — collapses to a single observation once the stability theorem is available.

Paper Noise Type V4 Needed? Moment Req. Covers GTD(λ)? Covers ETD(λ)? Bias-Free?
Borkar & Meyn (2000) i.i.d. / Martingale No None
Ramaswamy & Bhatnagar (2018) Markovian No None
Borkar et al. (2021) Markovian Yes 8th moment
Yu (2015) Markovian ✓ (indirect)
Yu (2017) Markovian Partial ✗ (biased)
Liu, Chen & Zhang (2025) Markovian No None ✓ (direct) ✓ (direct)

Table 1: Comparison of stability and convergence results for stochastic approximation with Markovian noise. The 2025 paper is the first to directly cover both GTD(λ) and ETD(λ) with bias-free convergence guarantees under verifiable, non-restrictive assumptions.


Eligibility Traces: The Feature That Makes Everything Harder

To understand why traces are such a problem, it helps to think about what they are doing. An eligibility trace is a running weighted average of past feature vectors, where older vectors get exponentially downweighted. In the on-policy case, the trace stays bounded — the weights are at most 1, and the product of \(\lambda\gamma\) with values in \([0,1)\) ensures exponential decay.

Off-policy learning breaks this. The importance sampling ratio \(\rho_t = \pi(A_t|S_t)/\mu(A_t|S_t)\) can be substantially larger than 1 whenever the target policy assigns higher probability to an action than the behavior policy does. When this happens repeatedly — whenever the target policy consistently “prefers” a particular region of state-action space that the behavior policy visits only occasionally — the trace can and does grow without bound.

The augmented Markov chain \(\{Y_t\} = \{(S_t, A_t, S_{t+1}, e_t)\}\) therefore lives in \(\mathcal{S} \times \mathcal{A} \times \mathcal{S} \times \mathbb{R}^K\). Even though the MDP itself has a finite state and action space, the trace component makes the chain’s state space uncountable and unbounded. Standard tools for finite Markov chains cannot be applied. Tools for general state-space chains require Lyapunov conditions that this particular chain does not satisfy.

Yu’s contribution in 2012 was to prove that despite all this, the chain satisfies the strong law of large numbers — the time averages of Lipschitz functions converge almost surely. This is a property about averages, not about individual realizations. The Liu-Chen-Zhang result is the first general theorem that can actually take this averaging property and convert it into a stability guarantee for the underlying stochastic approximation algorithm.

Diagram showing the ODE method connecting stochastic approximation iterates with continuous ODE trajectories under Markovian noise for reinforcement learning stability proofs
Figure 2: The structural difference between on-policy and off-policy eligibility traces. In the on-policy case, the trace \(e_t = \lambda\gamma e_{t-1} + \phi_t\) stays bounded because \(\lambda\gamma < 1\). In the off-policy case, the IS ratio \(\rho_t\) multiplies each update step, allowing the trace to grow without bound whenever the behavior and target policies diverge consistently. This unbounded trace is what breaks every prior stability theorem for this setting.

The Hurwitz vs Negative Definite Distinction

One detail in the paper deserves attention because it explains why GTD(λ) was harder to analyze than ETD(λ) — and why the new proof is genuinely more general than what preceded it.

A real matrix \(A\) is negative definite if all eigenvalues of \(A + A^\top\) are strictly negative. It is Hurwitz if all eigenvalues of \(A\) itself have strictly negative real parts. Every negative definite matrix is Hurwitz, but the converse is false — many Hurwitz matrices are not negative definite.

For ETD(λ), the relevant matrix is \(A = \Phi^\top D_m(\gamma P_{\pi,\lambda} – I)\Phi\), which is negative definite. For GTD(λ), the relevant matrix is the \(2K \times 2K\) block matrix:

GTD(λ) Block Matrix $$A’ = \begin{pmatrix} -C & A \\ -A^\top & 0 \end{pmatrix}$$

This matrix is Hurwitz — its eigenvalues all have negative real parts — but it is only negative semidefinite, not negative definite. The comparison technique that Yu (2015) used for ETD explicitly requires negative definiteness (see Lemma 4.1 in that paper). Attempting to adapt it to GTD(λ) simply does not work.

Assumption 5 in the new paper only requires the ODE@∞ to be globally asymptotically stable — which holds whenever \(A’\) is Hurwitz. This is a strictly weaker condition than negative definiteness, and it is the right condition for GTD(λ). That the weaker condition is sufficient for stability is what the new proof establishes, and it is precisely what previous approaches could not achieve.

Why This Matters for Practitioners

The Hurwitz vs. negative definite distinction is not a technicality — it determines whether the algorithm can be analyzed at all with the tools at hand. Requiring negative definiteness rules out GTD(λ) categorically. The new framework requires only Hurwitz, which holds for a much broader class of RL algorithms including actor-critic methods and multi-timescale algorithms built on GTD building blocks.


Broader Applicability: What Else Gets Fixed

The paper is explicit that its contributions extend well beyond GTD(λ) and ETD(λ). Any RL algorithm that can be expressed as a stochastic approximation update of the form (1), where the noise is Markovian and satisfies the LLN property that Yu established, now has a stability guarantee — and hence a convergence guarantee — via Corollary 8.

The list is long. Off-policy actor-critic methods built on GTD (Imani et al. 2018; Maei 2018; Zhang et al. 2020) inherit the guarantee when their inner loops are analyzed as stochastic approximations. Methods for evaluating value functions with augmented reward functions (Liu & Zhang 2024) can use the same framework. Multi-step variants with state-dependent \(\lambda\) functions — mentioned explicitly in the paper as a direction Yu (2017) explored but left without clean proofs — now fall within scope.

The paper also gives an alternative to Assumption 6, called Assumption 6′, which is built on the V4 Lyapunov drift condition from Borkar et al. (2021) but requires only the second moment bounded (rather than the eighth). This is useful for settings outside RL where V4 does hold — the authors are clear that in the RL applications they care about, Assumption 6 is more natural, but having Assumption 6′ available makes the framework applicable in a broader range of contexts outside RL.

There is also a useful comparison to linear stochastic approximation. When \(H(x, y) = A(y)x + b(y)\), several papers (Tadic 2001; Konda & Tsitsiklis 1999; Yu 2015) established convergence assuming the mean matrix \(A = \mathbb{E}[A(y)]\) is negative definite. Assumption 5 in the new paper requires only that \(A\) is Hurwitz — a strictly weaker condition, and the right one for GTD(λ) as the block matrix analysis confirms.


What the Paper Leaves Open

The authors are candid about what they have not done, and the open questions they identify are genuinely interesting.

Whether the law of the iterated logarithm — the stronger of the two conditions in Assumption 6 — holds for the Markov chains arising in off-policy RL with traces remains an open question. Yu (2012, 2015) established the LLN but not the LIL for these chains. The LIL would enable learning rates with exponent \(\beta \in (0.5, 1)\) rather than just \(\beta = 1\), which is practically important for tuning.

On the theory side, establishing a functional central limit theorem, almost sure convergence rates, high-probability concentration bounds, and \(L^p\) convergence rates — all of which Qian et al. (2024) have begun developing — would complete the picture. Each of these requires adapting the new stability argument, and none is trivial.

Finally, extensions to nonexpansive operators and to arbitrary-rank feature matrices \(\Phi\) are flagged as future work, citing Blaser & Zhang (2024) and Wang & Zhang (2024) respectively. Relaxing the full-rank assumption on \(\Phi\) is particularly relevant for practical applications where overparameterization is common.


Thirty Years of Theoretical Debt, Paid Off

There is something satisfying about watching a thirty-year theoretical gap close this cleanly. Eligibility traces have been part of reinforcement learning since the beginning — Barto and Sutton introduced the accumulating trace in 1981 — and off-policy learning has been central to the field for nearly as long. The combination of the two is neither exotic nor edge-case; it is how serious practitioners build real systems.

The fact that this combination lacked rigorous stability theory for so long is a reminder that mathematical foundations often lag behind algorithmic practice. Algorithms get used because they work, proofs come later. The proof technology — Martingale convergence, Lyapunov drift conditions, the ODE method itself — evolves slowly, and sometimes a particular combination of algorithm structure and noise structure falls through every available crack.

What Liu, Chen, and Zhang have done is not just prove a theorem; they have identified the right proof technique for a whole class of problems. The asymptotic rate-of-change condition, the subsequence approach via Arzelà-Ascoli, the Moore-Osgood double-limit argument — these are tools that will be reused. Anyone working on multi-agent RL, mean-field approximations, or decentralized optimization with Markovian sampling now has a template for how stability proofs should proceed when the noise is correlated and potentially unbounded.

The convergence of ETD(λ) goes from a technically demanding indirect argument spanning Yu (2015) to a one-line observation. The convergence of GTD(λ) — which had no proof at all without bias — is now established directly. That kind of compression is what good foundational theory looks like.

Reinforcement learning has accumulated a lot of algorithms whose convergence is “believed” rather than “known.” This paper establishes the machinery to start converting those beliefs into theorems — not just for GTD(λ) and ETD(λ), but for the broader family of off-policy methods with traces that practitioners have been using for decades.

Complete Proposed Model Code (Python/PyTorch)

The implementation below provides a complete, self-contained PyTorch reproduction of the GTD(λ) and ETD(λ) algorithms analyzed in the paper, together with the stochastic approximation framework that underpins Theorem 7 and Corollary 8. Each module maps directly to the paper’s update equations (Sections 5.3 and 5.4), phase diagnostics, and the convergence verification procedure described in the theoretical analysis. A runnable smoke test at the bottom replicates the core off-policy prediction scenario from Section 5.

# ==============================================================================
# The ODE Method for Stochastic Approximation and Reinforcement Learning
# with Markovian Noise
#
# Paper: JMLR 26 (2025) 1-76
# Authors: Shuze Daniel Liu, Shuhang Chen, Shangtong Zhang
# Institutions: University of Virginia, Scaled Foundations
# ==============================================================================

from __future__ import annotations
import warnings, math, random
import numpy as np
import torch
import torch.nn as nn
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple, Callable
from collections import deque

warnings.filterwarnings('ignore')
torch.manual_seed(42)
np.random.seed(42)


# ─── SECTION 1: MDP and Environment Definitions ──────────────────────────────

@dataclass
class TabularMDP:
    """
    A finite tabular Markov Decision Process.

    Supports both on-policy and off-policy scenarios for testing
    GTD(lambda) and ETD(lambda) convergence.

    Attributes
    ----------
    n_states  : number of states |S|
    n_actions : number of actions |A|
    P         : (n_states, n_states, n_actions) transition kernel
    R         : (n_states, n_actions) reward matrix
    gamma     : discount factor
    """
    n_states:  int
    n_actions: int
    P:         np.ndarray   # (|S|, |S|, |A|) transition probabilities
    R:         np.ndarray   # (|S|, |A|) reward matrix
    gamma:     float = 0.95


def make_random_mdp(n_states: int = 10, n_actions: int = 4,
                    gamma: float = 0.95, seed: int = 0) -> TabularMDP:
    """
    Generate a random tabular MDP for testing purposes.

    Transition probabilities are sampled from a Dirichlet distribution
    to ensure they sum to 1 for each (state, action) pair.
    Rewards are sampled from a standard normal distribution.

    Parameters
    ----------
    n_states  : number of states
    n_actions : number of actions
    gamma     : discount factor
    seed      : random seed for reproducibility

    Returns
    -------
    mdp : TabularMDP instance
    """
    rng = np.random.default_rng(seed)
    P = rng.dirichlet(np.ones(n_states), size=(n_states, n_actions))
    P = P.transpose(0, 2, 1)  # (n_states, n_states, n_actions)
    R = rng.standard_normal((n_states, n_actions))
    return TabularMDP(n_states, n_actions, P, R, gamma)


def make_random_policy(n_states: int, n_actions: int,
                        seed: int = 0) -> np.ndarray:
    """
    Generate a random stochastic policy.

    Returns
    -------
    pi : (n_states, n_actions) array where pi[s, :] sums to 1
    """
    rng = np.random.default_rng(seed)
    pi = rng.dirichlet(np.ones(n_actions), size=n_states)
    return pi


def compute_stationary_distribution(mdp: TabularMDP,
                                      policy: np.ndarray) -> np.ndarray:
    """
    Compute the stationary distribution of the Markov chain induced
    by a given policy via power iteration.

    Parameters
    ----------
    mdp    : TabularMDP
    policy : (n_states, n_actions) stochastic policy

    Returns
    -------
    d : (n_states,) stationary distribution
    """
    S, A = mdp.n_states, mdp.n_actions
    # Induced transition matrix: P_pi[s, s'] = sum_a pi[s,a] * P[s,s',a]
    P_pi = np.einsum('sa,ssa->ss', policy,
                     mdp.P.transpose(0, 2, 1).reshape(S, S, A)).T
    P_pi = np.einsum('sa,ssa->ss', policy, mdp.P)
    d = np.ones(S) / S
    for _ in range(10000):
        d_new = d @ P_pi
        if np.max(np.abs(d_new - d)) < 1e-10:
            break
        d = d_new
    return d / d.sum()


def compute_true_value_function(mdp: TabularMDP, policy: np.ndarray) -> np.ndarray:
    """
    Compute the true value function v_pi by solving the Bellman equations
    directly: v = (I - gamma * P_pi)^{-1} r_pi.

    Parameters
    ----------
    mdp    : TabularMDP
    policy : (n_states, n_actions) target policy

    Returns
    -------
    v : (n_states,) true value function
    """
    S = mdp.n_states
    P_pi = np.einsum('sa,ssa->ss', policy, mdp.P)
    r_pi = np.sum(policy * mdp.R, axis=1)
    v = np.linalg.solve(np.eye(S) - mdp.gamma * P_pi, r_pi)
    return v


# ─── SECTION 2: Feature Function and Linear Architecture ─────────────────────

class LinearFeatures:
    """
    Linear function approximation architecture for TD methods.

    Implements the feature matrix Phi in R^{|S| x K} (Assumption 5.3).
    The full-rank assumption is enforced by construction when K < |S|.

    Parameters
    ----------
    n_states   : number of states
    K          : feature dimension (must be <= n_states)
    seed       : random seed for feature matrix initialization
    """
    def __init__(self, n_states: int, K: int, seed: int = 0):
        rng = np.random.default_rng(seed)
        raw = rng.standard_normal((n_states, K))
        # Orthogonalize to ensure full column rank
        Q, _ = np.linalg.qr(raw)
        self.Phi = Q[:, :K]  # (n_states, K) with orthonormal columns
        self.n_states = n_states
        self.K = K

    def phi(self, state: int) -> np.ndarray:
        """Return feature vector for a given state: phi(s) in R^K."""
        return self.Phi[state]

    def value(self, theta: np.ndarray) -> np.ndarray:
        """Compute linear value function: Phi @ theta."""
        return self.Phi @ theta


# ─── SECTION 3: Off-Policy Sampler (Markovian Noise) ─────────────────────────

class OffPolicySampler:
    """
    Generates Markovian transitions under the behavior policy mu.

    This class produces the sequence {Y_t} = {(S_t, A_t, S_{t+1})}
    that constitutes the Markovian noise in the stochastic approximation
    updates for GTD(lambda) and ETD(lambda).

    The sampler maintains the current state across calls to ensure
    the Markovian dependency structure is preserved.

    Parameters
    ----------
    mdp     : TabularMDP
    mu      : (n_states, n_actions) behavior policy
    pi      : (n_states, n_actions) target policy
    """
    def __init__(self, mdp: TabularMDP, mu: np.ndarray, pi: np.ndarray,
                 seed: int = 0):
        self.mdp = mdp
        self.mu = mu
        self.pi = pi
        self.rng = np.random.default_rng(seed)
        self.state = self.rng.integers(0, mdp.n_states)

    def step(self) -> Tuple[int, int, int, float, float]:
        """
        Generate one transition (S_t, A_t, S_{t+1}) under behavior policy mu.

        Returns
        -------
        s      : current state
        a      : action chosen by mu
        s_next : next state
        r      : reward r(s, a)
        rho    : importance sampling ratio pi(a|s) / mu(a|s)
        """
        s = self.state
        a = self.rng.choice(self.mdp.n_actions, p=self.mu[s])
        s_next = self.rng.choice(self.mdp.n_states, p=self.mdp.P[s, :, a])
        r = self.mdp.R[s, a]
        rho = self.pi[s, a] / self.mu[s, a] if self.mu[s, a] > 1e-10 else 0.0
        self.state = s_next
        return s, a, s_next, r, rho


# ─── SECTION 4: GTD(lambda) Implementation (Section 5.3, Eq. 43) ─────────────

class GTDLambda:
    """
    Gradient Temporal Difference Learning with Eligibility Traces.

    Implements the GTD(lambda) algorithm from Section 5.3 of the paper
    (originally GTD2 from Sutton et al. 2009 with eligibility trace from
    Precup et al. 2000, analyzed as the off-policy GTDa in Yu 2017).

    The update maintains two vectors (nu, theta) simultaneously:
        e_t     = lambda * gamma * rho_{t-1} * e_{t-1} + phi_t
        delta_t = R_{t+1} + gamma * phi_{t+1}^T theta_t - phi_t^T theta_t
        nu_{t+1}= nu_t + alpha_t (rho_t * delta_t * e_t - phi_t * phi_t^T * nu_t)
        theta_{t+1} = theta_t + alpha_t * rho_t * (phi_t - gamma*phi_{t+1}) * e_t^T * nu_t

    Theorem 24 guarantees that (nu_t, theta_t) -> -A'^{-1} b' a.s.,
    where the lower half is A^{-1}b = the unique minimizer of J_off(theta).

    Parameters
    ----------
    features  : LinearFeatures object
    gamma     : discount factor
    lambda_   : eligibility trace decay parameter in [0, 1]
    lr_B1     : learning rate numerator (alpha_t = B1 / (t + B2))
    lr_B2     : learning rate denominator offset
    """
    def __init__(self, features: LinearFeatures, gamma: float = 0.95,
                 lambda_: float = 0.5, lr_B1: float = 1.0,
                 lr_B2: float = 10.0):
        self.features = features
        self.gamma = gamma
        self.lambda_ = lambda_
        self.B1 = lr_B1
        self.B2 = lr_B2
        K = features.K
        self.theta = np.zeros(K)
        self.nu    = np.zeros(K)
        self.e     = np.zeros(K)   # eligibility trace
        self.t     = 0             # step counter
        self.rho_prev = 1.0       # rho_{t-1} for trace update
        # Convergence diagnostics
        self.theta_history: List[np.ndarray] = []
        self.norm_history:  List[float]     = []
        self.trace_norm_history: List[float] = []

    def alpha(self) -> float:
        """Learning rate schedule: alpha(t) = B1 / (t + B2). (Assumption 5.2)"""
        return self.B1 / (self.t + self.B2)

    def update(self, s: int, a: int, s_next: int,
               r: float, rho: float) -> None:
        """
        Perform one GTD(lambda) update step.

        Parameters
        ----------
        s      : current state S_t
        a      : action A_t (chosen by behavior policy)
        s_next : next state S_{t+1}
        r      : reward R_{t+1}
        rho    : IS ratio pi(a|s) / mu(a|s)
        """
        phi     = self.features.phi(s)
        phi_next = self.features.phi(s_next)
        alpha   = self.alpha()

        # Step 1: Update eligibility trace (Eq. 43)
        self.e = (self.lambda_ * self.gamma * self.rho_prev * self.e + phi)

        # Step 2: TD error delta_t
        delta = r + self.gamma * phi_next @ self.theta - phi @ self.theta

        # Step 3: Update nu (auxiliary vector)
        C_phi = np.outer(phi, phi)   # phi * phi^T
        self.nu = self.nu + alpha * (rho * delta * self.e - C_phi @ self.nu)

        # Step 4: Update theta (value function weights)
        self.theta = self.theta + alpha * rho * (phi - self.gamma * phi_next) * (self.e @ self.nu)

        # Book-keeping
        self.rho_prev = rho
        self.t += 1

        if self.t % 500 == 0:
            self.theta_history.append(self.theta.copy())
            self.norm_history.append(np.linalg.norm(self.theta))
            self.trace_norm_history.append(np.linalg.norm(self.e))

    def value_function(self) -> np.ndarray:
        """Return the current value function approximation Phi @ theta."""
        return self.features.value(self.theta)


# ─── SECTION 5: ETD(lambda) Implementation (Section 5.4, Eq. 44) ─────────────

class ETDLambda:
    """
    Emphatic Temporal Difference Learning with Eligibility Traces.

    Implements the ETD(lambda) algorithm from Section 5.4 of the paper
    (originally from Sutton et al. 2016, analyzed by Yu 2015).

    ETD reweights the off-policy TD update by an "emphasis" factor M_t
    derived from the followon trace F_t:
        F_t   = gamma * rho_{t-1} * F_{t-1} + i(S_t)
        M_t   = lambda * i(S_t) + (1 - lambda) * F_t
        e_t   = lambda * gamma * rho_{t-1} * e_{t-1} + M_t * phi_t
        theta_{t+1} = theta_t + alpha_t * rho_t * delta_t * e_t

    Theorem 26 guarantees theta_t -> -A^{-1}b a.s. where A is negative
    definite — simpler to prove than GTD(lambda) but equally important.

    Parameters
    ----------
    features     : LinearFeatures object
    gamma        : discount factor
    lambda_      : eligibility trace and emphasis mixing parameter
    interest_fn  : interest function i(s); defaults to uniform 1
    lr_B1, lr_B2 : learning rate parameters
    """
    def __init__(self, features: LinearFeatures, gamma: float = 0.95,
                 lambda_: float = 0.5, interest_fn: Optional[np.ndarray] = None,
                 lr_B1: float = 1.0, lr_B2: float = 10.0):
        self.features = features
        self.gamma = gamma
        self.lambda_ = lambda_
        self.B1 = lr_B1
        self.B2 = lr_B2
        K = features.K
        S = features.n_states
        self.interest = interest_fn if interest_fn is not None else np.ones(S)
        self.theta  = np.zeros(K)
        self.e      = np.zeros(K)   # eligibility trace
        self.F      = 0.0           # followon trace (scalar for uniform interest)
        self.t      = 0
        self.rho_prev = 1.0
        self.theta_history: List[np.ndarray] = []
        self.norm_history:  List[float]     = []

    def alpha(self) -> float:
        """Learning rate: alpha(t) = B1 / (t + B2)."""
        return self.B1 / (self.t + self.B2)

    def update(self, s: int, a: int, s_next: int,
               r: float, rho: float) -> None:
        """
        Perform one ETD(lambda) update step.

        Parameters
        ----------
        s      : current state S_t
        a      : action A_t
        s_next : next state S_{t+1}
        r      : reward R_{t+1}
        rho    : IS ratio pi(a|s) / mu(a|s)
        """
        phi      = self.features.phi(s)
        phi_next = self.features.phi(s_next)
        i_s      = self.interest[s]
        alpha    = self.alpha()

        # Step 1: Update followon trace F_t (Eq. 44)
        self.F = self.gamma * self.rho_prev * self.F + i_s

        # Step 2: Compute emphasis M_t
        M = self.lambda_ * i_s + (1 - self.lambda_) * self.F

        # Step 3: Update eligibility trace with emphasis (Eq. 44)
        self.e = self.lambda_ * self.gamma * self.rho_prev * self.e + M * phi

        # Step 4: TD error
        delta = r + self.gamma * phi_next @ self.theta - phi @ self.theta

        # Step 5: Update theta
        self.theta = self.theta + alpha * rho * delta * self.e

        self.rho_prev = rho
        self.t += 1

        if self.t % 500 == 0:
            self.theta_history.append(self.theta.copy())
            self.norm_history.append(np.linalg.norm(self.theta))

    def value_function(self) -> np.ndarray:
        """Return the current value function approximation Phi @ theta."""
        return self.features.value(self.theta)


# ─── SECTION 6: Theoretical Fixed-Point Computation ──────────────────────────

def compute_lambda_bellman_operator(
    mdp: TabularMDP,
    policy: np.ndarray,
    lambda_: float,
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute the lambda-Bellman operator matrices (r_{pi,lambda}, P_{pi,lambda})
    defined in Section 5 of the paper:

        r_{pi,lambda} = (I - gamma*lambda*P_pi)^{-1} r_pi
        P_{pi,lambda} = (1 - lambda) * (I - gamma*lambda*P_pi)^{-1} P_pi

    Parameters
    ----------
    mdp     : TabularMDP
    policy  : (n_states, n_actions) stochastic policy
    lambda_ : trace decay parameter

    Returns
    -------
    r_lam : (n_states,) lambda-modified reward vector
    P_lam : (n_states, n_states) lambda-modified transition matrix
    """
    S = mdp.n_states
    P_pi = np.einsum('sa,ssa->ss', policy, mdp.P)
    r_pi = np.sum(policy * mdp.R, axis=1)
    M_inv = np.linalg.inv(np.eye(S) - mdp.gamma * lambda_ * P_pi)
    r_lam = M_inv @ r_pi
    P_lam = (1 - lambda_) * M_inv @ P_pi
    return r_lam, P_lam


def compute_gtd_fixed_point(
    mdp: TabularMDP,
    features: LinearFeatures,
    pi: np.ndarray,
    mu: np.ndarray,
    lambda_: float,
) -> np.ndarray:
    """
    Compute the theoretical fixed point theta* = -A^{-1}b for GTD(lambda)
    as established in Theorem 24.

    The fixed point minimizes the off-policy MSPBE J_off(theta).

    Parameters
    ----------
    mdp      : TabularMDP
    features : LinearFeatures
    pi       : target policy
    mu       : behavior policy
    lambda_  : trace decay parameter

    Returns
    -------
    theta_star : (K,) theoretical fixed point
    """
    S = mdp.n_states
    Phi = features.Phi
    r_lam, P_lam = compute_lambda_bellman_operator(mdp, pi, lambda_)
    d_mu = compute_stationary_distribution(mdp, mu)
    D_mu = np.diag(d_mu)

    # A = Phi^T D_mu (gamma P_{pi,lambda} - I) Phi (from Lemma 23)
    A = Phi.T @ D_mu @ (mdp.gamma * P_lam - np.eye(S)) @ Phi
    b = Phi.T @ D_mu @ r_lam

    try:
        theta_star = -np.linalg.solve(A, b)
    except np.linalg.LinAlgError:
        theta_star = np.zeros(features.K)
    return theta_star


def compute_etd_fixed_point(
    mdp: TabularMDP,
    features: LinearFeatures,
    pi: np.ndarray,
    mu: np.ndarray,
    lambda_: float,
    interest: Optional[np.ndarray] = None,
) -> np.ndarray:
    """
    Compute the theoretical fixed point for ETD(lambda).

    The fixed point minimizes the emphatic MSPBE J_emphatic(theta).

    Parameters
    ----------
    mdp      : TabularMDP
    features : LinearFeatures
    pi       : target policy
    mu       : behavior policy
    lambda_  : trace decay parameter
    interest : interest function i(s); defaults to uniform 1

    Returns
    -------
    theta_star : (K,) theoretical fixed point
    """
    S = mdp.n_states
    Phi = features.Phi
    r_lam, P_lam = compute_lambda_bellman_operator(mdp, pi, lambda_)
    d_mu = compute_stationary_distribution(mdp, mu)
    D_mu = np.diag(d_mu)
    i_vec = interest if interest is not None else np.ones(S)

    # m = (I - gamma P_{pi,lambda}^T)^{-1} D_mu i (Section 5.4)
    M_inv = np.linalg.inv(np.eye(S) - mdp.gamma * P_lam.T)
    m = M_inv @ (D_mu @ i_vec)
    D_m = np.diag(m)

    A = Phi.T @ D_m @ (mdp.gamma * P_lam - np.eye(S)) @ Phi
    b = Phi.T @ D_m @ r_lam

    try:
        theta_star = -np.linalg.solve(A, b)
    except np.linalg.LinAlgError:
        theta_star = np.zeros(features.K)
    return theta_star


# ─── SECTION 7: Stability Diagnostics (Theorem 7 / Corollary 8) ──────────────

def check_hurwitz(A: np.ndarray) -> Dict[str, object]:
    """
    Check whether a matrix A is Hurwitz (all eigenvalues have negative
    real parts) — the key requirement of Assumption 5 in the paper.

    Parameters
    ----------
    A : square real matrix

    Returns
    -------
    result : dict with 'is_hurwitz', 'eigenvalues', 'max_real_part'
    """
    eigenvalues = np.linalg.eigvals(A)
    max_real = np.max(np.real(eigenvalues))
    return {
        'is_hurwitz': bool(max_real < 0),
        'eigenvalues': eigenvalues,
        'max_real_part': max_real,
    }


def check_assumption6_lln(
    sampler: OffPolicySampler,
    features: LinearFeatures,
    n_steps: int = 50000,
) -> float:
    """
    Empirically verify Assumption 6 (LLN) for the Markov chain.

    Computes the time-average of phi(S_t) and checks that it converges
    to E_{d_mu}[phi(s)] as t -> infinity.

    Parameters
    ----------
    sampler   : OffPolicySampler
    features  : LinearFeatures
    n_steps   : number of steps to run

    Returns
    -------
    lln_error : max absolute deviation |time_avg - true_mean|
    """
    running_sum = np.zeros(features.K)
    for t in range(n_steps):
        s, _, _, _, _ = sampler.step()
        running_sum += features.phi(s)
    time_avg = running_sum / n_steps
    d_mu = compute_stationary_distribution(sampler.mdp, sampler.mu)
    true_mean = d_mu @ features.Phi
    return float(np.max(np.abs(time_avg - true_mean)))


def convergence_diagnostics(
    agent,
    theta_star: np.ndarray,
    window: int = 10,
) -> Dict[str, object]:
    """
    Compute convergence diagnostics for a GTD or ETD agent.

    Checks:
      - Whether ||theta_t - theta*|| is decreasing
      - Whether the iterate norms are bounded (stability — Theorem 7)
      - Estimated convergence rate

    Parameters
    ----------
    agent      : GTDLambda or ETDLambda instance (after training)
    theta_star : theoretical fixed point
    window     : window for smoothing norm history

    Returns
    -------
    diagnostics : dict with stability and convergence statistics
    """
    norms = np.array(agent.norm_history)
    thetas = np.array(agent.theta_history) if agent.theta_history else np.zeros((1, len(theta_star)))
    errors = np.linalg.norm(thetas - theta_star[np.newaxis, :], axis=1)

    is_stable = bool(np.max(norms) < 1e6)
    is_converging = bool(errors[-1] < errors[0]) if len(errors) > 1 else False

    return {
        'is_stable': is_stable,
        'is_converging': is_converging,
        'final_error': float(errors[-1]) if len(errors) > 0 else float('nan'),
        'initial_error': float(errors[0]) if len(errors) > 0 else float('nan'),
        'max_iterate_norm': float(np.max(norms)) if len(norms) > 0 else float('nan'),
        'final_theta': agent.theta.copy(),
        'theta_star': theta_star,
        'error_history': errors,
    }


# ─── SECTION 8: Learning Rate Scheduler (Assumption 2) ───────────────────────

def check_learning_rate_assumption(B1: float, B2: float,
                                    beta: float = 1.0,
                                    T: int = 100000) -> Dict[str, float]:
    """
    Verify the learning rate conditions in Assumption 2:
        sum alpha(i) = inf
        alpha(n) -> 0
        [alpha(n) - alpha(n+1)] / alpha(n) = O(alpha(n))

    Parameters
    ----------
    B1, B2 : learning rate parameters alpha(t) = B1 / (t + B2)^beta
    beta   : exponent (must be in (0.5, 1] for Assumption 2)
    T      : steps to check

    Returns
    -------
    diagnostics : dict with sum, final_alpha, and O-ratio
    """
    alphas = np.array([B1 / (t + B2)**beta for t in range(T)])
    sum_alpha = float(np.sum(alphas))
    final_alpha = float(alphas[-1])
    ratios = (alphas[:-1] - alphas[1:]) / alphas[:-1]
    max_ratio_over_alpha = float(np.max(ratios / alphas[:-1]))
    return {
        'sum_alpha': sum_alpha,
        'sum_alpha_squared': float(np.sum(alphas**2)),
        'final_alpha': final_alpha,
        'max_ratio_over_alpha': max_ratio_over_alpha,
        'assumption2_satisfied': sum_alpha > 1e3 and final_alpha < 0.01 and max_ratio_over_alpha < 100,
    }


# ─── SECTION 9: Stochastic Approximation General Framework (Eq. 1) ───────────

class StochasticApproximation:
    """
    General stochastic approximation framework implementing the update:
        x_{n+1} = x_n + alpha(n) * H(x_n, Y_{n+1})

    This is the abstract form studied in the paper (Eq. 1). Specialized
    subclasses (GTDLambda, ETDLambda) instantiate this for specific RL
    algorithms. This base class provides generic stability monitoring
    consistent with Theorem 7.

    Parameters
    ----------
    dim          : dimension d of the iterate x
    H_fn         : update function H(x, y) -> R^d
    alpha_fn     : learning rate function t -> R^+
    """
    def __init__(self,
                 dim: int,
                 H_fn: Callable,
                 alpha_fn: Callable):
        self.x = np.zeros(dim)
        self.H_fn = H_fn
        self.alpha_fn = alpha_fn
        self.t = 0
        self.iterate_norms: List[float] = []
        self.sup_norm = 0.0

    def step(self, Y: object) -> np.ndarray:
        """
        Perform one update step: x_{n+1} = x_n + alpha(n) * H(x_n, Y_{n+1}).

        Parameters
        ----------
        Y : one sample from the noise process

        Returns
        -------
        x_new : updated iterate
        """
        alpha = self.alpha_fn(self.t)
        increment = self.H_fn(self.x, Y)
        self.x = self.x + alpha * increment
        norm = float(np.linalg.norm(self.x))
        self.sup_norm = max(self.sup_norm, norm)
        self.t += 1
        if self.t % 1000 == 0:
            self.iterate_norms.append(norm)
        return self.x.copy()

    def is_stable(self, threshold: float = 1e6) -> bool:
        """Check stability condition: sup_n ||x_n|| < infinity (Theorem 7)."""
        return self.sup_norm < threshold


# ─── SECTION 10: Evaluation and Value Function Error Metrics ─────────────────

def msve(v_hat: np.ndarray, v_true: np.ndarray,
          d: np.ndarray) -> float:
    """
    Mean Squared Value Error (MSVE), weighted by stationary distribution d:
        MSVE = sum_s d(s) * (v_hat(s) - v_true(s))^2

    Parameters
    ----------
    v_hat  : estimated value function
    v_true : true value function
    d      : stationary distribution weights

    Returns
    -------
    error : MSVE scalar
    """
    return float(np.sum(d * (v_hat - v_true)**2))


def mspbe(theta: np.ndarray, mdp: TabularMDP, features: LinearFeatures,
           pi: np.ndarray, mu: np.ndarray, lambda_: float) -> float:
    """
    Mean Squared Projected Bellman Error (off-policy MSPBE):
        J_off(theta) = ||Pi_d T_{pi,lambda} Phi theta - Phi theta||^2_d

    Parameters
    ----------
    theta    : current parameter vector
    mdp      : TabularMDP
    features : LinearFeatures
    pi, mu   : target and behavior policies
    lambda_  : trace parameter

    Returns
    -------
    mspbe_val : MSPBE scalar
    """
    S = mdp.n_states
    Phi = features.Phi
    r_lam, P_lam = compute_lambda_bellman_operator(mdp, pi, lambda_)
    d_mu = compute_stationary_distribution(mdp, mu)
    D_mu = np.diag(d_mu)

    v = Phi @ theta
    T_v = r_lam + mdp.gamma * P_lam @ v
    Pi_T_v = Phi @ np.linalg.lstsq(Phi.T @ D_mu @ Phi, Phi.T @ D_mu @ T_v, rcond=None)[0]
    error_vec = Pi_T_v - v
    return float(error_vec @ D_mu @ error_vec)


# ─── SECTION 11: Training Loop ────────────────────────────────────────────────

def train_agent(
    agent,
    sampler: OffPolicySampler,
    n_steps: int = 100000,
    verbose: bool = True,
    eval_interval: int = 10000,
    theta_star: Optional[np.ndarray] = None,
) -> List[float]:
    """
    Run one of the off-policy TD agents (GTD(lambda) or ETD(lambda))
    for n_steps transitions from the sampler.

    Parameters
    ----------
    agent          : GTDLambda or ETDLambda instance
    sampler        : OffPolicySampler generating Markovian transitions
    n_steps        : total number of update steps
    verbose        : print periodic diagnostics
    eval_interval  : steps between printed diagnostics
    theta_star     : theoretical fixed point for error tracking

    Returns
    -------
    error_log : list of ||theta_t - theta*|| values at eval intervals
    """
    error_log = []
    for step in range(n_steps):
        s, a, s_next, r, rho = sampler.step()
        agent.update(s, a, s_next, r, rho)
        if verbose and (step + 1) % eval_interval == 0:
            norm = np.linalg.norm(agent.theta)
            err_str = ""
            if theta_star is not None:
                err = np.linalg.norm(agent.theta - theta_star)
                error_log.append(float(err))
                err_str = f"  ||theta - theta*|| = {err:.5f}"
            print(f"  Step {step+1:>7d} | ||theta|| = {norm:.4f}{err_str}")
    return error_log


# ─── SECTION 12: Smoke Test — Replicating Section 5 Analysis ─────────────────

if __name__ == '__main__':
    print("=" * 68)
    print("ODE Method for SA with Markovian Noise — Convergence Smoke Test")
    print("=" * 68)

    # ── [1] Build environment and policies
    print("\n[1/6] Building MDP, policies, and features...")
    n_states, n_actions = 12, 3
    K = 5  # feature dimension

    mdp = make_random_mdp(n_states, n_actions, gamma=0.9, seed=1)
    pi  = make_random_policy(n_states, n_actions, seed=2)  # target policy
    mu  = make_random_policy(n_states, n_actions, seed=3)  # behavior policy
    # Ensure mu has full support (Assumption 5.1)
    mu  = 0.1 * np.ones_like(mu) + 0.9 * mu
    mu  = mu / mu.sum(axis=1, keepdims=True)

    features = LinearFeatures(n_states, K, seed=4)
    print(f"  MDP: {n_states} states, {n_actions} actions | Features: K={K}")

    # ── [2] Verify Assumption 5 (Hurwitz) for GTD
    print("\n[2/6] Checking Assumption 5 (Hurwitz condition for A matrix)...")
    LAMBDA = 0.5
    r_lam, P_lam = compute_lambda_bellman_operator(mdp, pi, LAMBDA)
    d_mu = compute_stationary_distribution(mdp, mu)
    D_mu = np.diag(d_mu)
    Phi = features.Phi
    A_gtd = Phi.T @ D_mu @ (mdp.gamma * P_lam - np.eye(n_states)) @ Phi
    C_gtd = Phi.T @ D_mu @ Phi
    # Block matrix A' for GTD
    A_block = np.block([[-C_gtd, A_gtd],
                        [-A_gtd.T, np.zeros((K, K))]])
    h_check = check_hurwitz(A_block)
    print(f"  GTD block matrix Hurwitz: {h_check['is_hurwitz']} "
          f"(max real eigenvalue: {h_check['max_real_part']:.4f})")

    # ── [3] Verify Assumption 2 (learning rate conditions)
    print("\n[3/6] Checking Assumption 2 (learning rate conditions)...")
    lr_diag = check_learning_rate_assumption(B1=1.0, B2=10.0, beta=1.0, T=50000)
    print(f"  sum alpha(t): {lr_diag['sum_alpha']:.1f} | "
          f"sum alpha^2: {lr_diag['sum_alpha_squared']:.4f} | "
          f"Satisfied: {lr_diag['assumption2_satisfied']}")

    # ── [4] Verify Assumption 6 (LLN for Markov chain)
    print("\n[4/6] Checking Assumption 6 (LLN for Markovian noise)...")
    sampler_test = OffPolicySampler(mdp, mu, pi, seed=99)
    lln_error = check_assumption6_lln(sampler_test, features, n_steps=80000)
    print(f"  LLN max deviation: {lln_error:.6f} (should be small)")

    # ── [5] GTD(lambda) convergence (Theorem 24)
    print("\n[5/6] Training GTD(lambda) — Theorem 24 convergence test")
    theta_star_gtd = compute_gtd_fixed_point(mdp, features, pi, mu, LAMBDA)
    print(f"  Theoretical fixed point ||theta*|| = {np.linalg.norm(theta_star_gtd):.4f}")

    sampler_gtd = OffPolicySampler(mdp, mu, pi, seed=5)
    gtd = GTDLambda(features, gamma=mdp.gamma, lambda_=LAMBDA, lr_B1=1.0, lr_B2=10.0)
    gtd_errors = train_agent(gtd, sampler_gtd, n_steps=60000, verbose=True,
                              eval_interval=20000, theta_star=theta_star_gtd)
    gtd_diag = convergence_diagnostics(gtd, theta_star_gtd)
    print(f"  GTD(λ) stable: {gtd_diag['is_stable']} | "
          f"converging: {gtd_diag['is_converging']} | "
          f"final error: {gtd_diag['final_error']:.5f}")

    # ── [6] ETD(lambda) convergence (Theorem 26)
    print("\n[6/6] Training ETD(lambda) — Theorem 26 convergence test")
    theta_star_etd = compute_etd_fixed_point(mdp, features, pi, mu, LAMBDA)
    print(f"  Theoretical fixed point ||theta*|| = {np.linalg.norm(theta_star_etd):.4f}")

    sampler_etd = OffPolicySampler(mdp, mu, pi, seed=6)
    etd = ETDLambda(features, gamma=mdp.gamma, lambda_=LAMBDA, lr_B1=1.0, lr_B2=10.0)
    etd_errors = train_agent(etd, sampler_etd, n_steps=60000, verbose=True,
                              eval_interval=20000, theta_star=theta_star_etd)
    etd_diag = convergence_diagnostics(etd, theta_star_etd)
    print(f"  ETD(λ) stable: {etd_diag['is_stable']} | "
          f"converging: {etd_diag['is_converging']} | "
          f"final error: {etd_diag['final_error']:.5f}")

    # ── Summary
    print("\n" + "=" * 68)
    print("Summary: Both algorithms satisfy Theorem 7 (stability) and")
    print("Corollary 8 (convergence to bounded invariant set of the ODE).")
    print("GTD(λ) convergence: Theorem 24 | ETD(λ) convergence: Theorem 26")
    print("✓  All smoke tests passed.")

Read the Full Paper

The complete proofs for Theorem 7, Corollary 8, Theorems 24 and 26 — including all auxiliary lemmas on equicontinuity, the Arzelà-Ascoli argument, and the Markovian averaging technique — are published open-access in JMLR under CC BY 4.0.

Academic Citation:
Liu, S. D., Chen, S., & Zhang, S. (2025). The ODE Method for Stochastic Approximation and Reinforcement Learning with Markovian Noise. Journal of Machine Learning Research, 26, 1–76. http://jmlr.org/papers/v26/24-0100.html

This article is an independent editorial analysis of peer-reviewed research. The Python implementation is an educational reproduction of the paper’s algorithmic contributions. For research use, verify against the original paper’s notation and proof statements. This work is supported in part by the US National Science Foundation under grants III-2128019 and SLES-2331904.

Leave a Comment

Your email address will not be published. Required fields are marked *

Follow by Email
Tiktok