MetaClaw: The LLM Agent That Meta-Learns and Evolves in the Wild | AI Trend Blend

LLM Agents · Continual Learning · UNC-Chapel Hill · CMU · UC Santa Cruz · UC Berkeley (2026) · 25 min read

MetaClaw: The LLM Agent That Meta-Learns and Evolves in the Wild — Simply by Being Used

Researchers from UNC-Chapel Hill, Carnegie Mellon, UC Santa Cruz, and UC Berkeley built a continual meta-learning framework that gives deployed language model agents two complementary superpowers: instant skill injection from failures, and opportunistic weight-level fine-tuning while you sleep — pushing Kimi-K2.5 from 21.4% to 40.6% accuracy with an 8.25× end-to-end task completion gain.

MetaClaw Continual Meta-Learning LLM Agents Skill-Driven Adaptation LoRA Fine-Tuning GRPO OMLS OpenClaw AutoResearchClaw

Most AI agents you use today are frozen in time. They were trained once, deployed, and then served unchanged no matter how much your workflow shifts. The model helping you parse JSON files on Monday has no idea you switched to multi-agent messaging pipelines by Friday — and it never will, unless someone retrains it from scratch. MetaClaw is a direct challenge to that assumption. Built by a team spanning four universities, it asks a deceptively simple question: what if an agent could genuinely get better the more you use it, without ever going offline?

Why Static Agents Fail in the Real World

Picture an agent platform like OpenClaw, which connects to 20-plus messaging channels and handles everything from shell scripting to multi-step file operations. The tasks a user throws at it on day one of deployment are not the same tasks they throw at it on day forty. Workloads shift. Conventions change. Edge cases accumulate. A frozen model becomes progressively misaligned with actual usage — it keeps failing on task types that weren’t well-represented when it was pretrained, and it has no mechanism to notice or correct this.

The three existing categories of agent adaptation each address only a slice of this problem. Memory-based systems store raw conversation trajectories, but raw logs are verbose and redundant — they don’t extract the transferable behavioral patterns hidden inside failures. Skill-based systems compress experience into reusable instructions, but treat the skill library as a static artifact, never coordinating it with weight-level optimization. RL-based methods update model weights, but operate in offline, small-scale settings and ignore a subtle but critical problem: once skills change, older trajectories carry stale reward signals that contaminate gradient updates.

The Core Problem

Every existing approach addresses adaptation in isolation. Memory methods don’t connect to weights. Skill methods don’t connect to RL. RL methods don’t know which data is still valid. MetaClaw is the first framework to unify all three dimensions into a single coherent system — and to do it without ever taking the agent offline.

The MetaClaw Framework: Two Timescales, One System

MetaClaw’s key insight is that agent adaptation naturally operates at two distinct timescales that are not just compatible but mutually reinforcing. Behavioral heuristics — things like “always create a .bak file before modifying anything” or “use ISO 8601 with timezone offsets for all timestamps” — can be distilled from a single failed conversation in seconds and injected immediately into the prompt. Improving the model’s underlying policy across diverse task types requires gradient-based optimization over many trajectories, taking minutes to hours. MetaClaw exploits both timescales simultaneously.

The framework’s core object is the meta-model:

Eq. 1 — Meta-Model M = (θ, S)

Here θ denotes the parameters of the base LLM policy and S = {s₁, s₂, …, sK} is a library of skill instructions — concise, reusable behavioral directives injected into the system prompt at inference time. Given a task τ, the agent acts according to:

Eq. 2 — Agent Action a ~ πθ(· | τ, Retrieve(S, τ))

The Retrieve function uses embedding-based cosine similarity to pull the most relevant skills for each incoming task. The meta-model evolves continuously as the agent accumulates experience — but the two components of M evolve through separate mechanisms on different clocks.

Mechanism 1: Skill-Driven Fast Adaptation

When a trajectory fails, it joins a support set. Once enough failures accumulate, a skill evolver — itself a language model — analyzes the failures and synthesizes new behavioral instructions. The skill library updates immediately:

Eq. 3 — Skill Evolution S_g+1 = S_g ∪ E(S_g, D^sup_g)

The subscript g is the skill generation index — every time the library changes, g increments. This step modifies only S, leaving θ untouched. Because the new skills take effect through the prompt rather than model weights, fast adaptation incurs zero service downtime. The agent serves the next request with updated behavioral context before the current request is even logged.

There is a design principle worth calling out here: this mechanism is gradient-free by design, not by approximation. The skill library lives in a discrete natural-language space where gradient descent is ill-defined. LLM-based failure analysis is the natural and correct tool for this space.

Mechanism 2: Opportunistic Policy Optimization

While skills evolve continuously, weight-level policy optimization happens on a much slower clock. The RL buffer B accumulates post-adaptation trajectories — those collected after the latest skill generation has taken effect. When a training window opens, the policy updates via cloud LoRA fine-tuning using GRPO:

Eq. 4 — Policy Optimization θ_t+1 = θ_t + α ∇_θ E_{(τ,ξ,g’)~B}[R(π_θ(· | τ, S_g’))]

R here is a process reward model (PRM) score. The key phrase is “post-adaptation trajectories” — and this distinction turns out to be critical enough that the authors built an entire versioning mechanism around enforcing it.

The Virtuous Cycle

Skills and weights improve each other. Skills distilled from failures make the agent smarter immediately; those smarter behaviors produce better training data for weight updates; better weights make subsequent failures more informative for skill distillation. MetaClaw is the first framework to make this cycle explicit and design around it.

Skill Generation Versioning: Solving the Stale Reward Problem

This is the part of the paper that most practitioners will appreciate immediately, because the problem it solves is subtle enough to be missed and serious enough to cause training pathologies if ignored.

Suppose a trajectory fails, triggering skill evolution from generation g to g+1. That trajectory carries a reward signal reflecting performance under S_g — before the new skill existed. If this trajectory enters the RL buffer, policy optimization receives a gradient that penalizes θ for a failure that skill-driven adaptation has already corrected. The model gets punished for something it no longer does wrong. This is stale reward contamination.

MetaClaw enforces the support-query separation with a simple but airtight mechanism: every collected sample is stamped with its skill generation index g. Trajectories that trigger skill evolution are marked as support data and discarded from the RL buffer. Only trajectories collected after the new skills have taken effect — query data — are eligible for gradient updates. When g increments, all samples with generation index ≤ g are flushed from the training buffer.

TASK ARRIVES
     │
     ▼
Retrieve(S_g, τ)  →  Execute  →  Collect trajectory ξ_i
                                        │
                         Stamp with skill generation index g
                                        │
                    ┌───────────────────┴──────────────────┐
                    │  ξ_i is a FAILURE?                   │
                   YES                                     NO
                    │                                       │
           Add to D^sup_g                       Add to RL buffer B
           (support data)                        (query data, gen=g)
                    │
          |D^sup_g| >= threshold?
                   YES
                    │
          E(S_g, D^sup_g) → Synthesize new skills ΔS
          S_{g+1} = S_g ∪ ΔS
          Flush all samples with gen ≤ g from B  ← versioning enforcement
          g ← g + 1
                    │
OMLS detects idle + |B| ≥ batch_size?
          YES  →  GRPO RL update on θ  →  Cloud LoRA hot-swap

The Opportunistic Meta-Learning Scheduler (OMLS)

Policy optimization requires a model weight hot-swap upon completion, which briefly interrupts inference. In a deployed interactive system serving a real user, this is a genuine problem. MetaClaw’s solution is the Opportunistic Meta-Learning Scheduler (OMLS), a background daemon that defers weight updates to periods when the user is not actively engaging with the agent.

OMLS monitors three complementary idle signals:

Sleep window. The user configures a sleep schedule — for example, 23:00 to 07:00. During this window the system is guaranteed idle, providing the largest contiguous training block available each day.

System inactivity. OMLS polls the OS input device idle timer (HIDIdleTime on macOS, xprintidle on Linux). If no keyboard or mouse activity is detected for a configurable threshold — 30 minutes by default — a training window opens. The trainer supports mid-batch checkpointing, so it pauses gracefully the moment input resumes.

Calendar-aware scheduling. OMLS queries the Google Calendar API. When the current time falls inside a scheduled meeting, the user is presumed unavailable, opening an opportunistic window proactively. This is the most anticipatory of the three signals: it uses the user’s own schedule to predict idle periods before they happen.

“A better policy produces more informative failures for skill synthesis, and richer skills yield higher-reward trajectories for policy optimization. No existing system unifies these two forms of adaptation into a coherent framework that exploits this virtuous cycle.” — Xia, Chen, Yang, Tu et al., arXiv:2603.17187 (2026)

MetaClaw-Bench: Evaluation Designed for Continual Adaptation

Existing agent benchmarks treat tasks as independent episodes. MetaClaw-Bench is different by design: it structures evaluation as multi-workday simulations where the agent must improve from accumulated experience, not just solve individual tasks. The benchmark comprises 934 questions across 44 simulated workdays.

Part I is a 30-workday simulation with 346 questions covering real CLI tasks — file editing, JSON structuring, shell scripting. Workspace state persists across rounds within each day. Task difficulty increases monotonically with day index, with the final week requiring sophisticated multi-step reasoning.

Part II extends to 14 workdays with 588 questions and introduces five implicit preference rules that the agent must infer from feedback and internalize through skill evolution or RL training. None of the rules are stated in the system prompt.

Rule	Category	Requirement	Active From
P1	Timestamp	YYYY-MM-DDTHH:MM:SS+08:00 format	Day 01
P2	File naming	YYYYMMDD_description.ext (snake case)	Day 04
P3	Metadata	Every file: created_at, author, status	Day 06
P4	Backup	Create <file>.bak before modifying	Day 08
P5	Completion log	Append [DONE] \| timestamp \| task_id to done.log	Day 10

Table: Five implicit preference rules in Part II. The agent discovers these through failure feedback and skill distillation — not from explicit instruction.

Results: Closing the Gap With GPT-5.2 Using a Smaller Model

The headline result is striking: MetaClaw (Full) with Kimi-K2.5 reaches 40.6% accuracy on Part I — nearly matching GPT-5.2’s 41.1% baseline, despite Kimi-K2.5 starting from just 21.4%. End-to-end task completion jumps 8.25 times, from 2.0% to 16.5%.

Model	Condition	Part I Acc.	Part I Compl.	Part II Acc.	Part II Compl.
GPT-5.2	Baseline	41.1%	14.7%	44.9%	58.4%
GPT-5.2	MetaClaw (Skills)	44.0%	17.1%	49.1%	67.5%
Kimi-K2.5	Baseline	21.4%	2.0%	21.1%	18.2%
Kimi-K2.5	MetaClaw (Skills)	28.3%	2.0%	26.9%	33.8%
Kimi-K2.5	MetaClaw (Full)	40.6%	16.5%	39.6%	51.9%

Table 1: MetaClaw-Bench results. MetaClaw (Full) evaluated for Kimi-K2.5 only. Bold = best per metric.

AutoResearchClaw: Cross-Domain Generalization

AutoResearchClaw is a 23-stage autonomous research pipeline that takes a research idea and produces a conference-ready paper. MetaClaw’s skill injection was deployed into the pipeline executor with no domain-specific tuning, and the results transferred directly — demonstrating that behavioral heuristics distilled from CLI task failures generalize to open-ended research automation.

Metric	Baseline	+ MetaClaw (Skills)	Change
Stage retry rate ↓	10.5%	7.9%	↓ 24.8%
Refine cycle count ↓	2.0	1.2	↓ 40.0%
Pipeline stage completion ↑	18/19	19/19	↑ 5.3%
Composite robustness score ↑	0.714	0.845	↑ 18.3%

Table 2: AutoResearchClaw results. Skills-only adaptation, no gradient updates — all gains from prompt-level skill injection.

Practical Takeaway

Expect skill-driven gains to appear immediately — within the first session. Expect weight-level gains after roughly a week of usage, once the RL buffer has accumulated enough post-adaptation trajectories. The two mechanisms are additive, not redundant, and each has its own payoff horizon. For deployment-constrained scenarios, skills alone deliver substantial value at zero compute cost.

Complete End-to-End MetaClaw Implementation (Python)

The implementation below is a complete, self-contained Python translation of MetaClaw, structured in 11 sections that map directly to the paper. It covers every component: the meta-model (θ, S), skill-driven fast adaptation with an LLM evolver, embedding-based retrieval, opportunistic policy optimization with GRPO, the OMLS idle scheduler with all three signals, skill generation versioning with stale-data flushing, MetaClaw-Bench dataset helpers, the full training loop following Algorithm 1, and a smoke test validating all components end-to-end without requiring real API keys or GPU hardware.

# ===========================================================================
# MetaClaw: Continual Meta-Learning for Deployed LLM Agents
# Paper: arXiv:2603.17187v1
# Authors: Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu et al.
# Institutions: UNC-Chapel Hill, CMU, UC Santa Cruz, UC Berkeley
# ===========================================================================
# Sections:
#  1.  Imports & Configuration
#  2.  Skill Library (S) — storage, retrieval, versioning
#  3.  Skill Evolver — LLM-based failure analysis & skill synthesis
#  4.  Trajectory & RL Buffer — versioned storage with stale-data flushing
#  5.  Process Reward Model (PRM)
#  6.  Cloud LoRA Fine-Tuner (GRPO)
#  7.  Opportunistic Meta-Learning Scheduler (OMLS)
#  8.  Agent Executor — task execution with skill injection
#  9.  MetaClaw Core — Algorithm 1 implementation
# 10.  MetaClaw-Bench Dataset Helpers
# 11.  Smoke Test
# ===========================================================================

from __future__ import annotations
import json, math, os, random, time, threading, hashlib, logging
from collections import deque
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from enum import Enum
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple

logging.basicConfig(level=logging.INFO,
    format="[%(asctime)s] %(levelname)s  %(message)s", datefmt="%H:%M:%S")
log = logging.getLogger("metaclaw")


# ── SECTION 1: Configuration ─────────────────────────────────────────────────

class DataRole(Enum):
    SUPPORT = "support"   # failure trajectories driving skill evolution
    QUERY   = "query"     # post-adaptation trajectories valid for RL


@dataclass
class MetaClawConfig:
    """All hyper-parameters for MetaClaw in one place."""
    embed_dim: int = 384
    top_k_skills: int = 3
    skill_failure_threshold: int = 5    # failures before evolver fires
    max_new_skills_per_run: int = 3
    rl_batch_size: int = 16
    rl_lr: float = 5e-5
    lora_rank: int = 16
    sleep_start_hour: int = 23          # OMLS sleep window
    sleep_end_hour: int = 7
    idle_threshold_secs: float = 1800.0 # 30 min keyboard inactivity
    skill_library_path: str = "metaclaw_skills.json"
    buffer_max_size: int = 512


# ── SECTION 2: Skill Library ──────────────────────────────────────────────────

@dataclass
class Skill:
    """A single reusable behavioral directive — the atom of S."""
    name: str
    description: str        # one-sentence trigger & purpose
    content: str            # full Markdown instruction block (6-15 lines)
    category: str
    generation: int = 0     # skill generation when created
    embedding: Optional[List[float]] = field(default=None, repr=False)

    def to_dict(self) -> Dict:
        return {"name": self.name, "description": self.description,
                "content": self.content, "category": self.category,
                "generation": self.generation}

    @classmethod
    def from_dict(cls, d: Dict) -> "Skill":
        return cls(**{k: v for k, v in d.items() if k != "embedding"})


class SkillLibrary:
    """
    S = {s1, ..., sK} — the evolving library of behavioral instructions.

    Dual role (Section 3.2):
      As meta-parameter : accumulates knowledge across the task stream
      As adaptation basis: Retrieve(S, τ) extracts task-specific subset at
                           inference time via cosine similarity over embeddings
    """
    def __init__(self, cfg: MetaClawConfig):
        self.cfg = cfg
        self.skills: List[Skill] = []
        self.generation: int = 0
        self._embedder = SimpleEmbedder(cfg.embed_dim)
        if Path(cfg.skill_library_path).exists():
            self.load()

    def add(self, skill: Skill) -> None:
        skill.generation = self.generation
        skill.embedding = self._embedder.encode(
            skill.name + " " + skill.description)
        self.skills.append(skill)
        log.info(f"  + Skill [{skill.name}] added (gen={skill.generation})")

    def retrieve(self, query: str, top_k: int = 3) -> List[Skill]:
        """Retrieve(S, τ) — top-k skills by cosine similarity (Eq. 2)."""
        if not self.skills:
            return []
        q_emb = self._embedder.encode(query)
        scored = [(cosine_sim(q_emb, s.embedding), s)
                  for s in self.skills if s.embedding is not None]
        scored.sort(key=lambda x: x[0], reverse=True)
        return [s for _, s in scored[:top_k]]

    def format_for_prompt(self, skills: List[Skill]) -> str:
        """Format retrieved skills into an injected system-prompt block."""
        if not skills:
            return ""
        lines = ["## Active Skills\n"]
        for s in skills:
            lines += [f"### {s.name}", f"_{s.description}_\n", s.content, ""]
        return "\n".join(lines)

    def increment_generation(self) -> None:
        self.generation += 1
        log.info(f"  Skill generation → {self.generation}")

    def get_existing_names(self) -> List[str]:
        return [s.name for s in self.skills]

    def save(self) -> None:
        Path(self.cfg.skill_library_path).write_text(json.dumps(
            {"generation": self.generation,
             "skills": [s.to_dict() for s in self.skills]}, indent=2))

    def load(self) -> None:
        data = json.loads(Path(self.cfg.skill_library_path).read_text())
        self.generation = data.get("generation", 0)
        for d in data.get("skills", []):
            s = Skill.from_dict(d)
            s.embedding = self._embedder.encode(s.name + " " + s.description)
            self.skills.append(s)
        log.info(f"Loaded {len(self.skills)} skills (gen={self.generation})")


# ── SECTION 3: Skill Evolver ─────────────────────────────────────────────────

class SkillEvolver:
    """
    E(S_g, D^sup_g) — synthesizes new skills from failure trajectories.

    Gradient-free by design (Section 3.2): the skill library lives in
    discrete natural-language space where gradient descent is ill-defined.
    LLM-based failure analysis is the correct mechanism for this space.

    In production: calls an LLM API with the evolver prompt template.
    In smoke-test mode: generates heuristic skills from failure patterns.
    """

    EVOLVER_PROMPT = """You are a skill engineer for an AI assistant.
Analyze the failed conversations below and generate NEW skills that would
have prevented those failures.

Failed Conversations:
{failures}

Existing Skills (do NOT duplicate): {existing_skills}

Generate 1 to {max_skills} new skill JSON objects with fields:
  name, description (one sentence), content (6-15 line Markdown), category.

Output ONLY a valid JSON array. No preamble or explanation."""

    def __init__(self, cfg: MetaClawConfig, llm_client=None):
        self.cfg = cfg
        self.llm = llm_client

    def evolve(self, library: SkillLibrary,
               failures: List["Trajectory"]) -> List[Skill]:
        """Synthesize new skills. Returns list of Skill objects."""
        if not failures:
            return []
        existing = library.get_existing_names()
        if self.llm is not None:
            return self._evolve_with_llm(failures, existing)
        return self._evolve_heuristic(failures, existing)

    def _evolve_with_llm(self, failures: List["Trajectory"],
                        existing: List[str]) -> List[Skill]:
        prompt = self.EVOLVER_PROMPT.format(
            failures=self._format_failures(failures),
            existing_skills=json.dumps(existing),
            max_skills=self.cfg.max_new_skills_per_run,
        )
        try:
            raw = self.llm.complete(prompt).strip()
            if raw.startswith("```"):
                raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]
            return [Skill.from_dict(d) for d in json.loads(raw)
                    if d.get("name") not in existing]
        except Exception as e:
            log.warning(f"LLM evolver failed: {e}; using heuristic fallback.")
            return self._evolve_heuristic(failures, existing)

    def _evolve_heuristic(self, failures, existing) -> List[Skill]:
        """
        Offline fallback: the three most common failure categories
        found in MetaClaw-Bench (temporal format, backup, naming).
        """
        templates = [
            Skill("iso8601-timezone-format",
                  "Always use ISO 8601 with +08:00 timezone offset for timestamps.",
                  "## ISO 8601 Timestamp\n\n1. Format: YYYY-MM-DDTHH:MM:SS+08:00\n"
                  "2. Correct: 2026-03-16T09:30:00+08:00\n"
                  "3. Wrong: March 16 at 3pm, or UTC Z suffix\n\n"
                  "**Anti-pattern:** Omitting timezone or using natural language.",
                  "common_mistakes"),
            Skill("backup-before-modify",
                  "Create a .bak copy before modifying any existing file.",
                  "## Backup Before Modify\n\n1. Before editing: cp <f> <f>.bak\n"
                  "2. Verify backup exists.\n3. Modify original.\n\n"
                  "**Anti-pattern:** Overwriting without backup — no recovery path.",
                  "agentic"),
            Skill("date-prefix-naming",
                  "Name output files with YYYYMMDD_ prefix in snake_case.",
                  "## Date-Prefix File Naming\n\n1. Format: YYYYMMDD_description.ext\n"
                  "2. Example: 20260316_sprint_board.json\n"
                  "3. Use underscores, not hyphens or spaces.\n\n"
                  "**Anti-pattern:** board-2026.json or 'sprint board.json'",
                  "productivity"),
            Skill("required-metadata-fields",
                  "Include created_at, author, status in every output JSON file.",
                  "## Mandatory Metadata\n\nAll output files must include:\n"
                  "- created_at: ISO 8601 timestamp\n- author: agent identifier\n"
                  "- status: 'draft' | 'final' | 'pending_review'\n\n"
                  "**Anti-pattern:** Omitting metadata fields causing checker failures.",
                  "coding"),
            Skill("verify-before-destructive",
                  "Read file state before any destructive write or delete command.",
                  "## Verify-Before-Write\n\n1. Read current state (cat / ls / status)\n"
                  "2. Confirm path is correct.\n3. Execute destructive command.\n"
                  "4. Verify output immediately.\n\n"
                  "**Anti-pattern:** Blindly writing/deleting without pre-check.",
                  "agentic"),
        ]
        new = [t for t in templates if t.name not in existing]
        return new[:self.cfg.max_new_skills_per_run]

    @staticmethod
    def _format_failures(failures) -> str:
        return "\n---\n".join(
            f"Failure {i+1} (reward={f.reward:.2f})\n"
            f"Task: {f.task_instruction[:200]}\n"
            f"Response: {f.agent_response[:300]}\n"
            f"Error: {f.error_hint or 'none'}"
            for i, f in enumerate(failures)
        )


# ── SECTION 4: Trajectory & RL Buffer ────────────────────────────────────────

@dataclass
class Trajectory:
    """
    A single agent trajectory stamped with skill_generation (g).
    The stamp is the mechanism that enforces support-query separation
    (Section 3.4): stale samples flushed from buffer when g increments.
    """
    task_id: str
    task_instruction: str
    agent_response: str
    reward: float                       # PRM score in [0, 1]
    skill_generation: int               # g at collection time
    active_skill_names: List[str]
    role: DataRole = DataRole.QUERY
    error_hint: Optional[str] = None
    timestamp: float = field(default_factory=time.time)

    @property
    def is_failure(self) -> bool:
        return self.reward < 0.5


class RLBuffer:
    """
    RL training buffer with skill-generation versioning.

    Core invariant: only QUERY-role trajectories enter.
    When g increments, all samples with gen < new_g are flushed —
    they carry rewards conditioned on the old skill context (Section 3.4).
    """
    def __init__(self, max_size: int = 512):
        self.buffer: deque = deque(maxlen=max_size)
        self._lock = threading.Lock()

    def add(self, traj: Trajectory) -> None:
        assert traj.role == DataRole.QUERY, \
            "Only query-role trajectories enter the RL buffer."
        with self._lock:
            self.buffer.append(traj)

    def flush_stale(self, current_gen: int) -> int:
        """
        Flush samples with gen < current_gen.
        Called immediately after skill library update (g → g+1).
        Prevents stale reward contamination in gradient updates.
        """
        with self._lock:
            before = len(self.buffer)
            self.buffer = deque(
                [t for t in self.buffer if t.skill_generation >= current_gen],
                maxlen=self.buffer.maxlen)
            n = before - len(self.buffer)
        if n:
            log.info(f"  Buffer: flushed {n} stale samples (gen<{current_gen}), "
                     f"{len(self.buffer)} remain")
        return n

    def sample(self, n: int) -> List[Trajectory]:
        with self._lock:
            pool = list(self.buffer)
        return random.sample(pool, min(n, len(pool)))

    def __len__(self) -> int:
        return len(self.buffer)


# ── SECTION 5: Process Reward Model ─────────────────────────────────────────

class ProcessRewardModel:
    """
    R(ξ) — scores trajectories for policy optimization (Eq. 4).

    Production: a trained PRM that scores each step in the trajectory
    (Zhang et al. 2025c). Smoke-test: heuristic scorer from known signals.
    """
    def __init__(self, model_path=None):
        self.model_path = model_path

    def score(self, traj: Trajectory) -> float:
        if self.model_path:
            return self._score_with_model(traj)
        return self._score_heuristic(traj)

    def _score_heuristic(self, traj: Trajectory) -> float:
        r = traj.reward
        resp = traj.agent_response.lower()
        if ".bak" in resp:
            r = min(1.0, r + 0.10)
        if any(k in resp for k in ["verify", "check", "inspect"]):
            r = min(1.0, r + 0.05)
        if "+08:00" in traj.agent_response:
            r = min(1.0, r + 0.05)
        return r

    def _score_with_model(self, traj: Trajectory) -> float:
        raise NotImplementedError("Load your PRM checkpoint here.")


# ── SECTION 6: Cloud LoRA Fine-Tuner (GRPO) ──────────────────────────────────

class CloudLoRAFinetuner:
    """
    Gradient-based policy update via cloud LoRA + GRPO (Shao et al. 2024).

    Implements Eq. 4: θ_{t+1} = θ_t + α ∇_θ E[(R(π_θ(· | τ, S_{g'}))]

    No local GPU needed: training is delegated to a cloud backend
    (Tinker / MinT). Smoke-test: simulates convergence with plausible dynamics.
    """
    def __init__(self, cfg: MetaClawConfig, backend=None):
        self.cfg = cfg
        self.backend = backend
        self._step = 0

    def update(self, batch: List[Trajectory],
               library: SkillLibrary) -> Dict[str, float]:
        if self.backend:
            return self._real_grpo_update(batch, library)
        return self._simulated_grpo_update(batch)

    def _simulated_grpo_update(self, batch) -> Dict[str, float]:
        """GRPO simulation: normalize rewards within group, compute surrogate."""
        rewards = [t.reward for t in batch]
        mean_r  = sum(rewards) / len(rewards)
        std_r   = (sum((r-mean_r)**2 for r in rewards) / len(rewards))**0.5
        # Simulated surrogate loss converges over steps
        loss    = max(0.0, 1.0/(1+0.1*self._step) * (1-mean_r)
                      + random.gauss(0, 0.02))
        self._step += 1
        return {"loss": loss, "mean_reward": mean_r,
                "pg_norm": abs(random.gauss(0.5, 0.1)), "step": self._step}

    def _real_grpo_update(self, batch, library) -> Dict[str, float]:
        data = [{
            "prompt": library.format_for_prompt(
                library.retrieve(t.task_instruction, top_k=3)
            ) + "\n\nTask: " + t.task_instruction,
            "completion": t.agent_response, "reward": t.reward,
        } for t in batch]
        return self.backend.grpo_step(data=data, lr=self.cfg.rl_lr,
                                      lora_rank=self.cfg.lora_rank)

    def hot_swap_weights(self) -> None:
        """Atomic model swap — zero downtime if using blue-green deployment."""
        log.info("  [LoRA] Hot-swapping updated weights → inference endpoint")
        time.sleep(0.01)
        log.info("  [LoRA] Hot-swap complete. Updated θ now live.")


# ── SECTION 7: Opportunistic Meta-Learning Scheduler (OMLS) ──────────────────

class OMLSSignal(Enum):
    SLEEP    = "sleep_window"
    INACTIVE = "keyboard_inactive"
    CALENDAR = "calendar_free"
    BUSY     = "user_busy"


class OMLS:
    """
    Opportunistic Meta-Learning Scheduler (Section 3.5).

    Monitors three complementary idle signals:
      1. Sleep window  — configurable nightly schedule (largest block)
      2. Keyboard idle — OS input timer (macOS: HIDIdleTime; Linux: xprintidle)
      3. Calendar free — Google Calendar free/busy query (proactive)

    Window opens when ANY signal indicates idle.
    Window closes when ANY signal indicates return.
    Trainer supports mid-batch checkpointing across fragmented windows.
    """
    def __init__(self, cfg: MetaClawConfig,
                 calendar_client=None, mock_idle: bool = False):
        self.cfg = cfg
        self.calendar = calendar_client
        self.mock_idle = mock_idle

    def is_idle(self) -> Tuple[bool, OMLSSignal]:
        if self.mock_idle:
            return True, OMLSSignal.INACTIVE
        if self._in_sleep_window():
            return True, OMLSSignal.SLEEP
        if self._keyboard_idle():
            return True, OMLSSignal.INACTIVE
        if self.calendar and self._calendar_free():
            return True, OMLSSignal.CALENDAR
        return False, OMLSSignal.BUSY

    def _in_sleep_window(self) -> bool:
        h = datetime.now().hour
        s, e = self.cfg.sleep_start_hour, self.cfg.sleep_end_hour
        return (h >= s or h < e) if s > e else s <= h < e

    def _keyboard_idle(self) -> bool:
        try:
            import subprocess
            out = subprocess.check_output(
                ["ioreg", "-c", "IOHIDSystem"],
                stderr=subprocess.DEVNULL, timeout=2).decode()
            for line in out.splitlines():
                if "HIDIdleTime" in line:
                    return int(line.split("=")[1].strip())/1e9 >= \
                           self.cfg.idle_threshold_secs
        except Exception:
            pass
        return False

    def _calendar_free(self) -> bool:
        try:
            now = datetime.utcnow()
            events = self.calendar.events().list(
                calendarId="primary",
                timeMin=now.isoformat()+"Z",
                timeMax=(now+timedelta(minutes=5)).isoformat()+"Z",
                singleEvents=True).execute()
            return len(events.get("items", [])) == 0
        except Exception:
            return False


# ── SECTION 8: Agent Executor ─────────────────────────────────────────────────

@dataclass
class Task:
    task_id: str
    instruction: str
    task_type: str = "file-check"
    checker_fn: Optional[Any] = None


class AgentExecutor:
    """
    Serves user tasks with skill injection (Eq. 2): a ~ π_θ(· | τ, Retrieve(S, τ))

    Production: calls LLM API (Claude, Kimi, GPT) with skill-augmented prompt.
    Smoke-test: simulates responses; accuracy increases with skill count.
    """
    def __init__(self, library: SkillLibrary, cfg: MetaClawConfig,
                 llm_client=None, simulated_base_accuracy: float = 0.3):
        self.library = library
        self.cfg = cfg
        self.llm = llm_client
        self.base_acc = simulated_base_accuracy

    def execute(self, task: Task, prm: ProcessRewardModel) -> Trajectory:
        active = self.library.retrieve(task.instruction, top_k=self.cfg.top_k_skills)
        skill_ctx = self.library.format_for_prompt(active)
        system = ("You are MetaClaw Agent. Issue one command at a time. "
                  "Verify state before destructive ops. Reply DONE when complete.\n"
                  + skill_ctx)

        response = (self._call_llm(system, task.instruction) if self.llm
                    else self._simulate(task, active))

        reward, hint = self._evaluate(task, response)
        traj = Trajectory(
            task_id=task.task_id, task_instruction=task.instruction,
            agent_response=response, reward=reward,
            skill_generation=self.library.generation,
            active_skill_names=[s.name for s in active],
            error_hint=hint,
        )
        traj.reward = prm.score(traj)
        return traj

    def _call_llm(self, system: str, user: str) -> str:
        """Call production LLM. Replace with your provider's SDK."""
        return self.llm.chat([{"role":"system","content":system},
                               {"role":"user","content":user}])

    def _simulate(self, task: Task, skills: List[Skill]) -> str:
        acc = min(0.95, self.base_acc + len(skills) * 0.12)
        if random.random() < acc:
            parts = ["I will complete this carefully."]
            if any("backup" in s.name for s in skills):
                parts.append("cp file.json file.json.bak")
            if any("iso8601" in s.name for s in skills):
                parts.append("timestamp: 2026-03-16T09:30:00+08:00")
            parts.append("DONE")
            return " | ".join(parts)
        return "I updated the file with the required changes."

    def _evaluate(self, task: Task,
                  response: str) -> Tuple[float, Optional[str]]:
        if task.checker_fn:
            passed, hint = task.checker_fn(response)
            return (1.0 if passed else 0.0), hint
        lower = response.lower()
        ok = "done" in lower and not any(
            e in lower for e in ["error", "fail", "incorrect"])
        return (1.0 if ok else 0.0), (None if ok else "Task not completed.")


# ── SECTION 9: MetaClaw Core — Algorithm 1 ───────────────────────────────────

class MetaClaw:
    """
    MetaClaw: Continual Meta-Learning for Deployed LLM Agents.

    Implements Algorithm 1 — the main serving loop that continuously
    improves M = (θ, S) through normal usage via two loops:

      Fast (seconds): skill-driven adaptation from failures → S update
      Slow (hours)  : opportunistic RL policy optimization  → θ update
    """
    def __init__(self, cfg=None, llm_client=None, training_backend=None,
                 calendar_client=None, mock_idle=False,
                 simulated_base_accuracy=0.25):
        self.cfg = cfg or MetaClawConfig()
        self.library   = SkillLibrary(self.cfg)          # S
        self.evolver   = SkillEvolver(self.cfg, llm_client)
        self.buffer    = RLBuffer(self.cfg.buffer_max_size)
        self.prm       = ProcessRewardModel()
        self.finetuner = CloudLoRAFinetuner(self.cfg, training_backend)
        self.omls      = OMLS(self.cfg, calendar_client, mock_idle=mock_idle)
        self.executor  = AgentExecutor(
            self.library, self.cfg, llm_client, simulated_base_accuracy)
        self._support_set: List[Trajectory] = []
        self.stats = {"tasks_served":0, "skills_created":0,
                      "rl_updates":0, "total_reward":0.0}

    def serve(self, task: Task) -> Trajectory:
        """
        Algorithm 1, lines 2-23.
        Execute → stamp → route → maybe adapt → maybe optimize.
        """
        # Lines 4-6: Execute and stamp with current generation
        traj = self.executor.execute(task, self.prm)
        self.stats["tasks_served"] += 1
        self.stats["total_reward"] += traj.reward
        log.info(f"Task [{task.task_id}] reward={traj.reward:.3f} "
                 f"gen={traj.skill_generation} skills={traj.active_skill_names}")

        # Lines 7-11: Route to support set or RL buffer
        if traj.is_failure:
            traj.role = DataRole.SUPPORT
            self._support_set.append(traj)
        else:
            traj.role = DataRole.QUERY
            self.buffer.add(traj)

        # Lines 13-18: Skill-driven fast adaptation
        if len(self._support_set) >= self.cfg.skill_failure_threshold:
            self._run_skill_adaptation()

        # Lines 20-23: Opportunistic policy optimization
        self._maybe_run_rl_update()
        return traj

    def _run_skill_adaptation(self) -> None:
        """Eq. 3: S_{g+1} = S_g ∪ E(S_g, D^sup_g)"""
        log.info(f"[Evolver] Analyzing {len(self._support_set)} failures...")
        new_skills = self.evolver.evolve(self.library, self._support_set)
        if new_skills:
            for s in new_skills:
                self.library.add(s)
            self.stats["skills_created"] += len(new_skills)
            self.library.increment_generation()
            self.buffer.flush_stale(self.library.generation)  # versioning
            self.library.save()
        self._support_set.clear()

    def _maybe_run_rl_update(self) -> None:
        """Lines 20-23: OMLS idle check → GRPO update → hot-swap."""
        idle, sig = self.omls.is_idle()
        if not idle or len(self.buffer) < self.cfg.rl_batch_size:
            return
        log.info(f"[OMLS] Idle ({sig.value}). Buffer={len(self.buffer)}. Running RL...")
        batch = self.buffer.sample(self.cfg.rl_batch_size)
        m = self.finetuner.update(batch, self.library)
        self.finetuner.hot_swap_weights()
        self.stats["rl_updates"] += 1
        log.info(f"[RL] step={m['step']} loss={m['loss']:.4f} "
                 f"mean_reward={m['mean_reward']:.4f}")

    def get_stats(self) -> Dict:
        n = max(1, self.stats["tasks_served"])
        return {**self.stats, "mean_reward": self.stats["total_reward"]/n,
                "skill_library_size": len(self.library.skills),
                "skill_generation": self.library.generation,
                "buffer_size": len(self.buffer)}


# ── SECTION 10: MetaClaw-Bench Dataset Helpers ───────────────────────────────

def make_task(day: int, round_: int, part: int = 1) -> Task:
    """Synthetic MetaClaw-Bench task with automated checker."""
    templates = [
        ("Update sprint board and set task statuses to done", "file-check"),
        ("Create deployment log with ISO 8601 timestamp and env fields", "file-check"),
        ("Which approach correctly uses status check before write?", "multi-choice"),
        ("Append completion record to done.log with all required fields", "file-check"),
        ("What timestamp format is required for Asia/Shanghai timezone?", "multi-choice"),
    ]
    instr, ttype = templates[(day*3+round_) % len(templates)]

    def checker(response: str) -> Tuple[bool, Optional[str]]:
        r = response.lower()
        needs_bak = any(k in instr.lower() for k in ["update","append"])
        needs_ts  = any(k in instr.lower() for k in ["timestamp","log"])
        if ttype == "file-check":
            if needs_bak and ".bak" not in r:
                return False, "Missing .bak backup (P4)"
            if needs_ts and "+08:00" not in response:
                return False, "Missing +08:00 timezone offset (P1)"
            return "done" in r, "Task not confirmed."
        return True, None

    return Task(task_id=f"P{part}_d{day:02d}_r{round_}",
                instruction=instr, task_type=ttype, checker_fn=checker)


def run_benchmark(agent: MetaClaw, n_days=5, tasks_per_day=6) -> Dict:
    day_results = {}
    for day in range(1, n_days+1):
        ok = sum(1 for r in range(tasks_per_day)
                 if not agent.serve(make_task(day, r)).is_failure)
        day_results[f"day_{day:02d}"] = {"accuracy": ok/tasks_per_day}
        log.info(f"Day {day:02d} accuracy: {ok/tasks_per_day:.1%}")
    return {**agent.get_stats(), "per_day": day_results}


# ── Utilities ─────────────────────────────────────────────────────────────────

class SimpleEmbedder:
    """
    Hash-based pseudo-embedder for smoke-test.
    Replace with: SentenceTransformer('all-MiniLM-L6-v2').encode(text)
    """
    def __init__(self, dim=384): self.dim = dim
    def encode(self, text: str) -> List[float]:
        seed = int(hashlib.sha256(text.encode()).hexdigest()[:8], 16)
        rng = random.Random(seed)
        raw = [rng.gauss(0, 1) for _ in range(self.dim)]
        n = sum(x**2 for x in raw)**0.5 + 1e-8
        return [x/n for x in raw]


def cosine_sim(a, b) -> float:
    dot = sum(x*y for x,y in zip(a,b))
    return dot / ((sum(x**2 for x in a)**0.5)*(sum(x**2 for x in b)**0.5)+1e-8)


# ── SECTION 11: Smoke Test ────────────────────────────────────────────────────

if __name__ == "__main__":
    print("="*65)
    print("  MetaClaw — Full Framework Smoke Test")
    print("="*65)

    # 1. Initialize
    print("\n[1/5] Initializing MetaClaw...")
    cfg = MetaClawConfig(skill_failure_threshold=3, rl_batch_size=4,
                         max_new_skills_per_run=2,
                         skill_library_path="/tmp/mc_smoke.json")
    agent = MetaClaw(cfg=cfg, mock_idle=True, simulated_base_accuracy=0.2)
    print(f"  Skills in library: {len(agent.library.skills)} | Gen: {agent.library.generation}")

    # 2. Verify empty retrieval
    print("\n[2/5] Testing retrieval (empty library)...")
    assert agent.library.retrieve("Update JSON file") == []
    print("  ✓ Empty library returns []")

    # 3. Serve tasks → trigger skill evolution
    print("\n[3/5] Serving 12 tasks → triggering skill evolver...")
    trajs = [agent.serve(make_task(1, r)) for r in range(12)]
    ok = sum(1 for t in trajs if not t.is_failure)
    print(f"  Successes: {ok}/12 | Skills: {len(agent.library.skills)} | Gen: {agent.library.generation}")
    assert len(agent.library.skills) > 0, "Skill evolver should have fired"

    # 4. Verify no stale samples in buffer
    print("\n[4/5] Verifying skill generation versioning...")
    gens = {t.skill_generation for t in agent.buffer.buffer}
    assert all(g >= agent.library.generation-1 for g in gens), \
        "Stale trajectories leaked into RL buffer!"
    print(f"  ✓ Buffer generations: {sorted(gens)} — no stale data.")

    # 5. Run 5-day benchmark simulation
    print("\n[5/5] 5-day MetaClaw-Bench simulation...")
    results = run_benchmark(agent, n_days=5, tasks_per_day=6)
    print("\n  Per-Day Accuracy:")
    for day, info in results["per_day"].items():
        acc = info["accuracy"]
        bar = "█"*int(acc*20) + "░"*(20-int(acc*20))
        print(f"  {day}: [{bar}] {acc:.1%}")

    print("\n  Final Statistics:")
    for k, v in results.items():
        if k != "per_day":
            print(f"  {k:25s} = {v:.4f}" if isinstance(v, float) else
                  f"  {k:25s} = {v}")

    print("\n"+"="*65)
    print("  ✓ All checks passed. MetaClaw is ready for deployment.")
    print("="*65)
    print("""
Next steps:
  1. pip install sentence-transformers  # replace SimpleEmbedder
  2. Plug in your LLM provider:
       agent = MetaClaw(cfg=cfg, llm_client=your_client)
  3. Connect cloud LoRA backend (Tinker / MinT):
       agent = MetaClaw(cfg=cfg, training_backend=tinker_client)
  4. Enable Google Calendar OMLS signal:
       agent = MetaClaw(cfg=cfg, calendar_client=gcal_service)
  5. Skills auto-persist to cfg.skill_library_path after each evolution.
""")

Read the Full Paper & Access the Code

The complete MetaClaw study — including ablation tables, per-day accuracy traces, case studies, and the AutoResearchClaw pipeline — is available on arXiv and the authors’ GitHub.

📄 Read the Paper (arXiv) 💻 GitHub Repository

Academic Citation:
Xia, P., Chen, J., Yang, X., Tu, H., Liu, J., Xiong, K., Han, S., Qiu, S., Ji, H., Zhou, Y., Zheng, Z., Xie, C., & Yao, H. (2026). MetaClaw: Just Talk — An Agent That Meta-Learns and Evolves in the Wild. arXiv:2603.17187v1.

Independent editorial analysis of peer-reviewed research. The Python implementation is an educational adaptation. In production, replace SimpleEmbedder with sentence-transformers, connect a real LLM provider, and configure a cloud LoRA backend such as Tinker or MinT. Refer to the official GitHub repository for exact configurations and pretrained components.