Turning failed agent trajectories into high-quality training data
π Paper (PDF) β’ Quickstart β’ How It Works β’ Architecture β’ Usage β’ Related projects β’ Citation
In LLM Agent training, failed tool-use trajectories are routinely discarded. This is wasteful β a trajectory that fails Goal A may perfectly succeed for Goal B.
AgentHER borrows the core insight from Hindsight Experience Replay (HER) in reinforcement learning: instead of discarding failures, we relabel the goal to match what was actually achieved, creating valid training data from every trajectory.
| Original (Failed) | Hindsight (Success) | |
|---|---|---|
| Prompt | "Find copper wire under $5/kg" | "Find copper wire suppliers and compare pricing" |
| Trajectory | Searched 7 suppliers, best found at $5.30/kg | (same trajectory) |
| Label | β Failure | β Success |
The agent's work was thorough and correct β it just didn't meet an arbitrary price constraint. AgentHER recovers this data.
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββ
β 1. Failure ββββββΈβ 2. Outcome ββββββΈβ 3. Prompt ββββββΈβ 4. Data β
β Detector β β Extractor β β Relabeler β β Augmenter β
β β β β β β β β
β Is this really β β What did the β β Reverse- β β Package as β
β a failure? β β agent achieve? β β engineer a new β β SFT / DPO / β
β Recoverable? β β β β matching promptβ β ShareGPT β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββ
Stage 1 β Failure Detector: Validates whether the trajectory truly fails, classifies the failure type (constraint violation, wrong result, tool error, etc.), and assesses recoverability. Supports rule-based (free) or LLM-judge modes.
Stage 2 β Outcome Extractor: Analyzes observations to build a factual summary of what the agent actually accomplished, ignoring the original goal entirely.
Stage 3 β Prompt Relabeler: Uses an LLM to craft a natural, human-like prompt that the trajectory perfectly satisfies. Includes confidence scoring and retry logic.
Stage 4 β Data Augmenter: Packages the new (prompt, trajectory) pair into standard training formats: SFT, DPO (with chosen/rejected pairs), or ShareGPT multi-turn.
agenther/
βββ models.py # Pydantic data models (AgentStep, FailedTrajectory, etc.)
βββ constants.py # Shared thresholds (min observation length, truncation, etc.)
βββ llm_client.py # OpenAI-compatible LLM client with structured output
βββ prompts.py # Jinja2 prompt templates + steps_for_prompt()
βββ failure_detector.py # Stage 1: rule-based + LLM failure classification
βββ outcome_extractor.py # Stage 2: extract actual achievements
βββ prompt_relabeler.py # Stage 3: reverse-engineer hindsight prompts
βββ data_augmenter.py # Stage 4: SFT/DPO/ShareGPT formatting
βββ pipeline.py # End-to-end pipeline orchestrator
βββ cli.py # Command-line interface
# Recommended: use a virtual environment
python -m venv .venv && source .venv/bin/activate # Linux/macOS
# or: .venv\Scripts\activate # Windows
pip install -e .
# Optional, for running tests: pip install -e ".[dev]"python examples/run_example.py --rule-basedexport OPENAI_API_KEY="your-key"
# Process failed trajectories β SFT data
agenther run examples/example_trajectories.json -f sft -o outputs/sft_data.jsonl
# Generate DPO pairs
agenther run examples/example_trajectories.json -f dpo -o outputs/dpo_data.jsonl
# Validate input format
agenther validate examples/example_trajectories.json# With vLLM / Ollama / any OpenAI-compatible endpoint
agenther run data.json --model "llama3" --base-url "http://localhost:8000/v1"from agenther import AgentHERPipeline, PipelineConfig
from agenther.models import FailedTrajectory, AgentStep, OutputFormat
# Define a failed trajectory
trajectory = FailedTrajectory(
original_prompt="Find flights to Tokyo under $500",
steps=[
AgentStep(
thought="Searching for flights",
action_name="flight_search",
action_input={"destination": "Tokyo", "max_price": 500},
observation="Found: ANA $680, JAL $720, United $590",
),
],
final_answer="No flights under $500 found.",
failure_reason="All flights exceed $500 budget",
)
# Run the pipeline
config = PipelineConfig(model="gpt-4o", output_format=OutputFormat.SFT)
pipeline = AgentHERPipeline(config)
result = pipeline.process(trajectory)
if result.success:
print(f"Hindsight prompt: {result.relabeled.hindsight_prompt}")
# e.g., "Search for flights to Tokyo and compare prices across airlines"Provide failed trajectories as JSON or JSONL. steps must contain at least one step.
{
"trajectory_id": "optional_id",
"original_prompt": "The user's original request",
"steps": [
{
"thought": "Agent's reasoning",
"action_name": "tool_name",
"action_input": {"key": "value"},
"observation": "Tool output"
}
],
"final_answer": "Agent's final response",
"failure_reason": "Why this is considered a failure"
}CLI options override defaults; there is no config file loading. For reference, configs/default.yaml documents the same options (use it as a template; pass values via CLI or PipelineConfig in code):
llm:
model: "gpt-4o"
temperature: 0.3
pipeline:
use_llm_detector: false # Rule-based is faster and free
use_llm_extractor: true # LLM gives better outcome extraction
output_format: "sft" # sft | dpo | sharegpt
min_confidence: 0.5 # ΞΈ: quality threshold for relabeling
severity_threshold: 0.3 # Ξ΄: discard trajectories with severity weight below this
relabel_max_attempts: 3 # K: retry relabeling up to N times
output_dir: "outputs" # Default output directorypip install -e ".[dev]"
pytest -v- Batch processing is sequential β no parallelism; large batches may be slow.
- No config file β options are passed via CLI or
PipelineConfigin code. - Rule-based stages are heuristics β for best quality, use LLM for detector/extractor when cost allows.
Issues and pull requests are welcome on GitHub.
- AdaRubrics β Adaptive dynamic rubric evaluator for agent trajectories: generates task-specific dimensions and scores runs for filtering/RLHF. Use it to score or filter relabeled data from AgentHER.
- AgentSynth β Synthetic agent data pipeline (forward + back-translation, execution-based reject sampling). AgentHER can relabel failed or low-quality synthetic runs into valid SFT/DPO data.
- trajectory_tokenization β ReAct with trajectory tokenization: compresses long (Thought, Action, Observation) history so long-horizon runs fit in context. Addresses context length; AgentHER addresses reuse of failed trajectories.
The full paper is available here: AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling (PDF).
@software{agenther2025,
title = {AgentHER: Hindsight Experience Replay for LLM Agents},
author = {Ding, Liang},
year = {2025},
url = {https://github.com/alphadl/AgentHER},
}Apache 2.0
