Skip to content

Observability: real-time lifecycle dashboard + telemetry instrumentation for analysis runs #5

@ethenotethan

Description

@ethenotethan

Summary

The multi-agent orchestration workflow (discovery → library analysis → application analysis → synthesis) currently has no structured telemetry or real-time visibility into its execution. There is a nascent observability/ directory with a static session_profiler.html and a live_monitor.sh script, but these are post-hoc tools that require manual wiring. This issue tracks adding first-class, always-on observability so any run can be profiled and monitored in a dashboard without extra setup.


Problem

When a full analysis run executes (e.g., 24 libraries + 10 applications across 3 language stacks), there is no way to:

  • See which subagents are currently active vs. queued vs. complete in real time
  • Measure wall-clock time per phase (depth-0 libraries, depth-N libraries, application analysis, synthesis)
  • Measure per-subagent token consumption, tool-call counts, and latency
  • Detect stalled or failed subagents without tailing raw log files
  • Compare performance across runs (e.g., a re-analysis after a diff vs. a full cold run)

The only current signal is logs/latest/tool_calls.jsonl and transcript.txt, which require manual parsing.


Proposed Work

1. Structured Telemetry Emission

Instrument the lead agent and subagent lifecycle with structured span events written to a telemetry.jsonl file alongside the existing tool_calls.jsonl:

// Span open
{ "event": "span_start", "span_id": "lib-encoding", "parent": "phase-depth2", "component": "encoding", "kind": "library-analysis", "ts": 1712534400.123 }

// Span close
{ "event": "span_end", "span_id": "lib-encoding", "status": "ok", "duration_ms": 14820, "tokens": 42100, "tool_uses": 31, "ts": 1712534415.001 }

// Phase boundary
{ "event": "phase", "name": "library-depth-0-complete", "libraries": 9, "wall_ms": 68400, "ts": 1712534400.999 }

Spans to instrument:

  • Full run (root span)
  • Each phase (depth-0 libs, depth-N libs, application analysis, synthesis)
  • Each subagent invocation (library / application / external-service / architecture-documenter)
  • Discovery engine execution
  • Manifest write

2. Live Dashboard (upgrade session_profiler.html)

Upgrade observability/session_profiler.html into a proper live dashboard that:

  • Auto-refreshes by polling telemetry.jsonl (or a small SSE/WebSocket endpoint from serve_logs.py) every ~2s
  • Gantt / swimlane view — one row per subagent, colored by phase, with wall-clock time on the x-axis; shows in-progress spans with an animated fill
  • Phase summary bar — at the top: total elapsed, current phase, % complete, active agent count
  • Per-agent cards — name, kind, status (queued / running / done / error), elapsed, tokens, tool calls
  • Token burn rate chart — rolling 30s tokens/minute across all active subagents
  • Error/warning panel — surfaces any "status": "error" spans immediately

Tech: keep it as a single-file HTML + vanilla JS + D3 (already imported); serve via the existing serve_logs.py.

3. serve_logs.py SSE endpoint

Add a /events Server-Sent Events endpoint to serve_logs.py that tails telemetry.jsonl and pushes new lines to connected browsers. This removes the need for the dashboard to poll a file and enables sub-second latency updates.

4. Run Summary Report

After synthesis completes, write logs/latest/run_summary.json with:

{
  "run_id": "eigenda-20260408-abc123",
  "source_repo": "https://github.com/Layr-Labs/eigenda",
  "source_commit": "61019b4",
  "total_wall_ms": 312500,
  "phases": {
    "discovery": { "wall_ms": 1200 },
    "library_depth_0": { "wall_ms": 68400, "agents": 9 },
    "library_depth_n": { "wall_ms": 112000, "agents": 15 },
    "application_analysis": { "wall_ms": 98000, "agents": 10 },
    "synthesis": { "wall_ms": 32900, "agents": 1 }
  },
  "totals": {
    "agents_spawned": 35,
    "total_tokens": 1842000,
    "total_tool_uses": 847,
    "analyses_written": 45
  }
}

This enables cross-run benchmarking and regression detection.


Acceptance Criteria

  • telemetry.jsonl is written automatically on every run with span_start / span_end / phase events
  • observability/session_profiler.html shows a live Gantt view that updates without page refresh during an active run
  • serve_logs.py exposes a /events SSE endpoint
  • logs/latest/run_summary.json is written at the end of every run
  • Dashboard correctly reflects the EigenDA run profile (9 parallel depth-0 agents, 15 sequential depth-N agents, 10 parallel app agents, 1 synthesizer)

Context

The observability/ directory already has scaffolding (session_profiler.html, live_monitor.sh, serve_logs.py) — this issue is about making that scaffolding production-quality and always-on rather than opt-in.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions