Skip to content

Conversation

@theomonnom
Copy link
Member

@theomonnom theomonnom commented Jan 17, 2026

Summary by CodeRabbit

  • New Features
    • Introduced an agent evaluation framework with specialized judges to assess agent performance across task completion, accuracy, tool usage, and safety criteria.
    • Added configuration tracking that records changes to agent instructions and tools within conversation history for improved auditability and debugging.
    • Expanded public API exports for better accessibility of evaluation and configuration management capabilities.

✏️ Tip: You can customize this high-level summary in your review settings.

@chenghao-mou chenghao-mou requested a review from a team January 17, 2026 21:12
@coderabbitai
Copy link

coderabbitai bot commented Jan 17, 2026

📝 Walkthrough

Walkthrough

This PR introduces a structured evaluation framework for agent conversations through a new Judge class with specialized variants (task completion, accuracy, tool use, safety) for criteria-based assessment. It also enhances configuration tracking by capturing agent updates (instructions, tools) in the chat context via AgentConfigUpdate.

Changes

Cohort / File(s) Summary
Evaluation Framework
livekit-agents/livekit/agents/evals/judge.py
New module with Judge class for evaluating agent conversations via LLM-based assessment. Includes JudgmentResult dataclass and four factory functions (task_completion_judge, accuracy_judge, tool_use_judge, safety_judge) that return specialized judges with domain-specific evaluation instructions. Judge builds composite prompts from instructions, chat context, and optional references, then streams LLM output to determine Pass/Fail verdict via response parsing.
Evals Package Export
livekit-agents/livekit/agents/evals/__init__.py
Exports public API: Judge, JudgmentResult, and all four judge factory functions from the judge module via explicit __all__.
Agent Configuration Tracking
livekit-agents/livekit/agents/voice/agent_activity.py
Enhanced to track runtime configuration changes. Computes tool diffs (added/removed) when tools are updated via get_fnc_tool_names(). Records AgentConfigUpdate items in chat context on instruction updates and tool changes, capturing full tool definitions. Creates initial AgentConfigUpdate at agent startup with instructions and tools.
Chat Context Extension
livekit-agents/livekit/agents/llm/chat_context.py
Adds AgentConfigUpdate model with fields for configuration changes (id, type, instructions, tools_added, tools_removed, created_at, _tools). Extends ChatItem union discriminator to include the new type.
Public API Exports
livekit-agents/livekit/agents/__init__.py, livekit-agents/livekit/agents/llm/__init__.py
Exposes AgentConfigUpdate and AgentHandoff through the top-level agents package and llm subpackage via imports and __all__ declarations.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Judge
    participant LLM

    Client->>Judge: evaluate(chat_ctx, reference)
    Judge->>Judge: format_chat_ctx(chat_ctx)
    Judge->>Judge: build_composite_prompt(instructions, formatted_ctx, reference)
    Judge->>LLM: stream(prompt)
    LLM-->>Judge: response chunks
    Judge->>Judge: detect PASS/FAIL in response
    Judge->>Judge: create JudgmentResult(passed, reasoning)
    Judge-->>Client: JudgmentResult
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A judge hops forth with wisdom bright,
To evaluate with LLM insight—
Tools tracked, configs stored with care,
Agent conversations judged fair and square! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the two main additions in the changeset: AgentConfigUpdate functionality and an initial judges framework for agent evaluation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@livekit-agents/livekit/agents/llm/chat_context.py`:
- Around line 213-229: AgentConfigUpdate is missing the agent_id field, so
callers that set agent_id and the formatter _format_chat_ctx that reads it can
fail; add an agent_id: str | None = None (or appropriate type) to the
AgentConfigUpdate model declaration so the value is preserved and safe to
access, and ensure the new field is included before PrivateAttr/_tools in the
AgentConfigUpdate class so ChatItem (which unions AgentConfigUpdate) will carry
agent_id correctly.
♻️ Duplicate comments (2)
livekit-agents/livekit/agents/voice/agent_activity.py (1)

314-323: agent_id field missing in AgentConfigUpdate (covered in model)

This block passes agent_id; ensure the model defines it so the value isn’t lost.

livekit-agents/livekit/agents/evals/judge.py (1)

16-40: Guard agent_id access for config updates

agent_id is referenced here; ensure the model defines it (see AgentConfigUpdate).

🧹 Nitpick comments (3)
livekit-agents/livekit/agents/voice/agent_activity.py (1)

331-349: Stabilize tool diff ordering for deterministic updates

setlist produces nondeterministic ordering; sorting keeps logs/tests stable.

🔧 Proposed fix
-        tools_added = list(new_tool_names - old_tool_names) or None
-        tools_removed = list(old_tool_names - new_tool_names) or None
+        tools_added = sorted(new_tool_names - old_tool_names) or None
+        tools_removed = sorted(old_tool_names - new_tool_names) or None
livekit-agents/livekit/agents/evals/judge.py (2)

8-13: Add a Google‑style class docstring for JudgmentResult

🔧 Proposed fix
 `@dataclass`
 class JudgmentResult:
+    """Result of a judge evaluation.
+
+    Attributes:
+        passed: Whether the evaluation passed.
+        reasoning: Model reasoning for the judgment.
+    """
     passed: bool
     """Whether the evaluation passed."""
     reasoning: str
     """Chain-of-thought reasoning for the judgment."""
As per coding guidelines, please add Google-style docstrings.

43-87: Make PASS/FAIL parsing deterministic

rfind can be tripped by “PASS/FAIL” in the reasoning. Require a final verdict line and parse only that.

🔧 Proposed fix
         prompt_parts.extend(
             [
                 "",
                 "Does the conversation meet the criteria? Don't overthink it.",
-                "Explain your reasoning step by step, then answer Pass or Fail.",
+                "Provide a brief justification, then output a final line with exactly PASS or FAIL.",
             ]
         )
@@
-        response = "".join(response_chunks)
-
-        response_upper = response.upper()
-        pass_pos = response_upper.rfind("PASS")
-        fail_pos = response_upper.rfind("FAIL")
-        passed = pass_pos > fail_pos if pass_pos != -1 else False
+        response = "".join(response_chunks).strip()
+        last_line = response.splitlines()[-1].strip().upper() if response else ""
+        passed = last_line.startswith("PASS")
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 853bc41 and 9af2755.

📒 Files selected for processing (6)
  • livekit-agents/livekit/agents/__init__.py
  • livekit-agents/livekit/agents/evals/__init__.py
  • livekit-agents/livekit/agents/evals/judge.py
  • livekit-agents/livekit/agents/llm/__init__.py
  • livekit-agents/livekit/agents/llm/chat_context.py
  • livekit-agents/livekit/agents/voice/agent_activity.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

  • livekit-agents/livekit/agents/evals/__init__.py
  • livekit-agents/livekit/agents/llm/__init__.py
  • livekit-agents/livekit/agents/__init__.py
  • livekit-agents/livekit/agents/voice/agent_activity.py
  • livekit-agents/livekit/agents/evals/judge.py
  • livekit-agents/livekit/agents/llm/chat_context.py
🧬 Code graph analysis (6)
livekit-agents/livekit/agents/evals/__init__.py (1)
livekit-agents/livekit/agents/evals/judge.py (6)
  • Judge (43-87)
  • JudgmentResult (9-13)
  • accuracy_judge (112-128)
  • safety_judge (151-168)
  • task_completion_judge (90-109)
  • tool_use_judge (131-148)
livekit-agents/livekit/agents/llm/__init__.py (1)
livekit-agents/livekit/agents/llm/chat_context.py (1)
  • AgentConfigUpdate (213-224)
livekit-agents/livekit/agents/__init__.py (1)
livekit-agents/livekit/agents/llm/chat_context.py (2)
  • AgentConfigUpdate (213-224)
  • AgentHandoff (205-210)
livekit-agents/livekit/agents/voice/agent_activity.py (2)
livekit-agents/livekit/agents/llm/tool_context.py (4)
  • get_fnc_tool_names (283-292)
  • tools (44-46)
  • ToolContext (295-418)
  • flatten (320-325)
livekit-agents/livekit/agents/llm/chat_context.py (1)
  • AgentConfigUpdate (213-224)
livekit-agents/livekit/agents/evals/judge.py (4)
livekit-agents/livekit/agents/voice/agent_activity.py (1)
  • llm (2815-2819)
livekit-agents/livekit/agents/llm/chat_context.py (7)
  • ChatContext (232-670)
  • items (241-242)
  • items (245-246)
  • text_content (164-173)
  • copy (297-354)
  • copy (690-691)
  • add_message (248-281)
livekit-agents/livekit/agents/voice/agent_session.py (1)
  • output (394-395)
livekit-agents/livekit/agents/voice/agent.py (1)
  • instructions (99-104)
livekit-agents/livekit/agents/llm/chat_context.py (2)
livekit-agents/livekit/agents/utils/misc.py (1)
  • shortuuid (21-22)
livekit-agents/livekit/agents/llm/tool_context.py (1)
  • Tool (31-32)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: livekit-plugins-cartesia
  • GitHub Check: livekit-plugins-deepgram
  • GitHub Check: livekit-plugins-inworld
  • GitHub Check: livekit-plugins-openai
  • GitHub Check: livekit-plugins-elevenlabs
  • GitHub Check: unit-tests
  • GitHub Check: type-check (3.13)
  • GitHub Check: type-check (3.9)
🔇 Additional comments (11)
livekit-agents/livekit/agents/llm/__init__.py (2)

1-15: LGTM: AgentConfigUpdate re-exported from llm


55-69: LGTM: __all__ updated to include AgentConfigUpdate

livekit-agents/livekit/agents/__init__.py (2)

39-49: LGTM: top-level imports updated


117-156: LGTM: __all__ export list updated

livekit-agents/livekit/agents/voice/agent_activity.py (2)

18-25: LGTM: tool diff helpers wired in


603-611: LGTM: initial config snapshot recorded

livekit-agents/livekit/agents/evals/judge.py (4)

90-109: LGTM: task completion judge instructions are clear


112-128: LGTM: accuracy judge instructions look solid


131-148: LGTM: tool-use judge instructions look solid


151-167: LGTM: safety judge instructions look solid

livekit-agents/livekit/agents/evals/__init__.py (1)

1-17: LGTM: judge APIs re-exported

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +213 to 229
class AgentConfigUpdate(BaseModel):
id: str = Field(default_factory=lambda: utils.shortuuid("item_"))
type: Literal["agent_config_update"] = Field(default="agent_config_update")

instructions: str | None = None
tools_added: list[str] | None = None
tools_removed: list[str] | None = None

created_at: float = Field(default_factory=time.time)

_tools: list[Tool] = PrivateAttr(default_factory=list)
"""Full tool definitions (in-memory only, not serialized)."""

ChatItem = Annotated[
Union[ChatMessage, FunctionCall, FunctionCallOutput, AgentHandoff], Field(discriminator="type")
Union[ChatMessage, FunctionCall, FunctionCallOutput, AgentHandoff, AgentConfigUpdate],
Field(discriminator="type"),
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add agent_id to AgentConfigUpdate to match callers and formatter

Call sites set agent_id and _format_chat_ctx reads it, but the model doesn’t declare it. Add the field so the value is preserved and attribute access is safe.

🔧 Proposed fix
 class AgentConfigUpdate(BaseModel):
     id: str = Field(default_factory=lambda: utils.shortuuid("item_"))
     type: Literal["agent_config_update"] = Field(default="agent_config_update")
 
+    agent_id: str | None = None
     instructions: str | None = None
     tools_added: list[str] | None = None
     tools_removed: list[str] | None = None
🤖 Prompt for AI Agents
In `@livekit-agents/livekit/agents/llm/chat_context.py` around lines 213 - 229,
AgentConfigUpdate is missing the agent_id field, so callers that set agent_id
and the formatter _format_chat_ctx that reads it can fail; add an agent_id: str
| None = None (or appropriate type) to the AgentConfigUpdate model declaration
so the value is preserved and safe to access, and ensure the new field is
included before PrivateAttr/_tools in the AgentConfigUpdate class so ChatItem
(which unions AgentConfigUpdate) will carry agent_id correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants