add `AgentConfigUpdate` & initial judges #4547

theomonnom · 2026-01-17T21:12:45Z

Summary by CodeRabbit

New Features
- Introduced an agent evaluation framework with specialized judges to assess agent performance across task completion, accuracy, tool usage, and safety criteria.
- Added configuration tracking that records changes to agent instructions and tools within conversation history for improved auditability and debugging.
- Expanded public API exports for better accessibility of evaluation and configuration management capabilities.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-17T21:12:56Z

📝 Walkthrough

Walkthrough

This PR introduces a structured evaluation framework for agent conversations through a new Judge class with specialized variants (task completion, accuracy, tool use, safety) for criteria-based assessment. It also enhances configuration tracking by capturing agent updates (instructions, tools) in the chat context via AgentConfigUpdate.

Changes

Cohort / File(s)	Summary
Evaluation Framework `livekit-agents/livekit/agents/evals/judge.py`	New module with `Judge` class for evaluating agent conversations via LLM-based assessment. Includes `JudgmentResult` dataclass and four factory functions (`task_completion_judge`, `accuracy_judge`, `tool_use_judge`, `safety_judge`) that return specialized judges with domain-specific evaluation instructions. Judge builds composite prompts from instructions, chat context, and optional references, then streams LLM output to determine Pass/Fail verdict via response parsing.
Evals Package Export `livekit-agents/livekit/agents/evals/__init__.py`	Exports public API: `Judge`, `JudgmentResult`, and all four judge factory functions from the judge module via explicit `__all__`.
Agent Configuration Tracking `livekit-agents/livekit/agents/voice/agent_activity.py`	Enhanced to track runtime configuration changes. Computes tool diffs (added/removed) when tools are updated via `get_fnc_tool_names()`. Records `AgentConfigUpdate` items in chat context on instruction updates and tool changes, capturing full tool definitions. Creates initial `AgentConfigUpdate` at agent startup with instructions and tools.
Chat Context Extension `livekit-agents/livekit/agents/llm/chat_context.py`	Adds `AgentConfigUpdate` model with fields for configuration changes (id, type, instructions, tools_added, tools_removed, created_at, _tools). Extends `ChatItem` union discriminator to include the new type.
Public API Exports `livekit-agents/livekit/agents/__init__.py`, `livekit-agents/livekit/agents/llm/__init__.py`	Exposes `AgentConfigUpdate` and `AgentHandoff` through the top-level agents package and llm subpackage via imports and `__all__` declarations.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Judge
    participant LLM

    Client->>Judge: evaluate(chat_ctx, reference)
    Judge->>Judge: format_chat_ctx(chat_ctx)
    Judge->>Judge: build_composite_prompt(instructions, formatted_ctx, reference)
    Judge->>LLM: stream(prompt)
    LLM-->>Judge: response chunks
    Judge->>Judge: detect PASS/FAIL in response
    Judge->>Judge: create JudgmentResult(passed, reasoning)
    Judge-->>Client: JudgmentResult

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A judge hops forth with wisdom bright,
To evaluate with LLM insight—
Tools tracked, configs stored with care,
Agent conversations judged fair and square! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the two main additions in the changeset: AgentConfigUpdate functionality and an initial judges framework for agent evaluation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@livekit-agents/livekit/agents/llm/chat_context.py`:
- Around line 213-229: AgentConfigUpdate is missing the agent_id field, so
callers that set agent_id and the formatter _format_chat_ctx that reads it can
fail; add an agent_id: str | None = None (or appropriate type) to the
AgentConfigUpdate model declaration so the value is preserved and safe to
access, and ensure the new field is included before PrivateAttr/_tools in the
AgentConfigUpdate class so ChatItem (which unions AgentConfigUpdate) will carry
agent_id correctly.

♻️ Duplicate comments (2)

livekit-agents/livekit/agents/voice/agent_activity.py (1)

314-323: agent_id field missing in AgentConfigUpdate (covered in model)

This block passes agent_id; ensure the model defines it so the value isn’t lost.

livekit-agents/livekit/agents/evals/judge.py (1)

16-40: Guard agent_id access for config updates

agent_id is referenced here; ensure the model defines it (see AgentConfigUpdate).

🧹 Nitpick comments (3)

livekit-agents/livekit/agents/voice/agent_activity.py (1)

331-349: Stabilize tool diff ordering for deterministic updates

set → list produces nondeterministic ordering; sorting keeps logs/tests stable.
🔧 Proposed fix
-        tools_added = list(new_tool_names - old_tool_names) or None
-        tools_removed = list(old_tool_names - new_tool_names) or None
+        tools_added = sorted(new_tool_names - old_tool_names) or None
+        tools_removed = sorted(old_tool_names - new_tool_names) or None

livekit-agents/livekit/agents/evals/judge.py (2)

8-13: Add a Google‑style class docstring for JudgmentResult

🔧 Proposed fix

 `@dataclass`
 class JudgmentResult:
+    """Result of a judge evaluation.
+
+    Attributes:
+        passed: Whether the evaluation passed.
+        reasoning: Model reasoning for the judgment.
+    """
     passed: bool
     """Whether the evaluation passed."""
     reasoning: str
     """Chain-of-thought reasoning for the judgment."""

As per coding guidelines, please add Google-style docstrings.

43-87: Make PASS/FAIL parsing deterministic

rfind can be tripped by “PASS/FAIL” in the reasoning. Require a final verdict line and parse only that.

🔧 Proposed fix

         prompt_parts.extend(
             [
                 "",
                 "Does the conversation meet the criteria? Don't overthink it.",
-                "Explain your reasoning step by step, then answer Pass or Fail.",
+                "Provide a brief justification, then output a final line with exactly PASS or FAIL.",
             ]
         )
@@
-        response = "".join(response_chunks)
-
-        response_upper = response.upper()
-        pass_pos = response_upper.rfind("PASS")
-        fail_pos = response_upper.rfind("FAIL")
-        passed = pass_pos > fail_pos if pass_pos != -1 else False
+        response = "".join(response_chunks).strip()
+        last_line = response.splitlines()[-1].strip().upper() if response else ""
+        passed = last_line.startswith("PASS")

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 853bc41 and 9af2755.

📒 Files selected for processing (6)

livekit-agents/livekit/agents/__init__.py
livekit-agents/livekit/agents/evals/__init__.py
livekit-agents/livekit/agents/evals/judge.py
livekit-agents/livekit/agents/llm/__init__.py
livekit-agents/livekit/agents/llm/chat_context.py
livekit-agents/livekit/agents/voice/agent_activity.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

livekit-agents/livekit/agents/evals/__init__.py
livekit-agents/livekit/agents/llm/__init__.py
livekit-agents/livekit/agents/__init__.py
livekit-agents/livekit/agents/voice/agent_activity.py
livekit-agents/livekit/agents/evals/judge.py
livekit-agents/livekit/agents/llm/chat_context.py

🧬 Code graph analysis (6)

livekit-agents/livekit/agents/evals/__init__.py (1)

livekit-agents/livekit/agents/evals/judge.py (6)

Judge (43-87)

JudgmentResult (9-13)

accuracy_judge (112-128)

safety_judge (151-168)

task_completion_judge (90-109)

tool_use_judge (131-148)

livekit-agents/livekit/agents/llm/__init__.py (1)

livekit-agents/livekit/agents/llm/chat_context.py (1)

AgentConfigUpdate (213-224)

livekit-agents/livekit/agents/__init__.py (1)

livekit-agents/livekit/agents/llm/chat_context.py (2)

AgentConfigUpdate (213-224)

AgentHandoff (205-210)

livekit-agents/livekit/agents/voice/agent_activity.py (2)

livekit-agents/livekit/agents/llm/tool_context.py (4)

get_fnc_tool_names (283-292)

tools (44-46)

ToolContext (295-418)

flatten (320-325)

livekit-agents/livekit/agents/llm/chat_context.py (1)

AgentConfigUpdate (213-224)

livekit-agents/livekit/agents/evals/judge.py (4)

livekit-agents/livekit/agents/voice/agent_activity.py (1)

llm (2815-2819)

livekit-agents/livekit/agents/llm/chat_context.py (7)

ChatContext (232-670)

items (241-242)

items (245-246)

text_content (164-173)

copy (297-354)

copy (690-691)

add_message (248-281)

livekit-agents/livekit/agents/voice/agent_session.py (1)

output (394-395)

livekit-agents/livekit/agents/voice/agent.py (1)

instructions (99-104)

livekit-agents/livekit/agents/llm/chat_context.py (2)

livekit-agents/livekit/agents/utils/misc.py (1)

shortuuid (21-22)

livekit-agents/livekit/agents/llm/tool_context.py (1)

Tool (31-32)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: livekit-plugins-cartesia
GitHub Check: livekit-plugins-deepgram
GitHub Check: livekit-plugins-inworld
GitHub Check: livekit-plugins-openai
GitHub Check: livekit-plugins-elevenlabs
GitHub Check: unit-tests
GitHub Check: type-check (3.13)
GitHub Check: type-check (3.9)

🔇 Additional comments (11)

livekit-agents/livekit/agents/llm/__init__.py (2)

1-15: LGTM: AgentConfigUpdate re-exported from llm

55-69: LGTM: __all__ updated to include AgentConfigUpdate

livekit-agents/livekit/agents/__init__.py (2)

39-49: LGTM: top-level imports updated

117-156: LGTM: __all__ export list updated

livekit-agents/livekit/agents/voice/agent_activity.py (2)

18-25: LGTM: tool diff helpers wired in

603-611: LGTM: initial config snapshot recorded

livekit-agents/livekit/agents/evals/judge.py (4)

90-109: LGTM: task completion judge instructions are clear

112-128: LGTM: accuracy judge instructions look solid

131-148: LGTM: tool-use judge instructions look solid

151-167: LGTM: safety judge instructions look solid

livekit-agents/livekit/agents/evals/__init__.py (1)

1-17: LGTM: judge APIs re-exported

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-17T21:21:11Z

livekit-agents/livekit/agents/llm/chat_context.py

+class AgentConfigUpdate(BaseModel):
+    id: str = Field(default_factory=lambda: utils.shortuuid("item_"))
+    type: Literal["agent_config_update"] = Field(default="agent_config_update")
+
+    instructions: str | None = None
+    tools_added: list[str] | None = None
+    tools_removed: list[str] | None = None
+
+    created_at: float = Field(default_factory=time.time)
+
+    _tools: list[Tool] = PrivateAttr(default_factory=list)
+    """Full tool definitions (in-memory only, not serialized)."""
+
 ChatItem = Annotated[
-    Union[ChatMessage, FunctionCall, FunctionCallOutput, AgentHandoff], Field(discriminator="type")
+    Union[ChatMessage, FunctionCall, FunctionCallOutput, AgentHandoff, AgentConfigUpdate],
+    Field(discriminator="type"),
 ]


⚠️ Potential issue | 🟠 Major

Add agent_id to AgentConfigUpdate to match callers and formatter

Call sites set agent_id and _format_chat_ctx reads it, but the model doesn’t declare it. Add the field so the value is preserved and attribute access is safe.

🔧 Proposed fix

class AgentConfigUpdate(BaseModel): id: str = Field(default_factory=lambda: utils.shortuuid("item_")) type: Literal["agent_config_update"] = Field(default="agent_config_update") + agent_id: str | None = None instructions: str | None = None tools_added: list[str] | None = None tools_removed: list[str] | None = None

🤖 Prompt for AI Agents

In `@livekit-agents/livekit/agents/llm/chat_context.py` around lines 213 - 229, AgentConfigUpdate is missing the agent_id field, so callers that set agent_id and the formatter _format_chat_ctx that reads it can fail; add an agent_id: str | None = None (or appropriate type) to the AgentConfigUpdate model declaration so the value is preserved and safe to access, and ensure the new field is included before PrivateAttr/_tools in the AgentConfigUpdate class so ChatItem (which unions AgentConfigUpdate) will carry agent_id correctly.

wip

9af2755

chenghao-mou requested a review from a team January 17, 2026 21:12

coderabbitai bot reviewed Jan 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add `AgentConfigUpdate` & initial judges #4547

add `AgentConfigUpdate` & initial judges #4547

theomonnom commented Jan 17, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 17, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add AgentConfigUpdate & initial judges #4547

Are you sure you want to change the base?

add AgentConfigUpdate & initial judges #4547

Conversation

theomonnom commented Jan 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add `AgentConfigUpdate` & initial judges #4547

add `AgentConfigUpdate` & initial judges #4547

theomonnom commented Jan 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 17, 2026 •

edited

Loading