Add CrewAI + EvalView integration — regression testing for crews#352
Open
hidai25 wants to merge 1 commit intocrewAIInc:mainfrom
Open
Add CrewAI + EvalView integration — regression testing for crews#352hidai25 wants to merge 1 commit intocrewAIInc:mainfrom
hidai25 wants to merge 1 commit intocrewAIInc:mainfrom
Conversation
Self-contained example showing how to use EvalView to regression-test CrewAI crews. Uses the native adapter (crew.kickoff() in-process, tool call capture via event bus) — no HTTP server needed. Includes: example crew, test YAMLs, CI config, safety test with forbidden_tools, and watch mode for prompt iteration.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a self-contained integration example showing how to use EvalView for regression testing CrewAI crews.
EvalView complements
crewai test— whilecrewai testruns crews N times and shows scores, EvalView snapshots the full execution trace (which agent called which tool, with what parameters, in what order) and diffs it against a golden baseline on every change.This addresses the use case described in issue #4174 — deterministic CI regression checks for tool-using agents.
What's included
integrations/CrewAI-EvalView/README.md— setup guide, test examples, CI config, watch modecrew.py— example research + writing crew with toolsmain.py— runnable entry point using EvalView's Python APItests/research-report.yaml— tool-calling regression testtests/safety-check.yaml— forbidden tool safety testrequirements.txt— crewai + evalview dependencies.env.exampleHow it works
The native adapter calls
crew.kickoff()directly (no HTTP server), captures tool calls via CrewAI's event bus (ToolUsageFinishedEvent), and returns structured traces for diffing.Test plan
python main.pyevalview snapshotcaptures baselineevalview checkdetects regressions after changes