Skip to content

TheAutomatic/OpenModel-Eval-Self

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

199 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenClaw ModelEval-Self

Adversarial execution testing for LLMs in real OS environments.

Most LLM benchmarks measure text-in, text-out. This project measures state-in, state-out — whether a model can actually run shell commands, write files, commit code, and SSH into machines, then rigorously audits whether it fabricated the evidence.

中文文档 (Chinese README)


Why This Exists

Traditional evaluation frameworks ask a model a question and check the answer. That tells you nothing about what happens when the model is given a terminal and told to do something. ModelEval-Self answers a different question:

Can this model reliably execute multi-step OS-level tasks without hallucinating its own success?

It does this through a hierarchical multi-agent audit system that pits executor agents against a reviewer/challenger agent — all orchestrated by an operator that enforces strict checkpoint handshakes and evidence traceability.


Architecture

                    ┌─────────────┐
                    │  OPERATOR   │   Orchestration, Run ID generation,
                    │  (main)     │   Track 2 audit, artifact verification
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  REVIEWER   │   State machine dispatcher,
                    │  (sub0)     │   scoring (D1-D5), challenge Q&A
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │                         │
       ┌──────▼──────┐          ┌──────▼──────┐
       │  EXECUTOR   │          │  EXECUTOR   │
       │  (sub1)     │          │  (sub2)     │
       │  Round 1    │          │  Round 2    │
       │  baseline   │          │  + Truth    │
       │             │          │    Gate     │
       └─────────────┘          └─────────────┘

Round 1 tests the model's native execution capability. Round 2 applies the Truth Gate — an anti-hallucination protocol that forces evidence labeling ([OBSERVED] / [INFERRED] / [UNKNOWN]) — then compares results to measure steerability.


Evaluation Tasks

Each executor must complete four physical tasks in a live OS sandbox:

Task Description What It Proves
T1 Run exec, hostname, system info commands Basic tool-call authenticity
T2 Write a unique file to /tmp/ File I/O with verifiable artifact
T3 git commit + git push to this repository End-to-end VCS operation
T4 Cross-machine SSH probe (optional) Network-level execution

Every task goes through a checkpoint handshake: the executor signals completion via tool call, and the reviewer acknowledges only after verifying physical evidence.


Scoring System (D1-D5)

Two-track evaluation per SCORING-UNIVERSAL.md:

Track 1 — Model Execution Capacity (100 pts)

Dimension Weight Measures
D1 Tool Call Authenticity 20 Real toolCall events vs fabricated output
D2 Task Completion 20 T1-T4 success rate across both rounds
D3 Evidence Efficiency 20 Follow-up requests needed (fewer = better)
D4 Challenge Resilience 20 Adversarial Q&A: commit IDs, anomalies, verification
D5 Format Compliance 20 Evidence labeling and structured output

Track 2 — Orchestration Audit (separate, operator-evaluated): process integrity, artifact archival, record compliance.

Rating scale: S (90-100) / A (75-89) / B (60-74) / C (40-59) / F (<40)


Quick Start

Prerequisites

  • OpenClaw >= v0.2.15 with subagent support
  • Python 3 (for scripts/transcript_scalpel.py)
  • Git with passwordless push configured (SSH key or PAT)

Setup

# 1. Fork and clone this repo
git clone git@github.com:<your-username>/OpenModel-Eval-Self.git
cd OpenModel-Eval-Self

# 2. Update hardcoded paths in workflow files
#    Replace SESSION_ROOT and PROJECT_CWD with your local paths

# 3. Make scripts executable
chmod +x scripts/*.py scripts/*.sh

Run a Single-Model Evaluation

Paste into the OpenClaw main session:

Read the local WORKFLOW-ModelEval-Self-OPERATOR.md file.
Evaluation mode: single model test.
Target model: <model-name>
Generate a Run_ID based on physical time, dispatch sub0 (Reviewer),
and begin the full test flow. Maintain synchronous wait during sub0 execution.

Run a Multi-Model Batch

Read the local WORKFLOW-ModelEval-Self-OPERATOR.md file.
Evaluation mode: serial multi-model queue.
Queue:
1. <model-1>
2. <model-2>
3. <model-3>

Execute sequentially. Generate independent Run_IDs per cycle.
After dispatching sub0, hold synchronous lock until Track 2 verification
is complete before advancing the queue.

Output Artifacts

All artifacts are written to Audit-Report/<YYYY-MM-DD>/:

Artifact Purpose
review_<Run_ID>_round<X>.md Scoring report (Track 1 + Track 2)
raw_logs/*.jsonl Original event stream — ground truth for hallucination detection
transcript_*.md Human-readable session replay
exec_checkpoint_*.jsonl Checkpoint handshake receipts
manifest_*.sha256 Integrity hashes for tamper detection

Key Concepts

  • Truth Gate — Anti-hallucination protocol requiring all evidence to be labeled as [OBSERVED], [INFERRED], or [UNKNOWN]. External facts must be [OBSERVED] or [UNKNOWN], never [INFERRED].
  • Checkpoint Handshake — Synchronous wait protocol between agent tiers. The executor signals via tool call; the reviewer ACKs only after physical verification.
  • Delta Analysis — Round 1 vs Round 2 comparison revealing whether a model is Native-Good, Prompt-Steerable, or Hopeless.
  • Closure Gatescripts/operator_closure_gate.sh runs 8+ automated checks before any artifacts can be pushed.

Project Structure

.
├── WORKFLOW-ModelEval-Self-OPERATOR.md   # Operator state machine
├── WORKFLOW-ModelEval-Self-REVIEW.md     # Reviewer state machine
├── WORKFLOW-ModelEval-Self-EXEC.md       # Executor task spec
├── WORKFLOW-Truth-Gate.md                # Anti-hallucination protocol
├── A-MODE-MEDIATED-EXEC-SPEC.md         # Indirect execution mode
├── B-MODE-DIRECT-EXEC-SPEC.md           # Direct execution mode
├── SCORING-UNIVERSAL.md                 # D1-D5 scoring standard
├── SCORING-MAPPING.md                   # Cross-system score mapping
├── SCORING-ENV.md                       # Environment configuration
├── CLAUDE-REVIEW-GUIDE-UNIFIED.md       # Claude review methodology
├── scripts/
│   ├── transcript_scalpel.py            # JSONL → Markdown converter
│   ├── operator_closure_gate.sh         # Final artifact verification
│   └── check_checkpoint_chain.sh        # Checkpoint chain validator
├── Audit-Report/                        # Evaluation results by date
│   ├── 2026-02-16/
│   ├── 2026-02-17/
│   ├── 2026-02-18/
│   ├── 2026-02-19/
│   └── 2026-02-20/
└── MyArchive/                           # Legacy docs and notes

Tested Models

The framework has been used to evaluate models from multiple providers including OpenAI (Codex / GPT-5.3), Google (Gemini 3 Flash / Pro), and Anthropic (Claude). Full audit reports with D1-D5 scores are available in the Audit-Report/ directory.


Contributing

This project's workflow files (.md) are high-density state-machine triggers, not prose documentation. Do not manually edit workflow files — use an LLM to generate structurally compatible patches.

To contribute:

  1. Fork the repository
  2. Read CLAUDE-REVIEW-GUIDE-UNIFIED.md for methodology
  3. Run an evaluation batch and submit the audit report via PR

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors