PromptLens

Find the dead weight in your LLM prompts — and cut it safely.

PromptLens uses Shapley-value attribution to score every phrase in your system prompts by how much it actually changes model output. Phrases that score near zero are candidates for removal. Phrases that score high are load-bearing — don't touch them.

The CLI agent finds all your prompts automatically, audits them, and can compress them while running empirical validation to make sure the behaviour doesn't change.

Live web tool (paste-and-go): https://rohityalavarthy.github.io/PromptLens

Quickstart

# 1. Install
pip install -e sdk/python
pip install -e agent

# 2. Set your API key (Together AI — free $1 credit at together.ai)
export TOGETHER_API_KEY=your_key_here

# 3. Check a single prompt file
promptlens check --file src/prompts/system.txt

# 4. Audit your whole repo
promptlens audit --repo .

# 5. Compress a bloated prompt
promptlens compress --file src/prompts/system.txt

Requires Python 3.11+.

CLI Agent

`check` — fast pre-commit scan

Runs a quick saliency analysis on a single prompt file and warns you if more than 20% looks redundant. Fast enough to run before every commit.

promptlens check --file src/prompts/system.txt

Example output:

📋 PromptLens Analysis — src/prompts/system.txt
────────────────────────────────────────────────────
  Phrases analysed : 5
  Token estimate   : 94
  Test inputs used : 3
  Confidence       : 0.88
  Est. redundancy  : 31%

  PHRASE                                       SCORE  IMPACT
  ──────────────────────────────────────────── ────── ─────────────────────────
  Always respond concisely.                     0.91  ████████████████████
  Use bullet points when listing items.         0.54  ████████████
  Format your answer in Markdown.               0.42  █████████
  Respond in the user's language.               0.18  ████
  You are a helpful assistant.                  0.08  █

  ⚠  Redundancy warning: 31% of tokens score below threshold (0.15)
  Candidates for compression: 2 phrases / 29 tokens

Flags:

Flag	Default	Description
`--file, -f`	required	Path to the prompt file
`--saliency-threshold`	`0.15`	Phrases below this score are flagged
`--semantic`	off	Use embedding similarity (requires Together AI key)

`audit` — full repo analysis

Walks your entire codebase, finds every LLM API call, extracts the system prompts, and runs a saliency analysis on each one. Produces a ranked summary showing which prompts have the most redundancy.

# Scan the whole repo
promptlens audit --repo .

# Scan a single file
promptlens audit --file src/api/openai_handler.py

# Use your own test inputs for more realistic scoring
promptlens audit --repo . --test-inputs tests/inputs.jsonl

Example output:

🔍 PromptLens Agent — Audit Complete
════════════════════════════════════════════════════
  Prompts found    : 5
  Total tokens     : 2,341
  Candidate tokens : 418 (18% of total)

  FILE                                   TOKENS  REDUNDANCY
  ─────────────────────────────────────── ────── ──────────
  src/api/openai_handler.py                  234      42%
  src/prompts/search.txt                     189      12%
  src/llm/anthropic_wrapper.py               127       8%
  src/agents/summarizer.py                    89       4%

  Run promptlens compress --file <path> to compress a specific prompt.

Flags:

Flag	Default	Description
`--repo, -r`	`.`	Repo root to scan
`--file, -f`	—	Analyse a single file instead of scanning
`--test-inputs`	—	`.jsonl` or `.txt` file of test inputs
`--m-samples`	`20`	Monte Carlo walks for Shapley (higher = more precise)
`--saliency-threshold`	`0.15`	Score below which a phrase is a compression candidate
`--semantic`	off	Use embedding similarity instead of trigram

What it finds:

The agent uses AST analysis to detect LLM API calls across all common frameworks:

Framework	Detected patterns
OpenAI	`openai.chat.completions.create`, `client.chat.completions.create`
Anthropic	`client.messages.create`, `anthropic.messages.create`
LangChain	`ChatOpenAI`, `ChatAnthropic`, `LLMChain`, `PromptTemplate`
AWS Bedrock	`bedrock_runtime.invoke_model`, `BedrockChat`

It resolves prompts from three sources:

Literal strings — messages=[{"role": "system", "content": "You are..."}]
Variable assignments — SYSTEM_PROMPT = "..." then used in the call
File reads — open("prompts/system.txt").read() or Path(...).read_text()

`compress` — analyse, rewrite, validate

The full pipeline: scores your prompt, sends low-saliency phrases to an LLM rewriter with explicit instructions, validates that the compressed prompt produces similar outputs, and retries if it doesn't.

# Basic compression
promptlens compress --file src/prompts/system.txt

# Tighter quality gate
promptlens compress --file src/prompts/system.txt --threshold 0.10

# Semantic similarity + your own test inputs
promptlens compress --file src/prompts/system.txt --semantic --test-inputs tests/inputs.jsonl

# Review output then optionally apply in-place
promptlens compress --file src/prompts/system.txt --apply

How it works, step by step:

Score — Runs Shapley analysis to get a 0–1 impact score for every phrase.
Label — Phrases below the saliency threshold get labeled [COMPRESS]; others get [KEEP]. Each label includes the exact score so the rewriter knows how aggressively to act.
Rewrite — Sends the labeled prompt to a Qwen rewriter with a score-based decision table:
- 0.00–0.05 → REMOVE (if covered elsewhere in the prompt)
- 0.05–0.10 → REMOVE, MERGE, or REWRITE
- 0.10–threshold → MERGE or REWRITE only
- ≥ threshold → PARAPHRASE only (no structural change)
Validate — Runs both the original and compressed prompts against all test inputs, measures output divergence. Computes a verdict:
- PASS — safe to adopt
- MARGINAL — spot-check outputs before committing
- REVIEW — differences are significant; read carefully
- FAIL — do not apply without thorough manual review
Retry — If validation fails, the agent reinstates the modified phrase with the highest Shapley score and re-validates. Up to 3 retries.
Write — Result is written to <file>.suggested. Your original is never touched unless you use --apply and confirm.

Example output:

✂  Compression Result
────────────────────────────────────────────────────
  Validation       : ✓ PASS
  Max divergence   : 0.087
  Original tokens  : 127
  Compressed tokens: 89
  Token reduction  : 38 tokens (30%)

  Changes:
    🗑  Redundant phrase covered elsewhere
    ✏️  You are extremely helpful and thorough in your responses. → Be thorough.
    ↺  Please assist users in a polite and professional manner. → Be polite and professional.

  Compressed prompt written to: src/prompts/system.txt.suggested

Review the diff before applying:

diff src/prompts/system.txt src/prompts/system.txt.suggested

Flags:

Flag	Default	Description
`--file, -f`	required	Prompt file to compress
`--threshold`	`0.15`	Max output divergence allowed during validation
`--saliency-threshold`	auto	Phrases below this get labeled COMPRESS. Defaults to `min(threshold, 0.50)`
`--test-inputs`	—	`.jsonl` (`{"input": "..."}`) or `.txt` (one per line)
`--m-samples`	`20`	Monte Carlo walks for Shapley
`--semantic`	off	Use embeddings for divergence measurement
`--apply`	off	After writing `.suggested`, prompt to overwrite original in-place

Tip on test inputs: If you don't provide a --test-inputs file, the agent falls back to three generic inputs. The saliency scores will be less accurate. For production prompts, always pass representative real-world inputs — your results will be much sharper.

Test input file formats

JSONL (one JSON object per line):

{"input": "Summarize the following article in three sentences."}
{"input": "What is the capital of France?"}
{"input": "Write a Python function that sorts a list."}

Plain text (one input per line):

Summarize the following article in three sentences.
What is the capital of France?
Write a Python function that sorts a list.

Similarity modes

Both check, audit, and compress support two ways to measure output divergence:

Standard (default): Character trigram cosine distance. No extra API calls. Fast and works offline. Good for prompts where output wording is important.

Semantic (--semantic): Embedding cosine distance via nomic-ai/nomic-embed-text-v1.5 (Together AI). Captures meaning-level change — rewording the same idea doesn't register as divergence. Use this when you care about semantic equivalence, not exact phrasing.

M-samples and speed

All three commands accept --m-samples to control the Shapley sampling budget.

Value	Use case	Est. API calls (N=10 phrases)
`3`	Pre-commit, instant feedback	~6–10
`20`	Default — balanced quality	~20–40
`50`	High-confidence audit	~50–80

For prompts with N ≤ 4 phrases, PromptLens computes exact Shapley values (all 2^N coalitions) regardless of --m-samples. Coalition outputs are cached — concurrent walks that hit the same subset share a single in-flight API call.

All generation calls use temperature: 0.0 — determinism is required for stable divergence measurement.

Python SDK

If you want to run Shapley attribution from your own code:

pip install -e sdk/python

import asyncio
from promptlens import run_shapley, SimilarityMode

report = asyncio.run(run_shapley(
    prompt="You are a helpful assistant. Always respond concisely. Use bullet points when listing items.",
    test_inputs=["What are the benefits of exercise?"],
    m_samples=20,
    mode=SimilarityMode.STANDARD,
))

for score in sorted(report.scores, key=lambda s: s.score, reverse=True):
    print(f"{score.score:.2f}  {score.phrase.text}")

0.91  Always respond concisely.
0.54  Use bullet points when listing items.
0.12  You are a helpful assistant.

SaliencyReport fields:

scores — list of SaliencyScore: .score (0–1), .raw_shapley, .disposition (keep/remove)
phrases — segmented phrase list
token_count, redundancy_fraction, compression_candidate_tokens
confidence, m_samples, test_inputs_used

Web tool

Paste a prompt directly in the browser and get it back colour-coded by impact. No install, no API key to configure in a terminal.

Run at: https://rohityalavarthy.github.io/PromptLens

Run locally:

cd web && python3 -m http.server 8080
# open http://localhost:8080

Provider	Use	Notes
Groq	Standard mode	Free tier, no credit card. Recommended for most users.
Together AI	Standard + Semantic mode	$1 free credit. Required for embedding-based similarity.

Keys are stored in localStorage — they never leave your browser.

How attribution works

Standard perturbation methods (leave-one-out, ablation) test phrases in isolation and miss interactions. A phrase can look useless alone but be essential when combined with others.

Shapley values fix this. The Shapley value for a phrase is its average marginal contribution across all possible coalitions of the other phrases — the only attribution method satisfying all four fairness axioms when features interact (efficiency, symmetry, dummy, additivity).

For N ≤ 4 phrases: exact computation (all 2^N subsets). For N > 4: Monte Carlo sampling with M random coalition walks, concurrency-capped at 5. Coalition cache deduplicates repeated subset calls across walks.

Development

# SDK — 19 tests, all offline
cd sdk/python && pytest tests/ -v

# Agent — offline tests
cd agent && pytest tests/ -v

No live API calls in the test suite.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github		.github
agent		agent
sdk/python		sdk/python
web		web
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
promptlens-agent-impl-prd.md		promptlens-agent-impl-prd.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PromptLens

Quickstart

CLI Agent

`check` — fast pre-commit scan

`audit` — full repo analysis

`compress` — analyse, rewrite, validate

Test input file formats

Similarity modes

M-samples and speed

Python SDK

Web tool

How attribution works

Development

License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PromptLens

Quickstart

CLI Agent

check — fast pre-commit scan

audit — full repo analysis

compress — analyse, rewrite, validate

Test input file formats

Similarity modes

M-samples and speed

Python SDK

Web tool

How attribution works

Development

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages

`check` — fast pre-commit scan

`audit` — full repo analysis

`compress` — analyse, rewrite, validate