V2 Features

Back to README | All docs

New compression features added in v2. All features are opt-in with backward-compatible defaults — existing code produces identical output without changes. Zero new runtime dependencies.

Quick reference

Feature	Option	Default	Effect	Tradeoff
Quality metrics	automatic	on when compression occurs	Adds `entity_retention`, `structural_integrity`, `reference_coherence`, `quality_score` to result	~1% overhead from entity extraction
Relevance threshold	`relevanceThreshold`	off	Drops low-value messages to stubs	Higher ratio, may lose context in filler-heavy conversations
Tiered budget	`budgetStrategy: 'tiered'`	`'binary-search'`	Compresses old prose first, protects recent messages	Better quality at the same budget; slightly slower (tightening passes)
Entropy scorer	`entropyScorer`	off	Information-theoretic sentence scoring via external LM	Better sentence selection; requires a local model or API
Adaptive budgets	automatic	on	Scales summary budget with content density	Entity-dense content gets more room; sparse filler compresses harder
Conversation flow	`conversationFlow`	`false`	Groups Q&A / request→action chains	More coherent summaries; reduces ratio on conversations without clear patterns
Discourse-aware (experimental)	`discourseAware`	`false`	EDU decomposition with dependency tracking	Reduces ratio 8–28% without an ML scorer. Infrastructure only — provide your own scorer
Coreference	`coreference`	`false`	Inlines entity definitions into compressed summaries	Prevents orphaned references; adds bytes to summaries
Semantic clustering	`semanticClustering`	`false`	Groups messages by topic for cluster-aware compression	Better coherence on topic-scattered conversations; O(n²) similarity computation
Compression depth	`compressionDepth`	`'gentle'`	Controls aggressiveness: gentle/moderate/aggressive/auto	Higher depth = higher ratio but lower quality
ML token classifier	`mlTokenClassifier`	off	Per-token keep/remove via external ML model	Highest quality compression; requires a trained model (~500MB)

Quality metrics

Quality metrics are computed automatically whenever compression occurs. No option needed.

Fields

Field	Range	Meaning
`compression.entity_retention`	0–1	Fraction of technical identifiers (camelCase, snake_case, file paths, URLs, version numbers) preserved
`compression.structural_integrity`	0–1	Fraction of structural elements (code fences, JSON blocks, tables) preserved
`compression.reference_coherence`	0–1	Fraction of output entity references whose defining message is still present
`compression.quality_score`	0–1	Weighted composite: `0.4 × entity_retention + 0.4 × structural_integrity + 0.2 × reference_coherence`

Example

const result = compress(messages, { recencyWindow: 4 });

console.log(result.compression.quality_score); // 0.95
console.log(result.compression.entity_retention); // 0.92
console.log(result.compression.structural_integrity); // 1.0

Tradeoffs

Quality metrics add ~1% overhead from entity extraction on every compression
entity_retention only tracks identifiers (camelCase, snake_case, PascalCase, file paths, URLs, version numbers). Plain English nouns are not tracked
reference_coherence checks if defining messages survived, not whether the definition text survived — a message can be compressed (losing the definition prose) and still count as "present" if its ID is in the output
Scores of 1.0 do not mean lossless — they mean no tracked entities/structures were lost

Relevance threshold

Drops low-value messages to compact stubs instead of producing low-quality summaries.

Usage

const result = compress(messages, {
  relevanceThreshold: 5, // sentence score threshold
});

How it works

Before summarizing a group of compressible messages, the engine scores each sentence using the heuristic scorer. If the best sentence score in the group falls below relevanceThreshold, the entire group is replaced with [N messages of general discussion omitted]. Consecutive dropped messages are grouped into a single stub.

Original content is still stored in verbatim — round-trip integrity is preserved.

Tradeoffs

Higher values = more aggressive dropping. Values around 3–5 catch most filler. Values above 8 will drop messages containing some technical content
Lower values = only pure filler is dropped
Messages with any code identifiers (camelCase, snake_case) tend to score above 3, so they survive
The threshold operates on the best sentence in a group — a message with one technical sentence among filler will be preserved
messages_relevance_dropped stat tracks how many messages were stubbed

Tiered budget strategy

An alternative to binary search that keeps the recency window fixed and progressively compresses older content.

Usage

const result = compress(messages, {
  tokenBudget: 4000,
  budgetStrategy: 'tiered',
  forceConverge: true, // recommended with tiered
});

How it works

1. Run standard compress with the user's recencyWindow
2. If result fits budget → done
3. Pass 2a: Tighten older summaries (re-summarize at 40% budget)
4. Pass 2b: Stub low-value older messages (score < 3 → "[message omitted]")
5. Pass 3: forceConverge as last resort (if enabled)

Tradeoffs

	Binary search (default)	Tiered
Recency window	Shrinks to fit budget	Fixed — recent messages always preserved
Older messages	Compressed uniformly	Progressively tightened by priority
Speed	O(log n) compress iterations	Single compress + tightening passes
Best for	General use, simple budgets	Conversations where recent context matters most

Tiered is strictly better at preserving recent context but may produce lower quality on older messages (tighter budgets)
Without forceConverge, tiered may fail to meet very tight budgets
Works with both sync and async paths

Entropy scorer

Plug in a small causal language model for information-theoretic sentence scoring. Based on Selective Context (EMNLP 2023).

Usage

// Sync scorer (e.g., local model via llama.cpp bindings)
const result = compress(messages, {
  entropyScorer: (sentences) => sentences.map((s) => myLocalModel.selfInformation(s)),
  entropyScorerMode: 'augment', // combine with heuristic (default)
});

// Async scorer (e.g., remote inference)
const result = await compress(messages, {
  entropyScorer: async (sentences) => myApi.scoreSentences(sentences),
  summarizer: mySummarizer, // required to enable async path
});

Modes

Mode	Behavior
`'augment'` (default)	Weighted average of heuristic + entropy scores (60% entropy, 40% heuristic)
`'replace'`	Entropy scores only, heuristic skipped

Tradeoffs

'augment' is safer — heuristic catches structural patterns (code identifiers, status words) that entropy might miss in short sentences
'replace' gives the entropy scorer full control — use when your model is well-calibrated
Async scorers throw in sync mode (no summarizer/classifier provided). Use a sync scorer or add a summarizer to enable async
The engine stays zero-dependency — the scorer function is user-provided

Adaptive summary budgets

Summary budgets now scale with content density. This is automatic — no option needed.

How it works

The computeBudget function measures entity density (identifiers per character):

Dense content (many identifiers): up to 45% of content length as budget, max 800 chars
Sparse content (general discussion): down to 15% of content length, min 100 chars
Default (no density signal): 30% of content length, 200–600 chars (backward compatible)

Tradeoffs

Entity-dense messages (e.g., architecture discussions with many function names) get longer summaries, preserving more identifiers. This improves entity_retention but slightly reduces compression ratio on those messages
Sparse filler messages get tighter summaries, improving ratio where it matters most
Messages near the 120-char short-content threshold that previously escaped compression may now be compressed, since the lower budget minimum (100 chars vs. 200) allows shorter summaries

Conversation flow

Groups common conversation patterns into compression units that produce more coherent summaries.

Usage

const result = compress(messages, {
  conversationFlow: true,
});

Detected patterns

Pattern	Detection	Summary format
Q&A	User question (has `?`) → assistant answer	`Q: {question} → A: {answer}`
Request → action	User request (`can you`, `please`, `add`) → assistant action (`done`, `added`)	`Request: {request} → {action}`
Correction	`actually`, `wait`, `no,` followed by same-topic content	`Correction: {correction text}`
Acknowledgment	Substantive message (>200 chars) → short confirmation (`great`, `thanks`)	`{substance} (acknowledged)`

Follow-up confirmations (perfect, thanks) are included in Q&A and request chains when detected within 2 messages.

Tradeoffs

Flow chains produce more coherent summaries than independent compression — a Q&A pair as Q: ... → A: ... preserves the relationship between question and answer
Messages with code fences are excluded from flow chains to prevent code loss — they use the code-split path instead
Conversations without clear patterns (e.g., multi-party discussions, brainstorming) see no benefit
Flow chains can override soft preservation (recency, short content) but not hard blocks (system roles, dedup, tool_calls)
The detection is conservative — only well-established patterns are matched. Ambiguous exchanges fall through to normal compression

Discourse-aware summarization (experimental)

Status: experimental. The infrastructure is in place (EDU segmentation, dependency graph, greedy selector) but the built-in rule-based scorer reduces compression ratio by 8–28% with no measurable quality gain over the default sentence scorer. The dependency tracking inherently fights compression — pulling in parent EDUs when selecting children keeps more text than necessary. This feature needs an ML-backed scorer to identify which dependencies are actually load-bearing. Until then, leave it off unless you provide a custom scorer.

Breaks content into Elementary Discourse Units (EDUs) with dependency tracking. Based on From Context to EDUs (arXiv 2025).

Usage

// Not recommended without a custom scorer — reduces ratio
const result = compress(messages, {
  discourseAware: true,
});

// With a custom scorer (e.g., backed by an ML model) — the intended use
import { segmentEDUs, scoreEDUs, selectEDUs } from 'context-compression-engine';

const edus = segmentEDUs(text);
const scored = scoreEDUs(edus, (text) => myModel.importance(text));
const selected = selectEDUs(scored, budget);

How it works

Segment text into EDUs at clause boundaries (discourse markers: then, because, which, however, etc.)
Build dependency edges: pronoun references (it, this) → preceding EDU; temporal chains (first...then...finally); causal chains (because...therefore)
Score EDUs (information-density heuristic by default, or custom scorer)
Greedy selection: highest-scored EDUs first, pulling in dependency parents (up to 2 levels)

Why it underperforms without an ML scorer

The rule-based scorer rewards technical identifiers and penalizes filler — the same signals as the default sentence scorer. But the dependency tracking adds a tax: selecting one high-value EDU forces inclusion of its parent EDUs, which may be low-value. The default scorer can't distinguish load-bearing dependencies (removing the parent makes the child meaningless) from decorative ones (the parent adds context but the child stands alone). An ML scorer trained on discourse coherence would solve this.

Tradeoffs

Prevents incoherent summaries where removing a sentence orphans a pronoun reference — in theory, but the ratio cost currently outweighs the coherence benefit
The EDU segmenter, dependency builder, and selector are fully functional and exported — use them directly with a custom scorer via segmentEDUs, scoreEDUs, selectEDUs
Mutually exclusive with entropyScorer — when both are set, discourseAware takes priority

Cross-message coreference

Tracks entity references across messages to prevent orphaned references when source messages are compressed.

Usage

const result = compress(messages, {
  coreference: true,
});

How it works

Build coreference map: for each identifier (camelCase, snake_case, PascalCase), track where it first appears and which later messages reference it
After compression: check if any preserved message references an entity defined only in a compressed message
If so: prepend [context: {defining sentence}] to the compressed message's summary

Example

Without coreference:

Message 3 (compressed): [summary: handles retries with backoff | entities: fetchData]
Message 7 (preserved):  "Make sure fetchData uses a 30s timeout"

With coreference:

Message 3 (compressed): [context: The fetchData function handles API calls.] [summary: handles retries with backoff | entities: fetchData]
Message 7 (preserved):  "Make sure fetchData uses a 30s timeout"

Tradeoffs

Prevents the common failure mode where compressing an early definition message makes later references meaningless
Adds bytes to compressed summaries (the [context: ...] prefix). This slightly reduces compression ratio
Only tracks code-style identifiers (camelCase, snake_case, PascalCase) — not plain English nouns. This avoids false positives but misses some references
The inline definition is the first sentence containing the entity, truncated to 80 chars. Complex multi-sentence definitions are only partially captured

Semantic clustering

Groups messages by topic using lightweight TF-IDF and entity overlap, then compresses each cluster as a unit.

Usage

const result = compress(messages, {
  semanticClustering: true,
  clusterThreshold: 0.15, // similarity threshold (default)
});

How it works

Compute TF-IDF vectors per message (content words, stopwords removed)
Compute entity overlap (Jaccard similarity on extracted identifiers)
Combined similarity: 0.7 × cosine(TF-IDF) + 0.3 × jaccard(entities)
Agglomerative clustering with average linkage until similarity drops below threshold
Multi-message clusters compressed as a unit with topic label

Tradeoffs

Long conversations that drift across topics benefit most — scattered messages about fetchData in messages 3, 7, 12, 19 get merged into one compressed block
O(n²) similarity computation. For conversations under 50 messages this is negligible. For 500+ messages, consider whether the coherence benefit justifies the cost
clusterThreshold controls sensitivity: lower values (0.05–0.10) create larger clusters; higher values (0.20–0.30) require stronger topic similarity
Messages already claimed by flow chains are excluded from clustering — the two features cooperate without overlap
Messages with fewer than 80 chars are excluded (not enough content for meaningful similarity)

Compression depth

Controls how aggressively the summarizer compresses content.

Usage

// Fixed depth
const result = compress(messages, {
  compressionDepth: 'moderate',
});

// Auto: progressively tries gentle → moderate → aggressive until budget fits
const result = compress(messages, {
  tokenBudget: 2000,
  compressionDepth: 'auto',
  forceConverge: true,
});

Depth levels

Level	Summary budget	Strategy	Typical ratio
`'gentle'` (default)	30% of content	Sentence selection	~2x
`'moderate'`	15% of content	Tighter sentence selection	~3–4x
`'aggressive'`	Entity-only stubs	Key identifiers only	~6–8x
`'auto'`	Progressive	Tries each level until `tokenBudget` fits	Adapts to budget

Auto mode quality gate

In 'auto' mode, the engine stops escalating if quality_score drops below 0.60 (unless forced by a very tight budget). This prevents aggressive compression from destroying too much context.

Tradeoffs

'gentle' is the safest — identical to default behavior. Start here
'moderate' halves the summary budget. Entity-dense content keeps identifiers; sparse content gets very short summaries. Good for conversations with lots of boilerplate
'aggressive' produces entity-only stubs (fetchData, getUserProfile, retryConfig). Use for archival compression where only the topics matter, not the details
'auto' with tokenBudget is the most practical — it finds the minimum aggressiveness needed to fit. Without a budget, 'auto' is equivalent to 'gentle'

ML token classifier

Per-token keep/remove classification via a user-provided ML model. Based on LLMLingua-2 (ACL 2024).

Usage

import { compress, createMockTokenClassifier } from 'context-compression-engine';

// Mock classifier for testing
const classifier = createMockTokenClassifier([/fetch/i, /retry/i, /config/i]);
const result = compress(messages, { mlTokenClassifier: classifier });

// Real classifier (e.g., ONNX model)
const result = compress(messages, {
  mlTokenClassifier: (content) => {
    const tokens = myTokenizer.tokenize(content);
    const predictions = myModel.predict(tokens);
    return tokens.map((token, i) => ({
      token,
      keep: predictions[i] > 0.5,
      confidence: predictions[i],
    }));
  },
});

Types

type TokenClassification = {
  token: string;
  keep: boolean;
  confidence: number; // 0–1
};

type MLTokenClassifier = (
  content: string,
) => TokenClassification[] | Promise<TokenClassification[]>;

Tradeoffs

Highest potential compression quality — a well-trained encoder model (XLM-RoBERTa, ~500MB) can achieve 2–5x compression at 95–98% accuracy retention
T0 classification rules still override for code/structured content — the ML classifier only handles T2 prose
Falls back to deterministic summarization if the ML-compressed output is longer than the original
Async classifiers throw in sync mode — provide a summarizer or classifier to enable async
The engine stays zero-dependency — you provide the model and tokenizer

Helper utilities

import { whitespaceTokenize, createMockTokenClassifier } from 'context-compression-engine';

// Simple whitespace tokenizer
const tokens = whitespaceTokenize('The fetchData function'); // ['The', 'fetchData', 'function']

// Mock classifier for testing — keeps tokens matching any pattern
const mock = createMockTokenClassifier([/fetch/i, /retry/i], 0.9);

Combining features

Features can be combined freely. Here are recommended combinations:

Quality-focused (preserve context, moderate compression)

const result = compress(messages, {
  recencyWindow: 6,
  importanceScoring: true,
  contradictionDetection: true,
  coreference: true,
  conversationFlow: true,
});

Ratio-focused (maximum compression, acceptable quality loss)

const result = compress(messages, {
  tokenBudget: 2000,
  compressionDepth: 'auto',
  budgetStrategy: 'tiered',
  relevanceThreshold: 3,
  semanticClustering: true,
  forceConverge: true,
});

Balanced (good ratio + quality)

const result = compress(messages, {
  tokenBudget: 4000,
  conversationFlow: true,
  importanceScoring: true,
  coreference: true,
});

Feature interaction notes

conversationFlow and semanticClustering cooperate — flow chains are detected first, remaining messages are clustered
discourseAware is experimental and not included in any recommended combination — it reduces ratio without a custom ML scorer
mlTokenClassifier takes priority over discourseAware and entropyScorer
relevanceThreshold applies after flow/cluster detection — messages already grouped into chains/clusters are not individually threshold-checked
compressionDepth affects all summarization (groups, code-split prose, contradictions) — not just the main group path

FilesExpand file tree

v2-features.md

Latest commit

History

v2-features.md

File metadata and controls

V2 Features

Quick reference

Quality metrics

Fields

Example

Tradeoffs

Relevance threshold

Usage

How it works

Tradeoffs

Tiered budget strategy

Usage

How it works

Tradeoffs

Entropy scorer

Usage

Modes

Tradeoffs

Adaptive summary budgets

How it works

Tradeoffs

Conversation flow

Usage

Detected patterns

Tradeoffs

Discourse-aware summarization (experimental)

Usage

How it works

Why it underperforms without an ML scorer

Tradeoffs

Cross-message coreference

Usage

How it works

Example

Tradeoffs

Semantic clustering

Usage

How it works

Tradeoffs

Compression depth

Usage

Depth levels

Auto mode quality gate

Tradeoffs

ML token classifier

Usage

Types

Tradeoffs

Helper utilities

Combining features

Quality-focused (preserve context, moderate compression)

Ratio-focused (maximum compression, acceptable quality loss)

Balanced (good ratio + quality)

Feature interaction notes

See also