New compression features added in v2. All features are opt-in with backward-compatible defaults — existing code produces identical output without changes. Zero new runtime dependencies.
| Feature | Option | Default | Effect | Tradeoff |
|---|---|---|---|---|
| Quality metrics | automatic | on when compression occurs | Adds entity_retention, structural_integrity, reference_coherence, quality_score to result |
~1% overhead from entity extraction |
| Relevance threshold | relevanceThreshold |
off | Drops low-value messages to stubs | Higher ratio, may lose context in filler-heavy conversations |
| Tiered budget | budgetStrategy: 'tiered' |
'binary-search' |
Compresses old prose first, protects recent messages | Better quality at the same budget; slightly slower (tightening passes) |
| Entropy scorer | entropyScorer |
off | Information-theoretic sentence scoring via external LM | Better sentence selection; requires a local model or API |
| Adaptive budgets | automatic | on | Scales summary budget with content density | Entity-dense content gets more room; sparse filler compresses harder |
| Conversation flow | conversationFlow |
false |
Groups Q&A / request→action chains | More coherent summaries; reduces ratio on conversations without clear patterns |
| Discourse-aware (experimental) | discourseAware |
false |
EDU decomposition with dependency tracking | Reduces ratio 8–28% without an ML scorer. Infrastructure only — provide your own scorer |
| Coreference | coreference |
false |
Inlines entity definitions into compressed summaries | Prevents orphaned references; adds bytes to summaries |
| Semantic clustering | semanticClustering |
false |
Groups messages by topic for cluster-aware compression | Better coherence on topic-scattered conversations; O(n²) similarity computation |
| Compression depth | compressionDepth |
'gentle' |
Controls aggressiveness: gentle/moderate/aggressive/auto | Higher depth = higher ratio but lower quality |
| ML token classifier | mlTokenClassifier |
off | Per-token keep/remove via external ML model | Highest quality compression; requires a trained model (~500MB) |
Quality metrics are computed automatically whenever compression occurs. No option needed.
| Field | Range | Meaning |
|---|---|---|
compression.entity_retention |
0–1 | Fraction of technical identifiers (camelCase, snake_case, file paths, URLs, version numbers) preserved |
compression.structural_integrity |
0–1 | Fraction of structural elements (code fences, JSON blocks, tables) preserved |
compression.reference_coherence |
0–1 | Fraction of output entity references whose defining message is still present |
compression.quality_score |
0–1 | Weighted composite: 0.4 × entity_retention + 0.4 × structural_integrity + 0.2 × reference_coherence |
const result = compress(messages, { recencyWindow: 4 });
console.log(result.compression.quality_score); // 0.95
console.log(result.compression.entity_retention); // 0.92
console.log(result.compression.structural_integrity); // 1.0- Quality metrics add ~1% overhead from entity extraction on every compression
entity_retentiononly tracks identifiers (camelCase, snake_case, PascalCase, file paths, URLs, version numbers). Plain English nouns are not trackedreference_coherencechecks if defining messages survived, not whether the definition text survived — a message can be compressed (losing the definition prose) and still count as "present" if its ID is in the output- Scores of 1.0 do not mean lossless — they mean no tracked entities/structures were lost
Drops low-value messages to compact stubs instead of producing low-quality summaries.
const result = compress(messages, {
relevanceThreshold: 5, // sentence score threshold
});Before summarizing a group of compressible messages, the engine scores each sentence using the heuristic scorer. If the best sentence score in the group falls below relevanceThreshold, the entire group is replaced with [N messages of general discussion omitted]. Consecutive dropped messages are grouped into a single stub.
Original content is still stored in verbatim — round-trip integrity is preserved.
- Higher values = more aggressive dropping. Values around 3–5 catch most filler. Values above 8 will drop messages containing some technical content
- Lower values = only pure filler is dropped
- Messages with any code identifiers (camelCase, snake_case) tend to score above 3, so they survive
- The threshold operates on the best sentence in a group — a message with one technical sentence among filler will be preserved
messages_relevance_droppedstat tracks how many messages were stubbed
An alternative to binary search that keeps the recency window fixed and progressively compresses older content.
const result = compress(messages, {
tokenBudget: 4000,
budgetStrategy: 'tiered',
forceConverge: true, // recommended with tiered
});1. Run standard compress with the user's recencyWindow
2. If result fits budget → done
3. Pass 2a: Tighten older summaries (re-summarize at 40% budget)
4. Pass 2b: Stub low-value older messages (score < 3 → "[message omitted]")
5. Pass 3: forceConverge as last resort (if enabled)
| Binary search (default) | Tiered | |
|---|---|---|
| Recency window | Shrinks to fit budget | Fixed — recent messages always preserved |
| Older messages | Compressed uniformly | Progressively tightened by priority |
| Speed | O(log n) compress iterations | Single compress + tightening passes |
| Best for | General use, simple budgets | Conversations where recent context matters most |
- Tiered is strictly better at preserving recent context but may produce lower quality on older messages (tighter budgets)
- Without
forceConverge, tiered may fail to meet very tight budgets - Works with both sync and async paths
Plug in a small causal language model for information-theoretic sentence scoring. Based on Selective Context (EMNLP 2023).
// Sync scorer (e.g., local model via llama.cpp bindings)
const result = compress(messages, {
entropyScorer: (sentences) => sentences.map((s) => myLocalModel.selfInformation(s)),
entropyScorerMode: 'augment', // combine with heuristic (default)
});
// Async scorer (e.g., remote inference)
const result = await compress(messages, {
entropyScorer: async (sentences) => myApi.scoreSentences(sentences),
summarizer: mySummarizer, // required to enable async path
});| Mode | Behavior |
|---|---|
'augment' (default) |
Weighted average of heuristic + entropy scores (60% entropy, 40% heuristic) |
'replace' |
Entropy scores only, heuristic skipped |
'augment'is safer — heuristic catches structural patterns (code identifiers, status words) that entropy might miss in short sentences'replace'gives the entropy scorer full control — use when your model is well-calibrated- Async scorers throw in sync mode (no
summarizer/classifierprovided). Use a sync scorer or add a summarizer to enable async - The engine stays zero-dependency — the scorer function is user-provided
Summary budgets now scale with content density. This is automatic — no option needed.
The computeBudget function measures entity density (identifiers per character):
- Dense content (many identifiers): up to 45% of content length as budget, max 800 chars
- Sparse content (general discussion): down to 15% of content length, min 100 chars
- Default (no density signal): 30% of content length, 200–600 chars (backward compatible)
- Entity-dense messages (e.g., architecture discussions with many function names) get longer summaries, preserving more identifiers. This improves
entity_retentionbut slightly reduces compression ratio on those messages - Sparse filler messages get tighter summaries, improving ratio where it matters most
- Messages near the 120-char short-content threshold that previously escaped compression may now be compressed, since the lower budget minimum (100 chars vs. 200) allows shorter summaries
Groups common conversation patterns into compression units that produce more coherent summaries.
const result = compress(messages, {
conversationFlow: true,
});| Pattern | Detection | Summary format |
|---|---|---|
| Q&A | User question (has ?) → assistant answer |
Q: {question} → A: {answer} |
| Request → action | User request (can you, please, add) → assistant action (done, added) |
Request: {request} → {action} |
| Correction | actually, wait, no, followed by same-topic content |
Correction: {correction text} |
| Acknowledgment | Substantive message (>200 chars) → short confirmation (great, thanks) |
{substance} (acknowledged) |
Follow-up confirmations (perfect, thanks) are included in Q&A and request chains when detected within 2 messages.
- Flow chains produce more coherent summaries than independent compression — a Q&A pair as
Q: ... → A: ...preserves the relationship between question and answer - Messages with code fences are excluded from flow chains to prevent code loss — they use the code-split path instead
- Conversations without clear patterns (e.g., multi-party discussions, brainstorming) see no benefit
- Flow chains can override soft preservation (recency, short content) but not hard blocks (system roles, dedup, tool_calls)
- The detection is conservative — only well-established patterns are matched. Ambiguous exchanges fall through to normal compression
Status: experimental. The infrastructure is in place (EDU segmentation, dependency graph, greedy selector) but the built-in rule-based scorer reduces compression ratio by 8–28% with no measurable quality gain over the default sentence scorer. The dependency tracking inherently fights compression — pulling in parent EDUs when selecting children keeps more text than necessary. This feature needs an ML-backed scorer to identify which dependencies are actually load-bearing. Until then, leave it off unless you provide a custom scorer.
Breaks content into Elementary Discourse Units (EDUs) with dependency tracking. Based on From Context to EDUs (arXiv 2025).
// Not recommended without a custom scorer — reduces ratio
const result = compress(messages, {
discourseAware: true,
});
// With a custom scorer (e.g., backed by an ML model) — the intended use
import { segmentEDUs, scoreEDUs, selectEDUs } from 'context-compression-engine';
const edus = segmentEDUs(text);
const scored = scoreEDUs(edus, (text) => myModel.importance(text));
const selected = selectEDUs(scored, budget);- Segment text into EDUs at clause boundaries (discourse markers:
then,because,which,however, etc.) - Build dependency edges: pronoun references (
it,this) → preceding EDU; temporal chains (first...then...finally); causal chains (because...therefore) - Score EDUs (information-density heuristic by default, or custom scorer)
- Greedy selection: highest-scored EDUs first, pulling in dependency parents (up to 2 levels)
The rule-based scorer rewards technical identifiers and penalizes filler — the same signals as the default sentence scorer. But the dependency tracking adds a tax: selecting one high-value EDU forces inclusion of its parent EDUs, which may be low-value. The default scorer can't distinguish load-bearing dependencies (removing the parent makes the child meaningless) from decorative ones (the parent adds context but the child stands alone). An ML scorer trained on discourse coherence would solve this.
- Prevents incoherent summaries where removing a sentence orphans a pronoun reference — in theory, but the ratio cost currently outweighs the coherence benefit
- The EDU segmenter, dependency builder, and selector are fully functional and exported — use them directly with a custom scorer via
segmentEDUs,scoreEDUs,selectEDUs - Mutually exclusive with
entropyScorer— when both are set,discourseAwaretakes priority
Tracks entity references across messages to prevent orphaned references when source messages are compressed.
const result = compress(messages, {
coreference: true,
});- Build coreference map: for each identifier (camelCase, snake_case, PascalCase), track where it first appears and which later messages reference it
- After compression: check if any preserved message references an entity defined only in a compressed message
- If so: prepend
[context: {defining sentence}]to the compressed message's summary
Without coreference:
Message 3 (compressed): [summary: handles retries with backoff | entities: fetchData]
Message 7 (preserved): "Make sure fetchData uses a 30s timeout"
With coreference:
Message 3 (compressed): [context: The fetchData function handles API calls.] [summary: handles retries with backoff | entities: fetchData]
Message 7 (preserved): "Make sure fetchData uses a 30s timeout"
- Prevents the common failure mode where compressing an early definition message makes later references meaningless
- Adds bytes to compressed summaries (the
[context: ...]prefix). This slightly reduces compression ratio - Only tracks code-style identifiers (camelCase, snake_case, PascalCase) — not plain English nouns. This avoids false positives but misses some references
- The inline definition is the first sentence containing the entity, truncated to 80 chars. Complex multi-sentence definitions are only partially captured
Groups messages by topic using lightweight TF-IDF and entity overlap, then compresses each cluster as a unit.
const result = compress(messages, {
semanticClustering: true,
clusterThreshold: 0.15, // similarity threshold (default)
});- Compute TF-IDF vectors per message (content words, stopwords removed)
- Compute entity overlap (Jaccard similarity on extracted identifiers)
- Combined similarity:
0.7 × cosine(TF-IDF) + 0.3 × jaccard(entities) - Agglomerative clustering with average linkage until similarity drops below threshold
- Multi-message clusters compressed as a unit with topic label
- Long conversations that drift across topics benefit most — scattered messages about
fetchDatain messages 3, 7, 12, 19 get merged into one compressed block - O(n²) similarity computation. For conversations under 50 messages this is negligible. For 500+ messages, consider whether the coherence benefit justifies the cost
clusterThresholdcontrols sensitivity: lower values (0.05–0.10) create larger clusters; higher values (0.20–0.30) require stronger topic similarity- Messages already claimed by flow chains are excluded from clustering — the two features cooperate without overlap
- Messages with fewer than 80 chars are excluded (not enough content for meaningful similarity)
Controls how aggressively the summarizer compresses content.
// Fixed depth
const result = compress(messages, {
compressionDepth: 'moderate',
});
// Auto: progressively tries gentle → moderate → aggressive until budget fits
const result = compress(messages, {
tokenBudget: 2000,
compressionDepth: 'auto',
forceConverge: true,
});| Level | Summary budget | Strategy | Typical ratio |
|---|---|---|---|
'gentle' (default) |
30% of content | Sentence selection | ~2x |
'moderate' |
15% of content | Tighter sentence selection | ~3–4x |
'aggressive' |
Entity-only stubs | Key identifiers only | ~6–8x |
'auto' |
Progressive | Tries each level until tokenBudget fits |
Adapts to budget |
In 'auto' mode, the engine stops escalating if quality_score drops below 0.60 (unless forced by a very tight budget). This prevents aggressive compression from destroying too much context.
'gentle'is the safest — identical to default behavior. Start here'moderate'halves the summary budget. Entity-dense content keeps identifiers; sparse content gets very short summaries. Good for conversations with lots of boilerplate'aggressive'produces entity-only stubs (fetchData, getUserProfile, retryConfig). Use for archival compression where only the topics matter, not the details'auto'withtokenBudgetis the most practical — it finds the minimum aggressiveness needed to fit. Without a budget,'auto'is equivalent to'gentle'
Per-token keep/remove classification via a user-provided ML model. Based on LLMLingua-2 (ACL 2024).
import { compress, createMockTokenClassifier } from 'context-compression-engine';
// Mock classifier for testing
const classifier = createMockTokenClassifier([/fetch/i, /retry/i, /config/i]);
const result = compress(messages, { mlTokenClassifier: classifier });
// Real classifier (e.g., ONNX model)
const result = compress(messages, {
mlTokenClassifier: (content) => {
const tokens = myTokenizer.tokenize(content);
const predictions = myModel.predict(tokens);
return tokens.map((token, i) => ({
token,
keep: predictions[i] > 0.5,
confidence: predictions[i],
}));
},
});type TokenClassification = {
token: string;
keep: boolean;
confidence: number; // 0–1
};
type MLTokenClassifier = (
content: string,
) => TokenClassification[] | Promise<TokenClassification[]>;- Highest potential compression quality — a well-trained encoder model (XLM-RoBERTa, ~500MB) can achieve 2–5x compression at 95–98% accuracy retention
- T0 classification rules still override for code/structured content — the ML classifier only handles T2 prose
- Falls back to deterministic summarization if the ML-compressed output is longer than the original
- Async classifiers throw in sync mode — provide a
summarizerorclassifierto enable async - The engine stays zero-dependency — you provide the model and tokenizer
import { whitespaceTokenize, createMockTokenClassifier } from 'context-compression-engine';
// Simple whitespace tokenizer
const tokens = whitespaceTokenize('The fetchData function'); // ['The', 'fetchData', 'function']
// Mock classifier for testing — keeps tokens matching any pattern
const mock = createMockTokenClassifier([/fetch/i, /retry/i], 0.9);Features can be combined freely. Here are recommended combinations:
const result = compress(messages, {
recencyWindow: 6,
importanceScoring: true,
contradictionDetection: true,
coreference: true,
conversationFlow: true,
});const result = compress(messages, {
tokenBudget: 2000,
compressionDepth: 'auto',
budgetStrategy: 'tiered',
relevanceThreshold: 3,
semanticClustering: true,
forceConverge: true,
});const result = compress(messages, {
tokenBudget: 4000,
conversationFlow: true,
importanceScoring: true,
coreference: true,
});conversationFlowandsemanticClusteringcooperate — flow chains are detected first, remaining messages are clustereddiscourseAwareis experimental and not included in any recommended combination — it reduces ratio without a custom ML scorermlTokenClassifiertakes priority overdiscourseAwareandentropyScorerrelevanceThresholdapplies after flow/cluster detection — messages already grouped into chains/clusters are not individually threshold-checkedcompressionDepthaffects all summarization (groups, code-split prose, contradictions) — not just the main group path
- API reference — all options and result fields
- Token budget —
budgetStrategy,compressionDepth: 'auto' - Compression pipeline — how features fit into the pipeline
- Benchmark results — quality metrics per scenario