Tool Name
@agent-tools/text-splitter
Description
Framework-agnostic text chunking library providing multiple splitting strategies (recursive character, token-based, semantic similarity, and format-aware) for preparing documents for embedding and retrieval pipelines.
Why It's Useful for Agents
AI agents building RAG pipelines need to split documents into appropriately-sized chunks that preserve semantic coherence. Currently the ecosystem is fragmented: semantic-chunking (138 stars, MIT, JS, v2.6.0 Apr 2026) provides similarity-based chunking but is JavaScript-only and couples to ONNX models; @langchain/textsplitters (part of LangChain.js — 17.6k stars) offers recursive/token/markdown splitters but is tightly coupled to the LangChain Document abstraction. No standalone TypeScript-first library exists that combines all strategies with pluggable tokenizers and zero framework lock-in.
Proposed API
import { RecursiveCharacterSplitter, TokenSplitter, SemanticSplitter, MarkdownSplitter, CodeSplitter } from '@agent-tools/text-splitter';
// Recursive character splitting (LangChain-style)
const chunks = await RecursiveCharacterSplitter.split(text, {
chunkSize: 1000,
chunkOverlap: 200,
separators: ['\n\n', '\n', '. ', ' '],
});
// Token-based splitting with pluggable tokenizer
const chunks = await TokenSplitter.split(text, {
chunkSize: 512,
chunkOverlap: 64,
tokenizer: tiktoken('cl100k_base'), // or custom tokenizer fn
});
// Semantic splitting using embedding similarity
const chunks = await SemanticSplitter.split(text, {
embedFn: async (sentences) => embeddings, // pluggable embedding function
similarityThreshold: 0.75,
maxChunkSize: 1500,
});
// Format-aware: Markdown headers/sections
const chunks = await MarkdownSplitter.split(markdown, {
chunkSize: 1000,
headingDepth: 2, // split on ## and above
preserveMetadata: true, // attach heading path to chunks
});
// Format-aware: Code (respects function/class boundaries)
const chunks = await CodeSplitter.split(code, {
language: 'typescript',
chunkSize: 1500,
});
// Chunk metadata
// { text: string, startIndex: number, endIndex: number, metadata?: Record<string, unknown> }
Scope
In scope:
- Recursive character splitting with configurable separators and overlap
- Token-based splitting with pluggable tokenizer interface (tiktoken, gpt-tokenizer, custom)
- Semantic splitting with pluggable embedding function and similarity threshold
- Markdown-aware splitting preserving section hierarchy
- Code-aware splitting respecting AST boundaries (TypeScript, Python, Go, etc.)
- Chunk overlap configuration for all strategies
- Rich chunk metadata (source position, heading path, token count)
- Streaming/async iterator output for large documents
Out of scope:
- Built-in embedding models (use @agent-tools/embeddings or any provider)
- Built-in tokenizers (use @agent-tools/tokenizer or any provider)
- Document loading/parsing (use @agent-tools/doc-reader)
- Vector storage (use @agent-tools/vector)
- Full RAG pipeline orchestration
Tool Name
@agent-tools/text-splitterDescription
Framework-agnostic text chunking library providing multiple splitting strategies (recursive character, token-based, semantic similarity, and format-aware) for preparing documents for embedding and retrieval pipelines.
Why It's Useful for Agents
AI agents building RAG pipelines need to split documents into appropriately-sized chunks that preserve semantic coherence. Currently the ecosystem is fragmented: semantic-chunking (138 stars, MIT, JS, v2.6.0 Apr 2026) provides similarity-based chunking but is JavaScript-only and couples to ONNX models;
@langchain/textsplitters(part of LangChain.js — 17.6k stars) offers recursive/token/markdown splitters but is tightly coupled to the LangChain Document abstraction. No standalone TypeScript-first library exists that combines all strategies with pluggable tokenizers and zero framework lock-in.Proposed API
Scope
In scope:
Out of scope: