[new tool] @agent-tools/text-splitter — Document chunking with recursive, token, semantic, and format-aware strategies

## Tool Name
`@agent-tools/text-splitter`

## Description
Framework-agnostic text chunking library providing multiple splitting strategies (recursive character, token-based, semantic similarity, and format-aware) for preparing documents for embedding and retrieval pipelines.

## Why It's Useful for Agents
AI agents building RAG pipelines need to split documents into appropriately-sized chunks that preserve semantic coherence. Currently the ecosystem is fragmented: [semantic-chunking](https://github.com/jparkerweb/semantic-chunking) (138 stars, MIT, JS, v2.6.0 Apr 2026) provides similarity-based chunking but is JavaScript-only and couples to ONNX models; `@langchain/textsplitters` (part of [LangChain.js](https://github.com/langchain-ai/langchainjs) — 17.6k stars) offers recursive/token/markdown splitters but is tightly coupled to the LangChain Document abstraction. No standalone TypeScript-first library exists that combines all strategies with pluggable tokenizers and zero framework lock-in.

## Proposed API
```typescript
import { RecursiveCharacterSplitter, TokenSplitter, SemanticSplitter, MarkdownSplitter, CodeSplitter } from '@agent-tools/text-splitter';

// Recursive character splitting (LangChain-style)
const chunks = await RecursiveCharacterSplitter.split(text, {
  chunkSize: 1000,
  chunkOverlap: 200,
  separators: ['\n\n', '\n', '. ', ' '],
});

// Token-based splitting with pluggable tokenizer
const chunks = await TokenSplitter.split(text, {
  chunkSize: 512,
  chunkOverlap: 64,
  tokenizer: tiktoken('cl100k_base'), // or custom tokenizer fn
});

// Semantic splitting using embedding similarity
const chunks = await SemanticSplitter.split(text, {
  embedFn: async (sentences) => embeddings, // pluggable embedding function
  similarityThreshold: 0.75,
  maxChunkSize: 1500,
});

// Format-aware: Markdown headers/sections
const chunks = await MarkdownSplitter.split(markdown, {
  chunkSize: 1000,
  headingDepth: 2, // split on ## and above
  preserveMetadata: true, // attach heading path to chunks
});

// Format-aware: Code (respects function/class boundaries)
const chunks = await CodeSplitter.split(code, {
  language: 'typescript',
  chunkSize: 1500,
});

// Chunk metadata
// { text: string, startIndex: number, endIndex: number, metadata?: Record<string, unknown> }
```

## Scope
**In scope:**
- Recursive character splitting with configurable separators and overlap
- Token-based splitting with pluggable tokenizer interface (tiktoken, gpt-tokenizer, custom)
- Semantic splitting with pluggable embedding function and similarity threshold
- Markdown-aware splitting preserving section hierarchy
- Code-aware splitting respecting AST boundaries (TypeScript, Python, Go, etc.)
- Chunk overlap configuration for all strategies
- Rich chunk metadata (source position, heading path, token count)
- Streaming/async iterator output for large documents

**Out of scope:**
- Built-in embedding models (use @agent-tools/embeddings or any provider)
- Built-in tokenizers (use @agent-tools/tokenizer or any provider)
- Document loading/parsing (use @agent-tools/doc-reader)
- Vector storage (use @agent-tools/vector)
- Full RAG pipeline orchestration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[new tool] @agent-tools/text-splitter — Document chunking with recursive, token, semantic, and format-aware strategies #220

Tool Name

Description

Why It's Useful for Agents

Proposed API

Scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[new tool] @agent-tools/text-splitter — Document chunking with recursive, token, semantic, and format-aware strategies #220

Description

Tool Name

Description

Why It's Useful for Agents

Proposed API

Scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions