Skip to content

[new tool] @agent-tools/text-splitter — Document chunking with recursive, token, semantic, and format-aware strategies #220

Description

@burner-agent

Tool Name

@agent-tools/text-splitter

Description

Framework-agnostic text chunking library providing multiple splitting strategies (recursive character, token-based, semantic similarity, and format-aware) for preparing documents for embedding and retrieval pipelines.

Why It's Useful for Agents

AI agents building RAG pipelines need to split documents into appropriately-sized chunks that preserve semantic coherence. Currently the ecosystem is fragmented: semantic-chunking (138 stars, MIT, JS, v2.6.0 Apr 2026) provides similarity-based chunking but is JavaScript-only and couples to ONNX models; @langchain/textsplitters (part of LangChain.js — 17.6k stars) offers recursive/token/markdown splitters but is tightly coupled to the LangChain Document abstraction. No standalone TypeScript-first library exists that combines all strategies with pluggable tokenizers and zero framework lock-in.

Proposed API

import { RecursiveCharacterSplitter, TokenSplitter, SemanticSplitter, MarkdownSplitter, CodeSplitter } from '@agent-tools/text-splitter';

// Recursive character splitting (LangChain-style)
const chunks = await RecursiveCharacterSplitter.split(text, {
  chunkSize: 1000,
  chunkOverlap: 200,
  separators: ['\n\n', '\n', '. ', ' '],
});

// Token-based splitting with pluggable tokenizer
const chunks = await TokenSplitter.split(text, {
  chunkSize: 512,
  chunkOverlap: 64,
  tokenizer: tiktoken('cl100k_base'), // or custom tokenizer fn
});

// Semantic splitting using embedding similarity
const chunks = await SemanticSplitter.split(text, {
  embedFn: async (sentences) => embeddings, // pluggable embedding function
  similarityThreshold: 0.75,
  maxChunkSize: 1500,
});

// Format-aware: Markdown headers/sections
const chunks = await MarkdownSplitter.split(markdown, {
  chunkSize: 1000,
  headingDepth: 2, // split on ## and above
  preserveMetadata: true, // attach heading path to chunks
});

// Format-aware: Code (respects function/class boundaries)
const chunks = await CodeSplitter.split(code, {
  language: 'typescript',
  chunkSize: 1500,
});

// Chunk metadata
// { text: string, startIndex: number, endIndex: number, metadata?: Record<string, unknown> }

Scope

In scope:

  • Recursive character splitting with configurable separators and overlap
  • Token-based splitting with pluggable tokenizer interface (tiktoken, gpt-tokenizer, custom)
  • Semantic splitting with pluggable embedding function and similarity threshold
  • Markdown-aware splitting preserving section hierarchy
  • Code-aware splitting respecting AST boundaries (TypeScript, Python, Go, etc.)
  • Chunk overlap configuration for all strategies
  • Rich chunk metadata (source position, heading path, token count)
  • Streaming/async iterator output for large documents

Out of scope:

  • Built-in embedding models (use @agent-tools/embeddings or any provider)
  • Built-in tokenizers (use @agent-tools/tokenizer or any provider)
  • Document loading/parsing (use @agent-tools/doc-reader)
  • Vector storage (use @agent-tools/vector)
  • Full RAG pipeline orchestration

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is neededinfrastructureCI, workflows, build toolingnew-toolProposal for a new tool packagetier:autonomyTier 3 — self-extension, shell, orchestration

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions