text-prep-lite

Lightweight text preprocessing utilities for Natural Language Processing (NLP) written in TypeScript.

text-prep-lite provides two core helpers:

normalizeText – clean & normalise raw text into a predictable representation.
tokenize – break text into lowercase word tokens.

The library is intentionally dependency-free and suitable for browsers, Node.js, and serverless environments.

Why?

Natural-language data is messy. Before tokenisation or feeding text into an NLP model you often need to:

normalise case & whitespace
expand contractions ("can't" → "cannot")
strip punctuation / emojis

text-prep-lite does those common steps with zero runtime dependencies.

Installation

npm install text-prep-lite
# or
yarn add text-prep-lite

Usage

import { normalizeText, tokenize } from "text-prep-lite";

const raw = "  I can't believe it's not butter! 🧈  ";

const cleaned = normalizeText(raw, {
  expandContractions: true,
  removePunctuation: true,
  removeEmojis: true,
});
// → "i cannot believe it is not butter"

const tokens = tokenize(raw);
// → ["i", "can", "t", "believe", "it", "s", "not", "butter"]

API

`normalizeText(input: string, options?: NormalizeOptions): string`

Returns a cleaned version of input.

NormalizeOptions:

Option	Default	Description
`expandContractions`	`false`	Expand contractions for the selected `locale`.
`removePunctuation`	`false`	Strip punctuation characters.
`removeEmojis`	`false`	Remove Unicode emoji characters.
`locale`	`'en'`	BCP-47 language tag for locale-specific rules (currently: en, sq, fr, de, he).

Supported locales

en – English (default)
sq – Albanian
fr – French
de – German
he – Hebrew
es – Spanish
zh – Chinese (Mandarin)
yue – Chinese (Cantonese)

// French example
normalizeText("C'est incroyable!", { expandContractions: true, locale: "fr" });
// → "ce est incroyable!"  (punctuation kept in this call)

`tokenize(input: string): string[]`

Converts text to lowercase.
Removes punctuation & emojis.
Splits by whitespace / word boundaries.

Returns an array of tokens.

tokenize has no options – it always lowercases, strips punctuation & emojis, and splits on whitespace.

🔗 Related

👉 Need word embeddings for semantic analysis?
Check out wink-embeddings-small-en-50d
👉 Need a simple and robust PDF text extraction utility with a quality interface? Check out [pdf-worker-package]https://www.npmjs.com/package/pdf-worker-package

Development

# run tests
npm test

# build library
npm run build

License

Contributing

Fork & clone the repo
npm i
npm test – run lint & unit tests
Submit pull-request 🚀

Please add tests for any new feature or bug-fix.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
dist		dist
node_modules		node_modules
src		src
test		test
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.d.ts		vitest.d.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text-prep-lite

Why?

Installation

Usage

API

`normalizeText(input: string, options?: NormalizeOptions): string`

`tokenize(input: string): string[]`

🔗 Related

Development

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

text-prep-lite

Why?

Installation

Usage

API

normalizeText(input: string, options?: NormalizeOptions): string

tokenize(input: string): string[]

🔗 Related

Development

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`normalizeText(input: string, options?: NormalizeOptions): string`

`tokenize(input: string): string[]`

Packages