Lightweight text preprocessing utilities for Natural Language Processing (NLP) written in TypeScript.
text-prep-lite provides two core helpers:
normalizeText– clean & normalise raw text into a predictable representation.tokenize– break text into lowercase word tokens.
The library is intentionally dependency-free and suitable for browsers, Node.js, and serverless environments.
Natural-language data is messy. Before tokenisation or feeding text into an NLP model you often need to:
- normalise case & whitespace
- expand contractions ("can't" → "cannot")
- strip punctuation / emojis
text-prep-lite does those common steps with zero runtime dependencies.
npm install text-prep-lite
# or
yarn add text-prep-liteimport { normalizeText, tokenize } from "text-prep-lite";
const raw = " I can't believe it's not butter! 🧈 ";
const cleaned = normalizeText(raw, {
expandContractions: true,
removePunctuation: true,
removeEmojis: true,
});
// → "i cannot believe it is not butter"
const tokens = tokenize(raw);
// → ["i", "can", "t", "believe", "it", "s", "not", "butter"]Returns a cleaned version of input.
NormalizeOptions:
| Option | Default | Description |
|---|---|---|
expandContractions |
false |
Expand contractions for the selected locale. |
removePunctuation |
false |
Strip punctuation characters. |
removeEmojis |
false |
Remove Unicode emoji characters. |
locale |
'en' |
BCP-47 language tag for locale-specific rules (currently: en, sq, fr, de, he). |
Supported locales
en– English (default)sq– Albanianfr– Frenchde– Germanhe– Hebrewes– Spanishzh– Chinese (Mandarin)yue– Chinese (Cantonese)
// French example
normalizeText("C'est incroyable!", { expandContractions: true, locale: "fr" });
// → "ce est incroyable!" (punctuation kept in this call)- Converts text to lowercase.
- Removes punctuation & emojis.
- Splits by whitespace / word boundaries.
Returns an array of tokens.
tokenize has no options – it always lowercases, strips punctuation & emojis, and splits on whitespace.
-
👉 Need word embeddings for semantic analysis?
Check outwink-embeddings-small-en-50d -
👉 Need a simple and robust PDF text extraction utility with a quality interface? Check out [
pdf-worker-package]https://www.npmjs.com/package/pdf-worker-package
# run tests
npm test
# build library
npm run buildMIT © Cavani21/thegreatbey
- Fork & clone the repo
npm inpm test– run lint & unit tests- Submit pull-request 🚀
Please add tests for any new feature or bug-fix.