Sanskrit Tokenizer

This project implements a custom Byte Pair Encoding (BPE) tokenizer specifically designed for the Sanskrit language. It includes scripts to download a large Sanskrit corpus, train a BPE tokenizer, and perform encoding/decoding operations.

Features

Data Acquisition: Automatically downloads and processes the chronbmm/sanskrit-monolingual-pretraining dataset from Hugging Face.
Custom BPE Training: Trains a byte-level BPE tokenizer from scratch on the Sanskrit corpus.
Efficient Tokenization: Supports encoding text into tokens and decoding tokens back to text.
Unicode Support: Handles UTF-8 encoding natively, making it suitable for Sanskrit and other languages.

Project Structure

SANSKRIT TOKENIZER/
├── data/                   # Directory for storing downloaded corpus
├── tokenizer/              # Tokenizer logic and artifacts
│   ├── train_bpe.py        # Script to train the BPE model
│   ├── encode_decode.py    # Functions for encoding and decoding text
│   └── bpe_8k.json         # Trained tokenizer vocabulary (generated)
├── scripts/                # Utility scripts
│   └── preview_hf_dataset.py
├── download_corpus.py      # Script to download the dataset
├── requirements.txt        # Python dependencies
└── README.md               # Project documentation

Setup & Installation

1.Prerequisites: Ensure you have Python installed (Python 3.8+ recommended).

2.Install Dependencies: Install the required Python packages using pip:

    pip install -r requirements.txt
    ```

## Usage

### 1. Download the Corpus

First, download the Sanskrit dataset from Hugging Face. This script will save the data to `data/sanskrit51M.txt`.

```bash
python download_corpus.py

Note: This uses the chronbmm/sanskrit-monolingual-pretraining dataset. The data file is not included in this repository due to size constraints and must be downloaded using the script above.

2. Train the Tokenizer

Train the BPE tokenizer on the downloaded corpus. This will generate the tokenizer/bpe_8k.json file containing the merge rules.

python tokenizer/train_bpe.py

You can adjust the vocabulary size in tokenizer/train_bpe.py by modifying the DEFAULT_VOCAB variable (default is 8,000).

3. Encode and Decode Text

You can use the tokenizer/encode_decode.py module to tokenize Sanskrit text.

Example usage (create a new python script or run in interactive shell):

from tokenizer.encode_decode import encode, decode

text = "नमस्ते"
print(f"Original: {text}")

tokens = encode(text)
print(f"Tokens: {tokens}")

decoded_text = decode(tokens)
print(f"Decoded: {decoded_text}")

Requirements

datasets
huggingface_hub
tqdm
torch (if used in future extensions)
numpy
streamlit (if a UI is added)

See requirements.txt for the full list.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
__pycache__		__pycache__
scripts		scripts
tokenizer		tokenizer
.gitattributes		.gitattributes
.gitignore		.gitignore
download_corpus.py		download_corpus.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sanskrit Tokenizer

Features

Project Structure

Setup & Installation

2. Train the Tokenizer

3. Encode and Decode Text

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sanskrit Tokenizer

Features

Project Structure

Setup & Installation

2. Train the Tokenizer

3. Encode and Decode Text

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages