This project implements a custom Byte Pair Encoding (BPE) tokenizer specifically designed for the Sanskrit language. It includes scripts to download a large Sanskrit corpus, train a BPE tokenizer, and perform encoding/decoding operations.
- Data Acquisition: Automatically downloads and processes the
chronbmm/sanskrit-monolingual-pretrainingdataset from Hugging Face. - Custom BPE Training: Trains a byte-level BPE tokenizer from scratch on the Sanskrit corpus.
- Efficient Tokenization: Supports encoding text into tokens and decoding tokens back to text.
- Unicode Support: Handles UTF-8 encoding natively, making it suitable for Sanskrit and other languages.
SANSKRIT TOKENIZER/
├── data/ # Directory for storing downloaded corpus
├── tokenizer/ # Tokenizer logic and artifacts
│ ├── train_bpe.py # Script to train the BPE model
│ ├── encode_decode.py # Functions for encoding and decoding text
│ └── bpe_8k.json # Trained tokenizer vocabulary (generated)
├── scripts/ # Utility scripts
│ └── preview_hf_dataset.py
├── download_corpus.py # Script to download the dataset
├── requirements.txt # Python dependencies
└── README.md # Project documentation
1.Prerequisites: Ensure you have Python installed (Python 3.8+ recommended).
2.Install Dependencies: Install the required Python packages using pip:
pip install -r requirements.txt
```
## Usage
### 1. Download the Corpus
First, download the Sanskrit dataset from Hugging Face. This script will save the data to `data/sanskrit51M.txt`.
```bash
python download_corpus.pyNote: This uses the
chronbmm/sanskrit-monolingual-pretrainingdataset. The data file is not included in this repository due to size constraints and must be downloaded using the script above.
Train the BPE tokenizer on the downloaded corpus. This will generate the tokenizer/bpe_8k.json file containing the merge rules.
python tokenizer/train_bpe.pyYou can adjust the vocabulary size in tokenizer/train_bpe.py by modifying the DEFAULT_VOCAB variable (default is 8,000).
You can use the tokenizer/encode_decode.py module to tokenize Sanskrit text.
Example usage (create a new python script or run in interactive shell):
from tokenizer.encode_decode import encode, decode
text = "नमस्ते"
print(f"Original: {text}")
tokens = encode(text)
print(f"Tokens: {tokens}")
decoded_text = decode(tokens)
print(f"Decoded: {decoded_text}")datasetshuggingface_hubtqdmtorch(if used in future extensions)numpystreamlit(if a UI is added)
See requirements.txt for the full list.