Skip to content

KunjShah95/SANSKRIT-TOKENIZER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sanskrit Tokenizer

This project implements a custom Byte Pair Encoding (BPE) tokenizer specifically designed for the Sanskrit language. It includes scripts to download a large Sanskrit corpus, train a BPE tokenizer, and perform encoding/decoding operations.

Features

  • Data Acquisition: Automatically downloads and processes the chronbmm/sanskrit-monolingual-pretraining dataset from Hugging Face.
  • Custom BPE Training: Trains a byte-level BPE tokenizer from scratch on the Sanskrit corpus.
  • Efficient Tokenization: Supports encoding text into tokens and decoding tokens back to text.
  • Unicode Support: Handles UTF-8 encoding natively, making it suitable for Sanskrit and other languages.

Project Structure

SANSKRIT TOKENIZER/
├── data/                   # Directory for storing downloaded corpus
├── tokenizer/              # Tokenizer logic and artifacts
│   ├── train_bpe.py        # Script to train the BPE model
│   ├── encode_decode.py    # Functions for encoding and decoding text
│   └── bpe_8k.json         # Trained tokenizer vocabulary (generated)
├── scripts/                # Utility scripts
│   └── preview_hf_dataset.py
├── download_corpus.py      # Script to download the dataset
├── requirements.txt        # Python dependencies
└── README.md               # Project documentation

Setup & Installation

1.Prerequisites: Ensure you have Python installed (Python 3.8+ recommended).

2.Install Dependencies: Install the required Python packages using pip:

    pip install -r requirements.txt
    ```

## Usage

### 1. Download the Corpus

First, download the Sanskrit dataset from Hugging Face. This script will save the data to `data/sanskrit51M.txt`.

```bash
python download_corpus.py

Note: This uses the chronbmm/sanskrit-monolingual-pretraining dataset. The data file is not included in this repository due to size constraints and must be downloaded using the script above.

2. Train the Tokenizer

Train the BPE tokenizer on the downloaded corpus. This will generate the tokenizer/bpe_8k.json file containing the merge rules.

python tokenizer/train_bpe.py

You can adjust the vocabulary size in tokenizer/train_bpe.py by modifying the DEFAULT_VOCAB variable (default is 8,000).

3. Encode and Decode Text

You can use the tokenizer/encode_decode.py module to tokenize Sanskrit text.

Example usage (create a new python script or run in interactive shell):

from tokenizer.encode_decode import encode, decode

text = "नमस्ते"
print(f"Original: {text}")

tokens = encode(text)
print(f"Tokens: {tokens}")

decoded_text = decode(tokens)
print(f"Decoded: {decoded_text}")

Requirements

  • datasets
  • huggingface_hub
  • tqdm
  • torch (if used in future extensions)
  • numpy
  • streamlit (if a UI is added)

See requirements.txt for the full list.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages