Skip to content

feat(python-ext): add Python bindings for pplx-unigram#14

Open
mvanhorn wants to merge 1 commit into
perplexityai:mainfrom
mvanhorn:feat/unigram-py-bindings
Open

feat(python-ext): add Python bindings for pplx-unigram#14
mvanhorn wants to merge 1 commit into
perplexityai:mainfrom
mvanhorn:feat/unigram-py-bindings

Conversation

@mvanhorn

Copy link
Copy Markdown

Summary

pplx-unigram is now reachable from Python via pplx_garden.unigram.Engine, so the tokenizer the blog post benchmarks can be called from the same Python serving code that already consumes the rest of pplx_garden. The Rust kernel and its token output are unchanged.

Why this matters

The README ships pplx-unigram alongside fabric-lib as one of two flagship components, and the blog post (Improving Unigram Tokenizer CPU Performance) markets it as a fast CPU tokenizer. Today the crate is only callable from Rust: python-ext/src/lib.rs registers py_cumem, py_p2p_all_to_all, and py_fabric_lib but no py_unigram. A Python inference stack - which is the dominant consumer of HuggingFace-style tokenizers - has to drop down to Rust to use it, which most users won't do.

For comparison, huggingface/tokenizers (the de-facto Rust+Python tokenizer crate) and sentencepiece both ship Python bindings as a first-class entry point. This PR fills that gap for pplx-unigram.

Demo

Simulated demo (the wheel build requires CUDA via the existing cuda-lib/p2p/fabric workspace deps, so end-to-end Python validation runs upstream of macOS):

simulated demo

The token list in the Python "after" frame matches the Rust example's output for the same input - the wrapper is a 1:1 pass-through to pplx_unigram::Engine::encode.

Changes

  • python-ext/src/py_unigram.rs (new) - UnigramEngine and UnigramEncodeState PyO3 classes. from_hf_json(path) / from_hf_json_bytes(bytes) constructors mirror the Rust API. encode(text) allocates a fresh state per call; encode_into(text, state) reuses a pre-allocated state for hot loops.
  • python-ext/src/lib.rs - registers py_unigram::init(m) alongside the existing three.
  • python-ext/Cargo.toml and root Cargo.toml - add pplx-unigram to workspace deps and pull it into python-ext.
  • python/pplx_garden/unigram.py (new) - thin Python facade re-exporting Engine and EncodeState from pplx_garden._rust.
  • tests/unigram/test_python_bindings.py (new) - pytest suite covering vocab_size, XLM-R encoding parity with the Rust example, CJK input, state-reuse, and error mapping. Skips gracefully when UNIGRAM_TOKENIZER_JSON isn't set.
  • docs/unigram.md - adds a "Use from Python" section.

Error mapping: UnsupportedConfig and InvalidConfig from the Rust crate surface as ValueError; other variants surface as RuntimeError.

Testing

The pplx-unigram crate builds clean and the existing encode example still returns the expected token list for "The quick brown fox jumps over the lazy dog." ([581, 63773, 119455, 6, 147797, 88203, 7, 645, 70, 21, 3285, 10269, 5]). The new PyO3 wrapper compiles against pplx-unigram and pyo3 0.27.1 in a standalone check (the full python-ext wheel build needs CUDA, so end-to-end Python import was not run on the contributor side).

AI was used for assistance.

Wires the pplx-unigram crate into python-ext alongside the existing
PyO3 modules so Python serving stacks can call the tokenizer directly:

    from pplx_garden.unigram import Engine
    engine = Engine.from_hf_json('tokenizer.json')
    tokens = engine.encode('The quick brown fox jumps over the lazy dog.')

UnigramEngine exposes from_hf_json, from_hf_json_bytes, vocab_size,
encode, and encode_into (state-reuse for hot loops). UnsupportedConfig
and InvalidConfig from the Rust crate surface as ValueError; other
variants surface as RuntimeError.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant