feat(python-ext): add Python bindings for pplx-unigram#14
Open
mvanhorn wants to merge 1 commit into
Open
Conversation
Wires the pplx-unigram crate into python-ext alongside the existing
PyO3 modules so Python serving stacks can call the tokenizer directly:
from pplx_garden.unigram import Engine
engine = Engine.from_hf_json('tokenizer.json')
tokens = engine.encode('The quick brown fox jumps over the lazy dog.')
UnigramEngine exposes from_hf_json, from_hf_json_bytes, vocab_size,
encode, and encode_into (state-reuse for hot loops). UnsupportedConfig
and InvalidConfig from the Rust crate surface as ValueError; other
variants surface as RuntimeError.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pplx-unigramis now reachable from Python viapplx_garden.unigram.Engine, so the tokenizer the blog post benchmarks can be called from the same Python serving code that already consumes the rest ofpplx_garden. The Rust kernel and its token output are unchanged.Why this matters
The README ships pplx-unigram alongside fabric-lib as one of two flagship components, and the blog post (Improving Unigram Tokenizer CPU Performance) markets it as a fast CPU tokenizer. Today the crate is only callable from Rust:
python-ext/src/lib.rsregisterspy_cumem,py_p2p_all_to_all, andpy_fabric_libbut nopy_unigram. A Python inference stack - which is the dominant consumer of HuggingFace-style tokenizers - has to drop down to Rust to use it, which most users won't do.For comparison,
huggingface/tokenizers(the de-facto Rust+Python tokenizer crate) andsentencepieceboth ship Python bindings as a first-class entry point. This PR fills that gap for pplx-unigram.Demo
Simulated demo (the wheel build requires CUDA via the existing cuda-lib/p2p/fabric workspace deps, so end-to-end Python validation runs upstream of macOS):
The token list in the Python "after" frame matches the Rust example's output for the same input - the wrapper is a 1:1 pass-through to
pplx_unigram::Engine::encode.Changes
python-ext/src/py_unigram.rs(new) -UnigramEngineandUnigramEncodeStatePyO3 classes.from_hf_json(path)/from_hf_json_bytes(bytes)constructors mirror the Rust API.encode(text)allocates a fresh state per call;encode_into(text, state)reuses a pre-allocated state for hot loops.python-ext/src/lib.rs- registerspy_unigram::init(m)alongside the existing three.python-ext/Cargo.tomland rootCargo.toml- addpplx-unigramto workspace deps and pull it into python-ext.python/pplx_garden/unigram.py(new) - thin Python facade re-exportingEngineandEncodeStatefrompplx_garden._rust.tests/unigram/test_python_bindings.py(new) - pytest suite coveringvocab_size, XLM-R encoding parity with the Rust example, CJK input, state-reuse, and error mapping. Skips gracefully whenUNIGRAM_TOKENIZER_JSONisn't set.docs/unigram.md- adds a "Use from Python" section.Error mapping:
UnsupportedConfigandInvalidConfigfrom the Rust crate surface asValueError; other variants surface asRuntimeError.Testing
The pplx-unigram crate builds clean and the existing
encodeexample still returns the expected token list for "The quick brown fox jumps over the lazy dog." ([581, 63773, 119455, 6, 147797, 88203, 7, 645, 70, 21, 3285, 10269, 5]). The new PyO3 wrapper compiles againstpplx-unigramand pyo3 0.27.1 in a standalone check (the fullpython-extwheel build needs CUDA, so end-to-end Python import was not run on the contributor side).AI was used for assistance.