pmole

Compress an entire codebase into a single binary .pm file. Works on text and binary files. Supports encryption, in-archive search, integrity verification, and single-file extraction without full decompression.

Installation

pip install git+https://github.com/ramsy0dev/pmole

or from source:

git clone https://github.com/ramsy0dev/pmole --depth=1
cd pmole
pip install -e .

Library usage

Compress

import pmole

# Auto-selects the best algorithm per file (default)
archive = pmole.compress("./myproject")                        # → "myproject.pm"
archive = pmole.compress("./myproject", output="out.pm")

# Force a specific algorithm for every file
archive = pmole.compress("./myproject", algo="lzma")           # best ratio, slower
archive = pmole.compress("./myproject", algo="zlib")           # fast, good ratio
archive = pmole.compress("./myproject", algo="lzw")            # pure LZW

# Control parallelism (default threads=3)
archive = pmole.compress("./myproject", threads=8)

# Encrypt with AES-256-GCM
archive = pmole.compress("./myproject", password="s3cr3t")

List contents

entries = pmole.list_files(archive)
for e in entries:
    print(f"{e.path:40s}  {e.original_size:>8} B  "
          f"ratio={e.compression_ratio:.2f}  [{e.algo_name}]  "
          f"{'binary' if e.is_binary else 'text'}")

# Encrypted archive
entries = pmole.list_files(archive, password="s3cr3t")

Extract and decompress

# Extract one file
out = pmole.extract(archive, "myproject/main.py", output_dir="/tmp/out")

# Decompress entire archive; returns list of paths written
paths = pmole.decompress(archive, output_dir="/tmp/restored")
paths = pmole.decompress(archive, output_dir="/tmp/restored", threads=8)

# Encrypted archive
paths = pmole.decompress(archive, output_dir="/tmp/restored", password="s3cr3t")

Codebase tools

# Search for a regex pattern across all text files (nothing written to disk)
matches = pmole.search(archive, r"def \w+\(")
for m in matches:
    print(f"{m['path']}:{m['line_no']}: {m['line']}")

# Verify every file decompresses correctly (nothing written to disk)
results = pmole.verify(archive)
failures = [r for r in results if not r["ok"]]

# Compression statistics grouped by file extension
s = pmole.stats(archive)
print(f"{s['total_files']} files, "
      f"{s['total_original']} B → {s['total_compressed']} B")
for ext, bucket in s["by_extension"].items():
    ratio = bucket["compressed"] / bucket["original"]
    print(f"  {ext}: {bucket['files']} files, ratio={ratio:.2f}")

# Check whether an archive is password-protected
pmole.is_encrypted(archive)  # → True / False

Raw algorithm access

# LZW only
codes = pmole.lzw_compress(b"hello world hello world")
data  = pmole.lzw_decompress(codes)

# Any algorithm by constant
from pmole import ALGO_LZMA
from pmole.compression import compress_with_algo, decompress_with_algo

compressed = compress_with_algo(b"hello world", ALGO_LZMA)
original   = decompress_with_algo(compressed, ALGO_LZMA)

Encryption

compress(..., password="...") encrypts the archive with AES-256-GCM.

Property	Value
Cipher	AES-256-GCM (authenticated encryption)
Key derivation	PBKDF2-HMAC-SHA256, 260 000 iterations
Salt	16 random bytes (unique per archive)
Nonce	12 random bytes (unique per archive)
Envelope magic	`PME\x01` (distinct from plain `.pm`)
Overhead	48 bytes over the compressed archive size

An incorrect password raises ValueError immediately — the GCM authentication tag fails before any data is returned. All read operations (decompress, list_files, extract, verify, search, stats) accept the same password parameter.

The PMOLE_PASSWORD environment variable is used by the CLI as a fallback so the password is never visible in shell history.

Compression algorithms

Four algorithms are available. Use algo="auto" (the default) to let pmole benchmark each one per file and keep the smallest result.

Constant	Name	Description	Best for
`ALGO_LZW`	`"lzw"`	Pure LZW, codes packed as uint16 LE	Highly repetitive data
`ALGO_LZW_ZLIB`	`"lzw+zlib"`	LZW uint16 stream further compressed with zlib	Repetitive text with patterns
`ALGO_ZLIB`	`"zlib"`	zlib / DEFLATE (LZ77 + Huffman), Python built-in	Source code, mixed content
`ALGO_LZMA`	`"lzma"`	LZMA, Python built-in	Best ratio, any content

Note: LZW-based algorithms (lzw, lzw+zlib) are automatically skipped for files larger than 10 MB; only zlib and lzma are tried for large files.

`ArchiveEntry`

list_files() returns a list of ArchiveEntry dataclass instances:

Field	Type	Description
`path`	`str`	Archived path (forward slashes)
`original_size`	`int`	Uncompressed size in bytes
`compressed_size`	`int`	Compressed size in bytes
`is_binary`	`bool`	Detected as binary (metadata only)
`algo`	`int`	Algorithm ID used for this file
`data_offset`	`int`	Byte offset of this file's section in archive
`compression_ratio`	`float`	`compressed_size / original_size` (property)
`algo_name`	`str`	Human-readable algorithm name (property)

Auto-exclusion

compress() applies five independent exclusion layers before touching any file. All defaults are exported from pmole and can be overridden per call.

`.pmignore` / `.gitignore`

When compressing a directory, pmole looks for .pmignore in the root first, then .gitignore. If found, its patterns are applied using full gitignore semantics (**, negation !, directory anchoring /). .pmignore itself is always excluded from the archive.

Create a .pmignore to override or extend .gitignore patterns for archiving:

# .pmignore
*.log
scratch/
local_settings.py

File extensions — `EXCLUDE_EXTENSIONS`

Files whose suffix (case-insensitive, no leading dot) matches are skipped.

Category	Extensions
Compiled / executable	`exe` `dll` `so` `dylib` `ko` `sys` `efi` `lib` `a` `o` `obj` `out` `elf` `bin` `wasm` `app` `msi` `bat` `cmd`
Python bytecode	`pyc` `pyo` `pyd`
JVM / .NET	`class` `jar` `pdb`
Mobile / embedded	`apk` `ipa` `xex` `xbe` `3dsx`
Game engine	`pak` `gdc` `pck` `uasset`
Raw / firmware	`img` `hex` `srec` `rom` `bios` `bootloader` `boot`
Archives	`zip` `gz` `bz2` `xz` `zst` `lz4` `7z` `rar` `tar` `tgz` `tbz2` `txz`
Images	`png` `jpg` `jpeg` `gif` `bmp` `tiff` `tif` `webp` `avif` `heic` `ico`
Audio / video	`mp3` `mp4` `wav` `flac` `ogg` `aac` `m4a` `avi` `mkv` `mov` `wmv` `flv` `webm`
Fonts	`ttf` `otf` `woff` `woff2` `eot`
Databases	`sqlite` `sqlite3` `db` `mdb` `accdb`
Lock files	`lock`
Source maps	`map`
Log / temp	`log` `tmp` `bak` `swp` `swo`

Directories — `EXCLUDE_DIRECTORIES`

Any path component (between the root and the file) that matches is skipped, case-insensitively.

Category	Names
Version control	`.git` `.svn` `.hg` `.bzr`
Python	`__pycache__` `venv` `.venv` `env` `.env` `.tox` `.mypy_cache` `.pytest_cache` `.ruff_cache` `.pytype` `.pyre` `htmlcov` `.eggs` `.egg-info`
Node / JS	`node_modules` `.next` `.nuxt` `.svelte-kit` `.turbo` `.parcel-cache`
Build outputs	`build` `dist` `out` `bin` `obj` `target` `.gradle`
IDEs / editors	`.idea` `.vscode` `.vs` `.eclipse` `.fleet`
Package caches	`.m2` `.bundle` `vendor`
Test / coverage	`coverage` `cov` `.coverage`
Misc generated	`lib` `libs` `assets` `res` `resources` `static` `public` `cache` `.cache` `tmp` `temp` `fonts` `media` `data` `.terraform` `.docker`

Filenames — `EXCLUDE_FILENAMES`

Exact filename match, case-insensitive.

File	Reason
`.DS_Store`	macOS directory metadata
`Thumbs.db`	Windows thumbnail cache
`desktop.ini`	Windows folder settings
`.env`	Runtime secrets (archive `.env.example` instead)
`.swp` `.swo`	Vim swap files
`.pmignore`	pmole ignore file (meta — not archived)

File size — `MAX_FILE_SIZE_BYTES`

Files larger than 50 MB (default) are skipped with an info log. Override per call:

pmole.compress("./data", max_file_size_bytes=10 * 1024 * 1024)  # 10 MB cap

Customising filters

All four lists are exported at the top level. Pass replacements directly to compress():

import pmole

# Extend the defaults
pmole.compress(
    "./myproject",
    exclude_extensions=[*pmole.EXCLUDE_EXTENSIONS, "csv", "parquet"],
    exclude_directories=[*pmole.EXCLUDE_DIRECTORIES, "migrations"],
    exclude_filenames=[*pmole.EXCLUDE_FILENAMES, "secrets.json"],
    max_file_size_bytes=5 * 1024 * 1024,
)

# Or replace them entirely
pmole.compress("./myproject", exclude_extensions=[], exclude_directories=[])

CLI usage

# Enable debug output (timestamps + source location on every log line)
pmole --debug compress ./myproject

# Compress a directory (auto algorithm, 3 threads)
pmole compress ./myproject

# Force a specific algorithm
pmole compress ./myproject --algo lzma
pmole compress ./myproject -a zlib

# Compress a single file
pmole compress ./README.md

# Custom output path, more threads
pmole compress ./myproject -o out.pm -t 8

# Encrypt the archive
pmole compress ./myproject -p mysecret
# or via environment variable (keeps password out of shell history)
PMOLE_PASSWORD=mysecret pmole compress ./myproject

# Add extra exclusions
pmole compress ./myproject \
    --exclude-dir ".cache,scratch" \
    --exclude-ext "csv,parquet" \
    --exclude-name "local_settings.py"

# List archive contents (no decompression)
pmole list myproject.pm
pmole list myproject.pm -p mysecret            # encrypted

# Show directory tree structure
pmole tree myproject.pm

# Extract one file (optional output dir, default = current dir)
pmole extract myproject.pm myproject/main.py /tmp/out

# Decompress full archive (optional output dir, default = current dir)
pmole decompress myproject.pm /tmp/restored
pmole decompress myproject.pm -t 8             # more threads
pmole decompress myproject.pm -p mysecret      # encrypted

# Verify archive integrity (no disk writes)
pmole verify myproject.pm

# Search for a regex pattern in all text files (no disk writes)
pmole search myproject.pm "def \w+"
pmole search myproject.pm "TODO|FIXME"

# Show compression statistics grouped by file extension
pmole stats myproject.pm

# Diff two files
pmole diff a.py b.py

Debug output

--debug is a global flag that must come before the sub-command name. It switches the log format from the terse default to a timestamped form that includes the source file and line number — useful when diagnosing slow archives or unexpected exclusions.

# Normal output (default)
INFO      scan  42 source files
INFO      pack  src/utils.py  8120 B → 1943 B  [lzma]
INFO      wrote  myproject.pm

# Debug output  (pmole --debug compress …)
14:23:01  DEBUG     [utils.py:215]   Excluded 'src/vendor/jquery.min.js' (extension: js)
14:23:01  DEBUG     [compression.py:173]  compress_auto  lzma  1823/4096 B  ratio=0.44
14:23:01  INFO      scan  42 source files

Per-file exclusion lines (Excluded '…') are only shown in debug mode; they are suppressed at the default INFO level to keep normal output clean.

`.pm` format overview

A .pm file (magic PM\x03\x00) has three sections:

Header (16 bytes) — magic, file count, byte offset to the index table.
File sections — one per stored file: an 18-byte header (original_size, compressed_size, is_binary, algo) followed by the compressed payload.
Index table (at end) — one entry per file storing its path, sizes, algorithm, and data_offset, enabling listing and single-file extraction by seeking directly to the right section.

When encrypted, the entire .pm content is wrapped in an encrypted envelope (magic PME\x01, 32-byte header containing salt and nonce, followed by the AES-256-GCM ciphertext). See ARCHITECTURE.md for the full byte-level specification.

See ARCHITECTURE.md for the full byte-level specification and design rationale.

Testing

# Run all tests
pytest

# Run only the fast correctness + credibility suite
pytest tests/test_api.py tests/test_credibility.py tests/test_algo_lzw.py

# Run benchmarks (single iteration each, no stats)
pytest tests/test_benchmark.py -v

# Run benchmarks with full statistics (requires pytest-benchmark)
pip install pytest-benchmark
pytest tests/test_benchmark.py -v --benchmark-sort=mean

# Print the compression ratio table
pytest -s tests/test_credibility.py::test_ratio_summary_table

Test files

File	Purpose
`tests/test_api.py`	Integration smoke tests for the public API (compress, decompress, encrypt, verify, search, stats, exclusions)
`tests/test_algo_lzw.py`	Unit tests for the raw LZW compressor / decompressor
`tests/test_credibility.py`	Quantitative correctness and ratio-bound tests — edge-case roundtrips (empty, single byte, all-zeros, all-`0xFF`, 200 KB repetitive, 500 KB random), ratio thresholds per algorithm, auto-selection invariants, archive metadata accuracy, double-roundtrip identity, unicode content, corrupt-archive handling
`tests/test_benchmark.py`	Performance benchmarks for raw compress/decompress throughput per algorithm, auto-select overhead, full archive creation/restoration, thread-scaling (1 / 2 / 4 threads), and verify/search timing
`tests/conftest.py`	Provides a minimal `benchmark` fixture when `pytest-benchmark` is not installed so the benchmark tests always run

LICENSE

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pmole

Installation

Library usage

Compress

List contents

Extract and decompress

Codebase tools

Raw algorithm access

Encryption

Compression algorithms

`ArchiveEntry`

Auto-exclusion

`.pmignore` / `.gitignore`

File extensions — `EXCLUDE_EXTENSIONS`

Directories — `EXCLUDE_DIRECTORIES`

Filenames — `EXCLUDE_FILENAMES`

File size — `MAX_FILE_SIZE_BYTES`

Customising filters

CLI usage

Debug output

`.pm` format overview

Testing

Test files

LICENSE

About

Uh oh!

Releases 3

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github		.github
data		data
pmole		pmole
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

pmole

Installation

Library usage

Compress

List contents

Extract and decompress

Codebase tools

Raw algorithm access

Encryption

Compression algorithms

ArchiveEntry

Auto-exclusion

.pmignore / .gitignore

File extensions — EXCLUDE_EXTENSIONS

Directories — EXCLUDE_DIRECTORIES

Filenames — EXCLUDE_FILENAMES

File size — MAX_FILE_SIZE_BYTES

Customising filters

CLI usage

Debug output

.pm format overview

Testing

Test files

LICENSE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Uh oh!

Contributors

Uh oh!

Languages

`ArchiveEntry`

`.pmignore` / `.gitignore`

File extensions — `EXCLUDE_EXTENSIONS`

Directories — `EXCLUDE_DIRECTORIES`

Filenames — `EXCLUDE_FILENAMES`

File size — `MAX_FILE_SIZE_BYTES`

`.pm` format overview