CLI Reference

Command	Purpose
`protspace prepare`	Full pipeline: embed → reduce → annotate → bundle
`protspace embed`	Generate embeddings from FASTA via Biocentral API
`protspace project`	Dimensionality reduction on HDF5 embeddings
`protspace annotate`	Fetch protein annotations from databases
`protspace bundle`	Combine projections + annotations into .parquetbundle
`protspace serve`	Launch interactive Dash web frontend
`protspace style`	Add/inspect annotation styles in existing files

Run protspace <command> -h for detailed help.

`protspace prepare`

Full pipeline: load protein embeddings (from HDF5, FASTA, or UniProt query), run dimensionality reduction, fetch biological annotations, and create a .parquetbundle for visualization at protspace.app.

Accepts three input types:

HDF5 files (-i) — pre-computed embeddings from any pLM
FASTA files (-i + -e) — sequences are embedded on-the-fly via the Biocentral API
UniProt queries (-q + -e) — sequences are fetched from UniProt, then embedded

# From HDF5 embeddings
protspace prepare -i embeddings.h5 -m pca2,umap2 -o output

# From FASTA — auto-embed with two models
protspace prepare -i sequences.fasta -e prot_t5,esm2_650m -m pca2,umap2 -o output

# From UniProt query
protspace prepare -q "(family:phosphatase) AND (reviewed:true)" -e prot_t5 -m pca2 -o output

# With sequence similarity (MMseqs2)
protspace prepare -i emb.h5 -f seq.fasta -s -m pca2,mds2 -o output

# External HDF5 without model_name attribute — use colon syntax
protspace prepare -i external.h5:prot_t5 -m pca2 -o output

# Compare UMAP with different parameters in a single run
protspace prepare -i emb.h5 -m "umap2:n_neighbors=15" -m "umap2:n_neighbors=50" -m pca2 -o output

# Inline params with semicolons, comma-separated methods
protspace prepare -i emb.h5 -m "pca2,umap2:n_neighbors=50;min_dist=0.3,tsne2" -o output

Options

Input

Flag	Description	Default
`-i, --input`	HDF5 or FASTA file(s). Repeat for multi-embedding or to combine datasets. Use `-i file.h5:name` for external HDF5 files (see Model Name Resolution).	—
`-q, --query`	UniProt search query (alternative to -i).	—
`-f, --fasta`	FASTA for similarity computation (with -s when input is HDF5).	—

Embedding

Flag	Description	Default
`-e, --embedder`	Biocentral model shortcut (comma-separated for multi-model).	`prot_t5`
`--batch-size`	Sequences per API call.	`1000`

Available embedders: prot_t5, prost_t5, esm2_8m, esm2_35m, esm2_150m, esm2_650m, esm2_3b, ankh_base, ankh_large, ankh3_large, esmc_300m, esmc_600m

Licensing: ankh_base, ankh_large, ankh3_large (CC-BY-NC-SA-4.0), esmc_600m (Cambrian Non-Commercial). All others are permissively licensed.

Projection

Flag	Description	Default
`-m, --methods`	DR methods. Repeat the flag or use commas to combine methods (`-m pca2,umap2`); use semicolons to inline parameter overrides for one method (`-m 'umap2:n_neighbors=50;min_dist=0.1'`). See Overridable parameters for the supported keys. Methods: `pca2`, `umap2`, `tsne2`, `pacmap2`, `mds2`, `localmap2`.	`pca2`
`-s, --similarity`	Also compute sequence similarity DR from FASTA.	off
`--metric`	Distance metric (`euclidean`, `cosine`, `manhattan`).	`euclidean`
`--random-state`	Random seed.	`42`
`--n-neighbors`	UMAP/PaCMAP/LocalMAP neighbors.	`25`
`--min-dist`	UMAP min distance (0.0–0.99).	`0.1`
`--perplexity`	t-SNE perplexity.	`30`
`--learning-rate`	t-SNE learning rate.	`200`
`--mn-ratio`	PaCMAP/LocalMAP mid-near ratio.	`0.5`
`--fp-ratio`	PaCMAP/LocalMAP further ratio.	`2.0`
`--n-init`	MDS initializations.	`4`
`--max-iter`	MDS max iterations.	`300`
`--eps`	MDS convergence tolerance.	`1e-3`

Overridable parameters (with `-m`)

-m accepts inline overrides per method using key=value pairs (semicolon-separated). The same keys are also available as global flags above; an inline override only affects that method's projection.

Key	Abbrev	Type	Used by
`n_neighbors`	`n`	int	UMAP, PaCMAP, LocalMAP
`min_dist`	`d`	float	UMAP
`perplexity`	`p`	int	t-SNE
`learning_rate`	`lr`	int	t-SNE
`mn_ratio`	`mn`	float	PaCMAP, LocalMAP
`fp_ratio`	`fp`	float	PaCMAP, LocalMAP
`metric`	`m`	str	All (`euclidean`, `cosine`, `manhattan`)
`random_state`	`rs`	int	All
`n_init`	`ni`	int	MDS
`max_iter`	`mi`	int	MDS
`eps`	`e`	float	MDS

The abbreviation is what appears in projection names when the same method and dimension count is requested with different overrides — see Projection Naming.

Example:

protspace prepare -i emb.h5 \
  -m 'umap2:n_neighbors=15' \
  -m 'umap2:n_neighbors=50;min_dist=0.05' \
  -m pca2 \
  -o output

This produces three projections: ProtT5 — PCA 2, ProtT5 — UMAP 2 (n=15), and ProtT5 — UMAP 2 (d=0.05, n=50).

Annotations

Flag	Description	Default
`-a, --annotations`	Annotation sources: groups, individual names, or a CSV/TSV file path. See Annotation Reference.	`default`
`--scores / --no-scores`	Include annotation confidence scores.	on
`--refetch STAGES`	Recompute specific stages (comma-separated): query, embed, similarity, projections, uniprot, taxonomy, interpro, ted, biocentral. Shorthands: `all`, `annotations`.	off

Output

Flag	Description	Default
`-o, --output`	Output directory.	`.`
`--bundled / --no-bundled`	Bundle into single `.parquetbundle`.	bundled
`--keep-tmp`	Cache intermediates for resumability.	on
`--no-log`	Skip writing `run.log`.	off
`--dump-cache`	Print cached annotations and exit.	off

`protspace embed`

Generate HDF5 embeddings from FASTA via the Biocentral API.

protspace embed -i sequences.fasta -e prot_t5 -e esm2_3b -o embeddings/

`protspace project`

Run dimensionality reduction on HDF5 embeddings.

protspace project -i embeddings/prot_t5.h5 -i embeddings/esm2_3b.h5 -m pca2,umap2 -o projections/

`protspace annotate`

Fetch protein annotations from UniProt, InterPro, and taxonomy databases.

protspace annotate -i embeddings/prot_t5.h5 -a default -o annotations.parquet

`protspace bundle`

Combine projection and annotation parquet files into a .parquetbundle.

protspace bundle -p projections/ -a annotations.parquet -o output.parquetbundle

`protspace serve`

Launch the Dash web frontend for interactive visualization.

protspace serve output.parquetbundle

`protspace style`

Add custom colors, shapes, and display settings. See Annotation Styling.

protspace style data.parquetbundle --generate-template > styles.json
protspace style input.parquetbundle output.parquetbundle --annotation-styles styles.json
protspace style data.parquetbundle --dump-settings

Combining Multiple Inputs (`-i`)

When multiple -i inputs are provided, behavior depends on whether they share the same embedding name:

Same embedding name → proteins are unioned (concatenated). Use this to combine datasets (e.g., two species both embedded with ProtT5).
Different embedding names → proteins are intersected. Use this for multi-embedding comparison (e.g., ProtT5 vs ESM2 on the same proteins).

# Union: combine two species into one visualization
protspace prepare -i human.h5:prot_t5 -i drosophila.h5:prot_t5 -m umap2 -o output

# Intersection: compare embeddings on shared proteins
protspace prepare -i prot_t5.h5 -i esm2_650m.h5 -m pca2 -o output

Duplicate proteins across same-name inputs are deduplicated if their embeddings match (within tolerance). Conflicting embeddings for the same protein ID raise an error.

Projection Naming

Projections are prefixed with the embedding source: ESM2-650M — PCA 2, ProtT5 — UMAP 2, MMseqs2 — MDS 2.

When the same method and dimension count is requested with different inline parameter overrides (a parameter sweep), the differing parameters are appended in parentheses using their abbreviated names — for example, ProtT5 — UMAP 2 (n=50) for umap2:n_neighbors=50 running alongside another umap2 variant. A plain umap2 (no overrides) keeps the unsuffixed name. See Overridable parameters for the abbreviation table.

Model Name Resolution (`-i file.h5:name`)

HDF5 files need a model name for projection labels. Resolved in order:

Colon syntax — -i file.h5:prot_t5 (highest priority)
HDF5 attribute — model_name in root attrs (auto-set by protspace embed/prepare)
Error — exits with a copy-pasteable fix command

Use the colon syntax for HDF5 files created outside protspace (bio_embeddings, custom scripts, Colab). Files from protspace embed/prepare already have the attribute.

# External files — need colon syntax
protspace prepare -i my_embeddings.h5:prot_t5 -m pca2 -o output
protspace prepare -i esm2.h5:esm2_650m -i prott5.h5:prot_t5 -m pca2 -o output

# Combine datasets — same name → union proteins
protspace prepare -i species_a.h5:prot_t5 -i species_b.h5:prot_t5 -m umap2 -o output

# Protspace-generated files — just work
protspace prepare -i embeddings/prot_t5.h5 -m pca2 -o output

Check if an HDF5 file has the attribute: python -c "import h5py; print(dict(h5py.File('file.h5','r').attrs))"

Intermediate Caching (`--keep-tmp`)

With --keep-tmp (default), all intermediate results are cached in {output}/tmp/ and reused on subsequent runs:

Cached item	File	Reuse behavior
FASTA sequences	`sequences.fasta`	Skip UniProt query download
Embeddings	`{embedder}.h5`	Skip already-embedded proteins
Annotations	`all_annotations.parquet`	Fetch only missing annotation sources
Similarity matrix	`similarity_matrix.npy`	Skip MMseqs2 recomputation
DR projections	`proj_{name}_{method}_{hash}.npz`	Skip dimensionality reduction

Annotation cache always includes scores regardless of --no-scores
DR projection caches are keyed by embedding name, method, dimensions, and all parameters — changing any parameter creates a new cache entry
Use --refetch all to bypass all caches, or --refetch <stages> selectively (e.g., --refetch ted,biocentral)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI Reference

`protspace prepare`

Options

Input

Embedding

Projection

Overridable parameters (with `-m`)

Annotations

Output

`protspace embed`

`protspace project`

`protspace annotate`

`protspace bundle`

`protspace serve`

`protspace style`

Combining Multiple Inputs (`-i`)

Projection Naming

Model Name Resolution (`-i file.h5:name`)

Intermediate Caching (`--keep-tmp`)

FilesExpand file tree

cli.md

Latest commit

History

cli.md

File metadata and controls

CLI Reference

protspace prepare

Options

Input

Embedding

Projection

Overridable parameters (with -m)

Annotations

Output

protspace embed

protspace project

protspace annotate

protspace bundle

protspace serve

protspace style

Combining Multiple Inputs (-i)

Projection Naming

Model Name Resolution (-i file.h5:name)

Intermediate Caching (--keep-tmp)

`protspace prepare`

Overridable parameters (with `-m`)

`protspace embed`

`protspace project`

`protspace annotate`

`protspace bundle`

`protspace serve`

`protspace style`

Combining Multiple Inputs (`-i`)

Model Name Resolution (`-i file.h5:name`)

Intermediate Caching (`--keep-tmp`)