| Command | Purpose |
|---|---|
protspace prepare |
Full pipeline: embed → reduce → annotate → bundle |
protspace embed |
Generate embeddings from FASTA via Biocentral API |
protspace project |
Dimensionality reduction on HDF5 embeddings |
protspace annotate |
Fetch protein annotations from databases |
protspace bundle |
Combine projections + annotations into .parquetbundle |
protspace serve |
Launch interactive Dash web frontend |
protspace style |
Add/inspect annotation styles in existing files |
Run protspace <command> -h for detailed help.
Full pipeline: load protein embeddings (from HDF5, FASTA, or UniProt query), run dimensionality reduction, fetch biological annotations, and create a .parquetbundle for visualization at protspace.app.
Accepts three input types:
- HDF5 files (
-i) — pre-computed embeddings from any pLM - FASTA files (
-i+-e) — sequences are embedded on-the-fly via the Biocentral API - UniProt queries (
-q+-e) — sequences are fetched from UniProt, then embedded
# From HDF5 embeddings
protspace prepare -i embeddings.h5 -m pca2,umap2 -o output
# From FASTA — auto-embed with two models
protspace prepare -i sequences.fasta -e prot_t5,esm2_650m -m pca2,umap2 -o output
# From UniProt query
protspace prepare -q "(family:phosphatase) AND (reviewed:true)" -e prot_t5 -m pca2 -o output
# With sequence similarity (MMseqs2)
protspace prepare -i emb.h5 -f seq.fasta -s -m pca2,mds2 -o output
# External HDF5 without model_name attribute — use colon syntax
protspace prepare -i external.h5:prot_t5 -m pca2 -o output
# Compare UMAP with different parameters in a single run
protspace prepare -i emb.h5 -m "umap2:n_neighbors=15" -m "umap2:n_neighbors=50" -m pca2 -o output
# Inline params with semicolons, comma-separated methods
protspace prepare -i emb.h5 -m "pca2,umap2:n_neighbors=50;min_dist=0.3,tsne2" -o output| Flag | Description | Default |
|---|---|---|
-i, --input |
HDF5 or FASTA file(s). Repeat for multi-embedding or to combine datasets. Use -i file.h5:name for external HDF5 files (see Model Name Resolution). |
— |
-q, --query |
UniProt search query (alternative to -i). | — |
-f, --fasta |
FASTA for similarity computation (with -s when input is HDF5). | — |
| Flag | Description | Default |
|---|---|---|
-e, --embedder |
Biocentral model shortcut (comma-separated for multi-model). | prot_t5 |
--batch-size |
Sequences per API call. | 1000 |
Available embedders: prot_t5, prost_t5, esm2_8m, esm2_35m, esm2_150m, esm2_650m, esm2_3b, ankh_base, ankh_large, ankh3_large, esmc_300m, esmc_600m
Licensing:
ankh_base,ankh_large,ankh3_large(CC-BY-NC-SA-4.0),esmc_600m(Cambrian Non-Commercial). All others are permissively licensed.
| Flag | Description | Default |
|---|---|---|
-m, --methods |
DR methods. Repeat the flag or use commas to combine methods (-m pca2,umap2); use semicolons to inline parameter overrides for one method (-m 'umap2:n_neighbors=50;min_dist=0.1'). See Overridable parameters for the supported keys. Methods: pca2, umap2, tsne2, pacmap2, mds2, localmap2. |
pca2 |
-s, --similarity |
Also compute sequence similarity DR from FASTA. | off |
--metric |
Distance metric (euclidean, cosine, manhattan). |
euclidean |
--random-state |
Random seed. | 42 |
--n-neighbors |
UMAP/PaCMAP/LocalMAP neighbors. | 25 |
--min-dist |
UMAP min distance (0.0–0.99). | 0.1 |
--perplexity |
t-SNE perplexity. | 30 |
--learning-rate |
t-SNE learning rate. | 200 |
--mn-ratio |
PaCMAP/LocalMAP mid-near ratio. | 0.5 |
--fp-ratio |
PaCMAP/LocalMAP further ratio. | 2.0 |
--n-init |
MDS initializations. | 4 |
--max-iter |
MDS max iterations. | 300 |
--eps |
MDS convergence tolerance. | 1e-3 |
-m accepts inline overrides per method using key=value pairs (semicolon-separated). The same keys are also available as global flags above; an inline override only affects that method's projection.
| Key | Abbrev | Type | Used by |
|---|---|---|---|
n_neighbors |
n |
int | UMAP, PaCMAP, LocalMAP |
min_dist |
d |
float | UMAP |
perplexity |
p |
int | t-SNE |
learning_rate |
lr |
int | t-SNE |
mn_ratio |
mn |
float | PaCMAP, LocalMAP |
fp_ratio |
fp |
float | PaCMAP, LocalMAP |
metric |
m |
str | All (euclidean, cosine, manhattan) |
random_state |
rs |
int | All |
n_init |
ni |
int | MDS |
max_iter |
mi |
int | MDS |
eps |
e |
float | MDS |
The abbreviation is what appears in projection names when the same method and dimension count is requested with different overrides — see Projection Naming.
Example:
protspace prepare -i emb.h5 \
-m 'umap2:n_neighbors=15' \
-m 'umap2:n_neighbors=50;min_dist=0.05' \
-m pca2 \
-o outputThis produces three projections: ProtT5 — PCA 2, ProtT5 — UMAP 2 (n=15), and ProtT5 — UMAP 2 (d=0.05, n=50).
| Flag | Description | Default |
|---|---|---|
-a, --annotations |
Annotation sources: groups, individual names, or a CSV/TSV file path. See Annotation Reference. | default |
--scores / --no-scores |
Include annotation confidence scores. | on |
--refetch STAGES |
Recompute specific stages (comma-separated): query, embed, similarity, projections, uniprot, taxonomy, interpro, ted, biocentral. Shorthands: all, annotations. |
off |
| Flag | Description | Default |
|---|---|---|
-o, --output |
Output directory. | . |
--bundled / --no-bundled |
Bundle into single .parquetbundle. |
bundled |
--keep-tmp |
Cache intermediates for resumability. | on |
--no-log |
Skip writing run.log. |
off |
--dump-cache |
Print cached annotations and exit. | off |
Generate HDF5 embeddings from FASTA via the Biocentral API.
protspace embed -i sequences.fasta -e prot_t5 -e esm2_3b -o embeddings/Run dimensionality reduction on HDF5 embeddings.
protspace project -i embeddings/prot_t5.h5 -i embeddings/esm2_3b.h5 -m pca2,umap2 -o projections/Fetch protein annotations from UniProt, InterPro, and taxonomy databases.
protspace annotate -i embeddings/prot_t5.h5 -a default -o annotations.parquetCombine projection and annotation parquet files into a .parquetbundle.
protspace bundle -p projections/ -a annotations.parquet -o output.parquetbundleLaunch the Dash web frontend for interactive visualization.
protspace serve output.parquetbundleAdd custom colors, shapes, and display settings. See Annotation Styling.
protspace style data.parquetbundle --generate-template > styles.json
protspace style input.parquetbundle output.parquetbundle --annotation-styles styles.json
protspace style data.parquetbundle --dump-settingsWhen multiple -i inputs are provided, behavior depends on whether they share the same embedding name:
- Same embedding name → proteins are unioned (concatenated). Use this to combine datasets (e.g., two species both embedded with ProtT5).
- Different embedding names → proteins are intersected. Use this for multi-embedding comparison (e.g., ProtT5 vs ESM2 on the same proteins).
# Union: combine two species into one visualization
protspace prepare -i human.h5:prot_t5 -i drosophila.h5:prot_t5 -m umap2 -o output
# Intersection: compare embeddings on shared proteins
protspace prepare -i prot_t5.h5 -i esm2_650m.h5 -m pca2 -o outputDuplicate proteins across same-name inputs are deduplicated if their embeddings match (within tolerance). Conflicting embeddings for the same protein ID raise an error.
Projections are prefixed with the embedding source: ESM2-650M — PCA 2, ProtT5 — UMAP 2, MMseqs2 — MDS 2.
When the same method and dimension count is requested with different inline parameter overrides (a parameter sweep), the differing parameters are appended in parentheses using their abbreviated names — for example, ProtT5 — UMAP 2 (n=50) for umap2:n_neighbors=50 running alongside another umap2 variant. A plain umap2 (no overrides) keeps the unsuffixed name. See Overridable parameters for the abbreviation table.
HDF5 files need a model name for projection labels. Resolved in order:
- Colon syntax —
-i file.h5:prot_t5(highest priority) - HDF5 attribute —
model_namein root attrs (auto-set byprotspace embed/prepare) - Error — exits with a copy-pasteable fix command
Use the colon syntax for HDF5 files created outside protspace (bio_embeddings, custom scripts, Colab). Files from protspace embed/prepare already have the attribute.
# External files — need colon syntax
protspace prepare -i my_embeddings.h5:prot_t5 -m pca2 -o output
protspace prepare -i esm2.h5:esm2_650m -i prott5.h5:prot_t5 -m pca2 -o output
# Combine datasets — same name → union proteins
protspace prepare -i species_a.h5:prot_t5 -i species_b.h5:prot_t5 -m umap2 -o output
# Protspace-generated files — just work
protspace prepare -i embeddings/prot_t5.h5 -m pca2 -o outputCheck if an HDF5 file has the attribute: python -c "import h5py; print(dict(h5py.File('file.h5','r').attrs))"
With --keep-tmp (default), all intermediate results are cached in {output}/tmp/ and reused on subsequent runs:
| Cached item | File | Reuse behavior |
|---|---|---|
| FASTA sequences | sequences.fasta |
Skip UniProt query download |
| Embeddings | {embedder}.h5 |
Skip already-embedded proteins |
| Annotations | all_annotations.parquet |
Fetch only missing annotation sources |
| Similarity matrix | similarity_matrix.npy |
Skip MMseqs2 recomputation |
| DR projections | proj_{name}_{method}_{hash}.npz |
Skip dimensionality reduction |
- Annotation cache always includes scores regardless of
--no-scores - DR projection caches are keyed by embedding name, method, dimensions, and all parameters — changing any parameter creates a new cache entry
- Use
--refetch allto bypass all caches, or--refetch <stages>selectively (e.g.,--refetch ted,biocentral)
See also: Annotation Reference | Annotation Styling