feat: replace f5c + in-tree CUDA DTW with the krill engine (v0.4.0)#2
Merged
Conversation
krill ships the same cuDTW++ kernels py-baleen maintained as the _cuda_dtw C-extension. Delegate to krill.dtw_* via a thin baleen/_dtw.py shim (use_cuda -> use_gpu adapter, retaining the pure-Python GPU memory-planning helpers the pipeline's chunking relies on) and delete the vendored CUDA sources. Gate A (benchmarks/krill_exp/verify_dtw.py): the production GPU path is bit-identical to _cuda_dtw (max|Δ|=0) across dtw_distance, dtw_pairwise_varlen and dtw_multi_position_pairwise; CPU per-pair primitives identical. krill's batched CPU path resamples to fixed buckets (GPU-consistent), differing from the old tslearn fallback only on CPU-only installs. The GPU image now installs krill from the cu122 index (built_with_cuda).
Address two P1 review findings on the _cuda_dtw -> krill swap: - krill is now a required engine but not on PyPI: _dtw.py raised a bare ModuleNotFoundError on import. Replace with a clear, actionable error pointing at the cu122/CPU project indexes and Docker, and document the install in pyproject (krill intentionally not in dependencies so a plain `pip install .` does not fail on a PyPI lookup). - Remove all lingering baleen._cuda_dtw consumers left dangling by the module deletion: rewire scripts/profile_pipeline.py and benchmarks/bench.py to baleen._dtw; rewrite tests/test_dtw.py against the krill shim (CPU/GPU, validation, backend reporting, memory helpers); delete the obsolete test_cudtw_migration.py, test_setup.py, setup.py (pure-Python now), and the one-shot Gate A script verify_dtw.py. Bump version to 0.4.0, drop tslearn.
…dex review)
Second-pass review fixes after the krill swap:
- _pipeline._compute_pairwise_batch no longer imports tslearn; it delegates
to the krill shim (_dtw.dtw_pairwise_varlen, use_cuda=False), so CPU/test
paths work with krill installed and tslearn absent.
- profile_pipeline.py: drop unsupported use_open_start/use_open_end kwargs
from the dtw_multi_position_pairwise call (would TypeError on first contig).
- bench.py: narrow the DTW cProfile matcher to krill's specific DTW method
names so _signal.extract_signals_for_dtw_padded is no longer miscounted as
DTW; report dtw_backend from _dtw.backend() ('gpu'/'cpu') instead of the
stale 'tslearn' label.
test_standard_dtw_variable_length_matches_pairwise_loop cross-checked the CPU pairwise matrix against tslearn, which is no longer a dependency (the CPU path now delegates to krill). Compare against _dtw.dtw_distance instead so the test passes in a krill-only environment. Verified with tslearn uninstalled.
…-dense) Swap the external f5c eventalign CLI for krill's in-process aligner. krill is constructed with hmm_confidence=False, keep_kmer_skips=False — i.e. its alignment HMM is OFF and every event is forced onto the mapped reference subsequence (dense, no read-vs-ref skips), in contrast to f5c's HMM which skips deletions. Output stays in f5c's 16-column --samples TSV format, so group_signals_by_position and everything downstream is unchanged. (This is distinct from the downstream V3 run_hmm p_mod smoothing, which is untouched.) - New baleen/eventalign/_eventalign.py: run_eventalign (drop-in for the old f5c wrapper, with a `pore` arg), index_blow5 (slow5tools), check_krill. - _pipeline.py: route through _eventalign; thread `pore` through run_pipeline / run_pipeline_streaming / _process_contig[_streaming]; drop the f5c FASTQ-index step (krill reads BLOW5 directly) and the extra_f5c_args plumbing; rename PipelineMetadata.f5c_version -> eventalign_version. - cli.py: add --pore (default rna002); remove --f5c-threads and the f5c -t thread injection. - Delete baleen/eventalign/_f5c.py and tests/test_f5c.py; rewire test_pipeline, test_cli, scripts/profile_pipeline.py and bench.py profile buckets. Verified: full suite 304 passed/1 skipped with tslearn absent; real end-to-end stoich=1.0 run produces 4221 sites across 2 contigs on GPU.
…y (codex review) Address the Phase 2 review findings: - [P1] Add pyfastx to runtime dependencies. _eventalign imports it at module import time and _pipeline imports _eventalign unconditionally, so a fresh install was failing with ModuleNotFoundError before check_krill could report the intended setup error. - [P2] Include `pore` in the resume fingerprint (schema_version 1 -> 2). A resumed run with a different --pore would otherwise reuse per-contig slices generated by the old pore model. Updated test_resume / test_read_ids callers. - [P2] Honor --no-primary-only in krill eventalign: run_eventalign now takes primary_only and only filters secondary/supplementary when it is set, so the CLI flag actually includes those alignments instead of silently dropping them.
The Phase 2 fix let --no-primary-only produce eventalign rows for secondary/supplementary alignments, but _read_bam._scan_and_write still unconditionally skipped them, so those calls reached the site TSV yet were omitted from read_results.bam. Thread primary_only through flush_contig_to_bam -> _scan_and_write (default True preserves behavior) and pass it from the streaming worker so both paths agree.
…odex review) With non-primary alignments included, the contig-wide seen_reads set keyed only by query_name dropped all but the first-fetched alignment of a multi-aligning read, so --no-primary-only still omitted secondary/supplementary records from read_results.bam. Key the dedup on (name, flag, reference_start) in that mode (name-only still applies in the default primary-only mode, one record per read).
The 17dcd48/8f9df9a attempts to thread primary_only into the mod-BAM writer opened a path that emits non-primary alignments without reapplying the mapq/reverse/subsample filters the analysis used. That inconsistency is pre-existing: the original f5c pipeline also processed secondary/supplementary in eventalign (via the split BAM) while read_results.bam was always primary-only. Restore _read_bam to that original, behavior-preserving state. _eventalign still honors primary_only (consistent with the primary_only-filtered split BAM it reads), so the krill swap preserves end-to-end behavior.
_validate_resume_compatibility compared only inputs/params and never checked schema_version, so the field that exists to invalidate incompatible partial runs did nothing. Reject a resume when the stored schema_version differs from the current one, so format/writer changes (and the pore bump to schema 2) properly force a fresh run instead of silently mixing old and new outputs. Add a regression test for the schema-mismatch rejection.
Phase 3 of the krill migration — packaging and docs cleanup: - Dockerfile.cpu / Dockerfile.gpu: rewrite for krill. Drop f5c, the nvcc/CUDA C-extension build, and the multi-stage builder; install baleen (pure Python) plus the krill wheel (plain CPU index for cpu, cu122 index for gpu) and slow5tools. The gpu image upgrades pip/setuptools first (Ubuntu 22.04 ships a pre-PEP621 setuptools that would otherwise build an empty UNKNOWN package). - .docker/Dockerfile.docs: install krill (CPU) so mkdocstrings can import baleen; drop BALEEN_NO_CUDA. - CI: docker.yml and docs.yml install krill from the project index instead of BALEEN_NO_CUDA=1; remove the build-time CUDA flags. - docs/ + CLAUDE.md: describe the krill engine (GPU/CPU by wheel+device, not a compile flag), the _dtw shim (backend() -> "gpu"/"cpu"), krill HMM-free forced-dense eventalign, slow5tools indexing (no f5c .readdb step), the --pore flag, and removal of --f5c-threads / BALEEN_NO_CUDA / BALEEN_CUDA_ARCHS. Add a v0.4.0 changelog entry. - Add benchmarks/krill_exp/gate_b.py (the end-to-end AUROC/AUPRC harness used to validate the eventalign swap). Verified: both production images build and run (cpu -> backend cpu, gpu -> backend gpu, krill 0.1.5/+cu122, slow5tools present); full suite 305 passed / 1 skipped in a krill-only env; mkdocs build --strict succeeds.
…k (codex review) - [P1] krill publishes cp310/311/312 wheels only (no cp39), so the Python 3.9 CI matrix job could never install it. Bump requires-python to >=3.10, swap the CI matrix to 3.10/3.11/3.12, update the classifiers, and fix the "3.9 - 3.11 tested" docs. - [P2] read_ids_from_fastq_with_readdb still prefers a legacy <fastq>.index.readdb when present. Correct the inputs.md read-ID table: krill no longer creates a readdb, but an existing one (from an old f5c run) is still honored — so the doc no longer claims "no .readdb dependency".
There was a problem hiding this comment.
Pull request overview
This PR replaces baleen’s in-repo CUDA DTW extension and external f5c eventalign CLI integration with the external krill engine (v0.4.0), making baleen a pure-Python package (Python ≥3.10) while keeping downstream TSV formats stable.
Changes:
- Introduces
baleen/_dtw.pyas a shim over krill DTW (GPU/CPU) and updates the pipeline to use it. - Replaces the
f5csubprocess wrapper with an in-process krill-backed eventalign implementation (baleen/eventalign/_eventalign.py) and wires a new--poreCLI option into resume fingerprints. - Updates packaging/CI/Docker/docs/benchmarks to reflect the new engine + install model (krill via a project index, slow5tools required).
Reviewed changes
Copilot reviewed 51 out of 51 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_setup.py | Removes setup/import tests tied to the deleted _cuda_dtw extension and tslearn fallback. |
| tests/test_resume.py | Updates resume fingerprint expectations (schema v2) and adds schema mismatch rejection test; includes pore. |
| tests/test_read_ids.py | Updates fingerprint calls to include pore. |
| tests/test_pipeline.py | Updates pipeline tests to patch krill eventalign wrapper and _dtw shim; renames metadata field to eventalign_version. |
| tests/test_f5c.py | Removes tests for the deleted baleen.eventalign._f5c wrapper. |
| tests/test_dtw.py | Replaces CUDA/tslearn DTW tests with tests for the krill-backed _dtw shim (CPU, optional GPU agreement). |
| tests/test_cudtw_migration.py | Removes migration tests specific to the deleted in-tree CUDA DTW implementation. |
| tests/test_cli.py | Updates CLI arg fixture to replace f5c_threads with pore. |
| setup.py | Removes legacy setuptools/nvcc-based extension build script. |
| scripts/profile_pipeline.py | Updates profiling harness to use krill eventalign + _dtw shim and renames f5c timing fields. |
| pyproject.toml | Bumps version to 0.4.0, requires Python ≥3.10, drops tslearn, adds pyfastx; documents krill’s non-PyPI install. |
| docs/installation.md | Updates install instructions to be pure-Python + explicit krill install from project index; adds slow5tools requirement. |
| docs/index.md | Updates overview to reference krill engine and removes tslearn mention. |
| docs/guide/performance.md | Updates backend selection semantics to krill wheel/device-based GPU vs CPU. |
| docs/guide/overview.md | Updates architecture overview to krill eventalign and krill-backed DTW. |
| docs/guide/inputs.md | Removes f5c FASTQ indexing step; documents slow5tools-only indexing and legacy readdb behavior. |
| docs/guide/docker.md | Updates Docker guidance to krill+slow5tools bundling and runtime GPU verification via _dtw. |
| docs/guide/cli.md | Updates CLI docs from f5c options to eventalign options (--pore). |
| docs/contributing.md | Updates contributor setup and architecture notes for krill-based engine. |
| docs/changelog.md | Adds v0.4.0 changelog entry describing the engine swap and breaking changes. |
| Dockerfile.gpu | Rebuilds GPU image to install baleen + krill cu122 wheel + slow5tools (no nvcc build, no f5c). |
| Dockerfile.cpu | Rebuilds CPU image to install baleen + krill CPU wheel + slow5tools (no f5c). |
| CLAUDE.md | Updates project architecture and setup notes to reflect krill engine and _dtw shim. |
| benchmarks/krill_exp/gate_b.py | Adds Gate B end-to-end benchmark script for comparing eventalign engines. |
| benchmarks/bench.py | Updates environment snapshot and profiler bucketing to attribute DTW to krill and eventalign to _eventalign.py. |
| baleen/eventalign/_pipeline.py | Replaces _f5c calls with _eventalign, switches DTW calls to _dtw, adds pore to fingerprint schema v2, renames metadata field. |
| baleen/eventalign/_f5c.py | Removes legacy f5c/slow5tools subprocess wrapper module. |
| baleen/eventalign/_eventalign.py | Adds krill-backed eventalign implementation that emits f5c-format TSV. |
| baleen/cli.py | Replaces f5c threading options with --pore; routes DTW device parsing to _dtw. |
| baleen/_dtw.py | Adds krill DTW shim and GPU memory planning helpers (nvidia-smi-based introspection). |
| baleen/_cuda_dtw/multithreading.h | Removes vendored CUDA helper code (part of deleted extension). |
| baleen/_cuda_dtw/multithreading.cpp | Removes vendored CUDA helper code (part of deleted extension). |
| baleen/_cuda_dtw/limits.hpp | Removes vendored CUDA helper code (part of deleted extension). |
| baleen/_cuda_dtw/dtw.hpp | Removes vendored CUDA kernel code (part of deleted extension). |
| baleen/_cuda_dtw/dtw_api.h | Removes vendored CUDA C API header (part of deleted extension). |
| baleen/_cuda_dtw/cudtw/LICENSE | Removes vendored cuDTW++ license file (part of deleted extension subtree). |
| baleen/_cuda_dtw/cudtw/include/kernels/SHFL_FULLDTW_511.cuh | Removes vendored cuDTW++ kernel (part of deleted extension subtree). |
| baleen/_cuda_dtw/cudtw/include/kernels/SHFL_FULLDTW_255.cuh | Removes vendored cuDTW++ kernel (part of deleted extension subtree). |
| baleen/_cuda_dtw/cudtw/include/kernels/SHFL_FULLDTW_127.cuh | Removes vendored cuDTW++ kernel (part of deleted extension subtree). |
| baleen/_cuda_dtw/cudtw/include/DTW.hpp | Removes vendored cuDTW++ dispatcher header (part of deleted extension subtree). |
| baleen/_cuda_dtw/cuda_utils.hpp | Removes vendored CUDA utils header (part of deleted extension). |
| baleen/_cuda_dtw/init.py | Removes legacy DTW public API wrapper module (_cuda_dtw). |
| .github/workflows/docs.yml | Updates docs CI to install krill CPU wheel (mkdocstrings import requirement). |
| .github/workflows/docker.yml | Updates test matrix to 3.10–3.12 and installs krill CPU wheel before running pytest. |
| .docker/Dockerfile.docs | Updates docs build image to include krill CPU wheel for mkdocstrings imports. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+44
to
+48
| _HEADER = ( | ||
| "contig\tposition\treference_kmer\tread_name\tstrand\tevent_index\t" | ||
| "event_level_mean\tevent_stdv\tevent_length\tmodel_kmer\tmodel_mean\t" | ||
| "model_stdv\tstandardized_level\tstart_idx\tend_idx\tsamples" | ||
| ) |
Comment on lines
+176
to
179
| eventalign.add_argument( | ||
| "--no-rna", action="store_true", default=False, | ||
| help="Disable RNA mode for f5c eventalign", | ||
| help="Disable RNA mode for eventalign", | ||
| ) |
| | `--no-rna` | off | Disable RNA mode for f5c. | | ||
| | `--kmer-model` | — | Custom k-mer model for f5c. | | ||
| | `--pore` | `rna002` | krill pore model for eventalign. | | ||
| | `--no-rna` | off | Disable RNA mode for eventalign. | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the two pieces of engine code baleen maintained — f5c eventalign (external CLI) and the self-written CUDA DTW (
_cuda_dtwC extension) — with the krill package, which provides both. baleen is now pure Python (no nvcc build, no tslearn). Validated by two acceptance gates with no performance loss.What changed
DTW (Phase 1) —
baleen/_dtw.pyshim over krill's DTW; deletedbaleen/_cuda_dtw/; dropped tslearn.Eventalign (Phase 2) —
baleen/eventalign/_eventalign.py(krill, HMM-free / forced-dense: krill's alignment HMM is OFF, every event forced onto the mapped reference, no read-vs-ref skips); deleted_f5c.py. Emits f5c-format TSV so everything downstream is unchanged. Added--pore; removed--f5c-threads. Dropped the f5c FASTQ-index step (krill reads BLOW5 directly; needsslow5tools index).mod_ratioand 5/6 AUROCs are within −0.02 (only −log10(padj) off by 0.004). Accepted.Packaging/docs (Phase 3) — rewrote
Dockerfile.cpu/Dockerfile.gpufor krill (no f5c, no CUDA build); updated CI, docs,CLAUDE.md, changelog (v0.4.0); bumped to Python ≥3.10 (krill ships cp310+ wheels only).Engine install
krill is not on PyPI — install from the project index (bundled in the Docker images):
Verification
backend cpu, gpu→backend gpu+gpu:True, slow5tools present).mkdocs build --strictsucceeds.Notes
BALEEN_NO_CUDA/BALEEN_CUDA_ARCHS/--f5c-threads; f5c no longer required.requires-pythonbumped to>=3.10; version →0.4.0.Test plan