feat: replace f5c + in-tree CUDA DTW with the krill engine (v0.4.0) by loganylchen · Pull Request #2 · loganylchen/py-baleen

loganylchen · 2026-06-11T04:28:22Z

Summary

Replaces the two pieces of engine code baleen maintained — f5c eventalign (external CLI) and the self-written CUDA DTW (_cuda_dtw C extension) — with the krill package, which provides both. baleen is now pure Python (no nvcc build, no tslearn). Validated by two acceptance gates with no performance loss.

What changed

DTW (Phase 1) — baleen/_dtw.py shim over krill's DTW; deleted baleen/_cuda_dtw/; dropped tslearn.

Gate A: production GPU DTW is bit-identical to the old kernel (max|Δ| = 0) across all entry points.

Eventalign (Phase 2) — baleen/eventalign/_eventalign.py (krill, HMM-free / forced-dense: krill's alignment HMM is OFF, every event forced onto the mapped reference, no read-vs-ref skips); deleted _f5c.py. Emits f5c-format TSV so everything downstream is unchanged. Added --pore; removed --f5c-threads. Dropped the f5c FASTQ-index step (krill reads BLOW5 directly; needs slow5tools index).

Gate B (krill vs f5c eventalign, identical GPU DTW): krill wins every metric at stoich=0.5; at stoich=1.0 the headline mod_ratio and 5/6 AUROCs are within −0.02 (only −log10(padj) off by 0.004). Accepted.

Packaging/docs (Phase 3) — rewrote Dockerfile.cpu/Dockerfile.gpu for krill (no f5c, no CUDA build); updated CI, docs, CLAUDE.md, changelog (v0.4.0); bumped to Python ≥3.10 (krill ships cp310+ wheels only).

Engine install

krill is not on PyPI — install from the project index (bundled in the Docker images):

pip install krill --no-deps --index-url https://loganylchen.github.io/krill-dist/cu122/simple/   # GPU
pip install krill --no-deps --index-url https://loganylchen.github.io/krill-dist/simple/         # CPU

Verification

Full suite: 305 passed / 1 skipped in a krill-only env (tslearn absent).
Both production images build and run (cpu→backend cpu, gpu→backend gpu+gpu:True, slow5tools present).
mkdocs build --strict succeeds.
Every commit passed a Codex review before merge.

Notes

Breaking: drops Python 3.9 (krill has no cp39 wheel); removes BALEEN_NO_CUDA / BALEEN_CUDA_ARCHS / --f5c-threads; f5c no longer required.
requires-python bumped to >=3.10; version → 0.4.0.

Test plan

CI test matrix (3.10/3.11/3.12) green with krill installed from the project index
CI builds + pushes the cpu/gpu images
Pull the published dev image and re-run the suite + a stoich benchmark on GPU

krill ships the same cuDTW++ kernels py-baleen maintained as the _cuda_dtw C-extension. Delegate to krill.dtw_* via a thin baleen/_dtw.py shim (use_cuda -> use_gpu adapter, retaining the pure-Python GPU memory-planning helpers the pipeline's chunking relies on) and delete the vendored CUDA sources. Gate A (benchmarks/krill_exp/verify_dtw.py): the production GPU path is bit-identical to _cuda_dtw (max|Δ|=0) across dtw_distance, dtw_pairwise_varlen and dtw_multi_position_pairwise; CPU per-pair primitives identical. krill's batched CPU path resamples to fixed buckets (GPU-consistent), differing from the old tslearn fallback only on CPU-only installs. The GPU image now installs krill from the cu122 index (built_with_cuda).

Address two P1 review findings on the _cuda_dtw -> krill swap: - krill is now a required engine but not on PyPI: _dtw.py raised a bare ModuleNotFoundError on import. Replace with a clear, actionable error pointing at the cu122/CPU project indexes and Docker, and document the install in pyproject (krill intentionally not in dependencies so a plain `pip install .` does not fail on a PyPI lookup). - Remove all lingering baleen._cuda_dtw consumers left dangling by the module deletion: rewire scripts/profile_pipeline.py and benchmarks/bench.py to baleen._dtw; rewrite tests/test_dtw.py against the krill shim (CPU/GPU, validation, backend reporting, memory helpers); delete the obsolete test_cudtw_migration.py, test_setup.py, setup.py (pure-Python now), and the one-shot Gate A script verify_dtw.py. Bump version to 0.4.0, drop tslearn.

…dex review) Second-pass review fixes after the krill swap: - _pipeline._compute_pairwise_batch no longer imports tslearn; it delegates to the krill shim (_dtw.dtw_pairwise_varlen, use_cuda=False), so CPU/test paths work with krill installed and tslearn absent. - profile_pipeline.py: drop unsupported use_open_start/use_open_end kwargs from the dtw_multi_position_pairwise call (would TypeError on first contig). - bench.py: narrow the DTW cProfile matcher to krill's specific DTW method names so _signal.extract_signals_for_dtw_padded is no longer miscounted as DTW; report dtw_backend from _dtw.backend() ('gpu'/'cpu') instead of the stale 'tslearn' label.

test_standard_dtw_variable_length_matches_pairwise_loop cross-checked the CPU pairwise matrix against tslearn, which is no longer a dependency (the CPU path now delegates to krill). Compare against _dtw.dtw_distance instead so the test passes in a krill-only environment. Verified with tslearn uninstalled.

…-dense) Swap the external f5c eventalign CLI for krill's in-process aligner. krill is constructed with hmm_confidence=False, keep_kmer_skips=False — i.e. its alignment HMM is OFF and every event is forced onto the mapped reference subsequence (dense, no read-vs-ref skips), in contrast to f5c's HMM which skips deletions. Output stays in f5c's 16-column --samples TSV format, so group_signals_by_position and everything downstream is unchanged. (This is distinct from the downstream V3 run_hmm p_mod smoothing, which is untouched.) - New baleen/eventalign/_eventalign.py: run_eventalign (drop-in for the old f5c wrapper, with a `pore` arg), index_blow5 (slow5tools), check_krill. - _pipeline.py: route through _eventalign; thread `pore` through run_pipeline / run_pipeline_streaming / _process_contig[_streaming]; drop the f5c FASTQ-index step (krill reads BLOW5 directly) and the extra_f5c_args plumbing; rename PipelineMetadata.f5c_version -> eventalign_version. - cli.py: add --pore (default rna002); remove --f5c-threads and the f5c -t thread injection. - Delete baleen/eventalign/_f5c.py and tests/test_f5c.py; rewire test_pipeline, test_cli, scripts/profile_pipeline.py and bench.py profile buckets. Verified: full suite 304 passed/1 skipped with tslearn absent; real end-to-end stoich=1.0 run produces 4221 sites across 2 contigs on GPU.

…y (codex review) Address the Phase 2 review findings: - [P1] Add pyfastx to runtime dependencies. _eventalign imports it at module import time and _pipeline imports _eventalign unconditionally, so a fresh install was failing with ModuleNotFoundError before check_krill could report the intended setup error. - [P2] Include `pore` in the resume fingerprint (schema_version 1 -> 2). A resumed run with a different --pore would otherwise reuse per-contig slices generated by the old pore model. Updated test_resume / test_read_ids callers. - [P2] Honor --no-primary-only in krill eventalign: run_eventalign now takes primary_only and only filters secondary/supplementary when it is set, so the CLI flag actually includes those alignments instead of silently dropping them.

The Phase 2 fix let --no-primary-only produce eventalign rows for secondary/supplementary alignments, but _read_bam._scan_and_write still unconditionally skipped them, so those calls reached the site TSV yet were omitted from read_results.bam. Thread primary_only through flush_contig_to_bam -> _scan_and_write (default True preserves behavior) and pass it from the streaming worker so both paths agree.

…odex review) With non-primary alignments included, the contig-wide seen_reads set keyed only by query_name dropped all but the first-fetched alignment of a multi-aligning read, so --no-primary-only still omitted secondary/supplementary records from read_results.bam. Key the dedup on (name, flag, reference_start) in that mode (name-only still applies in the default primary-only mode, one record per read).

The 17dcd48/8f9df9a attempts to thread primary_only into the mod-BAM writer opened a path that emits non-primary alignments without reapplying the mapq/reverse/subsample filters the analysis used. That inconsistency is pre-existing: the original f5c pipeline also processed secondary/supplementary in eventalign (via the split BAM) while read_results.bam was always primary-only. Restore _read_bam to that original, behavior-preserving state. _eventalign still honors primary_only (consistent with the primary_only-filtered split BAM it reads), so the krill swap preserves end-to-end behavior.

_validate_resume_compatibility compared only inputs/params and never checked schema_version, so the field that exists to invalidate incompatible partial runs did nothing. Reject a resume when the stored schema_version differs from the current one, so format/writer changes (and the pore bump to schema 2) properly force a fresh run instead of silently mixing old and new outputs. Add a regression test for the schema-mismatch rejection.

Phase 3 of the krill migration — packaging and docs cleanup: - Dockerfile.cpu / Dockerfile.gpu: rewrite for krill. Drop f5c, the nvcc/CUDA C-extension build, and the multi-stage builder; install baleen (pure Python) plus the krill wheel (plain CPU index for cpu, cu122 index for gpu) and slow5tools. The gpu image upgrades pip/setuptools first (Ubuntu 22.04 ships a pre-PEP621 setuptools that would otherwise build an empty UNKNOWN package). - .docker/Dockerfile.docs: install krill (CPU) so mkdocstrings can import baleen; drop BALEEN_NO_CUDA. - CI: docker.yml and docs.yml install krill from the project index instead of BALEEN_NO_CUDA=1; remove the build-time CUDA flags. - docs/ + CLAUDE.md: describe the krill engine (GPU/CPU by wheel+device, not a compile flag), the _dtw shim (backend() -> "gpu"/"cpu"), krill HMM-free forced-dense eventalign, slow5tools indexing (no f5c .readdb step), the --pore flag, and removal of --f5c-threads / BALEEN_NO_CUDA / BALEEN_CUDA_ARCHS. Add a v0.4.0 changelog entry. - Add benchmarks/krill_exp/gate_b.py (the end-to-end AUROC/AUPRC harness used to validate the eventalign swap). Verified: both production images build and run (cpu -> backend cpu, gpu -> backend gpu, krill 0.1.5/+cu122, slow5tools present); full suite 305 passed / 1 skipped in a krill-only env; mkdocs build --strict succeeds.

…k (codex review) - [P1] krill publishes cp310/311/312 wheels only (no cp39), so the Python 3.9 CI matrix job could never install it. Bump requires-python to >=3.10, swap the CI matrix to 3.10/3.11/3.12, update the classifiers, and fix the "3.9 - 3.11 tested" docs. - [P2] read_ids_from_fastq_with_readdb still prefers a legacy <fastq>.index.readdb when present. Correct the inputs.md read-ID table: krill no longer creates a readdb, but an existing one (from an old f5c run) is still honored — so the doc no longer claims "no .readdb dependency".

Copilot

Pull request overview

This PR replaces baleen’s in-repo CUDA DTW extension and external f5c eventalign CLI integration with the external krill engine (v0.4.0), making baleen a pure-Python package (Python ≥3.10) while keeping downstream TSV formats stable.

Changes:

Introduces baleen/_dtw.py as a shim over krill DTW (GPU/CPU) and updates the pipeline to use it.
Replaces the f5c subprocess wrapper with an in-process krill-backed eventalign implementation (baleen/eventalign/_eventalign.py) and wires a new --pore CLI option into resume fingerprints.
Updates packaging/CI/Docker/docs/benchmarks to reflect the new engine + install model (krill via a project index, slow5tools required).

Reviewed changes

Copilot reviewed 51 out of 51 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_setup.py	Removes setup/import tests tied to the deleted `_cuda_dtw` extension and tslearn fallback.
tests/test_resume.py	Updates resume fingerprint expectations (schema v2) and adds schema mismatch rejection test; includes `pore`.
tests/test_read_ids.py	Updates fingerprint calls to include `pore`.
tests/test_pipeline.py	Updates pipeline tests to patch krill eventalign wrapper and `_dtw` shim; renames metadata field to `eventalign_version`.
tests/test_f5c.py	Removes tests for the deleted `baleen.eventalign._f5c` wrapper.
tests/test_dtw.py	Replaces CUDA/tslearn DTW tests with tests for the krill-backed `_dtw` shim (CPU, optional GPU agreement).
tests/test_cudtw_migration.py	Removes migration tests specific to the deleted in-tree CUDA DTW implementation.
tests/test_cli.py	Updates CLI arg fixture to replace `f5c_threads` with `pore`.
setup.py	Removes legacy setuptools/nvcc-based extension build script.
scripts/profile_pipeline.py	Updates profiling harness to use krill eventalign + `_dtw` shim and renames f5c timing fields.
pyproject.toml	Bumps version to 0.4.0, requires Python ≥3.10, drops tslearn, adds pyfastx; documents krill’s non-PyPI install.
docs/installation.md	Updates install instructions to be pure-Python + explicit krill install from project index; adds slow5tools requirement.
docs/index.md	Updates overview to reference krill engine and removes tslearn mention.
docs/guide/performance.md	Updates backend selection semantics to krill wheel/device-based GPU vs CPU.
docs/guide/overview.md	Updates architecture overview to krill eventalign and krill-backed DTW.
docs/guide/inputs.md	Removes f5c FASTQ indexing step; documents slow5tools-only indexing and legacy readdb behavior.
docs/guide/docker.md	Updates Docker guidance to krill+slow5tools bundling and runtime GPU verification via `_dtw`.
docs/guide/cli.md	Updates CLI docs from f5c options to eventalign options (`--pore`).
docs/contributing.md	Updates contributor setup and architecture notes for krill-based engine.
docs/changelog.md	Adds v0.4.0 changelog entry describing the engine swap and breaking changes.
Dockerfile.gpu	Rebuilds GPU image to install baleen + krill cu122 wheel + slow5tools (no nvcc build, no f5c).
Dockerfile.cpu	Rebuilds CPU image to install baleen + krill CPU wheel + slow5tools (no f5c).
CLAUDE.md	Updates project architecture and setup notes to reflect krill engine and `_dtw` shim.
benchmarks/krill_exp/gate_b.py	Adds Gate B end-to-end benchmark script for comparing eventalign engines.
benchmarks/bench.py	Updates environment snapshot and profiler bucketing to attribute DTW to krill and eventalign to `_eventalign.py`.
baleen/eventalign/_pipeline.py	Replaces `_f5c` calls with `_eventalign`, switches DTW calls to `_dtw`, adds `pore` to fingerprint schema v2, renames metadata field.
baleen/eventalign/_f5c.py	Removes legacy f5c/slow5tools subprocess wrapper module.
baleen/eventalign/_eventalign.py	Adds krill-backed eventalign implementation that emits f5c-format TSV.
baleen/cli.py	Replaces f5c threading options with `--pore`; routes DTW device parsing to `_dtw`.
baleen/_dtw.py	Adds krill DTW shim and GPU memory planning helpers (nvidia-smi-based introspection).
baleen/_cuda_dtw/multithreading.h	Removes vendored CUDA helper code (part of deleted extension).
baleen/_cuda_dtw/multithreading.cpp	Removes vendored CUDA helper code (part of deleted extension).
baleen/_cuda_dtw/limits.hpp	Removes vendored CUDA helper code (part of deleted extension).
baleen/_cuda_dtw/dtw.hpp	Removes vendored CUDA kernel code (part of deleted extension).
baleen/_cuda_dtw/dtw_api.h	Removes vendored CUDA C API header (part of deleted extension).
baleen/_cuda_dtw/cudtw/LICENSE	Removes vendored cuDTW++ license file (part of deleted extension subtree).
baleen/_cuda_dtw/cudtw/include/kernels/SHFL_FULLDTW_511.cuh	Removes vendored cuDTW++ kernel (part of deleted extension subtree).
baleen/_cuda_dtw/cudtw/include/kernels/SHFL_FULLDTW_255.cuh	Removes vendored cuDTW++ kernel (part of deleted extension subtree).
baleen/_cuda_dtw/cudtw/include/kernels/SHFL_FULLDTW_127.cuh	Removes vendored cuDTW++ kernel (part of deleted extension subtree).
baleen/_cuda_dtw/cudtw/include/DTW.hpp	Removes vendored cuDTW++ dispatcher header (part of deleted extension subtree).
baleen/_cuda_dtw/cuda_utils.hpp	Removes vendored CUDA utils header (part of deleted extension).
baleen/_cuda_dtw/init.py	Removes legacy DTW public API wrapper module (`_cuda_dtw`).
.github/workflows/docs.yml	Updates docs CI to install krill CPU wheel (mkdocstrings import requirement).
.github/workflows/docker.yml	Updates test matrix to 3.10–3.12 and installs krill CPU wheel before running pytest.
.docker/Dockerfile.docs	Updates docs build image to include krill CPU wheel for mkdocstrings imports.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+_HEADER = (
+    "contig\tposition\treference_kmer\tread_name\tstrand\tevent_index\t"
+    "event_level_mean\tevent_stdv\tevent_length\tmodel_kmer\tmodel_mean\t"
+    "model_stdv\tstandardized_level\tstart_idx\tend_idx\tsamples"
+)


+    eventalign.add_argument(
        "--no-rna", action="store_true", default=False,
-        help="Disable RNA mode for f5c eventalign",
+        help="Disable RNA mode for eventalign",
    )


-| `--no-rna` | off | Disable RNA mode for f5c. |
-| `--kmer-model` | — | Custom k-mer model for f5c. |
+| `--pore` | `rna002` | krill pore model for eventalign. |
+| `--no-rna` | off | Disable RNA mode for eventalign. |


loganylchen added 12 commits June 11, 2026 04:25

Copilot AI review requested due to automatic review settings June 11, 2026 04:28

Copilot started reviewing on behalf of loganylchen June 11, 2026 04:28 View session

loganylchen merged commit 3eb65b0 into dev Jun 11, 2026
6 checks passed

loganylchen deleted the pr/krill-engine branch June 11, 2026 04:30

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: replace f5c + in-tree CUDA DTW with the krill engine (v0.4.0)#2

feat: replace f5c + in-tree CUDA DTW with the krill engine (v0.4.0)#2
loganylchen merged 12 commits into
devfrom
pr/krill-engine

loganylchen commented Jun 11, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loganylchen commented Jun 11, 2026

Summary

What changed

Engine install

Verification

Notes

Test plan

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants