Skip to content

feat: replace f5c + in-tree CUDA DTW with the krill engine (v0.4.0)#2

Merged
loganylchen merged 12 commits into
devfrom
pr/krill-engine
Jun 11, 2026
Merged

feat: replace f5c + in-tree CUDA DTW with the krill engine (v0.4.0)#2
loganylchen merged 12 commits into
devfrom
pr/krill-engine

Conversation

@loganylchen

Copy link
Copy Markdown
Owner

Summary

Replaces the two pieces of engine code baleen maintained — f5c eventalign (external CLI) and the self-written CUDA DTW (_cuda_dtw C extension) — with the krill package, which provides both. baleen is now pure Python (no nvcc build, no tslearn). Validated by two acceptance gates with no performance loss.

What changed

DTW (Phase 1)baleen/_dtw.py shim over krill's DTW; deleted baleen/_cuda_dtw/; dropped tslearn.

  • Gate A: production GPU DTW is bit-identical to the old kernel (max|Δ| = 0) across all entry points.

Eventalign (Phase 2)baleen/eventalign/_eventalign.py (krill, HMM-free / forced-dense: krill's alignment HMM is OFF, every event forced onto the mapped reference, no read-vs-ref skips); deleted _f5c.py. Emits f5c-format TSV so everything downstream is unchanged. Added --pore; removed --f5c-threads. Dropped the f5c FASTQ-index step (krill reads BLOW5 directly; needs slow5tools index).

  • Gate B (krill vs f5c eventalign, identical GPU DTW): krill wins every metric at stoich=0.5; at stoich=1.0 the headline mod_ratio and 5/6 AUROCs are within −0.02 (only −log10(padj) off by 0.004). Accepted.

Packaging/docs (Phase 3) — rewrote Dockerfile.cpu/Dockerfile.gpu for krill (no f5c, no CUDA build); updated CI, docs, CLAUDE.md, changelog (v0.4.0); bumped to Python ≥3.10 (krill ships cp310+ wheels only).

Engine install

krill is not on PyPI — install from the project index (bundled in the Docker images):

pip install krill --no-deps --index-url https://loganylchen.github.io/krill-dist/cu122/simple/   # GPU
pip install krill --no-deps --index-url https://loganylchen.github.io/krill-dist/simple/         # CPU

Verification

  • Full suite: 305 passed / 1 skipped in a krill-only env (tslearn absent).
  • Both production images build and run (cpu→backend cpu, gpu→backend gpu+gpu:True, slow5tools present).
  • mkdocs build --strict succeeds.
  • Every commit passed a Codex review before merge.

Notes

  • Breaking: drops Python 3.9 (krill has no cp39 wheel); removes BALEEN_NO_CUDA / BALEEN_CUDA_ARCHS / --f5c-threads; f5c no longer required.
  • requires-python bumped to >=3.10; version → 0.4.0.

Test plan

  • CI test matrix (3.10/3.11/3.12) green with krill installed from the project index
  • CI builds + pushes the cpu/gpu images
  • Pull the published dev image and re-run the suite + a stoich benchmark on GPU

krill ships the same cuDTW++ kernels py-baleen maintained as the _cuda_dtw
C-extension. Delegate to krill.dtw_* via a thin baleen/_dtw.py shim (use_cuda
-> use_gpu adapter, retaining the pure-Python GPU memory-planning helpers the
pipeline's chunking relies on) and delete the vendored CUDA sources.

Gate A (benchmarks/krill_exp/verify_dtw.py): the production GPU path is
bit-identical to _cuda_dtw (max|Δ|=0) across dtw_distance, dtw_pairwise_varlen
and dtw_multi_position_pairwise; CPU per-pair primitives identical. krill's
batched CPU path resamples to fixed buckets (GPU-consistent), differing from
the old tslearn fallback only on CPU-only installs.

The GPU image now installs krill from the cu122 index (built_with_cuda).
Address two P1 review findings on the _cuda_dtw -> krill swap:

- krill is now a required engine but not on PyPI: _dtw.py raised a bare
  ModuleNotFoundError on import. Replace with a clear, actionable error
  pointing at the cu122/CPU project indexes and Docker, and document the
  install in pyproject (krill intentionally not in dependencies so a plain
  `pip install .` does not fail on a PyPI lookup).
- Remove all lingering baleen._cuda_dtw consumers left dangling by the
  module deletion: rewire scripts/profile_pipeline.py and benchmarks/bench.py
  to baleen._dtw; rewrite tests/test_dtw.py against the krill shim (CPU/GPU,
  validation, backend reporting, memory helpers); delete the obsolete
  test_cudtw_migration.py, test_setup.py, setup.py (pure-Python now), and the
  one-shot Gate A script verify_dtw.py. Bump version to 0.4.0, drop tslearn.
…dex review)

Second-pass review fixes after the krill swap:

- _pipeline._compute_pairwise_batch no longer imports tslearn; it delegates
  to the krill shim (_dtw.dtw_pairwise_varlen, use_cuda=False), so CPU/test
  paths work with krill installed and tslearn absent.
- profile_pipeline.py: drop unsupported use_open_start/use_open_end kwargs
  from the dtw_multi_position_pairwise call (would TypeError on first contig).
- bench.py: narrow the DTW cProfile matcher to krill's specific DTW method
  names so _signal.extract_signals_for_dtw_padded is no longer miscounted as
  DTW; report dtw_backend from _dtw.backend() ('gpu'/'cpu') instead of the
  stale 'tslearn' label.
test_standard_dtw_variable_length_matches_pairwise_loop cross-checked the CPU
pairwise matrix against tslearn, which is no longer a dependency (the CPU path
now delegates to krill). Compare against _dtw.dtw_distance instead so the test
passes in a krill-only environment. Verified with tslearn uninstalled.
…-dense)

Swap the external f5c eventalign CLI for krill's in-process aligner. krill is
constructed with hmm_confidence=False, keep_kmer_skips=False — i.e. its
alignment HMM is OFF and every event is forced onto the mapped reference
subsequence (dense, no read-vs-ref skips), in contrast to f5c's HMM which
skips deletions. Output stays in f5c's 16-column --samples TSV format, so
group_signals_by_position and everything downstream is unchanged. (This is
distinct from the downstream V3 run_hmm p_mod smoothing, which is untouched.)

- New baleen/eventalign/_eventalign.py: run_eventalign (drop-in for the old
  f5c wrapper, with a `pore` arg), index_blow5 (slow5tools), check_krill.
- _pipeline.py: route through _eventalign; thread `pore` through
  run_pipeline / run_pipeline_streaming / _process_contig[_streaming]; drop
  the f5c FASTQ-index step (krill reads BLOW5 directly) and the extra_f5c_args
  plumbing; rename PipelineMetadata.f5c_version -> eventalign_version.
- cli.py: add --pore (default rna002); remove --f5c-threads and the f5c -t
  thread injection.
- Delete baleen/eventalign/_f5c.py and tests/test_f5c.py; rewire test_pipeline,
  test_cli, scripts/profile_pipeline.py and bench.py profile buckets.

Verified: full suite 304 passed/1 skipped with tslearn absent; real
end-to-end stoich=1.0 run produces 4221 sites across 2 contigs on GPU.
…y (codex review)

Address the Phase 2 review findings:

- [P1] Add pyfastx to runtime dependencies. _eventalign imports it at module
  import time and _pipeline imports _eventalign unconditionally, so a fresh
  install was failing with ModuleNotFoundError before check_krill could report
  the intended setup error.
- [P2] Include `pore` in the resume fingerprint (schema_version 1 -> 2). A
  resumed run with a different --pore would otherwise reuse per-contig slices
  generated by the old pore model. Updated test_resume / test_read_ids callers.
- [P2] Honor --no-primary-only in krill eventalign: run_eventalign now takes
  primary_only and only filters secondary/supplementary when it is set, so the
  CLI flag actually includes those alignments instead of silently dropping them.
The Phase 2 fix let --no-primary-only produce eventalign rows for
secondary/supplementary alignments, but _read_bam._scan_and_write still
unconditionally skipped them, so those calls reached the site TSV yet were
omitted from read_results.bam. Thread primary_only through
flush_contig_to_bam -> _scan_and_write (default True preserves behavior) and
pass it from the streaming worker so both paths agree.
…odex review)

With non-primary alignments included, the contig-wide seen_reads set keyed only
by query_name dropped all but the first-fetched alignment of a multi-aligning
read, so --no-primary-only still omitted secondary/supplementary records from
read_results.bam. Key the dedup on (name, flag, reference_start) in that mode
(name-only still applies in the default primary-only mode, one record per read).
The 17dcd48/8f9df9a attempts to thread primary_only into the mod-BAM writer
opened a path that emits non-primary alignments without reapplying the
mapq/reverse/subsample filters the analysis used. That inconsistency is
pre-existing: the original f5c pipeline also processed secondary/supplementary
in eventalign (via the split BAM) while read_results.bam was always
primary-only. Restore _read_bam to that original, behavior-preserving state.
_eventalign still honors primary_only (consistent with the primary_only-filtered
split BAM it reads), so the krill swap preserves end-to-end behavior.
_validate_resume_compatibility compared only inputs/params and never checked
schema_version, so the field that exists to invalidate incompatible partial
runs did nothing. Reject a resume when the stored schema_version differs from
the current one, so format/writer changes (and the pore bump to schema 2)
properly force a fresh run instead of silently mixing old and new outputs.
Add a regression test for the schema-mismatch rejection.
Phase 3 of the krill migration — packaging and docs cleanup:

- Dockerfile.cpu / Dockerfile.gpu: rewrite for krill. Drop f5c, the nvcc/CUDA
  C-extension build, and the multi-stage builder; install baleen (pure Python)
  plus the krill wheel (plain CPU index for cpu, cu122 index for gpu) and
  slow5tools. The gpu image upgrades pip/setuptools first (Ubuntu 22.04 ships a
  pre-PEP621 setuptools that would otherwise build an empty UNKNOWN package).
- .docker/Dockerfile.docs: install krill (CPU) so mkdocstrings can import
  baleen; drop BALEEN_NO_CUDA.
- CI: docker.yml and docs.yml install krill from the project index instead of
  BALEEN_NO_CUDA=1; remove the build-time CUDA flags.
- docs/ + CLAUDE.md: describe the krill engine (GPU/CPU by wheel+device, not a
  compile flag), the _dtw shim (backend() -> "gpu"/"cpu"), krill HMM-free
  forced-dense eventalign, slow5tools indexing (no f5c .readdb step), the
  --pore flag, and removal of --f5c-threads / BALEEN_NO_CUDA / BALEEN_CUDA_ARCHS.
  Add a v0.4.0 changelog entry.
- Add benchmarks/krill_exp/gate_b.py (the end-to-end AUROC/AUPRC harness used
  to validate the eventalign swap).

Verified: both production images build and run (cpu -> backend cpu, gpu ->
backend gpu, krill 0.1.5/+cu122, slow5tools present); full suite 305 passed /
1 skipped in a krill-only env; mkdocs build --strict succeeds.
…k (codex review)

- [P1] krill publishes cp310/311/312 wheels only (no cp39), so the Python 3.9
  CI matrix job could never install it. Bump requires-python to >=3.10, swap
  the CI matrix to 3.10/3.11/3.12, update the classifiers, and fix the
  "3.9 - 3.11 tested" docs.
- [P2] read_ids_from_fastq_with_readdb still prefers a legacy
  <fastq>.index.readdb when present. Correct the inputs.md read-ID table: krill
  no longer creates a readdb, but an existing one (from an old f5c run) is still
  honored — so the doc no longer claims "no .readdb dependency".
Copilot AI review requested due to automatic review settings June 11, 2026 04:28
@loganylchen loganylchen merged commit 3eb65b0 into dev Jun 11, 2026
6 checks passed
@loganylchen loganylchen deleted the pr/krill-engine branch June 11, 2026 04:30

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces baleen’s in-repo CUDA DTW extension and external f5c eventalign CLI integration with the external krill engine (v0.4.0), making baleen a pure-Python package (Python ≥3.10) while keeping downstream TSV formats stable.

Changes:

  • Introduces baleen/_dtw.py as a shim over krill DTW (GPU/CPU) and updates the pipeline to use it.
  • Replaces the f5c subprocess wrapper with an in-process krill-backed eventalign implementation (baleen/eventalign/_eventalign.py) and wires a new --pore CLI option into resume fingerprints.
  • Updates packaging/CI/Docker/docs/benchmarks to reflect the new engine + install model (krill via a project index, slow5tools required).

Reviewed changes

Copilot reviewed 51 out of 51 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_setup.py Removes setup/import tests tied to the deleted _cuda_dtw extension and tslearn fallback.
tests/test_resume.py Updates resume fingerprint expectations (schema v2) and adds schema mismatch rejection test; includes pore.
tests/test_read_ids.py Updates fingerprint calls to include pore.
tests/test_pipeline.py Updates pipeline tests to patch krill eventalign wrapper and _dtw shim; renames metadata field to eventalign_version.
tests/test_f5c.py Removes tests for the deleted baleen.eventalign._f5c wrapper.
tests/test_dtw.py Replaces CUDA/tslearn DTW tests with tests for the krill-backed _dtw shim (CPU, optional GPU agreement).
tests/test_cudtw_migration.py Removes migration tests specific to the deleted in-tree CUDA DTW implementation.
tests/test_cli.py Updates CLI arg fixture to replace f5c_threads with pore.
setup.py Removes legacy setuptools/nvcc-based extension build script.
scripts/profile_pipeline.py Updates profiling harness to use krill eventalign + _dtw shim and renames f5c timing fields.
pyproject.toml Bumps version to 0.4.0, requires Python ≥3.10, drops tslearn, adds pyfastx; documents krill’s non-PyPI install.
docs/installation.md Updates install instructions to be pure-Python + explicit krill install from project index; adds slow5tools requirement.
docs/index.md Updates overview to reference krill engine and removes tslearn mention.
docs/guide/performance.md Updates backend selection semantics to krill wheel/device-based GPU vs CPU.
docs/guide/overview.md Updates architecture overview to krill eventalign and krill-backed DTW.
docs/guide/inputs.md Removes f5c FASTQ indexing step; documents slow5tools-only indexing and legacy readdb behavior.
docs/guide/docker.md Updates Docker guidance to krill+slow5tools bundling and runtime GPU verification via _dtw.
docs/guide/cli.md Updates CLI docs from f5c options to eventalign options (--pore).
docs/contributing.md Updates contributor setup and architecture notes for krill-based engine.
docs/changelog.md Adds v0.4.0 changelog entry describing the engine swap and breaking changes.
Dockerfile.gpu Rebuilds GPU image to install baleen + krill cu122 wheel + slow5tools (no nvcc build, no f5c).
Dockerfile.cpu Rebuilds CPU image to install baleen + krill CPU wheel + slow5tools (no f5c).
CLAUDE.md Updates project architecture and setup notes to reflect krill engine and _dtw shim.
benchmarks/krill_exp/gate_b.py Adds Gate B end-to-end benchmark script for comparing eventalign engines.
benchmarks/bench.py Updates environment snapshot and profiler bucketing to attribute DTW to krill and eventalign to _eventalign.py.
baleen/eventalign/_pipeline.py Replaces _f5c calls with _eventalign, switches DTW calls to _dtw, adds pore to fingerprint schema v2, renames metadata field.
baleen/eventalign/_f5c.py Removes legacy f5c/slow5tools subprocess wrapper module.
baleen/eventalign/_eventalign.py Adds krill-backed eventalign implementation that emits f5c-format TSV.
baleen/cli.py Replaces f5c threading options with --pore; routes DTW device parsing to _dtw.
baleen/_dtw.py Adds krill DTW shim and GPU memory planning helpers (nvidia-smi-based introspection).
baleen/_cuda_dtw/multithreading.h Removes vendored CUDA helper code (part of deleted extension).
baleen/_cuda_dtw/multithreading.cpp Removes vendored CUDA helper code (part of deleted extension).
baleen/_cuda_dtw/limits.hpp Removes vendored CUDA helper code (part of deleted extension).
baleen/_cuda_dtw/dtw.hpp Removes vendored CUDA kernel code (part of deleted extension).
baleen/_cuda_dtw/dtw_api.h Removes vendored CUDA C API header (part of deleted extension).
baleen/_cuda_dtw/cudtw/LICENSE Removes vendored cuDTW++ license file (part of deleted extension subtree).
baleen/_cuda_dtw/cudtw/include/kernels/SHFL_FULLDTW_511.cuh Removes vendored cuDTW++ kernel (part of deleted extension subtree).
baleen/_cuda_dtw/cudtw/include/kernels/SHFL_FULLDTW_255.cuh Removes vendored cuDTW++ kernel (part of deleted extension subtree).
baleen/_cuda_dtw/cudtw/include/kernels/SHFL_FULLDTW_127.cuh Removes vendored cuDTW++ kernel (part of deleted extension subtree).
baleen/_cuda_dtw/cudtw/include/DTW.hpp Removes vendored cuDTW++ dispatcher header (part of deleted extension subtree).
baleen/_cuda_dtw/cuda_utils.hpp Removes vendored CUDA utils header (part of deleted extension).
baleen/_cuda_dtw/init.py Removes legacy DTW public API wrapper module (_cuda_dtw).
.github/workflows/docs.yml Updates docs CI to install krill CPU wheel (mkdocstrings import requirement).
.github/workflows/docker.yml Updates test matrix to 3.10–3.12 and installs krill CPU wheel before running pytest.
.docker/Dockerfile.docs Updates docs build image to include krill CPU wheel for mkdocstrings imports.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +44 to +48
_HEADER = (
"contig\tposition\treference_kmer\tread_name\tstrand\tevent_index\t"
"event_level_mean\tevent_stdv\tevent_length\tmodel_kmer\tmodel_mean\t"
"model_stdv\tstandardized_level\tstart_idx\tend_idx\tsamples"
)
Comment thread baleen/cli.py
Comment on lines +176 to 179
eventalign.add_argument(
"--no-rna", action="store_true", default=False,
help="Disable RNA mode for f5c eventalign",
help="Disable RNA mode for eventalign",
)
Comment thread docs/guide/cli.md
| `--no-rna` | off | Disable RNA mode for f5c. |
| `--kmer-model` | — | Custom k-mer model for f5c. |
| `--pore` | `rna002` | krill pore model for eventalign. |
| `--no-rna` | off | Disable RNA mode for eventalign. |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants