|
| 1 | +# Plan: Auto-Algorithm Selection + Memory-Mapped Distance Matrix |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +Target: 100K+ time series. At N=100K the packed triangular distance matrix is N*(N+1)/2 doubles = ~40GB — doesn't fit in RAM. CLARA already avoids computing the full matrix by subsampling, but currently all series data must be loaded into RAM and there's no way to know N before loading. The user must also manually pick `--method`. |
| 6 | + |
| 7 | +**Goal**: Count series before loading, auto-select algorithm, and support mmap-backed distance matrix for when it IS needed (e.g., FastPAM on moderate N, or checkpoint/resume). |
| 8 | + |
| 9 | +## 5 Incremental Steps (each independently testable) |
| 10 | + |
| 11 | +### Step 1: Count-Before-Load |
| 12 | + |
| 13 | +Add `DataLoader::count()` — returns number of series without loading data. |
| 14 | + |
| 15 | +**File**: `dtwc/DataLoader.hpp` |
| 16 | + |
| 17 | +- Directory mode: count entries via `fs::directory_iterator` (no file reads) |
| 18 | +- Batch file mode: count lines (skip `start_row` header lines), respects `Ndata` limit |
| 19 | +- ~20 lines of code |
| 20 | + |
| 21 | +**Test**: Unit test with temp folder/file, assert count matches. |
| 22 | + |
| 23 | +### Step 2: Auto Method Selection |
| 24 | + |
| 25 | +Add `"auto"` to `--method` (make it the default). After loading data, pick algorithm based on `prob.size()`. |
| 26 | + |
| 27 | +**File**: `dtwc/dtwc_cl.cpp` |
| 28 | + |
| 29 | +1. Change default: `std::string method = "auto";` |
| 30 | +2. Add `{"auto", "auto"}` to the CheckedTransformer map (line 118) |
| 31 | +3. After `Problem prob{prob_name, dl};` (line 346), insert: |
| 32 | + |
| 33 | +```cpp |
| 34 | +if (method == "auto") { |
| 35 | + const size_t N = prob.size(); |
| 36 | + method = (N <= 5000) ? "pam" : "clara"; |
| 37 | + if (verbose) |
| 38 | + std::cout << "Auto-selected method: " << method << " (N=" << N << ")\n"; |
| 39 | +} |
| 40 | +``` |
| 41 | + |
| 42 | +4. For large N CLARA, scale sample size with sqrt(N): |
| 43 | + |
| 44 | +```cpp |
| 45 | +if (clara_opts.sample_size < 0 && prob.size() > 50000) |
| 46 | + clara_opts.sample_size = static_cast<int>(std::sqrt(prob.size()) * n_clusters); |
| 47 | +``` |
| 48 | + |
| 49 | +**Test**: Run CLI with `--method auto` on Coffee dataset (28 series → should pick pam). |
| 50 | + |
| 51 | +### Step 3: Memory-Mapped Distance Matrix |
| 52 | + |
| 53 | +Cross-platform mmap wrapper + `MmapDistanceMatrix` class with same API as `DenseDistanceMatrix`. |
| 54 | + |
| 55 | +**New file**: `dtwc/core/mmap_distance_matrix.hpp` |
| 56 | + |
| 57 | +**Dependency**: [llfio](https://github.com/mandreyel/llfio) (MIT, header-only, C++11, no deps). |
| 58 | +Add via CPM in `cmake/Dependencies.cmake`. Fits the project pattern (header-only deps like rapidcsv, Eigen). |
| 59 | + |
| 60 | +**Binary file layout** (internal, not user-facing): |
| 61 | + |
| 62 | +``` |
| 63 | +bytes 0-7: uint64_t magic ("DTWMAT01") |
| 64 | +bytes 8-15: uint64_t n |
| 65 | +bytes 16...: double[n*(n+1)/2] (NaN = uncomputed) |
| 66 | +``` |
| 67 | + |
| 68 | +**Implementation** using llfio (handles mmap, file locking, pre-allocation cross-platform): |
| 69 | + |
| 70 | +```cpp |
| 71 | +class MmapDistanceMatrix { |
| 72 | + llfio::mapped_file_handle mfh_; // RAII: mmap + file handle |
| 73 | + double* data_ = nullptr; // points past header |
| 74 | + size_t n_ = 0; |
| 75 | + |
| 76 | + // Same API as DenseDistanceMatrix |
| 77 | + double get(size_t i, size_t j) const; |
| 78 | + void set(size_t i, size_t j, double v); |
| 79 | + bool is_computed(size_t i, size_t j) const; |
| 80 | + size_t size() const { return n_; } |
| 81 | + |
| 82 | + // Periodic flush (mmap pages auto-flush, but explicit sync for safety) |
| 83 | + void sync(); |
| 84 | +}; |
| 85 | +``` |
| 86 | +
|
| 87 | +**Binary header** (24 bytes, from adversarial review): |
| 88 | +
|
| 89 | +| Offset | Size | Field | Purpose | |
| 90 | +|--------|------|-------|---------| |
| 91 | +| 0 | 4 | magic `"DTWM"` | Format identification | |
| 92 | +| 4 | 2 | version (1) | Future format evolution | |
| 93 | +| 6 | 4 | endian marker `0x01020304` | Detect byte-order mismatch | |
| 94 | +| 10 | 1 | elem_size (8) | float vs double detection | |
| 95 | +| 11 | 1 | reserved | Padding | |
| 96 | +| 12 | 8 | N (uint64) | Number of series | |
| 97 | +| 20 | 4 | data CRC32 | Corruption detection | |
| 98 | +
|
| 99 | +**Key features**: |
| 100 | +- **Warmstart**: Open existing file → mmap → NaN entries are uncomputed, resume from there |
| 101 | +- **Periodic save**: mmap pages auto-flush to disk by OS. Add explicit `sync()` every N pairs for crash safety |
| 102 | +- **File locking**: llfio provides byte-range locking — prevent two processes corrupting same cache |
| 103 | +- **Pre-allocation**: llfio uses `posix_fallocate`/`SetEndOfFile` — no sparse file SIGBUS risk |
| 104 | +- **Header written last**: On create, write data (NaN-filled) first, header last. Crash during init → no valid header → detected on reopen |
| 105 | +
|
| 106 | +**Test**: Create, write, close, reopen (warmstart), verify values persist. Test NaN sentinel. Test concurrent open (lock check). Test N=0, 1, 1000. |
| 107 | +
|
| 108 | +### Step 4: Integrate into Problem via std::variant |
| 109 | +
|
| 110 | +**Adversarial review finding**: Do NOT extend `DenseDistanceMatrix` with a raw pointer — vector reallocation invalidates it (P0 bug). Use a **separate class** behind `std::variant`. |
| 111 | +
|
| 112 | +**File**: `dtwc/Problem.hpp` |
| 113 | +
|
| 114 | +```cpp |
| 115 | +using DistMat = std::variant<core::DenseDistanceMatrix, core::MmapDistanceMatrix>; |
| 116 | +DistMat distMat_; |
| 117 | +
|
| 118 | +// Helper to dispatch get/set through the variant |
| 119 | +double distMat_get(size_t i, size_t j) const { |
| 120 | + return std::visit([&](auto& m) { return m.get(i, j); }, distMat_); |
| 121 | +} |
| 122 | +void distMat_set(size_t i, size_t j, double v) { |
| 123 | + std::visit([&](auto& m) { m.set(i, j, v); }, distMat_); |
| 124 | +} |
| 125 | +``` |
| 126 | + |
| 127 | +**File**: `dtwc/Problem.cpp` |
| 128 | + |
| 129 | +In `fillDistanceMatrix()` and lazy-alloc in `distByInd()`: |
| 130 | + |
| 131 | +```cpp |
| 132 | +constexpr size_t MMAP_THRESHOLD = 50000; |
| 133 | +if (N > MMAP_THRESHOLD) { |
| 134 | + auto cache = output_folder / (name + "_distmat.cache"); |
| 135 | + distMat_ = core::MmapDistanceMatrix(cache, N); // create or warmstart |
| 136 | +} else { |
| 137 | + distMat_ = core::DenseDistanceMatrix(N); |
| 138 | +} |
| 139 | +``` |
| 140 | + |
| 141 | +Add periodic sync during fill (every 10000 pairs for mmap path). |
| 142 | + |
| 143 | +**Test**: Existing 63 tests pass (N < 50K → in-memory path). Add test forcing mmap with small N. |
| 144 | + |
| 145 | +### Step 5: Clustering Checkpoint + Restart (resolves GitHub issue #22) |
| 146 | + |
| 147 | +Save clustering state alongside the distance matrix so both distance computation AND clustering can resume after crash/exit. |
| 148 | + |
| 149 | +**What gets saved** (alongside the mmap distance matrix cache file): |
| 150 | + |
| 151 | +- `name_checkpoint.bin` — small binary file: |
| 152 | + - Current medoid indices (`vector<int>`, k values) |
| 153 | + - Current labels (`vector<int>`, N values) |
| 154 | + - Best cost so far |
| 155 | + - Current iteration number |
| 156 | + - Algorithm identifier (pam/clara/lloyd) |
| 157 | + |
| 158 | +**Warmstart flow** (CLI `--restart` flag): |
| 159 | + |
| 160 | +1. Open existing distance matrix cache → mmap → NaN entries are uncomputed |
| 161 | +2. Load medoid checkpoint if present |
| 162 | +3. Resume: fill remaining NaN distance entries, then continue clustering from saved medoids |
| 163 | +4. For CLARA: save best-so-far result after each subsample iteration |
| 164 | + |
| 165 | +**Integration**: |
| 166 | + |
| 167 | +- `Problem::save_checkpoint()` — writes medoids + labels + cost |
| 168 | +- `Problem::load_checkpoint()` — reads and restores state |
| 169 | +- FastPAM: checkpoint after each SWAP iteration |
| 170 | +- CLARA: checkpoint after each subsample evaluation |
| 171 | +- Lloyd: checkpoint after each iteration |
| 172 | + |
| 173 | +**Files**: `dtwc/checkpoint.hpp`, `dtwc/checkpoint.cpp`, `dtwc/dtwc_cl.cpp` (add `--restart` flag) |
| 174 | + |
| 175 | +**This resolves [Battery-Intelligence-Lab/dtw-cpp#22](https://github.com/Battery-Intelligence-Lab/dtw-cpp/issues/22)** — the request for a restart option after crash/exit on HPC systems. |
| 176 | + |
| 177 | +### Step 6: Streaming CLARA (deferred — after Steps 1-5 proven) |
| 178 | + |
| 179 | +For N > 50K, avoid loading all series into RAM. This is the most complex step. |
| 180 | + |
| 181 | +**File**: `dtwc/DataLoader.hpp` — Add `load_subset(vector<int> indices)` |
| 182 | +**File**: `dtwc/fileOperations.hpp` — Add `load_folder_subset`, `load_batch_file_subset` |
| 183 | +**File**: `dtwc/algorithms/fast_clara.cpp` — Add `fast_clara_streaming(DataLoader&, N, opts)` |
| 184 | +**File**: `dtwc/dtwc_cl.cpp` — Branch: if clara and N > 50K, use streaming path |
| 185 | + |
| 186 | +The streaming path: |
| 187 | + |
| 188 | +1. `dl.count()` → get N |
| 189 | +2. For each subsample: `dl.load_subset(sample_indices)` → load only s series |
| 190 | +3. Build sub-Problem, run FastPAM → get medoid indices |
| 191 | +4. Assignment: iterate all N series one-by-one, compute distance to k medoids, assign label, discard series |
| 192 | +5. Memory: O(s² + k * L) instead of O(N * L + N²) |
| 193 | + |
| 194 | +**Defer this step** until Steps 1-4 are proven in practice. |
| 195 | + |
| 196 | +## Implementation Order |
| 197 | + |
| 198 | +``` |
| 199 | +Step 1 (count) ─── independent ──┐ |
| 200 | +Step 3 (mmap) ─── independent ──┤ |
| 201 | + ├── Step 2 (auto-select, needs Step 1) |
| 202 | + ├── Step 4 (integrate mmap, needs Step 3) |
| 203 | + ├── Step 5 (checkpoint + restart, needs Step 3+4, resolves #22) |
| 204 | + └── Step 6 (streaming CLARA, needs 1+2+3+4+5) |
| 205 | +``` |
| 206 | + |
| 207 | +**Steps 1 and 3 can be implemented in parallel** (different files, no overlap). |
| 208 | + |
| 209 | +## Files to Modify |
| 210 | + |
| 211 | +| File | Steps | Change | |
| 212 | +|------|-------|--------| |
| 213 | +| `dtwc/DataLoader.hpp` | 1, 6 | Add `count()`, later `load_subset()` | |
| 214 | +| `dtwc/dtwc_cl.cpp` | 2, 5, 6 | Add "auto" method, `--restart` flag, streaming | |
| 215 | +| `dtwc/core/mmap_distance_matrix.hpp` (NEW) | 3 | MmapDistanceMatrix using llfio | |
| 216 | +| `cmake/Dependencies.cmake` | 3 | Add llfio via CPM | |
| 217 | +| `dtwc/Problem.hpp` | 4 | `std::variant` distMat, checkpoint methods | |
| 218 | +| `dtwc/Problem.cpp` | 4, 5 | Auto-select mmap, save/load checkpoint | |
| 219 | +| `dtwc/checkpoint.{hpp,cpp}` | 5 | Extend with medoid/label/cost checkpoint | |
| 220 | +| `dtwc/algorithms/fast_pam.cpp` | 5 | Checkpoint after each SWAP iteration | |
| 221 | +| `dtwc/algorithms/fast_clara.{hpp,cpp}` | 5, 6 | Checkpoint per subsample, streaming variant | |
| 222 | +| `dtwc/fileOperations.hpp` | 6 | Subset loading functions | |
| 223 | + |
| 224 | +## Verification |
| 225 | + |
| 226 | +- All 63 existing unit tests must pass after each step |
| 227 | +- Stress test (`tests/integration/stress_test_cli.sh`) must pass |
| 228 | +- Step 2: `dtwc_cl --method auto -k 3 -i data/dummy` should print "Auto-selected method: pam (N=18)" |
| 229 | +- Step 3: Unit test creates 1000x1000 mmap matrix, writes/reads, reopens from file |
| 230 | +- Step 4: `dtwc_cl -k 3 -i large_dataset/` with N > 50K should create `.cache` file |
| 231 | + |
| 232 | +## Open Decisions (from adversarial reviews) |
| 233 | + |
| 234 | +### llfio vs mio vs custom wrapper |
| 235 | +- **llfio**: has file locking + pre-allocation built in, active. But heavy CMake, slow configure (~30s), complex internals. |
| 236 | +- **mio**: clean 3-line API, header-only, trivial CPM. But dormant, no file locking, no pre-allocation. |
| 237 | +- **custom ~80-line wrapper**: zero deps, full control. But maintenance burden, platform-specific bugs. |
| 238 | +- **Decision needed at implementation time**: try llfio first. If CMake integration is painful, fall back to mio + manual `flock`/`LockFileEx` (~20 extra lines). |
| 239 | + |
| 240 | +### std::visit overhead on hot path |
| 241 | +- `distByInd` is called millions of times. `std::visit` on variant adds indirect call overhead per access. |
| 242 | +- **Fix**: Resolve variant once at algorithm entry (e.g., in `fillDistanceMatrix`, `fast_pam`, `cluster_by_kMedoidsLloyd`). Pass concrete `DenseDistanceMatrix&` or `MmapDistanceMatrix&` via template to inner loops. The variant lives in Problem but hot paths don't go through it. |
| 243 | + |
| 244 | +### CLARA + mmap interaction |
| 245 | +- Auto-selecting CLARA for N>50K but also allocating full mmap matrix is contradictory — CLARA doesn't need the full N²/2 matrix. |
| 246 | +- **Fix**: mmap distance matrix ONLY when method explicitly requires full matrix (pam, kmedoids, hierarchical, mip). CLARA skips it entirely (as it already does). The auto-select logic should NOT trigger mmap when it picks CLARA. |
| 247 | + |
| 248 | +### Checkpoint format unification |
| 249 | +- Existing CSV checkpoint (`checkpoint.hpp`) vs new binary checkpoint coexist. |
| 250 | +- **Fix**: Keep CSV checkpoint for small N / human-readable use. Binary checkpoint is a superset (mmap cache + medoid state). `--restart` uses binary. Old `--checkpoint` flag uses CSV. Document both. |
| 251 | + |
| 252 | +### Binary header alignment |
| 253 | +- Pad header to **32 bytes** (not 24). Doubles start at byte 32 = 8-byte aligned. Safe on ARM. |
| 254 | + |
| 255 | +### Stale cache detection |
| 256 | +- Store hash of input filenames + sizes + modification times in the binary header. |
| 257 | +- On `--restart`, compare. Warn if stale, don't silently use wrong cache. |
| 258 | + |
| 259 | +### CRC32 scope |
| 260 | +- CRC covers header only (20 bytes), not data. Data integrity is OS responsibility (page cache + fsync). Computing CRC over multi-GB mmap is expensive and pointless for a cache file. |
| 261 | + |
| 262 | +### CLARA sample size formula |
| 263 | +- `clara_sample_size = max(40 + 2*k, min(N, sqrt(N) * k))` for N > 50K |
| 264 | +- Number of CLARA iterations: default 5 (existing `CLARAOptions::n_samples = 5`) |
| 265 | + |
| 266 | +## What NOT to Do |
| 267 | + |
| 268 | +- Don't add HDF5 C++ dependency yet — Python has it, C++ can wait |
| 269 | +- Don't abstract DataSource interface yet — YAGNI until streaming CLARA is proven |
| 270 | +- Don't change the Method enum — FastPAM/FastCLARA work fine as standalone functions |
| 271 | +- Don't CRC the entire data region — OS handles data integrity for mmap'd files |
| 272 | +- Don't allocate full mmap matrix when auto-selecting CLARA — CLARA doesn't need it |
0 commit comments