Skip to content

Commit c4fb30c

Browse files
ElektrikAkarclaude
andcommitted
Add lessons learned, mmap/auto-select implementation plan, session handoff
- LESSONS.md: NaN sentinel rationale, tri_index benchmark result (slower), llfio vs mio evaluation for memory-mapped I/O - Plan: 6-step incremental plan for auto-algorithm selection + mmap distance matrix + checkpoint/restart (resolves #22). Reviewed by 4 adversarial agents. - Handoff: detailed session state for continuity - .gitignore: add .cache/ directory Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 4255c16 commit c4fb30c

File tree

4 files changed

+414
-0
lines changed

4 files changed

+414
-0
lines changed

.claude/LESSONS.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,17 @@ Critical knowledge to avoid repeating mistakes.
2727
- **Lesson**: Any batch SIMD kernel loading from multiple pointers must pre-transpose to SoA layout;
2828
scatter/gather in a recurrence inner loop kills perf.
2929

30+
### tri_index precomputed row offsets: no benefit (benchmarked 2026-04-08)
31+
- Replaced `i*(i+1)/2 + j` with `row_offsets_[i] + j` lookup table in DenseDistanceMatrix.
32+
- Also tried `schedule(guided)` instead of `schedule(dynamic,1)` in fillDistanceMatrix.
33+
- **Result**: random_get 500K lookups: 2.61ms baseline → 2.75ms optimized (+5% slower).
34+
Fill benchmarks: within noise (±10%), no consistent improvement.
35+
- **Why**: Modern CPUs compute `i*(i+1)/2` in 1-2 cycles (imul+shr). Table lookup adds
36+
an L1 cache load (~4 cycles) plus the table pollutes L1. The conditional swap `if(i<j)`
37+
already compiles to branchless cmov. The multiplication is faster than the indirection.
38+
- **Lesson**: Don't replace simple integer arithmetic with lookup tables on modern CPUs.
39+
imul is cheap; L1 loads are not free. Profile before "optimizing".
40+
3041
### std::min with initializer_list is catastrophically slow
3142
- `std::min({a,b,c})` creates a temporary on every call. 2.5-3x speedup from `std::min(a, std::min(b,c))`.
3243

@@ -41,6 +52,23 @@ Critical knowledge to avoid repeating mistakes.
4152
- WDTW weights depend on (i,j) position. ADTW penalizes non-diagonal steps.
4253
- DDTW is a preprocessing step. Soft-DTW replaces min with softmin.
4354

55+
### NaN is the ONLY safe sentinel for distance matrix uncomputed entries
56+
- Soft-DTW returns negative distances (softmin with gamma). Any fixed negative sentinel (-1, -10000) can collide.
57+
- Previous session used -1 sentinel → broke Soft-DTW caching. Changed to NaN (2026-04-07 bugfix).
58+
- In binary files: NaN is just 8 bytes (IEEE 754 bit pattern), trivial to write via pointer assignment.
59+
- In CSV: write empty field (`,,`), load as NaN. Already implemented in `matrix_io.hpp`.
60+
- Don't use NaN payload bits for status encoding — compilers may canonicalize, SIMD strips them. Use separate status array if needed.
61+
62+
### Don't replace simple integer arithmetic with lookup tables (benchmarked 2026-04-08)
63+
- `tri_index: i*(i+1)/2 + j` is 1-2 cycles (imul). Precomputed row-offset table adds L1 load (~4 cycles) + cache pollution.
64+
- Benchmarked: table version was 5% SLOWER on 500K random lookups. Modern CPUs do integer multiply faster than memory indirection.
65+
66+
### llfio > mio for memory-mapped I/O (evaluated 2026-04-08)
67+
- mio (MIT, header-only): clean 3-line API but dormant since 2020, no file locking, no pre-allocation.
68+
- llfio (Apache 2.0, header-only option): active (Jan 2025), has file locking, `posix_fallocate`, sparse files, production-proven (SEC MIDAS).
69+
- llfio solves all P0/P1 issues from adversarial review (pre-allocation, file locking) without custom platform code.
70+
- Both have zero external dependencies. llfio requires C++17 (which we already require).
71+
4472
## C++ Implementation
4573

4674
### NaN for missing data
Lines changed: 272 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,272 @@
1+
# Plan: Auto-Algorithm Selection + Memory-Mapped Distance Matrix
2+
3+
## Context
4+
5+
Target: 100K+ time series. At N=100K the packed triangular distance matrix is N*(N+1)/2 doubles = ~40GB — doesn't fit in RAM. CLARA already avoids computing the full matrix by subsampling, but currently all series data must be loaded into RAM and there's no way to know N before loading. The user must also manually pick `--method`.
6+
7+
**Goal**: Count series before loading, auto-select algorithm, and support mmap-backed distance matrix for when it IS needed (e.g., FastPAM on moderate N, or checkpoint/resume).
8+
9+
## 5 Incremental Steps (each independently testable)
10+
11+
### Step 1: Count-Before-Load
12+
13+
Add `DataLoader::count()` — returns number of series without loading data.
14+
15+
**File**: `dtwc/DataLoader.hpp`
16+
17+
- Directory mode: count entries via `fs::directory_iterator` (no file reads)
18+
- Batch file mode: count lines (skip `start_row` header lines), respects `Ndata` limit
19+
- ~20 lines of code
20+
21+
**Test**: Unit test with temp folder/file, assert count matches.
22+
23+
### Step 2: Auto Method Selection
24+
25+
Add `"auto"` to `--method` (make it the default). After loading data, pick algorithm based on `prob.size()`.
26+
27+
**File**: `dtwc/dtwc_cl.cpp`
28+
29+
1. Change default: `std::string method = "auto";`
30+
2. Add `{"auto", "auto"}` to the CheckedTransformer map (line 118)
31+
3. After `Problem prob{prob_name, dl};` (line 346), insert:
32+
33+
```cpp
34+
if (method == "auto") {
35+
const size_t N = prob.size();
36+
method = (N <= 5000) ? "pam" : "clara";
37+
if (verbose)
38+
std::cout << "Auto-selected method: " << method << " (N=" << N << ")\n";
39+
}
40+
```
41+
42+
4. For large N CLARA, scale sample size with sqrt(N):
43+
44+
```cpp
45+
if (clara_opts.sample_size < 0 && prob.size() > 50000)
46+
clara_opts.sample_size = static_cast<int>(std::sqrt(prob.size()) * n_clusters);
47+
```
48+
49+
**Test**: Run CLI with `--method auto` on Coffee dataset (28 series → should pick pam).
50+
51+
### Step 3: Memory-Mapped Distance Matrix
52+
53+
Cross-platform mmap wrapper + `MmapDistanceMatrix` class with same API as `DenseDistanceMatrix`.
54+
55+
**New file**: `dtwc/core/mmap_distance_matrix.hpp`
56+
57+
**Dependency**: [llfio](https://github.com/mandreyel/llfio) (MIT, header-only, C++11, no deps).
58+
Add via CPM in `cmake/Dependencies.cmake`. Fits the project pattern (header-only deps like rapidcsv, Eigen).
59+
60+
**Binary file layout** (internal, not user-facing):
61+
62+
```
63+
bytes 0-7: uint64_t magic ("DTWMAT01")
64+
bytes 8-15: uint64_t n
65+
bytes 16...: double[n*(n+1)/2] (NaN = uncomputed)
66+
```
67+
68+
**Implementation** using llfio (handles mmap, file locking, pre-allocation cross-platform):
69+
70+
```cpp
71+
class MmapDistanceMatrix {
72+
llfio::mapped_file_handle mfh_; // RAII: mmap + file handle
73+
double* data_ = nullptr; // points past header
74+
size_t n_ = 0;
75+
76+
// Same API as DenseDistanceMatrix
77+
double get(size_t i, size_t j) const;
78+
void set(size_t i, size_t j, double v);
79+
bool is_computed(size_t i, size_t j) const;
80+
size_t size() const { return n_; }
81+
82+
// Periodic flush (mmap pages auto-flush, but explicit sync for safety)
83+
void sync();
84+
};
85+
```
86+
87+
**Binary header** (24 bytes, from adversarial review):
88+
89+
| Offset | Size | Field | Purpose |
90+
|--------|------|-------|---------|
91+
| 0 | 4 | magic `"DTWM"` | Format identification |
92+
| 4 | 2 | version (1) | Future format evolution |
93+
| 6 | 4 | endian marker `0x01020304` | Detect byte-order mismatch |
94+
| 10 | 1 | elem_size (8) | float vs double detection |
95+
| 11 | 1 | reserved | Padding |
96+
| 12 | 8 | N (uint64) | Number of series |
97+
| 20 | 4 | data CRC32 | Corruption detection |
98+
99+
**Key features**:
100+
- **Warmstart**: Open existing file → mmap → NaN entries are uncomputed, resume from there
101+
- **Periodic save**: mmap pages auto-flush to disk by OS. Add explicit `sync()` every N pairs for crash safety
102+
- **File locking**: llfio provides byte-range locking — prevent two processes corrupting same cache
103+
- **Pre-allocation**: llfio uses `posix_fallocate`/`SetEndOfFile` — no sparse file SIGBUS risk
104+
- **Header written last**: On create, write data (NaN-filled) first, header last. Crash during init → no valid header → detected on reopen
105+
106+
**Test**: Create, write, close, reopen (warmstart), verify values persist. Test NaN sentinel. Test concurrent open (lock check). Test N=0, 1, 1000.
107+
108+
### Step 4: Integrate into Problem via std::variant
109+
110+
**Adversarial review finding**: Do NOT extend `DenseDistanceMatrix` with a raw pointer — vector reallocation invalidates it (P0 bug). Use a **separate class** behind `std::variant`.
111+
112+
**File**: `dtwc/Problem.hpp`
113+
114+
```cpp
115+
using DistMat = std::variant<core::DenseDistanceMatrix, core::MmapDistanceMatrix>;
116+
DistMat distMat_;
117+
118+
// Helper to dispatch get/set through the variant
119+
double distMat_get(size_t i, size_t j) const {
120+
return std::visit([&](auto& m) { return m.get(i, j); }, distMat_);
121+
}
122+
void distMat_set(size_t i, size_t j, double v) {
123+
std::visit([&](auto& m) { m.set(i, j, v); }, distMat_);
124+
}
125+
```
126+
127+
**File**: `dtwc/Problem.cpp`
128+
129+
In `fillDistanceMatrix()` and lazy-alloc in `distByInd()`:
130+
131+
```cpp
132+
constexpr size_t MMAP_THRESHOLD = 50000;
133+
if (N > MMAP_THRESHOLD) {
134+
auto cache = output_folder / (name + "_distmat.cache");
135+
distMat_ = core::MmapDistanceMatrix(cache, N); // create or warmstart
136+
} else {
137+
distMat_ = core::DenseDistanceMatrix(N);
138+
}
139+
```
140+
141+
Add periodic sync during fill (every 10000 pairs for mmap path).
142+
143+
**Test**: Existing 63 tests pass (N < 50K → in-memory path). Add test forcing mmap with small N.
144+
145+
### Step 5: Clustering Checkpoint + Restart (resolves GitHub issue #22)
146+
147+
Save clustering state alongside the distance matrix so both distance computation AND clustering can resume after crash/exit.
148+
149+
**What gets saved** (alongside the mmap distance matrix cache file):
150+
151+
- `name_checkpoint.bin` — small binary file:
152+
- Current medoid indices (`vector<int>`, k values)
153+
- Current labels (`vector<int>`, N values)
154+
- Best cost so far
155+
- Current iteration number
156+
- Algorithm identifier (pam/clara/lloyd)
157+
158+
**Warmstart flow** (CLI `--restart` flag):
159+
160+
1. Open existing distance matrix cache → mmap → NaN entries are uncomputed
161+
2. Load medoid checkpoint if present
162+
3. Resume: fill remaining NaN distance entries, then continue clustering from saved medoids
163+
4. For CLARA: save best-so-far result after each subsample iteration
164+
165+
**Integration**:
166+
167+
- `Problem::save_checkpoint()` — writes medoids + labels + cost
168+
- `Problem::load_checkpoint()` — reads and restores state
169+
- FastPAM: checkpoint after each SWAP iteration
170+
- CLARA: checkpoint after each subsample evaluation
171+
- Lloyd: checkpoint after each iteration
172+
173+
**Files**: `dtwc/checkpoint.hpp`, `dtwc/checkpoint.cpp`, `dtwc/dtwc_cl.cpp` (add `--restart` flag)
174+
175+
**This resolves [Battery-Intelligence-Lab/dtw-cpp#22](https://github.com/Battery-Intelligence-Lab/dtw-cpp/issues/22)** — the request for a restart option after crash/exit on HPC systems.
176+
177+
### Step 6: Streaming CLARA (deferred — after Steps 1-5 proven)
178+
179+
For N > 50K, avoid loading all series into RAM. This is the most complex step.
180+
181+
**File**: `dtwc/DataLoader.hpp` — Add `load_subset(vector<int> indices)`
182+
**File**: `dtwc/fileOperations.hpp` — Add `load_folder_subset`, `load_batch_file_subset`
183+
**File**: `dtwc/algorithms/fast_clara.cpp` — Add `fast_clara_streaming(DataLoader&, N, opts)`
184+
**File**: `dtwc/dtwc_cl.cpp` — Branch: if clara and N > 50K, use streaming path
185+
186+
The streaming path:
187+
188+
1. `dl.count()` → get N
189+
2. For each subsample: `dl.load_subset(sample_indices)` → load only s series
190+
3. Build sub-Problem, run FastPAM → get medoid indices
191+
4. Assignment: iterate all N series one-by-one, compute distance to k medoids, assign label, discard series
192+
5. Memory: O(s² + k * L) instead of O(N * L + N²)
193+
194+
**Defer this step** until Steps 1-4 are proven in practice.
195+
196+
## Implementation Order
197+
198+
```
199+
Step 1 (count) ─── independent ──┐
200+
Step 3 (mmap) ─── independent ──┤
201+
├── Step 2 (auto-select, needs Step 1)
202+
├── Step 4 (integrate mmap, needs Step 3)
203+
├── Step 5 (checkpoint + restart, needs Step 3+4, resolves #22)
204+
└── Step 6 (streaming CLARA, needs 1+2+3+4+5)
205+
```
206+
207+
**Steps 1 and 3 can be implemented in parallel** (different files, no overlap).
208+
209+
## Files to Modify
210+
211+
| File | Steps | Change |
212+
|------|-------|--------|
213+
| `dtwc/DataLoader.hpp` | 1, 6 | Add `count()`, later `load_subset()` |
214+
| `dtwc/dtwc_cl.cpp` | 2, 5, 6 | Add "auto" method, `--restart` flag, streaming |
215+
| `dtwc/core/mmap_distance_matrix.hpp` (NEW) | 3 | MmapDistanceMatrix using llfio |
216+
| `cmake/Dependencies.cmake` | 3 | Add llfio via CPM |
217+
| `dtwc/Problem.hpp` | 4 | `std::variant` distMat, checkpoint methods |
218+
| `dtwc/Problem.cpp` | 4, 5 | Auto-select mmap, save/load checkpoint |
219+
| `dtwc/checkpoint.{hpp,cpp}` | 5 | Extend with medoid/label/cost checkpoint |
220+
| `dtwc/algorithms/fast_pam.cpp` | 5 | Checkpoint after each SWAP iteration |
221+
| `dtwc/algorithms/fast_clara.{hpp,cpp}` | 5, 6 | Checkpoint per subsample, streaming variant |
222+
| `dtwc/fileOperations.hpp` | 6 | Subset loading functions |
223+
224+
## Verification
225+
226+
- All 63 existing unit tests must pass after each step
227+
- Stress test (`tests/integration/stress_test_cli.sh`) must pass
228+
- Step 2: `dtwc_cl --method auto -k 3 -i data/dummy` should print "Auto-selected method: pam (N=18)"
229+
- Step 3: Unit test creates 1000x1000 mmap matrix, writes/reads, reopens from file
230+
- Step 4: `dtwc_cl -k 3 -i large_dataset/` with N > 50K should create `.cache` file
231+
232+
## Open Decisions (from adversarial reviews)
233+
234+
### llfio vs mio vs custom wrapper
235+
- **llfio**: has file locking + pre-allocation built in, active. But heavy CMake, slow configure (~30s), complex internals.
236+
- **mio**: clean 3-line API, header-only, trivial CPM. But dormant, no file locking, no pre-allocation.
237+
- **custom ~80-line wrapper**: zero deps, full control. But maintenance burden, platform-specific bugs.
238+
- **Decision needed at implementation time**: try llfio first. If CMake integration is painful, fall back to mio + manual `flock`/`LockFileEx` (~20 extra lines).
239+
240+
### std::visit overhead on hot path
241+
- `distByInd` is called millions of times. `std::visit` on variant adds indirect call overhead per access.
242+
- **Fix**: Resolve variant once at algorithm entry (e.g., in `fillDistanceMatrix`, `fast_pam`, `cluster_by_kMedoidsLloyd`). Pass concrete `DenseDistanceMatrix&` or `MmapDistanceMatrix&` via template to inner loops. The variant lives in Problem but hot paths don't go through it.
243+
244+
### CLARA + mmap interaction
245+
- Auto-selecting CLARA for N>50K but also allocating full mmap matrix is contradictory — CLARA doesn't need the full N²/2 matrix.
246+
- **Fix**: mmap distance matrix ONLY when method explicitly requires full matrix (pam, kmedoids, hierarchical, mip). CLARA skips it entirely (as it already does). The auto-select logic should NOT trigger mmap when it picks CLARA.
247+
248+
### Checkpoint format unification
249+
- Existing CSV checkpoint (`checkpoint.hpp`) vs new binary checkpoint coexist.
250+
- **Fix**: Keep CSV checkpoint for small N / human-readable use. Binary checkpoint is a superset (mmap cache + medoid state). `--restart` uses binary. Old `--checkpoint` flag uses CSV. Document both.
251+
252+
### Binary header alignment
253+
- Pad header to **32 bytes** (not 24). Doubles start at byte 32 = 8-byte aligned. Safe on ARM.
254+
255+
### Stale cache detection
256+
- Store hash of input filenames + sizes + modification times in the binary header.
257+
- On `--restart`, compare. Warn if stale, don't silently use wrong cache.
258+
259+
### CRC32 scope
260+
- CRC covers header only (20 bytes), not data. Data integrity is OS responsibility (page cache + fsync). Computing CRC over multi-GB mmap is expensive and pointless for a cache file.
261+
262+
### CLARA sample size formula
263+
- `clara_sample_size = max(40 + 2*k, min(N, sqrt(N) * k))` for N > 50K
264+
- Number of CLARA iterations: default 5 (existing `CLARAOptions::n_samples = 5`)
265+
266+
## What NOT to Do
267+
268+
- Don't add HDF5 C++ dependency yet — Python has it, C++ can wait
269+
- Don't abstract DataSource interface yet — YAGNI until streaming CLARA is proven
270+
- Don't change the Method enum — FastPAM/FastCLARA work fine as standalone functions
271+
- Don't CRC the entire data region — OS handles data integrity for mmap'd files
272+
- Don't allocate full mmap matrix when auto-selecting CLARA — CLARA doesn't need it

0 commit comments

Comments
 (0)