Skip to content

feat(l2g): add hyperparameter grid search with file-based CV output#1222

Draft
addramir wants to merge 23 commits into
devfrom
feat/l2g-hyperparameter-grid-cv-output
Draft

feat(l2g): add hyperparameter grid search with file-based CV output#1222
addramir wants to merge 23 commits into
devfrom
feat/l2g-hyperparameter-grid-cv-output

Conversation

@addramir
Copy link
Copy Markdown
Contributor

@addramir addramir commented May 13, 2026

Summary

Bug fix

  • Fix silent bug: `set_params(**config)` was called after `fit()` in `cross_validate_single_fold`, meaning the hyperparameter grid values were never actually applied during training. All CV folds were silently trained with the base hyperparameters regardless of the sweep config.

New features

Hyperparameter grid search (`hyperparameter_grid`)
W&B-style grid (`{"param": {"values": [v1, v2, ...]}}`) now accepted by `LocusToGeneStep`, `LocusToGeneConfig`, and `LocusToGeneTrainer`. When provided without W&B, the grid is iterated explicitly using `itertools.product`. When `train_on_full_dataset=True` is also set, it is automatically ignored (with a warning) because no single winning config has been chosen yet.

File-based CV output (`cv_results_dir`)
When set (and `wandb_run_name` is not set), writes the following after all folds complete:

  • `cv_results.json` — timestamp, hyperparameters, per-fold metrics, mean ± std per config
  • `cv_folds.csv` — flat tabular view (one row per fold × config)
  • `config_N/roc.png` — per-fold ROC curves with mean AUC in title
  • `config_N/pr.png` — per-fold Precision-Recall curves with mean AP in title
  • `config_N/confusion_matrix.png` — aggregated across all folds at threshold 0.5

Each config's raw arrays are released from driver memory as soon as its plots are written, so memory usage is O(n_samples × n_folds × 1 config) rather than O(n_samples × n_folds × n_configs).

GCS support: `cv_results_dir` accepts both local paths and `gs://` URIs; GCS output is written to a local temp dir then uploaded via `google-cloud-storage` (already a dependency).

Dark matter loci filter (`filter_dark_matter`)
Removes loci from the training set where every gold-standard positive has no molecular evidence linking it to the locus. A positive is considered "dark matter" when:

  • All QTL colocalisation features and E2G are zero (or NULL), and VEP score is below the protein-altering threshold (0.6)
  • The gene is not the nearest to the sentinel (all neighbourhood distance features < 1.0)

A locus is dropped only when ALL its positives qualify; loci with at least one signal-carrying or nearest-gene positive are kept. The entire locus (positives and negatives) is dropped to preserve class balance within retained loci. Filtering stats are written as `{model_path}_dark_matter_stats.json`.

Defaults to `False` — existing training runs are unaffected.

Refactor: `cross_validate_single_fold` extracted as `_run_cv_fold` private method; `_save_cv_results` split into `_summarise_and_plot_config` (per-config, releases raw arrays) and `_write_cv_files` (final JSON + CSV).

Behaviour matrix

`wandb_run_name` `cv_results_dir` `hyperparameter_grid` Effect
set any W&B grid sweep (existing behaviour, unaffected)
unset set multi-value Explicit grid iteration → files written
unset set None / single-value Single config × 5 folds → files written
unset unset any Terminal logging only (existing behaviour)

Test plan

  • Run training locally with `cv_results_dir=/tmp/cv_test` and a small `hyperparameter_grid` (2–3 values per param), verify directory structure and file contents
  • Verify `cv_results.json` contains correct mean/std across folds
  • Verify plots are generated for each config subdirectory
  • Run with `cv_results_dir=gs://...` on Dataproc, verify files appear in GCS
  • Run with no `cv_results_dir` to confirm existing terminal-logging path is unaffected
  • Run with `wandb_run_name` set to confirm existing W&B sweep path is unaffected
  • Enable `filter_dark_matter=True` and verify stats JSON is written next to model path

🤖 Generated with Claude Code

- Fix bug in cross_validate where set_params was called after fit, meaning
  hyperparameter grid values were never actually used during training
- Add hyperparameter_grid and cv_results_dir params to LocusToGeneStep,
  LocusToGeneConfig, and LocusToGeneTrainer.train/cross_validate
- When cv_results_dir is set (without W&B), iterate the full grid explicitly
  and write per-config results: cv_results.json, cv_folds.csv, and per-config
  plots (ROC, Precision-Recall, confusion matrix at threshold 0.5)
- Support both local and GCS (gs://) output paths; GCS writes via a local
  temp dir uploaded with google-cloud-storage
- Extract fold-level training logic into _run_cv_fold to reduce cross_validate
  complexity

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
addramir and others added 8 commits May 20, 2026 10:52
Introduces a `filter_dark_matter` option to `LocusToGeneStep` that removes
uninformative loci from the training set before model fitting.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Fix filter_dark_matter_loci docstring: "at least one" → "every positive"
- Return early (no-op + warning) when no neighbourhood distance features
  are present, avoiding incorrect removal of loci with undetermined nearness
- Consolidate 9 Spark count() actions into 3 aggregations to reduce job overhead
- Fix filter_dark_matter parameter docstring in LocusToGeneStep to match
  the ALL-positives-must-be-dark-matter removal rule
- Replace transitive fsspec dependency with google-cloud-storage (direct dep)
  for GCS writes; fall back to open() for local paths
- Add 5 unit tests covering: all-dark-matter locus removed, mixed locus kept,
  nearest-gene positive protected, no-nearest-features no-op, no-GS raises

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Persist dark_matter_positives_per_locus to avoid Spark recomputing it
  when collecting dm_row stats (comment #8)
- Replace google.cloud.storage GCS special-case with fsspec.open for
  unified local/GCS write in _write_dark_matter_stats (comment #7)
- Add explicit warning in run_train when filter_dark_matter=True but
  filter is a no-op due to missing features in features_list (comment #6)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
fsspec is only a transitive dep and deptry rejects undeclared imports.
Revert _write_dark_matter_stats to use google.cloud.storage (already a
direct dependency) for GCS writes and plain open() for local paths.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Coalesce NULL signal features to 0.0 before comparison so that
  positives with missing coloc/VEP evidence (from inner joins) are
  correctly identified as dark matter instead of being silently excluded
- Persist self._df after left_anti join to avoid recomputing the filtered
  matrix for after-stats and downstream training
- Create parent directory before writing stats JSON to local path to
  prevent FileNotFoundError when model_path parent doesn't exist yet
- Guard f.sum(f.when(...)) results against NULL on empty DataFrames
- Add regression test for NULL signal features treated as no signal

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
vepMaximum can be NULL when VEP annotation is absent. The coalesce fix
treats NULL as 0.0 (below threshold), so the locus is correctly dark matter.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Resolves conflicts in config.py and l2g.py by combining both sets of new
parameters: hyperparameter_grid/cv_results_dir from this branch and
hf_credentials_path/wandb_credentials_path from dev.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Combines dark matter loci filter (filter_dark_matter param) from PR #1224
with hyperparameter grid search and CV output from this branch.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@addramir addramir marked this pull request as ready for review May 20, 2026 12:55
Copilot AI review requested due to automatic review settings May 20, 2026 12:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the L2G training pipeline to support non-W&B hyperparameter grid search with persisted cross-validation artifacts, and adds an optional training-time “dark matter loci” filter for gold-standard feature matrices.

Changes:

  • Fix CV hyperparameters not being applied during fold training by setting params before fit(), and refactor per-fold logic into a private helper.
  • Add explicit (non-W&B) hyperparameter grid iteration with optional file/GCS output of CV metrics, CSV summaries, and diagnostic plots.
  • Add filter_dark_matter option + implementation to drop loci where all positives lack functional signal and are not nearest-gene, plus stats output and unit tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/gentropy/dataset/test_l2g_feature_matrix.py Adds unit tests covering dark matter locus filtering behavior and edge cases.
src/gentropy/method/l2g/trainer.py Fixes CV param application timing; adds explicit grid expansion + file/GCS CV outputs and plotting helpers.
src/gentropy/l2g.py Wires new training options (hyperparameter_grid, cv_results_dir, filter_dark_matter) into the step and writes dark-matter stats.
src/gentropy/dataset/l2g_feature_matrix.py Implements filter_dark_matter_loci() with stats and logging.
src/gentropy/config.py Exposes new configuration parameters for the step.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

f.sum("dark_matter_count").alias("dm_positives"),
).collect()[0]

self._df = self._df.join(dark_matter_loci, "studyLocusId", "left_anti").persist()
Comment thread src/gentropy/method/l2g/trainer.py Outdated
"config_id": config_id,
"hyperparameters": data["config"],
"fold_metrics": [
{"fold": f["fold"], **f["metrics"]} for f in folds
Comment on lines +551 to +560
if collect_for_file:
fold_results.append(
{
"config": dict(config) if config else {},
"fold": fold_index,
"metrics": metrics,
"y_true": y_fold_val,
"y_pred_proba": y_pred_proba[:, 1],
}
)
Comment thread src/gentropy/method/l2g/trainer.py Outdated
Comment on lines +468 to +479
elif cv_results_dir:
for config in self._expand_grid(parameter_grid):
run_all_folds(run_config=config)
else:
# Evaluate with cross validation to the terminal
run_all_folds()

if collect_for_file and fold_results:
self._save_cv_results(
cv_results_dir=cv_results_dir, # type: ignore[arg-type]
fold_results=fold_results,
n_splits=n_splits,
)
Comment thread src/gentropy/l2g.py
Comment on lines +477 to +481
if self.filter_dark_matter:
feature_matrix, dark_matter_stats = feature_matrix.filter_dark_matter_loci()
if dark_matter_stats:
if self.model_path:
self._write_dark_matter_stats(dark_matter_stats)
addramir and others added 8 commits May 20, 2026 14:09
- Warn instead of silently dropping dark matter stats when model_path is None
- Raise ValueError with offending keys when hyperparameter_grid entry lacks 'values'
- Replace type: ignore with assert in cross_validate
- Move filter_dark_matter next to other training flags in LocusToGeneConfig
- Ignore train_on_full_dataset when grid search produces more than one config

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Unpersist old self._df in filter_dark_matter_loci to avoid executor memory leak
- Fix NumPy scalar JSON serialisation in fold_metrics (float() cast)
- Process each config's results incrementally to bound driver memory to
  O(n_samples x n_folds x 1 config) instead of accumulating all configs
- Replace _save_cv_results with _summarise_and_plot_config + _write_cv_files
- Add tests for _expand_grid (single-param, cartesian product, missing key)
- Add test for cv_results_dir file output asserting JSON/CSV structure and
  that all metric values are plain Python floats
- Update PR description to document filter_dark_matter feature

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Allows setting hf_hub_repo_id=null (e.g. during grid search runs) without
Hydra/OmegaConf rejecting the override. Upload logic already guards with
`if self.hf_hub_repo_id` so None naturally skips it.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…gaConf merge bug

OmegaConf cannot merge a dict value into an Optional[Dict] field that is
currently None — it calls .get() on the None node and raises AttributeError.
Typing as Any bypasses the typed merge and allows the override to work.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Adds confusion matrix counts to evaluate() so cv_folds.csv includes
per-fold false positive counts alongside the existing aggregate metrics.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Cast count metrics as int (not float) in fold_metrics JSON/CSV
- Update test_evaluate_perfect_predictions to expect TP/FP/TN/FN keys

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Adds n_loci, n_loci_one_gene_above, and n_loci_no_gene_above to the
per-fold metrics dict so the CSV output captures how well each config
resolves study loci to a single confident gene (L2G >= 0.5).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size-XL and removed size-L labels May 22, 2026
addramir and others added 5 commits May 22, 2026 13:00
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
XGBClassifier's sklearn wrapper rejects float labels. When soft labels
are present, train via raw xgboost DMatrix API and inject the booster
back into a cloned XGBClassifier to preserve predict_proba.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
XGBClassifier() stores n_estimators=None when not explicitly set;
default to 100 (XGBoost's own default) when None is returned.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…n matrix

- Remove classes_ and n_features_in_ assignments in _fit_with_soft_labels:
  both are read-only properties derived from the booster in this XGBoost version
- Pass labels=[0, 1] to confusion_matrix to guarantee a 2x2 matrix when
  all predictions fall into a single class (e.g. tiny test sets)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@addramir addramir marked this pull request as draft May 22, 2026 13:43
wandb.plot_classifier passes y_true directly to sklearn metrics which
reject a mix of continuous (soft labels) and binary (y_pred) targets.
Binarize at 0.5 — same threshold used in evaluate().

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants