feat(l2g): add hyperparameter grid search with file-based CV output by addramir · Pull Request #1222 · opentargets/gentropy

addramir · 2026-05-13T11:21:12Z

Summary

Bug fix

Fix silent bug: `set_params(**config)` was called after `fit()` in `cross_validate_single_fold`, meaning the hyperparameter grid values were never actually applied during training. All CV folds were silently trained with the base hyperparameters regardless of the sweep config.

New features

Hyperparameter grid search (`hyperparameter_grid`)
W&B-style grid (`{"param": {"values": [v1, v2, ...]}}`) now accepted by `LocusToGeneStep`, `LocusToGeneConfig`, and `LocusToGeneTrainer`. When provided without W&B, the grid is iterated explicitly using `itertools.product`. When `train_on_full_dataset=True` is also set, it is automatically ignored (with a warning) because no single winning config has been chosen yet.

File-based CV output (`cv_results_dir`)
When set (and `wandb_run_name` is not set), writes the following after all folds complete:

`cv_results.json` — timestamp, hyperparameters, per-fold metrics, mean ± std per config
`cv_folds.csv` — flat tabular view (one row per fold × config)
`config_N/roc.png` — per-fold ROC curves with mean AUC in title
`config_N/pr.png` — per-fold Precision-Recall curves with mean AP in title
`config_N/confusion_matrix.png` — aggregated across all folds at threshold 0.5

Each config's raw arrays are released from driver memory as soon as its plots are written, so memory usage is O(n_samples × n_folds × 1 config) rather than O(n_samples × n_folds × n_configs).

GCS support: `cv_results_dir` accepts both local paths and `gs://` URIs; GCS output is written to a local temp dir then uploaded via `google-cloud-storage` (already a dependency).

Dark matter loci filter (`filter_dark_matter`)
Removes loci from the training set where every gold-standard positive has no molecular evidence linking it to the locus. A positive is considered "dark matter" when:

All QTL colocalisation features and E2G are zero (or NULL), and VEP score is below the protein-altering threshold (0.6)
The gene is not the nearest to the sentinel (all neighbourhood distance features < 1.0)

A locus is dropped only when ALL its positives qualify; loci with at least one signal-carrying or nearest-gene positive are kept. The entire locus (positives and negatives) is dropped to preserve class balance within retained loci. Filtering stats are written as `{model_path}_dark_matter_stats.json`.

Defaults to `False` — existing training runs are unaffected.

Refactor: `cross_validate_single_fold` extracted as `_run_cv_fold` private method; `_save_cv_results` split into `_summarise_and_plot_config` (per-config, releases raw arrays) and `_write_cv_files` (final JSON + CSV).

Behaviour matrix

`wandb_run_name`	`cv_results_dir`	`hyperparameter_grid`	Effect
set	—	any	W&B grid sweep (existing behaviour, unaffected)
unset	set	multi-value	Explicit grid iteration → files written
unset	set	None / single-value	Single config × 5 folds → files written
unset	unset	any	Terminal logging only (existing behaviour)

Test plan

Run training locally with `cv_results_dir=/tmp/cv_test` and a small `hyperparameter_grid` (2–3 values per param), verify directory structure and file contents
Verify `cv_results.json` contains correct mean/std across folds
Verify plots are generated for each config subdirectory
Run with `cv_results_dir=gs://...` on Dataproc, verify files appear in GCS
Run with no `cv_results_dir` to confirm existing terminal-logging path is unaffected
Run with `wandb_run_name` set to confirm existing W&B sweep path is unaffected
Enable `filter_dark_matter=True` and verify stats JSON is written next to model path

🤖 Generated with Claude Code

- Fix bug in cross_validate where set_params was called after fit, meaning hyperparameter grid values were never actually used during training - Add hyperparameter_grid and cv_results_dir params to LocusToGeneStep, LocusToGeneConfig, and LocusToGeneTrainer.train/cross_validate - When cv_results_dir is set (without W&B), iterate the full grid explicitly and write per-config results: cv_results.json, cv_folds.csv, and per-config plots (ROC, Precision-Recall, confusion matrix at threshold 0.5) - Support both local and GCS (gs://) output paths; GCS writes via a local temp dir uploaded with google-cloud-storage - Extract fold-level training logic into _run_cv_fold to reduce cross_validate complexity Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Introduces a `filter_dark_matter` option to `LocusToGeneStep` that removes uninformative loci from the training set before model fitting. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Fix filter_dark_matter_loci docstring: "at least one" → "every positive" - Return early (no-op + warning) when no neighbourhood distance features are present, avoiding incorrect removal of loci with undetermined nearness - Consolidate 9 Spark count() actions into 3 aggregations to reduce job overhead - Fix filter_dark_matter parameter docstring in LocusToGeneStep to match the ALL-positives-must-be-dark-matter removal rule - Replace transitive fsspec dependency with google-cloud-storage (direct dep) for GCS writes; fall back to open() for local paths - Add 5 unit tests covering: all-dark-matter locus removed, mixed locus kept, nearest-gene positive protected, no-nearest-features no-op, no-GS raises Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Persist dark_matter_positives_per_locus to avoid Spark recomputing it when collecting dm_row stats (comment #8) - Replace google.cloud.storage GCS special-case with fsspec.open for unified local/GCS write in _write_dark_matter_stats (comment #7) - Add explicit warning in run_train when filter_dark_matter=True but filter is a no-op due to missing features in features_list (comment #6) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

fsspec is only a transitive dep and deptry rejects undeclared imports. Revert _write_dark_matter_stats to use google.cloud.storage (already a direct dependency) for GCS writes and plain open() for local paths. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Coalesce NULL signal features to 0.0 before comparison so that positives with missing coloc/VEP evidence (from inner joins) are correctly identified as dark matter instead of being silently excluded - Persist self._df after left_anti join to avoid recomputing the filtered matrix for after-stats and downstream training - Create parent directory before writing stats JSON to local path to prevent FileNotFoundError when model_path parent doesn't exist yet - Guard f.sum(f.when(...)) results against NULL on empty DataFrames - Add regression test for NULL signal features treated as no signal Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

vepMaximum can be NULL when VEP annotation is absent. The coalesce fix treats NULL as 0.0 (below threshold), so the locus is correctly dark matter. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Resolves conflicts in config.py and l2g.py by combining both sets of new parameters: hyperparameter_grid/cv_results_dir from this branch and hf_credentials_path/wandb_credentials_path from dev. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Combines dark matter loci filter (filter_dark_matter param) from PR #1224 with hyperparameter grid search and CV output from this branch. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR extends the L2G training pipeline to support non-W&B hyperparameter grid search with persisted cross-validation artifacts, and adds an optional training-time “dark matter loci” filter for gold-standard feature matrices.

Changes:

Fix CV hyperparameters not being applied during fold training by setting params before fit(), and refactor per-fold logic into a private helper.
Add explicit (non-W&B) hyperparameter grid iteration with optional file/GCS output of CV metrics, CSV summaries, and diagnostic plots.
Add filter_dark_matter option + implementation to drop loci where all positives lack functional signal and are not nearest-gene, plus stats output and unit tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/gentropy/dataset/test_l2g_feature_matrix.py	Adds unit tests covering dark matter locus filtering behavior and edge cases.
src/gentropy/method/l2g/trainer.py	Fixes CV param application timing; adds explicit grid expansion + file/GCS CV outputs and plotting helpers.
src/gentropy/l2g.py	Wires new training options (`hyperparameter_grid`, `cv_results_dir`, `filter_dark_matter`) into the step and writes dark-matter stats.
src/gentropy/dataset/l2g_feature_matrix.py	Implements `filter_dark_matter_loci()` with stats and logging.
src/gentropy/config.py	Exposes new configuration parameters for the step.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            f.sum("dark_matter_count").alias("dm_positives"),
+        ).collect()[0]
+
+        self._df = self._df.join(dark_matter_loci, "studyLocusId", "left_anti").persist()


+                        "config_id": config_id,
+                        "hyperparameters": data["config"],
+                        "fold_metrics": [
+                            {"fold": f["fold"], **f["metrics"]} for f in folds


+        if collect_for_file:
+            fold_results.append(
+                {
+                    "config": dict(config) if config else {},
+                    "fold": fold_index,
+                    "metrics": metrics,
+                    "y_true": y_fold_val,
+                    "y_pred_proba": y_pred_proba[:, 1],
+                }
+            )


+        elif cv_results_dir:
+            for config in self._expand_grid(parameter_grid):
+                run_all_folds(run_config=config)
        else:
-            # Evaluate with cross validation to the terminal
            run_all_folds()

+        if collect_for_file and fold_results:
+            self._save_cv_results(
+                cv_results_dir=cv_results_dir,  # type: ignore[arg-type]
+                fold_results=fold_results,
+                n_splits=n_splits,
+            )


+        if self.filter_dark_matter:
+            feature_matrix, dark_matter_stats = feature_matrix.filter_dark_matter_loci()
+            if dark_matter_stats:
+                if self.model_path:
+                    self._write_dark_matter_stats(dark_matter_stats)


- Warn instead of silently dropping dark matter stats when model_path is None - Raise ValueError with offending keys when hyperparameter_grid entry lacks 'values' - Replace type: ignore with assert in cross_validate - Move filter_dark_matter next to other training flags in LocusToGeneConfig - Ignore train_on_full_dataset when grid search produces more than one config Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Unpersist old self._df in filter_dark_matter_loci to avoid executor memory leak - Fix NumPy scalar JSON serialisation in fold_metrics (float() cast) - Process each config's results incrementally to bound driver memory to O(n_samples x n_folds x 1 config) instead of accumulating all configs - Replace _save_cv_results with _summarise_and_plot_config + _write_cv_files - Add tests for _expand_grid (single-param, cartesian product, missing key) - Add test for cv_results_dir file output asserting JSON/CSV structure and that all metric values are plain Python floats - Update PR description to document filter_dark_matter feature Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Allows setting hf_hub_repo_id=null (e.g. during grid search runs) without Hydra/OmegaConf rejecting the override. Upload logic already guards with `if self.hf_hub_repo_id` so None naturally skips it. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…gaConf merge bug OmegaConf cannot merge a dict value into an Optional[Dict] field that is currently None — it calls .get() on the None node and raises AttributeError. Typing as Any bypasses the typed merge and allows the override to work. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Adds confusion matrix counts to evaluate() so cv_folds.csv includes per-fold false positive counts alongside the existing aggregate metrics. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Cast count metrics as int (not float) in fold_metrics JSON/CSV - Update test_evaluate_perfect_predictions to expect TP/FP/TN/FN keys Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Adds n_loci, n_loci_one_gene_above, and n_loci_no_gene_above to the per-fold metrics dict so the CSV output captures how well each config resolves study loci to a single confident gene (L2G >= 0.5). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

XGBClassifier's sklearn wrapper rejects float labels. When soft labels are present, train via raw xgboost DMatrix API and inject the booster back into a cloned XGBClassifier to preserve predict_proba. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

XGBClassifier() stores n_estimators=None when not explicitly set; default to 100 (XGBoost's own default) when None is returned. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…n matrix - Remove classes_ and n_features_in_ assignments in _fit_with_soft_labels: both are read-only properties derived from the booster in this XGBoost version - Pass labels=[0, 1] to confusion_matrix to guarantee a 2x2 matrix when all predictions fall into a single class (e.g. tiny test sets) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

wandb.plot_classifier passes y_true directly to sklearn metrics which reject a mix of continuous (soft labels) and binary (y_pred) targets. Binarize at 0.5 — same threshold used in evaluate(). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

github-actions Bot added size-M Method Step Feature labels May 13, 2026

addramir and others added 8 commits May 20, 2026 10:52

feat(l2g): add dark matter loci filter for training set

8f7ab0e

Introduces a `filter_dark_matter` option to `LocusToGeneStep` that removes uninformative loci from the training set before model fitting. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Merge yt_dark_matter into feat/l2g-hyperparameter-grid-cv-output

27bd445

Combines dark matter loci filter (filter_dark_matter param) from PR #1224 with hyperparameter grid search and CV output from this branch. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

github-actions Bot added size-L Dataset and removed size-M labels May 20, 2026

addramir marked this pull request as ready for review May 20, 2026 12:55

Copilot AI review requested due to automatic review settings May 20, 2026 12:55

Copilot started reviewing on behalf of addramir May 20, 2026 12:56 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

addramir and others added 8 commits May 20, 2026 14:09

feat(l2g): add TP/FP/TN/FN counts to CV fold metrics output

f3ec9ae

Adds confusion matrix counts to evaluate() so cv_folds.csv includes per-fold false positive counts alongside the existing aggregate metrics. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

fix(l2g): preserve int type for TP/FP/TN/FN in CV output and fix test

927bf4c

- Cast count metrics as int (not float) in fold_metrics JSON/CSV - Update test_evaluate_perfect_predictions to expect TP/FP/TN/FN keys Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

feat: add soft label training

19ff25e

github-actions Bot added size-XL and removed size-L labels May 22, 2026

addramir and others added 5 commits May 22, 2026 13:00

fix(l2g): rename apply_soft_labels params and add missing docstring

3385fdc

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

test(l2g): add test for trainer.train with soft labels

e888d4d

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

fix(l2g): handle None n_estimators in _fit_with_soft_labels

a3e4594

XGBClassifier() stores n_estimators=None when not explicitly set; default to 100 (XGBoost's own default) when None is returned. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

addramir marked this pull request as draft May 22, 2026 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(l2g): add hyperparameter grid search with file-based CV output#1222

feat(l2g): add hyperparameter grid search with file-based CV output#1222
addramir wants to merge 23 commits into
devfrom
feat/l2g-hyperparameter-grid-cv-output

addramir commented May 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

addramir commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug fix

New features

Behaviour matrix

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

addramir commented May 13, 2026 •

edited

Loading