feat(l2g): add hyperparameter grid search with file-based CV output#1222
Draft
addramir wants to merge 23 commits into
Draft
feat(l2g): add hyperparameter grid search with file-based CV output#1222addramir wants to merge 23 commits into
addramir wants to merge 23 commits into
Conversation
- Fix bug in cross_validate where set_params was called after fit, meaning hyperparameter grid values were never actually used during training - Add hyperparameter_grid and cv_results_dir params to LocusToGeneStep, LocusToGeneConfig, and LocusToGeneTrainer.train/cross_validate - When cv_results_dir is set (without W&B), iterate the full grid explicitly and write per-config results: cv_results.json, cv_folds.csv, and per-config plots (ROC, Precision-Recall, confusion matrix at threshold 0.5) - Support both local and GCS (gs://) output paths; GCS writes via a local temp dir uploaded with google-cloud-storage - Extract fold-level training logic into _run_cv_fold to reduce cross_validate complexity Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Introduces a `filter_dark_matter` option to `LocusToGeneStep` that removes uninformative loci from the training set before model fitting. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Fix filter_dark_matter_loci docstring: "at least one" → "every positive" - Return early (no-op + warning) when no neighbourhood distance features are present, avoiding incorrect removal of loci with undetermined nearness - Consolidate 9 Spark count() actions into 3 aggregations to reduce job overhead - Fix filter_dark_matter parameter docstring in LocusToGeneStep to match the ALL-positives-must-be-dark-matter removal rule - Replace transitive fsspec dependency with google-cloud-storage (direct dep) for GCS writes; fall back to open() for local paths - Add 5 unit tests covering: all-dark-matter locus removed, mixed locus kept, nearest-gene positive protected, no-nearest-features no-op, no-GS raises Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Persist dark_matter_positives_per_locus to avoid Spark recomputing it when collecting dm_row stats (comment #8) - Replace google.cloud.storage GCS special-case with fsspec.open for unified local/GCS write in _write_dark_matter_stats (comment #7) - Add explicit warning in run_train when filter_dark_matter=True but filter is a no-op due to missing features in features_list (comment #6) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
fsspec is only a transitive dep and deptry rejects undeclared imports. Revert _write_dark_matter_stats to use google.cloud.storage (already a direct dependency) for GCS writes and plain open() for local paths. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Coalesce NULL signal features to 0.0 before comparison so that positives with missing coloc/VEP evidence (from inner joins) are correctly identified as dark matter instead of being silently excluded - Persist self._df after left_anti join to avoid recomputing the filtered matrix for after-stats and downstream training - Create parent directory before writing stats JSON to local path to prevent FileNotFoundError when model_path parent doesn't exist yet - Guard f.sum(f.when(...)) results against NULL on empty DataFrames - Add regression test for NULL signal features treated as no signal Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
vepMaximum can be NULL when VEP annotation is absent. The coalesce fix treats NULL as 0.0 (below threshold), so the locus is correctly dark matter. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Resolves conflicts in config.py and l2g.py by combining both sets of new parameters: hyperparameter_grid/cv_results_dir from this branch and hf_credentials_path/wandb_credentials_path from dev. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Combines dark matter loci filter (filter_dark_matter param) from PR #1224 with hyperparameter grid search and CV output from this branch. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends the L2G training pipeline to support non-W&B hyperparameter grid search with persisted cross-validation artifacts, and adds an optional training-time “dark matter loci” filter for gold-standard feature matrices.
Changes:
- Fix CV hyperparameters not being applied during fold training by setting params before
fit(), and refactor per-fold logic into a private helper. - Add explicit (non-W&B) hyperparameter grid iteration with optional file/GCS output of CV metrics, CSV summaries, and diagnostic plots.
- Add
filter_dark_matteroption + implementation to drop loci where all positives lack functional signal and are not nearest-gene, plus stats output and unit tests.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/gentropy/dataset/test_l2g_feature_matrix.py | Adds unit tests covering dark matter locus filtering behavior and edge cases. |
| src/gentropy/method/l2g/trainer.py | Fixes CV param application timing; adds explicit grid expansion + file/GCS CV outputs and plotting helpers. |
| src/gentropy/l2g.py | Wires new training options (hyperparameter_grid, cv_results_dir, filter_dark_matter) into the step and writes dark-matter stats. |
| src/gentropy/dataset/l2g_feature_matrix.py | Implements filter_dark_matter_loci() with stats and logging. |
| src/gentropy/config.py | Exposes new configuration parameters for the step. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| f.sum("dark_matter_count").alias("dm_positives"), | ||
| ).collect()[0] | ||
|
|
||
| self._df = self._df.join(dark_matter_loci, "studyLocusId", "left_anti").persist() |
| "config_id": config_id, | ||
| "hyperparameters": data["config"], | ||
| "fold_metrics": [ | ||
| {"fold": f["fold"], **f["metrics"]} for f in folds |
Comment on lines
+551
to
+560
| if collect_for_file: | ||
| fold_results.append( | ||
| { | ||
| "config": dict(config) if config else {}, | ||
| "fold": fold_index, | ||
| "metrics": metrics, | ||
| "y_true": y_fold_val, | ||
| "y_pred_proba": y_pred_proba[:, 1], | ||
| } | ||
| ) |
Comment on lines
+468
to
+479
| elif cv_results_dir: | ||
| for config in self._expand_grid(parameter_grid): | ||
| run_all_folds(run_config=config) | ||
| else: | ||
| # Evaluate with cross validation to the terminal | ||
| run_all_folds() | ||
|
|
||
| if collect_for_file and fold_results: | ||
| self._save_cv_results( | ||
| cv_results_dir=cv_results_dir, # type: ignore[arg-type] | ||
| fold_results=fold_results, | ||
| n_splits=n_splits, | ||
| ) |
Comment on lines
+477
to
+481
| if self.filter_dark_matter: | ||
| feature_matrix, dark_matter_stats = feature_matrix.filter_dark_matter_loci() | ||
| if dark_matter_stats: | ||
| if self.model_path: | ||
| self._write_dark_matter_stats(dark_matter_stats) |
- Warn instead of silently dropping dark matter stats when model_path is None - Raise ValueError with offending keys when hyperparameter_grid entry lacks 'values' - Replace type: ignore with assert in cross_validate - Move filter_dark_matter next to other training flags in LocusToGeneConfig - Ignore train_on_full_dataset when grid search produces more than one config Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Unpersist old self._df in filter_dark_matter_loci to avoid executor memory leak - Fix NumPy scalar JSON serialisation in fold_metrics (float() cast) - Process each config's results incrementally to bound driver memory to O(n_samples x n_folds x 1 config) instead of accumulating all configs - Replace _save_cv_results with _summarise_and_plot_config + _write_cv_files - Add tests for _expand_grid (single-param, cartesian product, missing key) - Add test for cv_results_dir file output asserting JSON/CSV structure and that all metric values are plain Python floats - Update PR description to document filter_dark_matter feature Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Allows setting hf_hub_repo_id=null (e.g. during grid search runs) without Hydra/OmegaConf rejecting the override. Upload logic already guards with `if self.hf_hub_repo_id` so None naturally skips it. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…gaConf merge bug OmegaConf cannot merge a dict value into an Optional[Dict] field that is currently None — it calls .get() on the None node and raises AttributeError. Typing as Any bypasses the typed merge and allows the override to work. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Adds confusion matrix counts to evaluate() so cv_folds.csv includes per-fold false positive counts alongside the existing aggregate metrics. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Cast count metrics as int (not float) in fold_metrics JSON/CSV - Update test_evaluate_perfect_predictions to expect TP/FP/TN/FN keys Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Adds n_loci, n_loci_one_gene_above, and n_loci_no_gene_above to the per-fold metrics dict so the CSV output captures how well each config resolves study loci to a single confident gene (L2G >= 0.5). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
XGBClassifier's sklearn wrapper rejects float labels. When soft labels are present, train via raw xgboost DMatrix API and inject the booster back into a cloned XGBClassifier to preserve predict_proba. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
XGBClassifier() stores n_estimators=None when not explicitly set; default to 100 (XGBoost's own default) when None is returned. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…n matrix - Remove classes_ and n_features_in_ assignments in _fit_with_soft_labels: both are read-only properties derived from the booster in this XGBoost version - Pass labels=[0, 1] to confusion_matrix to guarantee a 2x2 matrix when all predictions fall into a single class (e.g. tiny test sets) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
wandb.plot_classifier passes y_true directly to sklearn metrics which reject a mix of continuous (soft labels) and binary (y_pred) targets. Binarize at 0.5 — same threshold used in evaluate(). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bug fix
New features
Hyperparameter grid search (`hyperparameter_grid`)
W&B-style grid (`{"param": {"values": [v1, v2, ...]}}`) now accepted by `LocusToGeneStep`, `LocusToGeneConfig`, and `LocusToGeneTrainer`. When provided without W&B, the grid is iterated explicitly using `itertools.product`. When `train_on_full_dataset=True` is also set, it is automatically ignored (with a warning) because no single winning config has been chosen yet.
File-based CV output (`cv_results_dir`)
When set (and `wandb_run_name` is not set), writes the following after all folds complete:
Each config's raw arrays are released from driver memory as soon as its plots are written, so memory usage is O(n_samples × n_folds × 1 config) rather than O(n_samples × n_folds × n_configs).
GCS support: `cv_results_dir` accepts both local paths and `gs://` URIs; GCS output is written to a local temp dir then uploaded via `google-cloud-storage` (already a dependency).
Dark matter loci filter (`filter_dark_matter`)
Removes loci from the training set where every gold-standard positive has no molecular evidence linking it to the locus. A positive is considered "dark matter" when:
A locus is dropped only when ALL its positives qualify; loci with at least one signal-carrying or nearest-gene positive are kept. The entire locus (positives and negatives) is dropped to preserve class balance within retained loci. Filtering stats are written as `{model_path}_dark_matter_stats.json`.
Defaults to `False` — existing training runs are unaffected.
Refactor: `cross_validate_single_fold` extracted as `_run_cv_fold` private method; `_save_cv_results` split into `_summarise_and_plot_config` (per-config, releases raw arrays) and `_write_cv_files` (final JSON + CSV).
Behaviour matrix
Test plan
🤖 Generated with Claude Code