Skip to content

Explore integration with clinical data #51

@eprifti

Description

@eprifti

Context

gpredomics produces sparse, interpretable models (binary/ternary/ratio/pow2) from omics data. In clinical settings, combining omics-derived predictions with clinical variables (age, BMI, lab values, comorbidities, medications) could significantly improve prediction accuracy.

Implementation target: gpredomicspy — Python bindings layer, leveraging scikit-learn and the Python ecosystem for the integration models while keeping the core Rust engine untouched.

Approaches to explore

Quick wins (in gpredomicspy, no engine changes)

1. Late fusion / Stacking

  • Train gpredomics on omics data → produce score S_omics
  • Use S_omics as a feature alongside clinical variables in a second-stage model (logistic regression, etc.)
  • Use out-of-fold predictions to avoid leakage
  • Pros: simplest, fully preserves interpretability, well-established in metagenomics literature
  • Cons: cannot capture interactions between individual omics features and clinical variables
  • Refs: Pasolli et al. 2016, Topçuoğlu et al. 2020

2. Score calibration + combination

  • Calibrate gpredomics score into proper probability via Platt scaling or isotonic regression
  • Combine with clinical risk scores in log-odds space (naive Bayes combination)
  • Pros: very simple, each score independently interpretable, mirrors clinical reasoning
  • Cons: independence assumption may be violated; calibration needs held-out data
  • Refs: Platt 1999, Wirbel et al. 2019

3. Stratified approaches

  • Use clinical variables (e.g., metformin use, BMI category) to define patient strata
  • Train separate gpredomics models per stratum
  • Pros: directly addresses known confounders (e.g., metformin alters gut microbiome), per-stratum models fully interpretable
  • Cons: reduces sample size per stratum; choice of strata is subjective
  • Refs: Forslund et al. 2015, Vujkovic-Cvijin et al. 2020

4. Feature engineering (interaction terms)

  • Create interaction features: S_omics × age, S_omics × BMI, etc.
  • Feed into a second-stage logistic regression
  • Pros: captures effect modification, interpretable interaction coefficients
  • Cons: risk of overfitting with many interactions

Medium-term (minor engine enhancements exposed via gpredomicspy)

5. Bayesian integration

  • Clinical variables define a prior P(disease | clinical); omics model provides likelihood
  • Simple Bayesian updating: P(disease | omics, clinical) ∝ P(omics | disease) × P(disease | clinical)
  • Requires calibrated probability outputs from gpredomics
  • Pros: scientifically natural, clinician-friendly, excellent interpretability
  • Cons: independence assumption; needs score calibration

6. Cooperative learning (Ding et al. 2022, Tibshirani group)

  • Each view's model is trained with a penalty encouraging agreement with the other view's predictions
  • gpredomics model stays sparse/discrete; clinical model can be logistic regression
  • Pros: principled multi-view approach compatible with gpredomics's design
  • Cons: requires engine to accept external soft labels during training

Not recommended for gpredomics

  • Early fusion (concatenating omics + clinical features): conflicts with discrete coefficient languages; scale mismatch
  • Multi-kernel learning / Autoencoders: requires abandoning the gpredomics paradigm
  • Intermediate fusion: designed for deep networks, not sparse linear models

Implementation plan

Implement in gpredomicspy as a Python-level API, e.g.:

```python
import gpredomicspy as gp

Train omics model

param = gp.Param()
param.load("param.yaml")
exp = gp.fit(param)

Clinical integration (stacking)

integrator = gp.ClinicalIntegrator(method="stacking")
integrator.fit(exp, clinical_df, y)
combined_pred = integrator.predict(clinical_df_test)

Or calibration + combination

integrator = gp.ClinicalIntegrator(method="calibrated_combination")
integrator.fit(exp, clinical_risk_scores, y)
```

This keeps the Rust engine focused on omics modeling while Python handles the integration layer with scikit-learn under the hood.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions