Emma Strand Meeting:

Elastic net has been good basic start, but doesn’t handle non-linear relationships or context-dependence
https://shap.readthedocs.io/en/latest/index.html
^ Designed to have more inputs with different variables
Emma’s biggest worry with elastic net was that it isn’t capturing any interactions among the inputs
Agrees we’re strapped on sample size for the really big/complicated analyses
What are our questions? Do we want to know what the integrations among layers are? Do we just want to know what feature(s) are most important?
Recommends being very purposeful about what we’re trying
Elastic net: can be very helpful tool if we know what type of outcome we expect , but maybe not best fit here? Especially with DNA methylation, since we pretty much know it’s not a linear relationship between methylation and expression
With epigenetic clock, you already know the relationship , you’re just trying to get the smallest number of features that are still predictive
Maybe Elastic net is more of a step 2? Like once we know what features have relationships
One way to add non-linearity to Elastic Net would be to throw in inputs that are non-linearized (e.g., with inputs X and Y, add in the input of X*Y). Maybe this could be useful? Like try a bunch of combinations to find the most predictive combination?
Maybe I can talk with Steve, volunteer @ URI who’s been working with Shelly, whose been studying ML applications. Emma or Shelley can connect me. He’s super helpful
Paper from Steve Volmer (?) using a program, think it’s multi-omic. Trying to predict disease susceptibility from SNPs. But program could still be useful. Could maybe just replace disease susceptibility (numerical score) with expression? Genomic signatures of disease resistance in endangered staghorn corals | Science
How can we decide how to input the methylation data? Could try median % level at the gene level and methylation density over the gene?
Maybe try to figure this out in Elastic Net first. Median % gene methylation, median methylation density. Could then easily compare methylation level v expression and methylation % v expression. Could also try a couple variants with just intronic methylation, exonic methylation, etc.
Recommends being really mindful of each step to make sure I don’t have to go back and rerun things. Start at the basic, figure out which type of methylation input to use first (which is most correlated with expression level). Doesn’t necessarily take into account interactions, but can’t keep up with spider webbing from multiple routes. After I have the inputs decided, then can start looking into other models to try.

AI summary:

Prompt:

“I have a large multi-omics dataset, which I want to use to do the following: Predict gene expression from input epigenetic features (CpG methylation and/or regional methylation, miRNA expression, lncRNA expression), and/or determine the relative importance of different epigenetic features to determining expression. Please suggest several ML models that might be useful for doing this, or which have been used for similar applications in biology/ecology contexts (provide all sources). For each, please explain the basics of the model, its strengths/limitations, assumptions, etc. Keep in mind my data sets will have multi-colinearity, and I want to account for the possibility of non-linear/context-dependent interactions. I also want to be able to implement the model in either R or command-line.”

1) Penalized regression (Elastic Net / Lasso / Ridge)

What it is

Linear regression with penalties on coefficients (L1 = Lasso, L2 = Ridge, Elastic Net = mixture). Elastic Net is very popular for genomics because it does variable selection and handles correlated predictors better than Lasso alone. OUP Academic+1

Why use it for your data

Good baseline for continuous gene expression prediction. Handles high-dimensional data (p ≫ n). Elastic Net downweights correlated groups rather than arbitrarily keeping one. Coefficients are directly interpretable as effect sizes.

Strengths

Fast, well understood, and easy to tune (cross-validated penalty λ, mixing α).
Produces sparse model (variable selection), helping identify important CpGs/miRNAs/lncRNAs.
Robust to multicollinearity compared to plain OLS.

Limitations/assumptions

Linear model — won't capture complex nonlinear or context-dependent interactions unless you add engineered interaction terms / basis expansions (splines, polynomial features).
If many nonlinear interactions are important, performance will be limited.

R / CLI implementations

R: glmnet (very widely used).
CLI: you can use Rscript to run glmnet, or Python scikit-learn's ElasticNet if you prefer command line.

Papers: robust meta-analysis and applications of Elastic Net in genomic contexts. OUP Academic+1

When to pick it

Use it as a first model and for feature selection. Good when you want interpretable coefficients and to reduce dimensionality prior to more complex models.

2) Random Forests / Ranger (ensemble tree models)

What it is

Ensemble of decision trees (bagging) that averages many bootstrap trees; captures nonlinearities and interactions automatically. Widely used in epigenetics and methylation biomarker discovery. BioMed Central+1

Why use it for your data

Handles mixed predictor types, nonlinear relationships, and interactions without explicit feature engineering. Not strongly affected by multicollinearity in terms of predictive power, though variable importance measures can be biased if predictors are correlated.

Strengths

Good off-the-shelf performance, robust, few hyperparameters.
Variable importance (Gini, permutation) gives a sense of feature contributions.
R: randomForest, faster ranger; CLI: ranger can be run from command line or via R.

Limitations/assumptions

Interpretation is less direct than linear models.
Standard importance metrics can be biased with correlated features; permutation importance or conditional importance methods are preferred.
Can overfit if not tuned and if sample size is small relative to noise.

Notes on collinearity & interactions

RF handles interactions, but if you need reliable ranking among highly correlated CpGs, pair RF with conditional permutation importance or follow up with Elastic Net on RF-selected features. BioMed Central

3) Gradient boosting (XGBoost / LightGBM / CatBoost)

What it is

Sequential ensemble of trees built to correct residuals (boosting). Often better accuracy than RF when tuned. Frequently used in multi-omics predictive models. ScienceDirect

Strengths

Strong predictive performance, handles nonlinearities and interactions.
Feature importance, SHAP values available for interpretability.
Scales well (LightGBM / XGBoost) and has CLI and R bindings.

Limitations

More hyperparameters to tune (learning rate, tree depth).
Like RF, correlated features complicate feature-importance interpretation. Use SHAP and careful conditional analyses.

When to pick

If you need the best off-the-shelf predictive accuracy and can invest time in tuning. Good candidate if nonlinear/context dependencies are expected.

4) Bayesian Additive Regression Trees (BART)

What it is

Bayesian ensemble of trees (sum of trees) that gives probabilistic predictions and naturally regularizes to avoid overfitting. Useful for flexible regression with uncertainty estimates. BART has seen use in regulatory genomics and TF prediction contexts (and more broadly in biology). PMC+1

Strengths

Captures nonlinearities and interactions without explicit specification.
Provides full posterior uncertainty (credible intervals) for predictions and effects.
Often performs competitively with boosting while offering principled uncertainty.

Limitations

More computationally intensive than RF/XGBoost.
Interpretation is harder; partial dependence and variable inclusion proportions can help.

R implementations

R packages: BART, dbarts, bartMachine (note license/Java dependency). Good choice when uncertainty quantification matters.

5) Kernel methods / Gaussian Processes / Support Vector Regression (SVR)

What it is

Kernel regression (e.g., SVR with radial basis kernel, Gaussian Processes) maps inputs to high-dimensional feature spaces to capture nonlinear relations.

Strengths

Powerful for smooth nonlinear relationships and small to moderate datasets. Gaussian Processes give uncertainty estimates and work well when you need calibrated intervals.

Limitations

Poor scalability to very large feature spaces (CpG arrays with hundreds of thousands of features) without dimensionality reduction. Kernels can struggle when p ≫ n unless you preselect features or use approximate kernels.

When to pick

Useful if you have reduced feature sets (selected CpGs / aggregated regional methylation) and want flexible, probabilistic modeling.

6) Deep learning (fully connected networks, CNNs, ResNets for methylation → expression)

What it is

Neural networks (MLPs), convolutional networks (if using sequence or spatially ordered CpGs), and specialized architectures (e.g., DeepMethyGene) trained end-to-end to learn nonlinear mapping from epigenetic features to expression. Recent work shows deep nets can predict expression from methylation patterns. BioMed Central

Strengths

Can learn complex, high-order interactions and hierarchical features; CNNs can exploit local genomic structure.
Flexible architectures let you combine inputs (miRNA, CpG, lncRNA) in multi-branch models.

Limitations

Data hungry — needs many samples (or strong regularization / transfer learning). Risk of overfitting with limited sample size.
Harder to interpret (but can use integrated gradients, DeepLIFT, SHAP for NN).
Requires more compute (GPUs helpful).

R / CLI implementations

R: keras + tensorflow (R interface). CLI: Python tensorflow/pytorch scripts.

Advice

If sample size is modest, consider pretraining, transfer learning or using deep models to learn features and then feed them into simpler models.

7) Multi-omics integrative and multivariate methods (DIABLO / mixOmics, MOFA, multi-view methods)

What it is

Methods tuned for multi-omics integration: DIABLO (supervised sparse PLS for multi-omics discriminant analysis / prediction and feature selection); MOFA (unsupervised factor analysis for multi-omics latent factors). These identify components/latent factors that combine signals across omics and can be used for downstream regression/prediction. DIABLO has been used for integrating mRNA, miRNA, methylation, proteomics in predictive frameworks. PMC+1

Strengths

Explicitly built to integrate heterogeneous omics and find shared signals and cross-omics associations.
DIABLO provides sparse multi-omics signatures and is implemented in mixOmics (R). MOFA gives latent factors useful as inputs to regression or for interpretation.

Limitations

DIABLO is supervised and tends to be used for classification/regression with careful tuning; it assumes linear latent relationships (though can capture nonlinearities via components indirectly).
MOFA is unsupervised — useful for exploratory factor discovery but needs a downstream supervised model for prediction.

When to pick

When you want to combine all your omics coherently and identify cross-modal predictors (e.g., miRNA + CpG factors that jointly explain expression variance). Great for feature reduction before complex models. mixOmics

8) Hybrid / stacked approaches and feature-selection pipelines

Practical strategy (recommended)

Step 1: Preprocess & reduce dimensionality:
- For methylation: consider aggregating CpGs to biologically plausible regions (promoters, gene bodies, enhancers) or using principal components / variance filtering / domain knowledge to reduce p.
- For miRNA/lncRNA: variance/abundance filtering and perhaps correlation clustering.
Step 2: Baseline models & feature selection:
- Fit Elastic Net for a sparse, interpretable baseline.
- Fit Random Forest / XGBoost for nonlinear baseline and feature importance.
Step 3: Integrative modeling:
- Use DIABLO or MOFA to create joint components fed to regression/ensemble models.
Step 4: Final modeling:
- If sample size is large and you need best predictive accuracy, try XGBoost or a small NN; otherwise BART or RF plus Elastic Net on top of selected features.
Step 5: Interpretation:
- Use SHAP values (for tree/boosting) or permutation importance, partial dependence plots, interaction detection (SHAP interaction values), and compare with Elastic Net coefficients.

Why this pipeline helps

Combines the interpretability of penalized regression with the flexibility of tree/boosting methods, addresses multicollinearity (regularization + aggregation), and gives ways to probe interactions.

Practical notes on multicollinearity, interactions, and interpretation

Multicollinearity: use Elastic Net, or aggregate CpGs into regional summaries, or use dimensionality reduction (PCA, MOFA). For variable ranking from tree models with correlated features, prefer conditional permutation importance or SHAP values and follow up with regression on selected features. BioMed Central
Nonlinear/context-dependent interactions: tree ensembles, BART, and neural nets naturally capture them. Use interaction detection tools (H statistic, SHAP interaction values) to highlight important interactions.
Interpretability: combine sparse linear models (effect sizes) with model-agnostic explainers (SHAP, partial dependence).
Uncertainty: BART and Gaussian Processes provide principled uncertainty; bootstrap or Bayesian methods can give intervals for other models.

Implementation pointers & R / CLI packages (quick list)

Elastic Net: glmnet (R). OUP Academic
Random Forest: randomForest, ranger (R / CLI). BioMed Central
Gradient boosting: xgboost, lightgbm, catboost (R + CLI). ScienceDirect
BART: BART, dbarts, bartMachine (R). PMC
Deep learning: keras/tensorflow (R interface) or Python tensorflow/pytorch. Example architecture papers: DeepMethyGene for methylation→expression. BioMed Central
Multi-omics integrative: mixOmics (DIABLO), MOFA2 (R / python). PMC+1
Interpretation: SHAPforxgboost / shapper / iml (R), shap (Python).
Pipelines / tuning: caret, tidymodels, mlr3 (R).

Key references (start here)

mixOmics / DIABLO (multi-omics supervised integration & prediction). PMC+1
Elastic Net in genomics/meta-analysis (robust penalized regression). OUP Academic+1
Random Forest applications to methylation / biomarker modeling. BioMed Central+1
DeepMethyGene — example of deep CNN to predict expression from methylation (recent). BioMed Central
BART for genomic inference & TF prediction (Bayesian tree ensemble; uncertainty). PMC+1