Emma Strand Meeting:
Elastic net has been good basic start, but doesn’t handle non-linear relationships or context-dependence
^ Designed to have more inputs with different variables
Emma’s biggest worry with elastic net was that it isn’t capturing any interactions among the inputs
Agrees we’re strapped on sample size for the really big/complicated analyses
What are our questions? Do we want to know what the integrations among layers are? Do we just want to know what feature(s) are most important?
Recommends being very purposeful about what we’re trying
Elastic net: can be very helpful tool if we know what type of outcome we expect , but maybe not best fit here? Especially with DNA methylation, since we pretty much know it’s not a linear relationship between methylation and expression
With epigenetic clock, you already know the relationship , you’re just trying to get the smallest number of features that are still predictive
Maybe Elastic net is more of a step 2? Like once we know what features have relationships
One way to add non-linearity to Elastic Net would be to throw in inputs that are non-linearized (e.g., with inputs X and Y, add in the input of X*Y). Maybe this could be useful? Like try a bunch of combinations to find the most predictive combination?
Maybe I can talk with Steve, volunteer @ URI who’s been working with Shelly, whose been studying ML applications. Emma or Shelley can connect me. He’s super helpful
Paper from Steve Volmer (?) using a program, think it’s multi-omic. Trying to predict disease susceptibility from SNPs. But program could still be useful. Could maybe just replace disease susceptibility (numerical score) with expression? Genomic signatures of disease resistance in endangered staghorn corals | Science
How can we decide how to input the methylation data? Could try median % level at the gene level and methylation density over the gene?
Maybe try to figure this out in Elastic Net first. Median % gene methylation, median methylation density. Could then easily compare methylation level v expression and methylation % v expression. Could also try a couple variants with just intronic methylation, exonic methylation, etc.
Recommends being really mindful of each step to make sure I don’t have to go back and rerun things. Start at the basic, figure out which type of methylation input to use first (which is most correlated with expression level). Doesn’t necessarily take into account interactions, but can’t keep up with spider webbing from multiple routes. After I have the inputs decided, then can start looking into other models to try.
AI summary:
Prompt:
“I have a large multi-omics dataset, which I want to use to do the following: Predict gene expression from input epigenetic features (CpG methylation and/or regional methylation, miRNA expression, lncRNA expression), and/or determine the relative importance of different epigenetic features to determining expression. Please suggest several ML models that might be useful for doing this, or which have been used for similar applications in biology/ecology contexts (provide all sources). For each, please explain the basics of the model, its strengths/limitations, assumptions, etc. Keep in mind my data sets will have multi-colinearity, and I want to account for the possibility of non-linear/context-dependent interactions. I also want to be able to implement the model in either R or command-line.”
1) Penalized regression (Elastic Net / Lasso / Ridge)
What it is
- Linear regression with penalties on coefficients (L1 = Lasso, L2 = Ridge, Elastic Net = mixture). Elastic Net is very popular for genomics because it does variable selection and handles correlated predictors better than Lasso alone. OUP Academic+1
Why use it for your data
- Good baseline for continuous gene expression prediction. Handles high-dimensional data (p ≫ n). Elastic Net downweights correlated groups rather than arbitrarily keeping one. Coefficients are directly interpretable as effect sizes.
Strengths
Fast, well understood, and easy to tune (cross-validated penalty λ, mixing α).
Produces sparse model (variable selection), helping identify important CpGs/miRNAs/lncRNAs.
Robust to multicollinearity compared to plain OLS.
Limitations/assumptions
Linear model — won't capture complex nonlinear or context-dependent interactions unless you add engineered interaction terms / basis expansions (splines, polynomial features).
If many nonlinear interactions are important, performance will be limited.
R / CLI implementations
R:
glmnet(very widely used).CLI: you can use Rscript to run glmnet, or Python scikit-learn's ElasticNet if you prefer command line.
Papers: robust meta-analysis and applications of Elastic Net in genomic contexts. OUP Academic+1
When to pick it
- Use it as a first model and for feature selection. Good when you want interpretable coefficients and to reduce dimensionality prior to more complex models.
2) Random Forests / Ranger (ensemble tree models)
What it is
- Ensemble of decision trees (bagging) that averages many bootstrap trees; captures nonlinearities and interactions automatically. Widely used in epigenetics and methylation biomarker discovery. BioMed Central+1
Why use it for your data
- Handles mixed predictor types, nonlinear relationships, and interactions without explicit feature engineering. Not strongly affected by multicollinearity in terms of predictive power, though variable importance measures can be biased if predictors are correlated.
Strengths
Good off-the-shelf performance, robust, few hyperparameters.
Variable importance (Gini, permutation) gives a sense of feature contributions.
R:
randomForest, fasterranger; CLI:rangercan be run from command line or via R.
Limitations/assumptions
Interpretation is less direct than linear models.
Standard importance metrics can be biased with correlated features; permutation importance or conditional importance methods are preferred.
Can overfit if not tuned and if sample size is small relative to noise.
Notes on collinearity & interactions
- RF handles interactions, but if you need reliable ranking among highly correlated CpGs, pair RF with conditional permutation importance or follow up with Elastic Net on RF-selected features. BioMed Central
3) Gradient boosting (XGBoost / LightGBM / CatBoost)
What it is
- Sequential ensemble of trees built to correct residuals (boosting). Often better accuracy than RF when tuned. Frequently used in multi-omics predictive models. ScienceDirect
Strengths
Strong predictive performance, handles nonlinearities and interactions.
Feature importance, SHAP values available for interpretability.
Scales well (LightGBM / XGBoost) and has CLI and R bindings.
Limitations
More hyperparameters to tune (learning rate, tree depth).
Like RF, correlated features complicate feature-importance interpretation. Use SHAP and careful conditional analyses.
When to pick
- If you need the best off-the-shelf predictive accuracy and can invest time in tuning. Good candidate if nonlinear/context dependencies are expected.
4) Bayesian Additive Regression Trees (BART)
What it is
- Bayesian ensemble of trees (sum of trees) that gives probabilistic predictions and naturally regularizes to avoid overfitting. Useful for flexible regression with uncertainty estimates. BART has seen use in regulatory genomics and TF prediction contexts (and more broadly in biology). PMC+1
Strengths
Captures nonlinearities and interactions without explicit specification.
Provides full posterior uncertainty (credible intervals) for predictions and effects.
Often performs competitively with boosting while offering principled uncertainty.
Limitations
More computationally intensive than RF/XGBoost.
Interpretation is harder; partial dependence and variable inclusion proportions can help.
R implementations
- R packages:
BART,dbarts,bartMachine(note license/Java dependency). Good choice when uncertainty quantification matters.
5) Kernel methods / Gaussian Processes / Support Vector Regression (SVR)
What it is
- Kernel regression (e.g., SVR with radial basis kernel, Gaussian Processes) maps inputs to high-dimensional feature spaces to capture nonlinear relations.
Strengths
- Powerful for smooth nonlinear relationships and small to moderate datasets. Gaussian Processes give uncertainty estimates and work well when you need calibrated intervals.
Limitations
- Poor scalability to very large feature spaces (CpG arrays with hundreds of thousands of features) without dimensionality reduction. Kernels can struggle when p ≫ n unless you preselect features or use approximate kernels.
When to pick
- Useful if you have reduced feature sets (selected CpGs / aggregated regional methylation) and want flexible, probabilistic modeling.
6) Deep learning (fully connected networks, CNNs, ResNets for methylation → expression)
What it is
- Neural networks (MLPs), convolutional networks (if using sequence or spatially ordered CpGs), and specialized architectures (e.g., DeepMethyGene) trained end-to-end to learn nonlinear mapping from epigenetic features to expression. Recent work shows deep nets can predict expression from methylation patterns. BioMed Central
Strengths
Can learn complex, high-order interactions and hierarchical features; CNNs can exploit local genomic structure.
Flexible architectures let you combine inputs (miRNA, CpG, lncRNA) in multi-branch models.
Limitations
Data hungry — needs many samples (or strong regularization / transfer learning). Risk of overfitting with limited sample size.
Harder to interpret (but can use integrated gradients, DeepLIFT, SHAP for NN).
Requires more compute (GPUs helpful).
R / CLI implementations
- R:
keras+tensorflow(R interface). CLI: Pythontensorflow/pytorchscripts.
Advice
- If sample size is modest, consider pretraining, transfer learning or using deep models to learn features and then feed them into simpler models.
7) Multi-omics integrative and multivariate methods (DIABLO / mixOmics, MOFA, multi-view methods)
What it is
- Methods tuned for multi-omics integration: DIABLO (supervised sparse PLS for multi-omics discriminant analysis / prediction and feature selection); MOFA (unsupervised factor analysis for multi-omics latent factors). These identify components/latent factors that combine signals across omics and can be used for downstream regression/prediction. DIABLO has been used for integrating mRNA, miRNA, methylation, proteomics in predictive frameworks. PMC+1
Strengths
Explicitly built to integrate heterogeneous omics and find shared signals and cross-omics associations.
DIABLO provides sparse multi-omics signatures and is implemented in
mixOmics(R). MOFA gives latent factors useful as inputs to regression or for interpretation.
Limitations
DIABLO is supervised and tends to be used for classification/regression with careful tuning; it assumes linear latent relationships (though can capture nonlinearities via components indirectly).
MOFA is unsupervised — useful for exploratory factor discovery but needs a downstream supervised model for prediction.
When to pick
- When you want to combine all your omics coherently and identify cross-modal predictors (e.g., miRNA + CpG factors that jointly explain expression variance). Great for feature reduction before complex models. mixOmics
8) Hybrid / stacked approaches and feature-selection pipelines
Practical strategy (recommended)
Step 1: Preprocess & reduce dimensionality:
For methylation: consider aggregating CpGs to biologically plausible regions (promoters, gene bodies, enhancers) or using principal components / variance filtering / domain knowledge to reduce p.
For miRNA/lncRNA: variance/abundance filtering and perhaps correlation clustering.
Step 2: Baseline models & feature selection:
Fit Elastic Net for a sparse, interpretable baseline.
Fit Random Forest / XGBoost for nonlinear baseline and feature importance.
Step 3: Integrative modeling:
- Use DIABLO or MOFA to create joint components fed to regression/ensemble models.
Step 4: Final modeling:
- If sample size is large and you need best predictive accuracy, try XGBoost or a small NN; otherwise BART or RF plus Elastic Net on top of selected features.
Step 5: Interpretation:
- Use SHAP values (for tree/boosting) or permutation importance, partial dependence plots, interaction detection (SHAP interaction values), and compare with Elastic Net coefficients.
Why this pipeline helps
- Combines the interpretability of penalized regression with the flexibility of tree/boosting methods, addresses multicollinearity (regularization + aggregation), and gives ways to probe interactions.
Practical notes on multicollinearity, interactions, and interpretation
Multicollinearity: use Elastic Net, or aggregate CpGs into regional summaries, or use dimensionality reduction (PCA, MOFA). For variable ranking from tree models with correlated features, prefer conditional permutation importance or SHAP values and follow up with regression on selected features. BioMed Central
Nonlinear/context-dependent interactions: tree ensembles, BART, and neural nets naturally capture them. Use interaction detection tools (H statistic, SHAP interaction values) to highlight important interactions.
Interpretability: combine sparse linear models (effect sizes) with model-agnostic explainers (SHAP, partial dependence).
Uncertainty: BART and Gaussian Processes provide principled uncertainty; bootstrap or Bayesian methods can give intervals for other models.
Implementation pointers & R / CLI packages (quick list)
Elastic Net:
glmnet(R). OUP AcademicRandom Forest:
randomForest,ranger(R / CLI). BioMed CentralGradient boosting:
xgboost,lightgbm,catboost(R + CLI). ScienceDirectBART:
BART,dbarts,bartMachine(R). PMCDeep learning:
keras/tensorflow(R interface) or Pythontensorflow/pytorch. Example architecture papers: DeepMethyGene for methylation→expression. BioMed CentralMulti-omics integrative:
mixOmics(DIABLO),MOFA2(R / python). PMC+1Interpretation:
SHAPforxgboost/shapper/iml(R),shap(Python).Pipelines / tuning:
caret,tidymodels,mlr3(R).
Key references (start here)
mixOmics / DIABLO (multi-omics supervised integration & prediction). PMC+1
Elastic Net in genomics/meta-analysis (robust penalized regression). OUP Academic+1
Random Forest applications to methylation / biomarker modeling. BioMed Central+1
DeepMethyGene — example of deep CNN to predict expression from methylation (recent). BioMed Central
BART for genomic inference & TF prediction (Bayesian tree ensemble; uncertainty). PMC+1