X_all <- as.matrix(predictor_features)
folds <- unique(colony)
genes <- colnames(response_features)
oof_one_gene <- function(g) {
y <- response_features[[g]]
pred <- rep(NA_real_, length(y))
for (cf in folds) {
te <- which(colony == cf); tr <- which(colony != cf)
if (length(unique(y[tr])) < 3) next
m <- tryCatch(cv.glmnet(X_all[tr, , drop = FALSE], y[tr],
alpha = alpha, nfolds = nfolds_inner),
error = function(e) NULL)
if (is.null(m)) next
pred[te] <- as.numeric(predict(m, X_all[te, , drop = FALSE], s = "lambda.min"))
}
ok <- !is.na(pred)
r2 <- 1 - sum((y[ok] - pred[ok])^2) / sum((y[ok] - mean(y[ok]))^2)
rc <- suppressWarnings(cor(y[ok], pred[ok]))The current Elastic Net pipeline uses bootstrapping with an 80/20 train/test data split (Monte Carlo CV) to estimate general model performance (how well is a single gene’s expression predicted by the model). The 80/20 split allowed a comparison of model performance by applying a trained model to unseen data, and the bootstrapping accounted for the random split.
However, it’s come to my attention that, due to the structure of our data, using a random 80/20 test split is actually permitting leakage, which would be leading to overfitting/inflated estimates of model performance. Our data was collected using a repeated-measures design, in which each colony was sampled 4 times over the course of a year. When randomly splitting this dataset 80/20 its most likely that a given colony’s timepoints will also be split up (i.e., that 3/4 of a colony’s samples will be placed in the training set, and 1 in the test set). This means that, when the model is predicting this held-out timepoint as part of the test set, it’s actually already seen the same colony’s other timepoints during training. Both the methylation (predictor) and the gene expression (response) features will carry strong colony-level signature, because they are intrinsically tied to genotype. This means there’s leakage between training and testing at the colony level, which inflates R^2.
To fix this, I’ll be replacing the 80/20 train/test split + bootstrapping with something called leave-one-out cross validation, with a colony-level split. If I have C colonies (so \(N = 4*C\) total samples), I will perform C fittings. In each, the model will be trained on the data from C-1 colonies, then tested on the single remaining colony’s samples. This way, there is no colony-level leakage between the training and testing datasets. Performance is estimated the same way – an R^2 value calculated by comparing model performance using test data (or “out-of-fold”, since it is produced using the colonies not included in the training fold) to the actual response.
The basic code change:
For each gene, iterate through every unique colony. For each colony, hold out all of that colony’s samples (all 4 timepoints) as the test set, then fit cv.glmnet on the remaining colonies, and predict the held-out colony. The pooled out-of-fold predictions across all colonies produce two R² metrics:
OOF_R2= 1 − SS_res/SS_tot — the standard coefficient of determination. This can go negative, which would the model predicts worse than always guessing the mean.OOF_R2_cor= cor(obs, pred)² — squared Pearson correlation. This is mostly for comparability with 26.6’s bootstrap R² (which was also cor²)
I also ended up dropping the model feature of doing an initial run, and then re-running on only well-predicted genes (r^2 > threshold). I originally incorporated this because of the high computational intensity required for bootstrapping. If my input gene set was large, it would be extremely intensive/timeconsuming to run the script with an appropriately high bootstrapping reps. Introducing a pre-filter to reduce the number of genes carried forward helped handle that. However, with the leave-one-out approach bootstrapping isn’t as necessary.
Finally, I also added an option to turn off plotting, and an option to filter to the top n most variable features for both lncRNA and methylation features. This is both statistically defensible, as lowly-variable features are less informative and can produce inflated correlations, and computationally valuable, since it reduces the computational intensity of the run (reduces down from the full ~30,000 feature predictor set). Since there are so few miRNA (~50/species), those are left un-filtered. If included, filtering will happen after variance stabilization, but before scaling (puts all variance at 1).
Basic code:
if (var_filter_n > 0) {
top_var_rows <- function(mat, n) {
if (nrow(mat) <= n) return(mat)
v <- apply(mat, 1, var)
mat[order(v, decreasing = TRUE)[seq_len(n)], , drop = FALSE]
}
vsd_lncRNA <- top_var_rows(vsd_lncRNA, var_filter_n)
vsd_WGBS <- top_var_rows(vsd_WGBS, var_filter_n)
}Used Claude Opus 4.8 to help integrate added code into the 26.6 script structure
Following the modifications, ran two side-by-side comparisons of the old 26.6 script (EN using Monte Carlo CV + stability selection) against the updated 26.7 (EN using leave-one-out CV + stability selection)
Rscript 26.6-ElasticNet-stability-selection.R \
"../output/26.5-ElasticNet-permutation-sig/SUBSET50_apul_biomin_counts.csv" \
"../output/10.1-format-miRNA-counts-updated/Apul_miRNA_counts_formatted_updated.txt" \
"../../D-Apul/output/31.5-Apul-lncRNA-discovery/lncRNA_counts.clean.filtered.txt" \
"../../D-Apul/output/40-Apul-Gene-Methylation/Apul-gene-methylation.tsv" \
"../../M-multi-species/data/rna_metadata.csv" \
"../output/26.6-ElasticNet-stability-selection-TEST/Apul" \
"ACR-225-TP1" \
0.2 \ # alpha
10 \ # bootstrap1_reps
10 \ # bootstrap2_reps
0.5 \ # r2_threshold
10 \ # n_subsamples (for stability selection)
4 \ # n_cores
0.6 # pi_thr (for stability selection)Rscript 26.7-ElasticNet-stabsel-varfilter-OOF.R \
"../output/26.5-ElasticNet-permutation-sig/SUBSET50_apul_biomin_counts.csv" \
"../output/10.1-format-miRNA-counts-updated/Apul_miRNA_counts_formatted_updated.txt" \
"../../D-Apul/output/31.5-Apul-lncRNA-discovery/lncRNA_counts.clean.filtered.txt" \
"../../D-Apul/output/40-Apul-Gene-Methylation/Apul-gene-methylation.tsv" \
"../../M-multi-species/data/rna_metadata.csv" \
"../output/26.7-ElasticNet-stabsel-varfilter-OOF-TEST/Apul" \
"ACR-225-TP1" \
0.2 \ # alpha
10 \ # n_subsamples
4 \ # n_cores
0.6 \ # pi_thr
0 \ # var_filter (0=off)
0 # plot_enabled (0=off)Note that the input arguments have changed
26.6 retained and produced outputs for 18 genes post-filter (R^2 > 0.5). That step is removed in 26.7, so all 48 genes were evaluated – only 14/48 had OOF R^2 values above 0.5, which makes sense given the expectation that Monte Carlo CV may inflate performance.
Per-Gene Comparison (for the 18-gene overlap)
| Gene | 26.6 N_stable | 26.7 N_stable | OOF R² |
|---|---|---|---|
| OG_01317 | 67 | 59 | 0.797 |
| OG_01117 | 64 | 57 | 0.715 |
| OG_00171 | 61 | 60 | 0.733 |
| OG_13792 | 59 | 45 | 0.690 |
| OG_10355 | 57 | 42 | 0.670 |
| OG_01023 | 52 | 52 | 0.690 |
| OG_10523 | 50 | 47 | 0.753 |
| OG_10516 | 45 | 41 | 0.715 |
| OG_00843 | 44 | 36 | 0.391 |
| OG_01277 | 44 | 45 | 0.700 |
| OG_01147 | 41 | 46 | 0.804 |
| OG_00297 | 38 | 44 | 0.561 |
| OG_01155 | 38 | 46 | 0.530 |
| OG_10509 | 38 | 50 | 0.103 |
| OG_10681 | 38 | 32 | 0.309 |
| OG_00460 | 36 | 32 | 0.584 |
| OG_00430 | 29 | 36 | 0.529 |
| OG_10450 | 25 | 35 | 0.268 |
Very similar counts of stable predictors across these genes – good sign of general agreement between the versions! Note, however, that several genes well-predicted by 26.6 (R^2 > 0.5) are actually much worse-performing under 26.7 (low OOF R^2, bolded), suggesting colony-level leakage was previously inflating their prediction.
Stable Predictor Type Breakdown
| Predictor Type | 26.6 N_stable | 26.6 Unique | 26.6 Genes | 26.7 N_stable | 26.7 Unique | 26.7 Genes |
|---|---|---|---|---|---|---|
| Gene Body Meth | 64 | 61 | 16 | 154 | 152 | 39 |
| lncRNA | 758 | 683 | 18 | 1,205 | 1,099 | 44 |
| miRNA | 4 | 3 | 3 | 5 | 4 | 4 |
26.7 finds more stable associations across all types, because it tests more genes, but the proportions are similar: lncRNA dominates (~88-90% of stable associations), Gene Body Meth ~8-11%, miRNA ~0.4-0.5%.
Of the 826 predictors that are stably selected in 26.6, 611 (74%) are also stably selected in 26.7. This is also a great sign of concordance! Its also probably deflated due to the small number of subsamples during stability selection (n_subsamples = 10), because predictors near the 0.6 threshold could flip between stable and unstable based on a single run. A normal run with a higher number of subsamples may lead to even greater concordance.
Finally, compare runtime
| 26.6 | 26.7 | |
|---|---|---|
| Phase 1 (bootstrap / OOF) | 2.96 min | 2.15 min |
| Phase 2 (stability selection) | 0.52 min (18 genes) | 1.35 min (48 genes) |
| Total | 3.48 min | 3.50 min |
All-in-all I’m very happy with this! Generally high concordance, with the updated version being less prone t oover-estimation and capable of modeling a higher number of genes, all without a time tradeoff.