Timeseries: EN switching in OOF – Kathleen’s Lab Notebook

The current Elastic Net pipeline uses bootstrapping with an 80/20 train/test data split (Monte Carlo CV) to estimate general model performance (how well is a single gene’s expression predicted by the model). The 80/20 split allowed a comparison of model performance by applying a trained model to unseen data, and the bootstrapping accounted for the random split.

However, it’s come to my attention that, due to the structure of our data, using a random 80/20 test split is actually permitting leakage, which would be leading to overfitting/inflated estimates of model performance. Our data was collected using a repeated-measures design, in which each colony was sampled 4 times over the course of a year. When randomly splitting this dataset 80/20 its most likely that a given colony’s timepoints will also be split up (i.e., that 3/4 of a colony’s samples will be placed in the training set, and 1 in the test set). This means that, when the model is predicting this held-out timepoint as part of the test set, it’s actually already seen the same colony’s other timepoints during training. Both the methylation (predictor) and the gene expression (response) features will carry strong colony-level signature, because they are intrinsically tied to genotype. This means there’s leakage between training and testing at the colony level, which inflates R^2.

To fix this, I’ll be replacing the 80/20 train/test split + bootstrapping with something called leave-one-out cross validation, with a colony-level split. If I have C colonies (so \(N = 4*C\) total samples), I will perform C fittings. In each, the model will be trained on the data from C-1 colonies, then tested on the single remaining colony’s samples. This way, there is no colony-level leakage between the training and testing datasets. Performance is estimated the same way – an R^2 value calculated by comparing model performance using test data (or “out-of-fold”, since it is produced using the colonies not included in the training fold) to the actual response.

The basic code change:

X_all <- as.matrix(predictor_features)
  folds <- unique(colony)
  genes <- colnames(response_features)

  oof_one_gene <- function(g) {
    y <- response_features[[g]]
    pred <- rep(NA_real_, length(y))
    for (cf in folds) {
      te <- which(colony == cf); tr <- which(colony != cf)
      if (length(unique(y[tr])) < 3) next
      m <- tryCatch(cv.glmnet(X_all[tr, , drop = FALSE], y[tr],
                              alpha = alpha, nfolds = nfolds_inner),
                    error = function(e) NULL)
      if (is.null(m)) next
      pred[te] <- as.numeric(predict(m, X_all[te, , drop = FALSE], s = "lambda.min"))
    }
    ok <- !is.na(pred)
   
    r2  <- 1 - sum((y[ok] - pred[ok])^2) / sum((y[ok] - mean(y[ok]))^2)
    rc  <- suppressWarnings(cor(y[ok], pred[ok]))

For each gene, iterate through every unique colony. For each colony, hold out all of that colony’s samples (all 4 timepoints) as the test set, then fit cv.glmnet on the remaining colonies, and predict the held-out colony. The pooled out-of-fold predictions across all colonies produce two R² metrics:

OOF_R2 = 1 − SS_res/SS_tot — the standard coefficient of determination. This can go negative, which would the model predicts worse than always guessing the mean.
OOF_R2_cor = cor(obs, pred)² — squared Pearson correlation. This is mostly for comparability with 26.6’s bootstrap R² (which was also cor²)

I also ended up dropping the model feature of doing an initial run, and then re-running on only well-predicted genes (r^2 > threshold). I originally incorporated this because of the high computational intensity required for bootstrapping. If my input gene set was large, it would be extremely intensive/timeconsuming to run the script with an appropriately high bootstrapping reps. Introducing a pre-filter to reduce the number of genes carried forward helped handle that. However, with the leave-one-out approach bootstrapping isn’t as necessary.

Finally, I also added an option to turn off plotting, and an option to filter to the top n most variable features for both lncRNA and methylation features. This is both statistically defensible, as lowly-variable features are less informative and can produce inflated correlations, and computationally valuable, since it reduces the computational intensity of the run (reduces down from the full ~30,000 feature predictor set). Since there are so few miRNA (~50/species), those are left un-filtered. If included, filtering will happen after variance stabilization, but before scaling (puts all variance at 1).

Basic code:

if (var_filter_n > 0) {
  top_var_rows <- function(mat, n) {
    if (nrow(mat) <= n) return(mat)
    v <- apply(mat, 1, var)
    mat[order(v, decreasing = TRUE)[seq_len(n)], , drop = FALSE]
  }
  vsd_lncRNA <- top_var_rows(vsd_lncRNA, var_filter_n)
  vsd_WGBS   <- top_var_rows(vsd_WGBS,   var_filter_n)
}

Used Claude Opus 4.8 to help integrate added code into the 26.6 script structure

Following the modifications, ran two side-by-side comparisons of the old 26.6 script (EN using Monte Carlo CV + stability selection) against the updated 26.7 (EN using leave-one-out CV + stability selection)

Rscript 26.6-ElasticNet-stability-selection.R \
"../output/26.5-ElasticNet-permutation-sig/SUBSET50_apul_biomin_counts.csv" \
"../output/10.1-format-miRNA-counts-updated/Apul_miRNA_counts_formatted_updated.txt" \
"../../D-Apul/output/31.5-Apul-lncRNA-discovery/lncRNA_counts.clean.filtered.txt" \
"../../D-Apul/output/40-Apul-Gene-Methylation/Apul-gene-methylation.tsv" \
"../../M-multi-species/data/rna_metadata.csv" \
"../output/26.6-ElasticNet-stability-selection-TEST/Apul" \
"ACR-225-TP1" \
0.2 \     # alpha
10 \      # bootstrap1_reps
10 \      # bootstrap2_reps
0.5 \     # r2_threshold
10 \      # n_subsamples (for stability selection)
4 \       # n_cores
0.6       # pi_thr (for stability selection)

Rscript 26.7-ElasticNet-stabsel-varfilter-OOF.R \
"../output/26.5-ElasticNet-permutation-sig/SUBSET50_apul_biomin_counts.csv" \
"../output/10.1-format-miRNA-counts-updated/Apul_miRNA_counts_formatted_updated.txt" \
"../../D-Apul/output/31.5-Apul-lncRNA-discovery/lncRNA_counts.clean.filtered.txt" \
"../../D-Apul/output/40-Apul-Gene-Methylation/Apul-gene-methylation.tsv" \
"../../M-multi-species/data/rna_metadata.csv" \
"../output/26.7-ElasticNet-stabsel-varfilter-OOF-TEST/Apul" \
"ACR-225-TP1" \
0.2 \     # alpha
10 \      # n_subsamples
4 \       # n_cores
0.6 \     # pi_thr
0 \       # var_filter (0=off)
0         # plot_enabled (0=off)

Note that the input arguments have changed

26.6 retained and produced outputs for 18 genes post-filter (R^2 > 0.5). That step is removed in 26.7, so all 48 genes were evaluated – only 14/48 had OOF R^2 values above 0.5, which makes sense given the expectation that Monte Carlo CV may inflate performance.

Per-Gene Comparison (for the 18-gene overlap)

Gene	26.6 N_stable	26.7 N_stable	OOF R²
OG_01317	67	59	0.797
OG_01117	64	57	0.715
OG_00171	61	60	0.733
OG_13792	59	45	0.690
OG_10355	57	42	0.670
OG_01023	52	52	0.690
OG_10523	50	47	0.753
OG_10516	45	41	0.715
OG_00843	44	36	0.391
OG_01277	44	45	0.700
OG_01147	41	46	0.804
OG_00297	38	44	0.561
OG_01155	38	46	0.530
OG_10509	38	50	0.103
OG_10681	38	32	0.309
OG_00460	36	32	0.584
OG_00430	29	36	0.529
OG_10450	25	35	0.268

Very similar counts of stable predictors across these genes – good sign of general agreement between the versions! Note, however, that several genes well-predicted by 26.6 (R^2 > 0.5) are actually much worse-performing under 26.7 (low OOF R^2, bolded), suggesting colony-level leakage was previously inflating their prediction.

Stable Predictor Type Breakdown

Predictor Type	26.6 N_stable	26.6 Unique	26.6 Genes	26.7 N_stable	26.7 Unique	26.7 Genes
Gene Body Meth	64	61	16	154	152	39
lncRNA	758	683	18	1,205	1,099	44
miRNA	4	3	3	5	4	4

26.7 finds more stable associations across all types, because it tests more genes, but the proportions are similar: lncRNA dominates (~88-90% of stable associations), Gene Body Meth ~8-11%, miRNA ~0.4-0.5%.

Of the 826 predictors that are stably selected in 26.6, 611 (74%) are also stably selected in 26.7. This is also a great sign of concordance! Its also probably deflated due to the small number of subsamples during stability selection (n_subsamples = 10), because predictors near the 0.6 threshold could flip between stable and unstable based on a single run. A normal run with a higher number of subsamples may lead to even greater concordance.

Finally, compare runtime

	26.6	26.7
Phase 1 (bootstrap / OOF)	2.96 min	2.15 min
Phase 2 (stability selection)	0.52 min (18 genes)	1.35 min (48 genes)
Total	3.48 min	3.50 min

All-in-all I’m very happy with this! Generally high concordance, with the updated version being less prone t oover-estimation and capable of modeling a higher number of genes, all without a time tradeoff.