Steven has been trying out BArnacle tensor decomposition to generate meaningfully grouped orthologous genes in all 3 species. He’s asked me to try running the Elastic NEt pipeline on the top 100 gene members of one of the output components from his most recent run.
I’ve also been meaning to make some data handling tweaks to the EN pipeline, so I’ll show multipel iterations here. Since one of the data tweaks (I think the scaling portion) is causing a weird error I haven’t been able to fix for Peve and Ptuh, I’ll only show the Apul results for now
Existing EN pipeline
Briefly, the pipeline applies Elastic Net regression to predict gene expression (in this case, the expression of the top 100 genes from Barnacle rank35 decomposition, component 24) using mlti-omic epigenetic predictors (miRNA, lncRNA, CpG methylation). The workflow cleans and variance stabilizes the data, merges the predictors into a single dataset, then applies EN with bootstrapping.
Introducing M-values and scaling
This option has one major change from the previous iteration (26.1): Scale standardization. Regression techniques require inputs to be on similar scales. That is not the case for our inputs, for two reasons. First, RNA counts are unbounded (could have thousands of transcripts for single RNA), but CpG site methylation is bounded on a 0-100 % scale. I'm transforming methylation from beta values (0% - 100%) to M values (log2 transformation of ratio of unmethyl/methyl). M values are unbounded (like expression counts), and also more appropriate for regression analyses because they remove heterodescasticity without need to run methylation values through a DESeq vst transformation, which may not be appropriate for use on data other than gene counts. Second, to get all predictor blocks (miRNA, lncRNA, and WGBS Mvalues on the same scale, ill be scaling each before combining into the full predictor set.)
Introducing dimensionality reduction
This has 1 major change from the previous iteration (26.2): Dimensionality reduction to address feature imbalance. The input features are of dramatically different sizes (e.g. in A.pulchra, ~50 miRNA, compared to ~15,000 lncRNA and CpG sites). To address this I'm adding a dimensionality reduction step via PCA for the lncRNA and CpG sites. It's important to note though that this is not equivalent to grouping similar features discretely into bins. Instead, each PC is instead a function of all the features. That means instead of seeing, for example, which lncRNAs comprise PC1, we have to look at which lncRNAs contribute most to PC1 (since all lncRNA are technically part of it)
Summary: Don’t think this is an appropriate change. All gene predictor sets end up dominated by miRNA, presumably because the PCs are too “noisy” to meet their potential for expression matching.