library(tidyverse)
library(ggplot2)
library(DESeq2)
library(igraph)
library(psych)
library(tidygraph)
library(ggraph)
library(WGCNA)
library(edgeR)
library(reshape2)
library(ggcorrplot)
library(corrplot)
library(rvest)
library(purrr)
library(pheatmap)
library(glmnet)
library(caret)
library(factoextra)
library(vegan)
library(ggfortify)
I’d like to see whether phenotype can predict gene and/or miRNA expression, will be testing this using the ML approach Ariana has been trialing (see her mRNA-WGBS ML post here: https://ahuffmyer.github.io/ASH_Putnam_Lab_Notebook/E5-timeseries-molecular-mRNA-WGBS-machine-learning-analysis-Part-1/)
Results
I filtered, variance stabilized, and combined the raw gene counts and miRNA counts. Then, due to the high number, I reduced this combined expression data to PCs of similar expression. For phenotype, I used host biomass, host protein, and respiration, all of which are indicators of host energy storage and expenditure. I ran the ML model using phenotypes as the predictor and expression PCs as the response.
Protein was most predictive of expression, with respiration also highly predictive. Interestingly, biomass did not contribute much!
None of the gene PCs were well-predicted by phenotype, with the highest R2 value sitting at 0.177 (~18% of variance in the PCs explained by phenotype).
Also, most of the expression PCs (21/31) had NA R2 values, which I believe is related to low variance in the phenotype metrics, particularly the protein phenotype
> apply(merged_pcs, 2, var)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17
5801.1357 2808.7631 2370.6702 1990.1140 1760.0861 1627.7522 1317.6404 1274.2452 1078.4045 1034.5725 976.4311 937.5277 891.5255 796.3490 779.2450 699.0609 676.0976
PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30 PC31
650.9711 620.9733 606.8573 591.2129 575.7182 550.6815 528.1735 519.2895 506.3591 474.2105 467.2894 454.2832 436.9429 423.9563
> apply(phys_selection, 2, var)
Host_AFDW.mg.cm2 prot_mg.mgafdw Rd
0.33627626 0.01063323 0.03545622
We can look at model performance for specific expression PCs. For example, here’s PC9, which is the best-predicted PC (highest R2 value)
We can also look at the top miRNA/genes associated with it. This was interesting because there is an miRNA that pops out (Cluster_4034)!
## # A tibble: 20 × 3
## # Groups: Merged_PC [1]
## gene_miRNA Merged_PC Loading
## <chr> <chr> <dbl>
## 1 FUN_023262 PC9 -0.0227
## 2 FUN_016918 PC9 0.0198
## 3 FUN_007014 PC9 -0.0194
## 4 FUN_031632 PC9 0.0192
## 5 FUN_035523 PC9 0.0188
## 6 FUN_041996 PC9 0.0188
## 7 FUN_006731 PC9 -0.0186
## 8 FUN_022897 PC9 -0.0185
## 9 FUN_043835 PC9 0.0184
## 10 FUN_004465 PC9 0.0183
## 11 FUN_014188 PC9 -0.0183
## 12 FUN_034114 PC9 -0.0181
## 13 FUN_024741 PC9 -0.0181
## 14 FUN_006446 PC9 -0.0179
## 15 FUN_018629 PC9 -0.0178
## 16 FUN_013857 PC9 -0.0178
## 17 Cluster_4034 PC9 0.0176
## 18 FUN_039308 PC9 -0.0175
## 19 FUN_012117 PC9 0.0175
## 20 FUN_000459 PC9 -0.0174
Code copied below in case of file path changes:
Inputs:
RNA counts matrix (raw):
../output/02.20-D-Apul-RNAseq-alignment-HiSat2/apul-gene_count_matrix.csv
sRNA/miRNA counts matrix (raw):
../output/03.10-D-Apul-sRNAseq-expression-DESeq2/Apul_miRNA_ShortStack_counts_formatted.txt
sample metadata:
../../M-multi-species/data/rna_metadata.csv
physiological data: https://github.com/urol-e5/timeseries/raw/refs/heads/master/time_series_analysis/Output/master_timeseries.csv
Note that I’ll start by using phenotype (e.g. biomass, respiration) as the predictor, which is suitable for understanding how external factors drive gene expression changes.
If, instead, we wanted to build some sort of predictive model, where gene expression could be used to predict phenotype, we could switch so that gene counts are used as the predictors.
Load libraries
Load and prep data
Load in count matrices for RNAseq.
# raw gene counts data (will filter and variance stabilize)
<- read_csv("../output/02.20-D-Apul-RNAseq-alignment-HiSat2/apul-gene_count_matrix.csv")
Apul_genes <- as.data.frame(Apul_genes)
Apul_genes
# format gene IDs as rownames (instead of a column)
rownames(Apul_genes) <- Apul_genes$gene_id
<- Apul_genes%>%select(!gene_id)
Apul_genes
# load and format metadata
<- read_csv("../../M-multi-species/data/rna_metadata.csv")%>%select(AzentaSampleName, ColonyID, Timepoint) %>%
metadata filter(grepl("ACR", ColonyID))
$Sample <- paste(metadata$AzentaSampleName, metadata$ColonyID, metadata$Timepoint, sep = "_")
metadata
<- unique(metadata$ColonyID)
colonies
# Load physiological data
<-read_csv("https://github.com/urol-e5/timeseries/raw/refs/heads/master/time_series_analysis/Output/master_timeseries.csv")%>%filter(colony_id_corr %in% colonies)%>%
physselect(colony_id_corr, species, timepoint, site, Host_AFDW.mg.cm2, Sym_AFDW.mg.cm2, Am, AQY, Rd, Ik, Ic, calc.umol.cm2.hr, cells.mgAFDW, prot_mg.mgafdw, Ratio_AFDW.mg.cm2, Total_Chl, Total_Chl_cell, cre.umol.mgafdw)
# format timepoint
$timepoint <- gsub("timepoint", "TP", phys$timepoint)
phys#add column with full sample info
<- merge(phys, metadata, by.x = c("colony_id_corr", "timepoint"), by.y = c("ColonyID", "Timepoint")) %>%
phys select(-AzentaSampleName)
#add site information into metadata
$Site<-phys$site[match(metadata$ColonyID, phys$colony_id_corr)]
metadata
# Rename gene column names to include full sample info (as in miRNA table)
colnames(Apul_genes) <- metadata$Sample[match(colnames(Apul_genes), metadata$AzentaSampleName)]
# raw miRNA counts (will filter and variance stabilize)
<- read.table(file = "../output/03.10-D-Apul-sRNAseq-expression-DESeq2/Apul_miRNA_ShortStack_counts_formatted.txt", header = TRUE, sep = "\t", check.names = FALSE) Apul_miRNA
Counts filtering
Ensure there are no genes or miRNAs with 0 counts across all samples.
nrow(Apul_genes)
<-Apul_genes %>%
Apul_genes_filtmutate(Total = rowSums(.[, 1:40]))%>%
filter(!Total==0)%>%
::select(!Total)
dplyr
nrow(Apul_genes_filt)
# miRNAs
nrow(Apul_miRNA)
<-Apul_miRNA %>%
Apul_miRNA_filtmutate(Total = rowSums(.[, 1:40]))%>%
filter(!Total==0)%>%
::select(!Total)
dplyr
nrow(Apul_miRNA_filt)
Removing genes with only 0 counts reduced number from 44371 to 35869. Retained all 51 miRNAs.
Will not be performing pOverA filtering for now, since LM should presumabily incorporate sample representation
Physiology filtering
Run PCA on physiology data to see if there are phys outliers
Export data for PERMANOVA test.
<-as.data.frame(phys)
test<-test[complete.cases(test), ] test
Build PERMANOVA model.
<-prcomp(test%>%select(where(is.numeric)), scale=TRUE, center=TRUE)
scaled_test fviz_eig(scaled_test)
# scale data
<- scale(test%>%select(where(is.numeric)))
vegan
# PerMANOVA
<-adonis2(vegan ~ timepoint*site, data = test, method='eu')
permanova permanova
<-ggplot2::autoplot(scaled_test, data=test, frame.colour="timepoint", loadings=FALSE, colour="timepoint", shape="site", loadings.label.colour="black", loadings.colour="black", loadings.label=FALSE, frame=FALSE, loadings.label.size=5, loadings.label.vjust=-1, size=5) +
pca1geom_text(aes(x = PC1, y = PC2, label = paste(colony_id_corr, timepoint)), vjust = -0.5)+
theme_classic()+
theme(legend.text = element_text(size=18),
legend.position="right",
plot.background = element_blank(),
legend.title = element_text(size=18, face="bold"),
axis.text = element_text(size=18),
axis.title = element_text(size=18, face="bold"));pca1
Remove ACR-173, timepoint 3 sample from analysis. This is Azenta sample 1B2.
<- Apul_genes_filt %>%
Apul_genes_filt select(!`1B2_ACR-173_TP3`)
<- Apul_miRNA_filt %>%
Apul_miRNA_filt select(!`1B2_ACR-173_TP3`)
<- metadata %>%
metadata filter(Sample != "1B2_ACR-173_TP3")
We also do not have phys data for colony 1B9 ACR-265 at TP4, so I’ll remove that here as well.
<- Apul_genes_filt%>%
Apul_genes_filt select(!`1B9_ACR-265_TP4`)
<- Apul_miRNA_filt%>%
Apul_miRNA_filt select(!`1B9_ACR-265_TP4`)
<- metadata %>%
metadata filter(Sample != "1B9_ACR-265_TP4")
Assign metadata and arrange order of columns
Order metadata the same as the column order in the gene matrix.
<-colnames(Apul_genes_filt)
list<-as.factor(list)
list
$Sample<-as.factor(metadata$Sample)
metadata
# Re-order the levels
$Sample <- factor(as.character(metadata$Sample), levels=list)
metadata# Re-order the data.frame
<- metadata[order(metadata$Sample),]
metadata_ordered $Sample
metadata_ordered
# Make sure the miRNA colnames are also in the same order as the gene colnames
<- Apul_miRNA_filt[, colnames(Apul_genes_filt)] Apul_miRNA_filt
Metadata and gene count matrix are now ordered the same.
Conduct variance stabilized transformation
VST should be performed on our two input datasets (gene counts and miRNA counts) separately
Genes:
#Set DESeq2 design
<- DESeqDataSetFromMatrix(countData = Apul_genes_filt,
dds_genes colData = metadata_ordered,
design = ~Timepoint+ColonyID)
Check size factors.
<- estimateSizeFactors(dds_genes) #estimate size factors to determine if we can use vst to transform our data. Size factors should be less than 4 for us to use vst
SF.dds_genes print(sizeFactors(SF.dds_genes)) #View size factors
all(sizeFactors(SF.dds_genes)) < 4
All size factors are less than 4, so we can use VST transformation.
<- vst(dds_genes, blind=TRUE) #apply a variance stabilizing transformation to minimize effects of small counts and normalize with respect to library size
vsd_genes <- assay(vsd_genes)
vsd_genes head(vsd_genes, 3) #view transformed gene count data for the first three genes in the dataset.
miRNA:
#Set DESeq2 design
<- DESeqDataSetFromMatrix(countData = Apul_miRNA_filt,
dds_miRNA colData = metadata_ordered,
design = ~Timepoint+ColonyID)
Check size factors.
<- estimateSizeFactors(dds_miRNA) #estimate size factors to determine if we can use vst to transform our data. Size factors should be less than 4 for us to use vst
SF.dds_miRNA print(sizeFactors(SF.dds_miRNA)) #View size factors
all(sizeFactors(SF.dds_miRNA)) < 4
All size factors are less than 4, so we can use VST transformation.
<- varianceStabilizingTransformation(dds_miRNA, blind=TRUE) #apply a variance stabilizing transformation to minimize effects of small counts and normalize with respect to library size. Using varianceStabilizingTransformation() instead of vst() because few input genes
vsd_miRNA <- assay(vsd_miRNA)
vsd_miRNA head(vsd_miRNA, 3) #view transformed gene count data for the first three genes in the dataset.
Combine counts data
# Extract variance stabilized counts as dataframes
# want samples in rows, genes/miRNAs in columns
<- as.data.frame(t(vsd_genes))
vsd_genes <- as.data.frame(t(vsd_miRNA))
vsd_miRNA
# Double check the row names (sample names) are in same order
rownames(vsd_genes) == rownames(vsd_miRNA)
# Combine vst gene counts and vst miRNA counts by rows (sample names)
<- cbind(vsd_genes, vsd_miRNA) vsd_merged
Feature selection
Genes + miRNA
We have a large number of genes, so we’ll reduce dimensionality using PCA. Note that, since we only have a few phenotypes of interest, we don’t need to reduce this dataset
First need to remove any genes/miRNA that are invariant
<- vsd_merged[, apply(vsd_merged, 2, var) > 0]
vsd_merged_filt
ncol(vsd_merged)
ncol(vsd_merged_filt)
colnames(vsd_merged[, apply(vsd_merged, 2, var) == 0])
Removed 74 invariant genes. I was worried we lost miRNA, but it looks like everything removed was a gene (prefix “FUN”)!
Reduce dimensionality (genes+miRNA)
# Perform PCA on gene+miRNA expression matrix
<- prcomp(vsd_merged_filt, scale. = TRUE)
pca_merged
# Select top PCs that explain most variance (e.g., top 50 PCs)
<- summary(pca_merged)$importance[2, ] # Cumulative variance explained
explained_var <- min(which(cumsum(explained_var) > 0.95)) # Keep PCs that explain 95% variance
num_pcs
<- as.data.frame(pca_merged$x[, 1:num_pcs]) # Extract selected PCs
merged_pcs
dim(merged_pcs)
We have 27 gene/miRNA expression PCs
Genes only
To investigate gene expression separately from miRNA expression, reduce dimensionality of genes alone.
Remove any genes that are invariant
<- vsd_genes[, apply(vsd_genes, 2, var) > 0]
vsd_genes_filt
ncol(vsd_genes)
ncol(vsd_genes_filt)
Removed 74 invariant genes.
Reduce dimensionality
# Perform PCA on gene expression matrix
<- prcomp(vsd_genes_filt, scale. = TRUE)
pca_genes
# Select top PCs that explain most variance (e.g., top 50 PCs)
<- summary(pca_genes)$importance[2, ] # Cumulative variance explained
explained_var_genes <- min(which(cumsum(explained_var_genes) > 0.95)) # Keep PCs that explain 95% variance
num_pcs_genes
<- as.data.frame(pca_genes$x[, 1:num_pcs_genes]) # Extract selected PCs
genes_pcs
dim(genes_pcs)
Physiological metrics
Select physiological metrics of interest. For now we’ll focus on biomass (“Host_AFDW.mg.cm2”), protein (“prot_mg.mgafdw”), and respiration (“Rd”). These are all metrics of host energy storage and expenditure.
# Assign sample IDs to row names
rownames(phys) <- phys$Sample
# Select metrics
<- phys %>% select(Host_AFDW.mg.cm2, prot_mg.mgafdw, Rd)
phys_selection
# Make sure the phys rownames are in the same order as the gene/miRNA rownames
<- phys_selection[rownames(merged_pcs),] phys_selection
Phenotype to predict gene/miRNA expression
The model
Train elastic models to predict gene expression PCs from phys data.
<- function(response_pcs, predictor_pcs) {
train_models <- list()
models
for (pc in colnames(response_pcs)) {
<- response_pcs[[pc]] # Gene expression PC
y <- as.matrix(predictor_pcs) # Phys as predictors
X
# Train elastic net model (alpha = 0.5 for mix of LASSO & Ridge)
<- cv.glmnet(X, y, alpha = 0.5)
model
<- model
models[[pc]]
}
return(models)
}
# Train models predicting gene expression PCs from phys data
<- train_models(merged_pcs, phys_selection) models
Extract feature importance.
<- function(models) {
get_feature_importance <- lapply(models, function(model) {
importance_list <- as.matrix(coef(model, s = "lambda.min"))[-1, , drop = FALSE] # Convert to regular matrix & remove intercept
coefs
# Convert to data frame
<- data.frame(Feature = rownames(coefs), Importance = as.numeric(coefs))
coefs_df
return(coefs_df)
})
# Combine feature importance across all predicted gene PCs
<- bind_rows(importance_list) %>%
importance_df group_by(Feature) %>%
summarize(MeanImportance = mean(abs(Importance)), .groups = "drop") %>%
arrange(desc(MeanImportance))
return(importance_df)
}
<- get_feature_importance(models)
feature_importance head(feature_importance, 20) # Top predictive phys features
Evaluate performance.
<- function(models, response_pcs, predictor_pcs) {
evaluate_model_performance <- data.frame(PC = colnames(response_pcs), R2 = NA)
results
for (pc in colnames(response_pcs)) {
<- response_pcs[[pc]]
y <- as.matrix(predictor_pcs)
X
<- models[[pc]]
model <- predict(model, X, s = "lambda.min")
preds
<- cor(y, preds)^2 # R-squared metric
R2 $PC == pc, "R2"] <- R2
results[results
}
return(results)
}
<- evaluate_model_performance(models, merged_pcs, phys_selection)
performance_results summary(performance_results$R2)
Results
Plot results.
# Select top 20 predictive phys features
<- feature_importance %>% top_n(20, MeanImportance)
top_features
# Plot
ggplot(top_features, aes(x = reorder(Feature, MeanImportance), y = MeanImportance)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() + # Flip for readability
theme_minimal() +
labs(title = "Top 20 Predictive Phys Features",
x = "Physiological Metric",
y = "Mean Importance")
ggplot(performance_results, aes(x = PC, y = R2)) +
geom_point(color = "darkred", size = 3) +
geom_hline(yintercept = mean(performance_results$R2, na.rm = TRUE), linetype = "dashed", color = "blue") +
theme_minimal() +
labs(title = "Model Performance Across Gene Expression PCs",
x = "Gene Expression PC",
y = "R² (Variance Explained)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate labels
Keep in mind that, while we ran the model with physiological predictors, we’re really interested in the genes/miRNA associated with these predictors
View components associated with gene/miRNA PCs
# Get the PCA rotation (loadings) matrix from the original gene/miRNA PCA
<- pca_merged$rotation # Each column corresponds to a PC
merged_loadings
# Convert to data frame and reshape for plotting
<- as.data.frame(merged_loadings) %>%
merged_loadings_df rownames_to_column(var = "gene_miRNA") %>%
pivot_longer(-gene_miRNA, names_to = "Merged_PC", values_to = "Loading")
# View top CpGs contributing most to each PC
<- merged_loadings_df %>%
top_genes group_by(Merged_PC) %>%
arrange(desc(abs(Loading))) %>%
slice_head(n = 20) # Select top 10 CpGs per PC
print(top_genes)
View top 20 miRNA/genes associated with PC9 (the PC with the highest R^2)
print(top_genes%>%filter(Merged_PC=="PC9"))
Interesting, there’s an miRNA in there!
View predicted vs actual gene expression values to evaluate model.
# Choose a gene expression PC to visualize (e.g., the most predictable one)
<- performance_results$PC[which.max(performance_results$R2)]
best_pc
# Extract actual and predicted values for that PC
<- merged_pcs[[best_pc]]
actual_values <- predict(models[[best_pc]], as.matrix(phys_selection), s = "lambda.min")
predicted_values
# Create data frame
<- data.frame(
prediction_df Actual = actual_values,
Predicted = predicted_values
)
# Scatter plot with regression line
ggplot(prediction_df, aes(x = Actual, y = lambda.min)) +
geom_point(color = "blue", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
theme_minimal() +
labs(title = paste("Predicted vs. Actual for", best_pc),
x = "Actual Gene Expression PC",
y = "Predicted Gene Expression PC") +
annotate("text", x = min(actual_values), y = max(predicted_values),
label = paste("R² =", round(max(performance_results$R2, na.rm=TRUE), 3)),
hjust = 0, color = "black", size = 5)
## `geom_smooth()` using formula = 'y ~ x'