library(tidyverse)
3 Case study: gene expression data
Similarly, we load utility package tidyverse
to help with data manipulation and visualization.
In this demonstration, we use the curate Ovarian data (Ganzfried et al. 2013) to demonstrate (1) the common gene co-expression network drawn by \(\Sigma_\Phi\) and (2) the fit of the models via calculating MSE. This data contains the gene expression and clinical outcomes of 2970 patients collected from 23 studies. Different studies have different sequencing platforms, sample sizes, stage/subtype of the tumor, survival and censoring information.
3.1 Loading and previewing the data
We load the curatedOvarianData
package to get the data. Description of each study can be found with data(package="curatedOvarianData")
.
library(curatedOvarianData)
#data(package="curatedOvarianData")
data(GSE13876_eset)
data(GSE26712_eset)
data(GSE9891_eset)
data(PMID17290060_eset)
The four datasets are of similar sizes. All of them have the majority of patients in the late stage of the cancer, and histological subtypes observed in the tissue samples are mostly serous carcinoma. The datasets differs in respect sequencing platforms, which are Operonv3two-color, AffymetrixHG-U133A, AffymetrixHG-U133Plus2 and AffymetrixHG-U133A.
3.2 Data pre-processing
First we find intersection of genes that are exist in all four studies. We use featureNames
to extract the gene names and exprs
to extract the data matrices. More operations of the ExpressionSets can be found in (Falcon, Morgan, and Gentleman 2007).
<- Reduce(intersect, list(featureNames(GSE13876_eset),
inter_genes featureNames(GSE26712_eset),
featureNames(GSE9891_eset),
featureNames(PMID17290060_eset)))
<- GSE13876_eset[inter_genes,]
GSE13876_eset <- GSE26712_eset[inter_genes,]
GSE26712_eset <- GSE9891_eset[inter_genes,]
GSE9891_eset <- PMID17290060_eset[inter_genes,]
PMID17290060_eset
<- t(exprs(GSE13876_eset))
study1 <- t(exprs(GSE26712_eset))
study2 <- t(exprs(GSE9891_eset))
study3 <- t(exprs(PMID17290060_eset)) study4
We then filter the genes so that only high variance genes are kept for later analysis.
# calculate Coefficient of Variation of each gene
<- apply(study1, 2, sd) / apply(study1, 2, mean)
cv1 <- apply(study2, 2, sd) / apply(study2, 2, mean)
cv2 <- apply(study3, 2, sd) / apply(study3, 2, mean)
cv3 <- apply(study4, 2, sd) / apply(study4, 2, mean)
cv4 <- rbind(cv1, cv2, cv3, cv4)
cv_matrix
# Find genes with CV >= threshold in at least one study
<- 0.16
threshold <- apply(cv_matrix, 2, function(cv) any(cv >= threshold))
genes_to_keep sum(genes_to_keep)
[1] 1060
# Filtered
<- GSE13876_eset[genes_to_keep,]
study1 <- GSE26712_eset[genes_to_keep,]
study2 <- GSE9891_eset[genes_to_keep,]
study3 <- PMID17290060_eset[genes_to_keep,] study4
Next, we log-transform the data (for a better normality) and store them in 4 \(N_s\times P\) matrices.
<- study1 %>% exprs() %>% t() %>%
df1 log() %>% as.data.frame()
<- study2 %>% exprs() %>% t() %>%
df2 log() %>% as.data.frame()
<- study3 %>% exprs() %>% t() %>%
df3 log() %>% as.data.frame()
<- study4 %>% exprs() %>% t() %>%
df4 log() %>% as.data.frame()
<- list(df1, df2, df3, df4)
list_gene saveRDS(list(df1, df2, df3, df4), "./Data/CuratedOvarian_processed.rds")
Now we can see the dimensions of each study:
sapply(list_gene, dim)
[,1] [,2] [,3] [,4]
[1,] 157 195 285 117
[2,] 1060 1060 1060 1060
The ultimate data for analysis has \(N_s=(157, 195, 285, 117)\) for \(s=1,2, 3, 4\), and \(P=1060\).
We scale the data so that we focus on the correlations between the genes. When the data are scaled, the variance of the genes are 1 and the off-diagonal values scales up, so does the estimated covariances matrices, which facilitates a more interpretable network analysis.
<- readRDS("./Data/CuratedOvarian_processed.rds")
Y_list <- lapply(
Y_list_scaled function(x) scale(x, center = TRUE, scale = TRUE)
Y_list,
)<- Y_list_scaled %>% do.call(rbind, .) %>% as.matrix() Y_mat_scaled
3.3 Fitting the models
For models fitted with MSFA::sp_fa
, the function provides scaling
and centering
arguments. Therefore, we do not need to use the scaled data before fitting (here I use Mavis’s version of MSFA).
Stack FA and Ind FA:
= Y_list %>% do.call(rbind, .) %>% as.matrix()
Y_mat <- MSFA::sp_fa(Y_mat, k = 6, scaling = FALSE, centering = TRUE,
fit_stackFA control = list(nrun = 10000, burn = 8000))
<-
fit_indFA lapply(1:6, function(s){
= c(8, 8, 8, 8, 8, 8)
j_s ::sp_fa(Y_list[[s]], k = j_s[s], scaling = FALSE, centering = TRUE,
MSFAcontrol = list(nrun = 10000, burn = 8000))
})
MOM-SS:
# Construct the membership matrix
<- sapply(Y_list, nrow)
N_s <- list()
M_list for(s in 1:4){
<- matrix(1, nrow = N_s[s], ncol = 1)
M_list[[s]]
}<- as.matrix(bdiag(M_list))
M
<- BFR.BE::BFR.BE.EM.CV(x = Y_mat, v = NULL, b = M, q = 20, scaling = TRUE) fit_MOMSS
PFA (we don’t recommend running it. It takes 4 days and more.)
<- sapply(Y_list, nrow)
N_s <- PFA(Y=t(Y_mat_scaled),
fit_PFA latentdim = 20,
grpind = rep(1:4,
times = N_s),
Thin = 5,
Cutoff = 0.001,
Total_itr = 10000, burn = 8000)
SUFA (takes about 10 hours):
<- SUFA::fit_SUFA(Y_list_scaled, qmax=20,nrun = 10000) fit_SUFA
BMSFA:
<- MSFA::sp_msfa(Y_list, k = 20, j_s = c(4, 4, 4, 4),
fit_BMSFA outputlevel = 1, scaling = TRUE, centering = TRUE,
control = list(nrun = 10000, burn = 8000))
Tetris (we don’t recommend running it. It could take more than 10 days.):
<- ceiling(1.25*4)
set_alpha <- tetris(Y_list_scaled, alpha=set_alpha, beta=1, nprint = 200, nrun=10000, burn=8000)
fit_Tetris print("Start to choose big_T. It might take a long time.")
<- choose.A(fit_Tetris, alpha_IBP=set_alpha, S=4)
big_T <- tetris(Y_list_scaled, alpha=set_alpha, beta=1,
run_fixed fixed=TRUE, A_fixed=big_T, nprint = 200, nrun=10000, burn=8000)
3.4 Post-processing
We use the post_xxx()
functions defined in the chapter of nutrition applications to post-process for point estimates of the factor loadings, common factors, and covariance matrices.
For Stack FA, Ind FA and BMSFA, we need to determine the number of common factors \(K\) and the number of latent dimensions \(J_s\) for each study after the post-processing, and we need to re-run those models.
Here we start with Stack FA, Ind FA and BMSFA:
# Stack FA
= post_stackFA(fit_stackFA, S=4)
res_stackFA # Ind FA
= post_IndFA(fit_indFA)
res_IndFA # BMSFA
= post_BMSFA(fit_BMSFA) res_BMSFA
# Eigenvalue decomposition
<- function(Sig_mean) {
fun_eigen <- eigen(Sig_mean)$values
val_eigen <- val_eigen/sum(val_eigen)
prop_var <- length(which(prop_var > 0.05))
choose_K return(choose_K)
}
<- res_stackFA$SigmaPhi
SigmaPhi_StackFA <- fun_eigen(SigmaPhi_StackFA)
K_StackFA
<- res_IndFA$SigmaLambda
SigmaLambda_IndFA <- lapply(SigmaLambda_IndFA, fun_eigen)
Js_IndFA
<- res_BMSFA$SigmaPhi
SigmaPhi_BMSFA <- fun_eigen(SigmaPhi_BMSFA)
K_BMSFA <- res_BMSFA$SigmaLambda
SigmaLambda_BMSFA <- lapply(SigmaLambda_BMSFA, fun_eigen)
Js_BMSFA
# We print the results:
print(paste0("Stack FA: K = ", K_StackFA))
[1] "Stack FA: K = 2"
print(paste0("Ind FA: J_s = ", Js_IndFA))
[1] "Ind FA: J_s = 4" "Ind FA: J_s = 7" "Ind FA: J_s = 6" "Ind FA: J_s = 6"
print(paste0("BMSFA: K = ", K_BMSFA))
[1] "BMSFA: K = 6"
print(paste0("BMSFA: J_s = ", Js_BMSFA))
[1] "BMSFA: J_s = 4" "BMSFA: J_s = 4" "BMSFA: J_s = 4" "BMSFA: J_s = 4"
We then fit the models again with the determined number of factors and latent dimensions.
# Stack FA
<- MSFA::sp_fa(Y_mat, k = K_StackFA, scaling = FALSE, centering = TRUE,
fit_stackFA2 control = list(nrun = 10000, burn = 8000))
# Ind FA
<- lapply(1:4, function(s){
fit_indFA2 = Js_IndFA[[s]]
j_s ::sp_fa(Y_list[[s]], k = j_s, scaling = FALSE, centering = TRUE,
MSFAcontrol = list(nrun = 10000, burn = 8000))
})# BMSFA
<- MSFA::sp_msfa(Y_list, k = K_BMSFA, j_s = Js_BMSFA,
fit_BMSFA2 outputlevel = 1, scaling = TRUE, centering = TRUE,
control = list(nrun = 10000, burn = 8000))
# Again we post-process the MCMC chains.
<- post_stackFA(fit_stackFA2, S=4)
res_stackFA2 <- post_IndFA(fit_indFA2)
res_IndFA2 <- post_BMSFA(fit_BMSFA2) res_BMSFA2
For the rest of the methods, we apply the same post-processing functions as in the nutrition applications chapter.
# PFA
<- post_PFA(fit_PFA)
res_PFA # MOM-SS
<- post_MOMSS(fit_MOMSS)
res_MOMSS # SUFA
<- post_SUFA(fit_SUFA)
res_SUFA # Tetris
#res_Tetris <- post_Tetris(fit_Tetris)
3.5 Visualization
We can visualize the common gene co-expression network using the estimated common covariance matrices. We use Gephi to visualize the networks. The SigmaPhi
matrix is the common covariance matrix, and SigmaLambda
is the study-specific covariance matrix.
Before that, we need to filter the genes so that only genes with high correlations are kept (leaving around 200 genes in the plot). We set the threshold for each model as follows:
- Stack FA: 0.85
- MOM-SS: 0.95
- BMSFA: 0.5
- SUFA: 0.28
- PFA: 0.55
# Filtering genes for visualization in Gephi
<- colnames(list_gene[[1]])
genenames
# StackFA
<- res_stackFA2$SigmaPhi
SigmaPhi_StackFA_curated colnames(SigmaPhi_StackFA_curated) <- rownames(SigmaPhi_StackFA_curated) <- genenames
diag(SigmaPhi_StackFA_curated) <- NA
<- SigmaPhi_StackFA_curated > 0.85
above_thresh <- apply(above_thresh, 1, function(x) any(x, na.rm = TRUE))
keep_genes sum(keep_genes)
[1] 214
abs(SigmaPhi_StackFA_curated) < 0.85] <- 0
SigmaPhi_StackFA_curated[<- SigmaPhi_StackFA_curated[keep_genes, keep_genes]
SigmaPhi_StackFA_curated_filtered #write.csv(SigmaPhi_StackFA_curated_filtered, "SigmaPhi_StackFA_curated_filtered.csv")
# MOM-SS
<- res_MOMSS$SigmaPhi
SigmaPhiMOMSS colnames(SigmaPhiMOMSS) <- rownames(SigmaPhiMOMSS) <- genenames
diag(SigmaPhiMOMSS) <- NA
<- abs(SigmaPhiMOMSS) > 0.95
above_thresh <- apply(above_thresh, 1, function(x) any(x, na.rm = TRUE))
keep_genes sum(keep_genes)
[1] 223
abs(SigmaPhiMOMSS) < 0.95] <- 0
SigmaPhiMOMSS[<- SigmaPhiMOMSS[keep_genes, keep_genes]
SigmaPhiMOMSS_filtered #write.csv(SigmaPhiMOMSS_filtered, "SigmaPhiMOMSS_filtered.csv")
# BMSFA
<- res_BMSFA2$SigmaPhi
SigmaPhiBMSFA colnames(SigmaPhiBMSFA) <- rownames(SigmaPhiBMSFA) <- genenames
diag(SigmaPhiBMSFA) <- NA
<- abs(SigmaPhiBMSFA) > 0.5
above_thresh <- apply(above_thresh, 1, function(x) any(x, na.rm = TRUE))
keep_genes sum(keep_genes)
[1] 192
abs(SigmaPhiBMSFA) < 0.5] <- 0
SigmaPhiBMSFA[<- SigmaPhiBMSFA[keep_genes, keep_genes]
SigmaPhiBMSFA_filtered #write.csv(SigmaPhiBMSFA_filtered, "SigmaPhiBMSFA_filtered.csv")
# SUFA
<- res_SUFA$SigmaPhi
SigmaPhiSUFA colnames(SigmaPhiSUFA) <- rownames(SigmaPhiSUFA) <- genenames
diag(SigmaPhiSUFA) <- NA
<- abs(SigmaPhiSUFA) > 0.28
above_thresh <- apply(above_thresh, 1, function(x) any(x, na.rm = TRUE))
keep_genes sum(keep_genes)
[1] 191
abs(SigmaPhiSUFA) < 0.28] <- 0
SigmaPhiSUFA[<- SigmaPhiSUFA[keep_genes, keep_genes]
SigmaPhiSUFA_filtered #write.csv(SigmaPhiSUFA_filtered, "SigmaPhiSUFA_filtered.csv")
# PFA
<- res_PFA$SigmaPhi
SigmaPhiPFA colnames(SigmaPhiPFA) <- rownames(SigmaPhiPFA) <- genenames
diag(SigmaPhiPFA) <- NA
<- abs(SigmaPhiPFA) > 0.55
above_thresh <- apply(above_thresh, 1, function(x) any(x, na.rm = TRUE))
keep_genes sum(keep_genes)
[1] 203
abs(SigmaPhiPFA) < 0.55] <- 0
SigmaPhiPFA[<- SigmaPhiPFA[keep_genes, keep_genes]
SigmaPhiPFA_filtered #write.csv(SigmaPhiPFA_filtered, "SigmaPhiPFA_filtered.csv")
Attached are the final figures make with Gephi. The nodes are the genes and the edges are the correlations between the genes. The thickness of the edges indicates the strength of the correlation. The color of the nodes indicates the study they belong to. The size of the nodes indicates the number of connections they have.
Stack FA:
PFA:
MOM-SS:
SUFA:
BMSFA: