Title: | A complete analysis of single cell subpopulations, from identifying subpopulations to analysing their relationship (scGPS = single cell Global Predictions of Subpopulation) |
---|---|
Description: | The package implements two main algorithms to answer two key questions: a SCORE (Stable Clustering at Optimal REsolution) to find subpopulations, followed by scGPS to investigate the relationships between subpopulations. |
Authors: | Quan Nguyen [aut, cre], Michael Thompson [aut], Anne Senabouth [aut] |
Maintainer: | Quan Nguyen <[email protected]> |
License: | GPL-3 |
Version: | 1.21.0 |
Built: | 2024-12-30 04:43:30 UTC |
Source: | https://github.com/bioc/scGPS |
often we need to label clusters with unique biological characters. One of the common approach to annotate a cluster is to perform functional enrichment analysis. The annotate implements ReactomePA and clusterProfiler for this analysis type in R. The function require installation of several databases as described below.
annotate_clusters( DEgeneList, pvalueCutoff = 0.05, gene_symbol = TRUE, species = "human" )
annotate_clusters( DEgeneList, pvalueCutoff = 0.05, gene_symbol = TRUE, species = "human" )
DEgeneList |
is a vector of gene symbols, convertable to ENTREZID |
pvalueCutoff |
is a numeric of the cutoff p value |
gene_symbol |
logical of whether the geneList is a gene symbol |
species |
is the selection of 'human' or 'mouse', default to 'human' genes |
write enrichment test output to a file and an enrichment test object for plotting
genes <-training_gene_sample genes <-genes$Merged_unique[seq_len(50)] enrichment_test <- annotate_clusters(genes, pvalueCutoff=0.05, gene_symbol=TRUE, species = 'human') clusterProfiler::dotplot(enrichment_test, showCategory=15)
genes <-training_gene_sample genes <-genes$Merged_unique[seq_len(50)] enrichment_test <- annotate_clusters(genes, pvalueCutoff=0.05, gene_symbol=TRUE, species = 'human') clusterProfiler::dotplot(enrichment_test, showCategory=15)
same as bootstrap_prediction, but with an multicore option
bootstrap_parallel( ncores = 4, nboots = 1, genes = genes, mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, c_selectID, listData = list(), cluster_mixedpop1 = NULL, cluster_mixedpop2 = NULL )
bootstrap_parallel( ncores = 4, nboots = 1, genes = genes, mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, c_selectID, listData = list(), cluster_mixedpop1 = NULL, cluster_mixedpop2 = NULL )
ncores |
a number specifying how many cpus to be used for running |
nboots |
a number specifying how many bootstraps to be run |
genes |
a gene list to build the model |
mixedpop1 |
a SingleCellExperiment object from a mixed population for training |
mixedpop2 |
a SingleCellExperiment object from a target mixed population for prediction |
c_selectID |
the root cluster in mixedpop1 to becompared to clusters in mixedpop2 |
listData |
a |
cluster_mixedpop1 |
a vector of cluster assignment for mixedpop1 |
cluster_mixedpop2 |
a vector of cluster assignment for mixedpop2 |
a list
with prediction results written in to the index
out_idx
Quan Nguyen, 2017-11-25
day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique #prl_boots <- bootstrap_parallel(ncores = 4, nboots = 1, genes=genes, # mixedpop1 = mixedpop2, mixedpop2 = mixedpop2, c_selectID=1, # listData =list()) #prl_boots[[1]]$ElasticNetPredict #prl_boots[[1]]$LDAPredict
day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique #prl_boots <- bootstrap_parallel(ncores = 4, nboots = 1, genes=genes, # mixedpop1 = mixedpop2, mixedpop2 = mixedpop2, c_selectID=1, # listData =list()) #prl_boots[[1]]$ElasticNetPredict #prl_boots[[1]]$LDAPredict
ElasticNet and LDA prediction for each of all the subpopulations in the new mixed population after training the model for a subpopulation in the first mixed population. The number of bootstraps to be run can be specified.
bootstrap_prediction( nboots = 1, genes = genes, mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, c_selectID = NULL, listData = list(), cluster_mixedpop1 = NULL, cluster_mixedpop2 = NULL, trainset_ratio = 0.5, LDA_run = TRUE, verbose = FALSE, log_transform = FALSE )
bootstrap_prediction( nboots = 1, genes = genes, mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, c_selectID = NULL, listData = list(), cluster_mixedpop1 = NULL, cluster_mixedpop2 = NULL, trainset_ratio = 0.5, LDA_run = TRUE, verbose = FALSE, log_transform = FALSE )
nboots |
a number specifying how many bootstraps to be run |
genes |
a gene list to build the model |
mixedpop1 |
a SingleCellExperiment object from a mixed population for training |
mixedpop2 |
a SingleCellExperiment object from a target mixed population for prediction |
c_selectID |
the root cluster in mixedpop1 to becompared to clusters in mixedpop2 |
listData |
a |
cluster_mixedpop1 |
a vector of cluster assignment for mixedpop1 |
cluster_mixedpop2 |
a vector of cluster assignment for mixedpop2 |
trainset_ratio |
a number specifying the proportion of cells to be part of the training subpopulation |
LDA_run |
logical, if the LDA prediction is added to compare to ElasticNet |
verbose |
a logical whether to display additional messages |
log_transform |
boolean whether log transform should be computed |
a list
with prediction results written in to the index
out_idx
Quan Nguyen, 2017-11-25
bootstrap_parallel
for parallel options
day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique cluster_mixedpop1 <- colData(mixedpop1)[,1] cluster_mixedpop2 <- colData(mixedpop2)[,1] c_selectID <- 2 test <- bootstrap_prediction(nboots = 1, mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, genes=genes, listData =list(), cluster_mixedpop1 = cluster_mixedpop1, cluster_mixedpop2 = cluster_mixedpop2, c_selectID = c_selectID) names(test) test$ElasticNetPredict test$LDAPredict
day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique cluster_mixedpop1 <- colData(mixedpop1)[,1] cluster_mixedpop2 <- colData(mixedpop2)[,1] c_selectID <- 2 test <- bootstrap_prediction(nboots = 1, mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, genes=genes, listData =list(), cluster_mixedpop1 = cluster_mixedpop1, cluster_mixedpop2 = cluster_mixedpop2, c_selectID = c_selectID) names(test) test$ElasticNetPredict test$LDAPredict
Compute Euclidean distance matrix by rows
calcDist(x)
calcDist(x)
x |
A numeric matrix |
a distance matrix
mat_test <-matrix(rnbinom(1000,mu=0.01, size=10),nrow=1000) calcDist(mat_test)
mat_test <-matrix(rnbinom(1000,mu=0.01, size=10),nrow=1000) calcDist(mat_test)
Compute Euclidean distance matrix by rows
calcDistArma(x)
calcDistArma(x)
x |
A numeric matrix |
a distance matrix
mat_test <-matrix(rnbinom(1000,mu=0.01, size=10),nrow=1000) #library(microbenchmark) #microbenchmark(calcDistArma(mat_test), dist(mat_test), times=3)
mat_test <-matrix(rnbinom(1000,mu=0.01, size=10),nrow=1000) #library(microbenchmark) #microbenchmark(calcDistArma(mat_test), dist(mat_test), times=3)
performs 40 clustering runs or more depending on windows
clustering( object = NULL, ngenes = 1500, windows = seq(from = 0.025, to = 1, by = 0.025), remove_outlier = c(0), nRounds = 1, PCA = FALSE, nPCs = 20, verbose = FALSE, log_transform = FALSE )
clustering( object = NULL, ngenes = 1500, windows = seq(from = 0.025, to = 1, by = 0.025), remove_outlier = c(0), nRounds = 1, PCA = FALSE, nPCs = 20, verbose = FALSE, log_transform = FALSE )
object |
is a SingleCellExperiment object from the train mixed population |
ngenes |
number of top variable genes to be used |
windows |
a numeric specifying the number of windows to test |
remove_outlier |
a vector containing IDs for clusters to be removed the default vector contains 0, as 0 is the cluster with singletons |
nRounds |
number of iterations to remove a selected clusters |
PCA |
logical specifying if PCA is used before calculating distance matrix |
nPCs |
number of principal components from PCA dimensional reduction to be used |
verbose |
a logical whether to display additional messages |
log_transform |
boolean whether log transform should be computed |
clustering results
Quan Nguyen, 2017-11-25
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) test <-clustering(mixedpop2, remove_outlier = c(0))
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) test <-clustering(mixedpop2, remove_outlier = c(0))
subsamples cells for each bagging run and performs 40 clustering runs or more depending on windows.
clustering_bagging( object = NULL, ngenes = 1500, bagging_run = 20, subsample_proportion = 0.8, windows = seq(from = 0.025, to = 1, by = 0.025), remove_outlier = c(0), nRounds = 1, PCA = FALSE, nPCs = 20, log_transform = FALSE )
clustering_bagging( object = NULL, ngenes = 1500, bagging_run = 20, subsample_proportion = 0.8, windows = seq(from = 0.025, to = 1, by = 0.025), remove_outlier = c(0), nRounds = 1, PCA = FALSE, nPCs = 20, log_transform = FALSE )
object |
is a SingleCellExperiment object from the train mixed population. |
ngenes |
number of genes used for clustering calculations. |
bagging_run |
an integer specifying the number of bagging runs to be computed. |
subsample_proportion |
a numeric specifying the proportion of the tree to be chosen in subsampling. |
windows |
a numeric vector specifying the rages of each window. |
remove_outlier |
a vector containing IDs for clusters to be removed the default vector contains 0, as 0 is the cluster with singletons. |
nRounds |
a integer specifying the number rounds to attempt to remove outliers. |
PCA |
logical specifying if PCA is used before calculating distance matrix. |
nPCs |
an integer specifying the number of principal components to use. |
log_transform |
boolean whether log transform should be computed |
a list of clustering results containing each bagging run as well as the clustering of the original tree and the tree itself.
Quan Nguyen, 2017-11-25
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) test <-clustering_bagging(mixedpop2, remove_outlier = c(0), bagging_run = 2, subsample_proportion = .7)
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) test <-clustering_bagging(mixedpop2, remove_outlier = c(0), bagging_run = 2, subsample_proportion = .7)
CORE is an algorithm to generate reproduciable clustering, CORE is first implemented in ascend R package. Here, CORE V2.0 uses bagging analysis to find a stable clustering result and detect rare clusters mixed population.
CORE_bagging( mixedpop = NULL, bagging_run = 20, subsample_proportion = 0.8, windows = seq(from = 0.025, to = 1, by = 0.025), remove_outlier = c(0), nRounds = 1, PCA = FALSE, nPCs = 20, ngenes = 1500, log_transform = FALSE )
CORE_bagging( mixedpop = NULL, bagging_run = 20, subsample_proportion = 0.8, windows = seq(from = 0.025, to = 1, by = 0.025), remove_outlier = c(0), nRounds = 1, PCA = FALSE, nPCs = 20, ngenes = 1500, log_transform = FALSE )
mixedpop |
is a SingleCellExperiment object from the train mixed population. |
bagging_run |
an integer specifying the number of bagging runs to be computed. |
subsample_proportion |
a numeric specifying the proportion of the tree to be chosen in subsampling. |
windows |
a numeric vector specifying the ranges of each window. |
remove_outlier |
a vector containing IDs for clusters to be removed the default vector contains 0, as 0 is the cluster with singletons. |
nRounds |
an integer specifying the number rounds to attempt to remove outliers. |
PCA |
logical specifying if PCA is used before calculating distance matrix. |
nPCs |
an integer specifying the number of principal components to use. |
ngenes |
number of genes used for clustering calculations. |
log_transform |
boolean whether log transform should be computed |
a list
with clustering results of all iterations, and a
selected
optimal resolution
Quan Nguyen, 2018-05-11
day5 <- day_5_cardio_cell_sample cellnames<-colnames(day5$dat5_counts) cluster <-day5$dat5_clusters cellnames <- data.frame('cluster' = cluster, 'cellBarcodes' = cellnames) #day5$dat5_counts needs to be in a matrix format mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) test <- CORE_bagging(mixedpop2, remove_outlier = c(0), PCA=FALSE, bagging_run = 2, subsample_proportion = .7)
day5 <- day_5_cardio_cell_sample cellnames<-colnames(day5$dat5_counts) cluster <-day5$dat5_clusters cellnames <- data.frame('cluster' = cluster, 'cellBarcodes' = cellnames) #day5$dat5_counts needs to be in a matrix format mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) test <- CORE_bagging(mixedpop2, remove_outlier = c(0), PCA=FALSE, bagging_run = 2, subsample_proportion = .7)
CORE is an algorithm to generate reproduciable clustering, CORE is first implemented in ascend R package. Here, CORE V2.0 introduces several new functionalities, including three key features: fast (and more memory efficient) implementation with C++ and paralellisation options allowing clustering of hundreds of thousands of cells (ongoing development), outlier revomal important if singletons exist (done), a number of dimensionality reduction methods including the imputation implementation (CIDR) for confirming clustering results (done), and an option to select the number of optimisation tree height windows for increasing resolution
CORE_clustering( mixedpop = NULL, windows = seq(from = 0.025, to = 1, by = 0.025), remove_outlier = c(0), nRounds = 1, PCA = FALSE, nPCs = 20, ngenes = 1500, verbose = FALSE, log_transform = FALSE )
CORE_clustering( mixedpop = NULL, windows = seq(from = 0.025, to = 1, by = 0.025), remove_outlier = c(0), nRounds = 1, PCA = FALSE, nPCs = 20, ngenes = 1500, verbose = FALSE, log_transform = FALSE )
mixedpop |
is a SingleCellExperiment object from the train mixed population |
windows |
a numeric specifying the number of windows to test |
remove_outlier |
a vector containing IDs for clusters to be removed the default vector contains 0, as 0 is the cluster with singletons. |
nRounds |
an integer specifying the number rounds to attempt to remove outliers. |
PCA |
logical specifying if PCA is used before calculating distance matrix |
nPCs |
an integer specifying the number of principal components to use. |
ngenes |
number of genes used for clustering calculations. |
verbose |
a logical whether to display additional messages |
log_transform |
boolean whether log transform should be computed |
a list
with clustering results of all iterations, and a
selected optimal resolution
Quan Nguyen, 2017-11-25
day5 <- day_5_cardio_cell_sample #day5$dat5_counts needs to be in a matrix format cellnames <- colnames(day5$dat5_counts) cluster <-day5$dat5_clusters cellnames <-data.frame('Cluster'=cluster, 'cellBarcodes' = cellnames) mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) test <- CORE_clustering(mixedpop2, remove_outlier = c(0), PCA=FALSE, nPCs=20, ngenes=1500)
day5 <- day_5_cardio_cell_sample #day5$dat5_counts needs to be in a matrix format cellnames <- colnames(day5$dat5_counts) cluster <-day5$dat5_clusters cellnames <-data.frame('Cluster'=cluster, 'cellBarcodes' = cellnames) mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) test <- CORE_clustering(mixedpop2, remove_outlier = c(0), PCA=FALSE, nPCs=20, ngenes=1500)
CORE_subcluster allows re-cluster the CORE clustering result
CORE_subcluster( mixedpop = NULL, windows = seq(from = 0.025, to = 1, by = 0.025), select_cell_index = NULL, ngenes = 1500 )
CORE_subcluster( mixedpop = NULL, windows = seq(from = 0.025, to = 1, by = 0.025), select_cell_index = NULL, ngenes = 1500 )
mixedpop |
is a SingleCellExperiment object from the train mixed population |
windows |
a numeric specifying the number of windows to test |
select_cell_index |
a vector containing indexes for cells in selected clusters to be reclustered |
ngenes |
number of genes used for clustering calculations. |
a list
with clustering results of all iterations, and a
selected optimal resolution
Quan Nguyen, 2017-11-25
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) test <- CORE_clustering(mixedpop2,remove_outlier= c(0))
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) test <- CORE_clustering(mixedpop2,remove_outlier= c(0))
The count data set contains counts for 16990 genes for 590 cells randomly subsampled from day-2 cardio-differentiation population. The vector of clustering information contains corresponding to cells in the count matrix
day_2_cardio_cell_sample
day_2_cardio_cell_sample
a list instance, containing a count matrix and a vector with clustering information
a list, with the name day2
Quan Nguyen, 2017-11-25
Dr Joseph Powell's laboratory, IMB, UQ
The count data set contains counts for 17402 genes for 983 cells (1 row per gene) randomly subsampled from day-5 cardio-differentiation population The vector of clustering information contains corresponding to cells in the count matrix
day_5_cardio_cell_sample
day_5_cardio_cell_sample
a list instance, containing a count matrix and a vector with clustering information.
a list, with the name day5
Quan Nguyen, 2017-11-25
Dr Joseph Powell's laboratory, IMB, UQ
Compute Distance between two vectors
distvec(x, y)
distvec(x, y)
x |
A numeric vector |
y |
A numeric vector |
a numeric distance
x <-matrix(rnbinom(1000,mu=0.01, size=10),nrow=1000) x <- x[1,] y <-matrix(rnbinom(1000,mu=0.01, size=10),nrow=1000) y <- y[1,] distvec(x, y)
x <-matrix(rnbinom(1000,mu=0.01, size=10),nrow=1000) x <- x[1,] y <-matrix(rnbinom(1000,mu=0.01, size=10),nrow=1000) y <- y[1,] distvec(x, y)
Find DE genes from comparing one clust vs remaining
find_markers( expression_matrix = NULL, cluster = NULL, selected_cluster = NULL, fitType = "local", dispersion_method = "per-condition", sharing_Mode = "maximum" )
find_markers( expression_matrix = NULL, cluster = NULL, selected_cluster = NULL, fitType = "local", dispersion_method = "per-condition", sharing_Mode = "maximum" )
expression_matrix |
is a normalised expression matrix. |
cluster |
corresponding cluster information in the expression_matrix by running CORE clustering or using other methods. |
selected_cluster |
a vector of unique cluster ids to calculate |
fitType |
string specifying 'local' or 'parametric' for DEseq dispersion estimation |
dispersion_method |
one of the options c( 'pooled', 'pooled-CR', per-condition', 'blind' ) |
sharing_Mode |
one of the options c("maximum", "fit-only", "gene-est-only") |
a list
containing sorted DESeq analysis results
Quan Nguyen, 2017-11-25
day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) # depending on the data, the DESeq::estimateDispersions function requires # suitable fitType # and dispersion_method options DEgenes <- find_markers(expression_matrix=assay(mixedpop1), cluster = colData(mixedpop1)[,1], selected_cluster=c(1), #can also run for more #than one clusters, e.g.selected_cluster = c(1,2) fitType = "parametric", dispersion_method = "blind", sharing_Mode="fit-only" ) names(DEgenes)
day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) # depending on the data, the DESeq::estimateDispersions function requires # suitable fitType # and dispersion_method options DEgenes <- find_markers(expression_matrix=assay(mixedpop1), cluster = colData(mixedpop1)[,1], selected_cluster=c(1), #can also run for more #than one clusters, e.g.selected_cluster = c(1,2) fitType = "parametric", dispersion_method = "blind", sharing_Mode="fit-only" ) names(DEgenes)
from calculated stability based on Rand indexes for consecutive clustering run, find the resolution (window), where the stability is the highest
find_optimal_stability( list_clusters, run_RandIdx, bagging = FALSE, windows = seq(from = 0.025, to = 1, by = 0.025) )
find_optimal_stability( list_clusters, run_RandIdx, bagging = FALSE, windows = seq(from = 0.025, to = 1, by = 0.025) )
list_clusters |
is a |
run_RandIdx |
is a |
bagging |
is a logical that is true if bagging is to be performed, changes return |
windows |
a numeric vector specifying the ranges of each window. |
bagging == FALSE => a list
with optimal stability, cluster
count and summary stats bagging == TRUE => a list
with high res
cluster count, optimal cluster count and keystats
Quan Nguyen, 2017-11-25
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) cluster_all <-clustering(object=mixedpop2) stab_df <- find_stability(list_clusters=cluster_all$list_clusters, cluster_ref = cluster_all$cluster_ref) optimal_stab <- find_optimal_stability(list_clusters = cluster_all$list_clusters, stab_df, bagging = FALSE)
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) cluster_all <-clustering(object=mixedpop2) stab_df <- find_stability(list_clusters=cluster_all$list_clusters, cluster_ref = cluster_all$cluster_ref) optimal_stab <- find_optimal_stability(list_clusters = cluster_all$list_clusters, stab_df, bagging = FALSE)
from clustering results, compare similarity between clusters by adjusted Randindex
find_stability(list_clusters = NULL, cluster_ref = NULL)
find_stability(list_clusters = NULL, cluster_ref = NULL)
list_clusters |
is a object from the iterative clustering runs |
cluster_ref |
is a object from the reference cluster |
a data frame
with stability scores and rand_index results
Quan Nguyen, 2017-11-25
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) cluster_all <-clustering(object=mixedpop2) stab_df <- find_stability(list_clusters=cluster_all$list_clusters, cluster_ref = cluster_all$cluster_ref)
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) cluster_all <-clustering(object=mixedpop2) stab_df <- find_stability(list_clusters=cluster_all$list_clusters, cluster_ref = cluster_all$cluster_ref)
Calculate mean
mean_cpp(x)
mean_cpp(x)
x |
integer. |
a scalar value
mean_cpp(seq_len(10^6))
mean_cpp(seq_len(10^6))
new_scGPS_object
generates a scGPS object in the
SingleCellExperiment class for use with the scGPS package. This
object contains an expression matrix, associated metadata (cells, genes,
clusters). The data are expected to be normalised counts.
new_scGPS_object( ExpressionMatrix = NULL, GeneMetadata = NULL, CellMetadata = NULL, LogMatrix = NULL )
new_scGPS_object( ExpressionMatrix = NULL, GeneMetadata = NULL, CellMetadata = NULL, LogMatrix = NULL )
ExpressionMatrix |
An expression matrix in data.frame or matrix format. Rows should represent a transcript and its normalised counts, while columns should represent individual cells. |
GeneMetadata |
A data frame or vector containing gene identifiers used in the expression matrix. The first column should hold the gene identifiers you are using in the expression matrix. Other columns contain information about the genes, such as their corresponding ENSEMBL transcript identifiers. |
CellMetadata |
A data frame containing cell identifiers (usually barcodes) and an integer representing which batch they belong to. The column containing clustering information needs to be the first column in the CellMetadata dataframe If clustering information is not available, users can run CORE function and add the information to the scGPS before running scGPS prediction |
LogMatrix |
optional input for a log matrix of the data. If no log matrix is supplied one will be created for the object |
This function generates an scGPS object belonging to the SingleCellExperiment.
Quan Nguyen, 2018-04-06
day2 <- day_2_cardio_cell_sample t <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) colData(t); show(t); colnames(t)
day2 <- day_2_cardio_cell_sample t <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) colData(t); show(t); colnames(t)
new_scGPS_object
generates a scGPS object in the
SingleCellExperiment class for use with the scGPS package. This
object contains an expression matrix, associated metadata (cells, genes,
clusters). The data are expected to be normalised counts.
new_summarized_scGPS_object( ExpressionMatrix = NULL, GeneMetadata = NULL, CellMetadata = NULL )
new_summarized_scGPS_object( ExpressionMatrix = NULL, GeneMetadata = NULL, CellMetadata = NULL )
ExpressionMatrix |
An expression dataset in matrix format. Rows should represent a transcript and its normalised counts, while columns should represent individual cells. |
GeneMetadata |
A data frame or vector containing gene identifiers used in the expression matrix. The first column should hold the cell identifiers you are using in the expression matrix. Other columns contain information about the genes, such as their corresponding ENSEMBL transcript identifiers. |
CellMetadata |
A data frame containing cell identifiers (usually barcodes) and clustering information (the first column of the data frame contains clustering information). The column containing clustering information needs to be named as 'Cluster'. If clustering information is not available, users can run CORE function and add the information to the scGPS before running scGPS prediction |
This function generates an scGPS object belonging to the SingleCellExperiment.
Quan Nguyen, 2017-11-25
day2 <- day_2_cardio_cell_sample t <-new_summarized_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) colData(t); show(t); colnames(t)
day2 <- day_2_cardio_cell_sample t <-new_summarized_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) colData(t); show(t); colnames(t)
Select top variable genes and perform prcomp
PCA(expression.matrix = NULL, ngenes = 1500, scaling = TRUE, npcs = 50)
PCA(expression.matrix = NULL, ngenes = 1500, scaling = TRUE, npcs = 50)
expression.matrix |
An expression matrix, with genes in rows |
ngenes |
number of genes used for clustering calculations. |
scaling |
a logical of whether we want to scale the matrix |
npcs |
an integer specifying the number of principal components to use. |
a list containing PCA results and variance explained
day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) t <-PCA(expression.matrix=assay(mixedpop1))
day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) t <-PCA(expression.matrix=assay(mixedpop1))
This function plots CORE and all clustering results underneath
plot_CORE(original.tree, list_clusters = NULL, color_branch = NULL)
plot_CORE(original.tree, list_clusters = NULL, color_branch = NULL)
original.tree |
the original dendrogram before clustering |
list_clusters |
a list containing clustering results for each of the |
color_branch |
is a vector containing user-specified colors (the number of unique colors should be equal or larger than the number of clusters). This parameter allows better selection of colors for the display. |
a plot with clustering bars underneath the tree
day5 <- day_5_cardio_cell_sample cellnames <- colnames(day5$dat5_counts) cluster <-day5$dat5_clusters cellnames <-data.frame('Cluster'=cluster, 'cellBarcodes' = cellnames) mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = cellnames) CORE_cluster <- CORE_clustering(mixedpop2, remove_outlier = c(0)) plot_CORE(CORE_cluster$tree, CORE_cluster$Cluster)
day5 <- day_5_cardio_cell_sample cellnames <- colnames(day5$dat5_counts) cluster <-day5$dat5_clusters cellnames <-data.frame('Cluster'=cluster, 'cellBarcodes' = cellnames) mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = cellnames) CORE_cluster <- CORE_clustering(mixedpop2, remove_outlier = c(0)) plot_CORE(CORE_cluster$tree, CORE_cluster$Cluster)
after an optimal cluster has been identified, users may use this function to plot the resulting dendrogram with the branch colors represent clutering results
plot_optimal_CORE( original_tree, optimal_cluster = NULL, shift = -100, values = NULL )
plot_optimal_CORE( original_tree, optimal_cluster = NULL, shift = -100, values = NULL )
original_tree |
a dendrogram object |
optimal_cluster |
a vector of cluster IDs for cells in the dendrogram |
shift |
a numer specifying the gap between the dendrogram and the colored |
values |
a vector containing color values of the branches and the colored bar underneath the tree bar underneath the dendrogram. This parameter allows better selection of colors for the display. |
a plot with colored braches and bars for the optimal clustering result
Quan Nguyen, 2017-11-25
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) CORE_cluster <- CORE_clustering(mixedpop2, remove_outlier = c(0)) key_height <- CORE_cluster$optimalClust$KeyStats$Height optimal_res <- CORE_cluster$optimalClust$OptimalRes optimal_index = which(key_height == optimal_res) plot_optimal_CORE(original_tree= CORE_cluster$tree, optimal_cluster = unlist(CORE_cluster$Cluster[optimal_index]), shift = -2000)
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) CORE_cluster <- CORE_clustering(mixedpop2, remove_outlier = c(0)) key_height <- CORE_cluster$optimalClust$KeyStats$Height optimal_res <- CORE_cluster$optimalClust$OptimalRes optimal_index = which(key_height == optimal_res) plot_optimal_CORE(original_tree= CORE_cluster$tree, optimal_cluster = unlist(CORE_cluster$Cluster[optimal_index]), shift = -2000)
plot PCA, tSNE, and CIDR reduced datasets
plot_reduced( reduced_dat, color_fac = NULL, dims = c(1, 2), dimNames = c("Dim1", "Dim2"), palletes = NULL, legend_title = "Cluster" )
plot_reduced( reduced_dat, color_fac = NULL, dims = c(1, 2), dimNames = c("Dim1", "Dim2"), palletes = NULL, legend_title = "Cluster" )
reduced_dat |
is a matrix with genes in rows and cells in columns |
color_fac |
is a vector of colors corresponding to clusters to determine colors of scattered plots |
dims |
an integer of the number of dimestions |
dimNames |
a vector of the names of the dimensions |
palletes |
can be a customised color pallete that determine colors for density plots, if NULL it will use RColorBrewer colorRampPalette(RColorBrewer::brewer.pal(sample_num, 'Set1'))(sample_num) |
legend_title |
title of the plot's legend |
a matrix with the top 20 CIDR dimensions
day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) #CIDR_dim <-CIDR(expression.matrix=assay(mixedpop1)) #p <- plot_reduced(CIDR_dim, color_fac = factor(colData(mixedpop1)[,1]), # palletes = seq_len(length(unique(colData(mixedpop1)[,1])))) #plot(p) tSNE_dim <-tSNE(expression.mat=assay(mixedpop1)) p2 <- plot_reduced(tSNE_dim, color_fac = factor(colData(mixedpop1)[,1]), palletes = seq_len(length(unique(colData(mixedpop1)[,1])))) plot(p2)
day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) #CIDR_dim <-CIDR(expression.matrix=assay(mixedpop1)) #p <- plot_reduced(CIDR_dim, color_fac = factor(colData(mixedpop1)[,1]), # palletes = seq_len(length(unique(colData(mixedpop1)[,1])))) #plot(p) tSNE_dim <-tSNE(expression.mat=assay(mixedpop1)) p2 <- plot_reduced(tSNE_dim, color_fac = factor(colData(mixedpop1)[,1]), palletes = seq_len(length(unique(colData(mixedpop1)[,1])))) plot(p2)
Predict a new mixed population after training the model for a subpopulation in the first mixed population. All subpopulations in the new target mixed population will be predicted, where each targeted subpopulation will have a transition score from the orginal subpopulation to the new subpopulation.
predicting( listData = NULL, cluster_mixedpop2 = NULL, mixedpop2 = NULL, out_idx = NULL, standardize = TRUE, LDA_run = FALSE, c_selectID = NULL, log_transform = FALSE )
predicting( listData = NULL, cluster_mixedpop2 = NULL, mixedpop2 = NULL, out_idx = NULL, standardize = TRUE, LDA_run = FALSE, c_selectID = NULL, log_transform = FALSE )
listData |
a |
cluster_mixedpop2 |
a vector of cluster assignment for mixedpop2 |
mixedpop2 |
a SingleCellExperiment object from the target mixed population of importance, e.g. differentially expressed genes that are most significant |
out_idx |
a number to specify index to write results into the list output. This is needed for running bootstrap. |
standardize |
a logical of whether to standardize the data |
LDA_run |
logical, if the LDA prediction is added to compare to ElasticNet, the LDA model needs to be trained from the training before inputting to this prediction step |
c_selectID |
a number to specify the trained cluster used for prediction |
log_transform |
boolean whether log transform should be computed |
a list
with prediction results written in to the index
out_idx
Quan Nguyen, 2017-11-25
c_selectID<-1 out_idx<-1 day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique listData <- training(genes, cluster_mixedpop1 = colData(mixedpop1)[, 1], mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, c_selectID, listData =list(), out_idx=out_idx) listData <- predicting(listData =listData, mixedpop2 = mixedpop2, out_idx=out_idx, cluster_mixedpop2 = colData(mixedpop2)[, 1], c_selectID = c_selectID)
c_selectID<-1 out_idx<-1 day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique listData <- training(genes, cluster_mixedpop1 = colData(mixedpop1)[, 1], mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, c_selectID, listData =list(), out_idx=out_idx) listData <- predicting(listData =listData, mixedpop2 = mixedpop2, out_idx=out_idx, cluster_mixedpop2 = colData(mixedpop2)[, 1], c_selectID = c_selectID)
This function provides significant speed gain if the input matrix is big
PrinComp_cpp(X)
PrinComp_cpp(X)
X |
an R matrix (expression matrix), rows are genes, columns are cells |
a list with three list pca lists
mat_test <-matrix(rnbinom(1000000,mu=0.01, size=10),nrow=1000) #library(microbenchmark) #microbenchmark(PrinComp_cpp(mat_test), prcomp(mat_test), times=3)
mat_test <-matrix(rnbinom(1000000,mu=0.01, size=10),nrow=1000) #library(microbenchmark) #microbenchmark(PrinComp_cpp(mat_test), prcomp(mat_test), times=3)
Comparing clustering results Function for calculating randindex (adapted from the function by Steve Horvath and Luohua Jiang, UCLA, 2003)
rand_index(tab, adjust = TRUE)
rand_index(tab, adjust = TRUE)
tab |
a table containing different clustering results in rows |
adjust |
a logical of whether to use the adjusted rand index |
a rand_index value
Quan Nguyen and Michael Thompson, 2018-05-11
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) cluster_all <-clustering(object=mixedpop2) rand_index(table(unlist(cluster_all$list_clusters[[1]]), cluster_all$cluster_ref))
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) cluster_all <-clustering(object=mixedpop2) rand_index(table(unlist(cluster_all$list_clusters[[1]]), cluster_all$cluster_ref))
Function to calculate Eucledean distance matrix without parallelisation
rcpp_Eucl_distance_NotPar(mat)
rcpp_Eucl_distance_NotPar(mat)
mat |
an R matrix (expression matrix), with cells in rows and genes in columns |
a distance matrix
mat_test <-matrix(rnbinom(100000,mu=0.01, size=10),nrow=1000) #library(microbenchmark) #microbenchmark(rcpp_Eucl_distance_NotPar(mat_test), dist(mat_test), times=3)
mat_test <-matrix(rnbinom(100000,mu=0.01, size=10),nrow=1000) #library(microbenchmark) #microbenchmark(rcpp_Eucl_distance_NotPar(mat_test), dist(mat_test), times=3)
This function provides fast and memory efficient distance matrix calculation
rcpp_parallel_distance(mat)
rcpp_parallel_distance(mat)
mat |
an R matrix (expression matrix), rows are genes, columns are cells |
a distance matrix
mat_test <-matrix(rnbinom(1000000,mu=0.01, size=10),nrow=10000) #library(microbenchmark) #microbenchmark(rcpp_parallel_distance(mat_test), dist(mat_test), times=3)
mat_test <-matrix(rnbinom(1000000,mu=0.01, size=10),nrow=10000) #library(microbenchmark) #microbenchmark(rcpp_parallel_distance(mat_test), dist(mat_test), times=3)
n
bootstrapsthe training and prediction results from bootstrap
were written to the object LSOLDA_dat
, the reformat_LASSO
summarises prediction for n
bootstrap runs
reformat_LASSO( c_selectID = NULL, mp_selectID = NULL, LSOLDA_dat = NULL, nPredSubpop = NULL, Nodes_group = "#7570b3", nboots = 2 )
reformat_LASSO( c_selectID = NULL, mp_selectID = NULL, LSOLDA_dat = NULL, nPredSubpop = NULL, Nodes_group = "#7570b3", nboots = 2 )
c_selectID |
is the original cluster to be projected |
mp_selectID |
is the target mixedpop to project to |
LSOLDA_dat |
is the results from the bootstrap |
nPredSubpop |
is the number of clusters in the target mixedpop
|
Nodes_group |
string representation of hexidecimal color code for node |
nboots |
is an integer for how many bootstraps are run |
a dataframe containg information for the Lasso prediction results, each column contains prediction results for all subpopulations from each bootstrap run
c_selectID<-1 day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique LSOLDA_dat <- bootstrap_prediction(nboots = 2, mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, genes=genes, c_selectID, listData =list(), cluster_mixedpop1 = colData(mixedpop1)[,1], cluster_mixedpop2=colData(mixedpop2)[,1]) reformat_LASSO(LSOLDA_dat=LSOLDA_dat, nPredSubpop=length(unique(colData(mixedpop2)[,1])), c_selectID = 1, mp_selectID =2, nboots = 2)
c_selectID<-1 day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique LSOLDA_dat <- bootstrap_prediction(nboots = 2, mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, genes=genes, c_selectID, listData =list(), cluster_mixedpop1 = colData(mixedpop1)[,1], cluster_mixedpop2=colData(mixedpop2)[,1]) reformat_LASSO(LSOLDA_dat=LSOLDA_dat, nPredSubpop=length(unique(colData(mixedpop2)[,1])), c_selectID = 1, mp_selectID =2, nboots = 2)
performs 40 clustering runs or more depending on windows
sub_clustering( object = NULL, ngenes = 1500, windows = seq(from = 0.025, to = 1, by = 0.025), select_cell_index = NULL )
sub_clustering( object = NULL, ngenes = 1500, windows = seq(from = 0.025, to = 1, by = 0.025), select_cell_index = NULL )
object |
is a SingleCellExperiment object from the train mixed population |
ngenes |
number of genes used for clustering calculations. |
windows |
a numeric vector specifying the ranges of each window. |
select_cell_index |
a vector containing indexes for cells in selected clusters to be reclustered |
clustering results
Quan Nguyen, 2018-01-31
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) test_sub_clustering <-sub_clustering(mixedpop2, select_cell_index = c(seq_len(100)))
day5 <- day_5_cardio_cell_sample mixedpop2 <-new_summarized_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) test_sub_clustering <-sub_clustering(mixedpop2, select_cell_index = c(seq_len(100)))
Subset a matrix
subset_cpp(m1in, rowidx_in, colidx_in)
subset_cpp(m1in, rowidx_in, colidx_in)
m1in |
an R matrix (expression matrix) |
rowidx_in |
a numeric vector of rows to keep |
colidx_in |
a numeric vector of columns to keep |
a subsetted matrix
mat_test <-matrix(rnbinom(1000000,mu=0.01, size=10),nrow=100) subset_mat <- subset_cpp(mat_test, rowidx_in=c(1:10), colidx_in=c(100:500)) dim(subset_mat)
mat_test <-matrix(rnbinom(1000000,mu=0.01, size=10),nrow=100) subset_mat <- subset_cpp(mat_test, rowidx_in=c(1:10), colidx_in=c(100:500)) dim(subset_mat)
n
bootstrapsThe training results from training
were written to
summary_accuracy(object = NULL)
summary_accuracy(object = NULL)
object |
is a list containing the training results from
the |
a vector of percent accuracy for the selected subpopulation
Quan Nguyen, 2017-11-25
c_selectID<-1 day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique LSOLDA_dat <- bootstrap_prediction(nboots = 1,mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, genes=genes, c_selectID, listData =list(), cluster_mixedpop1 = colData(mixedpop1)[,1], cluster_mixedpop2=colData(mixedpop2)[,1]) summary_accuracy(LSOLDA_dat) summary_deviance(LSOLDA_dat)
c_selectID<-1 day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique LSOLDA_dat <- bootstrap_prediction(nboots = 1,mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, genes=genes, c_selectID, listData =list(), cluster_mixedpop1 = colData(mixedpop1)[,1], cluster_mixedpop2=colData(mixedpop2)[,1]) summary_accuracy(LSOLDA_dat) summary_deviance(LSOLDA_dat)
n
bootstrapsthe training results from training
were written to
the object LSOLDA_dat
, the summary_devidance
summarises
deviance explained for n
bootstrap runs and also returns the best
deviance matrix for plotting, as well as the best matrix with Lasso genes
and coefficients
summary_deviance(object = NULL)
summary_deviance(object = NULL)
object |
is a list containing the training results from
|
a list
containing three elements, with a vector of percent
maximum deviance explained, a dataframe containg information for the full
deviance, and a dataframe containing gene names and coefficients of the best
model
Quan Nguyen, 2017-11-25
c_selectID<-1 day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique LSOLDA_dat <- bootstrap_prediction(nboots = 2,mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, genes=genes, c_selectID, listData =list(), cluster_mixedpop1 = colData(mixedpop1)[,1], cluster_mixedpop2=colData(mixedpop2)[,1]) summary_deviance(LSOLDA_dat)
c_selectID<-1 day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique LSOLDA_dat <- bootstrap_prediction(nboots = 2,mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, genes=genes, c_selectID, listData =list(), cluster_mixedpop1 = colData(mixedpop1)[,1], cluster_mixedpop2=colData(mixedpop2)[,1]) summary_deviance(LSOLDA_dat)
n
bootstrapsthe training results from training
were written to
the object LSOLDA_dat
, the summary_prediction
summarises
prediction for n
bootstrap runs
summary_prediction_lasso(LSOLDA_dat = NULL, nPredSubpop = NULL)
summary_prediction_lasso(LSOLDA_dat = NULL, nPredSubpop = NULL)
LSOLDA_dat |
is a list containing the training results from
|
nPredSubpop |
is the number of subpopulations in the target mixed population |
a dataframe containg information for the Lasso prediction results, each column contains prediction results for all subpopulations from each bootstrap run
c_selectID<-1 day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique LSOLDA_dat <- bootstrap_prediction(nboots = 1,mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, genes=genes, c_selectID, listData =list(), cluster_mixedpop1 = colData(mixedpop1)[,1], cluster_mixedpop2=colData(mixedpop2)[,1]) summary_prediction_lasso(LSOLDA_dat=LSOLDA_dat, nPredSubpop=4)
c_selectID<-1 day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique LSOLDA_dat <- bootstrap_prediction(nboots = 1,mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, genes=genes, c_selectID, listData =list(), cluster_mixedpop1 = colData(mixedpop1)[,1], cluster_mixedpop2=colData(mixedpop2)[,1]) summary_prediction_lasso(LSOLDA_dat=LSOLDA_dat, nPredSubpop=4)
n
bootstrapsthe training results from training
were written to
the object LSOLDA_dat
, the summary_prediction
summarises
prediction explained for n
bootstrap runs and also returns the best
deviance matrix for plotting, as well as the best matrix with Lasso genes
and coefficients
summary_prediction_lda(LSOLDA_dat = NULL, nPredSubpop = NULL)
summary_prediction_lda(LSOLDA_dat = NULL, nPredSubpop = NULL)
LSOLDA_dat |
is a list containing the training results from
|
nPredSubpop |
is the number of subpopulations in the target mixed population |
a dataframe containg information for the LDA prediction results, each column contains prediction results for all subpopulations from each bootstrap run
c_selectID<-1 day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique LSOLDA_dat <- bootstrap_prediction(nboots = 1,mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, genes=genes, c_selectID, listData =list(), cluster_mixedpop1 = colData(mixedpop1)[,1], cluster_mixedpop2=colData(mixedpop2)[,1]) summary_prediction_lda(LSOLDA_dat=LSOLDA_dat, nPredSubpop=4)
c_selectID<-1 day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique LSOLDA_dat <- bootstrap_prediction(nboots = 1,mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, genes=genes, c_selectID, listData =list(), cluster_mixedpop1 = colData(mixedpop1)[,1], cluster_mixedpop2=colData(mixedpop2)[,1]) summary_prediction_lda(LSOLDA_dat=LSOLDA_dat, nPredSubpop=4)
subset a matrix by top variable genes
top_var(expression.matrix = NULL, ngenes = 1500)
top_var(expression.matrix = NULL, ngenes = 1500)
expression.matrix |
is a matrix with genes in rows and cells in columns |
ngenes |
number of genes used for clustering calculations. |
a subsetted expression matrix with the top n most variable genes
day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) SortedExprsMat <-top_var(expression.matrix=assay(mixedpop1))
day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) SortedExprsMat <-top_var(expression.matrix=assay(mixedpop1))
Transpose a matrix
tp_cpp(X)
tp_cpp(X)
X |
an R matrix (expression matrix) |
a transposed matrix
mat_test <-matrix(rnbinom(1000000,mu=0.01, size=10),nrow=100) tp_mat <- tp_cpp(mat_test)
mat_test <-matrix(rnbinom(1000000,mu=0.01, size=10),nrow=100) tp_mat <- tp_cpp(mat_test)
Training a haft of all cells to find optimal ElasticNet and LDA models to predict a subpopulation
training( genes = NULL, cluster_mixedpop1 = NULL, mixedpop1 = NULL, mixedpop2 = NULL, c_selectID = NULL, listData = list(), out_idx = 1, standardize = TRUE, trainset_ratio = 0.5, LDA_run = FALSE, log_transform = FALSE )
training( genes = NULL, cluster_mixedpop1 = NULL, mixedpop1 = NULL, mixedpop2 = NULL, c_selectID = NULL, listData = list(), out_idx = 1, standardize = TRUE, trainset_ratio = 0.5, LDA_run = FALSE, log_transform = FALSE )
genes |
a vector of gene names (for ElasticNet shrinkage); gene symbols must be in the same format with gene names in subpop2. Note that genes are listed by the order of importance, e.g. differentially expressed genes that are most significan, so that if the gene list contains too many genes, only the top 500 genes are used. |
cluster_mixedpop1 |
a vector of cluster assignment in mixedpop1 |
mixedpop1 |
is a SingleCellExperiment object from the train mixed population |
mixedpop2 |
is a SingleCellExperiment object from the target mixed population |
c_selectID |
a selected number to specify which subpopulation to be used for training |
listData |
list to store output in |
out_idx |
a number to specify index to write results into the list output. This is needed for running bootstrap. |
standardize |
a logical value specifying whether or not to standardize the train matrix |
trainset_ratio |
a number specifying the proportion of cells to be part of the training subpopulation |
LDA_run |
logical, if the LDA run is added to compare to ElasticNet |
log_transform |
boolean whether log transform should be computed |
a list
with prediction results written in to the indexed
out_idx
Quan Nguyen, 2017-11-25
c_selectID<-1 out_idx<-1 day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique listData <- training(genes, cluster_mixedpop1 = colData(mixedpop1)[, 1], mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, c_selectID, listData =list(), out_idx=out_idx, trainset_ratio = 0.5) names(listData) listData$Accuracy
c_selectID<-1 out_idx<-1 day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) day5 <- day_5_cardio_cell_sample mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts, GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters) genes <-training_gene_sample genes <-genes$Merged_unique listData <- training(genes, cluster_mixedpop1 = colData(mixedpop1)[, 1], mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, c_selectID, listData =list(), out_idx=out_idx, trainset_ratio = 0.5) names(listData) listData$Accuracy
Genes should be ordered from most to least important genes (1 row per gene)
training_gene_sample
training_gene_sample
a list instance, containing a count matrix and a vector with clustering information.
a vector containing gene symbols
Quan Nguyen, 2017-11-25
Dr Joseph Powell's laboratory, IMB, UQ
calculate tSNE from top variable genes
tSNE( expression.mat = NULL, topgenes = 1500, scale = TRUE, thet = 0.5, perp = 30 )
tSNE( expression.mat = NULL, topgenes = 1500, scale = TRUE, thet = 0.5, perp = 30 )
expression.mat |
An expression matrix, with genes in rows |
topgenes |
number of genes used for clustering calculations. |
scale |
a logical of whether we want to scale the matrix |
thet |
numeric; Speed/accuracy trade-off (increase for less accuracy) |
perp |
numeric; Perplexity parameter (should not be bigger than 3 * perplexity < nrow(X) - 1, see details for interpretation) |
a tSNE reduced matrix containing three tSNE dimensions
day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) t <-tSNE(expression.mat = assay(mixedpop1))
day2 <- day_2_cardio_cell_sample mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters) t <-tSNE(expression.mat = assay(mixedpop1))
Calculate variance
var_cpp(x, bias = TRUE)
var_cpp(x, bias = TRUE)
x |
a vector of gene expression. |
bias |
degree of freedom |
a variance value
var_cpp(seq_len(10^6))
var_cpp(seq_len(10^6))