scTensor
This vignette has been changed in BioC 3.14, when each data package (LRBase.XXX.eg.db) is deprecated and the way to provide LRBase data has changed to AnnotationHub-style.
We explain the way to create a matrix, in which the row names are NCBI Gene ID (ENTREZID), for specifying an input of scTensor. Typical scTensor workflow can be described as below.
library("scTensor")
library("AnnotationHub")
library("LRBaseDbi")
# Input matrix
input <- ...
sce <- SingleCellExperiment(assays=list(counts = input))
# Celltype vector
label <- ...
# LRBase.XXX.eg.db
ah <- AnnotationHub()
dbfile <- query(ah, c("LRBaseDb", "Homo sapiens", "v002"))[[1]]
LRBase.Hsa.eg.db <- LRBaseDbi::LRBaseDb(dbfile)
# Setting
cellCellSetting(sce, LRBase.Hsa.eg.db, label)
In scTensor, the row names of the input matrix is limited only NCBI Gene ID to cooperate with the other R packages (cf. data(“Germline”)). Since the user has many different types of the data matrix, here we introduce some situations and the way to convert the row names of the user’s matrix as NCBI Gene ID.
First of all, we have to prepare the expression matrix (gene × cell). There are many types of single-cell RNA-Seq (scRNA-Seq) technologies and the situation will be changed by the used experimental methods and quantification tools described below.
In Plate-based scRNA-Seq experiment (i.e. Smart-Seq2, Quart-Seq2, CEL-Seq2, MARS-Seq,…etc), the FASTQ file is generated in each cell. After the mapping of reads in each FASTQ file to the reference genome, the same number of BAM files will be generated. By using some quantification tools such as featureCounts, or HTSeq-count, user can get the expression matrix and used as the input of scTensor. These tools simply count the number of reads in union-exon in each gene. One downside of these tools is that such tools do not take “multimap” of not unique reads into consideration and the quantification is not accurate. Therefore, we recommend the transcript-level quantification and gene-level summarization explained below.
Some quantification tools such as RSEM, Sailfish,
Salmon, Kallisto, and StringTie
use the reference transcriptome instead of genome, and quantify the
expression level in each transcript. After the quantification, the
transcript-level expression can be summarized to gene-level by using
summarizeToGene
function of tximport.
The paper of
tximport showed that the transcript-level expression summalized to
gene-level is more accurate than the gene-level expression calculated by
featureCounts.
Note that if you use the reference transcriptome of GENCODE, this step becomes slightly complicated. Although the number of transcripts of GENCODE and that of Ensembl is almost the same, and actually, most of the transcript is duplicated in these two databases, the gene identifier used in GENCODE looks complicated like “ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|”. In such a case, firstly only Ensembl Transcript ID should be extracted (e.g. ENST00000456328.2), removed the version (e.g. ENST00000456328), summarized to Ensembl Gene ID by tximport (e.g. ENSG00000223972), and then converted to NCBI Gene ID (e.g. 100287102) by each organism package such as Homo.sapiens.
In the droplet-based scRNA-Seq experiment (i.e. Drop-Seq, inDrop RNA-Seq, 10X Chromium), unique molecular identifier (UMI) is introduced for avoiding the bias of PCR amplification, and after multiplexing by cellular barcode, digital gene expression (DGE) matrix is generated by counting the number of types of UMI mapped in each gene.
When user perform Drop-seq, Drop-Seq tool can generate the DGE matrix.
Another tool Alevin, which is a subcommand of Salmon is also applicable to Drop-seq data. In such case [tximport] with option “type = ‘alevin’” can import the result of Alevin into R and summarize the DGE matrix.
When the user performs 10X Chromium, using Cell Ranger developed by 10X Genomics is straightforward.
Although Cell Ranger is implemented by Python, starting from the files generated by Cell Ranger (e.g. filtered_gene_bc_matrices/{hg19,mm10}/{barcodes.tsv,genes.tsv,matrix.mtx}), Seurat can import these files to an R object.
For example, according to the tutorial of Seurat (Seurat -
Guided Clustering Tutorial), PBMC data of Cell Ranger can be
imported by the Read10X
function and DGE matrix of
UMI-count is available by the output of CreateSeuratObject
function.
if(!require(Seurat)){
BiocManager::install("Seurat")
library(Seurat)
}
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data,
project = "pbmc3k", min.cells = 3, min.features = 200)
Note that the matrix is formatted as a sparse matrix of
Matrix
package (MM: Matrix market), but the scTensor
assumes dense matrix for now. By using as.matrix
function, the sparse matrix is easily converted to a dense matrix as
follows.
# Sparse matrix to dense matrix
for_sc <- as.matrix(pbmc.data)
Even after creating the gene-level expression matrix in Step.1, many
kinds of gene-level gene identifiers can be assigned as row names of the
matrix such as Ensembl Gene ID, RefSeq, or Gene Symbol. Again, only NCBI
Gene ID can be used as row names of the input matrix of scTensor.
To do such a task, we originally implemented a function
convertRowID
function of scTGIF.
The only user has to prepare for using this function is the 1. input
matrix (or data.frame) filled with only numbers, 2. current gene-level
gene identifier in each row of the input matrix, and 3. corresponding
table containing current gene-level gene identifier (left) and
corresponding NCBI Gene ID (right). The usage of this function is
explained below.
In addition to 1. and 2., the user has to prepare the 3. corresponding table. Here we introduce two approaches to assign the user’s Ensembl Gene ID to NCBI Gene ID. First approarch is using Organism DB packages such as Homo.sapiens, Mus.musculus, and Rattus.norvegicus.
Using the select
function of Organism DB, the
corresponding table can be retrieved like below.
suppressPackageStartupMessages(library("scTensor"))
if(!require(Homo.sapiens)){
BiocManager::install("Homo.sapiens")
suppressPackageStartupMessages(library(Homo.sapiens))
}
## Loading required package: Homo.sapiens
## Loading required package: AnnotationDbi
## Loading required package: OrganismDbi
## Loading required package: GenomicFeatures
## Loading required package: GO.db
## Loading required package: org.Hs.eg.db
##
## Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene
if(!require(scTGIF)){
BiocManager::install("scTGIF")
suppressPackageStartupMessages(library(scTGIF))
}
## Loading required package: scTGIF
# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 2. Gene identifier in each row
rowID <- c("ENSG00000204531", "ENSG00000181449",
"ENSG00000136997", "ENSG00000136826")
# 3. Corresponding table
LefttoRight <- select(Homo.sapiens,
column=c("ENSEMBL", "ENTREZID"),
keytype="ENSEMBL", keys=rowID)
## 'select()' returned 1:1 mapping between keys and columns
## | | | 0% | |================== | 25% | |=================================== | 50% | |==================================================== | 75% | |======================================================================| 100%
## Input matrix: 4x5
## Output matrix: 4x5
## Some gene expression vectors were collapsed into single vector
## by sum rule
## $output
## [,1] [,2] [,3] [,4] [,5]
## 5460 1 5 9 13 17
## 6657 2 6 10 14 18
## 4609 3 7 11 15 19
## 9314 4 8 12 16 20
##
## $ctable
## Left Right
## [1,] "ENSG00000204531" "5460"
## [2,] "ENSG00000181449" "6657"
## [3,] "ENSG00000136997" "4609"
## [4,] "ENSG00000136826" "9314"
Second approarch is using AnnotationHub package.
Although only three Organism DB packages are explicitly developed,
even if the data is generated from other species (e.g. Zebrafish,
Arabidopsis thaliana), similar database is also available from AnnotationHub,
and select
function can be performed like below.
suppressPackageStartupMessages(library("AnnotationHub"))
# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 3. Corresponding table
ah <- AnnotationHub()
# Database of Human
hs <- query(ah, c("OrgDb", "Homo sapiens"))[[1]]
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## 'select()' returned 1:1 mapping between keys and columns
## | | | 0% | |================== | 25% | |=================================== | 50% | |==================================================== | 75% | |======================================================================| 100%
## Input matrix: 4x5
## Output matrix: 4x5
## Some gene expression vectors were collapsed into single vector
## by sum rule
## $output
## [,1] [,2] [,3] [,4] [,5]
## 5460 1 5 9 13 17
## 6657 2 6 10 14 18
## 4609 3 7 11 15 19
## 9314 4 8 12 16 20
##
## $ctable
## Left Right
## [1,] "ENSG00000204531" "5460"
## [2,] "ENSG00000181449" "6657"
## [3,] "ENSG00000136997" "4609"
## [4,] "ENSG00000136826" "9314"
When using cellranger or Seurat to quantify UMI-count (cf. Step1, Case III), the row names of the input matrix might be Gene Symbol, and have to be converted to NCBI Gene ID. As well as the Case I described above, Organism DB and AnnotationHub will support such a task like below.
# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 2. Gene identifier in each row
rowID <- c("POU5F1", "SOX2", "MYC", "KLF4")
# 3. Corresponding table
LefttoRight <- select(Homo.sapiens,
column=c("SYMBOL", "ENTREZID"),
keytype="SYMBOL", keys=rowID)
## 'select()' returned 1:1 mapping between keys and columns
## | | | 0% | |================== | 25% | |=================================== | 50% | |==================================================== | 75% | |======================================================================| 100%
## Input matrix: 4x5
## Output matrix: 4x5
## Some gene expression vectors were collapsed into single vector
## by sum rule
## $output
## [,1] [,2] [,3] [,4] [,5]
## 5460 1 5 9 13 17
## 6657 2 6 10 14 18
## 4609 3 7 11 15 19
## 9314 4 8 12 16 20
##
## $ctable
## Left Right
## [1,] "POU5F1" "5460"
## [2,] "SOX2" "6657"
## [3,] "MYC" "4609"
## [4,] "KLF4" "9314"
# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 3. Corresponding table
ah <- AnnotationHub()
# Database of Human
hs <- query(ah, c("OrgDb", "Homo sapiens"))[[1]]
## loading from cache
## 'select()' returned 1:1 mapping between keys and columns
## | | | 0% | |================== | 25% | |=================================== | 50% | |==================================================== | 75% | |======================================================================| 100%
## Input matrix: 4x5
## Output matrix: 4x5
## Some gene expression vectors were collapsed into single vector
## by sum rule
## $output
## [,1] [,2] [,3] [,4] [,5]
## 5460 1 5 9 13 17
## 6657 2 6 10 14 18
## 4609 3 7 11 15 19
## 9314 4 8 12 16 20
##
## $ctable
## Left Right
## [1,] "POU5F1" "5460"
## [2,] "SOX2" "6657"
## [3,] "MYC" "4609"
## [4,] "KLF4" "9314"
Finally, we introduce some situations to perform some normalization methods of gene expression matrix.
If a user converts a Seurat object to a SingleCellExperient object by
using as.SingleCellExperiment
, the result of the
NormalizeData
function (log counts) is inherited to the
SingleCellExperient object as follows;
pbmc2 <- NormalizeData(pbmc, normalization.method = "LogNormalize",
scale.factor = 10000)
sce <- as.SingleCellExperiment(pbmc2)
assayNames(sce) # counts, logcounts
If the user want to use scater
package, calculateCPM
or normalize
function
can calculate the normalized expression values as follows; (see also the
vignette of scater).
if(!require(scater)){
BiocManager::install("scater")
library(scater)
}
sce <- SingleCellExperiment(assays=list(counts = input))
cpm(sce) <- calculateCPM(sce)
sce <- normalize(sce)
assayNames(sce) # counts, normcounts, logcounts, cpm
Any original normalization can be stored in the sce. For example, we can calculate the value of count per median (CPMED) as follows;
# User's Original Normalization Function
CPMED <- function(input){
libsize <- colSums(input)
median(libsize) * t(t(input) / libsize)
}
# Normalization
normcounts(sce) <- log10(CPMED(counts(sce)) + 1)
We recommend using the normcounts slot to save such original
normalization values. After the normalization, such values can be
specified by assayNames
option in
cellCellRanks
cellCellDecomp
and
cellCellReport
functions.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scTGIF_1.21.0
## [2] Homo.sapiens_1.3.1
## [3] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [4] org.Hs.eg.db_3.20.0
## [5] GO.db_3.20.0
## [6] OrganismDbi_1.49.0
## [7] GenomicFeatures_1.59.1
## [8] AnnotationDbi_1.69.0
## [9] SingleCellExperiment_1.29.1
## [10] SummarizedExperiment_1.37.0
## [11] Biobase_2.67.0
## [12] GenomicRanges_1.59.1
## [13] GenomeInfoDb_1.43.2
## [14] IRanges_2.41.2
## [15] S4Vectors_0.45.2
## [16] MatrixGenerics_1.19.0
## [17] matrixStats_1.4.1
## [18] scTensor_2.17.0
## [19] RSQLite_2.3.9
## [20] LRBaseDbi_2.17.0
## [21] AnnotationHub_3.15.0
## [22] BiocFileCache_2.15.0
## [23] dbplyr_2.5.0
## [24] BiocGenerics_0.53.3
## [25] generics_0.1.3
## [26] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] fs_1.6.5 bitops_1.0-9 enrichplot_1.27.3
## [4] httr_1.4.7 webshot_0.5.5 RColorBrewer_1.1-3
## [7] Rgraphviz_2.51.0 tools_4.4.2 backports_1.5.0
## [10] R6_2.5.1 lazyeval_0.2.2 withr_3.0.2
## [13] prettyunits_1.2.0 graphite_1.53.0 gridExtra_2.3
## [16] schex_1.21.0 fdrtool_1.2.18 cli_3.6.3
## [19] TSP_1.2-4 entropy_1.3.1 sass_0.4.9
## [22] genefilter_1.89.0 meshr_2.13.0 Rsamtools_2.23.1
## [25] yulab.utils_0.1.8 txdbmaker_1.3.1 gson_0.1.0
## [28] DOSE_4.1.0 R.utils_2.12.3 MeSHDbi_1.43.0
## [31] AnnotationForge_1.49.0 nnTensor_1.3.0 plotrix_3.8-4
## [34] maps_3.4.2.1 visNetwork_2.1.2 gridGraphics_0.5-1
## [37] GOstats_2.73.0 BiocIO_1.17.1 dplyr_1.1.4
## [40] dendextend_1.19.0 Matrix_1.7-1 abind_1.4-8
## [43] R.methodsS3_1.8.2 lifecycle_1.0.4 yaml_2.3.10
## [46] qvalue_2.39.0 SparseArray_1.7.2 grid_4.4.2
## [49] blob_1.2.4 misc3d_0.9-1 crayon_1.5.3
## [52] ggtangle_0.0.6 lattice_0.22-6 msigdbr_7.5.1
## [55] cowplot_1.1.3 annotate_1.85.0 KEGGREST_1.47.0
## [58] sys_3.4.3 maketools_1.3.1 pillar_1.10.0
## [61] knitr_1.49 fgsea_1.33.2 tcltk_4.4.2
## [64] rjson_0.2.23 codetools_0.2-20 fastmatch_1.1-6
## [67] glue_1.8.0 outliers_0.15 ggfun_0.1.8
## [70] data.table_1.16.4 vctrs_0.6.5 png_0.1-8
## [73] treeio_1.31.0 spam_2.11-0 rTensor_1.4.8
## [76] gtable_0.3.6 assertthat_0.2.1 cachem_1.1.0
## [79] xfun_0.49 S4Arrays_1.7.1 mime_0.12
## [82] tidygraph_1.3.1 survival_3.8-3 seriation_1.5.7
## [85] iterators_1.0.14 fields_16.3 nlme_3.1-166
## [88] Category_2.73.0 ggtree_3.15.0 bit64_4.5.2
## [91] progress_1.2.3 filelock_1.0.3 bslib_0.8.0
## [94] colorspace_2.1-1 DBI_1.2.3 tidyselect_1.2.1
## [97] bit_4.5.0.1 compiler_4.4.2 curl_6.0.1
## [100] httr2_1.0.7 graph_1.85.0 xml2_1.3.6
## [103] DelayedArray_0.33.3 plotly_4.10.4 rtracklayer_1.67.0
## [106] checkmate_2.3.2 scales_1.3.0 hexbin_1.28.5
## [109] RBGL_1.83.0 plot3D_1.4.1 rappdirs_0.3.3
## [112] stringr_1.5.1 digest_0.6.37 rmarkdown_2.29
## [115] ca_0.71.1 XVector_0.47.1 htmltools_0.5.8.1
## [118] pkgconfig_2.0.3 fastmap_1.2.0 rlang_1.1.4
## [121] htmlwidgets_1.6.4 UCSC.utils_1.3.0 farver_2.1.2
## [124] jquerylib_0.1.4 jsonlite_1.8.9 BiocParallel_1.41.0
## [127] GOSemSim_2.33.0 R.oo_1.27.0 RCurl_1.98-1.16
## [130] magrittr_2.0.3 GenomeInfoDbData_1.2.13 ggplotify_0.1.2
## [133] dotCall64_1.2 patchwork_1.3.0 munsell_0.5.1
## [136] Rcpp_1.0.13-1 babelgene_22.9 ape_5.8-1
## [139] viridis_0.6.5 stringi_1.8.4 tagcloud_0.6
## [142] ggraph_2.2.1 zlibbioc_1.52.0 MASS_7.3-61
## [145] plyr_1.8.9 parallel_4.4.2 ggrepel_0.9.6
## [148] Biostrings_2.75.3 graphlayouts_1.2.1 splines_4.4.2
## [151] hms_1.1.3 igraph_2.1.2 buildtools_1.0.0
## [154] biomaRt_2.63.0 reshape2_1.4.4 BiocVersion_3.21.1
## [157] XML_3.99-0.17 evaluate_1.0.1 BiocManager_1.30.25
## [160] foreach_1.5.2 tweenr_2.0.3 tidyr_1.3.1
## [163] purrr_1.0.2 polyclip_1.10-7 heatmaply_1.5.0
## [166] ggplot2_3.5.1 ReactomePA_1.51.0 ggforce_0.4.2
## [169] xtable_1.8-4 restfulr_0.0.15 reactome.db_1.89.0
## [172] tidytree_0.4.6 viridisLite_0.4.2 tibble_3.2.1
## [175] aplot_0.2.4 ccTensor_1.0.2 memoise_2.0.1
## [178] registry_0.5-1 GenomicAlignments_1.43.0 cluster_2.1.8
## [181] concaveman_1.1.0 GSEABase_1.69.0