Roadmap to prepare the input matrix for scTensor

This vignette has been changed in BioC 3.14, when each data package (LRBase.XXX.eg.db) is deprecated and the way to provide LRBase data has changed to AnnotationHub-style.

Introduction

We explain the way to create a matrix, in which the row names are NCBI Gene ID (ENTREZID), for specifying an input of scTensor. Typical scTensor workflow can be described as below.

library("scTensor")
library("AnnotationHub")
library("LRBaseDbi")

# Input matrix
input <- ...
sce <- SingleCellExperiment(assays=list(counts = input))
# Celltype vector
label <- ...
# LRBase.XXX.eg.db
ah <- AnnotationHub()
dbfile <- query(ah, c("LRBaseDb", "Homo sapiens", "v002"))[[1]]
LRBase.Hsa.eg.db <- LRBaseDbi::LRBaseDb(dbfile)
# Setting
cellCellSetting(sce, LRBase.Hsa.eg.db, label)

In scTensor, the row names of the input matrix is limited only NCBI Gene ID to cooperate with the other R packages (cf. data(“Germline”)). Since the user has many different types of the data matrix, here we introduce some situations and the way to convert the row names of the user’s matrix as NCBI Gene ID.

Step.1: Create a gene-level expression matrix

First of all, we have to prepare the expression matrix (gene × cell). There are many types of single-cell RNA-Seq (scRNA-Seq) technologies and the situation will be changed by the used experimental methods and quantification tools described below.

Case I: Gene-level quantification

In Plate-based scRNA-Seq experiment (i.e. Smart-Seq2, Quart-Seq2, CEL-Seq2, MARS-Seq,…etc), the FASTQ file is generated in each cell. After the mapping of reads in each FASTQ file to the reference genome, the same number of BAM files will be generated. By using some quantification tools such as featureCounts, or HTSeq-count, user can get the expression matrix and used as the input of scTensor. These tools simply count the number of reads in union-exon in each gene. One downside of these tools is that such tools do not take “multimap” of not unique reads into consideration and the quantification is not accurate. Therefore, we recommend the transcript-level quantification and gene-level summarization explained below.

Case II: Transcript-level quantification

Some quantification tools such as RSEM, Sailfish, Salmon, Kallisto, and StringTie use the reference transcriptome instead of genome, and quantify the expression level in each transcript. After the quantification, the transcript-level expression can be summarized to gene-level by using summarizeToGene function of tximport. The paper of tximport showed that the transcript-level expression summalized to gene-level is more accurate than the gene-level expression calculated by featureCounts.

Note that if you use the reference transcriptome of GENCODE, this step becomes slightly complicated. Although the number of transcripts of GENCODE and that of Ensembl is almost the same, and actually, most of the transcript is duplicated in these two databases, the gene identifier used in GENCODE looks complicated like “ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|”. In such a case, firstly only Ensembl Transcript ID should be extracted (e.g. ENST00000456328.2), removed the version (e.g. ENST00000456328), summarized to Ensembl Gene ID by tximport (e.g. ENSG00000223972), and then converted to NCBI Gene ID (e.g. 100287102) by each organism package such as Homo.sapiens.

Case III: UMI-count

In the droplet-based scRNA-Seq experiment (i.e. Drop-Seq, inDrop RNA-Seq, 10X Chromium), unique molecular identifier (UMI) is introduced for avoiding the bias of PCR amplification, and after multiplexing by cellular barcode, digital gene expression (DGE) matrix is generated by counting the number of types of UMI mapped in each gene.

When user perform Drop-seq, Drop-Seq tool can generate the DGE matrix.

Another tool Alevin, which is a subcommand of Salmon is also applicable to Drop-seq data. In such case [tximport] with option “type = ‘alevin’” can import the result of Alevin into R and summarize the DGE matrix.

When the user performs 10X Chromium, using Cell Ranger developed by 10X Genomics is straightforward.

Although Cell Ranger is implemented by Python, starting from the files generated by Cell Ranger (e.g. filtered_gene_bc_matrices/{hg19,mm10}/{barcodes.tsv,genes.tsv,matrix.mtx}), Seurat can import these files to an R object.

For example, according to the tutorial of Seurat (Seurat - Guided Clustering Tutorial), PBMC data of Cell Ranger can be imported by the Read10X function and DGE matrix of UMI-count is available by the output of CreateSeuratObject function.

if(!require(Seurat)){
    BiocManager::install("Seurat")
    library(Seurat)
}

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data,
  project = "pbmc3k", min.cells = 3, min.features = 200)

Note that the matrix is formatted as a sparse matrix of Matrix package (MM: Matrix market), but the scTensor assumes dense matrix for now. By using as.matrix function, the sparse matrix is easily converted to a dense matrix as follows.

# Sparse matrix to dense matrix
for_sc <- as.matrix(pbmc.data)

Step.2: Convert the row names of a matrix as NCBI Gene ID (ENTREZID)

Even after creating the gene-level expression matrix in Step.1, many kinds of gene-level gene identifiers can be assigned as row names of the matrix such as Ensembl Gene ID, RefSeq, or Gene Symbol. Again, only NCBI Gene ID can be used as row names of the input matrix of scTensor. To do such a task, we originally implemented a function convertRowID function of scTGIF. The only user has to prepare for using this function is the 1. input matrix (or data.frame) filled with only numbers, 2. current gene-level gene identifier in each row of the input matrix, and 3. corresponding table containing current gene-level gene identifier (left) and corresponding NCBI Gene ID (right). The usage of this function is explained below.

Case I: Ensembl Gene ID to NCBI Gene ID

In addition to 1. and 2., the user has to prepare the 3. corresponding table. Here we introduce two approaches to assign the user’s Ensembl Gene ID to NCBI Gene ID. First approarch is using Organism DB packages such as Homo.sapiens, Mus.musculus, and Rattus.norvegicus.

Using the select function of Organism DB, the corresponding table can be retrieved like below.

suppressPackageStartupMessages(library("scTensor"))
if(!require(Homo.sapiens)){
    BiocManager::install("Homo.sapiens")
    suppressPackageStartupMessages(library(Homo.sapiens))
}
## Loading required package: Homo.sapiens
## Loading required package: AnnotationDbi
## Loading required package: OrganismDbi
## Loading required package: GenomicFeatures
## Loading required package: GO.db
## Loading required package: org.Hs.eg.db
## 
## Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene
if(!require(scTGIF)){
    BiocManager::install("scTGIF")
    suppressPackageStartupMessages(library(scTGIF))
}
## Loading required package: scTGIF
# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 2. Gene identifier in each row
rowID <- c("ENSG00000204531", "ENSG00000181449",
  "ENSG00000136997", "ENSG00000136826")
# 3. Corresponding table
LefttoRight <- select(Homo.sapiens,
  column=c("ENSEMBL", "ENTREZID"),
  keytype="ENSEMBL", keys=rowID)
## 'select()' returned 1:1 mapping between keys and columns
# ID conversion
(input <- convertRowID(input, rowID, LefttoRight))
##   |                                                                              |                                                                      |   0%  |                                                                              |==================                                                    |  25%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================================| 100%
## Input matrix: 4x5
## Output matrix: 4x5
## Some gene expression vectors were collapsed into single vector
##   by sum rule
## $output
##      [,1] [,2] [,3] [,4] [,5]
## 5460    1    5    9   13   17
## 6657    2    6   10   14   18
## 4609    3    7   11   15   19
## 9314    4    8   12   16   20
## 
## $ctable
##      Left              Right 
## [1,] "ENSG00000204531" "5460"
## [2,] "ENSG00000181449" "6657"
## [3,] "ENSG00000136997" "4609"
## [4,] "ENSG00000136826" "9314"

Second approarch is using AnnotationHub package.

Although only three Organism DB packages are explicitly developed, even if the data is generated from other species (e.g. Zebrafish, Arabidopsis thaliana), similar database is also available from AnnotationHub, and select function can be performed like below.

suppressPackageStartupMessages(library("AnnotationHub"))

# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 3. Corresponding table
ah <- AnnotationHub()
# Database of Human
hs <- query(ah, c("OrgDb", "Homo sapiens"))[[1]]
## downloading 1 resources
## retrieving 1 resource
## loading from cache
LefttoRight <- select(hs,
  column=c("ENSEMBL", "ENTREZID"),
  keytype="ENSEMBL", keys=rowID)
## 'select()' returned 1:1 mapping between keys and columns
(input <- convertRowID(input, rowID, LefttoRight))
##   |                                                                              |                                                                      |   0%  |                                                                              |==================                                                    |  25%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================================| 100%
## Input matrix: 4x5
## Output matrix: 4x5
## Some gene expression vectors were collapsed into single vector
##   by sum rule
## $output
##      [,1] [,2] [,3] [,4] [,5]
## 5460    1    5    9   13   17
## 6657    2    6   10   14   18
## 4609    3    7   11   15   19
## 9314    4    8   12   16   20
## 
## $ctable
##      Left              Right 
## [1,] "ENSG00000204531" "5460"
## [2,] "ENSG00000181449" "6657"
## [3,] "ENSG00000136997" "4609"
## [4,] "ENSG00000136826" "9314"

Case II: Gene Symbol to NCBI Gene ID

When using cellranger or Seurat to quantify UMI-count (cf. Step1, Case III), the row names of the input matrix might be Gene Symbol, and have to be converted to NCBI Gene ID. As well as the Case I described above, Organism DB and AnnotationHub will support such a task like below.

# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 2. Gene identifier in each row
rowID <- c("POU5F1", "SOX2", "MYC", "KLF4")
# 3. Corresponding table
LefttoRight <- select(Homo.sapiens,
  column=c("SYMBOL", "ENTREZID"),
  keytype="SYMBOL", keys=rowID)
## 'select()' returned 1:1 mapping between keys and columns
# ID conversion
(input <- convertRowID(input, rowID, LefttoRight))
##   |                                                                              |                                                                      |   0%  |                                                                              |==================                                                    |  25%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================================| 100%
## Input matrix: 4x5
## Output matrix: 4x5
## Some gene expression vectors were collapsed into single vector
##   by sum rule
## $output
##      [,1] [,2] [,3] [,4] [,5]
## 5460    1    5    9   13   17
## 6657    2    6   10   14   18
## 4609    3    7   11   15   19
## 9314    4    8   12   16   20
## 
## $ctable
##      Left     Right 
## [1,] "POU5F1" "5460"
## [2,] "SOX2"   "6657"
## [3,] "MYC"    "4609"
## [4,] "KLF4"   "9314"
# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 3. Corresponding table
ah <- AnnotationHub()
# Database of Human
hs <- query(ah, c("OrgDb", "Homo sapiens"))[[1]]
## loading from cache
LefttoRight <- select(hs,
  column=c("SYMBOL", "ENTREZID"),
  keytype="SYMBOL", keys=rowID)
## 'select()' returned 1:1 mapping between keys and columns
(input <- convertRowID(input, rowID, LefttoRight))
##   |                                                                              |                                                                      |   0%  |                                                                              |==================                                                    |  25%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================================| 100%
## Input matrix: 4x5
## Output matrix: 4x5
## Some gene expression vectors were collapsed into single vector
##   by sum rule
## $output
##      [,1] [,2] [,3] [,4] [,5]
## 5460    1    5    9   13   17
## 6657    2    6   10   14   18
## 4609    3    7   11   15   19
## 9314    4    8   12   16   20
## 
## $ctable
##      Left     Right 
## [1,] "POU5F1" "5460"
## [2,] "SOX2"   "6657"
## [3,] "MYC"    "4609"
## [4,] "KLF4"   "9314"

Step.3: Normalize the count matrix

Finally, we introduce some situations to perform some normalization methods of gene expression matrix.

If a user converts a Seurat object to a SingleCellExperient object by using as.SingleCellExperiment, the result of the NormalizeData function (log counts) is inherited to the SingleCellExperient object as follows;

pbmc2 <- NormalizeData(pbmc, normalization.method = "LogNormalize",
    scale.factor = 10000)
sce <- as.SingleCellExperiment(pbmc2)
assayNames(sce) # counts, logcounts

If the user want to use scater package, calculateCPM or normalize function can calculate the normalized expression values as follows; (see also the vignette of scater).

if(!require(scater)){
    BiocManager::install("scater")
    library(scater)
}
sce <- SingleCellExperiment(assays=list(counts = input))
cpm(sce) <- calculateCPM(sce)
sce <- normalize(sce)
assayNames(sce) # counts, normcounts, logcounts, cpm

Any original normalization can be stored in the sce. For example, we can calculate the value of count per median (CPMED) as follows;

# User's Original Normalization Function
CPMED <- function(input){
    libsize <- colSums(input)
    median(libsize) * t(t(input) / libsize)
}
# Normalization
normcounts(sce) <- log10(CPMED(counts(sce)) + 1)

We recommend using the normcounts slot to save such original normalization values. After the normalization, such values can be specified by assayNames option in cellCellRanks cellCellDecomp and cellCellReport functions.

Session information

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scTGIF_1.20.0                          
##  [2] Homo.sapiens_1.3.1                     
##  [3] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [4] org.Hs.eg.db_3.20.0                    
##  [5] GO.db_3.20.0                           
##  [6] OrganismDbi_1.49.0                     
##  [7] GenomicFeatures_1.59.0                 
##  [8] AnnotationDbi_1.69.0                   
##  [9] SingleCellExperiment_1.28.0            
## [10] SummarizedExperiment_1.36.0            
## [11] Biobase_2.67.0                         
## [12] GenomicRanges_1.59.0                   
## [13] GenomeInfoDb_1.43.0                    
## [14] IRanges_2.41.0                         
## [15] S4Vectors_0.44.0                       
## [16] MatrixGenerics_1.19.0                  
## [17] matrixStats_1.4.1                      
## [18] scTensor_2.17.0                        
## [19] RSQLite_2.3.7                          
## [20] LRBaseDbi_2.17.0                       
## [21] AnnotationHub_3.15.0                   
## [22] BiocFileCache_2.15.0                   
## [23] dbplyr_2.5.0                           
## [24] BiocGenerics_0.53.0                    
## [25] BiocStyle_2.35.0                       
## 
## loaded via a namespace (and not attached):
##   [1] fs_1.6.5                 bitops_1.0-9             enrichplot_1.27.1       
##   [4] httr_1.4.7               webshot_0.5.5            RColorBrewer_1.1-3      
##   [7] Rgraphviz_2.50.0         tools_4.4.1              backports_1.5.0         
##  [10] utf8_1.2.4               R6_2.5.1                 lazyeval_0.2.2          
##  [13] withr_3.0.2              prettyunits_1.2.0        graphite_1.53.0         
##  [16] gridExtra_2.3            schex_1.20.0             fdrtool_1.2.18          
##  [19] cli_3.6.3                TSP_1.2-4                entropy_1.3.1           
##  [22] sass_0.4.9               genefilter_1.89.0        meshr_2.13.0            
##  [25] Rsamtools_2.22.0         yulab.utils_0.1.7        txdbmaker_1.2.0         
##  [28] gson_0.1.0               DOSE_4.1.0               R.utils_2.12.3          
##  [31] MeSHDbi_1.43.0           AnnotationForge_1.49.0   nnTensor_1.3.0          
##  [34] plotrix_3.8-4            maps_3.4.2               visNetwork_2.1.2        
##  [37] generics_0.1.3           gridGraphics_0.5-1       GOstats_2.73.0          
##  [40] BiocIO_1.17.0            dplyr_1.1.4              dendextend_1.18.1       
##  [43] Matrix_1.7-1             fansi_1.0.6              abind_1.4-8             
##  [46] R.methodsS3_1.8.2        lifecycle_1.0.4          yaml_2.3.10             
##  [49] qvalue_2.38.0            SparseArray_1.6.0        grid_4.4.1              
##  [52] blob_1.2.4               misc3d_0.9-1             crayon_1.5.3            
##  [55] ggtangle_0.0.4           lattice_0.22-6           msigdbr_7.5.1           
##  [58] cowplot_1.1.3            annotate_1.85.0          KEGGREST_1.47.0         
##  [61] sys_3.4.3                maketools_1.3.1          pillar_1.9.0            
##  [64] knitr_1.48               fgsea_1.33.0             tcltk_4.4.1             
##  [67] rjson_0.2.23             codetools_0.2-20         fastmatch_1.1-4         
##  [70] glue_1.8.0               outliers_0.15            ggfun_0.1.7             
##  [73] data.table_1.16.2        vctrs_0.6.5              png_0.1-8               
##  [76] treeio_1.30.0            spam_2.11-0              rTensor_1.4.8           
##  [79] gtable_0.3.6             assertthat_0.2.1         cachem_1.1.0            
##  [82] xfun_0.48                S4Arrays_1.6.0           mime_0.12               
##  [85] tidygraph_1.3.1          survival_3.7-0           seriation_1.5.6         
##  [88] iterators_1.0.14         fields_16.3              nlme_3.1-166            
##  [91] Category_2.73.0          ggtree_3.15.0            bit64_4.5.2             
##  [94] progress_1.2.3           filelock_1.0.3           bslib_0.8.0             
##  [97] colorspace_2.1-1         DBI_1.2.3                tidyselect_1.2.1        
## [100] bit_4.5.0                compiler_4.4.1           curl_5.2.3              
## [103] httr2_1.0.5              graph_1.85.0             xml2_1.3.6              
## [106] DelayedArray_0.33.1      plotly_4.10.4            rtracklayer_1.66.0      
## [109] checkmate_2.3.2          scales_1.3.0             hexbin_1.28.4           
## [112] RBGL_1.82.0              plot3D_1.4.1             rappdirs_0.3.3          
## [115] stringr_1.5.1            digest_0.6.37            rmarkdown_2.28          
## [118] ca_0.71.1                XVector_0.46.0           htmltools_0.5.8.1       
## [121] pkgconfig_2.0.3          highr_0.11               fastmap_1.2.0           
## [124] rlang_1.1.4              htmlwidgets_1.6.4        UCSC.utils_1.2.0        
## [127] farver_2.1.2             jquerylib_0.1.4          jsonlite_1.8.9          
## [130] BiocParallel_1.41.0      GOSemSim_2.33.0          R.oo_1.26.0             
## [133] RCurl_1.98-1.16          magrittr_2.0.3           GenomeInfoDbData_1.2.13 
## [136] ggplotify_0.1.2          dotCall64_1.2            patchwork_1.3.0         
## [139] munsell_0.5.1            Rcpp_1.0.13              babelgene_22.9          
## [142] ape_5.8                  viridis_0.6.5            stringi_1.8.4           
## [145] tagcloud_0.6             ggraph_2.2.1             zlibbioc_1.52.0         
## [148] MASS_7.3-61              plyr_1.8.9               parallel_4.4.1          
## [151] ggrepel_0.9.6            Biostrings_2.75.0        graphlayouts_1.2.0      
## [154] splines_4.4.1            hms_1.1.3                igraph_2.1.1            
## [157] buildtools_1.0.0         biomaRt_2.63.0           reshape2_1.4.4          
## [160] BiocVersion_3.21.1       XML_3.99-0.17            evaluate_1.0.1          
## [163] BiocManager_1.30.25      foreach_1.5.2            tweenr_2.0.3            
## [166] tidyr_1.3.1              purrr_1.0.2              polyclip_1.10-7         
## [169] heatmaply_1.5.0          ggplot2_3.5.1            ReactomePA_1.50.0       
## [172] ggforce_0.4.2            xtable_1.8-4             restfulr_0.0.15         
## [175] reactome.db_1.89.0       tidytree_0.4.6           viridisLite_0.4.2       
## [178] tibble_3.2.1             aplot_0.2.3              ccTensor_1.0.2          
## [181] GenomicAlignments_1.43.0 memoise_2.0.1            registry_0.5-1          
## [184] cluster_2.1.6            concaveman_1.1.0         GSEABase_1.69.0