This package provides a feature selection method for single-cell RNA-seq data. It encodes the category number and calculate the spearman correlation coefficient.
The encoding the category number (CAEN) method is developed for selecting feature genes, which is differentially expressed between classes. We have implemented the CAEN method via a set of R functions. We make a R package named CAEN, and give a tutorial for the package. The method consist three steps.
Step 1: Data Pre-processing;
Step 2: Encoding the category number; Calculating the spearman correlation coefficient between the gene and category number;
Step 3: Calculate the classification error using the genes selected.
We employ a simulation dataset to illustrate the usage of the CAEN package. The programs can run under the Linux system and windows 10 system. The R versions should larger than 4.0.
To install the CAEN package into your R environment, start R and enter:
Then, the CAEN package is ready to load.
In order to show the CAEN workflow, the package includes the example
data sets, which is a real dataset from Woo, Y., Kim, D., Koo, N.,
Kim, Y., Lee, S., Ko, J., et al. (2017). Profiling of mirnas and target
genes related to cystogenesis in adpkd mouse models. scientific reports
7, 14151.
The next we will show the dataset in the package.
The dataset is in the data subdirectory of CAEN package, to consistent with standard Bioconductor representations, we transform the format of dataset as SummarizedExperiment, please refer R package * SummarizedExperiment* for details. In the dataset, gene expression is compared between two mouse models, Pkd1f/f: HoxB7-cre mice and Pkd2f/f: HoxB7-cre mice. Each group includes 18 samples, and there are total 29996 transcripts in this dataset. The next we will load the data and use CAEN function to find the most differentially expressed genes.
## class: SummarizedExperiment
## dim: 29996 36
## metadata(0):
## assays(1): ''
## rownames(29996): NM_001001144 NM_001001152 ... NM_213730 NM_213733
## rowData names(0):
## colnames(36): Pkd1f_P1_KO1 Pkd1f_P1_KO2 ... Pkd2f_P7_WT2 Pkd2f_P7_WT3
## colData names(0):
readData includes 2 groups, each group includes 18 samples, and there are total 29996 transcripts. * the columns are samples of dataset; * the rows are transcripts of dataset.
x <- realData
y <- c(rep(1,18),rep(2,18))
xte <- realData
prob <- estimatep(x = x, y = y, xte = x, beta = 1, type = "mle", prior = NULL)
prob0 <- estimatep(x = x, y = y, xte = xte, beta = 1, type = "mle",
prior = NULL)
myscore <- CAEN(dataTable = x, y = y, K = 2, gene_no_list = 100)
The output of the function CAEN is: A list of computed correlation coefficient and the first some differentially expressed genes , where “r” represents correlation coefficient between gene and category number, and “np” represents the top differential feature label.
Getting the important gene, we Calculate classification error rate using genes selected. The step is as follows:
ddd <- myscore$np
datx <- t(assay(x)[ddd,])
datxte <- t(assay(xte)[ddd,])
probb <- prob[ddd,]
probb0 <- prob0[ddd,]
zipldacv.out <- ZIPLDA.cv(x = datx, y = y, prob0 = t(probb))
## 12345
ZIPLDA.out <- ZIPLDA(x = datx, y = y, xte = datxte, transform = FALSE,
prob0 = t(probb0),rho = zipldacv.out$bestrho)
classResult <- ZIPLDA.out$ytehat
In order to reproduce the presented CAEN workflow, the package also includes the simulated data sets, which is generated by function newCountDataSet(). The next we will give an example for how to generate simulation dataset:
The output of the function newCountDataSet() includes: *
“sim_train_data” represents training data of qn data matrix.
“sim_test_data” represents test data of qn data matrix.
The colnames of this two matrix are class labels for the n observations.
May have q<p because features with 0 total counts are removed. The q
features are those with >0 total counts in dataset. So q <=
p.
“truesf” denotes size factors for training observations.
* “isDE” represnts the differential gene label.
Calculate the spearman correlation coefficient for data For the category number, we need to consider not only the difference between class but also the Intra-category difference. Therefore, we propose CAEN method, by encoding the category number within class, it get the optimal category number and select the most important genes used for classification.
x <- t(assay(dat$sim_train_data))
y <- as.numeric(colnames(dat$sim_train_data))
xte <- t(assay(dat$sim_test_data))
prob <- estimatep(x = x, y = y, xte = x, beta = 1,
type = c("mle","deseq","quantile"),
prior = NULL)
prob0 <- estimatep(x = x, y = y, xte = xte, beta = 1,
type = c("mle","deseq","quantile"),
prior = NULL)
myscore <- CAEN(dataTable = assay(dat$sim_train_data),
y = as.numeric(colnames(dat$sim_train_data)), K = 4,
gene_no_list = 100)
The output of the function CAEN is: A list of computed correlation coefficient and the first some differentially expressed genes , where “r” represents correlation coefficient between gene and category number, and “np” represents the top differential feature label.
Calculate classification error rate using genes selected with CAEN method Getting the important gene, we Calculate classification error rate using genes selected. The step is as follows:
ddd <- myscore$np
datx <- x[,ddd]
datxte <- xte[,ddd]
probb <- prob[ddd,]
probb0 <- prob0[ddd,]
zipldacv.out <- ZIPLDA.cv(x = datx, y = y, prob0 = t(probb))
## 12345
ZIPLDA.out <- ZIPLDA(x = datx, y = y,
xte = datxte, transform = FALSE, prob0 = t(probb0),
rho = zipldacv.out$bestrho)
classResult <- ZIPLDA.out$ytehat
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [3] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [5] IRanges_2.41.1 S4Vectors_0.45.2
## [7] BiocGenerics_0.53.3 generics_0.1.3
## [9] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [11] CAEN_1.15.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-1 jsonlite_1.8.9 crayon_1.5.3
## [4] compiler_4.4.2 BiocManager_1.30.25 jquerylib_0.1.4
## [7] yaml_2.3.10 fastmap_1.2.0 lattice_0.22-6
## [10] R6_2.5.1 XVector_0.47.0 S4Arrays_1.7.1
## [13] knitr_1.49 DelayedArray_0.33.2 maketools_1.3.1
## [16] GenomeInfoDbData_1.2.13 bslib_0.8.0 rlang_1.1.4
## [19] cachem_1.1.0 xfun_0.49 sass_0.4.9
## [22] sys_3.4.3 PoiClaClu_1.0.2.1 SparseArray_1.7.2
## [25] cli_3.6.3 zlibbioc_1.52.0 grid_4.4.2
## [28] digest_0.6.37 lifecycle_1.0.4 evaluate_1.0.1
## [31] buildtools_1.0.0 abind_1.4-8 rmarkdown_2.29
## [34] httr_1.4.7 tools_4.4.2 htmltools_0.5.8.1
## [37] UCSC.utils_1.3.0