scFeatures is a tool for generating multi-view representations of samples in a single-cell dataset. This vignette provides an overview of scFeatures. It uses the main function to generate features and then illustrates case studies of using the generated features for classification, survival analysis and association study.
scFeatures can be run using one line of code
scfeatures_result <- scFeatures(data)
, which generates a
list of dataframes containing all feature types in the form of samples x
features.
data("example_scrnaseq" , package = "scFeatures")
data <- example_scrnaseq
scfeatures_result <- scFeatures(data = data@assays$RNA@data, sample = data$sample, celltype = data$celltype,
feature_types = "gene_mean_celltype" ,
type = "scrna",
ncores = 1,
species = "Homo sapiens")
By default, the above function generates all feature types. To reduce
the computational time for the demonstrate, here we generate only the
selected feature type “gene mean celltype”. More information on the
function customisation can be obtained by typing
?scFeatures()
To build disease prediction model from the generated features we
utilise ClassifyR
.
The output from scFeatures is a matrix of sample x feature, ie, the
row corresponds to each sample, the column corresponds to the feature,
and can be directly used as the X
. The order of the rows is
in the order of unique(data$sample).
Here we use the feature type gene mean celltype as an example to build classification model on the disease condition.
feature_gene_mean_celltype <- scfeatures_result$gene_mean_celltype
# inspect the first 5 rows and first 5 columns
feature_gene_mean_celltype[1:5, 1:5]
#> Naive T Cells--SNRPD2 Naive T Cells--NOSIP Naive T Cells--IL2RG
#> Pre_P8 1.6774074 1.856023 1.834704
#> Pre_P6 3.4815573 3.217231 3.583760
#> Pre_P27 0.0000000 1.486346 1.742069
#> Pre_P7 1.1761242 1.282083 2.050024
#> Pre_P20 0.6999054 1.315393 2.781787
#> Naive T Cells--ALDOA Naive T Cells--LUC7L3
#> Pre_P8 1.598921 1.8056758
#> Pre_P6 3.493135 0.0000000
#> Pre_P27 3.518687 1.5080699
#> Pre_P7 2.575957 0.9163035
#> Pre_P20 1.626403 1.2659969
# inspect the dimension of the matrix
dim(feature_gene_mean_celltype)
#> [1] 19 2217
We recommend using ClassifyR::crossValidate
to do
cross-validated classification with the extracted feaures.
library(ClassifyR)
# X is the feature type generated
# y is the condition for classification
X <- feature_gene_mean_celltype
y <- data@meta.data[!duplicated(data$sample), ]
y <- y[match(rownames(X), y$sample), ]$condition
# run the classification model using random forest
result <- ClassifyR::crossValidate(
X, y,
classifier = "randomForest", nCores = 2,
nFolds = 3, nRepeats = 5
)
ClassifyR::performancePlot(results = result)
It is expected that the classification accuracy is low. This is because we are using a small subset of data containing only 3523 genes and 519 cells. The dataset is unlikely to contain enough information to distinguish responders and non-responders.
Suppose we want to use the features to perform survival analysis. In here, since the patient outcomes are responder and non-responder, and do not contain survival information, we randomly “generate” the survival outcome for the purpose of demonstration.
We use a standard hierarchical clustering to split the patients into 2 groups based on the generated features.
library(survival)
library(survminer)
X <- feature_gene_mean_celltype
X <- t(X)
# run hierarchical clustering
hclust_res <- hclust(
as.dist(1 - cor(X, method = "pearson")),
method = "ward.D2"
)
set.seed(1)
# generate some survival outcome, including the survival days and the censoring outcome
survival_day <- sample(1:100, ncol(X))
censoring <- sample(0:1, ncol(X), replace = TRUE)
cluster_res <- cutree(hclust_res, k = 2)
metadata <- data.frame( cluster = factor(cluster_res),
survival_day = survival_day,
censoring = censoring)
# plot survival curve
fit <- survfit(
Surv(survival_day, censoring) ~ cluster,
data = metadata
)
ggsurv <- ggsurvplot(fit,
conf.int = FALSE, risk.table = TRUE,
risk.table.col = "strata", pval = TRUE
)
ggsurv
The p-value is very high, indicating there is not enough evidence to claim there is a survival difference between the two groups. This is as expected, because we randomly assigned survival status to each of the patient.
scFeatures provides a function that automatically run association study of the features with the conditions and produce an HTML file with the visualisation of the features and the association result.
For this, we would first need to generate the features using scFeatures and then store the result in a named list format.
For demonstration purpose, we provide an example of this features list. The code below show the steps of generating the HTML output from the features list.
# here we use the demo data from the package
data("scfeatures_result" , package = "scFeatures")
# here we use the current working directory to save the html output
# modify this to save the html file to other directory
output_folder <- tempdir()
run_association_study_report(scfeatures_result, output_folder )
#> /usr/local/bin/pandoc +RTS -K512m -RTS output_report.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output /tmp/RtmpifuZ7k/output_report.html --lua-filter /github/workspace/pkglib/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /github/workspace/pkglib/rmarkdown/rmarkdown/lua/latex-div.lua --lua-filter /github/workspace/pkglib/rmarkdown/rmarkdown/lua/table-classes.lua --embed-resources --standalone --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 3 --variable toc_float=1 --variable toc_selectors=h1,h2,h3 --variable toc_collapsed=1 --variable toc_print=1 --template /github/workspace/pkglib/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --number-sections --variable theme=bootstrap --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /tmp/RtmpifuZ7k/rmarkdown-strb07937d4f4ca.html --variable code_folding=hide --variable code_menu=1
Inside the directory defined in the output_folder
, you
will see the html report output with the name
output_report.html
.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] grid stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] data.table_1.16.2 enrichplot_1.27.0
#> [3] DOSE_4.1.0 clusterProfiler_4.15.0
#> [5] msigdbr_7.5.1 EnsDb.Hsapiens.v79_2.99.0
#> [7] ensembldb_2.31.0 AnnotationFilter_1.31.0
#> [9] GenomicFeatures_1.59.0 org.Hs.eg.db_3.20.0
#> [11] AnnotationDbi_1.69.0 plotly_4.10.4
#> [13] igraph_2.1.1 tidyr_1.3.1
#> [15] DT_0.33 limma_3.63.0
#> [17] pheatmap_1.0.12 dplyr_1.1.4
#> [19] reshape2_1.4.4 survminer_0.5.0
#> [21] ggpubr_0.6.0 ggplot2_3.5.1
#> [23] ClassifyR_3.11.0 survival_3.7-0
#> [25] BiocParallel_1.41.0 MultiAssayExperiment_1.33.0
#> [27] SummarizedExperiment_1.36.0 Biobase_2.67.0
#> [29] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
#> [31] IRanges_2.41.0 MatrixGenerics_1.19.0
#> [33] matrixStats_1.4.1 generics_0.1.3
#> [35] SeuratObject_5.0.2 sp_2.1-4
#> [37] scFeatures_1.7.0 S4Vectors_0.44.0
#> [39] BiocGenerics_0.53.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] SpatialExperiment_1.16.0 R.methodsS3_1.8.2
#> [3] GSEABase_1.69.0 EnsDb.Mmusculus.v79_2.99.0
#> [5] goftest_1.2-3 Biostrings_2.75.0
#> [7] HDF5Array_1.35.0 vctrs_0.6.5
#> [9] ggtangle_0.0.4 spatstat.random_3.3-2
#> [11] digest_0.6.37 png_0.1-8
#> [13] shape_1.4.6.1 ggrepel_0.9.6
#> [15] deldir_2.0-4 parallelly_1.38.0
#> [17] magick_2.8.5 MASS_7.3-61
#> [19] foreach_1.5.2 qvalue_2.38.0
#> [21] withr_3.0.2 xfun_0.48
#> [23] ggfun_0.1.7 memoise_2.0.1
#> [25] proxyC_0.4.1 commonmark_1.9.2
#> [27] gson_0.1.0 tidytree_0.4.6
#> [29] zoo_1.8-12 GlobalOptions_0.1.2
#> [31] gtools_3.9.5 SingleCellSignalR_1.18.0
#> [33] R.oo_1.26.0 Formula_1.2-5
#> [35] sys_3.4.3 KEGGREST_1.47.0
#> [37] httr_1.4.7 rstatix_0.7.2
#> [39] restfulr_0.0.15 globals_0.16.3
#> [41] rhdf5filters_1.18.0 rhdf5_2.50.0
#> [43] UCSC.utils_1.2.0 babelgene_22.9
#> [45] curl_5.2.3 zlibbioc_1.52.0
#> [47] ScaledMatrix_1.14.0 polyclip_1.10-7
#> [49] GenomeInfoDbData_1.2.13 SparseArray_1.6.0
#> [51] xtable_1.8-4 stringr_1.5.1
#> [53] evaluate_1.0.1 S4Arrays_1.6.0
#> [55] irlba_2.3.5.1 colorspace_2.1-1
#> [57] spatstat.data_3.1-2 magrittr_2.0.3
#> [59] buildtools_1.0.0 ggtree_3.15.0
#> [61] lattice_0.22-6 spatstat.geom_3.3-3
#> [63] future.apply_1.11.3 genefilter_1.89.0
#> [65] XML_3.99-0.17 scuttle_1.16.0
#> [67] cowplot_1.1.3 maketools_1.3.1
#> [69] ggupset_0.4.0 pillar_1.9.0
#> [71] nlme_3.1-166 iterators_1.0.14
#> [73] caTools_1.18.3 compiler_4.4.1
#> [75] beachmat_2.23.0 stringi_1.8.4
#> [77] tensor_1.5 GenomicAlignments_1.43.0
#> [79] plyr_1.8.9 crayon_1.5.3
#> [81] abind_1.4-8 BiocIO_1.17.0
#> [83] gridGraphics_0.5-1 ggtext_0.1.2
#> [85] locfit_1.5-9.10 bit_4.5.0
#> [87] fastmatch_1.1-4 codetools_0.2-20
#> [89] BiocSingular_1.23.0 crosstalk_1.2.1
#> [91] bslib_0.8.0 multtest_2.63.0
#> [93] splines_4.4.1 markdown_1.13
#> [95] circlize_0.4.16 Rcpp_1.0.13
#> [97] sparseMatrixStats_1.18.0 gridtext_0.1.5
#> [99] knitr_1.48 blob_1.2.4
#> [101] utf8_1.2.4 fs_1.6.5
#> [103] listenv_0.9.1 DelayedMatrixStats_1.29.0
#> [105] GSVA_2.1.0 ggsignif_0.6.4
#> [107] ggplotify_0.1.2 tibble_3.2.1
#> [109] Matrix_1.7-1 statmod_1.5.0
#> [111] pkgconfig_2.0.3 tools_4.4.1
#> [113] cachem_1.1.0 RSQLite_2.3.7
#> [115] viridisLite_0.4.2 DBI_1.2.3
#> [117] fastmap_1.2.0 rmarkdown_2.28
#> [119] scales_1.3.0 Rsamtools_2.22.0
#> [121] broom_1.0.7 sass_0.4.9
#> [123] patchwork_1.3.0 BiocManager_1.30.25
#> [125] dotCall64_1.2 graph_1.85.0
#> [127] carData_3.0-5 farver_2.1.2
#> [129] yaml_2.3.10 rtracklayer_1.66.0
#> [131] cli_3.6.3 purrr_1.0.2
#> [133] lifecycle_1.0.4 bluster_1.17.0
#> [135] backports_1.5.0 annotate_1.85.0
#> [137] gtable_0.3.6 rjson_0.2.23
#> [139] progressr_0.15.0 parallel_4.4.1
#> [141] ape_5.8 jsonlite_1.8.9
#> [143] edgeR_4.4.0 bitops_1.0-9
#> [145] bit64_4.5.2 Rtsne_0.17
#> [147] yulab.utils_0.1.7 spatstat.utils_3.1-0
#> [149] BiocNeighbors_2.1.0 ranger_0.16.0
#> [151] jquerylib_0.1.4 highr_0.11
#> [153] metapod_1.14.0 GOSemSim_2.33.0
#> [155] dqrng_0.4.1 survMisc_0.5.6
#> [157] spatstat.univar_3.0-1 R.utils_2.12.3
#> [159] lazyeval_0.2.2 htmltools_0.5.8.1
#> [161] KMsurv_0.1-5 GO.db_3.20.0
#> [163] glue_1.8.0 spam_2.11-0
#> [165] XVector_0.46.0 RCurl_1.98-1.16
#> [167] treeio_1.30.0 scran_1.34.0
#> [169] gridExtra_2.3 AUCell_1.29.0
#> [171] R6_2.5.1 SingleCellExperiment_1.28.0
#> [173] gplots_3.2.0 km.ci_0.5-6
#> [175] labeling_0.4.3 cluster_2.1.6
#> [177] Rhdf5lib_1.28.0 aplot_0.2.3
#> [179] DelayedArray_0.33.1 tidyselect_1.2.1
#> [181] ProtGenerics_1.38.0 xml2_1.3.6
#> [183] car_3.1-3 future_1.34.0
#> [185] rsvd_1.0.5 munsell_0.5.1
#> [187] KernSmooth_2.23-24 htmlwidgets_1.6.4
#> [189] fgsea_1.33.0 RColorBrewer_1.1-3
#> [191] rlang_1.1.4 spatstat.sparse_3.1-0
#> [193] spatstat.explore_3.3-3 fansi_1.0.6