A detailed explanation of scFeatures’ features

Introduction

scFeatures is a tool for generating multi-view representations of samples in a single-cell dataset. This vignette provides an overview of scFeatures. It uses the main function to generate features and then illustrates case studies of using the generated features for classification, survival analysis and association study.

library(scFeatures)

Running scFeatures

scFeatures can be run using one line of code scfeatures_result <- scFeatures(data), which generates a list of dataframes containing all feature types in the form of samples x features.

data("example_scrnaseq" , package = "scFeatures")
data <- example_scrnaseq

scfeatures_result <- scFeatures(data = data@assays$RNA@data, sample = data$sample, celltype = data$celltype,
                                feature_types = "gene_mean_celltype"  , 
                                type = "scrna",  
                                ncores = 1,  
                                species = "Homo sapiens")

By default, the above function generates all feature types. To reduce the computational time for the demonstrate, here we generate only the selected feature type “gene mean celltype”. More information on the function customisation can be obtained by typing ?scFeatures()

Classification of conditions using the generated features

To build disease prediction model from the generated features we utilise ClassifyR.

The output from scFeatures is a matrix of sample x feature, ie, the row corresponds to each sample, the column corresponds to the feature, and can be directly used as the X. The order of the rows is in the order of unique(data$sample).

Here we use the feature type gene mean celltype as an example to build classification model on the disease condition.

 
feature_gene_mean_celltype <- scfeatures_result$gene_mean_celltype

# inspect the first 5 rows and first 5 columns
feature_gene_mean_celltype[1:5, 1:5]
#>         Naive T Cells--SNRPD2 Naive T Cells--NOSIP Naive T Cells--IL2RG
#> Pre_P8              1.6774074             1.856023             1.834704
#> Pre_P6              3.4815573             3.217231             3.583760
#> Pre_P27             0.0000000             1.486346             1.742069
#> Pre_P7              1.1761242             1.282083             2.050024
#> Pre_P20             0.6999054             1.315393             2.781787
#>         Naive T Cells--ALDOA Naive T Cells--LUC7L3
#> Pre_P8              1.598921             1.8056758
#> Pre_P6              3.493135             0.0000000
#> Pre_P27             3.518687             1.5080699
#> Pre_P7              2.575957             0.9163035
#> Pre_P20             1.626403             1.2659969

# inspect the dimension of the matrix
dim(feature_gene_mean_celltype)
#> [1]   19 2217

We recommend using ClassifyR::crossValidate to do cross-validated classification with the extracted feaures.

library(ClassifyR)

# X is the feature type generated
# y is the condition for classification
X <- feature_gene_mean_celltype
y <- data@meta.data[!duplicated(data$sample), ]
y <- y[match(rownames(X), y$sample), ]$condition

# run the classification model using random forest
result <- ClassifyR::crossValidate(
    X, y,
    classifier = "randomForest", nCores = 2,
    nFolds = 3, nRepeats = 5
)

ClassifyR::performancePlot(results = result)

It is expected that the classification accuracy is low. This is because we are using a small subset of data containing only 3523 genes and 519 cells. The dataset is unlikely to contain enough information to distinguish responders and non-responders.

Survival analysis using the generated features

Suppose we want to use the features to perform survival analysis. In here, since the patient outcomes are responder and non-responder, and do not contain survival information, we randomly “generate” the survival outcome for the purpose of demonstration.

We use a standard hierarchical clustering to split the patients into 2 groups based on the generated features.

library(survival)
library(survminer)
 

X <- feature_gene_mean_celltype
X <- t(X)

# run hierarchical clustering
hclust_res <- hclust(
    as.dist(1 - cor(X, method = "pearson")),
    method = "ward.D2"
)

set.seed(1)
# generate some survival outcome, including the survival days and the censoring outcome
survival_day <- sample(1:100, ncol(X))
censoring <- sample(0:1, ncol(X), replace = TRUE)

cluster_res <- cutree(hclust_res, k = 2)
metadata <- data.frame( cluster = factor(cluster_res),
                        survival_day = survival_day,
                        censoring = censoring)

# plot survival curve
fit <- survfit(
    Surv(survival_day, censoring) ~ cluster,
    data = metadata
)
ggsurv <- ggsurvplot(fit,
    conf.int = FALSE, risk.table = TRUE,
    risk.table.col = "strata", pval = TRUE
)
ggsurv

The p-value is very high, indicating there is not enough evidence to claim there is a survival difference between the two groups. This is as expected, because we randomly assigned survival status to each of the patient.

Association study of the features with the conditions

scFeatures provides a function that automatically run association study of the features with the conditions and produce an HTML file with the visualisation of the features and the association result.

For this, we would first need to generate the features using scFeatures and then store the result in a named list format.

For demonstration purpose, we provide an example of this features list. The code below show the steps of generating the HTML output from the features list.

# here we use the demo data from the package 
data("scfeatures_result" , package = "scFeatures")

# here we use the current working directory to save the html output
# modify this to save the html file to other directory
output_folder <-  tempdir()

run_association_study_report(scfeatures_result, output_folder )
#> /usr/local/bin/pandoc +RTS -K512m -RTS output_report.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output /tmp/RtmpxuqBdp/output_report.html --lua-filter /github/workspace/pkglib/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /github/workspace/pkglib/rmarkdown/rmarkdown/lua/latex-div.lua --lua-filter /github/workspace/pkglib/rmarkdown/rmarkdown/lua/table-classes.lua --embed-resources --standalone --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 3 --variable toc_float=1 --variable toc_selectors=h1,h2,h3 --variable toc_collapsed=1 --variable toc_print=1 --template /github/workspace/pkglib/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --number-sections --variable theme=bootstrap --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /tmp/RtmpxuqBdp/rmarkdown-str35e92d02c8ea.html --variable code_folding=hide --variable code_menu=1

Inside the directory defined in the output_folder, you will see the html report output with the name output_report.html.

sessionInfo()

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] grid      stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] data.table_1.16.4           enrichplot_1.27.3          
#>  [3] DOSE_4.1.0                  clusterProfiler_4.15.1     
#>  [5] msigdbr_7.5.1               EnsDb.Hsapiens.v79_2.99.0  
#>  [7] ensembldb_2.31.0            AnnotationFilter_1.31.0    
#>  [9] GenomicFeatures_1.59.1      org.Hs.eg.db_3.20.0        
#> [11] AnnotationDbi_1.69.0        plotly_4.10.4              
#> [13] igraph_2.1.2                tidyr_1.3.1                
#> [15] DT_0.33                     limma_3.63.2               
#> [17] pheatmap_1.0.12             dplyr_1.1.4                
#> [19] reshape2_1.4.4              survminer_0.5.0            
#> [21] ggpubr_0.6.0                ggplot2_3.5.1              
#> [23] ClassifyR_3.11.4            survival_3.8-3             
#> [25] BiocParallel_1.41.0         MultiAssayExperiment_1.33.4
#> [27] SummarizedExperiment_1.37.0 Biobase_2.67.0             
#> [29] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
#> [31] IRanges_2.41.2              MatrixGenerics_1.19.0      
#> [33] matrixStats_1.4.1           SeuratObject_5.0.2         
#> [35] sp_2.1-4                    scFeatures_1.7.0           
#> [37] S4Vectors_0.45.2            BiocGenerics_0.53.3        
#> [39] generics_0.1.3              BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] ggtext_0.1.2                fs_1.6.5                   
#>   [3] ProtGenerics_1.39.1         GSVA_2.1.4                 
#>   [5] spatstat.sparse_3.1-0       bitops_1.0-9               
#>   [7] httr_1.4.7                  RColorBrewer_1.1-3         
#>   [9] tools_4.4.2                 backports_1.5.0            
#>  [11] R6_2.5.1                    HDF5Array_1.35.2           
#>  [13] lazyeval_0.2.2              rhdf5filters_1.19.0        
#>  [15] withr_3.0.2                 gridExtra_2.3              
#>  [17] progressr_0.15.1            cli_3.6.3                  
#>  [19] spatstat.explore_3.3-3      labeling_0.4.3             
#>  [21] sass_0.4.9                  survMisc_0.5.6             
#>  [23] spatstat.data_3.1-4         genefilter_1.89.0          
#>  [25] SingleCellSignalR_1.19.0    yulab.utils_0.1.8          
#>  [27] commonmark_1.9.2            Rsamtools_2.23.1           
#>  [29] gson_0.1.0                  ggupset_0.4.0              
#>  [31] R.utils_2.12.3              parallelly_1.41.0          
#>  [33] RSQLite_2.3.9               gridGraphics_0.5-1         
#>  [35] shape_1.4.6.1               BiocIO_1.17.1              
#>  [37] crosstalk_1.2.1             gtools_3.9.5               
#>  [39] spatstat.random_3.3-2       car_3.1-3                  
#>  [41] GO.db_3.20.0                Matrix_1.7-1               
#>  [43] abind_1.4-8                 R.methodsS3_1.8.2          
#>  [45] lifecycle_1.0.4             yaml_2.3.10                
#>  [47] edgeR_4.5.1                 carData_3.0-5              
#>  [49] qvalue_2.39.0               gplots_3.2.0               
#>  [51] rhdf5_2.51.1                SparseArray_1.7.2          
#>  [53] Rtsne_0.17                  blob_1.2.4                 
#>  [55] dqrng_0.4.1                 crayon_1.5.3               
#>  [57] ggtangle_0.0.6              lattice_0.22-6             
#>  [59] cowplot_1.1.3               beachmat_2.23.5            
#>  [61] annotate_1.85.0             KEGGREST_1.47.0            
#>  [63] magick_2.8.5                sys_3.4.3                  
#>  [65] maketools_1.3.1             pillar_1.10.0              
#>  [67] knitr_1.49                  metapod_1.15.0             
#>  [69] fgsea_1.33.2                rjson_0.2.23               
#>  [71] future.apply_1.11.3         codetools_0.2-20           
#>  [73] fastmatch_1.1-6             glue_1.8.0                 
#>  [75] ggfun_0.1.8                 spatstat.univar_3.1-1      
#>  [77] treeio_1.31.0               vctrs_0.6.5                
#>  [79] png_0.1-8                   spam_2.11-0                
#>  [81] gtable_0.3.6                cachem_1.1.0               
#>  [83] xfun_0.49                   S4Arrays_1.7.1             
#>  [85] SingleCellExperiment_1.29.1 iterators_1.0.14           
#>  [87] KMsurv_0.1-5                statmod_1.5.0              
#>  [89] bluster_1.17.0              nlme_3.1-166               
#>  [91] ggtree_3.15.0               bit64_4.5.2                
#>  [93] bslib_0.8.0                 irlba_2.3.5.1              
#>  [95] KernSmooth_2.23-24          colorspace_2.1-1           
#>  [97] DBI_1.2.3                   tidyselect_1.2.1           
#>  [99] proxyC_0.4.1                bit_4.5.0.1                
#> [101] compiler_4.4.2              curl_6.0.1                 
#> [103] AUCell_1.29.0               graph_1.85.0               
#> [105] BiocNeighbors_2.1.2         xml2_1.3.6                 
#> [107] DelayedArray_0.33.3         rtracklayer_1.67.0         
#> [109] scales_1.3.0                caTools_1.18.3             
#> [111] stringr_1.5.1               SpatialExperiment_1.17.0   
#> [113] digest_0.6.37               goftest_1.2-3              
#> [115] spatstat.utils_3.1-1        rmarkdown_2.29             
#> [117] XVector_0.47.1              htmltools_0.5.8.1          
#> [119] pkgconfig_2.0.3             sparseMatrixStats_1.19.0   
#> [121] fastmap_1.2.0               htmlwidgets_1.6.4          
#> [123] rlang_1.1.4                 GlobalOptions_0.1.2        
#> [125] UCSC.utils_1.3.0            DelayedMatrixStats_1.29.0  
#> [127] farver_2.1.2                jquerylib_0.1.4            
#> [129] zoo_1.8-12                  jsonlite_1.8.9             
#> [131] GOSemSim_2.33.0             R.oo_1.27.0                
#> [133] BiocSingular_1.23.0         RCurl_1.98-1.16            
#> [135] magrittr_2.0.3              ggplotify_0.1.2            
#> [137] Formula_1.2-5               scuttle_1.17.0             
#> [139] GenomeInfoDbData_1.2.13     dotCall64_1.2              
#> [141] patchwork_1.3.0             Rhdf5lib_1.29.0            
#> [143] munsell_0.5.1               Rcpp_1.0.13-1              
#> [145] ape_5.8-1                   babelgene_22.9             
#> [147] stringi_1.8.4               zlibbioc_1.52.0            
#> [149] MASS_7.3-61                 plyr_1.8.9                 
#> [151] ggrepel_0.9.6               parallel_4.4.2             
#> [153] listenv_0.9.1               deldir_2.0-4               
#> [155] Biostrings_2.75.3           splines_4.4.2              
#> [157] gridtext_0.1.5              tensor_1.5                 
#> [159] multtest_2.63.0             circlize_0.4.16            
#> [161] locfit_1.5-9.10             ranger_0.17.0              
#> [163] spatstat.geom_3.3-4         markdown_1.13              
#> [165] ggsignif_0.6.4              buildtools_1.0.0           
#> [167] ScaledMatrix_1.15.0         XML_3.99-0.17              
#> [169] evaluate_1.0.1              scran_1.35.0               
#> [171] BiocManager_1.30.25         foreach_1.5.2              
#> [173] EnsDb.Mmusculus.v79_2.99.0  purrr_1.0.2                
#> [175] polyclip_1.10-7             km.ci_0.5-6                
#> [177] future_1.34.0               rsvd_1.0.5                 
#> [179] broom_1.0.7                 xtable_1.8-4               
#> [181] restfulr_0.0.15             tidytree_0.4.6             
#> [183] rstatix_0.7.2               viridisLite_0.4.2          
#> [185] tibble_3.2.1                aplot_0.2.4                
#> [187] memoise_2.0.1               GenomicAlignments_1.43.0   
#> [189] cluster_2.1.8               globals_0.16.3             
#> [191] GSEABase_1.69.0