phantasusLite tutorial

Introduction

The phantasusLite package contains a set functions that facilitate working with public gene expression datasets originally developed for phantasus package. Unlike phantasus, phantasusLite aims to limit the amount of dependencies.

The current functionality includes:

  • Providing an interface to precomputed RNA-seq gene counts from ARCHS4 and DEE2 projects stored at remote HSDS repositories.
  • Inferring sample groups for cases when they are not provided in the original metadata.
  • Saving and loading expression matrices from GCT format.

Installation

It is recommended to install the release version of the package from Bioconductor using the following commands:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("phantasusLite")

Alternatively, the most recent version of the package can be installed from the GitHub repository:

library(devtools)
install_github("ctlab/phantasusLite")

Loading precomputed RNA-seq counts

library(GEOquery)
library(phantasusLite)

Let’s load dataset GSE53053 from GEO using GEOquery package:

ess <- getGEO("GSE53053")
es <- ess[[1]]

RNA-seq dataset from GEO do not contain the expression matrix, thus exprs(es) is empty:

head(exprs(es))
##      GSM1281300 GSM1281301 GSM1281302 GSM1281303 GSM1281304 GSM1281305
##      GSM1281306 GSM1281307

However, a number of precomputed gene count tables are available at HSDS server ‘https://alserglab.wustl.edu/hsds/’. It features HDF5 files with counts from ARCHS4 and DEE2 projects:

url <- 'https://alserglab.wustl.edu/hsds/?domain=/counts'
getHSDSFileList(url)
##  [1] "/counts/archs4/Arabidopsis_thaliana_count_matrix.h5"    
##  [2] "/counts/archs4/Bos_taurus_count_matrix.h5"              
##  [3] "/counts/archs4/Caenorhabditis_elegans_count_matrix.h5"  
##  [4] "/counts/archs4/Danio_rerio_count_matrix.h5"             
##  [5] "/counts/archs4/Drosophila_melanogaster_count_matrix.h5" 
##  [6] "/counts/archs4/Gallus_gallus_count_matrix.h5"           
##  [7] "/counts/archs4/Rattus_norvegicus_count_matrix.h5"       
##  [8] "/counts/archs4/Saccharomyces_cerevisiae_count_matrix.h5"
##  [9] "/counts/archs4/human_gene_v2.3.h5"                      
## [10] "/counts/archs4/meta.h5"                                 
## [11] "/counts/archs4/mouse_gene_v2.3.h5"                      
## [12] "/counts/dee2/athaliana_star_matrix_20240409.h5"         
## [13] "/counts/dee2/celegans_star_matrix_20240409.h5"          
## [14] "/counts/dee2/dmelanogaster_star_matrix_20240409.h5"     
## [15] "/counts/dee2/drerio_star_matrix_20240409.h5"            
## [16] "/counts/dee2/ecoli_star_matrix_20240409.h5"             
## [17] "/counts/dee2/hsapiens_star_matrix_20240409.h5"          
## [18] "/counts/dee2/meta.h5"                                   
## [19] "/counts/dee2/mmusculus_star_matrix_20240409.h5"         
## [20] "/counts/dee2/rnorvegicus_star_matrix_20240409.h5"       
## [21] "/counts/dee2/scerevisiae_star_matrix_20240409.h5"       
## [22] "/counts/index.h5"                                       
## [23] "/counts/priority.h5"

GSE53053 dataset was sequenced from Mus musculus and we can get an expression matrix from the corresponding HDF5-file with DEE2 data:

file <- "dee2/mmusculus_star_matrix_20240409.h5"
es <- loadCountsFromH5FileHSDS(es, url, file)
head(exprs(es))
##                    GSM1281300 GSM1281301 GSM1281302 GSM1281303 GSM1281304
## ENSMUSG00000102693          0          0          0          0          0
## ENSMUSG00000064842          0          0          0          0          0
## ENSMUSG00000051951          0          0        135        120        132
## ENSMUSG00000102851          0          0          0          0          0
## ENSMUSG00000103377          0          0          0          0          1
## ENSMUSG00000104017          0          0          0          0          0
##                    GSM1281305 GSM1281306 GSM1281307
## ENSMUSG00000102693          0          0          0
## ENSMUSG00000064842          0          0          0
## ENSMUSG00000051951          0          0          0
## ENSMUSG00000102851          0          0          0
## ENSMUSG00000103377          0          0          0
## ENSMUSG00000104017          0          0          0

Normally loadCountsFromHSDS can be used to automatically select the HDF5-file with the largest number of quantified samples:

es <- ess[[1]]
es <- loadCountsFromHSDS(es, url)
head(exprs(es))
##                    GSM1281300 GSM1281301 GSM1281302 GSM1281303 GSM1281304
## ENSMUSG00000000001       1015        603        561        549        425
## ENSMUSG00000000003          0          0          0          0          0
## ENSMUSG00000000028        109         34          0         14          9
## ENSMUSG00000000031          0         18          0          0          0
## ENSMUSG00000000037          0          0          0          0          0
## ENSMUSG00000000049          0          0          0          0          0
##                    GSM1281305 GSM1281306 GSM1281307
## ENSMUSG00000000001        853        407        479
## ENSMUSG00000000003          0          0          0
## ENSMUSG00000000028        165          0         15
## ENSMUSG00000000031          0          0          0
## ENSMUSG00000000037          0          0          0
## ENSMUSG00000000049          0          0          0

The counts are different from the previous values as ARCHS4 counts were used – ARCHS4 is prioritized when there are several files with the same number of samples:

preproc(experimentData(es))$gene_counts_source
## [1] "archs4/mouse_gene_v2.3.h5"

Further, gene symbols are also imported from ARCHS4 database and are available as feature data:

head(fData(es))
##                    Gene symbol          ENSEMBLID
## ENSMUSG00000000001       Gnai3 ENSMUSG00000000001
## ENSMUSG00000000003        Pbsn ENSMUSG00000000003
## ENSMUSG00000000028       Cdc45 ENSMUSG00000000028
## ENSMUSG00000000031         H19 ENSMUSG00000000031
## ENSMUSG00000000037       Scml2 ENSMUSG00000000037
## ENSMUSG00000000049        Apoh ENSMUSG00000000049

Inferring sample groups

For some of the GEO datasets, such as GSE53053, the sample annotation is not fully available. However, frequently sample titles are structured in a way that allows to infer the groups. For example, for GSE53053 we can see there are three groups: Ctrl, MandIL4, MandLPSandIFNg, with up to 3 replicates:

es$title
## [1] "Ctrl_1"           "Ctrl_2"           "MandIL4_1"        "MandIL4_2"       
## [5] "MandIL4_3"        "MandLPSandIFNg_1" "MandLPSandIFNg_2" "MandLPSandIFNg_3"

For such well-structured titles, inferCondition function can be used to automatically identify the sample conditions and replicates:

es <- inferCondition(es)
print(es$condition)
## [1] "Ctrl"           "Ctrl"           "MandIL4"        "MandIL4"       
## [5] "MandIL4"        "MandLPSandIFNg" "MandLPSandIFNg" "MandLPSandIFNg"
print(es$replicate)
## [1] "1" "2" "1" "2" "3" "1" "2" "3"

Working with GCT files

GCT text format can be used to store annotated gene expression matrices and load them in software such as Morpheus or Phantasus.

For example, we can save the ExpressionSet object that we defined previously:

f <- file.path(tempdir(), "GSE53053.gct")
writeGct(es, f)

And the load the file back:

es2 <- readGct(f)
## Warning in readGct(f): duplicated row IDs: Gm16364 Pakap Pakap A530058N18Rik
## Snora43 1700030C10Rik; they were renamed
print(es2)
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 53511 features, 8 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: Ctrl_1 Ctrl_2 ... MandLPSandIFNg_3 (8 total)
##   varLabels: title geo_accession ... replicate (43 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: Gnai3 Pbsn ... ENSMUSG00002076992 (53511 total)
##   fvarLabels: Gene symbol ENSEMBLID
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

Session info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] phantasusLite_1.5.0 GEOquery_2.75.0     Biobase_2.67.0     
## [4] BiocGenerics_0.53.3 generics_0.1.3      BiocStyle_2.35.0   
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.37.0 httr2_1.0.7                
##  [3] rjson_0.2.23                xfun_0.49                  
##  [5] bslib_0.8.0                 lattice_0.22-6             
##  [7] tzdb_0.4.0                  vctrs_0.6.5                
##  [9] tools_4.4.2                 curl_6.0.1                 
## [11] stats4_4.4.2                tibble_3.2.1               
## [13] R.oo_1.27.0                 pkgconfig_2.0.3            
## [15] Matrix_1.7-1                data.table_1.16.4          
## [17] rentrez_1.2.3               S4Vectors_0.45.2           
## [19] lifecycle_1.0.4             GenomeInfoDbData_1.2.13    
## [21] stringr_1.5.1               compiler_4.4.2             
## [23] statmod_1.5.0               GenomeInfoDb_1.43.2        
## [25] htmltools_0.5.8.1           sys_3.4.3                  
## [27] buildtools_1.0.0            sass_0.4.9                 
## [29] yaml_2.3.10                 pillar_1.10.0              
## [31] crayon_1.5.3                jquerylib_0.1.4            
## [33] tidyr_1.3.1                 R.utils_2.12.3             
## [35] rhdf5client_1.29.0          DelayedArray_0.33.3        
## [37] cachem_1.1.0                limma_3.63.2               
## [39] abind_1.4-8                 tidyselect_1.2.1           
## [41] digest_0.6.37               stringi_1.8.4              
## [43] dplyr_1.1.4                 purrr_1.0.2                
## [45] maketools_1.3.1             fastmap_1.2.0              
## [47] grid_4.4.2                  cli_3.6.3                  
## [49] SparseArray_1.7.2           magrittr_2.0.3             
## [51] S4Arrays_1.7.1              XML_3.99-0.17              
## [53] withr_3.0.2                 readr_2.1.5                
## [55] rappdirs_0.3.3              UCSC.utils_1.3.0           
## [57] rmarkdown_2.29              XVector_0.47.0             
## [59] httr_1.4.7                  matrixStats_1.4.1          
## [61] R.methodsS3_1.8.2           hms_1.1.3                  
## [63] evaluate_1.0.1              knitr_1.49                 
## [65] GenomicRanges_1.59.1        IRanges_2.41.2             
## [67] rlang_1.1.4                 glue_1.8.0                 
## [69] BiocManager_1.30.25         xml2_1.3.6                 
## [71] jsonlite_1.8.9              R6_2.5.1                   
## [73] MatrixGenerics_1.19.0       zlibbioc_1.52.0

```