YARN: Robust Multi-Tissue RNA-Seq Preprocessing and Normalization

YARN - Yet Another RNa-seq package

The goal of yarn is to expedite large RNA-seq analyses using a combination of previously developed tools. Yarn is meant to make it easier for the user to perform accurate comparison of conditions by leveraging many Bioconductor tools and various statistical and normalization techniques while accounting for the large heterogeneity and sparsity found in very large RNA-seq experiments.

Installation

You can install yarn from github with:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("yarn")

Quick Introduction

If you’re here to grab the GTEx version 6.0 data then look no further than this gist that uses yarn to download all the data and preprocess it for you.

Preprocessing

Below are a few of the functions we can use to preprocess a large RNA-seq experiment. We follow a particular procedure where we:

  1. Filter poor quality samples
  2. Merge samples of similar conditions for increased power
  3. Filter genes while preserving tissue or group specificity
  4. Normalize while accounting for global differences in tissue distribution

We will make use of the skin dataset for examples. The skin dataset is a small sample of the full GTEx data that can be downloaded using the downloadGTEx function. The skin dataset looks like this:

skin
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 40824 features, 20 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   rowNames: GTEX-OHPL-0008-SM-4E3I9 GTEX-145MN-1526-SM-5SI9T ...
##     GTEX-144FL-0626-SM-5LU43 (20 total)
##   varLabels: SAMPID SMATSSCR ... DTHHRDY (65 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 48350 48365 ... 7565 (40824 total)
##   fvarLabels: ensembl_gene_id hgnc_symbol ... gene_biotype (6 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

This is a basic workflow. Details will be fleshed out:

  1. First always remember to have the library loaded.
library(yarn)
  1. Download the GTEx gene count data as an ExpressionSet object or load the sample skin dataset.

For computational reasons we load the sample skin data instead of having the user download the

library(yarn)
data(skin)
  1. Check mis-annotation of gender or other phenotypes using group-specific genes
checkMisAnnotation(skin,"GENDER",controlGenes="Y",legendPosition="topleft")

  1. Decide what sub-groups should be merged
checkTissuesToMerge(skin,"SMTS","SMTSD")

  1. Filter lowly expressed genes
skin_filtered = filterLowGenes(skin,"SMTSD")
dim(skin)
## Features  Samples 
##    40824       20
dim(skin_filtered)
## Features  Samples 
##    19933       20

Or group specific genes

tmp = filterGenes(skin,labels=c("X","Y","MT"),featureName = "chromosome_name")
# Keep only the sex names
tmp = filterGenes(skin,labels=c("X","Y","MT"),featureName = "chromosome_name",keepOnly=TRUE)
  1. Normalize in a tissue or group-aware manner
plotDensity(skin_filtered,"SMTSD",main=expression('log'[2]*' raw expression'))

skin_filtered = normalizeTissueAware(skin_filtered,"SMTSD")
plotDensity(skin_filtered,"SMTSD",normalized=TRUE,main="Normalized")
## normalizedMatrix is assumed to already be log-transformed

Helper functions

Other than checkMisAnnotation and checkTissuesToMerge we provide a few plotting function. We include, plotCMDS, plotDensity, plotHeatmap.

plotCMDS - PCoA / Classical Multi-Dimensional Scaling of the most variable genes.

data(skin)
res = plotCMDS(skin,pch=21,bg=factor(pData(skin)$SMTSD))

plotDensity - Density plots colored by phenotype of choosing. Allows for inspection of global trend differences.

filtData = filterLowGenes(skin,"SMTSD")
plotDensity(filtData,groups="SMTSD",legendPos="topleft")

plotHeatmap - Heatmap of the most variable genes.

library(RColorBrewer)
tissues = pData(skin)$SMTSD
heatmapColColors=brewer.pal(12,"Set3")[as.integer(factor(tissues))]
heatmapCols = colorRampPalette(brewer.pal(9, "RdBu"))(50)
plotHeatmap(skin,normalized=FALSE,log=TRUE,trace="none",n=10,
 col = heatmapCols,ColSideColors = heatmapColColors,cexRow = 0.25,cexCol = 0.25)

Information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RColorBrewer_1.1-3  yarn_1.33.0         Biobase_2.67.0     
## [4] BiocGenerics_0.53.3 generics_0.1.3      rmarkdown_2.29     
## 
## loaded via a namespace (and not attached):
##   [1] sys_3.4.3                   jsonlite_1.8.9             
##   [3] magrittr_2.0.3              GenomicFeatures_1.59.1     
##   [5] BiocIO_1.17.1               zlibbioc_1.52.0            
##   [7] vctrs_0.6.5                 multtest_2.63.0            
##   [9] Rsamtools_2.23.1            memoise_2.0.1              
##  [11] DelayedMatrixStats_1.29.0   RCurl_1.98-1.16            
##  [13] askpass_1.2.1               htmltools_0.5.8.1          
##  [15] S4Arrays_1.7.1              progress_1.2.3             
##  [17] curl_6.0.1                  Rhdf5lib_1.29.0            
##  [19] SparseArray_1.7.2           rhdf5_2.51.1               
##  [21] sass_0.4.9                  nor1mix_1.3-3              
##  [23] KernSmooth_2.23-24          bslib_0.8.0                
##  [25] plyr_1.8.9                  httr2_1.0.7                
##  [27] cachem_1.1.0                GenomicAlignments_1.43.0   
##  [29] buildtools_1.0.0            lifecycle_1.0.4            
##  [31] iterators_1.0.14            pkgconfig_2.0.3            
##  [33] Matrix_1.7-1                R6_2.5.1                   
##  [35] fastmap_1.2.0               GenomeInfoDbData_1.2.13    
##  [37] MatrixGenerics_1.19.0       digest_0.6.37              
##  [39] colorspace_2.1-1            siggenes_1.81.0            
##  [41] reshape_0.8.9               AnnotationDbi_1.69.0       
##  [43] S4Vectors_0.45.2            GenomicRanges_1.59.1       
##  [45] RSQLite_2.3.9               base64_2.0.2               
##  [47] filelock_1.0.3              httr_1.4.7                 
##  [49] abind_1.4-8                 compiler_4.4.2             
##  [51] beanplot_1.3.1              rngtools_1.5.2             
##  [53] bit64_4.5.2                 doParallel_1.0.17          
##  [55] downloader_0.4              quantro_1.41.0             
##  [57] BiocParallel_1.41.0         DBI_1.2.3                  
##  [59] HDF5Array_1.35.2            gplots_3.2.0               
##  [61] biomaRt_2.63.0              MASS_7.3-61                
##  [63] openssl_2.3.0               rappdirs_0.3.3             
##  [65] DelayedArray_0.33.3         rjson_0.2.23               
##  [67] gtools_3.9.5                caTools_1.18.3             
##  [69] tools_4.4.2                 rentrez_1.2.3              
##  [71] quadprog_1.5-8              glue_1.8.0                 
##  [73] restfulr_0.0.15             nlme_3.1-166               
##  [75] rhdf5filters_1.19.0         grid_4.4.2                 
##  [77] gtable_0.3.6                tzdb_0.4.0                 
##  [79] preprocessCore_1.69.0       tidyr_1.3.1                
##  [81] data.table_1.16.4           hms_1.1.3                  
##  [83] xml2_1.3.6                  XVector_0.47.1             
##  [85] foreach_1.5.2               pillar_1.10.0              
##  [87] stringr_1.5.1               limma_3.63.2               
##  [89] genefilter_1.89.0           splines_4.4.2              
##  [91] dplyr_1.1.4                 BiocFileCache_2.15.0       
##  [93] lattice_0.22-6              rtracklayer_1.67.0         
##  [95] survival_3.8-3              bit_4.5.0.1                
##  [97] GEOquery_2.75.0             annotate_1.85.0            
##  [99] tidyselect_1.2.1            locfit_1.5-9.10            
## [101] maketools_1.3.1             Biostrings_2.75.3          
## [103] knitr_1.49                  IRanges_2.41.2             
## [105] edgeR_4.5.1                 SummarizedExperiment_1.37.0
## [107] stats4_4.4.2                xfun_0.49                  
## [109] scrime_1.3.5                statmod_1.5.0              
## [111] matrixStats_1.4.1           stringi_1.8.4              
## [113] UCSC.utils_1.3.0            yaml_2.3.10                
## [115] evaluate_1.0.1              codetools_0.2-20           
## [117] tibble_3.2.1                minfi_1.53.1               
## [119] cli_3.6.3                   bumphunter_1.49.0          
## [121] xtable_1.8-4                munsell_0.5.1              
## [123] jquerylib_0.1.4             Rcpp_1.0.13-1              
## [125] GenomeInfoDb_1.43.2         dbplyr_2.5.0               
## [127] png_0.1-8                   XML_3.99-0.17              
## [129] parallel_4.4.2              readr_2.1.5                
## [131] ggplot2_3.5.1               blob_1.2.4                 
## [133] prettyunits_1.2.0           mclust_6.1.1               
## [135] doRNG_1.8.6                 sparseMatrixStats_1.19.0   
## [137] bitops_1.0-9                scales_1.3.0               
## [139] illuminaio_0.49.0           purrr_1.0.2                
## [141] crayon_1.5.3                rlang_1.1.4                
## [143] KEGGREST_1.47.0