The goal of yarn is to expedite large RNA-seq analyses using a combination of previously developed tools. Yarn is meant to make it easier for the user to perform accurate comparison of conditions by leveraging many Bioconductor tools and various statistical and normalization techniques while accounting for the large heterogeneity and sparsity found in very large RNA-seq experiments.
You can install yarn from github with:
If you’re here to grab the GTEx version 6.0 data then look no further than this gist that uses yarn to download all the data and preprocess it for you.
Below are a few of the functions we can use to preprocess a large RNA-seq experiment. We follow a particular procedure where we:
We will make use of the skin
dataset for examples. The
skin
dataset is a small sample of the full GTEx data that
can be downloaded using the downloadGTEx
function. The
skin
dataset looks like this:
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 40824 features, 20 samples
## element names: exprs
## protocolData: none
## phenoData
## rowNames: GTEX-OHPL-0008-SM-4E3I9 GTEX-145MN-1526-SM-5SI9T ...
## GTEX-144FL-0626-SM-5LU43 (20 total)
## varLabels: SAMPID SMATSSCR ... DTHHRDY (65 total)
## varMetadata: labelDescription
## featureData
## featureNames: 48350 48365 ... 7565 (40824 total)
## fvarLabels: ensembl_gene_id hgnc_symbol ... gene_biotype (6 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
This is a basic workflow. Details will be fleshed out:
For computational reasons we load the sample skin data instead of having the user download the
## Features Samples
## 40824 20
## Features Samples
## 19933 20
Or group specific genes
tmp = filterGenes(skin,labels=c("X","Y","MT"),featureName = "chromosome_name")
# Keep only the sex names
tmp = filterGenes(skin,labels=c("X","Y","MT"),featureName = "chromosome_name",keepOnly=TRUE)
skin_filtered = normalizeTissueAware(skin_filtered,"SMTSD")
plotDensity(skin_filtered,"SMTSD",normalized=TRUE,main="Normalized")
## normalizedMatrix is assumed to already be log-transformed
Other than checkMisAnnotation
and
checkTissuesToMerge
we provide a few plotting function. We
include, plotCMDS
, plotDensity
,
plotHeatmap
.
plotCMDS
- PCoA / Classical Multi-Dimensional Scaling of
the most variable genes.
plotDensity
- Density plots colored by phenotype of
choosing. Allows for inspection of global trend differences.
plotHeatmap
- Heatmap of the most variable genes.
library(RColorBrewer)
tissues = pData(skin)$SMTSD
heatmapColColors=brewer.pal(12,"Set3")[as.integer(factor(tissues))]
heatmapCols = colorRampPalette(brewer.pal(9, "RdBu"))(50)
plotHeatmap(skin,normalized=FALSE,log=TRUE,trace="none",n=10,
col = heatmapCols,ColSideColors = heatmapColColors,cexRow = 0.25,cexCol = 0.25)
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-3 yarn_1.33.0 Biobase_2.67.0
## [4] BiocGenerics_0.53.1 generics_0.1.3 rmarkdown_2.28
##
## loaded via a namespace (and not attached):
## [1] sys_3.4.3 jsonlite_1.8.9
## [3] magrittr_2.0.3 GenomicFeatures_1.59.0
## [5] BiocIO_1.17.0 zlibbioc_1.52.0
## [7] vctrs_0.6.5 multtest_2.63.0
## [9] Rsamtools_2.22.0 memoise_2.0.1
## [11] DelayedMatrixStats_1.29.0 RCurl_1.98-1.16
## [13] askpass_1.2.1 htmltools_0.5.8.1
## [15] S4Arrays_1.6.0 progress_1.2.3
## [17] curl_5.2.3 Rhdf5lib_1.28.0
## [19] SparseArray_1.6.0 rhdf5_2.50.0
## [21] sass_0.4.9 nor1mix_1.3-3
## [23] KernSmooth_2.23-24 bslib_0.8.0
## [25] plyr_1.8.9 httr2_1.0.5
## [27] cachem_1.1.0 GenomicAlignments_1.43.0
## [29] buildtools_1.0.0 lifecycle_1.0.4
## [31] iterators_1.0.14 pkgconfig_2.0.3
## [33] Matrix_1.7-1 R6_2.5.1
## [35] fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [37] MatrixGenerics_1.19.0 digest_0.6.37
## [39] colorspace_2.1-1 siggenes_1.80.0
## [41] reshape_0.8.9 AnnotationDbi_1.69.0
## [43] S4Vectors_0.44.0 GenomicRanges_1.59.0
## [45] RSQLite_2.3.7 base64_2.0.2
## [47] filelock_1.0.3 fansi_1.0.6
## [49] httr_1.4.7 abind_1.4-8
## [51] compiler_4.4.1 beanplot_1.3.1
## [53] rngtools_1.5.2 bit64_4.5.2
## [55] doParallel_1.0.17 downloader_0.4
## [57] quantro_1.41.0 BiocParallel_1.41.0
## [59] DBI_1.2.3 highr_0.11
## [61] HDF5Array_1.35.1 gplots_3.2.0
## [63] biomaRt_2.63.0 MASS_7.3-61
## [65] openssl_2.2.2 rappdirs_0.3.3
## [67] DelayedArray_0.33.1 rjson_0.2.23
## [69] gtools_3.9.5 caTools_1.18.3
## [71] tools_4.4.1 rentrez_1.2.3
## [73] quadprog_1.5-8 glue_1.8.0
## [75] restfulr_0.0.15 nlme_3.1-166
## [77] rhdf5filters_1.19.0 grid_4.4.1
## [79] gtable_0.3.6 tzdb_0.4.0
## [81] preprocessCore_1.69.0 tidyr_1.3.1
## [83] data.table_1.16.2 hms_1.1.3
## [85] xml2_1.3.6 utf8_1.2.4
## [87] XVector_0.46.0 foreach_1.5.2
## [89] pillar_1.9.0 stringr_1.5.1
## [91] limma_3.63.0 genefilter_1.89.0
## [93] splines_4.4.1 dplyr_1.1.4
## [95] BiocFileCache_2.15.0 lattice_0.22-6
## [97] rtracklayer_1.66.0 survival_3.7-0
## [99] bit_4.5.0 GEOquery_2.75.0
## [101] annotate_1.85.0 tidyselect_1.2.1
## [103] locfit_1.5-9.10 maketools_1.3.1
## [105] Biostrings_2.75.0 knitr_1.48
## [107] IRanges_2.41.0 edgeR_4.4.0
## [109] SummarizedExperiment_1.36.0 stats4_4.4.1
## [111] xfun_0.48 scrime_1.3.5
## [113] statmod_1.5.0 matrixStats_1.4.1
## [115] stringi_1.8.4 UCSC.utils_1.2.0
## [117] yaml_2.3.10 evaluate_1.0.1
## [119] codetools_0.2-20 tibble_3.2.1
## [121] minfi_1.53.0 cli_3.6.3
## [123] bumphunter_1.49.0 xtable_1.8-4
## [125] munsell_0.5.1 jquerylib_0.1.4
## [127] Rcpp_1.0.13 GenomeInfoDb_1.43.0
## [129] dbplyr_2.5.0 png_0.1-8
## [131] XML_3.99-0.17 parallel_4.4.1
## [133] readr_2.1.5 ggplot2_3.5.1
## [135] blob_1.2.4 prettyunits_1.2.0
## [137] mclust_6.1.1 doRNG_1.8.6
## [139] sparseMatrixStats_1.18.0 bitops_1.0-9
## [141] scales_1.3.0 illuminaio_0.49.0
## [143] purrr_1.0.2 crayon_1.5.3
## [145] rlang_1.1.4 KEGGREST_1.47.0