awst
awst
R
is an open-source statistical environment which can be
easily modified to enhance its functionality via packages. awst is a
R
package available via the Bioconductor repository for packages.
R
can be installed on any operating system from CRAN after which you can install
awst
by using the following commands in your R
session:
awst is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. That is, packages like SummarizedExperiment.
If you are asking yourself the question “Where do I start using Bioconductor?” you might be interested in this blog post.
As package developers, we try to explain clearly how to use our
packages and in which order to use the functions. But R
and
Bioconductor
have a steep learning curve so it is critical
to learn where to ask for help. The blog post quoted above mentions some
but we would like to highlight the Bioconductor support site
as the main resource for getting help: remember to use the
awst
tag and check the older posts.
Other alternatives are available such as creating GitHub issues and
tweeting. However, please note that if you want to receive help you
should adhere to the posting
guidelines. It is particularly critical that you provide a small
reproducible example and your session information so package developers
can track down the source of the error.
awst
We hope that awst will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!
## Citation info
citation("awst")
#> Warning in person1(given = given[[i]], family = family[[i]], middle =
#> middle[[i]], : It is recommended to use 'given' instead of 'middle'.
#> To cite package 'awst' in publications use:
#>
#> Risso D, Pagnotta SM (2021). "Per-sample standardization and
#> asymmetric winsorization lead to accurate clustering of RNA-seq
#> expression profiles." _Bioinformatics_.
#> doi:10.1093/bioinformatics/btab091
#> <https://doi.org/10.1093/bioinformatics/btab091>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {Per-sample standardization and asymmetric winsorization lead to accurate clustering of RNA-seq expression profiles},
#> author = {Davide Risso and Stefano Maria Pagnotta},
#> year = {2021},
#> journal = {Bioinformatics},
#> doi = {10.1093/bioinformatics/btab091},
#> }
awst
does?AWST aims to regularize the original read counts to reduce the effect of noise on the clustering of samples. In fact, gene expression data are characterized by high levels of noise in both lowly expressed features, which suffer from background effects and low signal-to-noise ratio, and highly expressed features, which may be the result of amplification bias and other experimental artifacts. These effects are of utmost importance in highly degraded or low input material samples, such as tumor samples and single cells.
AWST comprises two main steps. In the first one, namely the
standardization step, we standardize the counts by centering and scaling
them, exploiting the log-normal probability distribution. We refer to
the standardized counts as z-counts. The second step, namely the
smoothing step, leverages a highly skewed transformation that decreases
the noise while preserving the influence of genes to separate molecular
subtypes. These two steps are implemented in the awst
function.
A further filtering method, implemented in the
gene_filter
function, is suggested to remove those features
that only contribute noise to the clustering.
Here, we will use the data in the airway
package to illustrate the awst
approach.
Please, see our paper (Risso and Pagnotta, 2021) and this repository for more extensive and biologically relevant examples.
data(airway)
airway
#> class: RangedSummarizedExperiment
#> dim: 63677 8
#> metadata(1): ''
#> assays(1): counts
#> rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
#> ENSG00000273493
#> rowData names(10): gene_id gene_name ... seq_coord_system symbol
#> colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
#> colData names(9): SampleName cell ... Sample BioSample
The data are stored in a RangedSummarizedExperiment
, a
special case of the SummarizedExperiment
class, one of the
central classes in Bioconductor. If you are not familiar with it, I
recomment to look at its vignette available at SummarizedExperiment.
First, we filter out non-expressed genes. For simplicity, we remove those genes with fewer than 10 reads on average across samples.
filter <- rowMeans(assay(airway)) >= 10
table(filter)
#> filter
#> FALSE TRUE
#> 47587 16090
se <- airway[filter,]
We are left with 16090 genes. We are now ready to apply
awst
to the data.
se <- awst(se)
se
#> class: RangedSummarizedExperiment
#> dim: 16090 8
#> metadata(1): ''
#> assays(2): counts awst
#> rownames(16090): ENSG00000000003 ENSG00000000419 ... ENSG00000273472
#> ENSG00000273486
#> rowData names(10): gene_id gene_name ... seq_coord_system symbol
#> colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
#> colData names(9): SampleName cell ... Sample BioSample
plot(density(assay(se, "awst")[,1]), main = "Sample 1")
We can see that the majority of the values have been shrunk around −2, while the others values gradually increase up to around 4. The effect of reducing the contribution of lowly expressed genes, and of the winsorization for the highly expressed ones, results in a better separation of the samples, reflecting biological differences (Risso and Pagnotta, 2021).
The other main function of the awst package
is gene_filter
. It can be used to remove those genes that
contribute little to nothing to the distance between samples. The
function uses an entropy measure to remove the uninformative genes.
Our final dataset is made of 8 genes.
We can see how the awst
transformation leads to
separation between treatment (along PC1) and cell line (along PC2).
Although in this example awst
applied to raw data works
well, a prior normalization step can help. We have found that
full-quantile normalization works well and has the computational
advantage of allowing awst
to estimate the parameters only
once for all samples (Risso and Pagnotta, 2021).
Here we show the results of awst
after full-quantile
normalization (implemented in EDASeq).
The awst package (Risso and Pagnotta, 2021) was made possible thanks to:
This package was developed using biocthis.
Code for creating the vignette
## Create the vignette
library("rmarkdown")
system.time(render("awst_intro.Rmd", "BiocStyle::html_document"))
## Extract the R code
library("knitr")
knit("awst_intro.Rmd", tangle = TRUE)
Date the vignette was generated.
#> [1] "2024-10-30 04:17:40 UTC"
Wallclock time spent generating the vignette.
#> Time difference of 14.317 secs
R
session information.
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.1 (2024-06-14)
#> os Ubuntu 24.04.1 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate C
#> ctype en_US.UTF-8
#> tz Etc/UTC
#> date 2024-10-30
#> pandoc 3.2.1 @ /usr/local/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> abind 1.4-8 2024-09-12 [2] RSPM (R 4.4.0)
#> airway * 1.25.0 2024-05-02 [2] Bioconductor 3.20 (R 4.4.1)
#> AnnotationDbi 1.69.0 2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#> aroma.light 3.37.0 2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#> awst * 1.15.0 2024-10-30 [1] https://bioc.r-universe.dev (R 4.4.1)
#> backports 1.5.0 2024-05-23 [2] RSPM (R 4.4.0)
#> bibtex 0.5.1 2023-01-26 [2] RSPM (R 4.4.0)
#> Biobase * 2.65.1 2024-10-27 [2] https://bioc.r-universe.dev (R 4.4.1)
#> BiocFileCache 2.13.2 2024-10-11 [2] https://bioc.r-universe.dev (R 4.4.1)
#> BiocGenerics * 0.51.3 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> BiocIO 1.15.2 2024-10-22 [2] https://bioc.r-universe.dev (R 4.4.1)
#> BiocManager 1.30.25 2024-08-28 [2] RSPM (R 4.4.0)
#> BiocParallel * 1.39.0 2024-10-23 [2] https://bioc.r-universe.dev (R 4.4.1)
#> BiocStyle * 2.33.1 2024-10-18 [2] https://bioc.r-universe.dev (R 4.4.1)
#> biomaRt 2.61.3 2024-10-06 [2] https://bioc.r-universe.dev (R 4.4.1)
#> Biostrings * 2.73.2 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> bit 4.5.0 2024-09-20 [2] RSPM (R 4.4.0)
#> bit64 4.5.2 2024-09-22 [2] RSPM (R 4.4.0)
#> bitops 1.0-9 2024-10-03 [2] RSPM (R 4.4.0)
#> blob 1.2.4 2023-03-17 [2] RSPM (R 4.4.0)
#> bslib 0.8.0 2024-07-29 [2] RSPM (R 4.4.0)
#> buildtools 1.0.0 2024-10-28 [3] local (/pkg)
#> cachem 1.1.0 2024-05-16 [2] RSPM (R 4.4.0)
#> cli 3.6.3 2024-06-21 [2] RSPM (R 4.4.0)
#> codetools 0.2-20 2024-03-31 [2] RSPM (R 4.4.0)
#> colorspace 2.1-1 2024-07-26 [2] RSPM (R 4.4.0)
#> crayon 1.5.3 2024-06-20 [2] RSPM (R 4.4.0)
#> curl 5.2.3 2024-09-20 [2] RSPM (R 4.4.0)
#> DBI 1.2.3 2024-06-02 [2] RSPM (R 4.4.0)
#> dbplyr 2.5.0 2024-03-19 [2] RSPM (R 4.4.0)
#> DelayedArray 0.31.14 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> deldir 2.0-4 2024-02-28 [2] RSPM (R 4.4.0)
#> digest 0.6.37 2024-08-19 [2] RSPM (R 4.4.0)
#> dplyr 1.1.4 2023-11-17 [2] RSPM (R 4.4.0)
#> EDASeq * 2.39.0 2024-10-01 [2] https://bioc.r-universe.dev (R 4.4.1)
#> evaluate 1.0.1 2024-10-10 [2] RSPM (R 4.4.0)
#> fansi 1.0.6 2023-12-08 [2] RSPM (R 4.4.0)
#> farver 2.1.2 2024-05-13 [2] RSPM (R 4.4.0)
#> fastmap 1.2.0 2024-05-15 [2] RSPM (R 4.4.0)
#> filelock 1.0.3 2023-12-11 [2] RSPM (R 4.4.0)
#> generics 0.1.3 2022-07-05 [2] RSPM (R 4.4.0)
#> GenomeInfoDb * 1.41.2 2024-10-02 [2] https://bioc.r-universe.dev (R 4.4.1)
#> GenomeInfoDbData 1.2.13 2024-10-30 [2] Bioconductor
#> GenomicAlignments * 1.41.0 2024-10-28 [2] https://bioc.r-universe.dev (R 4.4.1)
#> GenomicFeatures 1.57.1 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> GenomicRanges * 1.57.2 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> ggplot2 * 3.5.1 2024-04-23 [2] RSPM (R 4.4.0)
#> glue 1.8.0 2024-09-30 [2] RSPM (R 4.4.0)
#> gtable 0.3.6 2024-10-25 [2] RSPM (R 4.4.0)
#> highr 0.11 2024-05-26 [2] RSPM (R 4.4.0)
#> hms 1.1.3 2023-03-21 [2] RSPM (R 4.4.0)
#> htmltools 0.5.8.1 2024-04-04 [2] RSPM (R 4.4.0)
#> httr 1.4.7 2023-08-15 [2] RSPM (R 4.4.0)
#> httr2 1.0.5 2024-09-26 [2] RSPM (R 4.4.0)
#> hwriter 1.3.2.1 2022-04-08 [2] RSPM (R 4.4.0)
#> interp 1.1-6 2024-01-26 [2] RSPM (R 4.4.0)
#> IRanges * 2.39.2 2024-10-25 [2] https://bioc.r-universe.dev (R 4.4.1)
#> jpeg 0.1-10 2022-11-29 [2] RSPM (R 4.4.0)
#> jquerylib 0.1.4 2021-04-26 [2] RSPM (R 4.4.0)
#> jsonlite 1.8.9 2024-09-20 [2] RSPM (R 4.4.0)
#> KEGGREST 1.45.1 2024-10-16 [2] https://bioc.r-universe.dev (R 4.4.1)
#> knitr 1.48 2024-07-07 [2] RSPM (R 4.4.0)
#> labeling 0.4.3 2023-08-29 [2] RSPM (R 4.4.0)
#> lattice 0.22-6 2024-03-20 [2] RSPM (R 4.4.0)
#> latticeExtra 0.6-30 2022-07-04 [2] RSPM (R 4.4.0)
#> lifecycle 1.0.4 2023-11-07 [2] RSPM (R 4.4.0)
#> lubridate 1.9.3 2023-09-27 [2] RSPM (R 4.4.0)
#> magrittr 2.0.3 2022-03-30 [2] RSPM (R 4.4.0)
#> maketools 1.3.1 2024-10-28 [3] Github (jeroen/maketools@d46f92c)
#> Matrix 1.7-1 2024-10-18 [2] RSPM (R 4.4.0)
#> MatrixGenerics * 1.17.1 2024-10-23 [2] https://bioc.r-universe.dev (R 4.4.1)
#> matrixStats * 1.4.1 2024-09-08 [2] RSPM (R 4.4.0)
#> memoise 2.0.1 2021-11-26 [2] RSPM (R 4.4.0)
#> munsell 0.5.1 2024-04-01 [2] RSPM (R 4.4.0)
#> pillar 1.9.0 2023-03-22 [2] RSPM (R 4.4.0)
#> pkgconfig 2.0.3 2019-09-22 [2] RSPM (R 4.4.0)
#> plyr 1.8.9 2023-10-02 [2] RSPM (R 4.4.0)
#> png 0.1-8 2022-11-29 [2] RSPM (R 4.4.0)
#> prettyunits 1.2.0 2023-09-24 [2] RSPM (R 4.4.0)
#> progress 1.2.3 2023-12-06 [2] RSPM (R 4.4.0)
#> pwalign 1.1.0 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> R.methodsS3 1.8.2 2022-06-13 [2] RSPM (R 4.4.0)
#> R.oo 1.26.0 2024-01-24 [2] RSPM (R 4.4.0)
#> R.utils 2.12.3 2023-11-18 [2] RSPM (R 4.4.0)
#> R6 2.5.1 2021-08-19 [2] RSPM (R 4.4.0)
#> rappdirs 0.3.3 2021-01-31 [2] RSPM (R 4.4.0)
#> RColorBrewer 1.1-3 2022-04-03 [2] RSPM (R 4.4.0)
#> Rcpp 1.0.13 2024-07-17 [2] RSPM (R 4.4.0)
#> RCurl 1.98-1.16 2024-07-11 [2] RSPM (R 4.4.0)
#> RefManageR * 1.4.0 2022-09-30 [2] RSPM (R 4.4.0)
#> restfulr 0.0.15 2022-06-16 [2] RSPM (R 4.4.1)
#> rjson 0.2.23 2024-09-16 [2] RSPM (R 4.4.0)
#> rlang 1.1.4 2024-06-04 [2] RSPM (R 4.4.0)
#> rmarkdown 2.28 2024-08-17 [2] RSPM (R 4.4.0)
#> Rsamtools * 2.21.2 2024-10-26 [2] https://bioc.r-universe.dev (R 4.4.1)
#> RSQLite 2.3.7 2024-05-27 [2] RSPM (R 4.4.0)
#> rtracklayer 1.65.0 2024-10-23 [2] https://bioc.r-universe.dev (R 4.4.1)
#> S4Arrays 1.5.11 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> S4Vectors * 0.43.2 2024-10-17 [2] https://bioc.r-universe.dev (R 4.4.1)
#> sass 0.4.9 2024-03-15 [2] RSPM (R 4.4.0)
#> scales 1.3.0 2023-11-28 [2] RSPM (R 4.4.0)
#> sessioninfo * 1.2.2 2021-12-06 [2] RSPM (R 4.4.0)
#> ShortRead * 1.63.2 2024-10-26 [2] https://bioc.r-universe.dev (R 4.4.1)
#> SparseArray 1.5.45 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> stringi 1.8.4 2024-05-06 [2] RSPM (R 4.4.0)
#> stringr 1.5.1 2023-11-14 [2] RSPM (R 4.4.0)
#> SummarizedExperiment * 1.35.5 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> sys 3.4.3 2024-10-04 [2] RSPM (R 4.4.0)
#> tibble 3.2.1 2023-03-20 [2] RSPM (R 4.4.0)
#> tidyselect 1.2.1 2024-03-11 [2] RSPM (R 4.4.0)
#> timechange 0.3.0 2024-01-18 [2] RSPM (R 4.4.0)
#> UCSC.utils 1.1.0 2024-10-29 [2] https://bioc.r-universe.dev (R 4.4.1)
#> utf8 1.2.4 2023-10-22 [2] RSPM (R 4.4.0)
#> vctrs 0.6.5 2023-12-01 [2] RSPM (R 4.4.0)
#> withr 3.0.2 2024-10-28 [2] RSPM (R 4.4.0)
#> xfun 0.48 2024-10-03 [2] RSPM (R 4.4.0)
#> XML 3.99-0.17 2024-06-25 [2] RSPM (R 4.4.0)
#> xml2 1.3.6 2023-12-04 [2] RSPM (R 4.4.0)
#> XVector * 0.45.0 2024-10-02 [2] https://bioc.r-universe.dev (R 4.4.1)
#> yaml 2.3.10 2024-07-26 [2] RSPM (R 4.4.0)
#> zlibbioc 1.51.2 2024-10-21 [2] Bioconductor 3.20 (R 4.4.1)
#>
#> [1] /tmp/RtmpkPMt7P/Rinst1cff79536f
#> [2] /github/workspace/pkglib
#> [3] /usr/local/lib/R/site-library
#> [4] /usr/lib/R/site-library
#> [5] /usr/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
This vignette was generated using BiocStyle (Oleś, 2024) with knitr (Xie, 2024) and rmarkdown (Allaire, Xie, Dervieux et al., 2024) running behind the scenes.
Citations made with RefManageR (McLean, 2017).
[1] J. Allaire, Y. Xie, C. Dervieux, et al. rmarkdown: Dynamic Documents for R. R package version 2.28. 2024. URL: https://github.com/rstudio/rmarkdown.
[2] M. W. McLean. “RefManageR: Import and Manage BibTeX and BibLaTeX References in R”. In: The Journal of Open Source Software (2017). DOI: 10.21105/joss.00338.
[3] A. Oleś. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.33.1. 2024. URL: https://github.com/Bioconductor/BiocStyle.
[4] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2024. URL: https://www.R-project.org/.
[5] D. Risso and S. M. Pagnotta. “Per-sample standardization and asymmetric winsorization lead to accurate clustering of RNA-seq expression profiles”. In: Bioinformatics (2021). DOI: 10.1093/bioinformatics/btab091.
[6] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5–10. URL: https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.
[7] H. Wickham, W. Chang, R. Flight, et al. sessioninfo: R Session Information. R package version 1.2.2, https://r-lib.github.io/sessioninfo/. 2021. URL: https://github.com/r-lib/sessioninfo#readme.
[8] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.48. 2024. URL: https://yihui.org/knitr/.