To install and load RNAAgeCalc
It has been shown that both DNA methylation and RNA transcription are linked to chronological age and age related diseases. Several estimators have been developed to predict human aging from DNA level and RNA level. Most of the human transcriptional age predictor are based on microarray data and limited to only a few tissues. To date, transcriptional studies on aging using RNASeq data from different human tissues is limited. The aim of this package is to provide a tool for across-tissue and tissue-specific transcriptional age calculation based on Genotype-Tissue Expression (GTEx) RNASeq data (Lonsdale et al. 2013).
We utilized the GTEx data to construct our across-tissue and
tissue-specific transcriptional age calculator. GTEx is a public
available genetic database for studying tissue specific gene expression
and regulation. GTEx V6 release contains gene expression data at gene,
exon, and transcript level of 9,662 samples from 30 different tissues.
To avoid the influence of tumor on gene expression, the 102 tumor
samples from GTEx V6 release are dropped and the remaining 9,560 samples
were used in the subsequent analysis. To facilitate integrated analysis
and direct comparison of multiple datasets, we utilized recount2 (Collado-Torres et al. 2017) version of GTEx
data, where all samples were processed with the same analytical
pipeline. FPKM values were calculated for each individual sample using
getRPKM
function in Bioconductor package recount.
For the tissue-specific RNASeq age calculator, elastic net (Zou and Hastie 2005) algorithm was used to train the predictors for each individual tissue. Chronological age was response variable whereas logarithm transformed FPKM of genes were predictors. The across-tissue calculator was constructed by first performing differential expression analysis on the RNASeq counts data for each individual tissue. To identify genes consistently differentially expressed across tissues, we adapted the binomial test discussed in de Magalhaes et al. (De Magalhães, Curado, and Church 2009) to find the genes with the largest number of age-related signals. A detailed explanation can be found in our paper.
The package is implemented as follows. For each tissue, signature and sample type (see below for the descriptions), we pre-trained the calculator using elastic net based on the GTEx samples. We saved the pre-trained model coefficients as internal data in the package. The package takes gene expression data as input and then match the input genes to the genes in the internal data. This matching process is automatic so that the users just need to provide gene expression data without having to pull out the internal coefficients.
The main functions to calculate RNASeq age are
predict_age
and predict_age_fromse
. Users can
use either of them. predict_age
function takes data frame
as input whereas predict_age_fromse
function takes SummarizedExperiment
as input. Both functions work in the same way internally. In this
section, we explain the arguments in predict_age
and
predict_age_fromse
respectively.
exprdata
is a matrix or data frame which contains gene
expression data with each row represents a gene and each column
represents a sample. Users are expected to use the argument
exprtype
to specify raw count or FPKM (see below). The
rownames of exprdata
should be gene ids and colnames of
exprdata
should be sample ids. Here is an example of FPKM
expression data:
SRR2166176 SRR2167642
AFF4 23.478648 21.7803347
ZGLP1 1.880514 2.6874455
CHDH 6.062126 5.9185735
MIR6801 0.000000 0.3882558
ZFP90 6.478456 7.1444598
KDM6B 5.849882 5.1937368
tissue
is a string indicates which tissue the gene
expression data is obtained from. Users are expected to provide one of
the following tissues. If the tissue argument is not provide or the
provided tissue is not in this list, the age predictor trained on all
tissues will be used to calculate RNA age.
exprtype
is either “counts” or “FPKM”. If
exprtype
is counts, the expression data will be converted
to FPKM by count2FPKM
automatically and the calculator will
be applied on FPKM data. When calculating FPKM, by default gene length
is obtained from the package’s internal database. The internal gene
length information was obtained from recount2. However, users are able
to provide their own gene length information by using
genelength
argument (see below).
idtype
is a string which indicates the gene id type in
exprdata
. Default is “SYMBOL”. The following id types are
supported.
stype
is a string which specifies which version of
pre-trained calculators to be used. Two versions are provided. If
stype="all"
, the calculator trained on samples from all
races (American Indian/Alaska Native, Asian, Black/African American, and
Caucasian) will be used. If stype="Caucasian"
, the
calculator trained on Caucasian samples only will be used. We found that
RNA Age signatures could be different in different races (see our paper
for details). Thus we provide both the universal calculator and race
specific calculator. The race specific calculator for American
Indian/Alaska Native, Asian, or Black/African American are not provided
due to the small sample size in GTEx data.
signature
is a string which indicate the age signature
to use when calculating RNA age. This argument is not required.
In the case that this argument is not provided, if
tissue
argument is also provided and the tissue is in the
list above, the tissue specific age signature given by our DESeq2
analysis result on GTEx data will be used. Otherwise, the across tissue
signature “GTExAge” will be used.
In the case that this argument is provided, it should be one of the following signatures.
If the genes in exprdata
do not cover all the genes in
the signature, imputation will be made automatically by the
impute.knn
function in Bioconductor package impute.
genelength
is a vector of gene length in bp. The size of
genelength
should be equal to the number of rows in
exprdata
. This argument is optional. When using
exprtype = "FPKM"
, genelength
argument is
ignored. When using exprtype = "counts"
, the raw count will
be converted to FPKM. If genelength
is provided, the
function will convert raw count to FPKM based on the user-supplied gene
length. Otherwise, gene length is obtained from the internal
database.
chronage
is a data frame which contains the
chronological age of each sample. This argument is optional.
If provided, it should be a dataframe with 1st column sample id and
2nd column chronological age. The sample order in chronage
doesn’t have to be in the same order as in exprdata
.
However, the samples in chronage
and exprdata
should be the same. If some samples’ chronological age are not
available, users are expected to set the chronological age in
chronage
to NA. If chronage
contains more than
2 columns, only the first 2 columns will be considered. If more than 30
samples’ chronological age are available, age acceleration residual will
be calculated. Age acceleration residual is defined as the residual of
linear regression with RNASeq age as dependent variable and
chronological age as independent variable.
If this argument is not provided, the age acceleration residual will not be calculated.
chronage = data.frame(sampleid = colnames(fpkm), age = c(30,50))
res = predict_age(exprdata = fpkm, tissue = "brain", exprtype = "FPKM",
chronage = chronage)
signature is not provided, using DESeq2 signature
automatically.
RNAAge ChronAge
SRR2166176 27.32436 30
SRR2167642 49.16285 50
In the above example, we calculated the RNASeq age for 2 samples
based on their gene expression data coming from brain. Since the sample
size is small, age acceleration residual are not calculated.
Here is an example with sample size > 30:
# This example is just for illustration purpose. It does not represent any
# real data.
# construct a large gene expression data
fpkm_large = cbind(fpkm, fpkm+1, fpkm+2, fpkm+3)
fpkm_large = cbind(fpkm_large, fpkm_large, fpkm_large, fpkm_large)
colnames(fpkm_large) = paste0("sample",1:32)
# construct the samples' chronological age
chronage2 = data.frame(sampleid = colnames(fpkm_large), age = 31:62)
res2 = predict_age(exprdata = fpkm_large, exprtype = "FPKM",
chronage = chronage2)
head(res2)
RNAAge ChronAge AgeAccelResid
sample1 34.84154 31 -31.254133
sample2 49.77605 32 -16.909233
sample3 65.00385 33 -2.271030
sample4 76.91602 34 9.051533
sample5 82.66541 35 14.211328
sample6 93.07443 36 24.030742
The main difference between predict_age_fromse
and
predict_age
is that predict_age_fromse
takes
SummarizedExperiment as input. The se
argument is a
SummarizedExperiment object.
assays(se)
should contain gene expression data. The
name of assays(se)
should be either “FPKM” or “counts”. Use
exprtype
argument to specify the type of gene expression
data provided.colData(se)
. This is optional. If provided, the column name
for chronological age in colData(se)
should be “age”. If
some samples’ chronological age are not available, users are expected to
set the chronological age in colData(se)
to NA. If
chronological age is not provided, the age acceleration residual will
not be calculated.rowData(se)
. This is also optional. If using
exprtype = "FPKM"
, the provided gene length will be
ignored. If provided, the column name for gene length in
rowData(se)
should be “bp_length”. The function will
convert raw count to FPKM by the user-supplied gene length. Otherwise,
gene length is obtained from the internal database.tissue
, exprtype
,
idtype
, stype
, signature
are
exactly the same as described in predict_age
function.We suggest visualizing the results by plotting RNAAge vs
chronological age. This can be done by calling makeplot
function and passing in the data frame returned by
predict_age
function.
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] SummarizedExperiment_1.37.0 Biobase_2.67.0
[3] GenomicRanges_1.59.0 GenomeInfoDb_1.43.1
[5] IRanges_2.41.1 S4Vectors_0.45.2
[7] BiocGenerics_0.53.3 generics_0.1.3
[9] MatrixGenerics_1.19.0 matrixStats_1.4.1
[11] RNAAgeCalc_1.19.0 rmarkdown_2.29
loaded via a namespace (and not attached):
[1] sys_3.4.3 rstudioapi_0.17.1 jsonlite_1.8.9
[4] magrittr_2.0.3 GenomicFeatures_1.59.1 farver_2.1.2
[7] BiocIO_1.17.0 zlibbioc_1.52.0 vctrs_0.6.5
[10] memoise_2.0.1 Rsamtools_2.23.0 RCurl_1.98-1.16
[13] base64enc_0.1-3 htmltools_0.5.8.1 S4Arrays_1.7.1
[16] curl_6.0.1 SparseArray_1.7.2 Formula_1.2-5
[19] sass_0.4.9 bslib_0.8.0 htmlwidgets_1.6.4
[22] plyr_1.8.9 impute_1.81.0 cachem_1.1.0
[25] buildtools_1.0.0 GenomicAlignments_1.43.0 lifecycle_1.0.4
[28] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.7-1
[31] R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
[34] digest_0.6.37 colorspace_2.1-1 AnnotationDbi_1.69.0
[37] Hmisc_5.2-0 RSQLite_2.3.8 org.Hs.eg.db_3.20.0
[40] labeling_0.4.3 fansi_1.0.6 mgcv_1.9-1
[43] httr_1.4.7 abind_1.4-8 compiler_4.4.2
[46] rngtools_1.5.2 withr_3.0.2 downloader_0.4
[49] bit64_4.5.2 recount_1.33.0 htmlTable_2.4.3
[52] backports_1.5.0 BiocParallel_1.41.0 DBI_1.2.3
[55] DelayedArray_0.33.2 rjson_0.2.23 derfinderHelper_1.41.0
[58] tools_4.4.2 foreign_0.8-87 rentrez_1.2.3
[61] nnet_7.3-19 glue_1.8.0 restfulr_0.0.15
[64] nlme_3.1-166 grid_4.4.2 checkmate_2.3.2
[67] reshape2_1.4.4 derfinder_1.41.0 cluster_2.1.6
[70] gtable_0.3.6 BSgenome_1.75.0 tzdb_0.4.0
[73] tidyr_1.3.1 data.table_1.16.2 hms_1.1.3
[76] xml2_1.3.6 utf8_1.2.4 XVector_0.47.0
[79] foreach_1.5.2 pillar_1.9.0 stringr_1.5.1
[82] limma_3.63.2 splines_4.4.2 dplyr_1.1.4
[85] lattice_0.22-6 rtracklayer_1.67.0 bit_4.5.0
[88] GEOquery_2.75.0 tidyselect_1.2.1 locfit_1.5-9.10
[91] maketools_1.3.1 Biostrings_2.75.1 knitr_1.49
[94] gridExtra_2.3 xfun_0.49 statmod_1.5.0
[97] stringi_1.8.4 UCSC.utils_1.3.0 yaml_2.3.10
[100] evaluate_1.0.1 codetools_0.2-20 GenomicFiles_1.43.0
[103] qvalue_2.39.0 tibble_3.2.1 cli_3.6.3
[106] bumphunter_1.49.0 rpart_4.1.23 munsell_0.5.1
[109] jquerylib_0.1.4 Rcpp_1.0.13-1 png_0.1-8
[112] XML_3.99-0.17 parallel_4.4.2 ggplot2_3.5.1
[115] readr_2.1.5 blob_1.2.4 doRNG_1.8.6
[118] bitops_1.0-9 VariantAnnotation_1.53.0 scales_1.3.0
[121] purrr_1.0.2 crayon_1.5.3 rlang_1.1.4
[124] KEGGREST_1.47.0