receptLoss
is an R
package designed to identify novel nuclear hormone receptors (NHRs)
whose expression levels in cancers could serve as biomarkers for patient
survival.
By utilizing both expression data from both tumor and normal tissue,
receptLoss
provides biological context to the process of
tumor subclassification that is lacking in existing methods that rely
solely on expression patterns in tumor tissue.
receptLoss
is complementary to oncomix
.
Whereas oncomix
detects genes that gain
expression in subsets of tumors relative to normal tissue,
receptLoss
detects genes that lose
expression in subsets of tumors relative to normal tissue.
receptLoss
consists of 2 main functions:
receptLoss()
takes in 2 matrices of gene expression
data, one from tumor and one from adjacent normal tissue. The output is
a matrix with rows representing genes and columns representing summary
statistics.
plotReceptLoss()
generates a histogram visualization
of the distribution of the gene desired by the user.
We begin by simulating two gene expression data matrices, one from tumor and the other from normal tissue.
library(receptLoss)
library(dplyr)
library(ggplot2)
set.seed(100)
## Simulate matrix of expression values from
## 10 genes measured in both normal tissue and
## tumor tissue in 100 patients
exprMatrNml <- matrix(abs(rnorm(100, mean=2.5)), nrow=10)
exprMatrTum <- matrix(abs(rnorm(1000)), nrow=10)
geneNames <- paste0(letters[seq_len(nrow(exprMatrNml))],
seq_len(nrow(exprMatrNml)))
rownames(exprMatrNml) <- rownames(exprMatrTum) <- geneNames
exprMatrNml
and exprMatrTum
are m × n matrices containing
gene expression data from normal and tumor tissue, respectively, with
m genes as rows and n patients as columns. The row names
of these matrices are the gene names.
These two matrices should have the same number of rows (ie genes), with genes listed in the same order between the two matrices. However, they don’t have to have the same number of columns (ie patients).
To run receptLoss()
, we also define 2 parameters:
nSdBelow
is an integer value that places a lower
boundary (i.e. lowerBound
, shown as the pink ‘B’ in image
below) n standard deviations
below the mean of each gene’s expression levels in normal tissue (dotted
pink curve below). The larger nSdBelow
is, the smaller
(i.e. further to the left) the lowerBound
becomes.
nSdBelow
=2, as ~97.7% of the
normal tissue expression data should be greater than the
lowerBound
(assuming the expression data from normal tissue
is distributed as a Gaussian).minPropPerGroup
- a numeric value between (0, 0.5) indicating the minimum proportion of
tumor samples desired within each of the two tumor subgroups defined by
the lowerBound
. Determines the value of
meetsMinPropPerGrp
(either TRUE or FALSE) in the
output.
minPropPerGroup
=0.20. Values close
to 0 may result in the inclusion of genes that subdivide tumors into
very unequally-sized subgroups. Values closer to 0.5 will identify genes
that subdivide tumors into nearly equal-sized groups and may be
unnecessarily restrictive.<img src=“fig_2_1.png” alt=“fig 2”, width=“60%”>
nSdBelow <- 2
minPropPerGroup <- .2
rl <- receptLoss(exprMatrNml, exprMatrTum, nSdBelow, minPropPerGroup)
head(rl)
#> # A tibble: 6 × 7
#> geneNm lowerBound propTumLessThBound muAb muBl deltaMu meetsMinPropPerGrp
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 f6 0.928 0.58 1.52 0.425 1.10 TRUE
#> 2 b2 0.538 0.38 1.30 0.258 1.05 TRUE
#> 3 i9 0.379 0.24 1.10 0.203 0.893 TRUE
#> 4 g7 0.805 0.57 1.30 0.405 0.891 TRUE
#> 5 d4 0.359 0.32 1.04 0.174 0.866 TRUE
#> 6 c3 0.554 0.41 1.12 0.290 0.826 TRUE
The output of receptLoss()
is an m × 7 matrix, with m equaling the number of genes. The
7 columns are as follows:
geneNm
- the gene name
lowerBound
(B) - the value nSdBelow
the mean of the normal tissue expression data. Can be expressed as B = μN − σN * nsdBelow,
where μN
is the mean of the normal tissue expression data, σN is the
standard deviation of the normal tissue expression data, and nsdBelow
is the value nSdBelow
set by the user.
propTumLessThBound
(πL) - the
proportion of tumor samples with expression levels less than
lowerBound
. Can be expressed as: $$\pi_L =\frac{1}{N_T}\sum_{j=1}^{N_T} \Bigg\{
\begin{array}{ll} 1,~ if~x_{j} <
lowerBound \\
0, ~ otherwise
\end{array},
$$ where xj is the jth
tumor sample and NT is the total
number of tumor samples.
muAb
(μA) - “mu
above”, the arithmetic mean across expression values from tumors greater
than (ie above) the lowerBound
.
muBl
(μB) - “mu
below”, the arithmetic mean across expression values from tumors less
than (ie below) the lowerBound
.
deltaMu
(Δμ) - equal to μA − μB.
The rows in the output matrix are sorted in descending order by the
deltaMu
statistic, which indicates the degree of separation
between the two tumor subgroups. Higher deltaMu
values
indicate tumor subgroups that are more cleanly separated and more likely
to constitute a bimodal distribution within the tumor samples.
meetsMinPropPerGrp
- a logical indicating whether
the proportion of samples in each group is greater than that set by
minPropPerGroup
. If min(πL, 1 − πL)>
minPropPerGroup
, then meetsMinPropPerGrp
is
TRUE
; otherwise, it is FALSE
. Genes for which
meetsMinPropPerGrp
equals FALSE
can be
filtered out - they do not have a sufficient proportion of tumors in
each group to permit useful tumor subgrouping.
Let’s take the top-ranked gene and plot its distribution.
clrs <- c("#E78AC3", "#8DA0CB")
tryCatch({plotReceptLoss(exprMatrNml, exprMatrTum, rl,
geneName=as.character(rl[1,1]), clrs=clrs)},
warning=function(cond){
knitr::include_graphics("rl_fig.png")
}, error=function(cond){
knitr::include_graphics("rl_fig.png")
}
)
Here’s what this graph is showing us:
The x-axis represents RNA expression values, with lower values toward the left and larger values (i.e. higher expression) toward the right. The y-axis represents density. The name of the gene (“f6”) is shown in the upper left of the plot.
The dotted curve represents a Gaussian distribution fit to the expression data from normal tissue, and the blue histogram represents expression data from tumor tissue.
The pink vertical line corresponds to the lowerBound
for the expression data from normal tissue.
Since most normal tissue expresses the RNA above the
lowerBound
, any tumors that express the RNA below this
value have lost RNA expression relative to normal tissue. Thus, the
lowerBound
forms a boundary between 2 tumor subgroups that
either have or have not lost RNA expression relative to normal
tissue.
The question that inspired this package was whether the loss of expression of any of the ~50 NHRs (beyond the well-known estrogen, progesterone, and androgen NHRs) in uterine tumors was associated with differences in patient survival. NHRs might not only serve as survival biomarkers but also as drug targets, as their activity can be modulated by small molecules that resemble their hormonal ligands.
To facilitate the application of this question to additional cancer
types, a list of all NHRs is included in this package as the object
nhrs
.
This object facilitates filtering of NHRs from a matrix of gene expression data, as it contains several commonly-used gene identifiers (e.g. HGNC symbol, HGNC ID, Entrez ID, and Ensembl ID) for the NHRs that might be found in different RNA expression datasets.
The source code for generating nhrs
is available in “data-raw/nhrs.R”.
receptLoss::nhrs
#> # A tibble: 54 × 6
#> hgnc_symbol hgnc_id hgnc_name entrez_gene_id ensembl_gene_id synonyms
#> <chr> <dbl> <chr> <dbl> <chr> <chr>
#> 1 NR0B1 7960 nuclear receptor… 190 ENSG00000169297 AHC|DSS…
#> 2 NR0B2 7961 nuclear receptor… 8431 ENSG00000131910 Small h…
#> 3 THRA 11796 thyroid hormone … 7067 ENSG00000126351 AR7|EAR…
#> 4 THRB 11799 thyroid hormone … 7068 ENSG00000151090 THR1|TH…
#> 5 RARA 9864 retinoic acid re… 5914 ENSG00000131759 RAR alp…
#> 6 RARB 9865 retinoic acid re… 5915 ENSG00000077092 HAP|HBV…
#> 7 RARG 9866 retinoic acid re… 5916 ENSG00000172819 RARC|RA…
#> 8 PPARA 9232 peroxisome proli… 5465 ENSG00000186951 NUC1|nu…
#> 9 PPARD 9235 peroxisome proli… 5467 ENSG00000112033 NUCII|P…
#> 10 PPARG 9236 peroxisome proli… 5468 ENSG00000132170 PPARG1|…
#> # ℹ 44 more rows
receptLoss
identifies genes that subclassify tumors
based on their RNA expression levels relative to normal tissue.
The genes are ranked by their Δμ statistic which reflects
a measure of the cleaness of separation (ie bimodality) between the two
tumor subgroups.
receptLoss
can be expanded for use with a variety of
tumors, genes (e.g. to identify novel candidate tumor suppressors),
biological data (e.g. miRNA, protein expression), and even
non-biological data types where you have numeric data from two groups
(one normal group and one abnormal group) and where subgroup
identification is desired within the abnormal group
receptLoss
is particularly useful when there are a
large number of tumor samples (hundreds) relative to normal samples
(dozens), as is the case in several cancer databases, including the
uterine cancer database from the Cancer Genome
Atlas/Genomic Data Commons. By assuming that the normal expression
data are distributed as a single Gaussian, receptLoss
can
subclassify large numbers of tumors even in the presence of small
numbers of normal tissue samples.
Please contact me at [email protected] with any suggestions, questions, or comments. Thank you!
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 dplyr_1.1.4 receptLoss_1.19.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 utf8_1.2.4
#> [3] generics_0.1.3 tidyr_1.3.1
#> [5] SparseArray_1.6.0 lattice_0.22-6
#> [7] digest_0.6.37 magrittr_2.0.3
#> [9] evaluate_1.0.1 grid_4.4.1
#> [11] fastmap_1.2.0 Matrix_1.7-1
#> [13] jsonlite_1.8.9 GenomeInfoDb_1.43.0
#> [15] httr_1.4.7 purrr_1.0.2
#> [17] fansi_1.0.6 UCSC.utils_1.2.0
#> [19] scales_1.3.0 jquerylib_0.1.4
#> [21] abind_1.4-8 cli_3.6.3
#> [23] crayon_1.5.3 rlang_1.1.4
#> [25] XVector_0.46.0 Biobase_2.67.0
#> [27] munsell_0.5.1 withr_3.0.2
#> [29] DelayedArray_0.33.1 cachem_1.1.0
#> [31] yaml_2.3.10 S4Arrays_1.6.0
#> [33] tools_4.4.1 colorspace_2.1-1
#> [35] GenomeInfoDbData_1.2.13 SummarizedExperiment_1.36.0
#> [37] BiocGenerics_0.53.0 buildtools_1.0.0
#> [39] vctrs_0.6.5 R6_2.5.1
#> [41] matrixStats_1.4.1 stats4_4.4.1
#> [43] lifecycle_1.0.4 zlibbioc_1.52.0
#> [45] S4Vectors_0.44.0 IRanges_2.41.0
#> [47] pkgconfig_2.0.3 pillar_1.9.0
#> [49] bslib_0.8.0 gtable_0.3.6
#> [51] glue_1.8.0 highr_0.11
#> [53] xfun_0.48 tibble_3.2.1
#> [55] GenomicRanges_1.59.0 tidyselect_1.2.1
#> [57] sys_3.4.3 MatrixGenerics_1.19.0
#> [59] knitr_1.48 htmltools_0.5.8.1
#> [61] rmarkdown_2.28 maketools_1.3.1
#> [63] compiler_4.4.1