Differential Allelic Representation (DAR) describes a situation commonly encountered in RNA-seq experiments involving organisms that are not isogenic, and where modern techniques such as CRISPR have been used for generation of individual organisms. DAR compromises the integrity of Differential Expression analysis results as it can bias expression, influencing the classification of genes (or transcripts) as being differentially expressed (DE). The concern does not lie within the statistical measures for detecting differential expression, because the underlying biology does indeed result in differential expression. However, without DAR analysis, the impacts of DAR are confounded with the experimental condition of interest and features can be incorrectly inferred as DE in response to the experimental condition, when they are actually an artefact of DAR. DAR analysis is therefore recommended as a complementary technique alongside Differential Expression analysis where non-isogenic conditions are present.
DAR occurs when the construction of experimental sample groups unexpectedly results in an unequal distribution of polymorphic alleles between the groups. This is often due to random chance when selecting group composition, however, may also be compounded by a study’s experimental design (described further below). When unequally-represented polymorphic alleles are also expression quantitative trait loci (eQTLs), expression differences between experimental groups are observed similarly to that of a functional response to an experimental condition (Figure @ref(fig:ase-example)). Analysis of gene expression in absence of the consideration of DAR is problematic as the source of differential expression will be unclear.
The presence of DAR is influenced by the nature of an experiment’s design. In studies that select experimental groups from a common sample pool, for example “treatment vs control” designs, DAR is predominantly driven by the isogenicity within the initial pool and the stochastic selection of samples. However, in studies involving the selection of groups based on a genetic feature, for example “mutant vs wild-type”, the presence of DAR is often intensified on the chromosome(s) containing the determining feature. This is because the selection criteria also drives selection for alleles linked to the determining feature (Figure @ref(fig:selection-driven-dar)).
DAR analysis results in an easy-to-interpret value between 0 and 1 for each genetic feature of interest, where 0 represents identical allelic representation and 1 represents complete diversity. This metric can be used to identify features prone to false-positive calls in Differential Expression analysis, and can be leveraged with statistical methods to alleviate the impact of such artefacts on RNA-seq data.
BiocManager
is recommended for the installation of
packages required in this vignette. BiocManager
handles the
installation of packages from both the CRAN and Bioconductor
repositories.
if (!"BiocManager" %in% rownames(installed.packages()))
install.packages("BiocManager")
pkgs <- c("tidyverse", "limma", "tadar")
BiocManager::install(pkgs, update = FALSE)
Now we can load the required packages.
The example data used in this vignette is provided within the
tadar
package.
The VCF
file chr1.vcf.bgz
contains
multi-sample genotype calls produced from raw RNA-seq FASTQ
data. The data originates from zebrafish (Danio rerio) brain
transcriptomes, and further information can be found in the associated
manuscript by Gerken et al. (2023). The
VCF
has been modified to remove headers and other data that
are not essential for example purposes, and has been subset to a single
chromosome (Chromosome 1).
The GRanges
object chr1_genes
contains gene
feature information from the Ensembl database (release version 101) for
Danio rerio Chromosome 1 (Cunningham et
al. (2022)).
DAR analysis requires only several lines of code. This section
provides an overview of DAR analysis with tadar
, and exists
as a reference for those familiar with the process. See the next section
if you seek a detailed walkthrough.
First load the required packages.
Then define the input objects.
vcf <- system.file("extdata", "chr1.vcf.bgz", package="tadar")
data("chr1_genes")
data("chr1_tt")
groups <- list(
group1 = paste0("sample", 1:6),
group2 = paste0("sample", 7:13)
)
contrasts <- makeContrasts(
group1v2 = group1 - group2,
levels = names(groups)
)
region_loci <- 5
Now use the base
pipe to assign DAR values to genes
based on their surrounding region.
gene_dar <- readGenotypes(vcf) |>
countAlleles(groups = groups) |>
filterLoci() |>
countsToProps() |>
dar(contrasts = contrasts, region_loci = region_loci) |>
flipRanges(extend_edges = TRUE) |>
assignFeatureDar(features = chr1_genes, dar_val = "region")
Lastly, use the gene-level DAR values to moderate the corresponding p-values from differential gene expression analysis.
DAR analysis is performed by incorporating single nucleotide-level
genotype calls from variant calling software. We recommend the GATK best
practices workflow for RNAseq
short variant discovery (SNPs + Indels) as a reference for the
generation of data required to begin DAR analysis. However, should
individual genomes be also available for the experimental samples, these
can add significant important information. Ultimately, we require a
multi-sample VCF
file with each entry representing the
genomic location of a single nucleotide polymorphism (SNP) that differs
from the reference genome in at least one sample.
The functions contained in this package are intended to be implemented as a sequential workflow, with each function addressing a discrete step in the data processing/analysis procedure. Please follow the steps in order as outlined below.
Genotype data from a VCF
file is parsed into a
GRanges
object using the readGenotypes()
function. This function is essentially a wrapper to
VariantAnnotation::readVcf()
, but only loads the data
required for DAR analysis. By default, phasing information is removed as
it is not required for DAR analysis and complicates downstream
processing. This simply converts genotype calls represented as, for
example 0|1
to 0/1
, and is required if
proceeding with the DAR analysis workflow. This can optionally be turned
off with the unphase
argument if this function is intended
to be used for alternative purposes. The genome
option is
also available to override the genome automatically detected from the
VCF
. We intend to work with multiple GRanges
objects and keeping the genome consistent avoids downstream errors.
Additional arguments will be passed directly to
VariantAnnotation::readVcf()
.
#> GRanges object with 6 ranges and 13 metadata columns:
#> seqnames ranges strand | sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12 sample13
#> <Rle> <IRanges> <Rle> | <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character>
#> [1] 1 6698 * | 0/1 0/1 0/1 0/1 0/1 0/1 1/1 0/1 0/1 0/1 1/1 0/1 0/1
#> [2] 1 7214 * | ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. 1/1
#> [3] 1 7218 * | ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. 1/1
#> [4] 1 7274 * | ./. 0/0 ./. ./. ./. ./. ./. ./. ./. ./. ./. 0/0 0/1
#> [5] 1 9815 * | ./. ./. ./. ./. ./. 0/0 ./. ./. ./. ./. 1/1 ./. ./.
#> [6] 1 9842 * | ./. ./. ./. ./. ./. 0/0 ./. ./. ./. ./. 1/1 ./. ./.
#> -------
#> seqinfo: 1 sequence from GRCz11 genome
Genotypes are reported as numeric indices. 0
indicates
the reference allele, 1
is the first alternate allele,
2
is the second alternate allele, and 3
is the
third alternate allele. Even though we are working with a diploid
organism, four alleles are theoretically possible as the VCF consists of
multiple samples. Genotypes that could not be determined by the variant
calling software are reported as ./.
. We work directly with
the indices as they are consistent across all samples for a single
variant position.
We aim to calculate a DAR value at each suitable variant locus. This
requires us to firstly summarise the genotype data into counts of the
alleles reported at each variant locus. First we define our sample
grouping structure as a list
, where each element contains a
character vector of samples within a single group.
Now at each locus we can count the number of alleles that exist
within each group with countAlleles()
. This returns a
GRangesList
with each element corresponding to a different
sample group. Columns indicate how many samples had called alleles, how
many were missing and the counts of reference allele, followed by
alternate alleles n_1
, n_2
and
n_3
ensuring multi-allelic sites are retained.
#> GRangesList object of length 2:
#> $group1
#> GRanges object with 16994 ranges and 6 metadata columns:
#> seqnames ranges strand | n_called n_missing n_0 n_1 n_2 n_3
#> <Rle> <IRanges> <Rle> | <integer> <integer> <integer> <integer> <integer> <integer>
#> [1] 1 6698 * | 12 0 6 6 0 0
#> [2] 1 7214 * | 0 12 0 0 0 0
#> [3] 1 7218 * | 0 12 0 0 0 0
#> [4] 1 7274 * | 2 10 2 0 0 0
#> [5] 1 9815 * | 2 10 2 0 0 0
#> ... ... ... ... . ... ... ... ... ... ...
#> [16990] 1 59566458 * | 4 8 4 0 0 0
#> [16991] 1 59567075 * | 6 6 5 1 0 0
#> [16992] 1 59567620 * | 2 10 2 0 0 0
#> [16993] 1 59570107 * | 0 12 0 0 0 0
#> [16994] 1 59570454 * | 2 10 2 0 0 0
#> -------
#> seqinfo: 1 sequence from GRCz11 genome
#>
#> $group2
#> GRanges object with 16994 ranges and 6 metadata columns:
#> seqnames ranges strand | n_called n_missing n_0 n_1 n_2 n_3
#> <Rle> <IRanges> <Rle> | <integer> <integer> <integer> <integer> <integer> <integer>
#> [1] 1 6698 * | 14 0 5 9 0 0
#> [2] 1 7214 * | 2 12 0 2 0 0
#> [3] 1 7218 * | 2 12 0 2 0 0
#> [4] 1 7274 * | 4 10 3 1 0 0
#> [5] 1 9815 * | 2 12 0 2 0 0
#> ... ... ... ... . ... ... ... ... ... ...
#> [16990] 1 59566458 * | 4 10 3 1 0 0
#> [16991] 1 59567075 * | 4 10 4 0 0 0
#> [16992] 1 59567620 * | 6 8 6 0 0 0
#> [16993] 1 59570107 * | 0 14 0 0 0 0
#> [16994] 1 59570454 * | 2 12 0 2 0 0
#> -------
#> seqinfo: 1 sequence from GRCz11 genome
Not all samples were assigned genotypes during the variant calling
procedure. Loci with a large number of assigned genotypes across all
samples possess information that can be used to more accurately
determine differences in allelic representation between experimental
groups. We can remove the less informative data by filtering variant
loci independently within each sample group with the
filterLoci()
function. The filter
argument
controls the criterion for selection of loci that we want to keep, by
providing a logical expression using the metadata column names of our
counts
object (i.e. n_called
,
n_missing
, n_0
, n_1
,
n_2
and n_3
). By default,
filterLoci()
keeps loci that satisfy the criterion: number
of samples with called genotypes > number of samples with missing
genotypes.
#> GRangesList object of length 2:
#> $group1
#> GRanges object with 10701 ranges and 6 metadata columns:
#> seqnames ranges strand | n_called n_missing n_0 n_1 n_2 n_3
#> <Rle> <IRanges> <Rle> | <integer> <integer> <integer> <integer> <integer> <integer>
#> [1] 1 6698 * | 12 0 6 6 0 0
#> [2] 1 10093 * | 12 0 6 6 0 0
#> [3] 1 13098 * | 12 0 12 0 0 0
#> [4] 1 13300 * | 12 0 12 0 0 0
#> [5] 1 13558 * | 12 0 12 0 0 0
#> ... ... ... ... . ... ... ... ... ... ...
#> [10697] 1 59563688 * | 8 4 6 2 0 0
#> [10698] 1 59563996 * | 10 2 10 0 0 0
#> [10699] 1 59564693 * | 10 2 8 2 0 0
#> [10700] 1 59565427 * | 8 4 8 0 0 0
#> [10701] 1 59565445 * | 8 4 8 0 0 0
#> -------
#> seqinfo: 1 sequence from GRCz11 genome
#>
#> $group2
#> GRanges object with 12199 ranges and 6 metadata columns:
#> seqnames ranges strand | n_called n_missing n_0 n_1 n_2 n_3
#> <Rle> <IRanges> <Rle> | <integer> <integer> <integer> <integer> <integer> <integer>
#> [1] 1 6698 * | 14 0 5 9 0 0
#> [2] 1 10093 * | 14 0 9 5 0 0
#> [3] 1 13098 * | 14 0 13 1 0 0
#> [4] 1 13300 * | 14 0 14 0 0 0
#> [5] 1 13558 * | 14 0 14 0 0 0
#> ... ... ... ... . ... ... ... ... ... ...
#> [12195] 1 59564693 * | 12 2 11 1 0 0
#> [12196] 1 59565850 * | 10 4 8 2 0 0
#> [12197] 1 59565874 * | 12 2 10 2 0 0
#> [12198] 1 59566219 * | 10 4 10 0 0 0
#> [12199] 1 59566270 * | 8 6 4 4 0 0
#> -------
#> seqinfo: 1 sequence from GRCz11 genome
Specifying a filter with high stringency will result in greater removal of loci, but more accurate downstream results. For example, we can remove all loci where at least one sample does not possess a genotype call, resulting in complete information of allelic representation in each sample group.
#> GRangesList object of length 2:
#> $group1
#> GRanges object with 7551 ranges and 6 metadata columns:
#> seqnames ranges strand | n_called n_missing n_0 n_1 n_2 n_3
#> <Rle> <IRanges> <Rle> | <integer> <integer> <integer> <integer> <integer> <integer>
#> [1] 1 6698 * | 12 0 6 6 0 0
#> [2] 1 10093 * | 12 0 6 6 0 0
#> [3] 1 13098 * | 12 0 12 0 0 0
#> [4] 1 13300 * | 12 0 12 0 0 0
#> [5] 1 13558 * | 12 0 12 0 0 0
#> ... ... ... ... . ... ... ... ... ... ...
#> [7547] 1 59332656 * | 12 0 11 1 0 0
#> [7548] 1 59335556 * | 12 0 6 6 0 0
#> [7549] 1 59335583 * | 12 0 7 5 0 0
#> [7550] 1 59335775 * | 12 0 12 0 0 0
#> [7551] 1 59335805 * | 12 0 9 3 0 0
#> -------
#> seqinfo: 1 sequence from GRCz11 genome
#>
#> $group2
#> GRanges object with 8425 ranges and 6 metadata columns:
#> seqnames ranges strand | n_called n_missing n_0 n_1 n_2 n_3
#> <Rle> <IRanges> <Rle> | <integer> <integer> <integer> <integer> <integer> <integer>
#> [1] 1 6698 * | 14 0 5 9 0 0
#> [2] 1 10093 * | 14 0 9 5 0 0
#> [3] 1 13098 * | 14 0 13 1 0 0
#> [4] 1 13300 * | 14 0 14 0 0 0
#> [5] 1 13558 * | 14 0 14 0 0 0
#> ... ... ... ... . ... ... ... ... ... ...
#> [8421] 1 59335556 * | 14 0 9 5 0 0
#> [8422] 1 59335583 * | 14 0 9 5 0 0
#> [8423] 1 59335775 * | 14 0 13 1 0 0
#> [8424] 1 59406400 * | 14 0 12 2 0 0
#> [8425] 1 59559081 * | 14 0 10 4 0 0
#> -------
#> seqinfo: 1 sequence from GRCz11 genome
Note that because the filter is applied within each experimental sample group, we now have loci that are present in one group but not the other. This allows greater flexibility for more complex experimental designs that involve multiple comparisons between a greater number of sample groups than provided in this example. Ultimately, comparisons of allelic representation will be performed only on the intersection of loci present in each sample group.
Depending on the stringency of the previously applied filter, it’s still possible that not every sample will have a genotype call. One experimental group may also contain more samples than another, which is the case in this example, where group1 contains 6 samples and group2 contains 7 samples. We account for this by normalising the allele counts as a proportion of total counts at each locus.
#> GRangesList object of length 2:
#> $group1
#> GRanges object with 10701 ranges and 4 metadata columns:
#> seqnames ranges strand | prop_0 prop_1 prop_2 prop_3
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> <numeric>
#> [1] 1 6698 * | 0.5 0.5 0 0
#> [2] 1 10093 * | 0.5 0.5 0 0
#> [3] 1 13098 * | 1.0 0.0 0 0
#> [4] 1 13300 * | 1.0 0.0 0 0
#> [5] 1 13558 * | 1.0 0.0 0 0
#> ... ... ... ... . ... ... ... ...
#> [10697] 1 59563688 * | 0.75 0.25 0 0
#> [10698] 1 59563996 * | 1.00 0.00 0 0
#> [10699] 1 59564693 * | 0.80 0.20 0 0
#> [10700] 1 59565427 * | 1.00 0.00 0 0
#> [10701] 1 59565445 * | 1.00 0.00 0 0
#> -------
#> seqinfo: 1 sequence from GRCz11 genome
#>
#> $group2
#> GRanges object with 12199 ranges and 4 metadata columns:
#> seqnames ranges strand | prop_0 prop_1 prop_2 prop_3
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> <numeric>
#> [1] 1 6698 * | 0.357143 0.6428571 0 0
#> [2] 1 10093 * | 0.642857 0.3571429 0 0
#> [3] 1 13098 * | 0.928571 0.0714286 0 0
#> [4] 1 13300 * | 1.000000 0.0000000 0 0
#> [5] 1 13558 * | 1.000000 0.0000000 0 0
#> ... ... ... ... . ... ... ... ...
#> [12195] 1 59564693 * | 0.916667 0.0833333 0 0
#> [12196] 1 59565850 * | 0.800000 0.2000000 0 0
#> [12197] 1 59565874 * | 0.833333 0.1666667 0 0
#> [12198] 1 59566219 * | 1.000000 0.0000000 0 0
#> [12199] 1 59566270 * | 0.500000 0.5000000 0 0
#> -------
#> seqinfo: 1 sequence from GRCz11 genome
Now that we have normalised values of allelic representation at each
variant locus within our sample groups, we can calculate the DAR metric
between our experimental groups. We require at least two sample groups
to proceed with DAR analysis, however we can set up more comparisons
depending on how many groups are present. We define these comparisons as
a contrast matrix
.
contrasts <- matrix(
data = c(1, -1),
dimnames = list(
Levels = c("group1", "group2"),
Contrasts = c("group1v2")
)
)
Alternatively, this can be simplified with the
makeContrasts()
function from the limma
package (Ritchie et al. (2015)).
DAR is calculated by firstly determining the Euclidean distance
between allelic proportions of the contrasted sample groups. The
Euclidean distance is then converted to the DAR metric by dividing by
the maximum possible distance, $\sqrt{2}$, resulting in an easy-to-interpret
value between 0 and 1, where 0 represents identical allelic
representation and 1 represents complete diversity. This is handled
within the dar()
function by passing our allelic
proportions and intended contrasts as arguments.
Two types of DAR values are reported by the dar()
function as metadata columns of the resulting GRanges
objects:
dar_origin
: The raw DAR values calculated at single
nucleotide positions (the origin) between sample groups. These values
represent DAR estimates at a precise locus.dar_region
: The mean of raw DAR values in a specified
region surrounding the origin. This is optionally returned using either
of the region_fixed
or region_loci
arguments,
which control the strategy and size for establishing regions (more
information below). This option exists because eQTLs don’t necessarily
confer their effects on genes in close proximity. Therefore, DAR
estimates that are representative of regions may be more suitable for
downstream assignment DAR values to genomic features.region_fixed
and region_loci
are optional
arguments that offer users dar_region
values in conjunction
with dar_origin
values. While only one of these two
arguments is required, if both are provided, region_fixed
takes precedence. Understanding the distinctions in how each option
defines regions around an origin is important, as their selection
carries subjective implications downstream.
region_fixed
: Establishes a region with a fixed size
around the origin. The total region size is defined in base pairs,
meaning that the region will extend region_fixed / 2
base
pairs either side of the origin. For example if
region_fixed = 101
is specified, an origin at position 500
will have an associated region spanning positions 450-550.region_loci
: Establishes an elastic region to average
the specified number of loci with dar_origin
values, such
that this number of values are included. Note that whilst the number of
points around the central value may be equal, the genomic distances may
not be. In contrast to region_fixed
,
region_loci
results in DAR estimates that cover a
substantial proportion of each chromosome (i.e from the first locus to
the last), which is useful for downstream assignment of DAR values to a
greater number of genomic features.The remainder of this vignette uses examples produced from specifying
the region_loci
argument.
With a chosen elastic region size of 5 loci, this will smooth the DAR metric at each origin locus with the DAR values of the 2 loci either side.
#> GRangesList object of length 1:
#> $group1v2
#> GRanges object with 10369 ranges and 2 metadata columns:
#> seqnames ranges strand | dar_origin dar_region
#> <Rle> <IRanges> <Rle> | <numeric> <numeric>
#> [1] 1 6698 * | 0.1428571 NA
#> [2] 1 10093 * | 0.1428571 NA
#> [3] 1 13098 * | 0.0714286 0.0714286
#> [4] 1 13300 * | 0.0000000 0.0547619
#> [5] 1 13558 * | 0.0000000 0.1061905
#> ... ... ... ... . ... ...
#> [10365] 1 59562848 * | 0.000000 0.1300000
#> [10366] 1 59563684 * | 0.150000 0.0800000
#> [10367] 1 59563688 * | 0.150000 0.0833333
#> [10368] 1 59563996 * | 0.000000 NA
#> [10369] 1 59564693 * | 0.116667 NA
#> -------
#> seqinfo: 1 sequence from GRCz11 genome
We now have DAR values for loci shared between the two sample groups
of the defined contrasts. Each element of the resulting
GRangesList
object represents a single contrast and are
named as defined in our contrasts
object.
The final step of DAR analysis involves assigning DAR values to
features of interest. It makes sense to select features that were tested
for differential expression, because this step provides an estimate of
the potential for eQTL impacts on a feature’s expression. This is
performed with the assignFeatureDar()
function, passing the
features of interest as a GRanges
object. In this example
we use the genes contained in chr1_genes
, which was loaded
earlier.
The dar_val
argument controls whether origin or region
DAR values are used when assigned to features. In the code below we
specify dar_val = "origin"
, as the ranges of our
GRanges
objects held in dar
currently
represent those associated with origin DAR values (read further to see
the alternative scenario). The fill_missing
argument
controls what value is assigned to features with no overlaps. This
defaults to NA
for easy filtering of features with
unassigned DAR values. An alternative choice may be to set this to 0, as
there is no evidence for DAR in that region.
For each feature, assignFeatureDar()
takes the mean of
DAR values for any associated ranges that overlap the feature range.
With the following configuration, it means the resulting assigned DAR
values represent the average DAR that exists solely within the
feature.
gene_dar <- assignFeatureDar(
dar = dar,
features = chr1_genes,
dar_val = "origin",
fill_missing = NA
)
#> GRangesList object of length 1:
#> $group1v2
#> GRanges object with 1456 ranges and 3 metadata columns:
#> seqnames ranges strand | gene_id gene_name dar
#> <Rle> <IRanges> <Rle> | <character> <character> <numeric>
#> [1] 1 6408-12027 - | ENSDARG00000099104 rpl24 0.1428571
#> [2] 1 11822-16373 + | ENSDARG00000102407 cep97 0.1352381
#> [3] 1 18716-23389 + | ENSDARG00000102097 nfkbiz 0.1622024
#> [4] 1 25585-27255 + | ENSDARG00000099319 CU651657.1 NA
#> [5] 1 27690-34330 + | ENSDARG00000099640 eed 0.0794872
#> ... ... ... ... . ... ... ...
#> [1452] 1 59525648-59528626 - | ENSDARG00000105092 FP236157.3 NA
#> [1453] 1 59533317-59539898 + | ENSDARG00000099880 sp6 NA
#> [1454] 1 59542001-59542908 - | ENSDARG00000102872 FP236157.2 NA
#> [1455] 1 59555936-59567685 - | ENSDARG00000098899 zmp:0000001082 0.0810516
#> [1456] 1 59569516-59571785 - | ENSDARG00000101364 wfikkn1 NA
#> -------
#> seqinfo: 1 sequence from GRCz11 genome
Whilst eQTLs don’t necessarily exist within a feature itself,
smoothed values within a feature may still be representative of the
surrounding region as inheritied haplotypes. Additionally, we can opt to
use the region DAR values (dar_region
metadata column in
the dar
object) to assign DAR from the regions surrounding
a feature to increase the amount of smoothing. The size of this region
is controlled with the region_loci
argument in the
dar()
function, which we previously set as 5.
To assign DAR values based on regions, we must firstly utilise the
flipRanges()
function. We can also specify the
extend_edges
option to extend the outermost ranges of each
chromosome to encompass the entire chromosome. This is useful for
ensuring the assignment of DAR values to all features. However, caution
should be taken with features that exist toward the edges of a
chromosome as any assigned DAR values may be less accurate.
#> GRangesList object of length 1:
#> $group1v2
#> GRanges object with 10365 ranges and 3 metadata columns:
#> seqnames ranges strand | dar_origin dar_region origin_ranges
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <IRanges>
#> 1 1 1-13558 * | 0.0714286 0.0714286 13098
#> 1 1 10093-14083 * | 0.0000000 0.0547619 13300
#> 1 1 13098-14740 * | 0.0000000 0.1061905 13558
#> 1 1 13300-15050 * | 0.0595238 0.1204762 14083
#> 1 1 13558-15772 * | 0.4000000 0.1490476 14740
#> . ... ... ... . ... ... ...
#> 1 1 59559120-59562848 * | 0.25 0.1000000 59559963
#> 1 1 59559331-59563684 * | 0.10 0.1200000 59559970
#> 1 1 59559963-59563688 * | 0.00 0.1300000 59562848
#> 1 1 59559970-59563996 * | 0.15 0.0800000 59563684
#> 1 1 59562848-59578282 * | 0.15 0.0833333 59563688
#> -------
#> seqinfo: 1 sequence from GRCz11 genome
flipRanges()
can also be used to revert back to ranges
that represent the origins.
Now we can assign DAR values to features based on their surrounding
region, by supplying our new ranges and selecting
dar_val = region
. A warning will be produced if the ranges
don’t match the selected dar_val
, as this is likely
unintended by the user. However, use cases may exist and the warning can
be ignored if this is intended.
gene_dar_regions <- assignFeatureDar(
dar = dar_regions,
features = chr1_genes,
dar_val = "region"
)
#> GRangesList object of length 1:
#> $group1v2
#> GRanges object with 1456 ranges and 3 metadata columns:
#> seqnames ranges strand | gene_id gene_name dar
#> <Rle> <IRanges> <Rle> | <character> <character> <numeric>
#> [1] 1 6408-12027 - | ENSDARG00000099104 rpl24 0.0630952
#> [2] 1 11822-16373 + | ENSDARG00000102407 cep97 0.1467262
#> [3] 1 18716-23389 + | ENSDARG00000102097 nfkbiz 0.1711012
#> [4] 1 25585-27255 + | ENSDARG00000099319 CU651657.1 0.1585714
#> [5] 1 27690-34330 + | ENSDARG00000099640 eed 0.0833333
#> ... ... ... ... . ... ... ...
#> [1452] 1 59525648-59528626 - | ENSDARG00000105092 FP236157.3 0.0966071
#> [1453] 1 59533317-59539898 + | ENSDARG00000099880 sp6 0.0966071
#> [1454] 1 59542001-59542908 - | ENSDARG00000102872 FP236157.2 0.0966071
#> [1455] 1 59555936-59567685 - | ENSDARG00000098899 zmp:0000001082 0.0945833
#> [1456] 1 59569516-59571785 - | ENSDARG00000101364 wfikkn1 0.0833333
#> -------
#> seqinfo: 1 sequence from GRCz11 genome
Note that because elastic regions were defined, and
extend_edges
was set to TRUE
, all genes in
chr1_genes
now have an assigned DAR value.
Having assigned DAR values at the gene-level, these can be utilised to moderate p-values from differential gene expression (DGE) analysis.
Typically, under the null hypothesis, p-values are drawn from a uniform distribution. However, in the context of DAR, the p-value distribution is observed as right-skewed. This can be modelled using a beta distribution, where the α is set to a value < 1.
This is handled with the modP()
function, where the
α parameter for the beta
distribution is modelled as a linear relationship with DAR. With this
technique we can directly moderate the p-values obtained from
differential expression testing in the presence of DAR. By default we
implement a conservative approach, resulting in a reduction of false
positives, without the introduction of any further false positives
(i.e. significance is reduced, never increased). For more information
please refer to the tadar
paper.
chr1_tt <- merge(chr1_tt, mcols(gene_dar$group1v2), sort = FALSE)
chr1_tt$darP <- modP(chr1_tt$PValue, chr1_tt$dar)
#> DataFrame with 6 rows and 7 columns
#> gene_id gene_name logFC logCPM PValue dar darP
#> <character> <character> <numeric> <numeric> <numeric> <numeric> <numeric>
#> 1 ENSDARG00000009026 ank2a -1.611099 7.31684 1.09498e-11 0.0762946 3.50452e-10
#> 2 ENSDARG00000005739 gpm6ba 0.699597 7.90207 3.86672e-09 0.4584656 3.38679e-02
#> 3 ENSDARG00000058287 gpalpp1 -1.037994 4.29701 9.32906e-08 0.6765873 1.98145e-01
#> 4 ENSDARG00000056381 cfap97 -0.773803 4.24842 4.20056e-07 0.7285714 2.30320e-01
#> 5 ENSDARG00000074989 sparcl1 0.766643 3.41472 6.33950e-07 0.2845938 9.48570e-04
#> 6 ENSDARG00000078733 cnnm2b 0.705368 4.06124 7.44741e-06 0.4857843 2.26978e-01
The plotChrDar()
function produces a localised
visualisation of the trend in DAR along a chromosomal axis with the
option to overlay the positions of features of interest. This allows us
to compare DAR between regions and/or chromosomes, and visually assess
if features may be prone eQTL artefacts. plotChrDar()
is a
convenience function that returns a GViz
plot consisting of
an optional number of tracks. To produce a minimal plot, this function
requires the dar
argument, expecting a GRanges
object with DAR values in the metadata columns, and the
dar_val
argument as a character string to select the
appropriate DAR values from either the dar_origin
or
dar_region
metadata columns. These arguments form the
DataTrack
displaying the trend in DAR. Features of interest
(e.g. a mutation locus) can be passed to the foi
argument
as a GRanges
object and will be plotted along the
GenomeAxisTrack
. Additional features (e.g. DE genes) can
similarly be passed to the features
argument to be plotted
as a separate AnnotationTrack
. The *_anno
and
*_highlight
arguments are used to select the metadata
column containing feature labels, and to choose if their positions
should be highlighted over the DataTrack
displaying the
trend in DAR respectively. We can also add a title with the
title
argument. GViz
requires that any
GRanges
objects consist of ranges on a single chromosome,
however these can be internally subset using the chr
argument (which will also control the title of the
GenomeAxisTrack
).
set.seed(230822)
foi <- sample(chr1_genes, 1)
features <- sample(chr1_genes, 20)
plotChrDar(
dar = dar$group1v2, dar_val = "region", chr = "1",
foi = foi, foi_anno = "gene_name", foi_highlight = TRUE,
features = features, features_anno = "gene_name", features_highlight = TRUE,
title = "Example plot of DAR along Chromosome 1"
)
It may be of use to quickly assess whether a particular chromosome is
affected by high DAR more so than other chromosomes. This is
particularly relevant when a study’s experimental design causes
increased DAR on a particular chromosome. The plotDarECDF()
function facilitates this purpose by plotting the Empirical Cumulative
Distribution Function (ECDF) of DAR for each chromosome. This function
returns a ggplot2
object, allowing us to add our own
styling. For this example we use simulated data that results in a
visualisation commonly observed when experimental sample groups have
been constructed based on the presence/absence of a genetic feature of
interest (e.g. a mutant locus).
set.seed(230704)
simulate_dar <- function(n, mean) {
vapply(
rnorm(n = n, mean = mean),
function(x) exp(x) / (1 + exp(x)),
numeric(1)
)
}
gr <- GRanges(
paste0(rep(seq(1,25), each = 100), ":", seq(1,100)),
dar_origin = c(simulate_dar(2400, -2), simulate_dar(100, 0.5))
)
plotDarECDF(dar = gr, dar_val = "origin") +
theme_bw()
We can also choose to highlight the chromosome of interest with the
highlight
argument.
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] tadar_1.5.0 GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
#> [4] IRanges_2.41.0 S4Vectors_0.44.0 BiocGenerics_0.53.1
#> [7] generics_0.1.3 limma_3.63.0 lubridate_1.9.3
#> [10] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
#> [13] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
#> [16] tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
#> [19] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 sys_3.4.3
#> [3] rstudioapi_0.17.1 jsonlite_1.8.9
#> [5] magrittr_2.0.3 GenomicFeatures_1.59.0
#> [7] farver_2.1.2 rmarkdown_2.28
#> [9] BiocIO_1.17.0 zlibbioc_1.52.0
#> [11] vctrs_0.6.5 memoise_2.0.1
#> [13] Rsamtools_2.22.0 RCurl_1.98-1.16
#> [15] base64enc_0.1-3 htmltools_0.5.8.1
#> [17] S4Arrays_1.6.0 progress_1.2.3
#> [19] curl_5.2.3 SparseArray_1.6.0
#> [21] Formula_1.2-5 sass_0.4.9
#> [23] bslib_0.8.0 htmlwidgets_1.6.4
#> [25] Gviz_1.51.0 httr2_1.0.5
#> [27] cachem_1.1.0 buildtools_1.0.0
#> [29] GenomicAlignments_1.43.0 lifecycle_1.0.4
#> [31] pkgconfig_2.0.3 Matrix_1.7-1
#> [33] R6_2.5.1 fastmap_1.2.0
#> [35] GenomeInfoDbData_1.2.13 MatrixGenerics_1.19.0
#> [37] digest_0.6.37 colorspace_2.1-1
#> [39] AnnotationDbi_1.69.0 Hmisc_5.2-0
#> [41] RSQLite_2.3.7 labeling_0.4.3
#> [43] filelock_1.0.3 fansi_1.0.6
#> [45] timechange_0.3.0 httr_1.4.7
#> [47] abind_1.4-8 compiler_4.4.1
#> [49] bit64_4.5.2 withr_3.0.2
#> [51] htmlTable_2.4.3 backports_1.5.0
#> [53] BiocParallel_1.41.0 DBI_1.2.3
#> [55] highr_0.11 biomaRt_2.63.0
#> [57] rappdirs_0.3.3 DelayedArray_0.33.1
#> [59] rjson_0.2.23 tools_4.4.1
#> [61] foreign_0.8-87 nnet_7.3-19
#> [63] glue_1.8.0 restfulr_0.0.15
#> [65] grid_4.4.1 checkmate_2.3.2
#> [67] cluster_2.1.6 gtable_0.3.6
#> [69] BSgenome_1.75.0 tzdb_0.4.0
#> [71] ensembldb_2.31.0 data.table_1.16.2
#> [73] hms_1.1.3 xml2_1.3.6
#> [75] utf8_1.2.4 XVector_0.46.0
#> [77] pillar_1.9.0 BiocFileCache_2.15.0
#> [79] lattice_0.22-6 deldir_2.0-4
#> [81] rtracklayer_1.66.0 bit_4.5.0
#> [83] biovizBase_1.55.0 tidyselect_1.2.1
#> [85] maketools_1.3.1 Biostrings_2.75.0
#> [87] knitr_1.48 gridExtra_2.3
#> [89] ProtGenerics_1.39.0 SummarizedExperiment_1.36.0
#> [91] xfun_0.48 Biobase_2.67.0
#> [93] statmod_1.5.0 matrixStats_1.4.1
#> [95] stringi_1.8.4 UCSC.utils_1.2.0
#> [97] lazyeval_0.2.2 yaml_2.3.10
#> [99] evaluate_1.0.1 codetools_0.2-20
#> [101] interp_1.1-6 BiocManager_1.30.25
#> [103] cli_3.6.3 rpart_4.1.23
#> [105] munsell_0.5.1 jquerylib_0.1.4
#> [107] Rcpp_1.0.13 dichromat_2.0-0.1
#> [109] dbplyr_2.5.0 png_0.1-8
#> [111] XML_3.99-0.17 parallel_4.4.1
#> [113] blob_1.2.4 prettyunits_1.2.0
#> [115] jpeg_0.1-10 latticeExtra_0.6-30
#> [117] AnnotationFilter_1.31.0 bitops_1.0-9
#> [119] VariantAnnotation_1.52.0 scales_1.3.0
#> [121] crayon_1.5.3 rlang_1.1.4
#> [123] KEGGREST_1.47.0