Transposable elements (TEs) are autonomous mobile genetic elements. They are DNA sequences that have, or once had, the ability to mobilize within the genome either directly or through an RNA intermediate (Payer and Burns 2019). TEs can be categorized into two classes based on the intermediate substrate propagating insertions (RNA or DNA). Class I TEs, also called retrotransposons, first transcribe an RNA copy that is then reverse transcribed to cDNA before inserting in the genome. In turn, these can be divided into long terminal repeat (LTR) retrotransposons, which refer to endogenous retroviruses (ERVs), and non-LTR retrotransposons, which include long interspersed element class 1 (LINE-1 or L1) and short interspersed elements (SINEs). Class II TEs, also known as DNA transposons, directly excise themselves from one location before reinsertion. TEs are further split into families and subfamilies depending on various structural features (Goerner-Potvin and Bourque 2018; Guffanti et al. 2018).
Most TEs have lost the capacity for generating new insertions over their evolutionary history and are now fixed in the human population. Their insertions have resulted in a complex distribution of interspersed repeats comprising almost half (50%) of the human genome (Payer and Burns 2019).
TE expression has been observed in association with physiological processes in a wide range of species, including humans where it has been described to be important in early embryonic pluripotency and development. Moreover, aberrant TE expression has been associated with diseases such as cancer, neurodegenerative disorders, and infertility (Payer and Burns 2019).
The study of TE expression faces one main challenge: given their repetitive nature, the majority of TE-derived reads map to multiple regions of the genome and these multi-mapping reads are consequently discarded in standard RNA-seq data processing pipelines. For this reason, specific software packages for the quantification of TE expression have been developed (Goerner-Potvin and Bourque 2018), such as TEtranscripts (Jin et al. 2015), ERVmap (Tokuyama et al. 2018) and Telescope (Bendall et al. 2019). The main differences between these three methods are the following:
TEtranscripts (Jin et al. 2015) reassigns multi-mapping reads to TEs proportionally to their relative abundance, which is estimated using an expectation-maximization (EM) algorithm.
ERVmap (Tokuyama et al. 2018) is based on selective filtering of multi-mapping reads. It applies filters that consist in discarding reads when the ratio of sum of hard and soft clipping to the length of the read (base pair) is greater than or equal to 0.02, the ratio of the edit distance to the sequence read length (base pair) is greater or equal to 0.02 and/or the difference between the alignment score from BWA (field AS) and the suboptimal alignment score from BWA (field XS) is less than 5.
Telescope (Bendall et al. 2019) reassigns multi-mapping reads to TEs using their relative abundance, which like in TEtranscripts, is also estimated using an EM algorithm. The main differences with respect to TEtranscripts are: (1) Telescope works with an additional parameter for each TE that estimates the proportion of multi-mapping reads that need to be reassigned to that TE; (2) that reassignment parameter is optimized during the EM algorithm jointly with the TE relative abundances, using a Bayesian maximum a posteriori (MAP) estimate that allows one to use prior values on these two parameters; and (3) using the final estimates on these two parameters, multi-mapping reads can be flexibly reassigned to TEs using different strategies, where the default one is to assign a multi-mapping read to the TE with largest estimated abundance and discard those multi-mapping reads with ties on those largest abundances.
Because these tools were only available outside R and Bioconductor,
the atena
package provides a complete re-implementation in
R of these three methods to facilitate the integration of TE expression
quantification into Bioconductor workflows for the analysis of RNA-seq
data.
Another challenge in TE expression quantification is the lack of
complete TE annotations due to the difficulty to correctly place TEs in
genome assemblies (Goerner-Potvin and Bourque
2018). One of the main sources of TE annotations are RepeatMasker
annotations, available for instance at the RepeatMasker track of the
UCSC Genome Browser. atena
can fetch RepeatMasker
annotations with the function annotaTEs()
and flexibly
parse them by using a parsing function provided through the parameter
parsefun
. Examples of parsefun
included in
atena
are:
rmskidentity()
: returns RepeatMasker annotations
without any modification.rmskbasicparser()
: filters out non-TE repeats and
elements without strand information from RepeatMasker annotations. Then
assigns a unique id to each elements based on their repeat name.OneCodeToFindThemAll()
: implementation of the “One Code
To Find Them All” algorithm by Bailly-Bechet,
Haudry, and Lerat (2014), for parsing RepeatMasker output
files.rmskatenaparser()
: attempts to reconstruct fragmented
TEs by assembling together fragments from the same TE that are close
enough. For LTR class TEs, tries to reconstruct full-length and partial
TEs following the LTR - internal region - LTR structure.Both, the rmskatenaparser()
and
OneCodeToFindThemAll()
parser functions attempt to address
the annotation fragmentation present in the output files of the
RepeatMasker software (i.e. presence of multiple hits, such as
homology-based matches, corresponding to a unique copy of an element).
This is highly frequent for TEs of the LTR class, where the consensus
sequences are split separately into the LTR and internal regions,
causing RepeatMasker to also report these two regions of the TE as two
separate elements. These two functions try to identify these and other
cases of fragmented annotations and assemble them together into single
elements. To do so, the assembled elements must satisfy certain
criteria. These two parser functions differ in those criteria, as well
as in the approach for finding equivalences between LTR and internal
regions to reconstruct LTR retrotransposons. The
rmskatenaparser()
function is also much faster than
OneCodeToFindThemAll()
.
As an example, let’s retrieve TE annotations for Drosophila
melanogaster dm6 genome version. By setting
rmskidentity()
as argument to the parsefun
parameter, RepeatMasker annotations are retrieved intact as a
GRanges
object.
library(atena)
library(BiocParallel)
rmskann <- annotaTEs(genome="dm6", parsefun=rmskidentity)
rmskann
GRanges object with 137555 ranges and 11 metadata columns:
seqnames ranges strand | swScore milliDiv milliDel
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
[1] chr2L 2-154 + | 778 167 7
[2] chr2L 313-408 + | 296 174 207
[3] chr2L 457-612 + | 787 170 7
[4] chr2L 771-866 + | 296 174 207
[5] chr2L 915-1070 + | 787 170 7
... ... ... ... . ... ... ...
[137551] chrUn_DS486004v1 99-466 - | 3224 14 0
[137552] chrUn_DS486005v1 1-1001 + | 930 48 0
[137553] chrUn_DS486008v1 1-488 + | 4554 0 0
[137554] chrUn_DS486008v1 489-717 - | 2107 9 0
[137555] chrUn_DS486008v1 717-1001 - | 2651 3 0
milliIns genoLeft repName repClass repFamily
<numeric> <integer> <character> <character> <character>
[1] 20 -23513558 HETRP_DM Satellite Satellite
[2] 42 -23513304 HETRP_DM Satellite Satellite
[3] 19 -23513100 HETRP_DM Satellite Satellite
[4] 42 -23512846 HETRP_DM Satellite Satellite
[5] 19 -23512642 HETRP_DM Satellite Satellite
... ... ... ... ... ...
[137551] 3 -535 ROVER-LTR_DM LTR Gypsy
[137552] 1 0 (TATACATA)n Simple_repeat Simple_repeat
[137553] 0 -513 NOMAD_LTR LTR Gypsy
[137554] 0 -284 ACCORD_LTR LTR Gypsy
[137555] 0 0 DMRT1A LINE R1
repStart repEnd repLeft
<integer> <integer> <integer>
[1] 1519 1669 -203
[2] 1519 1634 -238
[3] 1516 1669 -203
[4] 1519 1634 -238
[5] 1516 1669 -203
... ... ... ...
[137551] 0 367 1
[137552] 1 1000 0
[137553] 31 518 0
[137554] -123 435 207
[137555] 0 5183 4899
-------
seqinfo: 1870 sequences (1 circular) from dm6 genome
We can see that we obtained annotations for 137555 elements. Now,
let’s fetch the same RepeatMasker annotations, but process them using
the OneCodeToFindThemAll
parser function (Bailly-Bechet, Haudry, and Lerat 2014). We set
the parameter strict=FALSE
to avoid applying a filter of
minimum 80% identity with the consensus sequence and minimum 80 bp
length. The insert
parameter is set to 500, meaning that
two elements with the same name are merged if they are closer than 500
bp in the annotations. The BPPARAM
parameter allows one to
run calculations in parallel using the functionality of the BiocParallel
Bioconductor package. In this particular example, we are setting the
BPPARAM
parameter to
SerialParam(progress=FALSE)
to disable parallel
calculations and progress reporting, but a common setting if we want to
run calculations in parallel would be
BPPARAM=Multicore(workers=ncores, progress=TRUE)
, which
would use ncores
parallel threads of execution and report
progress on the calculations.
teann <- annotaTEs(genome="dm6", parsefun=OneCodeToFindThemAll, strict=FALSE,
insert=500, BPPARAM=SerialParam(progress=FALSE))
length(teann)
[1] 22538
teann[1]
GRangesList object of length 1:
$IDEFIX_LTR.1
GRanges object with 1 range and 11 metadata columns:
seqnames ranges strand | swScore milliDiv milliDel milliIns
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric> <numeric>
[1] chr2L 9726-9859 + | 285 235 64 15
genoLeft repName repClass repFamily repStart repEnd
<integer> <character> <character> <character> <integer> <integer>
[1] -23503853 IDEFIX_LTR LTR Gypsy 425 565
repLeft
<integer>
[1] 29
-------
seqinfo: 1870 sequences (1 circular) from dm6 genome
As expected, we get a lower number of elements in the annotations, because repeats that are not TEs have been removed. Furthermore, some fragmented regions of TEs have been assembled together.
This time, the resulting teann
object is of class
GRangesList
. Each element of the list represents an
assembled TE containing a GRanges
object of length one, if
the TE could not be not assembled with another element, or of length
greater than one, if two or more fragments were assembled together into
a single TE.
We can get more information of the parsed annotations by accessing
the metadata columns with mcols()
:
mcols(teann)
DataFrame with 22538 rows and 3 columns
Status RelLength Class
<character> <numeric> <character>
IDEFIX_LTR.1 LTR 0.225589 LTR
DNAREP1_DM.2 noLTR 0.419192 DNA
LINEJ1_DM.3 noLTR 0.997211 LINE
DNAREP1_DM.4 noLTR 0.861953 DNA
BS2.5 noLTR 0.126880 LINE
... ... ... ...
QUASIMODO_I-int.22534 noLTR 0.0882838 LTR
ROVER-I_DM.22535 partialLTR_down 0.0636786 LTR
NOMAD_LTR.22536 noLTR 0.9420849 LTR
ACCORD_LTR.22537 noLTR 0.4103943 LTR
DMRT1A.22538 noLTR 0.0549875 LINE
There is information about the reconstruction status of the TE (Status column), the relative length of the reconstructed TE (RelLength) and the repeat class of the TE (Class). The relative length is calculated by adding the length (in base pairs) of all fragments from the same assembled TE, and dividing that sum by the length (in base pairs) of the consensus sequence. For full-length and partially reconstructed LTR TEs, the consensus sequence length used is the one resulting from adding twice the consensus sequence length of the long terminal repeat (LTR) and the one from the corresponding internal region. For solo-LTRs, the consensus sequence length of the long terminal repeat is used.
We can get an insight into the composition of the assembled annotations using the information from the status column. Let’s look at the absolute frequencies of the status and class of TEs in the annotations.
Here, full-lengthLTR are reconstructed LTR retrotransposons following the LTR - internal region (int) - LTR structure. Partially reconstructed LTR TEs are partialLTR_down (internal region followed by a downstream LTR) and partialLTR_up (LTR upstream of an internal region). int and LTR correspond to internal and solo-LTR regions, respectively. Finally, the noLTR refers to TEs of other classes (not LTR), as well as TEs of class LTR which could not be identified as either internal or long terminal repeat regions based on their name.
Moreover, the atena
package provides getter functions to
retrieve TEs of a specific class, using a specific relative length
threshold. Those TEs with higher relative lengths are more likely to
have intact open reading frames, making them more interesting for
expression quantification and functional analyses. For example, to get
LINEs with a minimum of 0.9 relative length, we can use the
getLINEs()
function. We use the TE annotations in
teann
we obtained before and set the relLength
to 0.9.
rmskLINE <- getLINEs(teann, relLength=0.9)
length(rmskLINE)
[1] 355
rmskLINE[1]
GRangesList object of length 1:
$LINEJ1_DM.3
GRanges object with 1 range and 11 metadata columns:
seqnames ranges strand | swScore milliDiv milliDel milliIns
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric> <numeric>
[1] chr2L 47514-52519 + | 43674 5 0 0
genoLeft repName repClass repFamily repStart repEnd
<integer> <character> <character> <character> <integer> <integer>
[1] -23461193 LINEJ1_DM LINE Jockey 2 5007
repLeft
<integer>
[1] 13
-------
seqinfo: 1870 sequences (1 circular) from dm6 genome
To get LTR retrotransposons, we can use the function
getLTRs()
. This function also allows to get one or more
specific types of reconstructed TEs. To get full-length, partial LTRs
and other fragments that could not be reconstructed, we can:
rmskLTR <- getLTRs(teann, relLength=0.8, fullLength=TRUE, partial=TRUE,
otherLTR=TRUE)
length(rmskLTR)
[1] 1408
rmskLTR[1]
GRangesList object of length 1:
$`ROO_I-int.11`
GRanges object with 4 ranges and 11 metadata columns:
seqnames ranges strand | swScore milliDiv milliDel milliIns
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric> <numeric>
[1] chr2L 976935-977362 + | 3968 5 0 0
[2] chr2L 977363-983449 + | 54257 1 13 1
[3] chr2L 983448-984084 + | 5412 5 19 0
[4] chr2L 984085-984512 + | 3968 5 0 0
genoLeft repName repClass repFamily repStart repEnd
<integer> <character> <character> <character> <integer> <integer>
[1] -22536350 ROO_LTR LTR Pao 1 428
[2] -22530263 ROO_I-int LTR Pao 1 6166
[3] -22529628 ROO_I-int LTR Pao 7608 8256
[4] -22529200 ROO_LTR LTR Pao 1 428
repLeft
<integer>
[1] 0
[2] 2090
[3] 0
[4] 0
-------
seqinfo: 1870 sequences (1 circular) from dm6 genome
To obtain DNA transposons and SINEs, one can use the
getDNAtransposons()
and getSINEs()
functions,
respectively.
Quantification of TE expression with atena
consists in
the following two steps:
Building of a parameter object for one of the available quantification methods.
Calling the TE expression quantification method
qtex()
using the previously built parameter
object.
The dataset that will be used to illustrate how to quantify TE
expression with atena
is a published RNA-seq dataset of
Drosophila melanogaster available at the National Center for
Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) under
accession GSE47006).
The two selected samples are: a piwi knockdown and a piwi control
(GSM1142845 and GSM1142844). These files have been subsampled. The
piwi-associated silencing complex (piRISC) silences TEs in the
Drosophila ovary, hence the knockdown of piwi causes the
de-repression of TEs. Here we show how the expression of full-length LTR
retrotransposons present in rmskLTR
can be easily
quantified using atena
.
A parameter object is build calling a specific function for the quantification method we want to use. Independenty of each method, all parameter object constructor functions require that the first two arguments specify the BAM files and the TE annotation, respectively.
To use the ERVmap method in atena
we should first build
an object of the class ERVmapParam
using the function
ERVmapParam()
. The singleEnd
parameter is set
to TRUE
since the example BAM files are single-end. The
ignoreStrand
parameter works analogously to the same
parameter in the function summarizeOverlaps()
from package
GenomicAlignments
and should be set to TRUE
whenever the RNA library
preparation protocol was stranded.
One of the filters applied by the ERVmap method compares the
alignment score of a given primary alignment, stored in the
AS
tag of a SAM record, to the largest alignment score
among every other secondary alignment, known as the suboptimal alignment
score. The original ERVmap software assumes that input BAM files are
generated using the Burrows-Wheeler Aligner (BWA) software (Li and Durbin 2009), which stores suboptimal
alignment scores in the XS
tag. Although AS
is
an optional tag, most short-read aligners provide this tag with
alignment scores in BAM files. However, the suboptimal alignment score,
stored in the XS
tag by BWA, is either stored in a
different tag or not stored at all by other short-read aligner software,
such as STAR (Dobin et al. 2013).
To enable using ERVmap on BAM files produced by short-read aligner
software other than BWA, atena
allows the user to set the
argument suboptimalAlignmentTag
to one of the following
three possible values:
The name of a tag different to XS
that stores the
suboptimal alignment score.
The value “none”, which will trigger the calculation of the
suboptimal alignment score by searching for the largest value stored in
the AS
tag among all available secondary
alignments.
The value “auto” (default), by which atena
will
first extract the name of the short-read aligner software from the BAM
file and if that software is BWA, then suboptimal alignment scores will
be obtained from the XS
tag. Otherwise, it will trigger the
calculation previously explained for
suboptimalAlignemntTag="none"
.
Finally, this filter is applied by comparing the difference between
alignment and suboptimal alignment scores to a cutoff value, which by
default is 5 but can be modified using the parameter
suboptimalAlignmentCutoff
. The default value 5 is the one
employed in the original ERVmap software that assumes the BAM file was
generated with BWA and for which lower values are interpreted as
“equivalent to second best match has one or more mismatches than the
best match” (Tokuyama et al. 2018, pg.
12571). From a different perspective, in BWA the mismatch penalty
has a value of 4 and therefore, a suboptimalAlignmentCutoff
value of 5 only retains those reads where the suboptimal alignment has
at least 1 mismatch more than the best match. Therefore, the
suboptimalAlignmentCutoff
value is specific to the
short-read mapper software and we recommend to set this value according
to the mismatch penalty of that software. Another option is to set
suboptimalAlignmentCutoff=NA
, which prevents the filtering
of reads based on this criteria, as set in the following example.
bamfiles <- list.files(system.file("extdata", package="atena"),
pattern="*.bam", full.names=TRUE)
empar <- ERVmapParam(bamfiles,
teFeatures=rmskLTR,
singleEnd=TRUE,
ignoreStrand=TRUE,
suboptimalAlignmentCutoff=NA)
ℹ Locating BAM files
ℹ Processing features
✔ Parameter object successfully created
empar
ERVmapParam object
# BAM files (2): control_KD.bam, piwi_KD.bam
# features (1408): ACCORD2_I-int.18752, ..., ZAM_LTR.21390
# single-end, unstranded
In the case of paired-end BAM files (singleEnd=FALSE
),
two additional arguments can be specified, strandMode
and
fragments
:
strandMode
defines the behavior of the strand getter
when internally reading the BAM files with the
GAlignmentPairs()
function. See the help page of
strandMode
in the GenomicAlignments
package for further details.
fragments
controls how read filtering and counting
criteria are applied to the read mates in a paired-end read. To use the
original ERVmap algorithm (Tokuyama et al.
2018) one should set fragments=TRUE
(default when
singleEnd=FALSE
), which filters and counts each mate of a
paired-end read independently (i.e., two read mates overlapping the same
feature count twice on that feature, treating paired-end reads as if
they were single-end). On the other hand, when
fragments=FALSE
, if the two read mates pass the filtering
criteria and overlap the same feature, they count once on that feature.
If either read mate fails to pass the filtering criteria, then both read
mates are discarded.
An additional functionality with respect to the original ERVmap
software is the integration of gene and TE expression quantification.
The original ERVmap software doesn’t quantify TE and gene expression
coordinately and this can potentially lead to counting twice reads that
simultaneously overlap a gene and a TE. In atena
, gene
expression is quantified based on the approach used in the TEtranscripts
software (Jin et al. 2015): unique reads
are preferably assigned to genes, whereas multi-mapping reads are
preferably assigned to TEs.
In case that a unique read does not overlap a gene or a multi-mapping
read does not overlap a TE, atena
searches for overlaps
with TEs or genes, respectively. Given the different treatment of unique
and multi-mapping reads, atena
requires the information
regarding the unique or multi-mapping status of a
read. This information is obtained from the presence of secondary
alignments in the BAM file or, alternatively, from the NH
tag in the BAM file (number of reported alignments that contain the
query in the current SAM record). Therefore, either secondary alignments
or the NH
tag need to be present for gene expression
quantification.
The original ERVmap approach does not discard any read overlapping
gene annotations. However, this can be changed using the parameter
geneCountMode
, which by default
geneCountMode="all"
and follows the behavior in the
original ERVmap method. On the contrary, by setting
geneCountMode="ervmap"
, atena
also applies the
filtering criteria employed to quantify TE expression to the reads
overlapping gene annotations.
Finally, atena
also allows one to aggregate TE
expression quantifications. By default, the names of the input
GRanges
or GRangesList
object given in the
teFeatures
parameter are used to aggregate quantifications.
However, the aggregateby
parameter can be used to specify
other column names in the feature annotations to be used to aggregate TE
counts, for example at the sub-family level.
To use the Telescope method for TE expression quantification, the
TelescopeParam()
function is used to build a parameter
object of the class TelescopeParam
.
As in the case of ERVmapParam()
, the
aggregateby
argument, which should be a character vector of
column names in the annotation, determines the columns to be used to
aggregate TE expression quantifications. This way, atena
provides not only quantifications at the subfamily level, but also
allows to quantify TEs at the desired level (family, class, etc.),
including locus based quantifications. For such a use case, the object
with the TE annotations should include a column with unique identifiers
for each TE locus and the aggregateby
argument should
specify the name of that column. When aggregateby
is not
specified, the names()
of the object containing TE
annotations are used to aggregate quantifications.
Here, TE quantifications will be aggregated according to the
names()
of the rmskLTR
object.
bamfiles <- list.files(system.file("extdata", package="atena"),
pattern="*.bam", full.names=TRUE)
tspar <- TelescopeParam(bfl=bamfiles,
teFeatures=rmskLTR,
singleEnd=TRUE,
ignoreStrand=TRUE)
ℹ Locating BAM files
ℹ Processing features
✔ Parameter object successfully created
tspar
TelescopeParam object
# BAM files (2): control_KD.bam, piwi_KD.bam
# features (CompressedGRangesList length 1408): ACCORD2_I-int.18752, ..., ZAM_LTR.21390
# single-end; unstranded
In case of paired-end data (singleEnd=FALSE
), the
argument usage is similar to that of ERVmapParam()
. In
relation to the BAM file, Telescope follows the same approach as the
ERVmap method: when fragments=FALSE
, only mated read
pairs from opposite strands are considered, while when
fragments=TRUE
, same-strand pairs, singletons, reads with
unmapped pairs and other fragments are also considered by the algorithm.
However, there is one important difference with respect to the counting
approach followed by ERVmap: when fragments=TRUE
mated
read pairs mapping to the same element are counted once, whereas in
the ERVmap method they are counted twice.
As in the ERVmap method from atena
, the gene expression
quantification method in Telescope is based on the approach of the
TEtranscripts software (Jin et al. 2015).
This way, atena
provides the possibility to integrate TE
expression quantification by Telescope with gene expression
quantification. As in the case of the ERVmap method implemented in
atena
, either secondary alignments or the NH
tag are required for gene expression quantification.
Finally, the third method available is TEtranscripts. First, the
TEtranscriptsParam()
function is called to build a
parameter object of the class TEtranscriptsParam
. The usage
of the aggregateby
argument is the same as in
TelescopeParam()
and ERVmapParam()
. Locus
based quantifications in the TEtranscripts method from
atena
is possible because the TEtranscripts algorithm
actually computes TE quantifications at the locus level and then sums up
all instances of each TE subfamily to provide expression at the
subfamily level. By avoiding this last step, atena
can
provide TE expression quantification at the locus level using the
TEtranscripts method. For such a use case, the object with the TE
annotations should include a column with unique identifiers for each TE
and the aggregateby
argument should specify the name of
that column.
In this example, the aggregateby
argument will be set to
aggregateby="repName"
in order to aggregate quantifications
at the repeat name level. Moreover, gene expression will also be
quantified. To do so, gene annotations are loaded from a TxDb
object.
library(TxDb.Dmelanogaster.UCSC.dm6.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm6.ensGene
gannot <- exonsBy(txdb, by="gene")
length(gannot)
[1] 17807
bamfiles <- list.files(system.file("extdata", package="atena"),
pattern="*.bam", full.names=TRUE)
ttpar <- TEtranscriptsParam(bamfiles,
teFeatures=rmskLTR,
geneFeatures=gannot,
singleEnd=TRUE,
ignoreStrand=TRUE,
aggregateby="repName")
ℹ Locating BAM files
ℹ Processing features
✔ Parameter object successfully created
ttpar
TEtranscriptsParam object
# BAM files (2): control_KD.bam, piwi_KD.bam
# features (CompressedGRangesList length 19215): ACCORD2_I-int.18752, ..., ZAM_LTR.21390
# aggregated by: repName
# single-end; unstranded
For paired-end data, where would set singleEnd=FALSE
,
the fragments
parameter has the same purpose as in
TelescopeParam()
. We can also extract the TEs and gene
combined feature set using the features()
function on the
parameter object. A metadata column called isTE
is added to
enable distinguishing TEs from gene annotations.
features(ttpar)
GRangesList object of length 19215:
$`ACCORD2_I-int.18752`
GRanges object with 5 ranges and 15 metadata columns:
seqnames ranges strand | swScore milliDiv
<Rle> <IRanges> <Rle> | <integer> <numeric>
ACCORD2_I-int.18752 chrY 2683299-2683474 - | 1419 0
ACCORD2_I-int.18752 chrY 2683475-2685353 - | 8817 30
ACCORD2_I-int.18752 chrY 2685348-2688057 - | 22228 20
ACCORD2_I-int.18752 chrY 2688056-2689854 - | 6698 75
ACCORD2_I-int.18752 chrY 2689855-2690073 - | 1821 5
milliDel milliIns genoLeft repName repClass
<numeric> <numeric> <integer> <character> <character>
ACCORD2_I-int.18752 6 23 -983878 ACCORD2_LTR LTR
ACCORD2_I-int.18752 98 0 -981999 ACCORD2_I-int LTR
ACCORD2_I-int.18752 1 0 -979295 ACCORD2_I-int LTR
ACCORD2_I-int.18752 57 41 -977498 ACCORD2_I-int LTR
ACCORD2_I-int.18752 0 18 -977279 ACCORD2_LTR LTR
repFamily repStart repEnd repLeft exon_id
<character> <integer> <integer> <integer> <integer>
ACCORD2_I-int.18752 Gypsy 42 173 1 <NA>
ACCORD2_I-int.18752 Gypsy 9 7203 5142 <NA>
ACCORD2_I-int.18752 Gypsy 2371 4841 2128 <NA>
ACCORD2_I-int.18752 Gypsy 5197 2015 1 <NA>
ACCORD2_I-int.18752 Gypsy 0 215 1 <NA>
exon_name type isTE
<character> <character> <Rle>
ACCORD2_I-int.18752 <NA> <NA> TRUE
ACCORD2_I-int.18752 <NA> <NA> TRUE
ACCORD2_I-int.18752 <NA> <NA> TRUE
ACCORD2_I-int.18752 <NA> <NA> TRUE
ACCORD2_I-int.18752 <NA> <NA> TRUE
-------
seqinfo: 1870 sequences (1 circular) from dm6 genome
$`ACCORD2_I-int.18766`
GRanges object with 6 ranges and 15 metadata columns:
seqnames ranges strand | swScore milliDiv
<Rle> <IRanges> <Rle> | <integer> <numeric>
ACCORD2_I-int.18766 chrY 2737073-2737248 - | 1521 0
ACCORD2_I-int.18766 chrY 2737249-2739132 - | 9404 31
ACCORD2_I-int.18766 chrY 2739127-2741836 - | 23688 20
ACCORD2_I-int.18766 chrY 2741835-2742273 - | 2682 64
ACCORD2_I-int.18766 chrY 2742370-2743581 - | 5660 69
ACCORD2_I-int.18766 chrY 2743582-2743800 - | 1947 5
milliDel milliIns genoLeft repName repClass
<numeric> <numeric> <integer> <character> <character>
ACCORD2_I-int.18766 6 23 -930104 ACCORD2_LTR LTR
ACCORD2_I-int.18766 97 1 -928220 ACCORD2_I-int LTR
ACCORD2_I-int.18766 2 0 -925516 ACCORD2_I-int LTR
ACCORD2_I-int.18766 32 46 -925079 ACCORD2_I-int LTR
ACCORD2_I-int.18766 20 42 -923771 ACCORD2_I-int LTR
ACCORD2_I-int.18766 0 18 -923552 ACCORD2_LTR LTR
repFamily repStart repEnd repLeft exon_id
<character> <integer> <integer> <integer> <integer>
ACCORD2_I-int.18766 Gypsy 42 173 1 <NA>
ACCORD2_I-int.18766 Gypsy 6 7206 5142 <NA>
ACCORD2_I-int.18766 Gypsy 2371 4841 2128 <NA>
ACCORD2_I-int.18766 Gypsy 5197 2015 1583 <NA>
ACCORD2_I-int.18766 Gypsy 5845 1367 1 <NA>
ACCORD2_I-int.18766 Gypsy 0 215 1 <NA>
exon_name type isTE
<character> <character> <Rle>
ACCORD2_I-int.18766 <NA> <NA> TRUE
ACCORD2_I-int.18766 <NA> <NA> TRUE
ACCORD2_I-int.18766 <NA> <NA> TRUE
ACCORD2_I-int.18766 <NA> <NA> TRUE
ACCORD2_I-int.18766 <NA> <NA> TRUE
ACCORD2_I-int.18766 <NA> <NA> TRUE
-------
seqinfo: 1870 sequences (1 circular) from dm6 genome
$`ACCORD2_I-int.2712`
GRanges object with 7 ranges and 15 metadata columns:
seqnames ranges strand | swScore milliDiv
<Rle> <IRanges> <Rle> | <integer> <numeric>
ACCORD2_I-int.2712 chr2R 2204406-2204447 + | 246 119
ACCORD2_I-int.2712 chr2R 2204527-2207748 + | 15834 93
ACCORD2_I-int.2712 chr2R 2207743-2208758 + | 6631 72
ACCORD2_I-int.2712 chr2R 2208789-2209734 + | 2458 83
ACCORD2_I-int.2712 chr2R 2209729-2211681 + | 12355 81
ACCORD2_I-int.2712 chr2R 2211689-2211834 + | 556 102
ACCORD2_I-int.2712 chr2R 2212158-2212233 + | 475 107
milliDel milliIns genoLeft repName repClass
<numeric> <numeric> <integer> <character> <character>
ACCORD2_I-int.2712 0 0 -23082489 ACCORD2_I-int LTR
ACCORD2_I-int.2712 62 18 -23079188 ACCORD2_I-int LTR
ACCORD2_I-int.2712 65 8 -23078178 ACCORD2_I-int LTR
ACCORD2_I-int.2712 103 7 -23077202 ACCORD2_I-int LTR
ACCORD2_I-int.2712 57 9 -23075255 ACCORD2_I-int LTR
ACCORD2_I-int.2712 30 123 -23075102 ACCORD2_LTR LTR
ACCORD2_I-int.2712 0 13 -23074703 ACCORD2_LTR LTR
repFamily repStart repEnd repLeft exon_id
<character> <integer> <integer> <integer> <integer>
ACCORD2_I-int.2712 Gypsy 1326 1367 5845 <NA>
ACCORD2_I-int.2712 Gypsy 1522 4950 2262 <NA>
ACCORD2_I-int.2712 Gypsy 5159 6236 976 <NA>
ACCORD2_I-int.2712 Gypsy 3851 4950 2262 <NA>
ACCORD2_I-int.2712 Gypsy 5159 7209 3 <NA>
ACCORD2_I-int.2712 Gypsy 1 132 83 <NA>
ACCORD2_I-int.2712 Gypsy 141 215 0 <NA>
exon_name type isTE
<character> <character> <Rle>
ACCORD2_I-int.2712 <NA> <NA> TRUE
ACCORD2_I-int.2712 <NA> <NA> TRUE
ACCORD2_I-int.2712 <NA> <NA> TRUE
ACCORD2_I-int.2712 <NA> <NA> TRUE
ACCORD2_I-int.2712 <NA> <NA> TRUE
ACCORD2_I-int.2712 <NA> <NA> TRUE
ACCORD2_I-int.2712 <NA> <NA> TRUE
-------
seqinfo: 1870 sequences (1 circular) from dm6 genome
...
<19212 more elements>
mcols(features(ttpar))
DataFrame with 19215 rows and 4 columns
Status RelLength Class isTE
<character> <numeric> <character> <logical>
ACCORD2_I-int.18752 full-lengthLTR 0.887595 LTR TRUE
ACCORD2_I-int.18766 full-lengthLTR 0.868882 LTR TRUE
ACCORD2_I-int.2712 partialLTR_down 0.968464 LTR TRUE
ACCORD2_I-int.6123 full-lengthLTR 1.000000 LTR TRUE
ACCORD2_LTR.19682 noLTR 1.000000 LTR TRUE
... ... ... ... ...
TRANSPAC_I-int.8329 full-lengthLTR 0.999810 LTR TRUE
TRANSPAC_I-int.9501 full-lengthLTR 1.000000 LTR TRUE
ZAM_I-int.2573 full-lengthLTR 0.998459 LTR TRUE
ZAM_I-int.7530 full-lengthLTR 0.861987 LTR TRUE
ZAM_LTR.21390 noLTR 0.957627 LTR TRUE
table(mcols(features(ttpar))$isTE)
FALSE TRUE
17807 1408
Regarding gene expression quantification, atena
has
implemented the approach of the original TEtranscripts software (Jin et al. 2015). As in the case of the ERVmap
and Telescope methods from atena
, either secondary
alignments or the NH
tag are required.
Following the gene annotation processing present in the TEtranscripts
algorithm, in case that geneFeatures
contains a metadata
column named “type”, only the elements with type="exon"
are
considered for quantification. If those elements are grouped through a
GRangesList
object, then counts are aggregated at the level
of those GRangesList
elements, such as genes or
transcripts. This also applies to the ERVmap and Telescope methods
implemented in atena
when gene features are present. Let’s
see an example of this processing:
## Create a toy example of gene annotations
geneannot <- GRanges(seqnames=rep("2L", 8),
ranges=IRanges(start=c(1,20,45,80,110,130,150,170),
width=c(10,20,35,10,5,15,10,25)),
strand="*",
type=rep("exon",8))
names(geneannot) <- paste0("gene",c(rep(1,3),rep(2,4),rep(3,1)))
geneannot
GRanges object with 8 ranges and 1 metadata column:
seqnames ranges strand | type
<Rle> <IRanges> <Rle> | <character>
gene1 2L 1-10 * | exon
gene1 2L 20-39 * | exon
gene1 2L 45-79 * | exon
gene2 2L 80-89 * | exon
gene2 2L 110-114 * | exon
gene2 2L 130-144 * | exon
gene2 2L 150-159 * | exon
gene3 2L 170-194 * | exon
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
ttpar2 <- TEtranscriptsParam(bamfiles,
teFeatures=rmskLTR,
geneFeatures=geneannot,
singleEnd=TRUE,
ignoreStrand=TRUE)
ℹ Locating BAM files
ℹ Processing features
✔ Parameter object successfully created
mcols(features(ttpar2))
DataFrame with 1411 rows and 4 columns
Status RelLength Class isTE
<character> <numeric> <character> <logical>
ACCORD2_I-int.18752 full-lengthLTR 0.887595 LTR TRUE
ACCORD2_I-int.18766 full-lengthLTR 0.868882 LTR TRUE
ACCORD2_I-int.2712 partialLTR_down 0.968464 LTR TRUE
ACCORD2_I-int.6123 full-lengthLTR 1.000000 LTR TRUE
ACCORD2_LTR.19682 noLTR 1.000000 LTR TRUE
... ... ... ... ...
ZAM_I-int.7530 full-lengthLTR 0.861987 LTR TRUE
ZAM_LTR.21390 noLTR 0.957627 LTR TRUE
gene1 NA NA NA FALSE
gene2 NA NA NA FALSE
gene3 NA NA NA FALSE
features(ttpar2)[!mcols(features(ttpar2))$isTE]
GRangesList object of length 3:
$gene1
GRanges object with 3 ranges and 13 metadata columns:
seqnames ranges strand | swScore milliDiv milliDel milliIns
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric> <numeric>
gene1 chr2L 1-10 * | <NA> NA NA NA
gene1 chr2L 20-39 * | <NA> NA NA NA
gene1 chr2L 45-79 * | <NA> NA NA NA
genoLeft repName repClass repFamily repStart repEnd
<integer> <character> <character> <character> <integer> <integer>
gene1 <NA> <NA> <NA> <NA> <NA> <NA>
gene1 <NA> <NA> <NA> <NA> <NA> <NA>
gene1 <NA> <NA> <NA> <NA> <NA> <NA>
repLeft type isTE
<integer> <character> <Rle>
gene1 <NA> exon FALSE
gene1 <NA> exon FALSE
gene1 <NA> exon FALSE
-------
seqinfo: 1870 sequences (1 circular) from dm6 genome
$gene2
GRanges object with 4 ranges and 13 metadata columns:
seqnames ranges strand | swScore milliDiv milliDel milliIns
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric> <numeric>
gene2 chr2L 80-89 * | <NA> NA NA NA
gene2 chr2L 110-114 * | <NA> NA NA NA
gene2 chr2L 130-144 * | <NA> NA NA NA
gene2 chr2L 150-159 * | <NA> NA NA NA
genoLeft repName repClass repFamily repStart repEnd
<integer> <character> <character> <character> <integer> <integer>
gene2 <NA> <NA> <NA> <NA> <NA> <NA>
gene2 <NA> <NA> <NA> <NA> <NA> <NA>
gene2 <NA> <NA> <NA> <NA> <NA> <NA>
gene2 <NA> <NA> <NA> <NA> <NA> <NA>
repLeft type isTE
<integer> <character> <Rle>
gene2 <NA> exon FALSE
gene2 <NA> exon FALSE
gene2 <NA> exon FALSE
gene2 <NA> exon FALSE
-------
seqinfo: 1870 sequences (1 circular) from dm6 genome
$gene3
GRanges object with 1 range and 13 metadata columns:
seqnames ranges strand | swScore milliDiv milliDel milliIns
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric> <numeric>
gene3 chr2L 170-194 * | <NA> NA NA NA
genoLeft repName repClass repFamily repStart repEnd
<integer> <character> <character> <character> <integer> <integer>
gene3 <NA> <NA> <NA> <NA> <NA> <NA>
repLeft type isTE
<integer> <character> <Rle>
gene3 <NA> exon FALSE
-------
seqinfo: 1870 sequences (1 circular) from dm6 genome
Finally, to quantify TE expression we call the qtex()
method using one of the previously defined parameter objects
(ERVmapParam
, TEtranscriptsParam
or
TelescopeParam
) according to the quantification method we
want to use. As with the OneCodeToFindThemAll()
function
described before, here we can also use the BPPARAM
parameter to perform calculations in parallel.
The qtex()
method returns a
SummarizedExperiment
object containing the resulting
quantification of expression in an assay slot. Additionally, when a
data.frame
, or DataFrame
, object storing
phenotypic data is passed to the qtex()
function through
the phenodata
parameter, this will be included as column
data in the resulting SummarizedExperiment
object and the
row names of these phenotypic data will be set as column names in the
output SummarizedExperiment
object.
In the current example, the call to quantify TE expression using the ERVmap method would be the following:
emq
class: RangedSummarizedExperiment
dim: 1408 2
metadata(0):
assays(1): counts
rownames(1408): ACCORD2_I-int.18752 ACCORD2_I-int.18766 ...
ZAM_I-int.7530 ZAM_LTR.21390
rowData names(4): Status RelLength Class isTE
colnames(2): control_KD piwi_KD
colData names(0):
colSums(assay(emq))
control_KD piwi_KD
95 75
In the case of the Telescope method, the call would be as follows:
tsq
class: RangedSummarizedExperiment
dim: 1408 2
metadata(0):
assays(1): counts
rownames(1408): ACCORD2_I-int.18752 ACCORD2_I-int.18766 ...
ZAM_I-int.7530 ZAM_LTR.21390
rowData names(4): Status RelLength Class isTE
colnames(2): control_KD piwi_KD
colData names(0):
colSums(assay(tsq))
control_KD piwi_KD
144 3
For the TEtranscripts method, TE expression is quantified by using the following call:
ttq
class: RangedSummarizedExperiment
dim: 17917 2
metadata(0):
assays(1): counts
rownames(17917): ACCORD2_I-int ACCORD2_LTR ... FBgn0286940 FBgn0286941
rowData names(4): Status RelLength Class isTE
colnames(2): control_KD piwi_KD
colData names(0):
colSums(assay(ttq))
control_KD piwi_KD
149 133
As mentioned, TE expression quantification is provided at the repeat name level.
The qtex()
function returns a
SummarizedExperiment
object that, on the one hand, stores
the quantified expression in its assay data.
head(assay(ttq))
control_KD piwi_KD
ACCORD2_I-int 0 0
ACCORD2_LTR 0 0
ACCORD_LTR 0 0
BATUMI_I-int 0 0
BATUMI_LTR 0 0
BEL_I-int 0 0
On the other hand, it contains metadata about the features that may
be useful to select subsets of the quantified data and extract and
explore the feature annotations, using the function
rowData()
on this SummarizedExperiment
object.
rowData(ttq)
DataFrame with 17917 rows and 4 columns
Status RelLength Class isTE
<CharacterList> <numeric> <character> <logical>
ACCORD2_I-int partialLTR_down 0.968464 LTR TRUE
ACCORD2_LTR full-lengthLTR,noLTR 0.928405 LTR TRUE
ACCORD_LTR full-lengthLTR,noLTR 0.879776 LTR TRUE
BATUMI_I-int int,partialLTR_down,noLTR 0.934641 LTR TRUE
BATUMI_LTR full-lengthLTR,partialLTR_up 0.963187 LTR TRUE
... ... ... ... ...
FBgn0286937 NA NA NA FALSE
FBgn0286938 NA NA NA FALSE
FBgn0286939 NA NA NA FALSE
FBgn0286940 NA NA NA FALSE
FBgn0286941 NA NA NA FALSE
Because we have aggregated quantifications by RepName
the number of TE quantified features has been substantially reduced with
respect to the original number of TE features.
Let’s say we want to select full-length LTRs features, this could be a way of doing it.
temask <- rowData(ttq)$isTE
fullLTRs <- rowData(ttq)$Status == "full-lengthLTR"
fullLTRs <- (sapply(fullLTRs, sum, na.rm=TRUE) == 1) &
(lengths(rowData(ttq)$Status) == 1)
sum(fullLTRs)
[1] 14
rowData(ttq)[fullLTRs, ]
DataFrame with 14 rows and 4 columns
Status RelLength Class isTE
<CharacterList> <numeric> <character> <logical>
BEL_LTR full-lengthLTR 0.969638 LTR TRUE
BLASTOPIA_LTR full-lengthLTR 0.990062 LTR TRUE
BLOOD_LTR full-lengthLTR 0.991468 LTR TRUE
BURDOCK_LTR full-lengthLTR 0.982514 LTR TRUE
Chouto_LTR full-lengthLTR 0.972210 LTR TRUE
... ... ... ... ...
Gypsy6_LTR full-lengthLTR 0.899885 LTR TRUE
Gypsy8_LTR full-lengthLTR 1.000000 LTR TRUE
Gypsy9_LTR full-lengthLTR 0.975136 LTR TRUE
Invader3_LTR full-lengthLTR 0.989059 LTR TRUE
TRANSPAC_LTR full-lengthLTR 0.999943 LTR TRUE
Note also that since we restricted expression quantification to LTRs, we do have only quantification for that TE class.
sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] TxDb.Dmelanogaster.UCSC.dm6.ensGene_3.12.0
[2] GenomicFeatures_1.59.1
[3] AnnotationDbi_1.69.0
[4] RColorBrewer_1.1-3
[5] BiocParallel_1.41.0
[6] atena_1.13.0
[7] SummarizedExperiment_1.37.0
[8] Biobase_2.67.0
[9] GenomicRanges_1.59.1
[10] GenomeInfoDb_1.43.2
[11] IRanges_2.41.2
[12] S4Vectors_0.45.2
[13] BiocGenerics_0.53.3
[14] generics_0.1.3
[15] MatrixGenerics_1.19.0
[16] matrixStats_1.4.1
[17] knitr_1.49
[18] BiocStyle_2.35.0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 dplyr_1.1.4 blob_1.2.4
[4] filelock_1.0.3 Biostrings_2.75.3 bitops_1.0-9
[7] fastmap_1.2.0 RCurl_1.98-1.16 BiocFileCache_2.15.0
[10] GenomicAlignments_1.43.0 XML_3.99-0.17 digest_0.6.37
[13] mime_0.12 lifecycle_1.0.4 KEGGREST_1.47.0
[16] RSQLite_2.3.9 magrittr_2.0.3 compiler_4.4.2
[19] rlang_1.1.4 sass_0.4.9 tools_4.4.2
[22] yaml_2.3.10 rtracklayer_1.67.0 S4Arrays_1.7.1
[25] bit_4.5.0.1 curl_6.0.1 DelayedArray_0.33.3
[28] abind_1.4-8 withr_3.0.2 purrr_1.0.2
[31] sys_3.4.3 grid_4.4.2 cli_3.6.3
[34] rmarkdown_2.29 crayon_1.5.3 httr_1.4.7
[37] rjson_0.2.23 DBI_1.2.3 cachem_1.1.0
[40] zlibbioc_1.52.0 parallel_4.4.2 BiocManager_1.30.25
[43] XVector_0.47.1 restfulr_0.0.15 vctrs_0.6.5
[46] Matrix_1.7-1 jsonlite_1.8.9 bit64_4.5.2
[49] maketools_1.3.1 jquerylib_0.1.4 glue_1.8.0
[52] codetools_0.2-20 BiocVersion_3.21.1 BiocIO_1.17.1
[55] UCSC.utils_1.3.0 tibble_3.2.1 pillar_1.10.0
[58] rappdirs_0.3.3 htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
[61] R6_2.5.1 dbplyr_2.5.0 sparseMatrixStats_1.19.0
[64] evaluate_1.0.1 lattice_0.22-6 AnnotationHub_3.15.0
[67] png_0.1-8 Rsamtools_2.23.1 memoise_2.0.1
[70] SQUAREM_2021.1 bslib_0.8.0 Rcpp_1.0.13-1
[73] SparseArray_1.7.2 xfun_0.49 buildtools_1.0.0
[76] pkgconfig_2.0.3