--- title: "An introduction to the atena package" author: - name: Beatriz Calvo-Serra affiliation: - &id Dept. of Experimental and Health Sciences, Universitat Pompeu Fabra, Barcelona, Spain email: beatriz.calvo@upf.edu - name: Robert Castelo affiliation: *id email: robert.castelo@upf.edu package: "`r pkg_ver('atena')`" abstract: > The `atena` package provides methods to quantify the expression of transposable elements within R and Bioconductor. vignette: > %\VignetteIndexEntry{An introduction to the atena package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: toc: true toc_float: true number_sections: true bibliography: bibliography.bib --- ```{r setup, echo=FALSE} library(knitr) options(width=80) knitr::opts_chunk$set( collapse=TRUE, comment="") ``` # What are transposable elements Transposable elements (TEs) are autonomous mobile genetic elements. They are DNA sequences that have, or once had, the ability to mobilize within the genome either directly or through an RNA intermediate [@payer2019transposable]. TEs can be categorized into two classes based on the intermediate substrate propagating insertions (RNA or DNA). Class I TEs, also called retrotransposons, first transcribe an RNA copy that is then reverse transcribed to cDNA before inserting in the genome. In turn, these can be divided into long terminal repeat (LTR) retrotransposons, which refer to endogenous retroviruses (ERVs), and non-LTR retrotransposons, which include long interspersed element class 1 (LINE-1 or L1) and short interspersed elements (SINEs). Class II TEs, also known as DNA transposons, directly excise themselves from one location before reinsertion. TEs are further split into families and subfamilies depending on various structural features [@goerner2018computational; @guffanti2018novel]. Most TEs have lost the capacity for generating new insertions over their evolutionary history and are now fixed in the human population. Their insertions have resulted in a complex distribution of interspersed repeats comprising almost half (50%) of the human genome [@payer2019transposable]. TE expression has been observed in association with physiological processes in a wide range of species, including humans where it has been described to be important in early embryonic pluripotency and development. Moreover, aberrant TE expression has been associated with diseases such as cancer, neurodegenerative disorders, and infertility [@payer2019transposable]. # Currently available methods for quantifying TE expression The study of TE expression faces one main challenge: given their repetitive nature, the majority of TE-derived reads map to multiple regions of the genome and these multi-mapping reads are consequently discarded in standard RNA-seq data processing pipelines. For this reason, specific software packages for the quantification of TE expression have been developed [@goerner2018computational], such as TEtranscripts [@jin2015tetranscripts], ERVmap [@tokuyama2018ervmap] and Telescope [@bendall2019telescope]. The main differences between these three methods are the following: * [TEtranscripts](https://github.com/mhammell-laboratory/TEtranscripts) [@jin2015tetranscripts] reassigns multi-mapping reads to TEs proportionally to their relative abundance, which is estimated using an expectation-maximization (EM) algorithm. * [ERVmap](https://github.com/mtokuyama/ERVmap) [@tokuyama2018ervmap] is based on selective filtering of multi-mapping reads. It applies filters that consist in discarding reads when the ratio of sum of hard and soft clipping to the length of the read (base pair) is greater than or equal to 0.02, the ratio of the edit distance to the sequence read length (base pair) is greater or equal to 0.02 and/or the difference between the alignment score from BWA (field AS) and the suboptimal alignment score from BWA (field XS) is less than 5. * [Telescope](https://github.com/mlbendall/telescope) [@bendall2019telescope] reassigns multi-mapping reads to TEs using their relative abundance, which like in TEtranscripts, is also estimated using an EM algorithm. The main differences with respect to TEtranscripts are: (1) Telescope works with an additional parameter for each TE that estimates the proportion of multi-mapping reads that need to be reassigned to that TE; (2) that reassignment parameter is optimized during the EM algorithm jointly with the TE relative abundances, using a Bayesian maximum a posteriori (MAP) estimate that allows one to use prior values on these two parameters; and (3) using the final estimates on these two parameters, multi-mapping reads can be flexibly reassigned to TEs using different strategies, where the default one is to assign a multi-mapping read to the TE with largest estimated abundance and discard those multi-mapping reads with ties on those largest abundances. Because these tools were only available outside R and Bioconductor, the `atena` package provides a complete re-implementation in R of these three methods to facilitate the integration of TE expression quantification into Bioconductor workflows for the analysis of RNA-seq data. # TEs annotations Another challenge in TE expression quantification is the lack of complete TE annotations due to the difficulty to correctly place TEs in genome assemblies [@goerner2018computational]. One of the main sources of TE annotations are RepeatMasker annotations, available for instance at the RepeatMasker track of the UCSC Genome Browser. `atena` can fetch RepeatMasker annotations with the function `annotaTEs()` and flexibly parse them by using a parsing function provided through the parameter `parsefun`. Examples of `parsefun` included in `atena` are: * `rmskidentity()`: returns RepeatMasker annotations without any modification. * `rmskbasicparser()`: filters out non-TE repeats and elements without strand information from RepeatMasker annotations. Then assigns a unique id to each elements based on their repeat name. * `OneCodeToFindThemAll()`: implementation of the "One Code To Find Them All" algorithm by @bailly2014one, for parsing RepeatMasker output files. * `rmskatenaparser()`: attempts to reconstruct fragmented TEs by assembling together fragments from the same TE that are close enough. For LTR class TEs, tries to reconstruct full-length and partial TEs following the LTR - internal region - LTR structure. Both, the `rmskatenaparser()` and `OneCodeToFindThemAll()` parser functions attempt to address the annotation fragmentation present in the output files of the RepeatMasker software (i.e. presence of multiple hits, such as homology-based matches, corresponding to a unique copy of an element). This is highly frequent for TEs of the LTR class, where the consensus sequences are split separately into the LTR and internal regions, causing RepeatMasker to also report these two regions of the TE as two separate elements. These two functions try to identify these and other cases of fragmented annotations and assemble them together into single elements. To do so, the assembled elements must satisfy certain criteria. These two parser functions differ in those criteria, as well as in the approach for finding equivalences between LTR and internal regions to reconstruct LTR retrotransposons. The `rmskatenaparser()` function is also much faster than `OneCodeToFindThemAll()`. ## Retrieving and parsing TE annotations As an example, let's retrieve TE annotations for *Drosophila melanogaster* *dm6* genome version. By setting `rmskidentity()` as argument to the `parsefun` parameter, RepeatMasker annotations are retrieved intact as a `GRanges` object. ```{r, message=FALSE, warning=FALSE} library(atena) library(BiocParallel) rmskann <- annotaTEs(genome="dm6", parsefun=rmskidentity) rmskann ``` We can see that we obtained annotations for `r length(rmskann)` elements. Now, let's fetch the same RepeatMasker annotations, but process them using the `OneCodeToFindThemAll` parser function [@bailly2014one]. We set the parameter `strict=FALSE` to avoid applying a filter of minimum 80% identity with the consensus sequence and minimum 80 bp length. The `insert` parameter is set to 500, meaning that two elements with the same name are merged if they are closer than 500 bp in the annotations. The `BPPARAM` parameter allows one to run calculations in parallel using the functionality of the [BiocParallel](https://bioconductor.org/packages/BiocParallel) Bioconductor package. In this particular example, we are setting the `BPPARAM` parameter to `SerialParam(progress=FALSE)` to disable parallel calculations and progress reporting, but a common setting if we want to run calculations in parallel would be `BPPARAM=Multicore(workers=ncores, progress=TRUE)`, which would use `ncores` parallel threads of execution and report progress on the calculations. ```{r, message=FALSE, warning=FALSE} teann <- annotaTEs(genome="dm6", parsefun=OneCodeToFindThemAll, strict=FALSE, insert=500, BPPARAM=SerialParam(progress=FALSE)) length(teann) teann[1] ``` As expected, we get a lower number of elements in the annotations, because repeats that are not TEs have been removed. Furthermore, some fragmented regions of TEs have been assembled together. This time, the resulting `teann` object is of class `GRangesList`. Each element of the list represents an assembled TE containing a `GRanges` object of length one, if the TE could not be not assembled with another element, or of length greater than one, if two or more fragments were assembled together into a single TE. We can get more information of the parsed annotations by accessing the metadata columns with `mcols()`: ```{r} mcols(teann) ``` There is information about the reconstruction status of the TE (*Status* column), the relative length of the reconstructed TE (*RelLength*) and the repeat class of the TE (*Class*). The relative length is calculated by adding the length (in base pairs) of all fragments from the same assembled TE, and dividing that sum by the length (in base pairs) of the consensus sequence. For full-length and partially reconstructed LTR TEs, the consensus sequence length used is the one resulting from adding twice the consensus sequence length of the long terminal repeat (LTR) and the one from the corresponding internal region. For solo-LTRs, the consensus sequence length of the long terminal repeat is used. We can get an insight into the composition of the assembled annotations using the information from the *status* column. Let's look at the absolute frequencies of the status and class of TEs in the annotations. ```{r comparsedann, message=FALSE, height=6, width=10, out.width="800px", fig.cap="Composition of parsed TE annotations.", echo=FALSE} library(RColorBrewer) pal1 <- brewer.pal(6, "Pastel2") pal2 <- brewer.pal(length(unique(mcols(teann)$Class)), "Set2") par(mfrow = c(1,2), mar = c(5,4,3,1)) bp1 <- barplot(table(mcols(teann)$Status), col = pal1, border = "black", main = "TEs by status", cex.axis=0.8, xaxt = "n") grid(nx=NA, ny=NULL) axis(1, at=bp1, labels = FALSE, las=1, line=0, lwd = 0, lwd.ticks = 1) par(xpd=TRUE) text(x= bp1[, 1] - 0.3, y = 10, labels=names(table(mcols(teann)$Status)), srt=40, offset = 1.7, cex = 0.8, pos = 1) par(xpd=FALSE) bp2 <- barplot(table(mcols(teann)$Class), col = pal2, border = "black", main = "TEs by class", cex.axis=0.8, xaxt = "n", ylim = c(0,max(table(mcols(teann)$Status)))) grid(nx=NA, ny=NULL) axis(1, at=bp2, labels = FALSE, las=1, line=0, lwd = 0, lwd.ticks = 1) par(xpd=TRUE) text(x= bp2[, 1] - 0.1, y = 10, labels=names(table(mcols(teann)$Class)), srt=35, offset = 1.2, cex = 0.8, pos = 1) par(xpd=FALSE) ``` Here, *full-lengthLTR* are reconstructed LTR retrotransposons following the LTR - internal region (int) - LTR structure. Partially reconstructed LTR TEs are *partialLTR_down* (internal region followed by a downstream LTR) and *partialLTR_up* (LTR upstream of an internal region). *int* and *LTR* correspond to internal and solo-LTR regions, respectively. Finally, the *noLTR* refers to TEs of other classes (not LTR), as well as TEs of class LTR which could not be identified as either internal or long terminal repeat regions based on their name. Moreover, the `atena` package provides getter functions to retrieve TEs of a specific class, using a specific relative length threshold. Those TEs with higher relative lengths are more likely to have intact open reading frames, making them more interesting for expression quantification and functional analyses. For example, to get LINEs with a minimum of 0.9 relative length, we can use the `getLINEs()` function. We use the TE annotations in `teann` we obtained before and set the `relLength` to 0.9. ```{r} rmskLINE <- getLINEs(teann, relLength=0.9) length(rmskLINE) rmskLINE[1] ``` To get LTR retrotransposons, we can use the function `getLTRs()`. This function also allows to get one or more specific types of reconstructed TEs. To get full-length, partial LTRs and other fragments that could not be reconstructed, we can: ```{r} rmskLTR <- getLTRs(teann, relLength=0.8, fullLength=TRUE, partial=TRUE, otherLTR=TRUE) length(rmskLTR) rmskLTR[1] ``` To obtain DNA transposons and SINEs, one can use the `getDNAtransposons()` and `getSINEs()` functions, respectively. # TE expression quantification Quantification of TE expression with `atena` consists in the following two steps: 1. Building of a parameter object for one of the available quantification methods. 2. Calling the TE expression quantification method `qtex()` using the previously built parameter object. The dataset that will be used to illustrate how to quantify TE expression with `atena` is a published RNA-seq dataset of _Drosophila melanogaster_ available at the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) under accession [GSE47006](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47006)). The two selected samples are: a piwi knockdown and a piwi control (GSM1142845 and GSM1142844). These files have been subsampled. The piwi-associated silencing complex (piRISC) silences TEs in the _Drosophila_ ovary, hence the knockdown of piwi causes the de-repression of TEs. Here we show how the expression of full-length LTR retrotransposons present in `rmskLTR` can be easily quantified using `atena`. ## Building a parameter object A parameter object is build calling a specific function for the quantification method we want to use. Independenty of each method, all parameter object constructor functions require that the first two arguments specify the BAM files and the TE annotation, respectively. ### ERVmap To use the ERVmap method in `atena` we should first build an object of the class `ERVmapParam` using the function `ERVmapParam()`. The `singleEnd` parameter is set to `TRUE` since the example BAM files are single-end. The `ignoreStrand` parameter works analogously to the same parameter in the function `summarizeOverlaps()` from package `r Biocpkg("GenomicAlignments")` and should be set to `TRUE` whenever the RNA library preparation protocol was stranded. One of the filters applied by the ERVmap method compares the alignment score of a given primary alignment, stored in the `AS` tag of a SAM record, to the largest alignment score among every other secondary alignment, known as the suboptimal alignment score. The original ERVmap software assumes that input BAM files are generated using the Burrows-Wheeler Aligner (BWA) software [@li2009fast], which stores suboptimal alignment scores in the `XS` tag. Although `AS` is an optional tag, most short-read aligners provide this tag with alignment scores in BAM files. However, the suboptimal alignment score, stored in the `XS` tag by BWA, is either stored in a different tag or not stored at all by other short-read aligner software, such as STAR [@dobin2013star]. To enable using ERVmap on BAM files produced by short-read aligner software other than BWA, `atena` allows the user to set the argument `suboptimalAlignmentTag` to one of the following three possible values: * The name of a tag different to `XS` that stores the suboptimal alignment score. * The value "none", which will trigger the calculation of the suboptimal alignment score by searching for the largest value stored in the `AS` tag among all available secondary alignments. * The value "auto" (default), by which `atena` will first extract the name of the short-read aligner software from the BAM file and if that software is BWA, then suboptimal alignment scores will be obtained from the `XS` tag. Otherwise, it will trigger the calculation previously explained for `suboptimalAlignemntTag="none"`. Finally, this filter is applied by comparing the difference between alignment and suboptimal alignment scores to a cutoff value, which by default is 5 but can be modified using the parameter `suboptimalAlignmentCutoff`. The default value 5 is the one employed in the original ERVmap software that assumes the BAM file was generated with BWA and for which lower values are interpreted as "equivalent to second best match has one or more mismatches than the best match" [@tokuyama2018ervmap, pg. 12571]. From a different perspective, in BWA the mismatch penalty has a value of 4 and therefore, a `suboptimalAlignmentCutoff` value of 5 only retains those reads where the suboptimal alignment has at least 1 mismatch more than the best match. Therefore, the `suboptimalAlignmentCutoff` value is specific to the short-read mapper software and we recommend to set this value according to the mismatch penalty of that software. Another option is to set `suboptimalAlignmentCutoff=NA`, which prevents the filtering of reads based on this criteria, as set in the following example. ```{r} bamfiles <- list.files(system.file("extdata", package="atena"), pattern="*.bam", full.names=TRUE) empar <- ERVmapParam(bamfiles, teFeatures=rmskLTR, singleEnd=TRUE, ignoreStrand=TRUE, suboptimalAlignmentCutoff=NA) empar ``` In the case of paired-end BAM files (`singleEnd=FALSE`), two additional arguments can be specified, `strandMode` and `fragments`: * `strandMode` defines the behavior of the strand getter when internally reading the BAM files with the `GAlignmentPairs()` function. See the help page of `strandMode` in the `r Biocpkg("GenomicAlignments")` package for further details. * `fragments` controls how read filtering and counting criteria are applied to the read mates in a paired-end read. To use the original ERVmap algorithm [@tokuyama2018ervmap] one should set `fragments=TRUE` (default when `singleEnd=FALSE`), which filters and counts each mate of a paired-end read independently (i.e., two read mates overlapping the same feature count twice on that feature, treating paired-end reads as if they were single-end). On the other hand, when `fragments=FALSE`, if the two read mates pass the filtering criteria and overlap the same feature, they count once on that feature. If either read mate fails to pass the filtering criteria, then both read mates are discarded. An additional functionality with respect to the original ERVmap software is the integration of gene and TE expression quantification. The original ERVmap software doesn't quantify TE and gene expression coordinately and this can potentially lead to counting twice reads that simultaneously overlap a gene and a TE. In `atena`, gene expression is quantified based on the approach used in the TEtranscripts software [@jin2015tetranscripts]: unique reads are preferably assigned to genes, whereas multi-mapping reads are preferably assigned to TEs. In case that a unique read does not overlap a gene or a multi-mapping read does not overlap a TE, `atena` searches for overlaps with TEs or genes, respectively. Given the different treatment of unique and multi-mapping reads, `atena` requires the information regarding the _unique_ or _multi-mapping_ status of a read. This information is obtained from the presence of secondary alignments in the BAM file or, alternatively, from the `NH` tag in the BAM file (number of reported alignments that contain the query in the current SAM record). Therefore, either secondary alignments or the `NH` tag need to be present for gene expression quantification. The original ERVmap approach does not discard any read overlapping gene annotations. However, this can be changed using the parameter `geneCountMode`, which by default `geneCountMode="all"` and follows the behavior in the original ERVmap method. On the contrary, by setting `geneCountMode="ervmap"`, `atena` also applies the filtering criteria employed to quantify TE expression to the reads overlapping gene annotations. Finally, `atena` also allows one to aggregate TE expression quantifications. By default, the names of the input `GRanges` or `GRangesList` object given in the `teFeatures` parameter are used to aggregate quantifications. However, the `aggregateby` parameter can be used to specify other column names in the feature annotations to be used to aggregate TE counts, for example at the sub-family level. ### Telescope To use the Telescope method for TE expression quantification, the `TelescopeParam()` function is used to build a parameter object of the class `TelescopeParam`. As in the case of `ERVmapParam()`, the `aggregateby` argument, which should be a character vector of column names in the annotation, determines the columns to be used to aggregate TE expression quantifications. This way, `atena` provides not only quantifications at the subfamily level, but also allows to quantify TEs at the desired level (family, class, etc.), including locus based quantifications. For such a use case, the object with the TE annotations should include a column with unique identifiers for each TE locus and the `aggregateby` argument should specify the name of that column. When `aggregateby` is not specified, the `names()` of the object containing TE annotations are used to aggregate quantifications. Here, TE quantifications will be aggregated according to the `names()` of the `rmskLTR` object. ```{r} bamfiles <- list.files(system.file("extdata", package="atena"), pattern="*.bam", full.names=TRUE) tspar <- TelescopeParam(bfl=bamfiles, teFeatures=rmskLTR, singleEnd=TRUE, ignoreStrand=TRUE) tspar ``` In case of paired-end data (`singleEnd=FALSE`), the argument usage is similar to that of `ERVmapParam()`. In relation to the BAM file, Telescope follows the same approach as the ERVmap method: when `fragments=FALSE`, only _mated read pairs_ from opposite strands are considered, while when `fragments=TRUE`, same-strand pairs, singletons, reads with unmapped pairs and other fragments are also considered by the algorithm. However, there is one important difference with respect to the counting approach followed by ERVmap: when `fragments=TRUE` _mated read pairs_ mapping to the same element are counted once, whereas in the ERVmap method they are counted twice. As in the ERVmap method from `atena`, the gene expression quantification method in Telescope is based on the approach of the TEtranscripts software [@jin2015tetranscripts]. This way, `atena` provides the possibility to integrate TE expression quantification by Telescope with gene expression quantification. As in the case of the ERVmap method implemented in `atena`, either secondary alignments or the `NH` tag are required for gene expression quantification. ### TEtranscripts Finally, the third method available is TEtranscripts. First, the `TEtranscriptsParam()` function is called to build a parameter object of the class `TEtranscriptsParam`. The usage of the `aggregateby` argument is the same as in `TelescopeParam()` and `ERVmapParam()`. Locus based quantifications in the TEtranscripts method from `atena` is possible because the TEtranscripts algorithm actually computes TE quantifications at the locus level and then sums up all instances of each TE subfamily to provide expression at the subfamily level. By avoiding this last step, `atena` can provide TE expression quantification at the locus level using the TEtranscripts method. For such a use case, the object with the TE annotations should include a column with unique identifiers for each TE and the `aggregateby` argument should specify the name of that column. In this example, the `aggregateby` argument will be set to `aggregateby="repName"` in order to aggregate quantifications at the repeat name level. Moreover, gene expression will also be quantified. To do so, gene annotations are loaded from a *TxDb* object. ```{r, message=FALSE} library(TxDb.Dmelanogaster.UCSC.dm6.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm6.ensGene gannot <- exonsBy(txdb, by="gene") length(gannot) ``` ```{r} bamfiles <- list.files(system.file("extdata", package="atena"), pattern="*.bam", full.names=TRUE) ttpar <- TEtranscriptsParam(bamfiles, teFeatures=rmskLTR, geneFeatures=gannot, singleEnd=TRUE, ignoreStrand=TRUE, aggregateby="repName") ttpar ``` For paired-end data, where would set `singleEnd=FALSE`, the `fragments` parameter has the same purpose as in `TelescopeParam()`. We can also extract the TEs and gene combined feature set using the `features()` function on the parameter object. A metadata column called `isTE` is added to enable distinguishing TEs from gene annotations. ```{r} features(ttpar) mcols(features(ttpar)) table(mcols(features(ttpar))$isTE) ``` Regarding gene expression quantification, `atena` has implemented the approach of the original TEtranscripts software [@jin2015tetranscripts]. As in the case of the ERVmap and Telescope methods from `atena`, either secondary alignments or the `NH` tag are required. Following the gene annotation processing present in the TEtranscripts algorithm, in case that `geneFeatures` contains a metadata column named "type", only the elements with `type="exon"` are considered for quantification. If those elements are grouped through a `GRangesList` object, then counts are aggregated at the level of those `GRangesList` elements, such as genes or transcripts. This also applies to the ERVmap and Telescope methods implemented in `atena` when gene features are present. Let's see an example of this processing: ```{r} ## Create a toy example of gene annotations geneannot <- GRanges(seqnames=rep("2L", 8), ranges=IRanges(start=c(1,20,45,80,110,130,150,170), width=c(10,20,35,10,5,15,10,25)), strand="*", type=rep("exon",8)) names(geneannot) <- paste0("gene",c(rep(1,3),rep(2,4),rep(3,1))) geneannot ttpar2 <- TEtranscriptsParam(bamfiles, teFeatures=rmskLTR, geneFeatures=geneannot, singleEnd=TRUE, ignoreStrand=TRUE) mcols(features(ttpar2)) features(ttpar2)[!mcols(features(ttpar2))$isTE] ``` ### Quantifying expression Finally, to quantify TE expression we call the `qtex()` method using one of the previously defined parameter objects (`ERVmapParam`, `TEtranscriptsParam` or `TelescopeParam`) according to the quantification method we want to use. As with the `OneCodeToFindThemAll()` function described before, here we can also use the `BPPARAM` parameter to perform calculations in parallel. The `qtex()` method returns a `SummarizedExperiment` object containing the resulting quantification of expression in an assay slot. Additionally, when a `data.frame`, or `DataFrame`, object storing phenotypic data is passed to the `qtex()` function through the `phenodata` parameter, this will be included as column data in the resulting `SummarizedExperiment` object and the row names of these phenotypic data will be set as column names in the output `SummarizedExperiment` object. In the current example, the call to quantify TE expression using the ERVmap method would be the following: ```{r, results='hide'} emq <- qtex(empar) ``` ```{r} emq colSums(assay(emq)) ``` In the case of the Telescope method, the call would be as follows: ```{r, results='hide'} tsq <- qtex(tspar) ``` ```{r} tsq colSums(assay(tsq)) ``` For the TEtranscripts method, TE expression is quantified by using the following call: ```{r, results='hide'} ttq <- qtex(ttpar) ``` ```{r} ttq colSums(assay(ttq)) ``` As mentioned, TE expression quantification is provided at the repeat name level. # Accesing expression quantifications and metadata The `qtex()` function returns a `SummarizedExperiment` object that, on the one hand, stores the quantified expression in its assay data. ```{r} head(assay(ttq)) ``` On the other hand, it contains metadata about the features that may be useful to select subsets of the quantified data and extract and explore the feature annotations, using the function `rowData()` on this `SummarizedExperiment` object. ```{r} rowData(ttq) ``` Because we have aggregated quantifications by `RepName` the number of TE quantified features has been substantially reduced with respect to the original number of TE features. ```{r} table(rowData(ttq)$isTE) ``` Let's say we want to select full-length LTRs features, this could be a way of doing it. ```{r} temask <- rowData(ttq)$isTE fullLTRs <- rowData(ttq)$Status == "full-lengthLTR" fullLTRs <- (sapply(fullLTRs, sum, na.rm=TRUE) == 1) & (lengths(rowData(ttq)$Status) == 1) sum(fullLTRs) rowData(ttq)[fullLTRs, ] ``` Note also that since we restricted expression quantification to LTRs, we do have only quantification for that TE class. ```{r} table(rowData(ttq)$Class[temask]) ``` # Session information ```{r session_info, cache=FALSE} sessionInfo() ``` # References