BREW3R.r

Introduction

The BREW3R.r package has been written to be part of the BREW3R workflow. Today, the package contains a single function which enable to extend three prime of gene annotations using another gene annotation as template. This is very helpful when you are using a technique that only sequence three-prime end of genes like 10X scRNA-seq or BRB-seq.

Installation

To install from Bioconductor use:

if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("BREW3R.r")

To install from github use:

if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("lldelisle/BREW3R.r")

Example

Load dependencies

library(rtracklayer)
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: generics
## 
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
##     unsplit, which.max, which.min
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
library(GenomicRanges)

Get gtfs

In this example, I will extend the transcripts from gencode using RefSeq on mm10. In order to decrease the size of the input files, the input files of this vignette have been subsetted to the chromosome 19. Original gtf for gencode is available here and gtf for RefSeq is available here.

input_gtf_file_to_extend <-
    system.file(
        "extdata/chr19.gencode.vM25.annotation.gtf.gz",
        package = "BREW3R.r",
        mustWork = TRUE
    )
input_gtf_file_template <-
    system.file(
        "extdata/chr19.mm10.ncbiRefSeq.gtf.gz",
        package = "BREW3R.r",
        mustWork = TRUE
    )

Convert gtf files to GRanges

We will use the rtracklayer package to import gtf:

input_gr_to_extend <- rtracklayer::import(input_gtf_file_to_extend)
input_gr_template <- rtracklayer::import(input_gtf_file_template)

Save annotations

The package only use exon information. It may be interesting to save the other annotations like ‘CDS’, ‘start_codon’, ‘end_codon’.

You should not save the ‘gene’ and ‘transcript’ annotations as they will be out of date. Same for three prime UTR.

input_gr_CDS <- subset(input_gr_to_extend, type == "CDS")

Extend the GRanges

Now we can run the main function of the package:

library(BREW3R.r)
new_gr_exons <- extend_granges(
    input_gr_to_extend = input_gr_to_extend,
    input_gr_to_overlap = input_gr_template
)
## Found 4343 last exons to potentially extend.
## Compute overlap between 4343 exons and 38576 exons.
## 2331  exons may be extended.
## 2263 exons have been extended while preventing collision with other genes.
## Found 32563  exons that may be included into 723 transcripts.
## Stay 68 candidate exons that may be included into 26 transcripts.
## Finally 29 combined exons will be included into 26 transcripts.

By default, you get few statistics. You can change the verbosity with options(rlib_message_verbosity = "quiet") to mute it or on the contrary you can set options(BREW3R.r.verbose = "progression") to get messages with all steps. Among them, you can read that you extended about half of last exons, then you could add 29 exons to 26 transcripts.

Explore your data

Here is an example for the Btrc gene that have been extended:

Here is an example for the Mrpl21 gene that have a new exon on the 3’ end of one of its transcript:

Recompose the GRanges

We can put back annotations that have been stored:

new_gr <- c(new_gr_exons, input_gr_CDS)

Write new GRanges to gtf

rtracklayer::export.gff(sort(new_gr), "my_new.gtf")

Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] BREW3R.r_1.3.0       rtracklayer_1.67.0   GenomicRanges_1.59.1
## [4] GenomeInfoDb_1.43.2  IRanges_2.41.1       S4Vectors_0.45.2    
## [7] BiocGenerics_0.53.3  generics_0.1.3       BiocStyle_2.35.0    
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9                  SparseArray_1.7.2          
##  [3] bitops_1.0-9                lattice_0.22-6             
##  [5] digest_0.6.37               grid_4.4.2                 
##  [7] evaluate_1.0.1              fastmap_1.2.0              
##  [9] Matrix_1.7-1                jsonlite_1.8.9             
## [11] restfulr_0.0.15             BiocManager_1.30.25        
## [13] httr_1.4.7                  UCSC.utils_1.3.0           
## [15] XML_3.99-0.17               Biostrings_2.75.1          
## [17] codetools_0.2-20            jquerylib_0.1.4            
## [19] abind_1.4-8                 cli_3.6.3                  
## [21] rlang_1.1.4                 crayon_1.5.3               
## [23] XVector_0.47.0              Biobase_2.67.0             
## [25] DelayedArray_0.33.2         cachem_1.1.0               
## [27] yaml_2.3.10                 S4Arrays_1.7.1             
## [29] tools_4.4.2                 parallel_4.4.2             
## [31] BiocParallel_1.41.0         GenomeInfoDbData_1.2.13    
## [33] Rsamtools_2.23.1            SummarizedExperiment_1.37.0
## [35] curl_6.0.1                  buildtools_1.0.0           
## [37] R6_2.5.1                    BiocIO_1.17.1              
## [39] matrixStats_1.4.1           lifecycle_1.0.4            
## [41] zlibbioc_1.52.0             bslib_0.8.0                
## [43] xfun_0.49                   GenomicAlignments_1.43.0   
## [45] sys_3.4.3                   MatrixGenerics_1.19.0      
## [47] knitr_1.49                  rjson_0.2.23               
## [49] htmltools_0.5.8.1           rmarkdown_2.29             
## [51] maketools_1.3.1             compiler_4.4.2             
## [53] RCurl_1.98-1.16