Many eukaryotic genes give rise to multiple RNA isoforms, increasing the protein-coding capacity of the genome and extending the range of post-transcriptional regulation possibilities. High-throughput sequencing is often used to deduce repertoires of transcripts expressed in specific biological samples by aligning the data to genomic sequences and assembling the alignments into transcript architectures. This typically outputs Gene Transfer Format (GTF) files describing newly identified transcripts as sets of exonic coordinates, but lacking the information about their coding sequences (CDSs; also known as ORFs) and possible biological functions.
To this end, we developed a package for functional annotation of custom-assembled transcriptomes in R (factR). factR predicts CDSs for novel RNA isoforms using a reference-guided process and then determines domain organisation of the protein products and possible susceptibility of transcripts to nonsense-mediated decay (NMD; a pathway destabilizing mRNAs with premature translation termination codons). factR also provides supporting tools for matching new transcripts to “official” gene IDs, visualizing transcript architectures and annotating alternatively spliced segments.
To install factR, enter the following commands in your R environment:
factR requires a custom transcriptome file in the GTF format as an input, and we provide the following three sample custom transcriptome files that can be used to test factR tools:
Method to download the above GTF files is described in the “Importing and inspecting GTF
data” section below.
Users may alternatively prepare their own GTF files making sure that
these contain both gene_id and
transcript_id attributes.
Several factR functions require reference transcriptome files (GTF or GFF3) as a guide. Such files can be accessed from within the R environment using e.g. the R package AnnotationHub or downloaded from an external database such as GENCODE or Ensembl. Both possibilities are described in the “Updating gene info” section below.
factR needs genomic DNA sequence to predict CDSs. Users may obtain genome files using e.g. BSgenome or AnnotationHub or download them from an external database such as GENCODE or Ensembl. We describe this in more detail in the “Constructing CDS information” section below.
Load factR to the R environment as follows:
factR handles transcriptome information in the form of
GenomicRanges objects containing genomic interval data and
relevant metadata. To create such an object from a GTF file, we use the
importGTF
function:
gtf <- system.file("extdata", "sc_merged_sample.gtf.gz", package = "factR")
custom.gtf <- importGTF(gtf)
The imported GTF file is stored as a GenomicRanges object.
Contents of the object can be examined using head
:
head(custom.gtf)
#> GRanges object with 6 ranges and 9 metadata columns:
#> seqnames ranges strand | source type score
#> <Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
#> [1] chr15 3180731-3180944 * | StringTie transcript 1000
#> [2] chr15 3180731-3180944 * | StringTie exon 1000
#> [3] chr15 3217391-3219698 - | StringTie transcript 1000
#> [4] chr15 3217391-3219698 - | StringTie exon 1000
#> [5] chr15 3268547-3277274 + | StringTie transcript 1000
#> [6] chr15 3268547-3268768 + | StringTie exon 1000
#> phase gene_id transcript_id exon_number gene_name
#> <integer> <character> <character> <character> <character>
#> [1] <NA> MSTRG.14523 MSTRG.14523.1 <NA> <NA>
#> [2] <NA> MSTRG.14523 MSTRG.14523.1 1 <NA>
#> [3] <NA> MSTRG.14524 ENSMUST00000227053.1 <NA> Gm7962
#> [4] <NA> MSTRG.14524 ENSMUST00000227053.1 1 Gm7962
#> [5] <NA> MSTRG.14525 ENSMUST00000160787.8 <NA> Selenop
#> [6] <NA> MSTRG.14525 ENSMUST00000160787.8 1 Selenop
#> ref_gene_id
#> <character>
#> [1] <NA>
#> [2] <NA>
#> [3] ENSMUSG00000114999.1
#> [4] ENSMUSG00000114999.1
#> [5] ENSMUSG00000064373.12
#> [6] ENSMUSG00000064373.12
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
In addition to genomic coordinates (seqnames and ranges), a typical GTF file contains metadata describing the feature type (e.g. transcript or exon), transcript IDs and some information on their parental genes.
Use the following command to calculate the total number of transcripts in the input transcriptome:
Note that none of the transcripts in the custom.gtf object contain CDS information:
Users may visualize specific sets of transcripts using
viewTranscripts
. For example, the following will plot
transcripts from the Zfr gene encoding a conserved zinc
finger-containing RNA-binding protein with known neuronal functions
:
StringTie2, a popular transcript assembler used to generate our custom GTF, typically assigns arbitrary names to newly identified transcripts (e.g. “MSTRG.x.y”) and uses the same prefix for their gene IDs (e.g. “MSTRG.x”). Incidentally, this is why the output above contains 10 previously known Zfr transcripts but lacks any novel entries.
factR can update gene metadata in custom transcriptome objects using a reference annotation as guide. Below, we describe two alternative ways to obtain mouse reference transcriptome data from GENCODE.
# query database for mouse gencode basic annotation
library(AnnotationHub)
ah <- AnnotationHub()
query(ah, c('Mus musculus', 'gencode', 'gff'))
#> AnnotationHub with 9 records
#> # snapshotDate(): 2024-10-24
#> # $dataprovider: Gencode
#> # $species: Mus musculus
#> # $rdataclass: GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH49545"]]'
#>
#> title
#> AH49545 | gencode.vM6.2wayconspseudos.gff3.gz
#> AH49546 | gencode.vM6.annotation.gff3.gz
#> AH49547 | gencode.vM6.basic.annotation.gff3.gz
#> AH49548 | gencode.vM6.chr_patch_hapl_scaff.annotation.gff3.gz
#> AH49549 | gencode.vM6.chr_patch_hapl_scaff.basic.annotation.gff3.gz
#> AH49550 | gencode.vM6.long_noncoding_RNAs.gff3.gz
#> AH49551 | gencode.vM6.polyAs.gff3.gz
#> AH49552 | gencode.vM6.primary_assembly.annotation.gff3.gz
#> AH49553 | gencode.vM6.tRNAs.gff3.gz
# Download full annotation
ref.gtf <- ah[['AH49546']]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
tmp <- tempfile()
download.file("ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.annotation.gtf.gz",
destfile = tmp)
ref.gtf <- importGTF(tmp)
When choosing a reference, users should consider using one with the
same chromosome naming style (e.g. “chr1” or “1”). Alternatively, the
styles can be matched using the matchChromosomes
function
(see help(matchChromosomes)
for more detail).
Once the ref.gtf object is ready, novel transcripts in custom.gtf can
be assigned to “official” gene IDs, whenever possible, using
matchGeneInfo
. By default, this function matches the query
(custom.gtf) to the reference (ref.gtf) by finding overlapping
coordinates:
# matching gene metadata
custom_matched_1.gtf <- matchGeneInfo(custom.gtf, ref.gtf)
#> Number of mismatched gene_ids found: 500
#> ---> Attempting to match gene_ids by finding overlapping coordinates...
#> ---> 386 gene_id matched
#> Total gene_ids corrected: 386
#> Remaining number of mismatched gene_ids: 114
To tune the performance of matchGeneInfo
, we provide
additional arguments specifying the name of columns containing “primary”
and potentially “secondary” gene IDs from the query (custom.gtf). For
further information, see help(matchGeneInfo)
.
# matching gene metadata
custom_matched_2.gtf <- matchGeneInfo(custom.gtf, ref.gtf,
primary_gene_id = "gene_id",
secondary_gene_id = "ref_gene_id")
#> Number of mismatched gene_ids found: 500
#> -> Attempting to correct gene ids by replacing gene_id with ref_gene_id...
#> -> 212 gene_ids matched
#> --> Attempting to match ensembl gene_ids...
#> --> All ensembl gene ids have been matched
#> ---> Attempting to match gene_ids by finding overlapping coordinates...
#> ---> 174 gene_id matched
#> Total gene_ids corrected: 386
#> Remaining number of mismatched gene_ids: 114
Note that custom.gtf updated by matchGeneInfo
now
contains 10 known and 7 newly Zfr assembled transcripts:
As seen in the above example, custom transcriptomes typically combine both new and previously annotated transcripts. To select only newly predicted transcripts, run the following:
custom_new.gtf <- subsetNewTranscripts(custom_matched_2.gtf, ref.gtf)
#> Removing transcripts with exact exon coordinates
viewTranscripts(custom_new.gtf,"Zfr")
This will subset custom.gtf transcripts with distinct exonic coordinates compared to ref.gtf and will store these transcripts in the custom_new.gtf object. Some custom-built transcripts may differ from their reference counterparts by having different start or/and end coordinates, with otherwise similar exon-intron structure. To shortlist novel transcripts with distinct intronic coordinates only, simply set the “refine.by” argument to “intron”:
custom_new.gtf <- subsetNewTranscripts(custom_matched_2.gtf, ref.gtf, refine.by = "intron")
#> Removing transcripts with exact exon coordinates
#> Removing transcripts with exact intron coordinates
viewTranscripts(custom_new.gtf, "Zfr")
We will use the custom_new.gtf object in the rest of the workflow.
Functional annotation of newly assembled transcripts in factR begins by deducing their protein-coding sequences (CDSs). To search for putative CDSs, factR requires a genome sequence file, which can be obtained using R packages or downloaded from online databases (e.g. UCSC, GENCODE or Ensembl). Three alternative ways to retrieve mouse genomic sequence are described below.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BSgenome.Mmusculus.UCSC.mm10")
and loaded into R environment:
library(AnnotationHub)
ah <- AnnotationHub()
query(ah, c("mm10","2bit"))
#> AnnotationHub with 1 record
#> # snapshotDate(): 2024-10-24
#> # names(): AH14005
#> # $dataprovider: UCSC
#> # $species: Mus musculus
#> # $rdataclass: TwoBitFile
#> # $rdatadateadded: 2014-12-15
#> # $title: mm10.2bit
#> # $description: UCSC 2 bit file for mm10
#> # $taxonomyid: 10090
#> # $genome: mm10
#> # $sourcetype: TwoBit
#> # $sourceurl: http://hgdownload.cse.ucsc.edu/goldenpath/mm10/bigZips/mm10.2bit
#> # $sourcesize: NA
#> # $tags: c("2bit", "UCSC", "genome")
#> # retrieve record with 'object[["AH14005"]]'
tmp <- tempfile()
download.file("ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/GRCm38.primary_assembly.genome.fa.gz",
tmp)
Mmusculus <- importFASTA(paste0(tmp, "/GRCm38.primary_assembly.genome.fa.gz"))
Once the genome sequence object is ready, factR can predict CDSs using its buildCDS() function and reference transcriptome data as a guide. buildCDS() first generates a database of previously annotated ATGs and uses this information to search for a potential translation start sites in query transcripts. buildCDS() then deduces the CDS and appends its coordinates to the custom transcriptome object. Let’s run this function for our novel transcripts:
custom_new_CDS.gtf <- buildCDS(custom_new.gtf, ref.gtf, Mmusculus)
#> Searching for reference mRNAs in query
#> No reference mRNAs found
#> Building database of annotated ATG codons
#> Selecting best ATG start codon for remaining transcripts and determining open-reading frame
#> 114 new CDSs constructed
#>
#> Summary: Out of 380 transcripts in `custom_new.gtf`,
#> 114 transcript CDSs were built
Note that the novel Zfr transcripts have been updated with information about likely CDSs (dark blue) and untranslated regions (light blue) :
We can display exonic regions and CDSs more clearly (at the expense of loosing their bona fide genomic coordinates) by setting rescale_intron argument to TRUE.
For comparison, here is the CDS situation in the reference Zfr transcripts:
To explore possible susceptibility of newly identified mRNA isoforms
to NMD, we use the predictNMD
function:
NMDprediction.out <- predictNMD(custom_new_CDS.gtf)
#> Predicting NMD sensitivities for 114 mRNAs
head(NMDprediction.out)
#> # A tibble: 6 × 6
#> transcript stop_to_lastEJ num_of_downEJs `3'UTR_length` is_NMD PTC_coord
#> <chr> <dbl> <int> <dbl> <lgl> <chr>
#> 1 ENSMUST00000100… 222 3 2895 TRUE chr15:91…
#> 2 ENSMUST00000228… NA 0 1672 FALSE <NA>
#> 3 MSTRG.14527.7 -131 0 2244 FALSE <NA>
#> 4 MSTRG.14531.1 1937 9 4141 TRUE chr15:39…
#> 5 MSTRG.14531.5 872 3 5317 TRUE chr15:39…
#> 6 MSTRG.14533.2 1073 1 2541 TRUE chr15:40…
predictNMD
outputs a data frame containing information
on features that may promote NMD and predicts NMD sensitivity for each
CDS-containing transcript based on the distance between the stop codon
and the last exon-exon junction.
To identify putative NMD targets for specific genes (e.g. Zfr), run the following:
NMDprediction.Zfr <- predictNMD(custom_new_CDS.gtf, gene_name == "Zfr")
#> Predicting NMD sensitivities for 7 mRNAs
head(NMDprediction.Zfr)
#> # A tibble: 6 × 6
#> transcript stop_to_lastEJ num_of_downEJs `3'UTR_length` is_NMD PTC_coord
#> <chr> <dbl> <int> <dbl> <lgl> <chr>
#> 1 MSTRG.14649.1 3857 11 8774 TRUE chr15:1215…
#> 2 MSTRG.14649.10 -177 0 4740 FALSE <NA>
#> 3 MSTRG.14649.11 41 1 4958 FALSE <NA>
#> 4 MSTRG.14649.12 -300 0 2242 FALSE <NA>
#> 5 MSTRG.14649.3 3857 11 8774 TRUE chr15:1215…
#> 6 MSTRG.14649.4 -300 0 2242 FALSE <NA>
factR can also inspect domain structure of protein products
encoded by newly identified mRNA isoforms using its
predictDomains()
function. This requires connection to the
online PFAM database and may require a substantial amount of time and
stable internet connection to query multiple transcripts simultaneously.
To quickly explore functionality of the predictDomains()
tool, users may prefer to begin with a relatively small subset of
transcripts. For example, the following will predict domain structure
for all transcripts of the Zfr gene:
predictDomains(custom_new_CDS.gtf, Mmusculus, gene_name == "Zfr", progress_bar = FALSE)
#> Checking CDSs and translating protein sequences
#> Predicting domain families for 7 proteins
#> # A tibble: 6 × 5
#> transcript description eval begin end
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 MSTRG.14649.1 beta-beta-alpha zinc fingers 1.54e-20 303 364
#> 2 MSTRG.14649.1 beta-beta-alpha zinc fingers 1.54e-20 371 414
#> 3 MSTRG.14649.3 beta-beta-alpha zinc fingers 1.54e-20 303 350
#> 4 MSTRG.14649.4 beta-beta-alpha zinc fingers 1.54e-20 311 358
#> 5 MSTRG.14649.9 beta-beta-alpha zinc fingers 1.54e-20 289 350
#> 6 MSTRG.14649.9 beta-beta-alpha zinc fingers 1.54e-20 357 400
The domain architectures for the set of query proteins can be additionally plotted by switching the argument “plot” to TRUE:
domains.Zfr <- predictDomains(custom_new_CDS.gtf, Mmusculus, gene_name == "Zfr", plot = TRUE, progress_bar = FALSE)
If you want to predict domains for all new transcripts and do not mind waiting for some time depending on the connection speed and the PFAM server load, run the following:
Annotated custom_new_CDS.gtf object can be exported to a GTF file as follows:
Finally, to export NMDprediction.out and domains.out data frames as tab-delimited text files, run the following:
Please cite factR if you find it useful:
This workflow was conducted on:
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] lubridate_1.9.3 forcats_1.0.0
#> [3] stringr_1.5.1 dplyr_1.1.4
#> [5] purrr_1.0.2 readr_2.1.5
#> [7] tidyr_1.3.1 tibble_3.2.1
#> [9] ggplot2_3.5.1 tidyverse_2.0.0
#> [11] GenomicFeatures_1.57.1 AnnotationDbi_1.69.0
#> [13] Biobase_2.67.0 BSgenome.Mmusculus.UCSC.mm10_1.4.3
#> [15] BSgenome_1.75.0 rtracklayer_1.65.0
#> [17] BiocIO_1.17.0 GenomicRanges_1.57.2
#> [19] Biostrings_2.75.0 GenomeInfoDb_1.41.2
#> [21] XVector_0.45.0 IRanges_2.39.2
#> [23] S4Vectors_0.43.2 AnnotationHub_3.15.0
#> [25] BiocFileCache_2.15.0 dbplyr_2.5.0
#> [27] BiocGenerics_0.53.0 factR_1.9.0
#> [29] rmarkdown_2.28
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 bitops_1.0-9
#> [3] pbapply_1.7-2 rlang_1.1.4
#> [5] magrittr_2.0.3 matrixStats_1.4.1
#> [7] compiler_4.4.1 RSQLite_2.3.7
#> [9] png_0.1-8 vctrs_0.6.5
#> [11] pkgconfig_2.0.3 crayon_1.5.3
#> [13] fastmap_1.2.0 labeling_0.4.3
#> [15] utf8_1.2.4 Rsamtools_2.21.2
#> [17] tzdb_0.4.0 UCSC.utils_1.1.0
#> [19] bit_4.5.0 xfun_0.48
#> [21] zlibbioc_1.51.2 cachem_1.1.0
#> [23] jsonlite_1.8.9 blob_1.2.4
#> [25] highr_0.11 DelayedArray_0.33.1
#> [27] BiocParallel_1.41.0 parallel_4.4.1
#> [29] R6_2.5.1 bslib_0.8.0
#> [31] stringi_1.8.4 jquerylib_0.1.4
#> [33] assertthat_0.2.1 SummarizedExperiment_1.35.5
#> [35] knitr_1.48 Matrix_1.7-1
#> [37] timechange_0.3.0 tidyselect_1.2.1
#> [39] abind_1.4-8 yaml_2.3.10
#> [41] codetools_0.2-20 curl_5.2.3
#> [43] lattice_0.22-6 withr_3.0.2
#> [45] KEGGREST_1.45.1 evaluate_1.0.1
#> [47] pillar_1.9.0 BiocManager_1.30.25
#> [49] filelock_1.0.3 MatrixGenerics_1.17.1
#> [51] generics_0.1.3 RCurl_1.98-1.16
#> [53] BiocVersion_3.21.1 hms_1.1.3
#> [55] munsell_0.5.1 scales_1.3.0
#> [57] glue_1.8.0 maketools_1.3.1
#> [59] tools_4.4.1 sys_3.4.3
#> [61] data.table_1.16.2 GenomicAlignments_1.41.0
#> [63] buildtools_1.0.0 XML_3.99-0.17
#> [65] grid_4.4.1 colorspace_2.1-1
#> [67] GenomeInfoDbData_1.2.13 patchwork_1.3.0
#> [69] wiggleplotr_1.29.0 restfulr_0.0.15
#> [71] cli_3.6.3 rappdirs_0.3.3
#> [73] fansi_1.0.6 S4Arrays_1.5.11
#> [75] gtable_0.3.6 sass_0.4.9
#> [77] digest_0.6.37 SparseArray_1.5.45
#> [79] farver_2.1.2 rjson_0.2.23
#> [81] memoise_2.0.1 htmltools_0.5.8.1
#> [83] lifecycle_1.0.4 httr_1.4.7
#> [85] mime_0.12 bit64_4.5.2