Welcome to the introduction of data management with ORFik experiment.
This vignette will walk you through which data and how you import and
export with ORFik, focusing on NGS library import and annotation.
ORFik
is an R package containing various functions for
analysis of RiboSeq, RNASeq, RCP-seq, TCP-seq, Chip-seq and Cage data,
we advice you to read ORFikOverview vignette, before starting this
one.
There exists a myriad of formats in bioinformatics, ORFik focuses on NGS data anlysis and therefor supports many functions to import that data. We will here learn how to use ORFik effectivly in importing data (how to convert to the format you should use). That is the format with optimal information content relative to load speed.
Sequencing reads starts with fastq files. In the Annotation & Alignment vignette we walk through how to acquire (download usually) and map these reads into bam files. We here presume that this is done and you have the bam files.
ORFik currently supports these formats:
All these formats can be imported using the function
fimport
ORFik also supports bam files that are collapsed, the number of duplicates is stored in the header information per read. This is a very effective speed up for short read sequencing like Ribo-seq. bigwig etc, internally supports collapsing since it uses a ‘score’ column.
We advice you always to create an ORFik experiment of your data, to make it simpler to code (see Data management vignette).
But here we will show you how to do direct import of files. To load a file, usually it is sufficient to use fimport.
# Use bam file that exists in ORFik package
library(ORFik)
bam_file <- system.file("extdata/Danio_rerio_sample", "ribo-seq.bam", package = "ORFik")
footprints <- fimport(bam_file)
# This is identical to:
footprints <- readBam(bam_file)
Bam files are very cumbersome to read, so we should only do this once, and then convert to something faster (described bellow)
Since bam files are cumbersome to load, we should convert files to more optimized formats. Which formats to convert and export to, depends on if you have the files loaded already or not.
There are several converters in ORFik, here are some examples:
ofst_out_dir <- file.path(tempdir(), "ofst/")
convert_bam_to_ofst(NULL, bam_file, ofst_out_dir)
# Find the file again
ofst_file <- list.files(ofst_out_dir, full.names = TRUE)[1]
# Load it
fimport(ofst_file)
## GAlignments object with 16649 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr23 - 28M 28 17599129 17599156 28
## [2] chr23 - 28M 28 17599129 17599156 28
## [3] chr23 - 28M 28 17599129 17599156 28
## [4] chr23 - 28M 28 17599129 17599156 28
## [5] chr23 - 28M 28 17599129 17599156 28
## ... ... ... ... ... ... ... ...
## [16645] chr8 + 29M 29 24068894 24068922 29
## [16646] chr8 + 28M 28 24068907 24068934 28
## [16647] chr8 + 30M 30 24068919 24068948 30
## [16648] chr8 + 30M 30 24068919 24068948 30
## [16649] chr8 + 30M 30 24068939 24068968 30
## njunc
## <integer>
## [1] 0
## [2] 0
## [3] 0
## [4] 0
## [5] 0
## ... ...
## [16645] 0
## [16646] 0
## [16647] 0
## [16648] 0
## [16649] 0
## -------
## seqinfo: 1133 sequences from an unspecified genome; no seqlengths
Bigwig is a bit special, in that it is very fast (and good compression), but to make it failsafe we need the information about size for all chromosomes. Bioconductor functions are very picky about correct chromosome sizes of GRanges objects etc.
bigwig_out_dir <- file.path(tempdir(), "bigwig/")
convert_to_bigWig(NULL, bam_file, bigwig_out_dir,
seq_info = seqinfo(BamFile(bam_file)))
# Find the file again
bigwig_file <- list.files(bigwig_out_dir, full.names = TRUE)
# You see we have 2 files here, 1 for forward strand, 1 for reverse
# Load it
fimport(bigwig_file)
## GRanges object with 3104 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## [1] MT 1-16596 + | 0
## [2] Zv9_NA1 1-22128 + | 0
## [3] Zv9_NA10 1-48062 + | 0
## [4] Zv9_NA100 1-26026 + | 0
## [5] Zv9_NA1000 1-942 + | 0
## ... ... ... ... . ...
## [3100] chr5 1-75682077 - | 0
## [3101] chr6 1-59938731 - | 0
## [3102] chr7 1-77276063 - | 0
## [3103] chr8 1-56184765 - | 0
## [3104] chr9 1-58232459 - | 0
## -------
## seqinfo: 1133 sequences from an unspecified genome
For preloaded files it is better to just convert it there and then, and not convert through the filepath, because then you have just loaded the file twice.
For details, see ?convertLibs and ?convertToOneBasedRanges()
The really cool thing with bigwig is that it has very fast random access.
bigwig_file <- list.files(bigwig_out_dir, full.names = TRUE)
# Let us access a location without loading the full file.
random_point <- GRangesList(GRanges("chr24", 22711508, "-"))
# Getting raw vector (Then you need to know which strand it is on:)
bigwig_for_random_point <- bigwig_file[2] # the reverse strand file
rtracklayer::import.bw(bigwig_for_random_point, as = "NumericList",
which = random_point[[1]])
## NumericList of length 1
## [[1]] 4
ORFik also has an automatic wrapper for spliced transcript coordinates: As data.table
dt <- coveragePerTiling(random_point, bigwig_file)
dt # Position is 1, because it is relative to first
## count genes position
## <int> <int> <int>
## 1: 4 1 1
Annotation consists of 2 primary files:
If you don’t have the annotation yet on your hard drive, to get these two files for your organism, ORFik supports a direct downloader of annotation from ENSEMBL/refseq, for yeast it would look like this:
library(ORFik)
organism <- "Saccharomyces cerevisiae" # Baker's yeast
# This is where you should usually store you annotation references ->
#output_dir <- file.path(ORFik::config()["ref"], gsub(" ", "_", organism))
output_dir <- tempdir()
annotation <- getGenomeAndAnnotation(
organism = organism,
output.dir = output_dir,
assembly_type = "toplevel"
)
## Warning in dir.create(output.dir, recursive = TRUE): '/tmp/Rtmpz834kP' already
## exists
## Warning in .get_cds_IDX(mcols0$type, mcols0$phase): The "phase" metadata column contains non-NA values for features of type
## stop_codon. This information was ignored.
The gtf is a very slow format and ORFik will almost never use this directly. We therefor use a much faster format called TxDb (transcript database) The nice thing with using getGenomeAndAnnotation is that it will do a lot of important fixes for you related to import. It will make the fasta index and txdb, and a lot more optimizations that for large species like human make import time go from minutes to less than a second.
If you don’t want to use getGenomeAndAnnotation, you can do it for your own annotation like this:
## [1] "/tmp/Rtmpz834kP/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.fai"
# Make TxDb object for large speedup:
txdb <- makeTxdbFromGenome(gtf, genome, organism, optimize = TRUE, return = TRUE)
## Warning in .get_cds_IDX(mcols0$type, mcols0$phase): The "phase" metadata column contains non-NA values for features of type
## stop_codon. This information was ignored.
The txdb is saved in the same as gtf with a “.db” extension.
## Seqinfo object with 17 sequences from an unspecified genome:
## seqnames seqlengths isCircular genome
## I 230218 <NA> <NA>
## II 813184 <NA> <NA>
## III 316620 <NA> <NA>
## IV 1531933 <NA> <NA>
## V 576874 <NA> <NA>
## ... ... ... ...
## XIII 924431 <NA> <NA>
## XIV 784333 <NA> <NA>
## XV 1091291 <NA> <NA>
## XVI 948066 <NA> <NA>
## Mito 85779 <NA> <NA>
# Load a 10 first bases from chromosome 1
txSeqsFromFa(GRangesList(GRanges("I", 1:10, "+")), fa, TRUE)
## DNAStringSet object of length 1:
## width seq
## [1] 10 CCACACCACA
# Load a 10 first bases from chromosome 1 and 2.
txSeqsFromFa(GRangesList(GRanges("I", 1:10, "+"), GRanges("II", 1:10, "+")), fa, TRUE)
## DNAStringSet object of length 2:
## width seq
## [1] 10 CCACACCACA
## [2] 10 AAATAGCCCT
ORFik makes it very easy to load specific regions from a txdb. We already have the txdb loaded, but let us load it again
Lets get these regions of the transcripts:
## GRangesList object of length 1:
## $YAL069W_mRNA
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] I 335-649 + | 1 YAL069W_mRNA-E1 1
## -------
## seqinfo: 17 sequences (1 circular) from an unspecified genome
## GRangesList object of length 1:
## $YAL069W_mRNA
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] I 335-649 + | 1 YAL069W_mRNA-E1 1
## -------
## seqinfo: 17 sequences (1 circular) from an unspecified genome
## GRangesList object of length 1:
## $YIL111W_mRNA
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] IX 155311-155312 + | 3384 YIL111W_mRNA-E1 1
## -------
## seqinfo: 17 sequences (1 circular) from an unspecified genome
## GRangesList object of length 1:
## $YAL069W_mRNA
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | cds_id cds_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] I 335-649 + | 1 <NA> 1
## -------
## seqinfo: 17 sequences (1 circular) from an unspecified genome
## GRangesList object of length 0:
## <0 elements>
These are output as GRangesList, which are list elements of GRanges (1 list elements per transcript, which can contain multiple GRanges). If the gene region is not spliced, it has only 1 GRanges object.
Your annotations contains many transcripts, the 2022 human GTFs usually contain around 200K transcripts, at least half of these are from non coding transcripts (they have no CDS). So how to filter out what you do not need?
Let’s say you only want mRNAs, which have these properties: - Leaders >= 1nt - CDS >= 150nt - Trailers >= 0nt
We can in ORFik get this with:
tx_to_keep <- filterTranscripts(txdb, minFiveUTR = 1, minCDS = 150, minThreeUTR = 0)
loadRegion(txdb, "mrna")
## GRangesList object of length 6600:
## $YAL069W_mRNA
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] I 335-649 + | 1 YAL069W_mRNA-E1 1
## -------
## seqinfo: 17 sequences (1 circular) from an unspecified genome
##
## $`YAL068W-A_mRNA`
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] I 538-792 + | 2 YAL068W-A_mRNA-E1 1
## -------
## seqinfo: 17 sequences (1 circular) from an unspecified genome
##
## $`YAL067W-A_mRNA`
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] I 2480-2707 + | 3 YAL067W-A_mRNA-E1 1
## -------
## seqinfo: 17 sequences (1 circular) from an unspecified genome
##
## ...
## <6597 more elements>
But what if you do this?
You see the SacCer3 Yeast gtf does not have any defined leaders at size 10, because the annotation is incomplete. Luckily ORFik supports creating pseudo 5’ UTRs.
txdb <- makeTxdbFromGenome(gtf, genome, organism, optimize = TRUE, return = TRUE,
pseudo_5UTRS_if_needed = 100)
## Warning in .get_cds_IDX(mcols0$type, mcols0$phase): The "phase" metadata column contains non-NA values for features of type
## stop_codon. This information was ignored.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence VI.
## Note that ranges located on a sequence whose length is unknown (NA) or
## on a circular sequence are not considered out-of-bound (use
## seqlengths() and isCircular() to get the lengths and circularity flags
## of the underlying sequences). You can use trim() to trim these ranges.
## See ?`trim,GenomicRanges-method` for more information.
## [1] "Q0010_mRNA" "Q0017_mRNA" "Q0032_mRNA"
Great, This now worked. For detailed access of single points like start sites or regions, see the ‘working with transcripts vignette’.
To get canonical mRNA isoform (primary isoform defined e.g. by Ensembl)
## [1] "Q0010_mRNA" "Q0017_mRNA" "Q0032_mRNA"
You here get all canonical isoform mRNAs (cds exists since minCDS >= 1)
You can also add in a uORF annotation defined by ORFik, like this: Here we needed the pseudo leaders (5’ UTRs), because the yeast SacCer3 genome has no proper leader definitions.
## GRangesList object of length 6910:
## $YAL069W_mRNA_1
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | names
## <Rle> <IRanges> <Rle> | <character>
## YAL069W_mRNA I 300-323 + | YAL069W_mRNA_1
## -------
## seqinfo: 17 sequences (1 circular) from an unspecified genome
##
## $YAL069W_mRNA_2
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | names
## <Rle> <IRanges> <Rle> | <character>
## YAL069W_mRNA I 248-286 + | YAL069W_mRNA_2
## -------
## seqinfo: 17 sequences (1 circular) from an unspecified genome
##
## $`YAL067W-A_mRNA_1`
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | names
## <Rle> <IRanges> <Rle> | <character>
## YAL067W-A_mRNA I 2380-2406 + | YAL067W-A_mRNA_1
## -------
## seqinfo: 17 sequences (1 circular) from an unspecified genome
##
## ...
## <6907 more elements>
## GRangesList object of length 6910:
## $YAL069W_mRNA_1
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | names
## <Rle> <IRanges> <Rle> | <character>
## YAL069W_mRNA I 300-323 + | YAL069W_mRNA_1
## -------
## seqinfo: 17 sequences (1 circular) from an unspecified genome
##
## $YAL069W_mRNA_2
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | names
## <Rle> <IRanges> <Rle> | <character>
## YAL069W_mRNA I 248-286 + | YAL069W_mRNA_2
## -------
## seqinfo: 17 sequences (1 circular) from an unspecified genome
##
## $`YAL067W-A_mRNA_1`
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | names
## <Rle> <IRanges> <Rle> | <character>
## YAL067W-A_mRNA I 2380-2406 + | YAL067W-A_mRNA_1
## -------
## seqinfo: 17 sequences (1 circular) from an unspecified genome
##
## ...
## <6907 more elements>