Title: | Genome Intervals and Read Alignments for Functional Exploration |
---|---|
Description: | The package 'girafe' deals with the genome-level representation of aligned reads from next-generation sequencing data. It contains an object class for enabling a detailed description of genome intervals with aligned reads and functions for comparing, visualising, exporting and working with such intervals and the aligned reads. As such, the package interacts with and provides a link between the packages ShortRead, IRanges and genomeIntervals. |
Authors: | Joern Toedling, with contributions from Constance Ciaudo, Olivier Voinnet, Edith Heard, Emmanuel Barillot, and Wolfgang Huber |
Maintainer: | J. Toedling <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.59.1 |
Built: | 2024-12-26 06:30:41 UTC |
Source: | https://github.com/bioc/girafe |
This function can be used to assess the significance of sliding-window
read counts. The background distribution of read counts in windows
is assumed to be a Negative-Binomial (NB) one.
The two parameters of the NB distribution, mean ‘mu’ and
dispersion ‘size’, are estimated using any of the methods
described below (see details).
The estimated NB distribution is used to assign a p-value to
each window based on the number of aligned reads in the window.
The p-values can be corrected for multiple testing using any
of the correction methods implemented for p.adjust
.
addNBSignificance(x, estimate="NB.012", correct = "none", max.n=10L)
addNBSignificance(x, estimate="NB.012", correct = "none", max.n=10L)
x |
A |
estimate |
string; which method to use to estimate the parameters of the NB background distribution; see below for details |
correct |
string; which method to use for p-value adjustment;
can be any method that is implemented for |
max.n |
integer; only relevant if |
The two parameters of the Negative-Binomial (NB) distribution are:
mean ‘’ (or ‘mu’) and size
‘
’ (or ‘size’).
The function knows a number of methods to estimate the parameters of the NB distribution.
Solely the windows with only 0,
1, or 2 aligned reads are used for estimating
and ‘
’.
From the probability mass function
of the NB
distribution, it follows that the ratios
and
The observed numbers of windows with 0-2 aligned reads are used to estimate
and
and from these estimates, one can obtain estimates for
and
.
This estimation method uses the function
fitdistr
from package ‘MASS’. Windows with up to
n.max
aligned reads are considered for this estimate.
This estimate also uses the windows the 0-2
aligned reads, but uses these numbers to estimates the parameter
of a Poisson distribution. The parameter
‘
’ is set to a very large number, such that the
estimated NB distribution actually is a Poisson distribution with
mean and variance equal to
.
A data.frame
of class slidingWindowSummary
, which is the
the supplied argument x
extended by an additional column
p.value
which holds the p-value for each window.
The attribute NBparams
of the result contains the list of the
estimated parameters of the Negative-Binomial background
distribution.
Joern Toedling
Such an estimation of the Negative-Binomial parameters has also been
described in the paper:
Ji et al.(2008) An integrated system CisGenome for analyzing
ChIP-chip and ChIP-seq data. Nat Biotechnol. 26(11):1293-1300.
exDir <- system.file("extdata", package="girafe") exA <- readAligned(dirPath=exDir, type="Bowtie", pattern="aravinSRNA_23_no_adapter_excerpt_mm9_unmasked.bwtmap") exAI <- as(exA, "AlignedGenomeIntervals") exPX <- perWindow(exAI, chr="chrX", winsize=1e5, step=0.5e5) exPX <- addNBSignificance(exPX, correct="bonferroni") str(exPX) exPX[exPX$p.value <= 0.05,]
exDir <- system.file("extdata", package="girafe") exA <- readAligned(dirPath=exDir, type="Bowtie", pattern="aravinSRNA_23_no_adapter_excerpt_mm9_unmasked.bwtmap") exAI <- as(exA, "AlignedGenomeIntervals") exPX <- perWindow(exAI, chr="chrX", winsize=1e5, step=0.5e5) exPX <- addNBSignificance(exPX, correct="bonferroni") str(exPX) exPX[exPX$p.value <= 0.05,]
Function to create AlignedGenomeIntervals
objects from BAM
(binary alignment map format) files. Uses functions from package
Rsamtools
to parse BAM files.
agiFromBam(bamfile, ...)
agiFromBam(bamfile, ...)
bamfile |
File path of BAM file. BAM file should be sorted and have an index in the same directory (see Details below). |
... |
further arguments passed on to function |
Note: the BAM files must be sorted and must also have an index
file (*.bai
) in the same directory. These should be done when
creating the BAM. However, the functions sortBam
and
indexBam
can be used for the same purpose, as can the
respective modules of the “samtools” library (‘samtools
sort’ and ‘samtools index’).
The BAM files are parsed chromosome by chromosome to limit the memory
footprint of the function. Thus, this function aims to be a
less-memory-consuming alternative to first reading in the BAM file
using the readAligned
function and then converting the
AlignedRead
object into an AlignedGenomeIntervals
object.
An object of class AlignedGenomeIntervals
.
J Toedling
http://samtools.sourceforge.net
scanBam
,
AlignedGenomeIntervals-class
fl <- system.file("extdata", "ex1.bam", package="Rsamtools") ExGi <- agiFromBam(fl) head(detail(ExGi))
fl <- system.file("extdata", "ex1.bam", package="Rsamtools") ExGi <- agiFromBam(fl) head(detail(ExGi))
A class for representing reads from next-generation sequencing experiments that have been aligned to genomic intervals.
Objects can be created either by:
calls of the form
new("AlignedGenomeIntervals", .Data, closed, ...)
.
using the auxiliary function AlignedGenomeIntervals
and
supplying separate vectors of same length which hold the
required information:AlignedGenomeIntervals(start, end, chromosome, strand, reads,
matches, sequence)
If arguments reads
or matches
are not specified, they
are assumed to be '1' for all intervals.
or, probably the most common way, by coercing from objects of
class AlignedRead
.
.Data
:two-column integer matrix, holding the start and end coordinates of the intervals on the chromosomes
sequence
:character; sequence of the read aligned to the interval
reads
:integer; total number of reads that were aligned to this interval
matches
:integer; the total number of genomic intervals that reads which were aligned to this interval were aligned to. A value of '1' thus means that this read sequence matches uniquely to this one genome interval only
organism
:string; an identifier for the genome of
which organism the intervals are related to. Functions making use
of this slot require a specific annotation package
org.<organism>.eg.db
. For example if organism
is
'Hs', the annotation package 'org.Hs.eg.db' is utilised by these
functions. The annotation packages can be obtained from the
Bioconductor repositories.
annotation
:data.frame; see class
genome_intervals
for details
closed
:matrix; see class
genome_intervals
for details
type
:character; see class
genome_intervals
for details
score
:numeric; optional score for each aligned genome interval
id
:character; optional identifier for each aligned genome interval
chrlengths
:integer; optional named integer vector of
chromosome lengths for the respective genome; if present it is
used in place of the chromosome lengths retrieved from the
annotation package (see slot organism
)
Class Genome_intervals-class
, directly.
Class Intervals_full
, by class
"Genome_intervals", distance 2.
Coercion method from objects of class
AlignedRead
, which is defined in package ShortRead
,
to objects of class AlignedGenomeIntervals
signature("AlignedGenomeIntervals")
: computes
the read coverage over all chromosomes. If the organism
of
the object is set correctly, the chromosome lengths are retrieved
from the appropriate annotation package, otherwise the maximum
interval end is taken to be the absolute length of that chromosome
(strand).
The result of this method is a list and the individual list
elements are of class Rle
, a class for encoding long
repetitive vectors that is defined in package IRanges
.
The additional argument byStrand
governs whether
the coverage is computed separately for each strand. If
byStrand=FALSE
(default) only one result is returned per
chromosome. If byStrand=TRUE
, there result is
two separate Rle
objects per chromosome with the strand
appended to the chromosome name.
signature("AlignedGenomeIntervals")
: a more
detailed output of all the intervals than provided by show
;
only advisable for objects containing few intervals
signature("AlignedGenomeIntervals")
with
additional arguments fiveprime=0L
and
threeprime=0L
. These must be integer numbers and greater
than or equal to 0. They specify how much is subtracted from the
left border of the interval and added to the right side. Which end
is 5' and which one is 3' are determined from the strand
information of the object.
Lastly, if the object has an organism
annotation, it is
checked that the right ends of the intervals do not exceed the
respective chromosome lengths.
export the aligned intervals as tab-delimited text
files which can be uploaded to the UCSC genome
browser as ‘custom tracks’.
Currently, there are methods for exporting the data
into ‘bed’ format and ‘bedGraph’ format,
either writing the intervals from both strands into one file or
into two separate files (formats ‘bedStrand’ and
‘bedGraphStrand’, respectively).
Details about these track formats can be found
at the UCSC genome browser web pages.
The additional argument writeHeader
can be set to
FALSE
to suppress writing of the track definition header
line to the file.
For Genome_intervals
objects, only ‘bed’ format is
supported at the moment and does not need to be specified.
signature("AlignedGenomeIntervals")
: creates
a histogram of the lengths of the reads aligned to the intervals
Get or set the organism that the genome intervals in
the object correspond to. Should be a predefined code, such as
'Mm' for mouse and 'Hs' for human. The reason for this code, that,
if the organism is set, a corresponding annotation package that is
called org.<organism>.eg.db
is used, for example for
obtaining the chromosome lengths to be used in methods such as
coverage
. These annotation packages can be obtained from
the Bioconductor repository.
visualisation method; a second argument of class
Genome_intervals_stranded
can be provided for additional
annotation to the plot. Please see below and in the vignette for
examples. Refer to the documentation of plotAligned
for more details on the plotting function.
collapse/reduce aligned genome intervals by combining
intervals which are completely included in each other, combining
overlapping intervals AND combining immediately adjacent
intervals (if method="standard"
).
Intervals are only combined if they are on the same
chromosome, the same strand AND have the same match specificity
of the aligned reads.
If you only want to combine intervals that have exactly the same
start and stop position (but may have reads of slightly different
sequence aligned to them), then use the argument
method="exact"
.
If you only want to combine intervals that have exactly the same
5' or 3' end (but may differ in the other end and in the aligned
sequence), then use the argument
method="same5"
(same 5' end) or
method="same3"
(same 3' end).
Finally, it's possible to only collapse/reduce aligned genome
intervals that overlap each other by at least a certain fraction
using the argument min.frac
. min.frac
is a number
between 0.0 and 1.0. For example, if you call reduce
with
argument min.frac=0.4
, only intervals that overlap
each other by at least 40 percent are collapsed/merged.
draw a random sample of n
(Argument
size
) of the aligned reads (without or with replacement)
and returns the AlignedGenomeIntervals
object defined by
these aligned reads.
access or set a custom score for the object
sorts the intervals by chromosome name, start and end
coordinate in increasing order (unless decreasing=TRUE
is
specified) and returns the sorted object
take a subset of reads, matrix-like subsetting via '\[' can also be used
Joern Toedling
Genome_intervals-class
,
AlignedRead-class
,
plotAligned
############# toy example: A <- new("AlignedGenomeIntervals", .Data=cbind(c(1,3,4,5,8,10), c(5,5,6,8,9,11)), annotation=data.frame( seq_name=factor(rep(c("chr1","chr2","chr3"), each=2)), strand=factor(c("-","-","+","+","+","+") ,levels=c("-","+")), inter_base=rep(FALSE, 6)), reads=rep(3L, 6), matches=rep(1L,6), sequence=c("ACATT","ACA","CGT","GTAA","AG","CT")) show(A) detail(A) ## alternative initiation of this object: A <- AlignedGenomeIntervals( start=c(1,3,4,5,8,10), end=c(5,5,6,8,9,11), chromosome=rep(c("chr2","chrX","chr1"), each=2), strand=c("-","-","+","+","+","+"), sequence=c("ACATT","ACA","CGT","GGAA","AG","CT"), reads=c(1L, 5L, 2L, 7L, 3L, 3L)) detail(A) ## custom identifiers can be assigned to the intervals id(A) <- paste("gi", 1:6, sep="") ## subsetting and combining detail(A[c(1:4)]) detail(c(A[1], A[4])) ## sorting: always useful A <- sort(A) detail(A) ## the 'reduce' method provides a cleaned-up, compact set detail(reduce(A)) ## with arguments specifying additional conditions for merging detail(reduce(A, min.frac=0.8)) ## 'sample' to draw a sample subset of reads and their intervals detail(sample(A, 10)) ## biological example exDir <- system.file("extdata", package="girafe") exA <- readAligned(dirPath=exDir, type="Bowtie", pattern="aravinSRNA_23_no_adapter_excerpt_mm9_unmasked.bwtmap") exAI <- as(exA, "AlignedGenomeIntervals") organism(exAI) <- "Mm" show(exAI) ## which chromosomes are the intervals on? table(chromosome(exAI)) ## subset exAI[is.element(chromosome(exAI), c("chr1","chr2"))] ## compute coverage per chromosome: coverage(exAI[is.element(chromosome(exAI), c("chr1","chr2"))]) ### plotting: load(file.path(exDir, "mgi_gi.RData")) if (interactive()) plot(exAI, mgi.gi, chr="chrX", start=50400000, end=50410000) ### overlap with annotated genome elements: exOv <- interval_overlap(exAI, mgi.gi) ## how many elements do read match positions generally overlap: table(listLen(exOv)) ## what are the 13 elements overlapped by a single match position: mgi.gi[exOv[[which.max(listLen(exOv))]]] ## what kinds of elements are overlapped (tabOv <- table(as.character(mgi.gi$type)[unlist(exOv)])) ### display those classes: my.cols <- rainbow(length(tabOv)) if (interactive()) pie(tabOv, col=my.cols, radius=0.85)
############# toy example: A <- new("AlignedGenomeIntervals", .Data=cbind(c(1,3,4,5,8,10), c(5,5,6,8,9,11)), annotation=data.frame( seq_name=factor(rep(c("chr1","chr2","chr3"), each=2)), strand=factor(c("-","-","+","+","+","+") ,levels=c("-","+")), inter_base=rep(FALSE, 6)), reads=rep(3L, 6), matches=rep(1L,6), sequence=c("ACATT","ACA","CGT","GTAA","AG","CT")) show(A) detail(A) ## alternative initiation of this object: A <- AlignedGenomeIntervals( start=c(1,3,4,5,8,10), end=c(5,5,6,8,9,11), chromosome=rep(c("chr2","chrX","chr1"), each=2), strand=c("-","-","+","+","+","+"), sequence=c("ACATT","ACA","CGT","GGAA","AG","CT"), reads=c(1L, 5L, 2L, 7L, 3L, 3L)) detail(A) ## custom identifiers can be assigned to the intervals id(A) <- paste("gi", 1:6, sep="") ## subsetting and combining detail(A[c(1:4)]) detail(c(A[1], A[4])) ## sorting: always useful A <- sort(A) detail(A) ## the 'reduce' method provides a cleaned-up, compact set detail(reduce(A)) ## with arguments specifying additional conditions for merging detail(reduce(A, min.frac=0.8)) ## 'sample' to draw a sample subset of reads and their intervals detail(sample(A, 10)) ## biological example exDir <- system.file("extdata", package="girafe") exA <- readAligned(dirPath=exDir, type="Bowtie", pattern="aravinSRNA_23_no_adapter_excerpt_mm9_unmasked.bwtmap") exAI <- as(exA, "AlignedGenomeIntervals") organism(exAI) <- "Mm" show(exAI) ## which chromosomes are the intervals on? table(chromosome(exAI)) ## subset exAI[is.element(chromosome(exAI), c("chr1","chr2"))] ## compute coverage per chromosome: coverage(exAI[is.element(chromosome(exAI), c("chr1","chr2"))]) ### plotting: load(file.path(exDir, "mgi_gi.RData")) if (interactive()) plot(exAI, mgi.gi, chr="chrX", start=50400000, end=50410000) ### overlap with annotated genome elements: exOv <- interval_overlap(exAI, mgi.gi) ## how many elements do read match positions generally overlap: table(listLen(exOv)) ## what are the 13 elements overlapped by a single match position: mgi.gi[exOv[[which.max(listLen(exOv))]]] ## what kinds of elements are overlapped (tabOv <- table(as.character(mgi.gi$type)[unlist(exOv)])) ### display those classes: my.cols <- rainbow(length(tabOv)) if (interactive()) pie(tabOv, col=my.cols, radius=0.85)
A function to sum up aligned reads per category of genome feature (i.e. gene, ncRNA, etc.).
countReadsAnnotated(GI, M, typeColumn="type", fractionGI=0.7, mem.friendly=FALSE, showAllTypes=FALSE)
countReadsAnnotated(GI, M, typeColumn="type", fractionGI=0.7, mem.friendly=FALSE, showAllTypes=FALSE)
GI |
object of class |
M |
Annotation object of class |
typeColumn |
string; which column of the annotation object
|
fractionGI |
which fraction of the intervals in object |
mem.friendly |
logical; should a version which requires less memory but takes a bit longer be used |
showAllTypes |
logical; should a table of genome feature types in
|
The read counts are summed up over each type of genome feature, and the read counts are normalised by their number of genomic matches. For example if a read has two matches in the genome, but only one inside a miRNA, it would count 0.5 for miRNAs.
A named numeric vector which gives the summed read counts for each supplied type of genome feature.
J Toedling
A <- AlignedGenomeIntervals( start=c(1,8,14,20), end=c(5,15,19,25), chromosome=rep("chr1", each=4), strand=c("+","+","+","+"), sequence=c("ACATT","TATCGGAC","TCGGACT","GTAACG"), reads=c(7L, 2L, 4L, 5L) ) M2 <- new("Genome_intervals_stranded", rbind(c(2,6), c(1,15), c(20,30)), closed = matrix(TRUE, ncol=2, nrow=3), annotation = data.frame( seq_name= factor(rep("chr1", 3)), inter_base= logical(3), strand=factor(rep("+", 3), levels=c("+","-")), alias=c("miRNA1","gene1","tRNA1"), type=c("miRNA","gene","tRNA")) ) if (interactive()){ grid.newpage() plot(A, M2, chr="chr1", start=0, end=35, nameColum="alias", show="plus") } countReadsAnnotated(A, M2, typeColumn="type")
A <- AlignedGenomeIntervals( start=c(1,8,14,20), end=c(5,15,19,25), chromosome=rep("chr1", each=4), strand=c("+","+","+","+"), sequence=c("ACATT","TATCGGAC","TCGGACT","GTAACG"), reads=c(7L, 2L, 4L, 5L) ) M2 <- new("Genome_intervals_stranded", rbind(c(2,6), c(1,15), c(20,30)), closed = matrix(TRUE, ncol=2, nrow=3), annotation = data.frame( seq_name= factor(rep("chr1", 3)), inter_base= logical(3), strand=factor(rep("+", 3), levels=c("+","-")), alias=c("miRNA1","gene1","tRNA1"), type=c("miRNA","gene","tRNA")) ) if (interactive()){ grid.newpage() plot(A, M2, chr="chr1", start=0, end=35, nameColum="alias", show="plus") } countReadsAnnotated(A, M2, typeColumn="type")
Function to retrieve overlapping intervals that overlap at least by a specified fraction of their widths.
fracOverlap(I1, I2, min.frac=0.0, both=TRUE, mem.friendly=FALSE)
fracOverlap(I1, I2, min.frac=0.0, both=TRUE, mem.friendly=FALSE)
I1 |
object that inherits from class |
I2 |
object that inherits from class |
min.frac |
numeric; minimum required fraction of each of the two interval widths by which two intervals should overlap in order to be marked as overlapping. |
both |
logical; shall both overlap partners meet the minimum
fraction |
mem.friendly |
logical; if set to |
An object of class data.frame
with one row each for a pair of
overlapping elements.
Index1 |
Index of interval in first interval list |
Index2 |
Index of interval in second interval list |
n |
number of bases that the two intervals overlap |
fraction1 |
fraction of interval 1's width by which the two intervals overlap |
fraction2 |
fraction of interval 2's width by which the two intervals overlap |
J. Toedling
data("gen_ints", package="genomeIntervals") i[4,2] <- 13L fracOverlap(i, i, 0.5)
data("gen_ints", package="genomeIntervals") i[4,2] <- 13L fracOverlap(i, i, 0.5)
Function to extract integer Phred score values from FastQ data.
intPhred(x, method="Sanger", returnType="list")
intPhred(x, method="Sanger", returnType="list")
x |
object of class |
method |
string; one of 'Sanger', 'Solexa' or 'previousSolexa'. See details below. |
returnType |
string; in which format should the result be returned, either as a 'list' or as a 'matrix'. |
There are different standards for encoding read qualities in Fastq files. The 'Sanger' format encodes a Phred quality score from 0 to 93 using ASCII 33 to 126. The current 'Solexa'/llumina format (1.3 and higher) encodes a Phred quality score from 0 to 40 using ASCII 64 to 104. The 'previous Solexa'/Illumina format (1.0) encodes a custom Solexa/Illumina quality score from -5 to 40 using ASCII 59 to 104. This custom Solexa quality score is approximately equal to the Phred scores for high qualities, but differs in the low quality range.
If returnType
is equal to ‘list’:
A list of integer Phred quality values of the same length as the
number of reads in the object x
.
If returnType
is equal to ‘matrix’:
A matrix of integer Phred quality values. The number of rows is the
number of reads in the object x
. The number of columns is the
maximum length (width) over all reads in object x
. The last
entries for reads that are shorter than this maximum width are
'NA'.
Joern Toedling
http://maq.sourceforge.net/fastq.shtml
exDir <- system.file("extdata", package="girafe") ra <- readFastq(dirPath=exDir, pattern= "aravinSRNA_23_plus_adapter_excerpt.fastq") ra.quals <- intPhred(ra, method="Sanger", returnType="matrix") ra.qmed <- apply(ra.quals, 2, median) if (interactive()) plot(ra.qmed, type="h", ylim=c(0,42), xlab="Base postion", ylab="Median Phred Quality Score", lwd=2, col="steelblue")
exDir <- system.file("extdata", package="girafe") ra <- readFastq(dirPath=exDir, pattern= "aravinSRNA_23_plus_adapter_excerpt.fastq") ra.quals <- intPhred(ra, method="Sanger", returnType="matrix") ra.qmed <- apply(ra.quals, 2, median) if (interactive()) plot(ra.qmed, type="h", ylim=c(0,42), xlab="Base postion", ylab="Median Phred Quality Score", lwd=2, col="steelblue")
This function computes the median quality for each position in a read over all reads in a ShortReadQ object.
medianByPosition(x, method = "Sanger", batchSize = 100000L)
medianByPosition(x, method = "Sanger", batchSize = 100000L)
x |
object of class |
method |
string; passed on to function |
batchSize |
number of rows to process in each iteration; directly influences RAM usage of this function |
The quality values are computed for each batch of reads and stored as
numeric Rle
objects for each postion. In each iteration, the
Rle
object of the current batch is merged with the previous one
in order to keep the RAM usage low.
A numeric vector of the median values per nucleotide position in the reads. The length of this vector corresponds to the length of the longest read in the data.
Joern Toedling
exDir <- system.file("extdata", package="girafe") ra <- readFastq(dirPath=exDir, pattern= "aravinSRNA_23_plus_adapter_excerpt.fastq") medianByPosition(ra, batchSize=200)
exDir <- system.file("extdata", package="girafe") ra <- readFastq(dirPath=exDir, pattern= "aravinSRNA_23_plus_adapter_excerpt.fastq") medianByPosition(ra, batchSize=200)
Investigate aligned reads in genome intervals with sliding windows.
perWindow(object, chr, winsize, step, normaliseByMatches = TRUE, mem.friendly = FALSE)
perWindow(object, chr, winsize, step, normaliseByMatches = TRUE, mem.friendly = FALSE)
object |
object of class |
chr |
string; which chromosome to investigate with sliding windows |
winsize |
integer; size of the sliding window in base-pairs |
step |
integer; offset between the start positions of two sliding windows |
normaliseByMatches |
logical; should the number of reads per
|
mem.friendly |
logical; argument passed on to function
|
The windows are constructed from the first base position onto which a read has been mapped until the end of the chromosome.
a data.frame
with the following information for each sliding
window on the chromosome
chr |
string; which chromosome the interval is on |
start |
integer; start coordinate of the windows on the chromosome |
end |
integer; end coordinate of the windows on the chromosome |
n.overlap |
integer; number of read match positions inside the
window. Per match position there can be one or more reads mapped, so
this number always is smaller than |
n.reads |
numeric; number of reads which match positions inside
this window; can be floating-point numbers if argument
|
n.unique |
integer; number of reads which each only have one match position in the genome and for which this position is contained inside this window |
max.reads |
integer; the maximal number of reads at any single one match position contained inside this window |
first |
integer; coordinate of the first read alignment found inside the window |
last |
integer; coordinate of the last read alignment found inside the window |
The result is of class data.frame
and in addition of the
(S3) class slidingWindowSummary
, which may be utilized by
follow-up functions.
Joern Toedling
exDir <- system.file("extdata", package="girafe") exA <- readAligned(dirPath=exDir, type="Bowtie", pattern="aravinSRNA_23_no_adapter_excerpt_mm9_unmasked.bwtmap") exAI <- as(exA, "AlignedGenomeIntervals") exPX <- perWindow(exAI, chr="chrX", winsize=1e5, step=0.5e5) head(exPX[order(exPX$n.overlap, decreasing=TRUE),])
exDir <- system.file("extdata", package="girafe") exA <- readAligned(dirPath=exDir, type="Bowtie", pattern="aravinSRNA_23_no_adapter_excerpt_mm9_unmasked.bwtmap") exAI <- as(exA, "AlignedGenomeIntervals") exPX <- perWindow(exAI, chr="chrX", winsize=1e5, step=0.5e5) head(exPX[order(exPX$n.overlap, decreasing=TRUE),])
Function to remove 3' adapter contamination from reads
trimAdapter(fq, adapter, match.score = 1, mismatch.score = -1, score.threshold = 2)
trimAdapter(fq, adapter, match.score = 1, mismatch.score = -1, score.threshold = 2)
fq |
Object of class |
adapter |
object of class |
match.score |
numeric; alignment score for matching bases |
mismatch.score |
numeric; alignment score for mismatches |
score.threshold |
numeric; minimum total alignment score required for an overlap match between the 3' end of the read and the 5' end of the adapter sequence. |
Performs an overlap alignment between the ends of the reads and the start of the adapter sequence.
An object of class ShortReadQ
containing the reads without the
3' adapter contamination.
The function trimLRPatterns
from package ShortRead
may
be a faster alternative to this function.
J. Toedling
pairwiseAlignment
,
narrow
,
readFastq
,
writeFastq
exDir <- system.file("extdata", package="girafe") ## load reads containing adapter fragments at the end ra23.wa <- readFastq(dirPath=exDir, pattern= "aravinSRNA_23_plus_adapter_excerpt.fastq") table(width(ra23.wa)) # adapter sequence obtained from GEO page # accession number: GSE10364 #adapter <- DNAString("CTGTAGGCACCATCAAT") adapter <- "CTGTAGGCACCATCAAT" # trim adapter ra23.na <- trimAdapter(ra23.wa, adapter) table(width(ra23.na))
exDir <- system.file("extdata", package="girafe") ## load reads containing adapter fragments at the end ra23.wa <- readFastq(dirPath=exDir, pattern= "aravinSRNA_23_plus_adapter_excerpt.fastq") table(width(ra23.wa)) # adapter sequence obtained from GEO page # accession number: GSE10364 #adapter <- DNAString("CTGTAGGCACCATCAAT") adapter <- "CTGTAGGCACCATCAAT" # trim adapter ra23.na <- trimAdapter(ra23.wa, adapter) table(width(ra23.na))
For each genome interval in one set, finds the nearest interval in a second set of genome intervals.
a data.frame
with a number of rows equal to the number of
intervals in argument from
. The elements of the data.frame are:
distance_to_nearest |
numeric; distance to nearest interval from
object |
which_nearest |
list; each list element are the indices or the
intervals in object |
which_overlap |
list; each list element are the indices or the
intervals in object |
Currently, the package girafe contains method implementations
for the first object (Argument: from
) being of any of the
classes “AlignedGenomeIntervals”,“Genome_intervals” or
“Genome_intervals_stranded”. The second object (Argument:
to
) has be of class “Genome_intervals_stranded” or
“Genome_intervals”.
If the supplied objects are stranded, as it is the case with objects of classes ‘AlignedGenomeIntervals’ and ‘Genome_intervals_stranded’, then the overlap and distance is solely computed between intervals on the same strand.
For objects of class ‘Genome_intervals’, overlap and distances are computed regardless of strand information.
Joern Toedling
### process aligned reads exDir <- system.file("extdata", package="girafe") exA <- readAligned(dirPath=exDir, type="Bowtie", pattern="aravinSRNA_23_no_adapter_excerpt_mm9_unmasked.bwtmap") exAI <- as(exA, "AlignedGenomeIntervals") ## load annotated genome features load(file.path(exDir, "mgi_gi.RData")) ## subset for sake of speed: A <- exAI[is.element(seqnames(exAI), c("chrX","chrY"))] G <- mgi.gi[is.element(seqnames(mgi.gi), c("chrX","chrY"))] ## find nearest annotated feature for each AlignedGenomeInterval WN <- which_nearest(A, G) dim(WN); tail(WN) ## notice the difference to: tail(which_nearest(as(A, "Genome_intervals"), G)) # the last interval in A is located antisense to a gene, # but not overlapping anything on the same strand
### process aligned reads exDir <- system.file("extdata", package="girafe") exA <- readAligned(dirPath=exDir, type="Bowtie", pattern="aravinSRNA_23_no_adapter_excerpt_mm9_unmasked.bwtmap") exAI <- as(exA, "AlignedGenomeIntervals") ## load annotated genome features load(file.path(exDir, "mgi_gi.RData")) ## subset for sake of speed: A <- exAI[is.element(seqnames(exAI), c("chrX","chrY"))] G <- mgi.gi[is.element(seqnames(mgi.gi), c("chrX","chrY"))] ## find nearest annotated feature for each AlignedGenomeInterval WN <- which_nearest(A, G) dim(WN); tail(WN) ## notice the difference to: tail(which_nearest(as(A, "Genome_intervals"), G)) # the last interval in A is located antisense to a gene, # but not overlapping anything on the same strand