Title: | Representation and manipulation of genomic intervals |
---|---|
Description: | The ability to efficiently represent and manipulate genomic annotations and alignments is playing a central role when it comes to analyzing high-throughput sequencing data (a.k.a. NGS data). The GenomicRanges package defines general purpose containers for storing and manipulating genomic intervals and variables defined along a genome. More specialized containers for representing and manipulating short alignments against a reference genome, or a matrix-like summarization of an experiment, are defined in the GenomicAlignments and SummarizedExperiment packages, respectively. Both packages build on top of the GenomicRanges infrastructure. |
Authors: | Patrick Aboyoun [aut], Hervé Pagès [aut, cre], Michael Lawrence [aut], Sonali Arora [ctb], Martin Morgan [ctb], Kayla Morrell [ctb], Valerie Obenchain [ctb], Marcel Ramos [ctb], Lori Shepherd [ctb], Dan Tenenbaum [ctb], Daniel van Twisk [ctb] |
Maintainer: | Hervé Pagès <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.59.1 |
Built: | 2024-11-19 02:58:55 UTC |
Source: | https://github.com/bioc/GenomicRanges |
absoluteRanges
transforms the genomic ranges in x
into
absolute ranges i.e. into ranges counted from the beginning of
the virtual sequence obtained by concatenating all the sequences in the
underlying genome (in the order reported by seqlevels(x)
).
relativeRanges
performs the reverse transformation.
NOTE: These functions only work on small genomes. See Details section below.
absoluteRanges(x) relativeRanges(x, seqlengths) ## Related utility: isSmallGenome(seqlengths)
absoluteRanges(x) relativeRanges(x, seqlengths) ## Related utility: isSmallGenome(seqlengths)
x |
For For |
seqlengths |
An object holding sequence lengths. This can be a named integer
(or numeric) vector with no duplicated names as returned by
For |
Because absoluteRanges
returns the absolute ranges in an
IRanges object, and because an IRanges
object cannot hold ranges with an end > .Machine$integer.max
(i.e. >= 2^31 on most platforms), absoluteRanges
cannot be used
if the size of the underlying genome (i.e. the total length of the
sequences in it) is > .Machine$integer.max
. Utility function
isSmallGenome
is provided as a mean for the user to check
upfront whether the genome is small (i.e. its size is <=
.Machine$integer.max
) or not, and thus compatible with
absoluteRanges
or not.
relativeRanges
applies the same restriction by looking at the
seqlengths
argument.
An IRanges object for absoluteRanges
.
A GRanges object for relativeRanges
.
absoluteRanges
and relativeRanges
both return an object that
is parallel to the input object (i.e. same length and names).
isSmallGenome
returns TRUE if the total length of the underlying
sequences is <= .Machine$integer.max
(e.g. Fly genome),
FALSE if not (e.g. Human genome), or NA if it cannot be computed (because
some sequence lengths are NA).
H. Pagès
GRanges objects.
IntegerRanges objects in the IRanges package.
Seqinfo objects and the seqlengths
getter in
the GenomeInfoDb package.
genomicvars for manipulating genomic variables.
The tileGenome
function for putting tiles on a
genome.
## --------------------------------------------------------------------- ## TOY EXAMPLE ## --------------------------------------------------------------------- gr <- GRanges(Rle(c("chr2", "chr1", "chr3", "chr1"), 4:1), IRanges(1:10, width=5), seqinfo=Seqinfo(c("chr1", "chr2", "chr3"), c(100, 50, 20))) ar <- absoluteRanges(gr) ar gr2 <- relativeRanges(ar, seqlengths(gr)) gr2 ## Sanity check: stopifnot(all(gr == gr2)) ## --------------------------------------------------------------------- ## ON REAL DATA ## --------------------------------------------------------------------- ## With a "small" genome library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene ex <- exons(txdb) ex isSmallGenome(ex) ## Note that because isSmallGenome() can return NA (see Value section ## above), its result should always be wrapped inside isTRUE() when ## used in an if statement: if (isTRUE(isSmallGenome(ex))) { ar <- absoluteRanges(ex) ar ex2 <- relativeRanges(ar, seqlengths(ex)) ex2 # original strand is not restored ## Sanity check: strand(ex2) <- strand(ex) # restore the strand stopifnot(all(ex == ex2)) } ## With a "big" genome (but we can reduce it) library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene ex <- exons(txdb) isSmallGenome(ex) ## Not run: absoluteRanges(ex) # error! ## End(Not run) ## However, if we are only interested in some chromosomes, we might ## still be able to use absoluteRanges(): seqlevels(ex, pruning.mode="coarse") <- paste0("chr", 1:10) isSmallGenome(ex) # TRUE! ar <- absoluteRanges(ex) ex2 <- relativeRanges(ar, seqlengths(ex)) ## Sanity check: strand(ex2) <- strand(ex) stopifnot(all(ex == ex2))
## --------------------------------------------------------------------- ## TOY EXAMPLE ## --------------------------------------------------------------------- gr <- GRanges(Rle(c("chr2", "chr1", "chr3", "chr1"), 4:1), IRanges(1:10, width=5), seqinfo=Seqinfo(c("chr1", "chr2", "chr3"), c(100, 50, 20))) ar <- absoluteRanges(gr) ar gr2 <- relativeRanges(ar, seqlengths(gr)) gr2 ## Sanity check: stopifnot(all(gr == gr2)) ## --------------------------------------------------------------------- ## ON REAL DATA ## --------------------------------------------------------------------- ## With a "small" genome library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene ex <- exons(txdb) ex isSmallGenome(ex) ## Note that because isSmallGenome() can return NA (see Value section ## above), its result should always be wrapped inside isTRUE() when ## used in an if statement: if (isTRUE(isSmallGenome(ex))) { ar <- absoluteRanges(ex) ar ex2 <- relativeRanges(ar, seqlengths(ex)) ex2 # original strand is not restored ## Sanity check: strand(ex2) <- strand(ex) # restore the strand stopifnot(all(ex == ex2)) } ## With a "big" genome (but we can reduce it) library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene ex <- exons(txdb) isSmallGenome(ex) ## Not run: absoluteRanges(ex) # error! ## End(Not run) ## However, if we are only interested in some chromosomes, we might ## still be able to use absoluteRanges(): seqlevels(ex, pruning.mode="coarse") <- paste0("chr", 1:10) isSmallGenome(ex) # TRUE! ar <- absoluteRanges(ex) ex2 <- relativeRanges(ar, seqlengths(ex)) ## Sanity check: strand(ex2) <- strand(ex) stopifnot(all(ex == ex2))
Attaching a Constraint object to an object of class A (the "constrained" object) is meant to be a convenient/reusable/extensible way to enforce a particular set of constraints on particular instances of A.
THIS IS AN EXPERIMENTAL FEATURE AND STILL VERY MUCH A WORK-IN-PROGRESS!
For the developer, using constraints is an alternative to the more traditional approach that consists in creating subclasses of A and implementing specific validity methods for each of them. However, using constraints offers the following advantages over the traditional approach:
The traditional approach often tends to lead to a proliferation of subclasses of A.
Constraints can easily be re-used across different classes without the need to create any new class.
Constraints can easily be combined.
All constraints are implemented as concrete subclasses of the Constraint class, which is a virtual class with no slots. Like the Constraint virtual class itself, concrete Constraint subclasses cannot have slots.
Here are the 7 steps typically involved in the process of putting constraints on objects of class A:
Add a slot named constraint
to the definition of class A.
The type of this slot must be Constraint_OR_NULL. Note that
any subclass of A will inherit this slot.
Implements the constraint()
accessors (getter and setter)
for objects of class A. This is done by implementing the
"constraint"
method (getter) and replacement method (setter)
for objects of class A (see the examples below). As a convenience
to the user, the setter should also accept the name of a constraint
(i.e. the name of its class) in addition to an instance of that
class. Note that those accessors will work on instances of any
subclass of A.
Modify the validity method for class A so it also returns the
result of checkConstraint(x, constraint(x))
(append this
result to the result returned by the validity method).
Testing: Create x
, an instance of class A (or subclass of A).
By default there is no constraint on it (constraint(x)
is
NULL
). validObject(x)
should return TRUE
.
Create a new constraint (MyConstraint) by extending the
Constraint class, typically with
setClass("MyConstraint", contains="Constraint")
.
This constraint is not enforcing anything yet so you could put
it on x
(with constraint(x) <- "MyConstraint"
),
but not much would happen. In order to actually enforce something,
a "checkConstraint"
method for signature
c(x="A", constraint="MyConstraint")
needs to be implemented.
Implement a "checkConstraint"
method for signature
c(x="A", constraint="MyConstraint")
. Like validity methods,
"checkConstraint"
methods must return NULL
or
a character vector describing the problems found.
Like validity methods, they should never fail (i.e. they should
never raise an error).
Note that, alternatively, an existing constraint (e.g.
SomeConstraint) can be adapted to work on objects of class A
by just defining a new "checkConstraint"
method for
signature c(x="A", constraint="SomeConstraint")
.
Also, stricter constraints can be built on top of existing
constraints by extending one or more existing constraints
(see the examples below).
Testing: Try constraint(x) <- "MyConstraint"
. It will or
will not work depending on whether x
satisfies the
constraint or not. In the former case, trying to modify x
in a way that breaks the constraint on it will also raise an
error.
WARNING: This note is not true anymore as the constraint
slot has
been temporarily removed from GenomicRanges objects (starting with
package GenomicRanges >= 1.7.9).
Currently, only GenomicRanges objects can be constrained, that is:
they have a constraint
slot;
they have constraint()
accessors (getter and setter)
for this slot;
their validity method has been modified so it also returns the
result of checkConstraint(x, constraint(x))
.
More classes in the GenomicRanges and IRanges packages will support constraints in the near future.
H. Pagès
setClass
,
is
,
setMethod
,
showMethods
,
validObject
,
GenomicRanges-class
## The examples below show how to define and set constraints on ## GenomicRanges objects. Note that this is how the constraint() ## setter is defined for GenomicRanges objects: #setReplaceMethod("constraint", "GenomicRanges", # function(x, value) # { # if (isSingleString(value)) # value <- new(value) # if (!is(value, "Constraint_OR_NULL")) # stop("the supplied 'constraint' must be a ", # "Constraint object, a single string, or NULL") # x@constraint <- value # validObject(x) # x # } #) #selectMethod("constraint", "GenomicRanges") # the getter #selectMethod("constraint<-", "GenomicRanges") # the setter ## We'll use the GRanges instance 'gr' created in the GRanges examples ## to test our constraints: example(GRanges, echo=FALSE) gr #constraint(gr) ## --------------------------------------------------------------------- ## EXAMPLE 1: The HasRangeTypeCol constraint. ## --------------------------------------------------------------------- ## The HasRangeTypeCol constraint checks that the constrained object ## has a unique "rangeType" metadata column and that this column ## is a 'factor' Rle with no NAs and with the following levels ## (in this order): gene, transcript, exon, cds, 5utr, 3utr. setClass("HasRangeTypeCol", contains="Constraint") ## Like validity methods, "checkConstraint" methods must return NULL or ## a character vector describing the problems found. They should never ## fail i.e. they should never raise an error. setMethod("checkConstraint", c("GenomicRanges", "HasRangeTypeCol"), function(x, constraint, verbose=FALSE) { x_mcols <- mcols(x) idx <- match("rangeType", colnames(x_mcols)) if (length(idx) != 1L || is.na(idx)) { msg <- c("'mcols(x)' must have exactly 1 column ", "named \"rangeType\"") return(paste(msg, collapse="")) } rangeType <- x_mcols[[idx]] .LEVELS <- c("gene", "transcript", "exon", "cds", "5utr", "3utr") if (!is(rangeType, "Rle") || S4Vectors:::anyMissing(runValue(rangeType)) || !identical(levels(rangeType), .LEVELS)) { msg <- c("'mcols(x)$rangeType' must be a ", "'factor' Rle with no NAs and with levels: ", paste(.LEVELS, collapse=", ")) return(paste(msg, collapse="")) } NULL } ) #\dontrun{ #constraint(gr) <- "HasRangeTypeCol" # will fail #} checkConstraint(gr, new("HasRangeTypeCol")) # with GenomicRanges >= 1.7.9 levels <- c("gene", "transcript", "exon", "cds", "5utr", "3utr") rangeType <- Rle(factor(c("cds", "gene"), levels=levels), c(8, 2)) mcols(gr)$rangeType <- rangeType #constraint(gr) <- "HasRangeTypeCol" # OK checkConstraint(gr, new("HasRangeTypeCol")) # with GenomicRanges >= 1.7.9 ## Use is() to check whether the object has a given constraint or not: #is(constraint(gr), "HasRangeTypeCol") # TRUE #\dontrun{ #mcols(gr)$rangeType[3] <- NA # will fail #} mcols(gr)$rangeType[3] <- NA checkConstraint(gr, new("HasRangeTypeCol")) # with GenomicRanges >= 1.7.9 ## --------------------------------------------------------------------- ## EXAMPLE 2: The GeneRanges constraint. ## --------------------------------------------------------------------- ## The GeneRanges constraint is defined on top of the HasRangeTypeCol ## constraint. It checks that all the ranges in the object are of type ## "gene". setClass("GeneRanges", contains="HasRangeTypeCol") ## The checkConstraint() generic will check the HasRangeTypeCol constraint ## first, and, only if it's statisfied, it will then check the GeneRanges ## constraint. setMethod("checkConstraint", c("GenomicRanges", "GeneRanges"), function(x, constraint, verbose=FALSE) { rangeType <- mcols(x)$rangeType if (!all(rangeType == "gene")) { msg <- c("all elements in 'mcols(x)$rangeType' ", "must be equal to \"gene\"") return(paste(msg, collapse="")) } NULL } ) #\dontrun{ #constraint(gr) <- "GeneRanges" # will fail #} checkConstraint(gr, new("GeneRanges")) # with GenomicRanges >= 1.7.9 mcols(gr)$rangeType[] <- "gene" ## This replace the previous constraint (HasRangeTypeCol): #constraint(gr) <- "GeneRanges" # OK checkConstraint(gr, new("GeneRanges")) # with GenomicRanges >= 1.7.9 #is(constraint(gr), "GeneRanges") # TRUE ## However, 'gr' still indirectly has the HasRangeTypeCol constraint ## (because the GeneRanges constraint extends the HasRangeTypeCol ## constraint): #is(constraint(gr), "HasRangeTypeCol") # TRUE #\dontrun{ #mcols(gr)$rangeType[] <- "exon" # will fail #} mcols(gr)$rangeType[] <- "exon" checkConstraint(gr, new("GeneRanges")) # with GenomicRanges >= 1.7.9 ## --------------------------------------------------------------------- ## EXAMPLE 3: The HasGCCol constraint. ## --------------------------------------------------------------------- ## The HasGCCol constraint checks that the constrained object has a ## unique "GC" metadata column, that this column is of type numeric, ## with no NAs, and that all the values in that column are >= 0 and <= 1. setClass("HasGCCol", contains="Constraint") setMethod("checkConstraint", c("GenomicRanges", "HasGCCol"), function(x, constraint, verbose=FALSE) { x_mcols <- mcols(x) idx <- match("GC", colnames(x_mcols)) if (length(idx) != 1L || is.na(idx)) { msg <- c("'mcols(x)' must have exactly ", "one column named \"GC\"") return(paste(msg, collapse="")) } GC <- x_mcols[[idx]] if (!is.numeric(GC) || S4Vectors:::anyMissing(GC) || any(GC < 0) || any(GC > 1)) { msg <- c("'mcols(x)$GC' must be a numeric vector ", "with no NAs and with values between 0 and 1") return(paste(msg, collapse="")) } NULL } ) ## This replace the previous constraint (GeneRanges): #constraint(gr) <- "HasGCCol" # OK checkConstraint(gr, new("HasGCCol")) # with GenomicRanges >= 1.7.9 #is(constraint(gr), "HasGCCol") # TRUE #is(constraint(gr), "GeneRanges") # FALSE #is(constraint(gr), "HasRangeTypeCol") # FALSE ## --------------------------------------------------------------------- ## EXAMPLE 4: The HighGCRanges constraint. ## --------------------------------------------------------------------- ## The HighGCRanges constraint is defined on top of the HasGCCol ## constraint. It checks that all the ranges in the object have a GC ## content >= 0.5. setClass("HighGCRanges", contains="HasGCCol") ## The checkConstraint() generic will check the HasGCCol constraint ## first, and, if it's statisfied, it will then check the HighGCRanges ## constraint. setMethod("checkConstraint", c("GenomicRanges", "HighGCRanges"), function(x, constraint, verbose=FALSE) { GC <- mcols(x)$GC if (!all(GC >= 0.5)) { msg <- c("all elements in 'mcols(x)$GC' ", "must be >= 0.5") return(paste(msg, collapse="")) } NULL } ) #\dontrun{ #constraint(gr) <- "HighGCRanges" # will fail #} checkConstraint(gr, new("HighGCRanges")) # with GenomicRanges >= 1.7.9 mcols(gr)$GC[6:10] <- 0.5 #constraint(gr) <- "HighGCRanges" # OK checkConstraint(gr, new("HighGCRanges")) # with GenomicRanges >= 1.7.9 #is(constraint(gr), "HighGCRanges") # TRUE #is(constraint(gr), "HasGCCol") # TRUE ## --------------------------------------------------------------------- ## EXAMPLE 5: The HighGCGeneRanges constraint. ## --------------------------------------------------------------------- ## The HighGCGeneRanges constraint is the combination (AND) of the ## GeneRanges and HighGCRanges constraints. setClass("HighGCGeneRanges", contains=c("GeneRanges", "HighGCRanges")) ## No need to define a method for this constraint: the checkConstraint() ## generic will automatically check the GeneRanges and HighGCRanges ## constraints. #constraint(gr) <- "HighGCGeneRanges" # OK checkConstraint(gr, new("HighGCGeneRanges")) # with GenomicRanges >= 1.7.9 #is(constraint(gr), "HighGCGeneRanges") # TRUE #is(constraint(gr), "HighGCRanges") # TRUE #is(constraint(gr), "HasGCCol") # TRUE #is(constraint(gr), "GeneRanges") # TRUE #is(constraint(gr), "HasRangeTypeCol") # TRUE ## See how all the individual constraints are checked (from less ## specific to more specific constraints): #checkConstraint(gr, constraint(gr), verbose=TRUE) checkConstraint(gr, new("HighGCGeneRanges"), verbose=TRUE) # with # GenomicRanges # >= 1.7.9 ## See all the "checkConstraint" methods: showMethods("checkConstraint")
## The examples below show how to define and set constraints on ## GenomicRanges objects. Note that this is how the constraint() ## setter is defined for GenomicRanges objects: #setReplaceMethod("constraint", "GenomicRanges", # function(x, value) # { # if (isSingleString(value)) # value <- new(value) # if (!is(value, "Constraint_OR_NULL")) # stop("the supplied 'constraint' must be a ", # "Constraint object, a single string, or NULL") # x@constraint <- value # validObject(x) # x # } #) #selectMethod("constraint", "GenomicRanges") # the getter #selectMethod("constraint<-", "GenomicRanges") # the setter ## We'll use the GRanges instance 'gr' created in the GRanges examples ## to test our constraints: example(GRanges, echo=FALSE) gr #constraint(gr) ## --------------------------------------------------------------------- ## EXAMPLE 1: The HasRangeTypeCol constraint. ## --------------------------------------------------------------------- ## The HasRangeTypeCol constraint checks that the constrained object ## has a unique "rangeType" metadata column and that this column ## is a 'factor' Rle with no NAs and with the following levels ## (in this order): gene, transcript, exon, cds, 5utr, 3utr. setClass("HasRangeTypeCol", contains="Constraint") ## Like validity methods, "checkConstraint" methods must return NULL or ## a character vector describing the problems found. They should never ## fail i.e. they should never raise an error. setMethod("checkConstraint", c("GenomicRanges", "HasRangeTypeCol"), function(x, constraint, verbose=FALSE) { x_mcols <- mcols(x) idx <- match("rangeType", colnames(x_mcols)) if (length(idx) != 1L || is.na(idx)) { msg <- c("'mcols(x)' must have exactly 1 column ", "named \"rangeType\"") return(paste(msg, collapse="")) } rangeType <- x_mcols[[idx]] .LEVELS <- c("gene", "transcript", "exon", "cds", "5utr", "3utr") if (!is(rangeType, "Rle") || S4Vectors:::anyMissing(runValue(rangeType)) || !identical(levels(rangeType), .LEVELS)) { msg <- c("'mcols(x)$rangeType' must be a ", "'factor' Rle with no NAs and with levels: ", paste(.LEVELS, collapse=", ")) return(paste(msg, collapse="")) } NULL } ) #\dontrun{ #constraint(gr) <- "HasRangeTypeCol" # will fail #} checkConstraint(gr, new("HasRangeTypeCol")) # with GenomicRanges >= 1.7.9 levels <- c("gene", "transcript", "exon", "cds", "5utr", "3utr") rangeType <- Rle(factor(c("cds", "gene"), levels=levels), c(8, 2)) mcols(gr)$rangeType <- rangeType #constraint(gr) <- "HasRangeTypeCol" # OK checkConstraint(gr, new("HasRangeTypeCol")) # with GenomicRanges >= 1.7.9 ## Use is() to check whether the object has a given constraint or not: #is(constraint(gr), "HasRangeTypeCol") # TRUE #\dontrun{ #mcols(gr)$rangeType[3] <- NA # will fail #} mcols(gr)$rangeType[3] <- NA checkConstraint(gr, new("HasRangeTypeCol")) # with GenomicRanges >= 1.7.9 ## --------------------------------------------------------------------- ## EXAMPLE 2: The GeneRanges constraint. ## --------------------------------------------------------------------- ## The GeneRanges constraint is defined on top of the HasRangeTypeCol ## constraint. It checks that all the ranges in the object are of type ## "gene". setClass("GeneRanges", contains="HasRangeTypeCol") ## The checkConstraint() generic will check the HasRangeTypeCol constraint ## first, and, only if it's statisfied, it will then check the GeneRanges ## constraint. setMethod("checkConstraint", c("GenomicRanges", "GeneRanges"), function(x, constraint, verbose=FALSE) { rangeType <- mcols(x)$rangeType if (!all(rangeType == "gene")) { msg <- c("all elements in 'mcols(x)$rangeType' ", "must be equal to \"gene\"") return(paste(msg, collapse="")) } NULL } ) #\dontrun{ #constraint(gr) <- "GeneRanges" # will fail #} checkConstraint(gr, new("GeneRanges")) # with GenomicRanges >= 1.7.9 mcols(gr)$rangeType[] <- "gene" ## This replace the previous constraint (HasRangeTypeCol): #constraint(gr) <- "GeneRanges" # OK checkConstraint(gr, new("GeneRanges")) # with GenomicRanges >= 1.7.9 #is(constraint(gr), "GeneRanges") # TRUE ## However, 'gr' still indirectly has the HasRangeTypeCol constraint ## (because the GeneRanges constraint extends the HasRangeTypeCol ## constraint): #is(constraint(gr), "HasRangeTypeCol") # TRUE #\dontrun{ #mcols(gr)$rangeType[] <- "exon" # will fail #} mcols(gr)$rangeType[] <- "exon" checkConstraint(gr, new("GeneRanges")) # with GenomicRanges >= 1.7.9 ## --------------------------------------------------------------------- ## EXAMPLE 3: The HasGCCol constraint. ## --------------------------------------------------------------------- ## The HasGCCol constraint checks that the constrained object has a ## unique "GC" metadata column, that this column is of type numeric, ## with no NAs, and that all the values in that column are >= 0 and <= 1. setClass("HasGCCol", contains="Constraint") setMethod("checkConstraint", c("GenomicRanges", "HasGCCol"), function(x, constraint, verbose=FALSE) { x_mcols <- mcols(x) idx <- match("GC", colnames(x_mcols)) if (length(idx) != 1L || is.na(idx)) { msg <- c("'mcols(x)' must have exactly ", "one column named \"GC\"") return(paste(msg, collapse="")) } GC <- x_mcols[[idx]] if (!is.numeric(GC) || S4Vectors:::anyMissing(GC) || any(GC < 0) || any(GC > 1)) { msg <- c("'mcols(x)$GC' must be a numeric vector ", "with no NAs and with values between 0 and 1") return(paste(msg, collapse="")) } NULL } ) ## This replace the previous constraint (GeneRanges): #constraint(gr) <- "HasGCCol" # OK checkConstraint(gr, new("HasGCCol")) # with GenomicRanges >= 1.7.9 #is(constraint(gr), "HasGCCol") # TRUE #is(constraint(gr), "GeneRanges") # FALSE #is(constraint(gr), "HasRangeTypeCol") # FALSE ## --------------------------------------------------------------------- ## EXAMPLE 4: The HighGCRanges constraint. ## --------------------------------------------------------------------- ## The HighGCRanges constraint is defined on top of the HasGCCol ## constraint. It checks that all the ranges in the object have a GC ## content >= 0.5. setClass("HighGCRanges", contains="HasGCCol") ## The checkConstraint() generic will check the HasGCCol constraint ## first, and, if it's statisfied, it will then check the HighGCRanges ## constraint. setMethod("checkConstraint", c("GenomicRanges", "HighGCRanges"), function(x, constraint, verbose=FALSE) { GC <- mcols(x)$GC if (!all(GC >= 0.5)) { msg <- c("all elements in 'mcols(x)$GC' ", "must be >= 0.5") return(paste(msg, collapse="")) } NULL } ) #\dontrun{ #constraint(gr) <- "HighGCRanges" # will fail #} checkConstraint(gr, new("HighGCRanges")) # with GenomicRanges >= 1.7.9 mcols(gr)$GC[6:10] <- 0.5 #constraint(gr) <- "HighGCRanges" # OK checkConstraint(gr, new("HighGCRanges")) # with GenomicRanges >= 1.7.9 #is(constraint(gr), "HighGCRanges") # TRUE #is(constraint(gr), "HasGCCol") # TRUE ## --------------------------------------------------------------------- ## EXAMPLE 5: The HighGCGeneRanges constraint. ## --------------------------------------------------------------------- ## The HighGCGeneRanges constraint is the combination (AND) of the ## GeneRanges and HighGCRanges constraints. setClass("HighGCGeneRanges", contains=c("GeneRanges", "HighGCRanges")) ## No need to define a method for this constraint: the checkConstraint() ## generic will automatically check the GeneRanges and HighGCRanges ## constraints. #constraint(gr) <- "HighGCGeneRanges" # OK checkConstraint(gr, new("HighGCGeneRanges")) # with GenomicRanges >= 1.7.9 #is(constraint(gr), "HighGCGeneRanges") # TRUE #is(constraint(gr), "HighGCRanges") # TRUE #is(constraint(gr), "HasGCCol") # TRUE #is(constraint(gr), "GeneRanges") # TRUE #is(constraint(gr), "HasRangeTypeCol") # TRUE ## See how all the individual constraints are checked (from less ## specific to more specific constraints): #checkConstraint(gr, constraint(gr), verbose=TRUE) checkConstraint(gr, new("HighGCGeneRanges"), verbose=TRUE) # with # GenomicRanges # >= 1.7.9 ## See all the "checkConstraint" methods: showMethods("checkConstraint")
coverage
methods for GRanges and
GRangesList objects.
NOTE: The coverage
generic function and methods
for IntegerRanges and IntegerRangesList
objects are defined and documented in the IRanges package.
Methods for GAlignments and
GAlignmentPairs objects are defined and
documented in the GenomicAlignments package.
## S4 method for signature 'GenomicRanges' coverage(x, shift=0L, width=NULL, weight=1L, method=c("auto", "sort", "hash", "naive")) ## S4 method for signature 'GRangesList' coverage(x, shift=0L, width=NULL, weight=1L, method=c("auto", "sort", "hash", "naive"))
## S4 method for signature 'GenomicRanges' coverage(x, shift=0L, width=NULL, weight=1L, method=c("auto", "sort", "hash", "naive")) ## S4 method for signature 'GRangesList' coverage(x, shift=0L, width=NULL, weight=1L, method=c("auto", "sort", "hash", "naive"))
x |
A GenomicRanges or GRangesList object. |
shift , weight
|
A numeric vector or a list-like object. If numeric, it must be parallel
to Alternatively, each of these arguments can also be specified as a
single string naming a metadata column in See Note that when |
width |
Either See |
method |
See |
When x
is a GRangesList object, coverage(x, ...)
is equivalent to coverage(unlist(x), ...)
.
A named RleList object with one coverage vector per
seqlevel in x
.
H. Pagès and P. Aboyoun
coverage
in the IRanges package.
coverage-methods in the GenomicAlignments package.
RleList objects in the IRanges package.
GRanges, GPos, and GRangesList objects.
## Coverage of a GRanges object: gr <- GRanges( seqnames=Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), ranges=IRanges(1:10, end=10), strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), seqlengths=c(chr1=11, chr2=12, chr3=13)) cvg <- coverage(gr) pcvg <- coverage(gr[strand(gr) == "+"]) mcvg <- coverage(gr[strand(gr) == "-"]) scvg <- coverage(gr[strand(gr) == "*"]) stopifnot(identical(pcvg + mcvg + scvg, cvg)) ## Coverage of a GPos object: pos_runs <- GRanges(c("chr1", "chr1", "chr2"), IRanges(c(1, 5, 9), c(10, 8, 15))) gpos <- GPos(pos_runs) coverage(gpos) ## Coverage of a GRangesList object: gr1 <- GRanges(seqnames="chr2", ranges=IRanges(3, 6), strand = "+") gr2 <- GRanges(seqnames=c("chr1", "chr1"), ranges=IRanges(c(7,13), width=3), strand=c("+", "-")) gr3 <- GRanges(seqnames=c("chr1", "chr2"), ranges=IRanges(c(1, 4), c(3, 9)), strand=c("-", "-")) grl <- GRangesList(gr1=gr1, gr2=gr2, gr3=gr3) stopifnot(identical(coverage(grl), coverage(unlist(grl))))
## Coverage of a GRanges object: gr <- GRanges( seqnames=Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), ranges=IRanges(1:10, end=10), strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), seqlengths=c(chr1=11, chr2=12, chr3=13)) cvg <- coverage(gr) pcvg <- coverage(gr[strand(gr) == "+"]) mcvg <- coverage(gr[strand(gr) == "-"]) scvg <- coverage(gr[strand(gr) == "*"]) stopifnot(identical(pcvg + mcvg + scvg, cvg)) ## Coverage of a GPos object: pos_runs <- GRanges(c("chr1", "chr1", "chr2"), IRanges(c(1, 5, 9), c(10, 8, 15))) gpos <- GPos(pos_runs) coverage(gpos) ## Coverage of a GRangesList object: gr1 <- GRanges(seqnames="chr2", ranges=IRanges(3, 6), strand = "+") gr2 <- GRanges(seqnames=c("chr1", "chr1"), ranges=IRanges(c(7,13), width=3), strand=c("+", "-")) gr3 <- GRanges(seqnames=c("chr1", "chr2"), ranges=IRanges(c(1, 4), c(3, 9)), strand=c("-", "-")) grl <- GRangesList(gr1=gr1, gr2=gr2, gr3=gr3) stopifnot(identical(coverage(grl), coverage(unlist(grl))))
The DelegatingGenomicRanges
class implements the virtual
GenomicRanges
class using a delegate GenomicRanges
. This
enables developers to create GenomicRanges
subclasses that add
specialized columns or other structure, while remaining agnostic to
how the data are actually stored. See the Extending GenomicRanges
vignette.
M. Lawrence
Various methods for finding/counting overlaps between objects containing genomic ranges. This man page describes the methods that operate on GenomicRanges and GRangesList objects.
NOTE: The findOverlaps
generic function
and methods for IntegerRanges and
IntegerRangesList objects are defined and
documented in the IRanges package.
The methods for GAlignments,
GAlignmentPairs, and
GAlignmentsList objects are defined and
documented in the GenomicAlignments package.
GenomicRanges and GRangesList objects also support
countOverlaps
, overlapsAny
, and subsetByOverlaps
thanks to the default methods defined in the IRanges package and
to the findOverlaps
and countOverlaps
methods defined in
this package and documented below.
## S4 method for signature 'GenomicRanges,GenomicRanges' findOverlaps(query, subject, maxgap=-1L, minoverlap=0L, type=c("any", "start", "end", "within", "equal"), select=c("all", "first", "last", "arbitrary"), ignore.strand=FALSE) ## S4 method for signature 'GRangesList,GenomicRanges' findOverlaps(query, subject, maxgap=-1L, minoverlap=0L, type=c("any", "start", "end", "within", "equal"), select=c("all", "first", "last", "arbitrary"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GRangesList' findOverlaps(query, subject, maxgap=-1L, minoverlap=0L, type=c("any", "start", "end", "within", "equal"), select=c("all", "first", "last", "arbitrary"), ignore.strand=FALSE) ## S4 method for signature 'GRangesList,GRangesList' findOverlaps(query, subject, maxgap=-1L, minoverlap=0L, type=c("any", "start", "end", "within", "equal"), select=c("all", "first", "last", "arbitrary"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GenomicRanges' countOverlaps(query, subject, maxgap=-1L, minoverlap=0L, type=c("any", "start", "end", "within", "equal"), ignore.strand=FALSE)
## S4 method for signature 'GenomicRanges,GenomicRanges' findOverlaps(query, subject, maxgap=-1L, minoverlap=0L, type=c("any", "start", "end", "within", "equal"), select=c("all", "first", "last", "arbitrary"), ignore.strand=FALSE) ## S4 method for signature 'GRangesList,GenomicRanges' findOverlaps(query, subject, maxgap=-1L, minoverlap=0L, type=c("any", "start", "end", "within", "equal"), select=c("all", "first", "last", "arbitrary"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GRangesList' findOverlaps(query, subject, maxgap=-1L, minoverlap=0L, type=c("any", "start", "end", "within", "equal"), select=c("all", "first", "last", "arbitrary"), ignore.strand=FALSE) ## S4 method for signature 'GRangesList,GRangesList' findOverlaps(query, subject, maxgap=-1L, minoverlap=0L, type=c("any", "start", "end", "within", "equal"), select=c("all", "first", "last", "arbitrary"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GenomicRanges' countOverlaps(query, subject, maxgap=-1L, minoverlap=0L, type=c("any", "start", "end", "within", "equal"), ignore.strand=FALSE)
query , subject
|
A GRanges or GRangesList object. |
maxgap , minoverlap , type
|
See IMPORTANT NOTE about how |
select |
When |
ignore.strand |
When set to |
When the query and the subject are GRanges or
GRangesList objects, findOverlaps
uses the triplet
(sequence name, range, strand) to determine which features (see
paragraph below for the definition of feature) from the query
overlap which features in the subject
, where a strand value
of "*"
is treated as occurring on both the "+"
and
"-"
strand.
An overlap is recorded when a feature in the query
and a feature
in the subject
have the same sequence name, have a compatible
pairing of strands (e.g. "+"
/"+"
, "-"
/"-"
,
"*"
/"+"
, "*"
/"-"
, etc.), and satisfy the
interval overlap requirements.
In the context of findOverlaps
, a feature is a collection of
ranges that are treated as a single entity. For GRanges objects,
a feature is a single range; while for GRangesList objects,
a feature is a list element containing a set of ranges. In the results,
the features are referred to by number, which run from 1 to
length(query)
/length(subject)
.
For type="equal"
with GRangesList objects, query[[i]]
matches subject[[j]]
iff for each range in query[[i]]
there is an identical range in subject[[j]]
, and vice versa.
For findOverlaps
: either a Hits object when
select="all"
or an integer vector otherwise.
For countOverlaps
: an integer vector containing the tabulated
query overlap hits.
P. Aboyoun, S. Falcon, M. Lawrence, and H. Pagès
The Hits class in the S4Vectors package for representing a set of hits between 2 vector-like objects.
The findOverlaps
generic function defined
in the IRanges package.
The GNCList constructor and class for preprocessing and representing a GenomicRanges or object as a data structure based on Nested Containment Lists.
The GRanges and GRangesList classes.
## --------------------------------------------------------------------- ## BASIC EXAMPLES ## --------------------------------------------------------------------- ## GRanges object: gr <- GRanges( seqnames=Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), ranges=IRanges(1:10, width=10:1, names=head(letters,10)), strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score=1:10, GC=seq(1, 0, length=10) ) gr ## GRangesList object: gr1 <- GRanges(seqnames="chr2", ranges=IRanges(4:3, 6), strand="+", score=5:4, GC=0.45) gr2 <- GRanges(seqnames=c("chr1", "chr1"), ranges=IRanges(c(7,13), width=3), strand=c("+", "-"), score=3:4, GC=c(0.3, 0.5)) gr3 <- GRanges(seqnames=c("chr1", "chr2"), ranges=IRanges(c(1, 4), c(3, 9)), strand=c("-", "-"), score=c(6L, 2L), GC=c(0.4, 0.1)) grl <- GRangesList("gr1"=gr1, "gr2"=gr2, "gr3"=gr3) ## Overlapping two GRanges objects: table(!is.na(findOverlaps(gr, gr1, select="arbitrary"))) countOverlaps(gr, gr1) findOverlaps(gr, gr1) subsetByOverlaps(gr, gr1) countOverlaps(gr, gr1, type="start") findOverlaps(gr, gr1, type="start") subsetByOverlaps(gr, gr1, type="start") findOverlaps(gr, gr1, select="first") findOverlaps(gr, gr1, select="last") findOverlaps(gr1, gr) findOverlaps(gr1, gr, type="start") findOverlaps(gr1, gr, type="within") findOverlaps(gr1, gr, type="equal") ## --------------------------------------------------------------------- ## MORE EXAMPLES ## --------------------------------------------------------------------- table(!is.na(findOverlaps(gr, gr1, select="arbitrary"))) countOverlaps(gr, gr1) findOverlaps(gr, gr1) subsetByOverlaps(gr, gr1) ## Overlaps between a GRanges and a GRangesList object: table(!is.na(findOverlaps(grl, gr, select="first"))) countOverlaps(grl, gr) findOverlaps(grl, gr) subsetByOverlaps(grl, gr) countOverlaps(grl, gr, type="start") findOverlaps(grl, gr, type="start") subsetByOverlaps(grl, gr, type="start") findOverlaps(grl, gr, select="first") table(!is.na(findOverlaps(grl, gr1, select="first"))) countOverlaps(grl, gr1) findOverlaps(grl, gr1) subsetByOverlaps(grl, gr1) countOverlaps(grl, gr1, type="start") findOverlaps(grl, gr1, type="start") subsetByOverlaps(grl, gr1, type="start") findOverlaps(grl, gr1, select="first") ## Overlaps between two GRangesList objects: countOverlaps(grl, rev(grl)) findOverlaps(grl, rev(grl)) subsetByOverlaps(grl, rev(grl)) ## --------------------------------------------------------------------- ## INTERPRETATION OF 'minoverlap' WHEN 'query' OR 'subject' IS A ## GRangesList OBJECT ## --------------------------------------------------------------------- gr1 <- GRanges("chr5:1-26") gr2 <- GRanges("chr5:31-40") gr3 <- c(GRanges("chr5:11-20"), gr2) grl123 <- GRangesList(gr1, gr2, gr3) grl123 query <- GRanges("chr5:17-35") findOverlaps(query, grl123[[1]], minoverlap=8) # 1 hit findOverlaps(query, grl123[[2]], minoverlap=8) # no hit findOverlaps(query, grl123[[3]], minoverlap=8) # no hit ## Using GRangesList object 'grl123' as the subject: findOverlaps(query, grl123, minoverlap=8) ## As we can see, a hit is reported with the 3rd element in the subject. ## That's because the total number of positions in this overlap is 9: ## - positions 17 to 20 in the first range of grl123[[3]], so 4 positions ## - positions 31 to 35 in its second range, so 5 positions ## Sanity checks: hits <- findOverlaps(query, grl123[[1]], minoverlap=8) stopifnot(length(hits) == 1) hits <- findOverlaps(query, grl123[[2]], minoverlap=8) stopifnot(length(hits) == 0) hits <- findOverlaps(query, grl123[[3]], minoverlap=8) stopifnot(length(hits) == 0) hits <- findOverlaps(query, grl123, minoverlap=8) stopifnot(identical(subjectHits(hits), c(1L, 3L))) hits <- findOverlaps(query, grl123, minoverlap=9) stopifnot(identical(subjectHits(hits), c(1L, 3L))) hits <- findOverlaps(query, grl123, minoverlap=10) stopifnot(identical(subjectHits(hits), 1L)) hits <- findOverlaps(query, grl123, minoverlap=11) stopifnot(length(hits) == 0)
## --------------------------------------------------------------------- ## BASIC EXAMPLES ## --------------------------------------------------------------------- ## GRanges object: gr <- GRanges( seqnames=Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), ranges=IRanges(1:10, width=10:1, names=head(letters,10)), strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score=1:10, GC=seq(1, 0, length=10) ) gr ## GRangesList object: gr1 <- GRanges(seqnames="chr2", ranges=IRanges(4:3, 6), strand="+", score=5:4, GC=0.45) gr2 <- GRanges(seqnames=c("chr1", "chr1"), ranges=IRanges(c(7,13), width=3), strand=c("+", "-"), score=3:4, GC=c(0.3, 0.5)) gr3 <- GRanges(seqnames=c("chr1", "chr2"), ranges=IRanges(c(1, 4), c(3, 9)), strand=c("-", "-"), score=c(6L, 2L), GC=c(0.4, 0.1)) grl <- GRangesList("gr1"=gr1, "gr2"=gr2, "gr3"=gr3) ## Overlapping two GRanges objects: table(!is.na(findOverlaps(gr, gr1, select="arbitrary"))) countOverlaps(gr, gr1) findOverlaps(gr, gr1) subsetByOverlaps(gr, gr1) countOverlaps(gr, gr1, type="start") findOverlaps(gr, gr1, type="start") subsetByOverlaps(gr, gr1, type="start") findOverlaps(gr, gr1, select="first") findOverlaps(gr, gr1, select="last") findOverlaps(gr1, gr) findOverlaps(gr1, gr, type="start") findOverlaps(gr1, gr, type="within") findOverlaps(gr1, gr, type="equal") ## --------------------------------------------------------------------- ## MORE EXAMPLES ## --------------------------------------------------------------------- table(!is.na(findOverlaps(gr, gr1, select="arbitrary"))) countOverlaps(gr, gr1) findOverlaps(gr, gr1) subsetByOverlaps(gr, gr1) ## Overlaps between a GRanges and a GRangesList object: table(!is.na(findOverlaps(grl, gr, select="first"))) countOverlaps(grl, gr) findOverlaps(grl, gr) subsetByOverlaps(grl, gr) countOverlaps(grl, gr, type="start") findOverlaps(grl, gr, type="start") subsetByOverlaps(grl, gr, type="start") findOverlaps(grl, gr, select="first") table(!is.na(findOverlaps(grl, gr1, select="first"))) countOverlaps(grl, gr1) findOverlaps(grl, gr1) subsetByOverlaps(grl, gr1) countOverlaps(grl, gr1, type="start") findOverlaps(grl, gr1, type="start") subsetByOverlaps(grl, gr1, type="start") findOverlaps(grl, gr1, select="first") ## Overlaps between two GRangesList objects: countOverlaps(grl, rev(grl)) findOverlaps(grl, rev(grl)) subsetByOverlaps(grl, rev(grl)) ## --------------------------------------------------------------------- ## INTERPRETATION OF 'minoverlap' WHEN 'query' OR 'subject' IS A ## GRangesList OBJECT ## --------------------------------------------------------------------- gr1 <- GRanges("chr5:1-26") gr2 <- GRanges("chr5:31-40") gr3 <- c(GRanges("chr5:11-20"), gr2) grl123 <- GRangesList(gr1, gr2, gr3) grl123 query <- GRanges("chr5:17-35") findOverlaps(query, grl123[[1]], minoverlap=8) # 1 hit findOverlaps(query, grl123[[2]], minoverlap=8) # no hit findOverlaps(query, grl123[[3]], minoverlap=8) # no hit ## Using GRangesList object 'grl123' as the subject: findOverlaps(query, grl123, minoverlap=8) ## As we can see, a hit is reported with the 3rd element in the subject. ## That's because the total number of positions in this overlap is 9: ## - positions 17 to 20 in the first range of grl123[[3]], so 4 positions ## - positions 31 to 35 in its second range, so 5 positions ## Sanity checks: hits <- findOverlaps(query, grl123[[1]], minoverlap=8) stopifnot(length(hits) == 1) hits <- findOverlaps(query, grl123[[2]], minoverlap=8) stopifnot(length(hits) == 0) hits <- findOverlaps(query, grl123[[3]], minoverlap=8) stopifnot(length(hits) == 0) hits <- findOverlaps(query, grl123, minoverlap=8) stopifnot(identical(subjectHits(hits), c(1L, 3L))) hits <- findOverlaps(query, grl123, minoverlap=9) stopifnot(identical(subjectHits(hits), c(1L, 3L))) hits <- findOverlaps(query, grl123, minoverlap=10) stopifnot(identical(subjectHits(hits), 1L)) hits <- findOverlaps(query, grl123, minoverlap=11) stopifnot(length(hits) == 0)
S4 generic functions for squeezing the genomic ranges out of a range-based object.
These are analog to range squeezers ranges
and
rglist
defined in the IRanges package, except
that granges
returns the ranges in a GRanges object (instead
of an IRanges object for ranges
),
and grglist
returns them in a GRangesList object (instead of
an IRangesList object for rglist
).
granges(x, use.names=TRUE, use.mcols=FALSE, ...) grglist(x, use.names=TRUE, use.mcols=FALSE, ...)
granges(x, use.names=TRUE, use.mcols=FALSE, ...) grglist(x, use.names=TRUE, use.mcols=FALSE, ...)
x |
An object containing genomic ranges e.g. a GenomicRanges, RangedSummarizedExperiment, GAlignments, GAlignmentPairs, or GAlignmentsList object, or a Pairs object containing genomic ranges. |
use.names , use.mcols , ...
|
See |
See ranges
in the IRanges package for
some details.
For some objects (e.g. GAlignments and
GAlignmentPairs objects defined in the
GenomicAlignments package), as(x, "GRanges")
and
as(x, "GRangesList")
, are equivalent to
granges(x, use.names=TRUE, use.mcols=TRUE)
and
grglist(x, use.names=TRUE, use.mcols=TRUE)
, respectively.
A GRanges object for granges
.
A GRangesList object for grglist
.
If x
is a vector-like object (e.g.
GAlignments), the returned object is expected
to be parallel to x
, that is, the i-th element in the output
corresponds to the i-th element in the input.
If use.names
is TRUE, then the names on x
(if any) are propagated to the returned object.
If use.mcols
is TRUE, then the metadata columns on x
(if any) are propagated to the returned object.
H. Pagès
GRanges and GRangesList objects.
RangedSummarizedExperiment objects in the SummarizedExperiment packages.
GAlignments, GAlignmentPairs, and GAlignmentsList objects in the GenomicAlignments package.
## See ?GAlignments in the GenomicAlignments package for examples of ## "ranges" and "rglist" methods.
## See ?GAlignments in the GenomicAlignments package for examples of ## "ranges" and "rglist" methods.
Methods for comparing and/or ordering GenomicRanges objects.
## duplicated() ## ------------ ## S4 method for signature 'GenomicRanges' duplicated(x, incomparables=FALSE, fromLast=FALSE, nmax=NA, method=c("auto", "quick", "hash")) ## match() & selfmatch() ## --------------------- ## S4 method for signature 'GenomicRanges,GenomicRanges' match(x, table, nomatch=NA_integer_, incomparables=NULL, method=c("auto", "quick", "hash"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' selfmatch(x, method=c("auto", "quick", "hash"), ignore.strand=FALSE) ## order() and related methods ## ---------------------------- ## S4 method for signature 'GenomicRanges' is.unsorted(x, na.rm=FALSE, strictly=FALSE, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' order(..., na.last=TRUE, decreasing=FALSE, method=c("auto", "shell", "radix")) ## S4 method for signature 'GenomicRanges' sort(x, decreasing=FALSE, ignore.strand=FALSE, by) ## S4 method for signature 'GenomicRanges' rank(x, na.last=TRUE, ties.method=c("average", "first", "last", "random", "max", "min"), ignore.strand=FALSE) ## Generalized parallel comparison of 2 GenomicRanges objects ## ---------------------------------------------------------- ## S4 method for signature 'GenomicRanges,GenomicRanges' pcompare(x, y)
## duplicated() ## ------------ ## S4 method for signature 'GenomicRanges' duplicated(x, incomparables=FALSE, fromLast=FALSE, nmax=NA, method=c("auto", "quick", "hash")) ## match() & selfmatch() ## --------------------- ## S4 method for signature 'GenomicRanges,GenomicRanges' match(x, table, nomatch=NA_integer_, incomparables=NULL, method=c("auto", "quick", "hash"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' selfmatch(x, method=c("auto", "quick", "hash"), ignore.strand=FALSE) ## order() and related methods ## ---------------------------- ## S4 method for signature 'GenomicRanges' is.unsorted(x, na.rm=FALSE, strictly=FALSE, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' order(..., na.last=TRUE, decreasing=FALSE, method=c("auto", "shell", "radix")) ## S4 method for signature 'GenomicRanges' sort(x, decreasing=FALSE, ignore.strand=FALSE, by) ## S4 method for signature 'GenomicRanges' rank(x, na.last=TRUE, ties.method=c("average", "first", "last", "random", "max", "min"), ignore.strand=FALSE) ## Generalized parallel comparison of 2 GenomicRanges objects ## ---------------------------------------------------------- ## S4 method for signature 'GenomicRanges,GenomicRanges' pcompare(x, y)
x , table , y
|
GenomicRanges objects. |
incomparables |
Not supported. |
fromLast , method , nomatch , nmax , na.rm , strictly , na.last , decreasing
|
See |
ignore.strand |
Whether or not the strand should be ignored when comparing 2 genomic ranges. |
... |
One or more GenomicRanges objects. The GenomicRanges objects after the first one are used to break ties. |
ties.method |
A character string specifying how ties are treated. Only |
by |
An optional formula that is resolved against |
Two elements of a GenomicRanges derivative (i.e. two genomic ranges)
are considered equal iff they are on the same underlying sequence and strand,
and share the same start and width. duplicated()
and unique()
on a GenomicRanges derivative are conforming to this.
The "natural order" for the elements of a GenomicRanges derivative
is to order them (a) first by sequence level, (b) then by strand, (c) then
by start, (d) and finally by width.
This way, the space of genomic ranges is totally ordered.
Note that, because we already do (c) and (d) for regular ranges (see
?`IPosRanges-comparison`
), genomic ranges that
belong to the same underlying sequence and strand are ordered like regular
ranges.
pcompare()
, ==
, !=
, <=
, >=
, <
and >
on GenomicRanges derivatives behave accordingly to this
"natural order".
is.unsorted()
, order()
, sort()
, rank()
on
GenomicRanges derivatives also behave accordingly to this
"natural order".
Finally, note that some inter range transformations like
reduce
or disjoin
also use this "natural order" implicitly when operating on
GenomicRanges derivatives.
H. Pagès, is.unsorted
contributed by Pete Hickey
The GenomicRanges class.
IPosRanges-comparison in the IRanges package for comparing and ordering genomic ranges.
findOverlaps-methods for finding overlapping genomic ranges.
intra-range-methods and inter-range-methods for intra range and inter range transformations of a GRanges object.
setops-methods for set operations on GenomicRanges objects.
gr0 <- GRanges( Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), IRanges(c(1:9,7L), end=10), strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), seqlengths=c(chr1=11, chr2=12, chr3=13) ) gr <- c(gr0, gr0[7:3]) names(gr) <- LETTERS[seq_along(gr)] ## --------------------------------------------------------------------- ## A. ELEMENT-WISE (AKA "PARALLEL") COMPARISON OF 2 GenomicRanges OBJECTS ## --------------------------------------------------------------------- gr[2] == gr[2] # TRUE gr[2] == gr[5] # FALSE gr == gr[4] gr >= gr[3] ## --------------------------------------------------------------------- ## B. match(), selfmatch(), %in%, duplicated(), unique() ## --------------------------------------------------------------------- table <- gr[1:7] match(gr, table) match(gr, table, ignore.strand=TRUE) gr %in% table duplicated(gr) unique(gr) ## --------------------------------------------------------------------- ## C. findMatches(), countMatches() ## --------------------------------------------------------------------- findMatches(gr, table) countMatches(gr, table) findMatches(gr, table, ignore.strand=TRUE) countMatches(gr, table, ignore.strand=TRUE) gr_levels <- unique(gr) countMatches(gr_levels, gr) ## --------------------------------------------------------------------- ## D. order() AND RELATED METHODS ## --------------------------------------------------------------------- is.unsorted(gr) order(gr) sort(gr) is.unsorted(sort(gr)) is.unsorted(gr, ignore.strand=TRUE) gr2 <- sort(gr, ignore.strand=TRUE) is.unsorted(gr2) # TRUE is.unsorted(gr2, ignore.strand=TRUE) # FALSE ## TODO: Broken. Please fix! #sort(gr, by = ~ seqnames + start + end) # equivalent to (but slower than) above score(gr) <- rev(seq_len(length(gr))) ## TODO: Broken. Please fix! #sort(gr, by = ~ score) rank(gr, ties.method="first") rank(gr, ties.method="first", ignore.strand=TRUE) ## --------------------------------------------------------------------- ## E. GENERALIZED ELEMENT-WISE COMPARISON OF 2 GenomicRanges OBJECTS ## --------------------------------------------------------------------- gr3 <- GRanges(c(rep("chr1", 12), "chr2"), IRanges(c(1:11, 6:7), width=3)) strand(gr3)[12] <- "+" gr4 <- GRanges("chr1", IRanges(5, 9)) pcompare(gr3, gr4) rangeComparisonCodeToLetter(pcompare(gr3, gr4))
gr0 <- GRanges( Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), IRanges(c(1:9,7L), end=10), strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), seqlengths=c(chr1=11, chr2=12, chr3=13) ) gr <- c(gr0, gr0[7:3]) names(gr) <- LETTERS[seq_along(gr)] ## --------------------------------------------------------------------- ## A. ELEMENT-WISE (AKA "PARALLEL") COMPARISON OF 2 GenomicRanges OBJECTS ## --------------------------------------------------------------------- gr[2] == gr[2] # TRUE gr[2] == gr[5] # FALSE gr == gr[4] gr >= gr[3] ## --------------------------------------------------------------------- ## B. match(), selfmatch(), %in%, duplicated(), unique() ## --------------------------------------------------------------------- table <- gr[1:7] match(gr, table) match(gr, table, ignore.strand=TRUE) gr %in% table duplicated(gr) unique(gr) ## --------------------------------------------------------------------- ## C. findMatches(), countMatches() ## --------------------------------------------------------------------- findMatches(gr, table) countMatches(gr, table) findMatches(gr, table, ignore.strand=TRUE) countMatches(gr, table, ignore.strand=TRUE) gr_levels <- unique(gr) countMatches(gr_levels, gr) ## --------------------------------------------------------------------- ## D. order() AND RELATED METHODS ## --------------------------------------------------------------------- is.unsorted(gr) order(gr) sort(gr) is.unsorted(sort(gr)) is.unsorted(gr, ignore.strand=TRUE) gr2 <- sort(gr, ignore.strand=TRUE) is.unsorted(gr2) # TRUE is.unsorted(gr2, ignore.strand=TRUE) # FALSE ## TODO: Broken. Please fix! #sort(gr, by = ~ seqnames + start + end) # equivalent to (but slower than) above score(gr) <- rev(seq_len(length(gr))) ## TODO: Broken. Please fix! #sort(gr, by = ~ score) rank(gr, ties.method="first") rank(gr, ties.method="first", ignore.strand=TRUE) ## --------------------------------------------------------------------- ## E. GENERALIZED ELEMENT-WISE COMPARISON OF 2 GenomicRanges OBJECTS ## --------------------------------------------------------------------- gr3 <- GRanges(c(rep("chr1", 12), "chr2"), IRanges(c(1:11, 6:7), width=3)) strand(gr3)[12] <- "+" gr4 <- GRanges("chr1", IRanges(5, 9)) pcompare(gr3, gr4) rangeComparisonCodeToLetter(pcompare(gr3, gr4))
The GenomicRangesList virtual class is a general container for storing a list of GenomicRanges objects.
Most users are probably more interested in the GRangesList container, a GenomicRangesList derivative for storing a list of GRanges objects.
The place of GenomicRangesList in the Vector class hierarchy:
Vector ^ | List ^ | RangesList ^ ^ / \ / \ / \ / \ / \ / \ IntegerRangesList GenomicRangesList ^ ^ | | IRangesList GRangesList ^ ^ ^ ^ / \ / \ / \ / \ / \ / \ SimpleIRangesList \ SimpleGRangesList \ CompressedIRangesList CompressedGRangesList
Note that the Vector class hierarchy has many more classes. In particular Vector, List, IRangesList, and IntegerRangesList have other subclasses not shown here.
H. Pagès and M. Lawrence
GRangesList objects.
GenomicRanges and GRanges objects.
A genomic variable is a variable defined along a genome. Here are 2 ways a genomic variable is generally represented in Bioconductor:
as a named RleList object with one list element per chromosome;
as a metadata column on a disjoint GRanges object.
This man page documents tools for switching from one form to the other.
bindAsGRanges(...) mcolAsRleList(x, varname) binnedAverage(bins, numvar, varname, na.rm=FALSE)
bindAsGRanges(...) mcolAsRleList(x, varname) binnedAverage(bins, numvar, varname, na.rm=FALSE)
... |
One or more genomic variables in the form of named RleList objects. |
x |
A disjoint GRanges object with metadata columns on it.
A GRanges object is said to be disjoint if it contains
ranges that do not overlap with each other. This can be tested with
|
varname |
The name of the genomic variable. For For |
bins |
A GRanges object representing the genomic bins. Typically
obtained by calling |
numvar |
A named RleList object representing a numerical variable
defined along the genome covered by |
na.rm |
A logical value indicating whether |
bindAsGRanges
allows to switch the representation of one or
more genomic variables from the named RleList form to the
metadata column on a disjoint GRanges object form by binding
the supplied named RleList objects together and putting them
on the same GRanges object. This transformation is lossless.
mcolAsRleList
performs the opposite transformation and is also
lossless (however the circularity flags and genome information in
seqinfo(x)
won't propagate). It works for any metadata column on
x
that can be put in Rle form i.e. that is an
atomic vector or a factor.
binnedAverage
computes the binned average of a numerical variable
defined along a genome.
For bindAsGRanges
: a GRanges object with 1 metadata column
per supplied genomic variable.
For mcolAsRleList
: a named RleList object with
1 list element per seqlevel in x
.
For binnedAverage
: input GRanges object bins
with
an additional metadata column named varname
containing the binned
average.
H. Pagès
RleList objects in the IRanges package.
coverage,GenomicRanges-method for computing the coverage of a GRanges object.
The tileGenome
function for putting tiles on a
genome.
GRanges objects and isDisjoint,GenomicRanges-method
for the isDisjoint
method for GenomicRanges objects.
## --------------------------------------------------------------------- ## A. TWO WAYS TO REPRESENT A GENOMIC VARIABLE ## ----------------------------------------------------------------- ## 1) As a named RleList object ## ---------------------------- ## Let's create a genomic variable in the "named RleList" form: library(BSgenome.Scerevisiae.UCSC.sacCer2) set.seed(55) my_var <- RleList( lapply(seqlengths(Scerevisiae), function(seqlen) { tmp <- sample(50L, seqlen, replace=TRUE) Rle(cumsum(tmp - rev(tmp))) } ), compress=FALSE) my_var ## 2) As a metadata column on a disjoint GRanges object ## ---------------------------------------------------- gr1 <- bindAsGRanges(my_var=my_var) gr1 gr2 <- GRanges(c("chrI:1-150", "chrI:211-285", "chrI:291-377", "chrV:51-60"), score=c(0.4, 8, -10, 2.2), id=letters[1:4], seqinfo=seqinfo(Scerevisiae)) gr2 ## Going back to the "named RleList" form: mcolAsRleList(gr1, "my_var") score <- mcolAsRleList(gr2, "score") score id <- mcolAsRleList(gr2, "id") id bindAsGRanges(score=score, id=id) ## Bind 'my_var', 'score', and 'id' together: gr3 <- bindAsGRanges(my_var=my_var, score=score, id=id) ## Sanity checks: stopifnot(identical(my_var, mcolAsRleList(gr3, "my_var"))) stopifnot(identical(score, mcolAsRleList(gr3, "score"))) stopifnot(identical(id, mcolAsRleList(gr3, "id"))) gr2b <- bindAsGRanges(score=score, id=id) seqinfo(gr2b) <- seqinfo(gr2) stopifnot(identical(gr2, gr2b)) ## --------------------------------------------------------------------- ## B. BIND TOGETHER THE COVERAGES OF SEVERAL BAM FILES ## --------------------------------------------------------------------- library(pasillaBamSubset) library(GenomicAlignments) untreated1_cvg <- coverage(BamFile(untreated1_chr4())) untreated3_cvg <- coverage(BamFile(untreated3_chr4())) all_cvg <- bindAsGRanges(untreated1=untreated1_cvg, untreated3=untreated3_cvg) ## Keep regions with coverage: all_cvg[with(mcols(all_cvg), untreated1 + untreated3 >= 1)] ## Plot the coverage profiles with the Gviz package: library(Gviz) plotNumvars <- function(numvars, region, name="numvars", ...) { stopifnot(is(numvars, "GRanges")) stopifnot(is(region, "GRanges"), length(region) == 1L) gtrack <- GenomeAxisTrack() dtrack <- DataTrack(numvars, chromosome=as.character(seqnames(region)), name=name, groups=colnames(mcols(numvars)), type="l", ...) plotTracks(list(gtrack, dtrack), from=start(region), to=end(region)) } plotNumvars(all_cvg, GRanges("chr4:1-25000"), "coverage", col=c("red", "blue")) plotNumvars(all_cvg, GRanges("chr4:1.03e6-1.08e6"), "coverage", col=c("red", "blue")) ## Sanity checks: stopifnot(identical(untreated1_cvg, mcolAsRleList(all_cvg, "untreated1"))) stopifnot(identical(untreated3_cvg, mcolAsRleList(all_cvg, "untreated3"))) ## --------------------------------------------------------------------- ## C. COMPUTE THE BINNED AVERAGE OF A NUMERICAL VARIABLE DEFINED ALONG A ## GENOME ## --------------------------------------------------------------------- ## In some applications (e.g. visualization), there is the need to compute ## the average of a genomic variable for a set of predefined fixed-width ## regions (sometimes called "bins"). ## Let's use tileGenome() to create such a set of bins: bins1 <- tileGenome(seqinfo(Scerevisiae), tilewidth=100, cut.last.tile.in.chrom=TRUE) ## Compute the binned average for 'my_var' and 'score': bins1 <- binnedAverage(bins1, my_var, "binned_var") bins1 bins1 <- binnedAverage(bins1, score, "binned_score") bins1 ## Binned average in "named RleList" form: binned_var1 <- mcolAsRleList(bins1, "binned_var") binned_var1 stopifnot(all.equal(mean(my_var), mean(binned_var1))) # sanity check mcolAsRleList(bins1, "binned_score") ## With bigger bins: bins2 <- tileGenome(seqinfo(Scerevisiae), tilewidth=50000, cut.last.tile.in.chrom=TRUE) bins2 <- binnedAverage(bins2, my_var, "binned_var") bins2 <- binnedAverage(bins2, score, "binned_score") bins2 binned_var2 <- mcolAsRleList(bins2, "binned_var") binned_var2 stopifnot(all.equal(mean(my_var), mean(binned_var2))) # sanity check mcolAsRleList(bins2, "binned_score") ## Not surprisingly, the "binned" variables are much more compact in ## memory than the original variables (they contain much less runs): object.size(my_var) object.size(binned_var1) object.size(binned_var2) ## --------------------------------------------------------------------- ## D. SANITY CHECKS ## --------------------------------------------------------------------- bins3 <- tileGenome(c(chr1=10, chr2=8), tilewidth=5, cut.last.tile.in.chrom=TRUE) my_var3 <- RleList(chr1=Rle(c(1:3, NA, 5:7)), chr2=Rle(c(-3, NA, -3, NaN))) bins3 <- binnedAverage(bins3, my_var3, "binned_var3", na.rm=TRUE) binned_var3 <- mcols(bins3)$binned_var3 stopifnot( identical(mean(my_var3$chr1[1:5], na.rm=TRUE), binned_var3[1]), identical(mean(c(my_var3$chr1, 0, 0, 0)[6:10], na.rm=TRUE), binned_var3[2]), #identical(mean(c(my_var3$chr2, 0), na.rm=TRUE), # binned_var3[3]), identical(0, binned_var3[4]) )
## --------------------------------------------------------------------- ## A. TWO WAYS TO REPRESENT A GENOMIC VARIABLE ## ----------------------------------------------------------------- ## 1) As a named RleList object ## ---------------------------- ## Let's create a genomic variable in the "named RleList" form: library(BSgenome.Scerevisiae.UCSC.sacCer2) set.seed(55) my_var <- RleList( lapply(seqlengths(Scerevisiae), function(seqlen) { tmp <- sample(50L, seqlen, replace=TRUE) Rle(cumsum(tmp - rev(tmp))) } ), compress=FALSE) my_var ## 2) As a metadata column on a disjoint GRanges object ## ---------------------------------------------------- gr1 <- bindAsGRanges(my_var=my_var) gr1 gr2 <- GRanges(c("chrI:1-150", "chrI:211-285", "chrI:291-377", "chrV:51-60"), score=c(0.4, 8, -10, 2.2), id=letters[1:4], seqinfo=seqinfo(Scerevisiae)) gr2 ## Going back to the "named RleList" form: mcolAsRleList(gr1, "my_var") score <- mcolAsRleList(gr2, "score") score id <- mcolAsRleList(gr2, "id") id bindAsGRanges(score=score, id=id) ## Bind 'my_var', 'score', and 'id' together: gr3 <- bindAsGRanges(my_var=my_var, score=score, id=id) ## Sanity checks: stopifnot(identical(my_var, mcolAsRleList(gr3, "my_var"))) stopifnot(identical(score, mcolAsRleList(gr3, "score"))) stopifnot(identical(id, mcolAsRleList(gr3, "id"))) gr2b <- bindAsGRanges(score=score, id=id) seqinfo(gr2b) <- seqinfo(gr2) stopifnot(identical(gr2, gr2b)) ## --------------------------------------------------------------------- ## B. BIND TOGETHER THE COVERAGES OF SEVERAL BAM FILES ## --------------------------------------------------------------------- library(pasillaBamSubset) library(GenomicAlignments) untreated1_cvg <- coverage(BamFile(untreated1_chr4())) untreated3_cvg <- coverage(BamFile(untreated3_chr4())) all_cvg <- bindAsGRanges(untreated1=untreated1_cvg, untreated3=untreated3_cvg) ## Keep regions with coverage: all_cvg[with(mcols(all_cvg), untreated1 + untreated3 >= 1)] ## Plot the coverage profiles with the Gviz package: library(Gviz) plotNumvars <- function(numvars, region, name="numvars", ...) { stopifnot(is(numvars, "GRanges")) stopifnot(is(region, "GRanges"), length(region) == 1L) gtrack <- GenomeAxisTrack() dtrack <- DataTrack(numvars, chromosome=as.character(seqnames(region)), name=name, groups=colnames(mcols(numvars)), type="l", ...) plotTracks(list(gtrack, dtrack), from=start(region), to=end(region)) } plotNumvars(all_cvg, GRanges("chr4:1-25000"), "coverage", col=c("red", "blue")) plotNumvars(all_cvg, GRanges("chr4:1.03e6-1.08e6"), "coverage", col=c("red", "blue")) ## Sanity checks: stopifnot(identical(untreated1_cvg, mcolAsRleList(all_cvg, "untreated1"))) stopifnot(identical(untreated3_cvg, mcolAsRleList(all_cvg, "untreated3"))) ## --------------------------------------------------------------------- ## C. COMPUTE THE BINNED AVERAGE OF A NUMERICAL VARIABLE DEFINED ALONG A ## GENOME ## --------------------------------------------------------------------- ## In some applications (e.g. visualization), there is the need to compute ## the average of a genomic variable for a set of predefined fixed-width ## regions (sometimes called "bins"). ## Let's use tileGenome() to create such a set of bins: bins1 <- tileGenome(seqinfo(Scerevisiae), tilewidth=100, cut.last.tile.in.chrom=TRUE) ## Compute the binned average for 'my_var' and 'score': bins1 <- binnedAverage(bins1, my_var, "binned_var") bins1 bins1 <- binnedAverage(bins1, score, "binned_score") bins1 ## Binned average in "named RleList" form: binned_var1 <- mcolAsRleList(bins1, "binned_var") binned_var1 stopifnot(all.equal(mean(my_var), mean(binned_var1))) # sanity check mcolAsRleList(bins1, "binned_score") ## With bigger bins: bins2 <- tileGenome(seqinfo(Scerevisiae), tilewidth=50000, cut.last.tile.in.chrom=TRUE) bins2 <- binnedAverage(bins2, my_var, "binned_var") bins2 <- binnedAverage(bins2, score, "binned_score") bins2 binned_var2 <- mcolAsRleList(bins2, "binned_var") binned_var2 stopifnot(all.equal(mean(my_var), mean(binned_var2))) # sanity check mcolAsRleList(bins2, "binned_score") ## Not surprisingly, the "binned" variables are much more compact in ## memory than the original variables (they contain much less runs): object.size(my_var) object.size(binned_var1) object.size(binned_var2) ## --------------------------------------------------------------------- ## D. SANITY CHECKS ## --------------------------------------------------------------------- bins3 <- tileGenome(c(chr1=10, chr2=8), tilewidth=5, cut.last.tile.in.chrom=TRUE) my_var3 <- RleList(chr1=Rle(c(1:3, NA, 5:7)), chr2=Rle(c(-3, NA, -3, NaN))) bins3 <- binnedAverage(bins3, my_var3, "binned_var3", na.rm=TRUE) binned_var3 <- mcols(bins3)$binned_var3 stopifnot( identical(mean(my_var3$chr1[1:5], na.rm=TRUE), binned_var3[1]), identical(mean(c(my_var3$chr1, 0, 0, 0)[6:10], na.rm=TRUE), binned_var3[2]), #identical(mean(c(my_var3$chr2, 0), na.rm=TRUE), # binned_var3[3]), identical(0, binned_var3[4]) )
The GNCList class is a container for storing the Nested Containment List
representation of a vector of genomic ranges (typically represented as
a GRanges object).
To preprocess a GRanges object, simply call the GNCList
constructor function on it. The resulting GNCList object can then be used
for efficient overlap-based operations on the genomic ranges.
GNCList(x)
GNCList(x)
x |
The GRanges (or more generally GenomicRanges) object to preprocess. |
The IRanges package also defines the NCList
and NCLists
constructors and classes for
preprocessing and representing a IntegerRanges or
IntegerRangesList object as a data structure based on
Nested Containment Lists.
Note that GNCList objects (introduced in BioC 3.1) are replacements for GIntervalTree objects (BioC < 3.1).
See ?NCList
in the IRanges package for
some important differences between the new algorithm based on Nested
Containment Lists and the old algorithm based on interval trees.
In particular, the new algorithm supports preprocessing of a
GenomicRanges object with ranges defined on circular sequences
(e.g. on the mitochnodrial chromosome). See below for some examples.
A GNCList object.
H. Pagès
Alexander V. Alekseyenko and Christopher J. Lee – Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases. Bioinformatics (2007) 23 (11): 1386-1393. doi: 10.1093/bioinformatics/btl647
The NCList
and NCLists
constructors and classs defined in the IRanges package.
findOverlaps
for finding/counting interval overlaps
between two range-based objects.
GRanges objects.
## The examples below are for illustration purpose only and do NOT ## reflect typical usage. This is because, for a one time use, it is ## NOT advised to explicitely preprocess the input for findOverlaps() ## or countOverlaps(). These functions will take care of it and do a ## better job at it (by preprocessing only what's needed when it's ## needed, and release memory as they go). ## --------------------------------------------------------------------- ## PREPROCESS QUERY OR SUBJECT ## --------------------------------------------------------------------- query <- GRanges(Rle(c("chrM", "chr1", "chrM", "chr1"), 4:1), IRanges(1:10, width=5), strand=rep(c("+", "-"), 5)) subject <- GRanges(Rle(c("chr1", "chr2", "chrM"), 3:1), IRanges(6:1, width=5), strand="+") ## Either the query or the subject of findOverlaps() can be preprocessed: ppsubject <- GNCList(subject) hits1a <- findOverlaps(query, ppsubject) hits1a hits1b <- findOverlaps(query, ppsubject, ignore.strand=TRUE) hits1b ppquery <- GNCList(query) hits2a <- findOverlaps(ppquery, subject) hits2a hits2b <- findOverlaps(ppquery, subject, ignore.strand=TRUE) hits2b ## Note that 'hits1a' and 'hits2a' contain the same hits but not ## necessarily in the same order. stopifnot(identical(sort(hits1a), sort(hits2a))) ## Same for 'hits1b' and 'hits2b'. stopifnot(identical(sort(hits1b), sort(hits2b))) ## --------------------------------------------------------------------- ## WITH CIRCULAR SEQUENCES ## --------------------------------------------------------------------- seqinfo <- Seqinfo(c("chr1", "chr2", "chrM"), seqlengths=c(100, 50, 10), isCircular=c(FALSE, FALSE, TRUE)) seqinfo(query) <- seqinfo[seqlevels(query)] seqinfo(subject) <- seqinfo[seqlevels(subject)] ppsubject <- GNCList(subject) hits3 <- findOverlaps(query, ppsubject) hits3 ## Circularity introduces more hits: stopifnot(all(hits1a %in% hits3)) new_hits <- setdiff(hits3, hits1a) new_hits # 1 new hit query[queryHits(new_hits)] subject[subjectHits(new_hits)] # positions 11:13 on chrM are the same # as positions 1:3 ## Sanity checks: stopifnot(identical(new_hits, Hits(9, 6, 10, 6, sort.by.query=TRUE))) ppquery <- GNCList(query) hits4 <- findOverlaps(ppquery, subject) stopifnot(identical(sort(hits3), sort(hits4)))
## The examples below are for illustration purpose only and do NOT ## reflect typical usage. This is because, for a one time use, it is ## NOT advised to explicitely preprocess the input for findOverlaps() ## or countOverlaps(). These functions will take care of it and do a ## better job at it (by preprocessing only what's needed when it's ## needed, and release memory as they go). ## --------------------------------------------------------------------- ## PREPROCESS QUERY OR SUBJECT ## --------------------------------------------------------------------- query <- GRanges(Rle(c("chrM", "chr1", "chrM", "chr1"), 4:1), IRanges(1:10, width=5), strand=rep(c("+", "-"), 5)) subject <- GRanges(Rle(c("chr1", "chr2", "chrM"), 3:1), IRanges(6:1, width=5), strand="+") ## Either the query or the subject of findOverlaps() can be preprocessed: ppsubject <- GNCList(subject) hits1a <- findOverlaps(query, ppsubject) hits1a hits1b <- findOverlaps(query, ppsubject, ignore.strand=TRUE) hits1b ppquery <- GNCList(query) hits2a <- findOverlaps(ppquery, subject) hits2a hits2b <- findOverlaps(ppquery, subject, ignore.strand=TRUE) hits2b ## Note that 'hits1a' and 'hits2a' contain the same hits but not ## necessarily in the same order. stopifnot(identical(sort(hits1a), sort(hits2a))) ## Same for 'hits1b' and 'hits2b'. stopifnot(identical(sort(hits1b), sort(hits2b))) ## --------------------------------------------------------------------- ## WITH CIRCULAR SEQUENCES ## --------------------------------------------------------------------- seqinfo <- Seqinfo(c("chr1", "chr2", "chrM"), seqlengths=c(100, 50, 10), isCircular=c(FALSE, FALSE, TRUE)) seqinfo(query) <- seqinfo[seqlevels(query)] seqinfo(subject) <- seqinfo[seqlevels(subject)] ppsubject <- GNCList(subject) hits3 <- findOverlaps(query, ppsubject) hits3 ## Circularity introduces more hits: stopifnot(all(hits1a %in% hits3)) new_hits <- setdiff(hits3, hits1a) new_hits # 1 new hit query[queryHits(new_hits)] subject[subjectHits(new_hits)] # positions 11:13 on chrM are the same # as positions 1:3 ## Sanity checks: stopifnot(identical(new_hits, Hits(9, 6, 10, 6, sort.by.query=TRUE))) ppquery <- GNCList(query) hits4 <- findOverlaps(ppquery, subject) stopifnot(identical(sort(hits3), sort(hits4)))
The GPos class is a container for storing a set of genomic positions (a.k.a. genomic loci). It exists in 2 flavors: UnstitchedGPos and StitchedGPos. Each flavor uses a particular internal representation:
In an UnstitchedGPos instance the positions are stored as an integer vector.
In a StitchedGPos instance the positions are stored as an IRanges object where each range represents a run of consecutive positions (i.e. a run of positions that are adjacent and in ascending order). This storage is particularly memory-efficient when the vector of positions contains long runs of consecutive positions.
Because genomic positions can be seen as genomic ranges of width 1, the GPos class extends the GenomicRanges virtual class (via the GRanges class).
## Constructor function GPos(seqnames=NULL, pos=NULL, strand=NULL, ..., seqinfo=NULL, seqlengths=NULL, stitch=NA)
## Constructor function GPos(seqnames=NULL, pos=NULL, strand=NULL, ..., seqinfo=NULL, seqlengths=NULL, stitch=NA)
seqnames , strand , ... , seqinfo , seqlengths
|
See documentation of the |
pos |
When |
stitch |
Controls which internal representation should be used: StitchedGPos
(when When |
Even though a GRanges object can be used for storing genomic positions, using a GPos object is more efficient. In particular the memory footprint of an UnstitchedGPos object is typically about half that of a GRanges object.
OTOH the memory footprint of a StitchedGPos object can vary a lot but will never be worse than that of a GRanges object. However it will reduce dramatically if the vector of positions contains long runs of consecutive positions. In the worst case scenario (i.e. when the object contains no consecutive positions) its memory footprint will be the same as that of a GRanges object.
Like for any Vector derivative, the length of a
GPos object cannot exceed .Machine$integer.max
(i.e. 2^31 on
most platforms). GPos()
will return an error if pos
contains too many positions.
An UnstitchedGPos or StitchedGPos object.
GPos objects support the same set of getters as other GenomicRanges
derivatives (i.e. seqnames()
, ranges()
, start()
,
end()
, strand()
, mcols()
, seqinfo()
,
etc...), plus the pos()
getter which is equivalent to
start()
or end()
. See ?GenomicRanges
for the
list of getters supported by GenomicRanges derivatives.
Note that ranges()
returns an IPos derivative
instead of the IRanges object that one gets with other
GenomicRanges derivatives. To get an IRanges
object, you need to call ranges()
again on this
IPos derivative i.e. do ranges(ranges(x))
on GPos object x
.
Like GRanges objects, GPos derivatives support the following setters:
The seqnames()
and strand()
setters.
The names()
, mcols()
and metadata()
setters.
The family of setters that operate on the seqinfo
component of an object:
seqlevels()
,
seqlevelsStyle()
,
seqlengths()
,
isCircular()
,
genome()
,
and seqinfo()
.
These setters are defined and documented in the GenomeInfoDb
package.
However, there is no pos()
setter for GPos derivatives at the
moment (although one might be added in the future).
From UnstitchedGPos to StitchedGPos and vice-versa: coercion back and
forth between UnstitchedGPos and StitchedGPos is supported via
as(x, "StitchedGPos")
and as(x, "UnstitchedGPos")
.
This is the most efficient and recommended way to switch between the
2 internal representations. Note that this switch can have dramatic
consequences on memory usage so is for advanced users only.
End users should almost never need to do this switch when following
a typical workflow.
From GenomicRanges to UnstitchedGPos, StitchedGPos, or GPos:
A GenomicRanges derivative x
in which all the ranges have
a width of 1 can be coerced to an UnstitchedGPos or StitchedGPos object
with as(x, "UnstitchedGPos")
or as(x, "StitchedGPos")
,
respectively.
For convenience as(x, "GPos")
is supported and is equivalent to
as(x, "UnstitchedGPos")
.
From GPos to GRanges:
A GPos derivative x
can be coerced to a GRanges object
with as(x, "GRanges")
. However be aware that the resulting object
can use thousands times (or more) memory than x
!
See "MEMORY USAGE" in the Examples section below.
From GPos to ordinary R objects:
Like with any other GenomicRanges derivative, as.character()
,
as.factor()
, and as.data.frame()
work on a GPos derivative
x
. Note however that as.data.frame(x)
returns a data frame
with a pos
column (containing pos(x)
) instead of the
start
, end
, and width
columns that one gets with other
GenomicRanges derivatives.
A GPos derivative can be subsetted exactly like a GRanges object.
GPos derivatives can be concatenated with c()
or append()
.
See ?c
in the S4Vectors package for
more information about concatenating Vector derivatives.
Like with any other GRanges object, split()
and relist()
work on a GPos derivative.
Internal representation of GPos objects has changed in GenomicRanges
1.29.10 (Bioc 3.6). Update any old object x
with:
x <- updateObject(x, verbose=TRUE)
.
Hervé Pagès; based on ideas borrowed from Georg Stricker [email protected] and Julien Gagneur [email protected]
The IPos class in the IRanges package for storing a set of integer positions (i.e. integer ranges of width 1).
The GRanges class for storing a set of genomic ranges of arbitrary width.
Seqinfo objects and the
seqinfo
accessor and family in the
GenomeInfoDb package for accessing/modifying information
about the underlying sequences of a GenomicRanges derivative.
GenomicRanges-comparison for comparing and ordering genomic ranges and/or positions.
findOverlaps-methods for finding overlapping genomic ranges and/or positions.
intra-range-methods and inter-range-methods for intra range and inter range transformations of GenomicRanges derivatives.
coverage-methods for computing the coverage of a set of genomic ranges and/or positions.
nearest-methods for finding the nearest genomic range/position neighbor.
The snpsBySeqname
,
snpsByOverlaps
, and
snpsById
methods for
SNPlocs objects defined in the BSgenome
package for extractors that return a GPos derivative.
SummarizedExperiment objects and derivatives in the SummarizedExperiment package.
showClass("GPos") # shows the known subclasses ## --------------------------------------------------------------------- ## BASIC EXAMPLES ## --------------------------------------------------------------------- ## Example 1: gpos1a <- GPos(seqnames=Rle(c("chr1", "chr2", "chr1"), c(10, 6, 4)), pos=c(44:53, 5:10, 2:5)) gpos1a # unstitched length(gpos1a) seqnames(gpos1a) pos(gpos1a) # same as 'start(gpos1a)' and 'end(gpos1a)' strand(gpos1a) as.character(gpos1a) as.data.frame(gpos1a) as(gpos1a, "GRanges") as.data.frame(as(gpos1a, "GRanges")) gpos1a[9:17] gpos1b <- GPos(seqnames=Rle(c("chr1", "chr2", "chr1"), c(10, 6, 4)), pos=c(44:53, 5:10, 2:5), stitch=TRUE) gpos1b # stitched ## 'gpos1a' and 'gpos1b' are semantically equivalent, only their ## internal representations differ: all(gpos1a == gpos1b) gpos1c <- GPos(c("chr1:44-53", "chr2:5-10", "chr1:2-5")) gpos1c # stitched identical(gpos1b, gpos1c) ## Example 2: pos_runs <- GRanges("chrI", IRanges(c(1, 6, 12, 17), c(5, 10, 16, 20)), strand=c("*", "-", "-", "+")) gpos2 <- GPos(pos_runs) gpos2 # stitched strand(gpos2) ## Example 3: gpos3A <- gpos3B <- GPos(c("chrI:1-1000", "chrI:1005-2000")) npos <- length(gpos3A) mcols(gpos3A)$sample <- Rle("sA") sA_counts <- sample(10, npos, replace=TRUE) mcols(gpos3A)$counts <- sA_counts mcols(gpos3B)$sample <- Rle("sB") sB_counts <- sample(10, npos, replace=TRUE) mcols(gpos3B)$counts <- sB_counts gpos3 <- c(gpos3A, gpos3B) gpos3 ## Example 4: library(BSgenome.Scerevisiae.UCSC.sacCer2) genome <- BSgenome.Scerevisiae.UCSC.sacCer2 gpos4 <- GPos(seqinfo(genome)) gpos4 # all the positions along the genome are represented mcols(gpos4)$dna <- do.call("c", unname(as.list(genome))) gpos4 ## Note however that, like for any Vector derivative, the length of a ## GPos derivative cannot exceed '.Machine$integer.max' (i.e. 2^31 on ## most platforms) so the above only works with a "small" genome. ## For example it doesn't work with the Human genome: library(TxDb.Hsapiens.UCSC.hg38.knownGene) ## Not run: GPos(seqinfo(TxDb.Hsapiens.UCSC.hg38.knownGene)) # error! ## End(Not run) ## You can use isSmallGenome() to check upfront whether the genome is ## "small" or not. isSmallGenome(genome) # TRUE isSmallGenome(TxDb.Hsapiens.UCSC.hg38.knownGene) # FALSE ## --------------------------------------------------------------------- ## MEMORY USAGE ## --------------------------------------------------------------------- ## Coercion to GRanges works... gr4 <- as(gpos4, "GRanges") gr4 ## ... but is generally not a good idea: object.size(gpos4) object.size(gr4) # 8 times bigger than the StitchedGPos object! ## Shuffling the order of the positions impacts memory usage: gpos4r <- rev(gpos4) object.size(gpos4r) # significantly gpos4s <- sample(gpos4) object.size(gpos4s) # even worse! ## If one anticipates a lot of shuffling of the genomic positions, ## then an UnstitchedGPos object should be used instead: gpos4b <- as(gpos4, "UnstitchedGPos") object.size(gpos4b) # initial size is bigger than stitched version object.size(rev(gpos4b)) # size didn't change object.size(sample(gpos4b)) # size increased, but is still < stitched # version ## AN IMPORTANT NOTE: In the worst situations, GPos still performs as ## good as a GRanges object. object.size(as(gpos4r, "GRanges")) # same size as 'gpos4r' object.size(as(gpos4s, "GRanges")) # same size as 'gpos4s' ## Best case scenario is when the object is strictly sorted (i.e. ## positions are in strict ascending order). ## This can be checked with: is.unsorted(gpos4, strict=TRUE) # 'gpos4' is strictly sorted ## --------------------------------------------------------------------- ## USING MEMORY-EFFICIENT METADATA COLUMNS ## --------------------------------------------------------------------- ## In order to keep memory usage as low as possible, it is recommended ## to use a memory-efficient representation of the metadata columns that ## we want to set on the object. Rle's are particularly well suited for ## this, especially if the metadata columns contain long runs of ## identical values. This is the case for example if we want to use a ## GPos object to represent the coverage of sequencing reads along a ## genome. ## Example 5: library(pasillaBamSubset) library(Rsamtools) # for the BamFile() constructor function bamfile1 <- BamFile(untreated1_chr4()) bamfile2 <- BamFile(untreated3_chr4()) gpos5 <- GPos(seqinfo(bamfile1)) library(GenomicAlignments) # for "coverage" method for BamFile objects cvg1 <- unlist(coverage(bamfile1), use.names=FALSE) cvg2 <- unlist(coverage(bamfile2), use.names=FALSE) mcols(gpos5) <- DataFrame(cvg1, cvg2) gpos5 object.size(gpos5) # lightweight ## Keep only the positions where coverage is at least 10 in one of the ## 2 samples: gpos5[mcols(gpos5)$cvg1 >= 10 | mcols(gpos5)$cvg2 >= 10] ## --------------------------------------------------------------------- ## USING A GPos OBJECT IN A RangedSummarizedExperiment OBJECT ## --------------------------------------------------------------------- ## Because the GPos class extends the GenomicRanges virtual class, a ## GPos derivative can be used as the rowRanges component of a ## RangedSummarizedExperiment object. ## As a 1st example, we show how the counts for samples sA and sB in ## 'gpos3' can be stored in a SummarizedExperiment object where the rows ## correspond to unique genomic positions and the columns to samples: library(SummarizedExperiment) counts <- cbind(sA=sA_counts, sB=sB_counts) mcols(gpos3A) <- NULL rse3 <- SummarizedExperiment(list(counts=counts), rowRanges=gpos3A) rse3 rowRanges(rse3) head(assay(rse3)) ## Finally we show how the coverage data from Example 5 can be easily ## stored in a lightweight SummarizedExperiment derivative: cvg <- mcols(gpos5) mcols(gpos5) <- NULL rse5 <- SummarizedExperiment(list(cvg=cvg), rowRanges=gpos5) rse5 rowRanges(rse5) assay(rse5) ## Keep only the positions where coverage is at least 10 in one of the ## 2 samples: rse5[assay(rse5)$cvg1 >= 10 | assay(rse5)$cvg2 >= 10]
showClass("GPos") # shows the known subclasses ## --------------------------------------------------------------------- ## BASIC EXAMPLES ## --------------------------------------------------------------------- ## Example 1: gpos1a <- GPos(seqnames=Rle(c("chr1", "chr2", "chr1"), c(10, 6, 4)), pos=c(44:53, 5:10, 2:5)) gpos1a # unstitched length(gpos1a) seqnames(gpos1a) pos(gpos1a) # same as 'start(gpos1a)' and 'end(gpos1a)' strand(gpos1a) as.character(gpos1a) as.data.frame(gpos1a) as(gpos1a, "GRanges") as.data.frame(as(gpos1a, "GRanges")) gpos1a[9:17] gpos1b <- GPos(seqnames=Rle(c("chr1", "chr2", "chr1"), c(10, 6, 4)), pos=c(44:53, 5:10, 2:5), stitch=TRUE) gpos1b # stitched ## 'gpos1a' and 'gpos1b' are semantically equivalent, only their ## internal representations differ: all(gpos1a == gpos1b) gpos1c <- GPos(c("chr1:44-53", "chr2:5-10", "chr1:2-5")) gpos1c # stitched identical(gpos1b, gpos1c) ## Example 2: pos_runs <- GRanges("chrI", IRanges(c(1, 6, 12, 17), c(5, 10, 16, 20)), strand=c("*", "-", "-", "+")) gpos2 <- GPos(pos_runs) gpos2 # stitched strand(gpos2) ## Example 3: gpos3A <- gpos3B <- GPos(c("chrI:1-1000", "chrI:1005-2000")) npos <- length(gpos3A) mcols(gpos3A)$sample <- Rle("sA") sA_counts <- sample(10, npos, replace=TRUE) mcols(gpos3A)$counts <- sA_counts mcols(gpos3B)$sample <- Rle("sB") sB_counts <- sample(10, npos, replace=TRUE) mcols(gpos3B)$counts <- sB_counts gpos3 <- c(gpos3A, gpos3B) gpos3 ## Example 4: library(BSgenome.Scerevisiae.UCSC.sacCer2) genome <- BSgenome.Scerevisiae.UCSC.sacCer2 gpos4 <- GPos(seqinfo(genome)) gpos4 # all the positions along the genome are represented mcols(gpos4)$dna <- do.call("c", unname(as.list(genome))) gpos4 ## Note however that, like for any Vector derivative, the length of a ## GPos derivative cannot exceed '.Machine$integer.max' (i.e. 2^31 on ## most platforms) so the above only works with a "small" genome. ## For example it doesn't work with the Human genome: library(TxDb.Hsapiens.UCSC.hg38.knownGene) ## Not run: GPos(seqinfo(TxDb.Hsapiens.UCSC.hg38.knownGene)) # error! ## End(Not run) ## You can use isSmallGenome() to check upfront whether the genome is ## "small" or not. isSmallGenome(genome) # TRUE isSmallGenome(TxDb.Hsapiens.UCSC.hg38.knownGene) # FALSE ## --------------------------------------------------------------------- ## MEMORY USAGE ## --------------------------------------------------------------------- ## Coercion to GRanges works... gr4 <- as(gpos4, "GRanges") gr4 ## ... but is generally not a good idea: object.size(gpos4) object.size(gr4) # 8 times bigger than the StitchedGPos object! ## Shuffling the order of the positions impacts memory usage: gpos4r <- rev(gpos4) object.size(gpos4r) # significantly gpos4s <- sample(gpos4) object.size(gpos4s) # even worse! ## If one anticipates a lot of shuffling of the genomic positions, ## then an UnstitchedGPos object should be used instead: gpos4b <- as(gpos4, "UnstitchedGPos") object.size(gpos4b) # initial size is bigger than stitched version object.size(rev(gpos4b)) # size didn't change object.size(sample(gpos4b)) # size increased, but is still < stitched # version ## AN IMPORTANT NOTE: In the worst situations, GPos still performs as ## good as a GRanges object. object.size(as(gpos4r, "GRanges")) # same size as 'gpos4r' object.size(as(gpos4s, "GRanges")) # same size as 'gpos4s' ## Best case scenario is when the object is strictly sorted (i.e. ## positions are in strict ascending order). ## This can be checked with: is.unsorted(gpos4, strict=TRUE) # 'gpos4' is strictly sorted ## --------------------------------------------------------------------- ## USING MEMORY-EFFICIENT METADATA COLUMNS ## --------------------------------------------------------------------- ## In order to keep memory usage as low as possible, it is recommended ## to use a memory-efficient representation of the metadata columns that ## we want to set on the object. Rle's are particularly well suited for ## this, especially if the metadata columns contain long runs of ## identical values. This is the case for example if we want to use a ## GPos object to represent the coverage of sequencing reads along a ## genome. ## Example 5: library(pasillaBamSubset) library(Rsamtools) # for the BamFile() constructor function bamfile1 <- BamFile(untreated1_chr4()) bamfile2 <- BamFile(untreated3_chr4()) gpos5 <- GPos(seqinfo(bamfile1)) library(GenomicAlignments) # for "coverage" method for BamFile objects cvg1 <- unlist(coverage(bamfile1), use.names=FALSE) cvg2 <- unlist(coverage(bamfile2), use.names=FALSE) mcols(gpos5) <- DataFrame(cvg1, cvg2) gpos5 object.size(gpos5) # lightweight ## Keep only the positions where coverage is at least 10 in one of the ## 2 samples: gpos5[mcols(gpos5)$cvg1 >= 10 | mcols(gpos5)$cvg2 >= 10] ## --------------------------------------------------------------------- ## USING A GPos OBJECT IN A RangedSummarizedExperiment OBJECT ## --------------------------------------------------------------------- ## Because the GPos class extends the GenomicRanges virtual class, a ## GPos derivative can be used as the rowRanges component of a ## RangedSummarizedExperiment object. ## As a 1st example, we show how the counts for samples sA and sB in ## 'gpos3' can be stored in a SummarizedExperiment object where the rows ## correspond to unique genomic positions and the columns to samples: library(SummarizedExperiment) counts <- cbind(sA=sA_counts, sB=sB_counts) mcols(gpos3A) <- NULL rse3 <- SummarizedExperiment(list(counts=counts), rowRanges=gpos3A) rse3 rowRanges(rse3) head(assay(rse3)) ## Finally we show how the coverage data from Example 5 can be easily ## stored in a lightweight SummarizedExperiment derivative: cvg <- mcols(gpos5) mcols(gpos5) <- NULL rse5 <- SummarizedExperiment(list(cvg=cvg), rowRanges=gpos5) rse5 rowRanges(rse5) assay(rse5) ## Keep only the positions where coverage is at least 10 in one of the ## 2 samples: rse5[assay(rse5)$cvg1 >= 10 | assay(rse5)$cvg2 >= 10]
The GRanges class is a container for the genomic locations and their associated annotations.
GRanges is a vector of genomic locations and associated annotations. Each element in the vector is comprised of a sequence name, an interval, a strand, and optional metadata columns (e.g. score, GC content, etc.). This information is stored in four components:
seqnames
a 'factor' Rle object containing the sequence names.
ranges
an IRanges object containing the ranges.
strand
mcols
a DataFrame object
containing the metadata columns. Columns cannot be named
"seqnames"
, "ranges"
, "strand"
,
"seqlevels"
, "seqlengths"
, "isCircular"
,
"start"
, "end"
, "width"
, or "element"
.
seqinfo
a Seqinfo object containing information about the set of genomic sequences present in the GRanges object.
GRanges(seqnames=NULL, ranges=NULL, strand=NULL,
..., seqinfo=NULL, seqlengths=NULL)
:Creates a GRanges object.
seqnames
NULL
, or an Rle object, character vector,
or factor containing the sequence names.
ranges
NULL
, or an IRanges object containing the
ranges.
strand
NULL
, or an Rle object, character vector,
or factor containing the strand information.
...
Metadata columns to set on the GRanges object. All the metadata
columns must be vector-like objects of the same length as the object
to construct. They cannot be named "start"
, "end"
,
"width"
, or "element"
.
seqinfo
Either NULL
, or a Seqinfo object,
or a character vector of unique sequence names (a.k.a.
seqlevels), or a named numeric vector of sequence lengths.
When not NULL
, seqinfo
must be compatible with the
sequence names in seqnames
, that is, it must have one entry
for each unique sequence name in seqnames
. Note that it can
have additional entries i.e. entries for seqlevels not present
in seqnames
.
seqlengths
NULL
, or an integer vector named with levels(seqnames)
and containing the lengths (or NA) for each level in
levels(seqnames)
.
If ranges
is not supplied and/or NULL then the constructor
proceeds in 2 steps:
An initial GRanges object is created with
as(seqnames, "GRanges")
.
Then this GRanges object is updated according to whatever
non-NULL remaining arguments were passed to the call to
GRanges()
.
As a consequence of this behavior, GRanges(x)
is equivalent to
as(x, "GRanges")
.
In the following code snippets, x
is a GRanges object.
length(x)
:Get the number of elements.
seqnames(x)
, seqnames(x) <- value
:Get or set the sequence names.
value
can be an Rle object, a character vector,
or a factor.
ranges(x)
, ranges(x) <- value
:Get or set the ranges. value
can be an
IntegerRanges object.
start(x)
, start(x) <- value
:Get or set start(ranges(x))
.
end(x)
, end(x) <- value
:Get or set end(ranges(x))
.
width(x)
, width(x) <- value
:Get or set width(ranges(x))
.
strand(x)
, strand(x) <- value
:Get or set the strand. value
can be an Rle object, character
vector, or factor.
names(x)
, names(x) <- value
:Get or set the names of the elements.
mcols(x, use.names=FALSE)
, mcols(x) <- value
:Get or set the metadata columns.
If use.names=TRUE
and the metadata columns are not NULL
,
then the names of x
are propagated as the row names of the
returned DataFrame object.
When setting the metadata columns, the supplied value must be NULL
or a data-frame-like object (i.e. DataFrame or data.frame)
holding element-wise metadata.
elementMetadata(x)
, elementMetadata(x) <- value
,
values(x)
, values(x) <- value
:Alternatives to mcols
functions. Their use is discouraged.
seqinfo(x)
, seqinfo(x) <- value
:Get or set the information about the underlying sequences.
value
must be a Seqinfo object.
seqlevels(x)
,
seqlevels(x, pruning.mode=c("error", "coarse", "fine", "tidy")) <- value
:Get or set the sequence levels.
seqlevels(x)
is equivalent to seqlevels(seqinfo(x))
or to levels(seqnames(x))
, those 2 expressions being
guaranteed to return identical character vectors on a GRanges object.
value
must be a character vector with no NAs.
See ?seqlevels
for more information.
seqlengths(x)
, seqlengths(x) <- value
:Get or set the sequence lengths.
seqlengths(x)
is equivalent to seqlengths(seqinfo(x))
.
value
can be a named non-negative integer or numeric vector
eventually with NAs.
isCircular(x)
, isCircular(x) <- value
:Get or set the circularity flags.
isCircular(x)
is equivalent to isCircular(seqinfo(x))
.
value
must be a named logical vector eventually with NAs.
genome(x)
, genome(x) <- value
:Get or set the genome identifier or assembly name for each sequence.
genome(x)
is equivalent to genome(seqinfo(x))
.
value
must be a named character vector eventually with NAs.
seqlevelsStyle(x)
, seqlevelsStyle(x) <- value
:Get or set the seqname style for x
.
See the seqlevelsStyle generic getter and setter
in the GenomeInfoDb package for more information.
score(x), score(x) <- value
:Get or set the “score” column from the element metadata.
granges(x, use.names=FALSE, use.mcols=FALSE)
: Squeeze the genomic
ranges out of GenomicRanges object x
and return them in a
GRanges object parallel to x
(i.e. same length as x
).
If use.mcols
is TRUE
, the metadata columns are propagated.
If x
is a GenomicRanges derivative with extra column
slots, these will be propagated as metadata columns on the returned
GRanges object.
In the code snippets below, x
is a GRanges object.
as(from, "GRanges")
:Creates a GRanges object from a character vector, a factor, or IntegerRangesList object.
When from
is a character vector (or a factor), each element
in it must represent a genomic range in format chr1:2501-2800
(unstranded range) or chr1:2501-2800:+
(stranded range).
..
is also supported as a separator between the start and end
positions. Strand can be +
, -
, *
, or missing.
The names on from
are propagated to the returned GRanges object.
See as.character()
and as.factor()
below for the
reverse transformations.
Coercing a data.frame or DataFrame into a GRanges object is also
supported. See makeGRangesFromDataFrame
for the details.
as(from, "IntegerRangesList")
:Creates a IntegerRangesList object from a GRanges
object. The strand
and metadata columns become inner
metadata columns (i.e. metadata columns on the ranges).
The seqlengths(from)
, isCircular(from)
, and
genome(from)
vectors become the metadata columns.
as.character(x, ignore.strand=FALSE)
:Turn GRanges object x
into a character vector where each
range in x
is represented by a string in format
chr1:2501-2800:+
. If ignore.strand
is TRUE or if
all the ranges in x
are unstranded (i.e. their strand
is set to *
), then all the strings in the output are in
format chr1:2501-2800
.
The names on x
are propagated to the returned character vector.
Its metadata (metadata(x)
) and metadata columns (mcols(x)
)
are ignored.
See as(from, "GRanges")
above for the reverse transformation.
as.factor(x)
:Equivalent to
factor(as.character(x), levels=as.character(sort(unique(x))))
See as(from, "GRanges")
above for the reverse transformation.
Note that table(x)
is supported on a GRanges object. It is
equivalent to, but much faster than, table(as.factor(x))
.
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
:Creates a data.frame with columns seqnames
(factor),
start
(integer), end
(integer), width
(integer),
strand
(factor), as well as the additional metadata columns
stored in mcols(x)
. Pass an explicit
stringsAsFactors=TRUE/FALSE
argument via ...
to
override the default conversions for the metadata columns in
mcols(x)
.
as(from, "Grouping")
: Creates a
ManyToOneGrouping
object that groups
from
by seqname, strand, start and end (same as the default
sort order). This makes it convenient, for example, to aggregate a
GenomicRanges object by range.
In the code snippets below, x
is a Seqinfo
object.
as(x, "GRanges")
, as(x, "GenomicRanges")
,
as(x, "IntegerRangesList")
: Turns Seqinfo
object x
(with no NA
lengths) into a GRanges or
IntegerRangesList.
In the code snippets below, x
is a GRanges object.
x[i]
:Return a new GRanges object made of the elements selected by i
.
x[i, j]
:Like the above, but allow the user to conveniently subset the metadata
columns thru j
.
x[i] <- value
:Replacement version of x[i]
.
x$name
, x$name <- value
:Shortcuts for mcols(x)$name
and mcols(x)$name <- value
,
respectively. Provided as a convenience, for GRanges objects *only*,
and as the result of strong popular demand.
Note that those methods are not consistent with the other $
and $<-
methods in the IRanges/GenomicRanges infrastructure,
and might confuse some users by making them believe that a GRanges
object can be manipulated as a data.frame-like object.
Therefore we recommend using them only interactively, and we discourage
their use in scripts or packages. For the latter, use
mcols(x)$name
and mcols(x)$name <- value
, instead
of x$name
and x$name <- value
, respectively.
See ?`[`
in the S4Vectors package for more
information about subsetting Vector derivatives and for an important note
about the x[i, j]
form.
Note that a GRanges object can be used as a subscript to subset a
list-like object that has names on it. In that case, the names on the
list-like object are interpreted as sequence names.
In the code snippets below, x
is a list or List object with
names on it, and the subscript gr
is a GRanges object with all its
seqnames being valid x
names.
x[gr]
:Return an object of the same class as x
and parallel
to gr
. More precisely, it's conceptually doing:
lapply(gr, function(gr1) x[[seqnames(gr1)]][ranges(gr1)])
c(x, ..., ignore.mcols=FALSE)
:Concatenate GRanges object x
and the GRanges objects in
...
together.
See ?c
in the S4Vectors package
for more information about concatenating Vector derivatives.
split(x, f, drop=FALSE)
:Splits GRanges object x
according to f
to create a
GRangesList object.
If f
is a list-like object then drop
is ignored
and f
is treated as if it was
rep(seq_len(length(f)), sapply(f, length))
,
so the returned object has the same shape as f
(it also
receives the names of f
).
Otherwise, if f
is not a list-like object, empty list
elements are removed from the returned object if drop
is
TRUE
.
In the code snippets below, x
is a GRanges object.
show(x)
:By default the show
method displays 5 head and 5 tail
elements. This can be changed by setting the global options
showHeadLines
and showTailLines
. If the object
length is less than (or equal to) the sum of these 2 options
plus 1, then the full object is displayed.
Note that these options also affect the display of
GAlignments and
GAlignmentPairs objects (defined in
the GenomicAlignments package), as well as other objects
defined in the IRanges and Biostrings packages (e.g.
IRanges and DNAStringSet objects).
P. Aboyoun and H. Pagès
The IRanges class in the IRanges package for storing a set of integer ranges.
The GPos class for representing a set of genomic positions (i.e. genomic ranges of width 1, a.k.a. genomic loci).
makeGRangesFromDataFrame
for making a GRanges object
from a data.frame or DataFrame object.
Seqinfo objects and the
seqinfo
accessor and family in the
GenomeInfoDb package for accessing/modifying information
about the underlying sequences of a GenomicRanges derivative.
GenomicRanges-comparison for comparing and ordering genomic ranges and/or positions.
findOverlaps-methods for finding overlapping genomic ranges and/or positions.
intra-range-methods and inter-range-methods for intra range and inter range transformations of GenomicRanges derivatives.
coverage-methods for computing the coverage of a set of genomic ranges and/or positions.
setops-methods for set operations on GRanges objects.
subtract for subtracting a set of genomic ranges from a GRanges object (similar to bedtools subtract).
nearest-methods for finding the nearest genomic range/position neighbor.
absoluteRanges
for transforming genomic ranges into
absolute ranges (i.e. into ranges on the sequence obtained
by virtually concatenating all the sequences in a genome).
tileGenome
for putting tiles on a genome.
genomicvars for manipulating genomic variables.
GRangesList objects.
Vector, Rle, and DataFrame objects in the S4Vectors package.
showClass("GRanges") # shows the known subclasses ## --------------------------------------------------------------------- ## CONSTRUCTION ## --------------------------------------------------------------------- ## Specifying the bare minimum i.e. seqnames and ranges only. The ## GRanges object will have no names, no strand information, and no ## metadata columns: gr0 <- GRanges(Rle(c("chr2", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), IRanges(1:10, width=10:1)) gr0 ## Specifying names, strand, metadata columns. They can be set on an ## existing object: names(gr0) <- head(letters, 10) strand(gr0) <- Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)) mcols(gr0)$score <- 1:10 mcols(gr0)$GC <- seq(1, 0, length=10) gr0 ## ... or specified at construction time: gr <- GRanges(Rle(c("chr2", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), IRanges(1:10, width=10:1, names=head(letters, 10)), Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score=1:10, GC=seq(1, 0, length=10)) stopifnot(identical(gr0, gr)) ## Specifying the seqinfo. It can be set on an existing object: seqinfo <- Seqinfo(paste0("chr", 1:3), c(1000, 2000, 1500), NA, "mock1") seqinfo(gr0) <- merge(seqinfo(gr0), seqinfo) seqlevels(gr0) <- seqlevels(seqinfo) ## ... or specified at construction time: gr <- GRanges(Rle(c("chr2", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), IRanges(1:10, width=10:1, names=head(letters, 10)), Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score=1:10, GC=seq(1, 0, length=10), seqinfo=seqinfo) stopifnot(identical(gr0, gr)) ## --------------------------------------------------------------------- ## COERCION ## --------------------------------------------------------------------- ## From GRanges: as.character(gr) as.factor(gr) as.data.frame(gr) ## From character to GRanges: x1 <- "chr2:56-125" as(x1, "GRanges") as(rep(x1, 4), "GRanges") x2 <- c(A=x1, B="chr1:25-30:-") as(x2, "GRanges") ## From data.frame to GRanges: df <- data.frame(chrom="chr2", start=11:15, end=20:24) gr3 <- as(df, "GRanges") ## Alternatively, coercion to GRanges can be done by just calling the ## GRanges() constructor on the object to coerce: gr1 <- GRanges(x1) # same as as(x1, "GRanges") gr2 <- GRanges(x2) # same as as(x2, "GRanges") gr3 <- GRanges(df) # same as as(df, "GRanges") ## Sanity checks: stopifnot(identical(as(x1, "GRanges"), gr1)) stopifnot(identical(as(x2, "GRanges"), gr2)) stopifnot(identical(as(df, "GRanges"), gr3)) ## --------------------------------------------------------------------- ## SUMMARIZING ELEMENTS ## --------------------------------------------------------------------- table(seqnames(gr)) table(strand(gr)) sum(width(gr)) table(gr) summary(mcols(gr)[,"score"]) ## The number of lines displayed in the 'show' method are controlled ## with two global options: longGR <- sample(gr, 25, replace=TRUE) longGR options(showHeadLines=7) options(showTailLines=2) longGR ## Revert to default values options(showHeadLines=NULL) options(showTailLines=NULL) ## --------------------------------------------------------------------- ## INVERTING THE STRAND ## --------------------------------------------------------------------- invertStrand(gr) ## --------------------------------------------------------------------- ## RENAMING THE UNDERLYING SEQUENCES ## --------------------------------------------------------------------- seqlevels(gr) seqlevels(gr) <- sub("chr", "Chrom", seqlevels(gr)) gr seqlevels(gr) <- sub("Chrom", "chr", seqlevels(gr)) # revert ## --------------------------------------------------------------------- ## COMBINING OBJECTS ## --------------------------------------------------------------------- gr2 <- GRanges(seqnames=Rle(c('chr1', 'chr2', 'chr3'), c(3, 3, 4)), IRanges(1:10, width=5), strand='-', score=101:110, GC=runif(10), seqinfo=seqinfo) gr3 <- GRanges(seqnames=Rle(c('chr1', 'chr2', 'chr3'), c(3, 4, 3)), IRanges(101:110, width=10), strand='-', score=21:30, seqinfo=seqinfo) some.gr <- c(gr, gr2) c(gr, gr2, gr3) c(gr, gr2, gr3, ignore.mcols=TRUE) ## --------------------------------------------------------------------- ## USING A GRANGES OBJECT AS A SUBSCRIPT TO SUBSET ANOTHER OBJECT ## --------------------------------------------------------------------- ## Subsetting *by* a GRanges subscript is supported only if the object ## to subset is a named list-like object: x <- RleList(chr1=101:120, chr2=2:-8, chr3=31:40) x[gr]
showClass("GRanges") # shows the known subclasses ## --------------------------------------------------------------------- ## CONSTRUCTION ## --------------------------------------------------------------------- ## Specifying the bare minimum i.e. seqnames and ranges only. The ## GRanges object will have no names, no strand information, and no ## metadata columns: gr0 <- GRanges(Rle(c("chr2", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), IRanges(1:10, width=10:1)) gr0 ## Specifying names, strand, metadata columns. They can be set on an ## existing object: names(gr0) <- head(letters, 10) strand(gr0) <- Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)) mcols(gr0)$score <- 1:10 mcols(gr0)$GC <- seq(1, 0, length=10) gr0 ## ... or specified at construction time: gr <- GRanges(Rle(c("chr2", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), IRanges(1:10, width=10:1, names=head(letters, 10)), Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score=1:10, GC=seq(1, 0, length=10)) stopifnot(identical(gr0, gr)) ## Specifying the seqinfo. It can be set on an existing object: seqinfo <- Seqinfo(paste0("chr", 1:3), c(1000, 2000, 1500), NA, "mock1") seqinfo(gr0) <- merge(seqinfo(gr0), seqinfo) seqlevels(gr0) <- seqlevels(seqinfo) ## ... or specified at construction time: gr <- GRanges(Rle(c("chr2", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), IRanges(1:10, width=10:1, names=head(letters, 10)), Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score=1:10, GC=seq(1, 0, length=10), seqinfo=seqinfo) stopifnot(identical(gr0, gr)) ## --------------------------------------------------------------------- ## COERCION ## --------------------------------------------------------------------- ## From GRanges: as.character(gr) as.factor(gr) as.data.frame(gr) ## From character to GRanges: x1 <- "chr2:56-125" as(x1, "GRanges") as(rep(x1, 4), "GRanges") x2 <- c(A=x1, B="chr1:25-30:-") as(x2, "GRanges") ## From data.frame to GRanges: df <- data.frame(chrom="chr2", start=11:15, end=20:24) gr3 <- as(df, "GRanges") ## Alternatively, coercion to GRanges can be done by just calling the ## GRanges() constructor on the object to coerce: gr1 <- GRanges(x1) # same as as(x1, "GRanges") gr2 <- GRanges(x2) # same as as(x2, "GRanges") gr3 <- GRanges(df) # same as as(df, "GRanges") ## Sanity checks: stopifnot(identical(as(x1, "GRanges"), gr1)) stopifnot(identical(as(x2, "GRanges"), gr2)) stopifnot(identical(as(df, "GRanges"), gr3)) ## --------------------------------------------------------------------- ## SUMMARIZING ELEMENTS ## --------------------------------------------------------------------- table(seqnames(gr)) table(strand(gr)) sum(width(gr)) table(gr) summary(mcols(gr)[,"score"]) ## The number of lines displayed in the 'show' method are controlled ## with two global options: longGR <- sample(gr, 25, replace=TRUE) longGR options(showHeadLines=7) options(showTailLines=2) longGR ## Revert to default values options(showHeadLines=NULL) options(showTailLines=NULL) ## --------------------------------------------------------------------- ## INVERTING THE STRAND ## --------------------------------------------------------------------- invertStrand(gr) ## --------------------------------------------------------------------- ## RENAMING THE UNDERLYING SEQUENCES ## --------------------------------------------------------------------- seqlevels(gr) seqlevels(gr) <- sub("chr", "Chrom", seqlevels(gr)) gr seqlevels(gr) <- sub("Chrom", "chr", seqlevels(gr)) # revert ## --------------------------------------------------------------------- ## COMBINING OBJECTS ## --------------------------------------------------------------------- gr2 <- GRanges(seqnames=Rle(c('chr1', 'chr2', 'chr3'), c(3, 3, 4)), IRanges(1:10, width=5), strand='-', score=101:110, GC=runif(10), seqinfo=seqinfo) gr3 <- GRanges(seqnames=Rle(c('chr1', 'chr2', 'chr3'), c(3, 4, 3)), IRanges(101:110, width=10), strand='-', score=21:30, seqinfo=seqinfo) some.gr <- c(gr, gr2) c(gr, gr2, gr3) c(gr, gr2, gr3, ignore.mcols=TRUE) ## --------------------------------------------------------------------- ## USING A GRANGES OBJECT AS A SUBSCRIPT TO SUBSET ANOTHER OBJECT ## --------------------------------------------------------------------- ## Subsetting *by* a GRanges subscript is supported only if the object ## to subset is a named list-like object: x <- RleList(chr1=101:120, chr2=2:-8, chr3=31:40) x[gr]
A GRangesFactor object is a Factor derivative where the levels are a GRanges object.
See ?Factor
and in the S4Vectors package
for general information about Factor objects.
GRangesFactor(x, levels, index=NULL, ...) # constructor function
GRangesFactor(x, levels, index=NULL, ...) # constructor function
x , levels
|
Like with the When When |
index |
|
... |
Optional metadata columns. |
Like with the Factor()
constructor function,
there are 4 different ways to use the GRangesFactor()
constructor function. See Details section in the man page for
Factor objects for more information.
A GRangesFactor object.
GRangesFactor objects support the accessors documented in the man page for Factor objects.
In addition, the following getters are supported for convenience:
seqnames()
, start()
, end()
, width()
,
strand()
, seqinfo()
, granges()
, and ranges()
.
When called on GRangesFactor object x
, they all behave as if they
were called on unfactor(x)
.
Because a GRangesFactor object x
is a Factor
derivative, unfactor(x)
can be used to decode it.
unfactor(x)
returns an object of the same class as levels(x)
(i.e. a GRanges object or derivative) and same length as x
.
See ?unfactor
for more information.
GRangesFactor objects support the coercions documented in the man page for Factor objects.
In addition, coercion back and forth between GRanges and
GRangesFactor is supported via as(x, "GRanges")
and
as(x, "GRangesFactor")
.
A GRangesFactor object can be subsetted with [
, like a
Factor object.
2 or more GRangesFactor objects can be concatenated with c()
.
The result of this concatenation is another GRangesFactor object.
See Concatenation section in ?Factor
.
See Comparing & Ordering section in ?Factor
.
Hervé Pagès
GRanges objects.
Factor objects in the S4Vectors package for the parent class of GRangesFactor.
anyDuplicated
in the BiocGenerics
package.
showClass("GRangesFactor") # GRangesFactor extends Factor ## --------------------------------------------------------------------- ## CONSTRUCTOR & ACCESSORS ## --------------------------------------------------------------------- set.seed(123) ir0 <- IRanges(sample(5, 8, replace=TRUE), width=10, names=letters[1:8]) gr0 <- GRanges("chrA", ir0, ID=paste0("ID", 1:8)) ## Use explicit levels: gr1 <- GRanges("chrA", IRanges(1:6, width=10)) grf1 <- GRangesFactor(gr0, levels=gr1) grf1 length(grf1) names(grf1) levels(grf1) # gr1 nlevels(grf1) as.integer(grf1) # encoding ## If we don't specify the levels, they'll be set to unique(gr0): grf2 <- GRangesFactor(gr0) grf2 length(grf2) names(grf2) levels(grf2) # unique(gr0) nlevels(grf2) as.integer(grf2) ## --------------------------------------------------------------------- ## DECODING ## --------------------------------------------------------------------- unfactor(grf1) stopifnot(identical(gr0, unfactor(grf1))) stopifnot(identical(gr0, unfactor(grf2))) unfactor(grf1, use.names=FALSE) unfactor(grf1, ignore.mcols=TRUE) ## --------------------------------------------------------------------- ## COERCION ## --------------------------------------------------------------------- grf2b <- as(gr0, "GRangesFactor") # same as GRangesFactor(gr0) stopifnot(identical(grf2, grf2b)) as.factor(grf2) as.factor(grf1) as.character(grf1) # same as unfactor(as.factor(grf1)), # and also same as as.character(unfactor(grf1)) ## --------------------------------------------------------------------- ## CONCATENATION ## --------------------------------------------------------------------- gr3 <- GRanges("chrA", IRanges(c(5, 2, 8:6), width=10)) grf3 <- GRangesFactor(levels=gr3, index=2:4) grf13 <- c(grf1, grf3) grf13 levels(grf13) stopifnot(identical(c(unfactor(grf1), unfactor(grf3)), unfactor(grf13))) ## --------------------------------------------------------------------- ## COMPARING & ORDERING ## --------------------------------------------------------------------- grf1 == grf2 # same as unfactor(grf1) == unfactor(grf2) order(grf1) # same as order(unfactor(grf1)) order(grf2) # same as order(unfactor(grf2)) ## The levels of the GRangesFactor influence the order of the table: table(grf1) table(grf2)
showClass("GRangesFactor") # GRangesFactor extends Factor ## --------------------------------------------------------------------- ## CONSTRUCTOR & ACCESSORS ## --------------------------------------------------------------------- set.seed(123) ir0 <- IRanges(sample(5, 8, replace=TRUE), width=10, names=letters[1:8]) gr0 <- GRanges("chrA", ir0, ID=paste0("ID", 1:8)) ## Use explicit levels: gr1 <- GRanges("chrA", IRanges(1:6, width=10)) grf1 <- GRangesFactor(gr0, levels=gr1) grf1 length(grf1) names(grf1) levels(grf1) # gr1 nlevels(grf1) as.integer(grf1) # encoding ## If we don't specify the levels, they'll be set to unique(gr0): grf2 <- GRangesFactor(gr0) grf2 length(grf2) names(grf2) levels(grf2) # unique(gr0) nlevels(grf2) as.integer(grf2) ## --------------------------------------------------------------------- ## DECODING ## --------------------------------------------------------------------- unfactor(grf1) stopifnot(identical(gr0, unfactor(grf1))) stopifnot(identical(gr0, unfactor(grf2))) unfactor(grf1, use.names=FALSE) unfactor(grf1, ignore.mcols=TRUE) ## --------------------------------------------------------------------- ## COERCION ## --------------------------------------------------------------------- grf2b <- as(gr0, "GRangesFactor") # same as GRangesFactor(gr0) stopifnot(identical(grf2, grf2b)) as.factor(grf2) as.factor(grf1) as.character(grf1) # same as unfactor(as.factor(grf1)), # and also same as as.character(unfactor(grf1)) ## --------------------------------------------------------------------- ## CONCATENATION ## --------------------------------------------------------------------- gr3 <- GRanges("chrA", IRanges(c(5, 2, 8:6), width=10)) grf3 <- GRangesFactor(levels=gr3, index=2:4) grf13 <- c(grf1, grf3) grf13 levels(grf13) stopifnot(identical(c(unfactor(grf1), unfactor(grf3)), unfactor(grf13))) ## --------------------------------------------------------------------- ## COMPARING & ORDERING ## --------------------------------------------------------------------- grf1 == grf2 # same as unfactor(grf1) == unfactor(grf2) order(grf1) # same as order(unfactor(grf1)) order(grf2) # same as order(unfactor(grf2)) ## The levels of the GRangesFactor influence the order of the table: table(grf1) table(grf2)
The GRangesList class is a container for storing a collection of GRanges objects. It is a subclass of GenomicRangesList. It exists in 2 flavors: SimpleGRangesList and CompressedGRangesList. Each flavor uses a particular internal representation. The CompressedGRangesList flavor is the default. It is particularly efficient for storing a large number of list elements and operating on them.
GRangesList(..., compress=TRUE)
:Creates a GRangesList object using the GRanges objects
supplied in ...
, either consecutively or in a list.
By default a CompressedGRangesList instance is returned,
that is, a GRangesList object of the CompressedGRangesList flavor.
Use compress=FALSE
to get a SimpleGRangesList instance
instead.
makeGRangesListFromFeatureFragments(seqnames=Rle(factor()),
fragmentStarts=list(), fragmentEnds=list(),
fragmentWidths=list(),
strand=character(0), sep=",")
:Constructs a GRangesList object from a list of fragmented features. See the Examples section below.
In the code snippets below, x
is a GRangesList object.
length(x)
:Get the number of list elements.
names(x)
, names(x) <- value
:Get or set the names on x
.
seqnames(x)
, seqnames(x) <- value
:Get or set the sequence names in the form of an RleList.
value
can be an RleList or CharacterList object.
ranges(x, use.mcols=FALSE)
, ranges(x) <- value
:Get or set the ranges in the form of a CompressedIRangesList.
value
can be an IntegerRangesList object.
start(x)
, start(x) <- value
:Get or set start(ranges(x))
.
end(x)
, end(x) <- value
:Get or set end(ranges(x))
.
width(x)
, width(x) <- value
:Get or set width(ranges(x))
.
strand(x)
, strand(x) <- value
:Get or set the strand in the form of an RleList object.
value
can be an RleList, CharacterList or
single character. value
as a single character converts all
ranges in x
to the same value
; for selective strand
conversion (i.e., mixed +
and -
) use RleList
or CharacterList.
mcols(x, use.names=FALSE)
, mcols(x) <- value
:Get or set the metadata columns.
value
can be NULL
, or a data.frame-like object (i.e.
DataFrame or data.frame) holding element-wise metadata.
elementNROWS(x)
:Get a vector of the length
of each of the list element.
isEmpty(x)
:Returns a logical indicating either if the GRangesList has no elements or if all its elements are empty.
seqinfo(x)
, seqinfo(x) <- value
:Get or set the information about the underlying sequences.
value
must be a Seqinfo object.
seqlevels(x)
,
seqlevels(x, pruning.mode=c("error", "coarse", "fine", "tidy"))
<- value
:Get or set the sequence levels.
seqlevels(x)
is equivalent to seqlevels(seqinfo(x))
or to levels(seqnames(x))
, those 2 expressions being
guaranteed to return identical character vectors on a GRangesList
object. value
must be a character vector with no NAs.
See ?seqlevels
for more information.
seqlengths(x)
, seqlengths(x) <- value
:Get or set the sequence lengths.
seqlengths(x)
is equivalent to seqlengths(seqinfo(x))
.
value
can be a named non-negative integer or numeric vector
eventually with NAs.
isCircular(x)
, isCircular(x) <- value
:Get or set the circularity flags.
isCircular(x)
is equivalent to isCircular(seqinfo(x))
.
value
must be a named logical vector eventually with NAs.
genome(x)
, genome(x) <- value
:Get or set the genome identifier or assembly name for each sequence.
genome(x)
is equivalent to genome(seqinfo(x))
.
value
must be a named character vector eventually with NAs.
seqlevelsStyle(x)
, seqlevelsStyle(x) <- value
:Get or set the seqname style for x
.
See the seqlevelsStyle generic getter and setter
in the GenomeInfoDb package for more information.
score(x), score(x) <- value
:Get or set the score
metadata column.
In the code snippets below, x
is a GRangesList object.
as.data.frame(x, row.names=NULL, optional=FALSE, ...,
value.name="value", use.outer.mcols=FALSE,
group_name.as.factor=FALSE)
:Coerces x
to a data.frame
. See as.data.frame on the
List
man page for details (?List
).
as.list(x, use.names = TRUE)
:Creates a list containing the elements of x
.
as(x, "IRangesList")
:Turns x
into an IRangesList object.
When x
is a list
of GRanges
, it can be coerced to a
GRangesList
.
as(x, "GRangesList")
:Turns x
into a GRangesList
.
In the following code snippets, x
is a GRangesList object.
x[i, j]
, x[i, j] <- value
:Get or set elements i
with optional metadata columns
mcols(x)[,j]
, where i
can be missing; an NA-free
logical, numeric, or character vector; a logical-Rle object,
or an AtomicList object.
x[[i]]
, x[[i]] <- value
:Get or set element i
, where i
is a numeric or character
vector of length 1.
x$name
, x$name <- value
:Get or set element name
, where name
is a name or character
vector of length 1.
head(x, n = 6L)
:If n
is non-negative, returns the first n elements of the
GRangesList object.
If n
is negative, returns all but the last abs(n)
elements
of the GRangesList object.
rep(x, times, length.out, each)
:Repeats the values in x
through one of the following conventions:
times
Vector giving the number of times to repeat each
element if of length length(x)
, or to repeat the whole vector
if of length 1.
length.out
Non-negative integer. The desired length of the output vector.
each
Non-negative integer. Each element of x
is
repeated each
times.
subset(x, subset)
:Returns a new object of the same class as x
made of the subset
using logical vector subset
, where missing values are taken as
FALSE
.
tail(x, n = 6L)
:If n
is non-negative, returns the last n list elements of the
GRangesList object.
If n
is negative, returns all but the first abs(n)
list elements of the GRangesList object.
In the code snippets below, x
is a GRangesList object.
c(x, ...)
:Combines x
and the GRangesList objects in ...
together. Any object in ...
must belong to the same class
as x
, or to one of its subclasses, or must be NULL
.
The result is an object of the same class as x
.
append(x, values, after = length(x))
:Inserts the values
into x
at the position given by
after
, where x
and values
are of the same
class.
unlist(x, recursive = TRUE, use.names = TRUE)
:Concatenates the elements of x
into a single GRanges
object.
In the code snippets below, x
is a GRangesList object.
endoapply(X, FUN, ...)
:Similar to lapply
, but performs an endomorphism,
i.e. returns an object of class(X)
.
lapply(X, FUN, ...)
:Like the standard lapply
function defined in the
base package, the lapply
method for GRangesList objects
returns a list of the same length as X
, with each element being
the result of applying FUN
to the corresponding element of
X
.
Map(f, ...)
:Applies a function to the corresponding elements of given GRangesList objects.
mapply(FUN, ..., MoreArgs=NULL, SIMPLIFY=TRUE, USE.NAMES=TRUE)
:Like the standard mapply
function defined in the
base package, the mapply
method for GRangesList objects is a
multivariate version of sapply
.
mendoapply(FUN, ..., MoreArgs = NULL)
:Similar to mapply
, but performs an endomorphism
across multiple objects, i.e. returns an object of
class(list(...)[[1]])
.
Reduce(f, x, init, right = FALSE, accumulate = FALSE)
:Uses a binary function to successively combine the elements of x
and a possibly given initial value.
f
A binary argument function.
init
An R object of the same kind as the elements of x
.
right
A logical indicating whether to proceed from left to right (default) or from right to left.
nomatch
The value to be returned in the case when "no match" (no element satisfying the predicate) is found.
sapply(X, FUN, ..., simplify=TRUE, USE.NAMES=TRUE)
:Like the standard sapply
function defined in
the base package, the sapply
method for GRangesList objects
is a user-friendly version of lapply
by default returning a vector
or matrix if appropriate.
P. Aboyoun & H. Pagès
GRanges objects.
seqinfo
in the GenomeInfoDb
package.
IntegerRangesList objects in the IRanges package.
RleList objects in the IRanges package.
DataFrameList objects in the IRanges package.
intra-range-methods, inter-range-methods, coverage-methods, setops-methods, and findOverlaps-methods.
GenomicRangesList objects.
## Construction with GRangesList(): gr1 <- GRanges("chr2", IRanges(3, 6), strand="+", score=5L, GC=0.45) gr2 <- GRanges(c("chr1", "chr1"), IRanges(c(7,13), width=3), strand=c("+", "-"), score=3:4, GC=c(0.3, 0.5)) gr3 <- GRanges(c("chr1", "chr2"), IRanges(c(1, 4), c(3, 9)), strand=c("-", "-"), score=c(6L, 2L), GC=c(0.4, 0.1)) grl <- GRangesList(gr1=gr1, gr2=gr2, gr3=gr3) grl ## Summarizing elements: elementNROWS(grl) table(seqnames(grl)) ## Extracting subsets: grl[seqnames(grl) == "chr1", ] grl[seqnames(grl) == "chr1" & strand(grl) == "+", ] ## Renaming the underlying sequences: seqlevels(grl) seqlevels(grl) <- sub("chr", "Chrom", seqlevels(grl)) grl ## Coerce to IRangesList (seqnames and strand information is lost): as(grl, "IRangesList") ## isDisjoint(): isDisjoint(grl) ## disjoin(): disjoin(grl) # metadata columns and order NOT preserved ## Construction with makeGRangesListFromFeatureFragments(): filepath <- system.file("extdata", "feature_frags.txt", package="GenomicRanges") featfrags <- read.table(filepath, header=TRUE, stringsAsFactors=FALSE) grl2 <- with(featfrags, makeGRangesListFromFeatureFragments(seqnames=targetName, fragmentStarts=targetStart, fragmentWidths=blockSizes, strand=strand)) names(grl2) <- featfrags$RefSeqID grl2
## Construction with GRangesList(): gr1 <- GRanges("chr2", IRanges(3, 6), strand="+", score=5L, GC=0.45) gr2 <- GRanges(c("chr1", "chr1"), IRanges(c(7,13), width=3), strand=c("+", "-"), score=3:4, GC=c(0.3, 0.5)) gr3 <- GRanges(c("chr1", "chr2"), IRanges(c(1, 4), c(3, 9)), strand=c("-", "-"), score=c(6L, 2L), GC=c(0.4, 0.1)) grl <- GRangesList(gr1=gr1, gr2=gr2, gr3=gr3) grl ## Summarizing elements: elementNROWS(grl) table(seqnames(grl)) ## Extracting subsets: grl[seqnames(grl) == "chr1", ] grl[seqnames(grl) == "chr1" & strand(grl) == "+", ] ## Renaming the underlying sequences: seqlevels(grl) seqlevels(grl) <- sub("chr", "Chrom", seqlevels(grl)) grl ## Coerce to IRangesList (seqnames and strand information is lost): as(grl, "IRangesList") ## isDisjoint(): isDisjoint(grl) ## disjoin(): disjoin(grl) # metadata columns and order NOT preserved ## Construction with makeGRangesListFromFeatureFragments(): filepath <- system.file("extdata", "feature_frags.txt", package="GenomicRanges") featfrags <- read.table(filepath, header=TRUE, stringsAsFactors=FALSE) grl2 <- with(featfrags, makeGRangesListFromFeatureFragments(seqnames=targetName, fragmentStarts=targetStart, fragmentWidths=blockSizes, strand=strand)) names(grl2) <- featfrags$RefSeqID grl2
This man page documents inter range transformations of a GenomicRanges object (i.e. of an object that belongs to the GenomicRanges class or one of its subclasses, this includes for example GRanges objects), or a GRangesList object.
See ?`intra-range-methods`
and
?`inter-range-methods`
in the IRanges
package for a quick introduction to intra range and inter
range transformations.
See ?`intra-range-methods`
for
intra range transformations of a GenomicRanges object or
GRangesList object.
## S4 method for signature 'GenomicRanges' range(x, ..., with.revmap=FALSE, ignore.strand=FALSE, na.rm=FALSE) ## S4 method for signature 'GRangesList' range(x, ..., with.revmap=FALSE, ignore.strand=FALSE, na.rm=FALSE) ## S4 method for signature 'GenomicRanges' reduce(x, drop.empty.ranges=FALSE, min.gapwidth=1L, with.revmap=FALSE, with.inframe.attrib=FALSE, ignore.strand=FALSE) ## S4 method for signature 'GRangesList' reduce(x, drop.empty.ranges=FALSE, min.gapwidth=1L, with.revmap=FALSE, with.inframe.attrib=FALSE, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' gaps(x, start=1L, end=seqlengths(x), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' disjoin(x, with.revmap=FALSE, ignore.strand=FALSE) ## S4 method for signature 'GRangesList' disjoin(x, with.revmap=FALSE, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' isDisjoint(x, ignore.strand=FALSE) ## S4 method for signature 'GRangesList' isDisjoint(x, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' disjointBins(x, ignore.strand=FALSE)
## S4 method for signature 'GenomicRanges' range(x, ..., with.revmap=FALSE, ignore.strand=FALSE, na.rm=FALSE) ## S4 method for signature 'GRangesList' range(x, ..., with.revmap=FALSE, ignore.strand=FALSE, na.rm=FALSE) ## S4 method for signature 'GenomicRanges' reduce(x, drop.empty.ranges=FALSE, min.gapwidth=1L, with.revmap=FALSE, with.inframe.attrib=FALSE, ignore.strand=FALSE) ## S4 method for signature 'GRangesList' reduce(x, drop.empty.ranges=FALSE, min.gapwidth=1L, with.revmap=FALSE, with.inframe.attrib=FALSE, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' gaps(x, start=1L, end=seqlengths(x), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' disjoin(x, with.revmap=FALSE, ignore.strand=FALSE) ## S4 method for signature 'GRangesList' disjoin(x, with.revmap=FALSE, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' isDisjoint(x, ignore.strand=FALSE) ## S4 method for signature 'GRangesList' isDisjoint(x, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' disjointBins(x, ignore.strand=FALSE)
x |
A GenomicRanges or GenomicRangesList object. |
drop.empty.ranges , min.gapwidth , with.revmap , with.inframe.attrib , start , end
|
See |
ignore.strand |
|
... |
For |
na.rm |
Ignored. |
range
returns an object of the same type as x
containing range bounds for each distinct (seqname, strand) pairing.
The names (names(x)
) and the metadata columns in x
are
dropped.
reduce
returns an object of the same type as x
containing reduced ranges for each distinct (seqname, strand) pairing.
The names (names(x)
) and the metadata columns in x
are
dropped.
See ?reduce
for more information about range
reduction and for a description of the optional arguments.
gaps
returns an object of the same type as x
containing complemented ranges for each distinct (seqname, strand) pairing.
The names (names(x)
) and the metadata columns in x
are
dropped.
For the start
and end
arguments of this gaps
method,
it is expected that the user will supply a named integer vector (where
the names correspond to the appropriate seqlevels).
See ?gaps
for more information about range
complements and for a description of the optional arguments.
disjoin
returns an object of the same type as x
containing disjoint ranges for each distinct (seqname, strand) pairing.
The names (names(x)
) and the metadata columns in x
are
dropped. If with.revmap=TRUE
, a metadata column that maps the
ouput ranges to the input ranges is added to the returned object.
See ?disjoin
for more information.
isDisjoint
returns a logical value indicating whether the ranges
in x
are disjoint (i.e. non-overlapping).
disjointBins
returns bin indexes for the ranges in x
, such
that ranges in the same bin do not overlap. If ignore.strand=FALSE
,
the two features cannot overlap if they are on different strands.
When they are supported on GRangesList object x
, the above inter
range transformations will apply the transformation to each of the list
elements in x
and return a list-like object parallel to
x
(i.e. with 1 list element per list element in x
).
If x
has names on it, they're propagated to the returned object.
H. Pagès and P. Aboyoun
The GenomicRanges and GRanges classes.
The IntegerRanges class in the IRanges package.
The inter-range-methods man page in the IRanges package.
GenomicRanges-comparison for comparing and ordering genomic ranges.
endoapply
in the S4Vectors package.
gr <- GRanges( seqnames=Rle(paste("chr", c(1, 2, 1, 3), sep=""), c(1, 3, 2, 4)), ranges=IRanges(1:10, width=10:1, names=letters[1:10]), strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score=1:10, GC=seq(1, 0, length=10) ) gr gr1 <- GRanges(seqnames="chr2", ranges=IRanges(3, 6), strand="+", score=5L, GC=0.45) gr2 <- GRanges(seqnames="chr1", ranges=IRanges(c(10, 7, 19), width=5), strand=c("+", "-", "+"), score=3:5, GC=c(0.3, 0.5, 0.66)) gr3 <- GRanges(seqnames=c("chr1", "chr2"), ranges=IRanges(c(1, 4), c(3, 9)), strand=c("-", "-"), score=c(6L, 2L), GC=c(0.4, 0.1)) grl <- GRangesList(gr1=gr1, gr2=gr2, gr3=gr3) grl ## --------------------------------------------------------------------- ## range() ## --------------------------------------------------------------------- ## On a GRanges object: range(gr) range(gr, with.revmap=TRUE) ## On a GRangesList object: range(grl) range(grl, ignore.strand=TRUE) range(grl, with.revmap=TRUE, ignore.strand=TRUE) # --------------------------------------------------------------------- ## reduce() ## --------------------------------------------------------------------- reduce(gr) gr2 <- reduce(gr, with.revmap=TRUE) revmap <- mcols(gr2)$revmap # an IntegerList ## Use the mapping from reduced to original ranges to group the original ## ranges by reduced range: relist(gr[unlist(revmap)], revmap) ## Or use it to split the DataFrame of original metadata columns by ## reduced range: relist(mcols(gr)[unlist(revmap), ], revmap) # a SplitDataFrameList ## [For advanced users] Use this reverse mapping to compare the reduced ## ranges with the ranges they originate from: expanded_gr2 <- rep(gr2, elementNROWS(revmap)) reordered_gr <- gr[unlist(revmap)] codes <- pcompare(expanded_gr2, reordered_gr) ## All the codes should translate to "d", "e", "g", or "h" (the 4 letters ## indicating that the range on the left contains the range on the right): alphacodes <- rangeComparisonCodeToLetter(pcompare(expanded_gr2, reordered_gr)) stopifnot(all(alphacodes %in% c("d", "e", "g", "h"))) ## On a big GRanges object with a lot of seqlevels: mcols(gr) <- NULL biggr <- c(gr, GRanges("chr1", IRanges(c(4, 1), c(5, 2)), strand="+")) seqlevels(biggr) <- paste0("chr", 1:2000) biggr <- rep(biggr, 25000) set.seed(33) seqnames(biggr) <- sample(factor(seqlevels(biggr), levels=seqlevels(biggr)), length(biggr), replace=TRUE) biggr2 <- reduce(biggr, with.revmap=TRUE) revmap <- mcols(biggr2)$revmap expanded_biggr2 <- rep(biggr2, elementNROWS(revmap)) reordered_biggr <- biggr[unlist(revmap)] codes <- pcompare(expanded_biggr2, reordered_biggr) alphacodes <- rangeComparisonCodeToLetter(pcompare(expanded_biggr2, reordered_biggr)) stopifnot(all(alphacodes %in% c("d", "e", "g", "h"))) table(alphacodes) ## On a GRangesList object: reduce(grl) # Doesn't really reduce anything but note the reordering # of the inner elements in the 2nd and 3rd list elements: # the ranges are reordered by sequence name first (which # should appear in the same order as in 'seqlevels(grl)'), # and then by strand. reduce(grl, ignore.strand=TRUE) # 2nd list element got reduced ## --------------------------------------------------------------------- ## gaps() ## --------------------------------------------------------------------- gaps(gr, start=3, end=12) gaps(gr, start=3, end=12, ignore.strand=TRUE) ## Note that if the lengths of the underlying sequences are known, then ## by default 'gaps(gr)' returns the regions of the sequences that are ## not covered by 'gr': seqlengths(gr) # lengths of underlying sequences are not known seqlengths(gr) <- c(chr1=50, chr2=30, chr3=18) gaps(gr) gaps(gr, ignore.strand=TRUE) ## --------------------------------------------------------------------- ## disjoin(), isDisjoint(), disjointBins() ## --------------------------------------------------------------------- disjoin(gr) disjoin(gr, with.revmap=TRUE) disjoin(gr, with.revmap=TRUE, ignore.strand=TRUE) isDisjoint(gr) stopifnot(isDisjoint(disjoin(gr))) disjointBins(gr) stopifnot(all(sapply(split(gr, disjointBins(gr)), isDisjoint))) ## On a GRangesList object: disjoin(grl) # doesn't really disjoin anything but note the reordering disjoin(grl, with.revmap=TRUE)
gr <- GRanges( seqnames=Rle(paste("chr", c(1, 2, 1, 3), sep=""), c(1, 3, 2, 4)), ranges=IRanges(1:10, width=10:1, names=letters[1:10]), strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score=1:10, GC=seq(1, 0, length=10) ) gr gr1 <- GRanges(seqnames="chr2", ranges=IRanges(3, 6), strand="+", score=5L, GC=0.45) gr2 <- GRanges(seqnames="chr1", ranges=IRanges(c(10, 7, 19), width=5), strand=c("+", "-", "+"), score=3:5, GC=c(0.3, 0.5, 0.66)) gr3 <- GRanges(seqnames=c("chr1", "chr2"), ranges=IRanges(c(1, 4), c(3, 9)), strand=c("-", "-"), score=c(6L, 2L), GC=c(0.4, 0.1)) grl <- GRangesList(gr1=gr1, gr2=gr2, gr3=gr3) grl ## --------------------------------------------------------------------- ## range() ## --------------------------------------------------------------------- ## On a GRanges object: range(gr) range(gr, with.revmap=TRUE) ## On a GRangesList object: range(grl) range(grl, ignore.strand=TRUE) range(grl, with.revmap=TRUE, ignore.strand=TRUE) # --------------------------------------------------------------------- ## reduce() ## --------------------------------------------------------------------- reduce(gr) gr2 <- reduce(gr, with.revmap=TRUE) revmap <- mcols(gr2)$revmap # an IntegerList ## Use the mapping from reduced to original ranges to group the original ## ranges by reduced range: relist(gr[unlist(revmap)], revmap) ## Or use it to split the DataFrame of original metadata columns by ## reduced range: relist(mcols(gr)[unlist(revmap), ], revmap) # a SplitDataFrameList ## [For advanced users] Use this reverse mapping to compare the reduced ## ranges with the ranges they originate from: expanded_gr2 <- rep(gr2, elementNROWS(revmap)) reordered_gr <- gr[unlist(revmap)] codes <- pcompare(expanded_gr2, reordered_gr) ## All the codes should translate to "d", "e", "g", or "h" (the 4 letters ## indicating that the range on the left contains the range on the right): alphacodes <- rangeComparisonCodeToLetter(pcompare(expanded_gr2, reordered_gr)) stopifnot(all(alphacodes %in% c("d", "e", "g", "h"))) ## On a big GRanges object with a lot of seqlevels: mcols(gr) <- NULL biggr <- c(gr, GRanges("chr1", IRanges(c(4, 1), c(5, 2)), strand="+")) seqlevels(biggr) <- paste0("chr", 1:2000) biggr <- rep(biggr, 25000) set.seed(33) seqnames(biggr) <- sample(factor(seqlevels(biggr), levels=seqlevels(biggr)), length(biggr), replace=TRUE) biggr2 <- reduce(biggr, with.revmap=TRUE) revmap <- mcols(biggr2)$revmap expanded_biggr2 <- rep(biggr2, elementNROWS(revmap)) reordered_biggr <- biggr[unlist(revmap)] codes <- pcompare(expanded_biggr2, reordered_biggr) alphacodes <- rangeComparisonCodeToLetter(pcompare(expanded_biggr2, reordered_biggr)) stopifnot(all(alphacodes %in% c("d", "e", "g", "h"))) table(alphacodes) ## On a GRangesList object: reduce(grl) # Doesn't really reduce anything but note the reordering # of the inner elements in the 2nd and 3rd list elements: # the ranges are reordered by sequence name first (which # should appear in the same order as in 'seqlevels(grl)'), # and then by strand. reduce(grl, ignore.strand=TRUE) # 2nd list element got reduced ## --------------------------------------------------------------------- ## gaps() ## --------------------------------------------------------------------- gaps(gr, start=3, end=12) gaps(gr, start=3, end=12, ignore.strand=TRUE) ## Note that if the lengths of the underlying sequences are known, then ## by default 'gaps(gr)' returns the regions of the sequences that are ## not covered by 'gr': seqlengths(gr) # lengths of underlying sequences are not known seqlengths(gr) <- c(chr1=50, chr2=30, chr3=18) gaps(gr) gaps(gr, ignore.strand=TRUE) ## --------------------------------------------------------------------- ## disjoin(), isDisjoint(), disjointBins() ## --------------------------------------------------------------------- disjoin(gr) disjoin(gr, with.revmap=TRUE) disjoin(gr, with.revmap=TRUE, ignore.strand=TRUE) isDisjoint(gr) stopifnot(isDisjoint(disjoin(gr))) disjointBins(gr) stopifnot(all(sapply(split(gr, disjointBins(gr)), isDisjoint))) ## On a GRangesList object: disjoin(grl) # doesn't really disjoin anything but note the reordering disjoin(grl, with.revmap=TRUE)
This man page documents intra range transformations of a GenomicRanges object (i.e. of an object that belongs to the GenomicRanges class or one of its subclasses, this includes for example GRanges objects), or a GRangesList object.
See ?`intra-range-methods`
and
?`inter-range-methods`
in the IRanges
package for a quick introduction to intra range and inter
range transformations.
Intra range methods for GAlignments and GAlignmentsList objects are defined and documented in the GenomicAlignments package.
See ?`inter-range-methods`
for
inter range transformations of a GenomicRanges or
GRangesList object.
## S4 method for signature 'GenomicRanges' shift(x, shift=0L, use.names=TRUE) ## S4 method for signature 'GenomicRanges' narrow(x, start=NA, end=NA, width=NA, use.names=TRUE) ## S4 method for signature 'GenomicRanges' resize(x, width, fix="start", use.names=TRUE, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' flank(x, width, start=TRUE, both=FALSE, use.names=TRUE, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' promoters(x, upstream=2000, downstream=200, use.names=TRUE) ## S4 method for signature 'GenomicRanges' terminators(x, upstream=2000, downstream=200, use.names=TRUE) ## S4 method for signature 'GenomicRanges' restrict(x, start=NA, end=NA, keep.all.ranges=FALSE, use.names=TRUE) ## S4 method for signature 'GenomicRanges' trim(x, use.names=TRUE)
## S4 method for signature 'GenomicRanges' shift(x, shift=0L, use.names=TRUE) ## S4 method for signature 'GenomicRanges' narrow(x, start=NA, end=NA, width=NA, use.names=TRUE) ## S4 method for signature 'GenomicRanges' resize(x, width, fix="start", use.names=TRUE, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' flank(x, width, start=TRUE, both=FALSE, use.names=TRUE, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges' promoters(x, upstream=2000, downstream=200, use.names=TRUE) ## S4 method for signature 'GenomicRanges' terminators(x, upstream=2000, downstream=200, use.names=TRUE) ## S4 method for signature 'GenomicRanges' restrict(x, start=NA, end=NA, keep.all.ranges=FALSE, use.names=TRUE) ## S4 method for signature 'GenomicRanges' trim(x, use.names=TRUE)
x |
A GenomicRanges object. |
shift , use.names , start , end , width , both , fix , keep.all.ranges , upstream , downstream
|
See |
ignore.strand |
|
shift
: behaves like the shift
method for
IntegerRanges objects. See
?`intra-range-methods`
for the details.
narrow
: on a GenomicRanges object behaves
like on an IntegerRanges object. See
?`intra-range-methods`
for the details.
A major difference though is that it returns a GenomicRanges
object instead of an IntegerRanges object.
The returned object is parallel (i.e. same length and names)
to the original object x
.
resize
: returns an object of the same type and length as
x
containing intervals that have been resized to width
width
based on the strand(x)
values. Elements where
strand(x) == "+"
or strand(x) == "*"
are anchored at
start(x)
and elements where strand(x) == "-"
are anchored
at the end(x)
. The use.names
argument determines whether
or not to keep the names on the ranges.
flank
: returns an object of the same type and length
as x
containing intervals of width width
that flank
the intervals in x
. The start
argument takes a
logical indicating whether x
should be flanked at the
"start" (TRUE
) or the "end" (FALSE
), which for
strand(x) != "-"
is start(x)
and end(x)
respectively and for strand(x) == "-"
is end(x)
and
start(x)
respectively. The both
argument takes a
single logical value indicating whether the flanking region
width
positions extends into the range. If
both=TRUE
, the resulting range thus straddles the end
point, with width
positions on either side.
promoters
: assumes that the ranges in x
represent
transcript regions and returns the ranges of the corresponding promoter
regions. The result is another GenomicRanges derivative
parallel to the input, that is, of the same length as x
and with the i-th element in the output corresponding to the i-th
element in the input.
The promoter regions extend around the transcription start
sites (TSS) which are located at start(x)
for ranges on the
+
or *
strand, and at end(x)
for ranges on the
-
strand.
The upstream
and downstream
arguments define the
number of nucleotides in the 5' and 3' direction, respectively.
More precisely, the output range is defined as
(start(x) - upstream) to (start(x) + downstream - 1)
for ranges on the +
or *
strand, and as
(end(x) - downstream + 1) to (end(x) + upstream)
for ranges on the -
strand.
Be aware that the returned object might contain out-of-bound ranges i.e. ranges that start before the first nucleotide position and/or end after the last nucleotide position of the underlying sequence.
The returned object will always have the same class as x
,
except when x
is a GPos object in which case a
GRanges instance is returned.
terminators
: like promoters
but returns the ranges
of the terminator regions. These regions extend around the transcription
end sites (TES) which are located at end(x)
for ranges on the
+
or *
strand, and at start(x)
for ranges on the
-
strand.
restrict
: returns an object of the same type and length as
x
containing restricted ranges for distinct seqnames. The
start
and end
arguments can be a named numeric vector of
seqnames for the ranges to be resticted or a numeric vector or length 1
if the restriction operation is to be applied to all the sequences in
x
. See ?`intra-range-methods`
for more
information about range restriction and for a description of the optional
arguments.
trim
:trims out-of-bound ranges located on non-circular sequences whose length is not NA.
P. Aboyoun, V. Obenchain, and H. Pagès
GenomicRanges, GRanges, and GRangesList objects.
The intra-range-methods man page in the IRanges package.
The IntegerRanges class in the IRanges package.
## --------------------------------------------------------------------- ## A. ON A GRanges OBJECT ## --------------------------------------------------------------------- gr <- GRanges( seqnames=Rle(paste("chr", c(1, 2, 1, 3), sep=""), c(1, 3, 2, 4)), ranges=IRanges(1:10, width=10:1, names=letters[1:10]), strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score=1:10, GC=seq(1, 0, length=10) ) gr shift(gr, 1) narrow(gr[-10], start=2, end=-2) resize(gr, width=10) flank(gr, width=10) restrict(gr, start=3, end=7) gr <- GRanges("chr1", IRanges(rep(10, 3), width=8), c("+", "-", "*")) promoters(gr, 2, 5) promoters(gr, upstream=0, downstream=1) # TSS terminators(gr, 2, 5) terminators(gr, upstream=0, downstream=1) # TES ## --------------------------------------------------------------------- ## B. ON A GRangesList OBJECT ## --------------------------------------------------------------------- gr1 <- GRanges("chr2", IRanges(3, 6)) gr2 <- GRanges(c("chr1", "chr1"), IRanges(c(7,13), width=3), strand=c("+", "-")) gr3 <- GRanges(c("chr1", "chr2"), IRanges(c(1, 4), c(3, 9)), strand="-") grl <- GRangesList(gr1= gr1, gr2=gr2, gr3=gr3) grl resize(grl, width=20) flank(grl, width=20) restrict(grl, start=3)
## --------------------------------------------------------------------- ## A. ON A GRanges OBJECT ## --------------------------------------------------------------------- gr <- GRanges( seqnames=Rle(paste("chr", c(1, 2, 1, 3), sep=""), c(1, 3, 2, 4)), ranges=IRanges(1:10, width=10:1, names=letters[1:10]), strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score=1:10, GC=seq(1, 0, length=10) ) gr shift(gr, 1) narrow(gr[-10], start=2, end=-2) resize(gr, width=10) flank(gr, width=10) restrict(gr, start=3, end=7) gr <- GRanges("chr1", IRanges(rep(10, 3), width=8), c("+", "-", "*")) promoters(gr, 2, 5) promoters(gr, upstream=0, downstream=1) # TSS terminators(gr, 2, 5) terminators(gr, upstream=0, downstream=1) # TES ## --------------------------------------------------------------------- ## B. ON A GRangesList OBJECT ## --------------------------------------------------------------------- gr1 <- GRanges("chr2", IRanges(3, 6)) gr2 <- GRanges(c("chr1", "chr1"), IRanges(c(7,13), width=3), strand=c("+", "-")) gr3 <- GRanges(c("chr1", "chr2"), IRanges(c(1, 4), c(3, 9)), strand="-") grl <- GRangesList(gr1= gr1, gr2=gr2, gr3=gr3) grl resize(grl, width=20) flank(grl, width=20) restrict(grl, start=3)
makeGRangesFromDataFrame
takes a data-frame-like object as
input and tries to automatically find the columns that describe
genomic ranges. It returns them as a GRanges object.
makeGRangesFromDataFrame
is also the workhorse behind the
coercion method from data.frame (or DataFrame) to
GRanges.
makeGRangesFromDataFrame(df, keep.extra.columns=FALSE, ignore.strand=FALSE, seqinfo=NULL, seqnames.field=c("seqnames", "seqname", "chromosome", "chrom", "chr", "chromosome_name", "seqid"), start.field="start", end.field=c("end", "stop"), strand.field="strand", starts.in.df.are.0based=FALSE, na.rm=FALSE)
makeGRangesFromDataFrame(df, keep.extra.columns=FALSE, ignore.strand=FALSE, seqinfo=NULL, seqnames.field=c("seqnames", "seqname", "chromosome", "chrom", "chr", "chromosome_name", "seqid"), start.field="start", end.field=c("end", "stop"), strand.field="strand", starts.in.df.are.0based=FALSE, na.rm=FALSE)
df |
A data.frame or DataFrame object. If not, then
the function first tries to turn |
keep.extra.columns |
|
ignore.strand |
|
seqinfo |
Either |
seqnames.field |
A character vector of recognized names for the column in |
start.field |
A character vector of recognized names for the column in |
end.field |
A character vector of recognized names for the column in |
strand.field |
A character vector of recognized names for the column in |
starts.in.df.are.0based |
|
na.rm |
|
A GRanges object with one element per row in the input.
If the seqinfo
argument was supplied, the returned object will
have exactly the seqlevels specified in seqinfo
and in the same
order. Otherwise, the seqlevels are ordered according to the output of
the rankSeqlevels
function (except if
df
contains the seqnames in the form of a factor-Rle, in which
case the levels of the factor-Rle become the seqlevels of the returned
object and with no re-ordering).
If df
has non-automatic row names (i.e. rownames(df)
is
not NULL
and is not seq_len(nrow(df))
), then they will be
used to set names on the returned GRanges object.
Coercing data.frame or DataFrame df
into
a GRanges object (with as(df, "GRanges")
), or
calling GRanges(df)
, are both equivalent to calling
makeGRangesFromDataFrame(df, keep.extra.columns=TRUE)
.
H. Pagès, based on a proposal by Kasper Daniel Hansen
GRanges objects.
Seqinfo objects and the
rankSeqlevels
function in the
GenomeInfoDb package.
The makeGRangesListFromFeatureFragments
function
for making a GRangesList object from a list of fragmented
features.
The getTable
function in the
rtracklayer package for an R interface to the UCSC
Table Browser.
DataFrame objects in the S4Vectors package.
## --------------------------------------------------------------------- ## BASIC EXAMPLES ## --------------------------------------------------------------------- df <- data.frame(chr="chr1", start=11:15, end=12:16, strand=c("+","-","+","*","."), score=1:5) df makeGRangesFromDataFrame(df) # strand value "." is replaced with "*" ## NA in ranges df$start[5] <- df$end[2] <- NA df #makeGRangesFromDataFrame(df) # error! makeGRangesFromDataFrame(df, na.rm=TRUE) # rows with NAs got dropped ## The strand column is optional: df <- data.frame(chr="chr1", start=11:15, end=12:16, score=1:5) makeGRangesFromDataFrame(df) gr <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE) gr2 <- as(df, "GRanges") # equivalent to the above stopifnot(identical(gr, gr2)) gr2 <- GRanges(df) # equivalent to the above stopifnot(identical(gr, gr2)) makeGRangesFromDataFrame(df, ignore.strand=TRUE) makeGRangesFromDataFrame(df, keep.extra.columns=TRUE, ignore.strand=TRUE) makeGRangesFromDataFrame(df, seqinfo=paste0("chr", 4:1)) makeGRangesFromDataFrame(df, seqinfo=c(chrM=NA, chr1=500, chrX=100)) makeGRangesFromDataFrame(df, seqinfo=Seqinfo(paste0("chr", 4:1))) ## --------------------------------------------------------------------- ## ABOUT AUTOMATIC DETECTION OF THE seqnames/start/end/strand COLUMNS ## --------------------------------------------------------------------- ## Automatic detection of the seqnames/start/end/strand columns is ## case insensitive: df <- data.frame(ChRoM="chr1", StarT=11:15, stoP=12:16, STRAND=c("+","-","+","*","."), score=1:5) makeGRangesFromDataFrame(df) ## It also ignores a common prefix between the start and end columns: df <- data.frame(seqnames="chr1", tx_start=11:15, tx_end=12:16, strand=c("+","-","+","*","."), score=1:5) makeGRangesFromDataFrame(df) ## The common prefix between the start and end columns is used to ## disambiguate between more than one seqnames column: df <- data.frame(chrom="chr1", tx_start=11:15, tx_end=12:16, tx_chr="chr2", score=1:5) makeGRangesFromDataFrame(df) ## --------------------------------------------------------------------- ## 0-BASED VS 1-BASED START POSITIONS ## --------------------------------------------------------------------- if (require(rtracklayer)) { session <- browserSession() genome(session) <- "sacCer2" query <- ucscTableQuery(session, "Assembly") df <- getTable(query) head(df) ## A common pitfall is to forget that the UCSC Table Browser uses the ## "0-based start" convention: gr0 <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE, start.field="chromStart", end.field="chromEnd") head(gr0) ## The start positions need to be converted into 1-based positions, ## to adhere to the convention used in Bioconductor: gr1 <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE, start.field="chromStart", end.field="chromEnd", starts.in.df.are.0based=TRUE) head(gr1) }
## --------------------------------------------------------------------- ## BASIC EXAMPLES ## --------------------------------------------------------------------- df <- data.frame(chr="chr1", start=11:15, end=12:16, strand=c("+","-","+","*","."), score=1:5) df makeGRangesFromDataFrame(df) # strand value "." is replaced with "*" ## NA in ranges df$start[5] <- df$end[2] <- NA df #makeGRangesFromDataFrame(df) # error! makeGRangesFromDataFrame(df, na.rm=TRUE) # rows with NAs got dropped ## The strand column is optional: df <- data.frame(chr="chr1", start=11:15, end=12:16, score=1:5) makeGRangesFromDataFrame(df) gr <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE) gr2 <- as(df, "GRanges") # equivalent to the above stopifnot(identical(gr, gr2)) gr2 <- GRanges(df) # equivalent to the above stopifnot(identical(gr, gr2)) makeGRangesFromDataFrame(df, ignore.strand=TRUE) makeGRangesFromDataFrame(df, keep.extra.columns=TRUE, ignore.strand=TRUE) makeGRangesFromDataFrame(df, seqinfo=paste0("chr", 4:1)) makeGRangesFromDataFrame(df, seqinfo=c(chrM=NA, chr1=500, chrX=100)) makeGRangesFromDataFrame(df, seqinfo=Seqinfo(paste0("chr", 4:1))) ## --------------------------------------------------------------------- ## ABOUT AUTOMATIC DETECTION OF THE seqnames/start/end/strand COLUMNS ## --------------------------------------------------------------------- ## Automatic detection of the seqnames/start/end/strand columns is ## case insensitive: df <- data.frame(ChRoM="chr1", StarT=11:15, stoP=12:16, STRAND=c("+","-","+","*","."), score=1:5) makeGRangesFromDataFrame(df) ## It also ignores a common prefix between the start and end columns: df <- data.frame(seqnames="chr1", tx_start=11:15, tx_end=12:16, strand=c("+","-","+","*","."), score=1:5) makeGRangesFromDataFrame(df) ## The common prefix between the start and end columns is used to ## disambiguate between more than one seqnames column: df <- data.frame(chrom="chr1", tx_start=11:15, tx_end=12:16, tx_chr="chr2", score=1:5) makeGRangesFromDataFrame(df) ## --------------------------------------------------------------------- ## 0-BASED VS 1-BASED START POSITIONS ## --------------------------------------------------------------------- if (require(rtracklayer)) { session <- browserSession() genome(session) <- "sacCer2" query <- ucscTableQuery(session, "Assembly") df <- getTable(query) head(df) ## A common pitfall is to forget that the UCSC Table Browser uses the ## "0-based start" convention: gr0 <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE, start.field="chromStart", end.field="chromEnd") head(gr0) ## The start positions need to be converted into 1-based positions, ## to adhere to the convention used in Bioconductor: gr1 <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE, start.field="chromStart", end.field="chromEnd", starts.in.df.are.0based=TRUE) head(gr1) }
makeGRangesListFromDataFrame
extends the
makeGRangesFromDataFrame functionality from
GenomicRanges
. It can take a data-frame-like object as input
and tries to automatically find the columns that describe the genomic
ranges. It returns a GRangesList object. This is different from the
makeGRangesFromDataFrame
function by requiring a
split.field
. The split.field
acts like the
"f" argument in the split
function. This factor
must be of the same length as the number of rows in the DataFrame
argument. The split.field
may also be a character vector.
makeGRangesListFromDataFrame(df, split.field = NULL, names.field = NULL, ...)
makeGRangesListFromDataFrame(df, split.field = NULL, names.field = NULL, ...)
df |
A |
split.field |
A character string of a recognized column name in |
names.field |
An optional single |
... |
Additional arguments passed on to makeGRangesFromDataFrame |
A GRangesList of the same length as the number of levels or
unique character strings in the df
column indicated by
split.field
. When split.field
is not provided the df
is split by row and the resulting GRangesList has the
same length as nrow(df).
Names on the individual ranges are taken from the names.field
argument. Names on the outer list elements of the GRangesList
are propagated from split.field
.
M. Ramos
## --------------------------------------------------------------------- ## BASIC EXAMPLES ## --------------------------------------------------------------------- df <- data.frame(chr="chr1", start=11:15, end=12:16, strand=c("+","-","+","*","."), score=1:5, specimen = c("a", "a", "b", "b", "c"), gene_symbols = paste0("GENE", letters[1:5])) df grl <- makeGRangesListFromDataFrame(df, split.field = "specimen", names.field = "gene_symbols") grl names(grl) ## Keep metadata columns makeGRangesListFromDataFrame(df, split.field = "specimen", keep.extra.columns = TRUE)
## --------------------------------------------------------------------- ## BASIC EXAMPLES ## --------------------------------------------------------------------- df <- data.frame(chr="chr1", start=11:15, end=12:16, strand=c("+","-","+","*","."), score=1:5, specimen = c("a", "a", "b", "b", "c"), gene_symbols = paste0("GENE", letters[1:5])) df grl <- makeGRangesListFromDataFrame(df, split.field = "specimen", names.field = "gene_symbols") grl names(grl) ## Keep metadata columns makeGRangesListFromDataFrame(df, split.field = "specimen", keep.extra.columns = TRUE)
The nearest
, precede
, follow
, distance
,
nearestKNeighbors
, and distanceToNearest
methods for
GenomicRanges
objects and subclasses.
## S4 method for signature 'GenomicRanges,GenomicRanges' precede(x, subject, select=c("first", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,missing' precede(x, subject, select=c("first", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GenomicRanges' follow(x, subject, select=c("last", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,missing' follow(x, subject, select=c("last", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GenomicRanges' nearest(x, subject, select=c("arbitrary", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,missing' nearest(x, subject, select=c("arbitrary", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GenomicRanges' nearestKNeighbors(x, subject, k=1L, select=c("arbitrary", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,missing' nearestKNeighbors(x, subject, k=1L, select=c("arbitrary", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GenomicRanges' distanceToNearest(x, subject, ignore.strand=FALSE, ...) ## S4 method for signature 'GenomicRanges,missing' distanceToNearest(x, subject, ignore.strand=FALSE, ...) ## S4 method for signature 'GenomicRanges,GenomicRanges' distance(x, y, ignore.strand=FALSE, ...)
## S4 method for signature 'GenomicRanges,GenomicRanges' precede(x, subject, select=c("first", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,missing' precede(x, subject, select=c("first", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GenomicRanges' follow(x, subject, select=c("last", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,missing' follow(x, subject, select=c("last", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GenomicRanges' nearest(x, subject, select=c("arbitrary", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,missing' nearest(x, subject, select=c("arbitrary", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GenomicRanges' nearestKNeighbors(x, subject, k=1L, select=c("arbitrary", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,missing' nearestKNeighbors(x, subject, k=1L, select=c("arbitrary", "all"), ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GenomicRanges' distanceToNearest(x, subject, ignore.strand=FALSE, ...) ## S4 method for signature 'GenomicRanges,missing' distanceToNearest(x, subject, ignore.strand=FALSE, ...) ## S4 method for signature 'GenomicRanges,GenomicRanges' distance(x, y, ignore.strand=FALSE, ...)
x |
The query GenomicRanges instance. |
subject |
The subject GenomicRanges instance
within which the nearest neighbors are found. Can be missing,
in which case |
y |
For the |
k |
For the |
select |
Logic for handling ties. By default, all methods
select a single interval (arbitrary for When |
ignore.strand |
A |
... |
Additional arguments for methods. |
nearest
:Performs conventional nearest neighbor finding.
Returns an integer vector containing the index of the nearest neighbor
range in subject
for each range in x
. If there is no
nearest neighbor NA
is returned. For details of the algorithm
see the man page in the IRanges package (?nearest
).
precede
:For each range in x
, precede
returns
the index of the range in subject
that is directly
preceded by the range in x
. Overlapping ranges are excluded.
NA
is returned when there are no qualifying ranges in
subject
.
follow
:The opposite of precede
, follow
returns
the index of the range in subject
that is directly followed by the
range in x
. Overlapping ranges are excluded. NA
is returned
when there are no qualifying ranges in subject
.
nearestKNeighbors
:Performs conventional k-nearest neighbor finding.
Returns an IntegerList containing the index of the
k-nearest neighbors in subject
for each range in x
. If there
is no nearest neighbor NA
is returned. If select="all"
is
specified, ties will be included in the resulting
IntegerList.
precede
and follow
: Orientation is 5' to 3', consistent with the direction of translation.
Because positional numbering along a chromosome is from left to
right and transcription takes place from 5' to 3', precede
and
follow
can appear to have ‘opposite’ behavior on the +
and -
strand. Using positions 5 and 6 as an example, 5 precedes
6 on the +
strand but follows 6 on the -
strand.
The table below outlines the orientation when ranges on different
strands are compared. In general, a feature on *
is considered
to belong to both strands. The single exception is when both x
and subject
are *
in which case both are treated as +
.
x | subject | orientation -----+-----------+---------------- a) + | + | ---> b) + | - | NA c) + | * | ---> d) - | + | NA e) - | - | <--- f) - | * | <--- g) * | + | ---> h) * | - | <--- i) * | * | ---> (the only situation where * arbitrarily means +)
distanceToNearest
:Returns the distance for each range in x
to its nearest neighbor in the subject
.
distance
:Returns the distance for each range in x
to the range in y
.
The behavior of distance
has changed in Bioconductor 2.12.
See the man page ?distance
in the IRanges package for
details.
For nearest
, precede
and follow
, an integer
vector of indices in subject
, or a Hits if
select="all"
.
For nearestKNeighbors
, an IntegerList of vertices in
subject
.
For distanceToNearest
, a Hits object with a
column for the query
index (queryHits), subject
index
(subjectHits) and the distance
between the pair.
For distance
, an integer vector of distances between the ranges
in x
and y
.
P. Aboyoun and V. Obenchain
The GenomicRanges and GRanges classes.
The IntegerRanges class in the IRanges package.
The Hits class in the S4Vectors package.
The nearest-methods man page in the IRanges package.
findOverlaps-methods for finding just the overlapping ranges.
The nearest-methods man page in the GenomicFeatures package.
## ----------------------------------------------------------- ## precede() and follow() ## ----------------------------------------------------------- query <- GRanges("A", IRanges(c(5, 20), width=1), strand="+") subject <- GRanges("A", IRanges(rep(c(10, 15), 2), width=1), strand=c("+", "+", "-", "-")) precede(query, subject) follow(query, subject) strand(query) <- "-" precede(query, subject) follow(query, subject) ## ties choose first in order query <- GRanges("A", IRanges(10, width=1), c("+", "-", "*")) subject <- GRanges("A", IRanges(c(5, 5, 5, 15, 15, 15), width=1), rep(c("+", "-", "*"), 2)) precede(query, subject) precede(query, rev(subject)) ## ignore.strand=TRUE treats all ranges as '+' precede(query[1], subject[4:6], select="all", ignore.strand=FALSE) precede(query[1], subject[4:6], select="all", ignore.strand=TRUE) ## ----------------------------------------------------------- ## nearest() ## ----------------------------------------------------------- ## When multiple ranges overlap an "arbitrary" range is chosen query <- GRanges("A", IRanges(5, 15)) subject <- GRanges("A", IRanges(c(1, 15), c(5, 19))) nearest(query, subject) ## select="all" returns all hits nearest(query, subject, select="all") ## Ranges in 'x' will self-select when 'subject' is present query <- GRanges("A", IRanges(c(1, 10), width=5)) nearest(query, query) ## Ranges in 'x' will not self-select when 'subject' is missing nearest(query) ## ----------------------------------------------------------- ## nearestKNeighbors() ## ----------------------------------------------------------- ## Without an argument, k defaults to 1 query <- GRanges("A", IRanges(c(2, 5), c(8, 15))) subject <- GRanges("A", IRanges(c(1, 4, 10, 15), c(5, 7, 12, 19))) nearestKNeighbors(query, subject) ## Return multiple neighbors with k > 1 nearestKNeighbors(query, subject, k=3) ## select="all" returns all hits nearestKNeighbors(query, subject, select="all") ## ----------------------------------------------------------- ## distance(), distanceToNearest() ## ----------------------------------------------------------- ## Adjacent, overlap, separated by 1 query <- GRanges("A", IRanges(c(1, 2, 10), c(5, 8, 11))) subject <- GRanges("A", IRanges(c(6, 5, 13), c(10, 10, 15))) distance(query, subject) ## recycling distance(query[1], subject) ## zero-width ranges zw <- GRanges("A", IRanges(4,3)) stopifnot(distance(zw, GRanges("A", IRanges(3,4))) == 0L) sapply(-3:3, function(i) distance(shift(zw, i), GRanges("A", IRanges(4,3)))) query <- GRanges(c("A", "B"), IRanges(c(1, 5), width=1)) distanceToNearest(query, subject) ## distance() with GRanges and TxDb see the ## ?'distance,GenomicRanges,TxDb-method' man ## page in the GenomicFeatures package.
## ----------------------------------------------------------- ## precede() and follow() ## ----------------------------------------------------------- query <- GRanges("A", IRanges(c(5, 20), width=1), strand="+") subject <- GRanges("A", IRanges(rep(c(10, 15), 2), width=1), strand=c("+", "+", "-", "-")) precede(query, subject) follow(query, subject) strand(query) <- "-" precede(query, subject) follow(query, subject) ## ties choose first in order query <- GRanges("A", IRanges(10, width=1), c("+", "-", "*")) subject <- GRanges("A", IRanges(c(5, 5, 5, 15, 15, 15), width=1), rep(c("+", "-", "*"), 2)) precede(query, subject) precede(query, rev(subject)) ## ignore.strand=TRUE treats all ranges as '+' precede(query[1], subject[4:6], select="all", ignore.strand=FALSE) precede(query[1], subject[4:6], select="all", ignore.strand=TRUE) ## ----------------------------------------------------------- ## nearest() ## ----------------------------------------------------------- ## When multiple ranges overlap an "arbitrary" range is chosen query <- GRanges("A", IRanges(5, 15)) subject <- GRanges("A", IRanges(c(1, 15), c(5, 19))) nearest(query, subject) ## select="all" returns all hits nearest(query, subject, select="all") ## Ranges in 'x' will self-select when 'subject' is present query <- GRanges("A", IRanges(c(1, 10), width=5)) nearest(query, query) ## Ranges in 'x' will not self-select when 'subject' is missing nearest(query) ## ----------------------------------------------------------- ## nearestKNeighbors() ## ----------------------------------------------------------- ## Without an argument, k defaults to 1 query <- GRanges("A", IRanges(c(2, 5), c(8, 15))) subject <- GRanges("A", IRanges(c(1, 4, 10, 15), c(5, 7, 12, 19))) nearestKNeighbors(query, subject) ## Return multiple neighbors with k > 1 nearestKNeighbors(query, subject, k=3) ## select="all" returns all hits nearestKNeighbors(query, subject, select="all") ## ----------------------------------------------------------- ## distance(), distanceToNearest() ## ----------------------------------------------------------- ## Adjacent, overlap, separated by 1 query <- GRanges("A", IRanges(c(1, 2, 10), c(5, 8, 11))) subject <- GRanges("A", IRanges(c(6, 5, 13), c(10, 10, 15))) distance(query, subject) ## recycling distance(query[1], subject) ## zero-width ranges zw <- GRanges("A", IRanges(4,3)) stopifnot(distance(zw, GRanges("A", IRanges(3,4))) == 0L) sapply(-3:3, function(i) distance(shift(zw, i), GRanges("A", IRanges(4,3)))) query <- GRanges(c("A", "B"), IRanges(c(1, 5), width=1)) distanceToNearest(query, subject) ## distance() with GRanges and TxDb see the ## ?'distance,GenomicRanges,TxDb-method' man ## page in the GenomicFeatures package.
The phicoef
function calculates the "phi coefficient" between
two binary variables.
phicoef(x, y=NULL)
phicoef(x, y=NULL)
x , y
|
Two logical vectors of the same length.
If |
The "phi coefficient" between the two binary variables. This is a single numeric value ranging from -1 to +1.
H. Pagès
http://en.wikipedia.org/wiki/Phi_coefficient
set.seed(33) x <- sample(c(TRUE, FALSE), 100, replace=TRUE) y <- sample(c(TRUE, FALSE), 100, replace=TRUE) phicoef(x, y) phicoef(rep(x, 10), c(rep(x, 9), y)) stopifnot(phicoef(table(x, y)) == phicoef(x, y)) stopifnot(phicoef(y, x) == phicoef(x, y)) stopifnot(phicoef(x, !y) == - phicoef(x, y)) stopifnot(phicoef(x, x) == 1)
set.seed(33) x <- sample(c(TRUE, FALSE), 100, replace=TRUE) y <- sample(c(TRUE, FALSE), 100, replace=TRUE) phicoef(x, y) phicoef(rep(x, 10), c(rep(x, 9), y)) stopifnot(phicoef(table(x, y)) == phicoef(x, y)) stopifnot(phicoef(y, x) == phicoef(x, y)) stopifnot(phicoef(x, !y) == - phicoef(x, y)) stopifnot(phicoef(x, x) == 1)
Performs set operations on GRanges and GRangesList objects.
NOTE: The punion
, pintersect
,
psetdiff
, and pgap
generic
functions and methods for IntegerRanges objects are defined
and documented in the IRanges package.
## Vector-wise set operations ## -------------------------- ## S4 method for signature 'GenomicRanges,GenomicRanges' union(x, y, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GenomicRanges' intersect(x, y, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GenomicRanges' setdiff(x, y, ignore.strand=FALSE) ## Element-wise (aka "parallel") set operations ## -------------------------------------------- ## S4 method for signature 'GRanges,GRanges' punion(x, y, fill.gap=FALSE, ignore.strand=FALSE) ## S4 method for signature 'GRanges,GRanges' pintersect(x, y, drop.nohit.ranges=FALSE, ignore.strand=FALSE, strict.strand=FALSE) ## S4 method for signature 'GRanges,GRanges' psetdiff(x, y, ignore.strand=FALSE)
## Vector-wise set operations ## -------------------------- ## S4 method for signature 'GenomicRanges,GenomicRanges' union(x, y, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GenomicRanges' intersect(x, y, ignore.strand=FALSE) ## S4 method for signature 'GenomicRanges,GenomicRanges' setdiff(x, y, ignore.strand=FALSE) ## Element-wise (aka "parallel") set operations ## -------------------------------------------- ## S4 method for signature 'GRanges,GRanges' punion(x, y, fill.gap=FALSE, ignore.strand=FALSE) ## S4 method for signature 'GRanges,GRanges' pintersect(x, y, drop.nohit.ranges=FALSE, ignore.strand=FALSE, strict.strand=FALSE) ## S4 method for signature 'GRanges,GRanges' psetdiff(x, y, ignore.strand=FALSE)
x , y
|
For For For For In addition, for the parallel operations, |
fill.gap |
Logical indicating whether or not to force a union by using the rule
|
ignore.strand |
For set operations: If set to TRUE, then the strand of For parallel set operations: If set to TRUE, the strand information is
ignored in the computation and the result has the strand information of
|
drop.nohit.ranges |
If TRUE then elements in If FALSE (the default) then nothing is removed and a |
strict.strand |
If set to FALSE (the default), features on the |
The pintersect
methods involving GRanges and/or
GRangesList objects use the triplet (sequence name, range, strand)
to determine the element by element intersection of features, where a
strand value of "*"
is treated as occurring on both the "+"
and "-"
strand (unless strict.strand
is set to TRUE, in
which case the strand of intersecting elements must be strictly the same).
The psetdiff
methods involving GRanges and/or
GRangesList objects use the triplet (sequence name, range,
strand) to determine the element by element set difference of features,
where a strand value of "*"
is treated as occurring on both the
"+"
and "-"
strand.
For union
, intersect
, and setdiff
: a GRanges
object if x
and y
are GenomicRanges objects,
and a GRangesList object if they are GRangesList objects.
For punion
and pintersect
: when x
or y
is
not a GRanges object, an object of the same class as this
non-GRanges object. Otherwise, a GRanges object.
For psetdiff
: either a GRanges object when both x
and y
are GRanges objects, or a GRangesList object
when y
is a GRangesList object.
For pgap
: a GRanges object.
P. Aboyoun and H. Pagès
subtract for subtracting a set of genomic ranges from a GRanges object (similar to bedtools subtract).
setops-methods in the IRanges package for set operations on IntegerRanges and IntegerRangesList objects.
findOverlaps-methods for finding/counting overlapping genomic ranges.
intra-range-methods and inter-range-methods for intra range and inter range transformations of a GRanges object.
GRanges and GRangesList objects.
mendoapply
in the S4Vectors package.
## --------------------------------------------------------------------- ## A. SET OPERATIONS ## --------------------------------------------------------------------- x <- GRanges("chr1", IRanges(c(2, 9) , c(7, 19)), strand=c("+", "-")) y <- GRanges("chr1", IRanges(5, 10), strand="-") union(x, y) union(x, y, ignore.strand=TRUE) intersect(x, y) intersect(x, y, ignore.strand=TRUE) setdiff(x, y) setdiff(x, y, ignore.strand=TRUE) ## With 2 GRangesList objects: gr1 <- GRanges(seqnames="chr2", ranges=IRanges(3, 6)) gr2 <- GRanges(seqnames=c("chr1", "chr1"), ranges=IRanges(c(7,13), width = 3), strand=c("+", "-")) gr3 <- GRanges(seqnames=c("chr1", "chr2"), ranges=IRanges(c(1, 4), c(3, 9)), strand=c("-", "-")) grlist <- GRangesList(gr1=gr1, gr2=gr2, gr3=gr3) union(grlist, shift(grlist, 3)) intersect(grlist, shift(grlist, 3)) setdiff(grlist, shift(grlist, 3)) ## Sanity checks: grlist2 <- shift(grlist, 3) stopifnot(identical( union(grlist, grlist2), mendoapply(union, grlist, grlist2) )) stopifnot(identical( intersect(grlist, grlist2), mendoapply(intersect, grlist, grlist2) )) stopifnot(identical( setdiff(grlist, grlist2), mendoapply(setdiff, grlist, grlist2) )) ## --------------------------------------------------------------------- ## B. PARALLEL SET OPERATIONS ## --------------------------------------------------------------------- punion(x, shift(x, 6)) ## Not run: punion(x, shift(x, 7)) # will fail ## End(Not run) punion(x, shift(x, 7), fill.gap=TRUE) pintersect(x, shift(x, 6)) pintersect(x, shift(x, 7)) psetdiff(x, shift(x, 7)) ## --------------------------------------------------------------------- ## C. MORE EXAMPLES ## --------------------------------------------------------------------- ## GRanges object: gr <- GRanges(seqnames=c("chr2", "chr1", "chr1"), ranges=IRanges(1:3, width = 12), strand=Rle(strand(c("-", "*", "-")))) ## Parallel intersection of a GRanges and a GRangesList object pintersect(gr, grlist) pintersect(grlist, gr) ## For a fast 'mendoapply(intersect, grlist, as(gr, "GRangesList"))' ## call pintersect() with 'strict.strand=TRUE' and call reduce() on ## the result with 'drop.empty.ranges=TRUE': reduce(pintersect(grlist, gr, strict.strand=TRUE), drop.empty.ranges=TRUE) ## Parallel set difference of a GRanges and a GRangesList object psetdiff(gr, grlist)
## --------------------------------------------------------------------- ## A. SET OPERATIONS ## --------------------------------------------------------------------- x <- GRanges("chr1", IRanges(c(2, 9) , c(7, 19)), strand=c("+", "-")) y <- GRanges("chr1", IRanges(5, 10), strand="-") union(x, y) union(x, y, ignore.strand=TRUE) intersect(x, y) intersect(x, y, ignore.strand=TRUE) setdiff(x, y) setdiff(x, y, ignore.strand=TRUE) ## With 2 GRangesList objects: gr1 <- GRanges(seqnames="chr2", ranges=IRanges(3, 6)) gr2 <- GRanges(seqnames=c("chr1", "chr1"), ranges=IRanges(c(7,13), width = 3), strand=c("+", "-")) gr3 <- GRanges(seqnames=c("chr1", "chr2"), ranges=IRanges(c(1, 4), c(3, 9)), strand=c("-", "-")) grlist <- GRangesList(gr1=gr1, gr2=gr2, gr3=gr3) union(grlist, shift(grlist, 3)) intersect(grlist, shift(grlist, 3)) setdiff(grlist, shift(grlist, 3)) ## Sanity checks: grlist2 <- shift(grlist, 3) stopifnot(identical( union(grlist, grlist2), mendoapply(union, grlist, grlist2) )) stopifnot(identical( intersect(grlist, grlist2), mendoapply(intersect, grlist, grlist2) )) stopifnot(identical( setdiff(grlist, grlist2), mendoapply(setdiff, grlist, grlist2) )) ## --------------------------------------------------------------------- ## B. PARALLEL SET OPERATIONS ## --------------------------------------------------------------------- punion(x, shift(x, 6)) ## Not run: punion(x, shift(x, 7)) # will fail ## End(Not run) punion(x, shift(x, 7), fill.gap=TRUE) pintersect(x, shift(x, 6)) pintersect(x, shift(x, 7)) psetdiff(x, shift(x, 7)) ## --------------------------------------------------------------------- ## C. MORE EXAMPLES ## --------------------------------------------------------------------- ## GRanges object: gr <- GRanges(seqnames=c("chr2", "chr1", "chr1"), ranges=IRanges(1:3, width = 12), strand=Rle(strand(c("-", "*", "-")))) ## Parallel intersection of a GRanges and a GRangesList object pintersect(gr, grlist) pintersect(grlist, gr) ## For a fast 'mendoapply(intersect, grlist, as(gr, "GRangesList"))' ## call pintersect() with 'strict.strand=TRUE' and call reduce() on ## the result with 'drop.empty.ranges=TRUE': reduce(pintersect(grlist, gr, strict.strand=TRUE), drop.empty.ranges=TRUE) ## Parallel set difference of a GRanges and a GRangesList object psetdiff(gr, grlist)
A bunch of useful strand
and invertStrand
methods.
## S4 method for signature 'missing' strand(x) ## S4 method for signature 'character' strand(x) ## S4 method for signature 'factor' strand(x) ## S4 method for signature 'integer' strand(x) ## S4 method for signature 'logical' strand(x) ## S4 method for signature 'Rle' strand(x) ## S4 method for signature 'RleList' strand(x) ## S4 method for signature 'DataFrame' strand(x) ## S4 replacement method for signature 'DataFrame,ANY' strand(x) <- value ## S4 method for signature 'character' invertStrand(x) ## S4 method for signature 'factor' invertStrand(x) ## S4 method for signature 'integer' invertStrand(x) ## S4 method for signature 'logical' invertStrand(x) ## S4 method for signature 'Rle' invertStrand(x) ## S4 method for signature 'RleList' invertStrand(x)
## S4 method for signature 'missing' strand(x) ## S4 method for signature 'character' strand(x) ## S4 method for signature 'factor' strand(x) ## S4 method for signature 'integer' strand(x) ## S4 method for signature 'logical' strand(x) ## S4 method for signature 'Rle' strand(x) ## S4 method for signature 'RleList' strand(x) ## S4 method for signature 'DataFrame' strand(x) ## S4 replacement method for signature 'DataFrame,ANY' strand(x) <- value ## S4 method for signature 'character' invertStrand(x) ## S4 method for signature 'factor' invertStrand(x) ## S4 method for signature 'integer' invertStrand(x) ## S4 method for signature 'logical' invertStrand(x) ## S4 method for signature 'Rle' invertStrand(x) ## S4 method for signature 'RleList' invertStrand(x)
x |
The object from which to obtain a strand factor, strand factor Rle, or strand factor RleList object. Can be missing. See Details and Value sections below for more information. |
value |
Replacement value for the strand. |
All the strand
and invertStrand
methods documented
here return either a strand factor, strand factor
Rle, or strand factor RleList object.
These are factor, factor-Rle, or factor-RleList
objects containing the "standard strand levels" (i.e. +
, -
,
and *
) and no NAs.
All the strand
and invertStrand
methods documented here
return an object that is parallel to input object x
when
x
is a character, factor, integer, logical, Rle,
or RleList object.
For the strand
methods:
If x
is missing, returns an empty factor with the
"standard strand levels" i.e. +
, -
, and *
.
If x
is a character vector or factor, it is coerced to a
factor with the levels listed above. NA
values in x
are not accepted.
If x
is an integer vector, it is coerced to a factor
with the levels listed above. 1
, -1
, and NA
values in x
are mapped to the +
, -
, and
*
levels respectively.
If x
is a logical vector, it is coerced to a factor
with the levels listed above. FALSE
, TRUE
, and
NA
values in x
are mapped to the +
, -
,
and *
levels respectively.
If x
is a character-, factor-, integer-, or
logical-Rle, it is transformed with
runValue(x) <- strand(runValue(x))
and returned.
If x
is an RleList object, each list element in
x
is transformed by calling strand()
on it and
the resulting RleList object is returned. More precisely
the returned object is endoapply(x, strand)
.
Note that in addition to being parallel to x
, this
object also has the same shape as x
(i.e. its list
elements have the same lengths as in x
).
If x
is a DataFrame
object, the "strand"
column is passed thru strand()
and returned.
If x
has no "strand"
column, this return value is
populated with *
s.
Each invertStrand
method returns the same object as its corresponding
strand
method but with "+"
and "-"
switched.
M. Lawrence and H. Pagès
strand() x1 <- c("-", "*", "*", "+", "-", "*") x2 <- factor(c("-", "-", "+", "-")) x3 <- c(-1L, NA, NA, 1L, -1L, NA) x4 <- c(TRUE, NA, NA, FALSE, TRUE, NA) strand(x1) invertStrand(x1) strand(x2) invertStrand(x2) strand(x3) invertStrand(x3) strand(x4) invertStrand(x4) strand(Rle(x1)) invertStrand(Rle(x1)) strand(Rle(x2)) invertStrand(Rle(x2)) strand(Rle(x3)) invertStrand(Rle(x3)) strand(Rle(x4)) invertStrand(Rle(x4)) x5 <- RleList(x1, character(0), as.character(x2)) strand(x5) invertStrand(x5) strand(DataFrame(score=2:-3)) strand(DataFrame(score=2:-3, strand=x3)) strand(DataFrame(score=2:-3, strand=Rle(x3))) ## Sanity checks: target <- strand(x1) stopifnot(identical(target, strand(x3))) stopifnot(identical(target, strand(x4))) stopifnot(identical(Rle(strand(x1)), strand(Rle(x1)))) stopifnot(identical(Rle(strand(x2)), strand(Rle(x2)))) stopifnot(identical(Rle(strand(x3)), strand(Rle(x3)))) stopifnot(identical(Rle(strand(x4)), strand(Rle(x4))))
strand() x1 <- c("-", "*", "*", "+", "-", "*") x2 <- factor(c("-", "-", "+", "-")) x3 <- c(-1L, NA, NA, 1L, -1L, NA) x4 <- c(TRUE, NA, NA, FALSE, TRUE, NA) strand(x1) invertStrand(x1) strand(x2) invertStrand(x2) strand(x3) invertStrand(x3) strand(x4) invertStrand(x4) strand(Rle(x1)) invertStrand(Rle(x1)) strand(Rle(x2)) invertStrand(Rle(x2)) strand(Rle(x3)) invertStrand(Rle(x3)) strand(Rle(x4)) invertStrand(Rle(x4)) x5 <- RleList(x1, character(0), as.character(x2)) strand(x5) invertStrand(x5) strand(DataFrame(score=2:-3)) strand(DataFrame(score=2:-3, strand=x3)) strand(DataFrame(score=2:-3, strand=Rle(x3))) ## Sanity checks: target <- strand(x1) stopifnot(identical(target, strand(x3))) stopifnot(identical(target, strand(x4))) stopifnot(identical(Rle(strand(x1)), strand(Rle(x1)))) stopifnot(identical(Rle(strand(x2)), strand(Rle(x2)))) stopifnot(identical(Rle(strand(x3)), strand(Rle(x3)))) stopifnot(identical(Rle(strand(x4)), strand(Rle(x4))))
Similar to bedtools subtract.
subtract(x, y, minoverlap=1L, ...) ## S4 method for signature 'GenomicRanges,GenomicRanges' subtract(x, y, minoverlap=1L, ignore.strand=FALSE)
subtract(x, y, minoverlap=1L, ...) ## S4 method for signature 'GenomicRanges,GenomicRanges' subtract(x, y, minoverlap=1L, ignore.strand=FALSE)
x , y
|
Two GRanges objects, typically, but any GenomicRanges
derivative should be supported.
Note that reduce(y, ignore.strand=ignore.strand) internally. |
minoverlap |
Minimum overlap (in number of genomic positions) between a range in
|
ignore.strand |
If set to TRUE, the strand information is ignored in the computation
and the strand of |
... |
Further arguments to be passed to specific methods. |
subtract()
first replaces its second argument y
with:
reduce(y, ignore.strand=ignore.strand)
Then it searches for genomic ranges in y
that overlap genomic
ranges in x
by at least the number of base pairs specified via
the minoverlap
argument. If an overlapping range is found in
y
, the overlapping portion is removed from any range in x
involved in the overlap.
Note that by default subtract(x, y)
is equivalent to:
psetdiff(x, rep(GRangesList(y), length(x)))
but will typically be hundred times more efficient.
A GRangesList object parallel to x
, that is, with
one list element per range in x
.
The names and metadata columns on x
are propagated to the result.
H. Pagès
bedtools subtract at https://bedtools.readthedocs.io/en/latest/content/tools/subtract.html
setops-methods for set operations on GRanges objects.
findOverlaps-methods for finding/counting overlapping genomic ranges.
intra-range-methods and inter-range-methods for intra range and inter range transformations of a GRanges object.
GRanges and GRangesList objects.
x <- GRanges(c(A="chr1:1-50", B="chr1:40-110", C="chrX:1-500")) y <- GRanges(c("chr1:21-25", "chr1:38-150")) z <- subtract(x, y) z unlist(z)
x <- GRanges(c(A="chr1:1-50", B="chr1:40-110", C="chrX:1-500")) y <- GRanges(c("chr1:21-25", "chr1:38-150")) z <- subtract(x, y) z unlist(z)
tile
and slidingWindows
methods for GenomicRanges. tile
partitions each range
into a set of tiles, which are defined in terms of their number or
width. slidingWindows
generates sliding windows of a specified
width and frequency.
## S4 method for signature 'GenomicRanges' tile(x, n, width) ## S4 method for signature 'GenomicRanges' slidingWindows(x, width, step=1L)
## S4 method for signature 'GenomicRanges' tile(x, n, width) ## S4 method for signature 'GenomicRanges' slidingWindows(x, width, step=1L)
x |
A GenomicRanges object, like a |
n |
The number of tiles to generate.
See |
width |
The (maximum) width of each tile.
See |
step |
The distance between the start positions of the sliding windows. |
The tile
function splits x
into a GRangesList
,
each element of which corresponds
to a tile, or partition, of x
. Specify the tile geometry with either
n
or width
(not both). Passing n
creates n
tiles
of approximately equal width, truncated by sequence end, while passing
width
tiles the region with ranges of the given width, again truncated
by sequence end.
The slidingWindows
function generates sliding windows within
each range of x
, according to width
and step
,
returning a GRangesList
. If the sliding windows do not exactly
cover a range in x
, the last window is partial.
A GRangesList
object, each element of which corresponds to a window.
M. Lawrence
tile
in the IRanges package.
gr <- GRanges( seqnames=Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), ranges=IRanges(1:10, end=11), strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), seqlengths=c(chr1=11, chr2=12, chr3=13)) # split every range in half tiles <- tile(gr, n = 2L) stopifnot(all(elementNROWS(tiles) == 2L)) # split ranges into subranges of width 2 # odd width ranges must contain one subrange of width 1 tiles <- tile(gr, width = 2L) stopifnot(all(all(width(tiles) %in% c(1L, 2L)))) windows <- slidingWindows(gr, width=3L, step=2L) width(windows[[1L]]) # last range is truncated
gr <- GRanges( seqnames=Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), ranges=IRanges(1:10, end=11), strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), seqlengths=c(chr1=11, chr2=12, chr3=13)) # split every range in half tiles <- tile(gr, n = 2L) stopifnot(all(elementNROWS(tiles) == 2L)) # split ranges into subranges of width 2 # odd width ranges must contain one subrange of width 1 tiles <- tile(gr, width = 2L) stopifnot(all(all(width(tiles) %in% c(1L, 2L)))) windows <- slidingWindows(gr, width=3L, step=2L) width(windows[[1L]]) # last range is truncated
tileGenome
returns a set of genomic regions that form a
partitioning of the specified genome. Each region is called a "tile".
tileGenome(seqlengths, ntile, tilewidth, cut.last.tile.in.chrom=FALSE)
tileGenome(seqlengths, ntile, tilewidth, cut.last.tile.in.chrom=FALSE)
seqlengths |
Either a named numeric vector of chromosome lengths or a Seqinfo
object. More precisely, if a named numeric vector, it must have a length
>= 1, cannot contain NAs or negative values, and cannot have duplicated
names. If a Seqinfo object, then it's first replaced with the
vector of sequence lengths stored in the object (extracted from the object
with the |
ntile |
The number of tiles to generate. |
tilewidth |
The desired tile width. The effective tile width might be slightly different but is guaranteed to never be more than the desired width. |
cut.last.tile.in.chrom |
Whether or not to cut the last tile in each chromosome.
This is set to |
If cut.last.tile.in.chrom
is FALSE
(the default),
a GRangesList object with one list element per tile, each of
them containing a number of genomic ranges equal to the number of
chromosomes it overlaps with. Note that when the tiles are small (i.e.
much smaller than the chromosomes), most of them only overlap with a
single chromosome.
If cut.last.tile.in.chrom
is TRUE
, a GRanges
object with one element (i.e. one genomic range) per tile.
H. Pagès, based on a proposal by M. Morgan
genomicvars for an example of how to compute the binned average of a numerical variable defined along a genome.
GRangesList and GRanges objects.
Seqinfo objects and the seqlengths
getter.
IntegerList objects.
Views objects.
## --------------------------------------------------------------------- ## A. WITH A TOY GENOME ## --------------------------------------------------------------------- seqlengths <- c(chr1=60, chr2=20, chr3=25) ## Create 5 tiles: tiles <- tileGenome(seqlengths, ntile=5) tiles elementNROWS(tiles) # tiles 3 and 4 contain 2 ranges width(tiles) ## Use sum() on this IntegerList object to get the effective tile ## widths: sum(width(tiles)) # each tile covers exactly 21 genomic positions ## Create 9 tiles: tiles <- tileGenome(seqlengths, ntile=9) elementNROWS(tiles) # tiles 6 and 7 contain 2 ranges table(sum(width(tiles))) # some tiles cover 12 genomic positions, # others 11 ## Specify the tile width: tiles <- tileGenome(seqlengths, tilewidth=20) length(tiles) # 6 tiles table(sum(width(tiles))) # effective tile width is <= specified ## Specify the tile width and cut the last tile in each chromosome: tiles <- tileGenome(seqlengths, tilewidth=24, cut.last.tile.in.chrom=TRUE) tiles width(tiles) # each tile covers exactly 24 genomic positions, except # the last tile in each chromosome ## Partition a genome by chromosome ("natural partitioning"): tiles <- tileGenome(seqlengths, tilewidth=max(seqlengths), cut.last.tile.in.chrom=TRUE) tiles # one tile per chromosome ## sanity check stopifnot(all.equal(setNames(end(tiles), seqnames(tiles)), seqlengths)) ## --------------------------------------------------------------------- ## B. WITH A REAL GENOME ## --------------------------------------------------------------------- library(BSgenome.Scerevisiae.UCSC.sacCer2) tiles <- tileGenome(seqinfo(Scerevisiae), ntile=20) tiles tiles <- tileGenome(seqinfo(Scerevisiae), tilewidth=50000, cut.last.tile.in.chrom=TRUE) tiles ## --------------------------------------------------------------------- ## C. AN APPLICATION: COMPUTE THE BINNED AVERAGE OF A NUMERICAL VARIABLE ## DEFINED ALONG A GENOME ## --------------------------------------------------------------------- ## See '?genomicvars' for an example of how to compute the binned ## average of a numerical variable defined along a genome.
## --------------------------------------------------------------------- ## A. WITH A TOY GENOME ## --------------------------------------------------------------------- seqlengths <- c(chr1=60, chr2=20, chr3=25) ## Create 5 tiles: tiles <- tileGenome(seqlengths, ntile=5) tiles elementNROWS(tiles) # tiles 3 and 4 contain 2 ranges width(tiles) ## Use sum() on this IntegerList object to get the effective tile ## widths: sum(width(tiles)) # each tile covers exactly 21 genomic positions ## Create 9 tiles: tiles <- tileGenome(seqlengths, ntile=9) elementNROWS(tiles) # tiles 6 and 7 contain 2 ranges table(sum(width(tiles))) # some tiles cover 12 genomic positions, # others 11 ## Specify the tile width: tiles <- tileGenome(seqlengths, tilewidth=20) length(tiles) # 6 tiles table(sum(width(tiles))) # effective tile width is <= specified ## Specify the tile width and cut the last tile in each chromosome: tiles <- tileGenome(seqlengths, tilewidth=24, cut.last.tile.in.chrom=TRUE) tiles width(tiles) # each tile covers exactly 24 genomic positions, except # the last tile in each chromosome ## Partition a genome by chromosome ("natural partitioning"): tiles <- tileGenome(seqlengths, tilewidth=max(seqlengths), cut.last.tile.in.chrom=TRUE) tiles # one tile per chromosome ## sanity check stopifnot(all.equal(setNames(end(tiles), seqnames(tiles)), seqlengths)) ## --------------------------------------------------------------------- ## B. WITH A REAL GENOME ## --------------------------------------------------------------------- library(BSgenome.Scerevisiae.UCSC.sacCer2) tiles <- tileGenome(seqinfo(Scerevisiae), ntile=20) tiles tiles <- tileGenome(seqinfo(Scerevisiae), tilewidth=50000, cut.last.tile.in.chrom=TRUE) tiles ## --------------------------------------------------------------------- ## C. AN APPLICATION: COMPUTE THE BINNED AVERAGE OF A NUMERICAL VARIABLE ## DEFINED ALONG A GENOME ## --------------------------------------------------------------------- ## See '?genomicvars' for an example of how to compute the binned ## average of a numerical variable defined along a genome.