Title: | Introduce *Ranges to bedtools users |
---|---|
Description: | Translates bedtools command-line invocations to R code calling functions from the Bioconductor *Ranges infrastructure. This is intended to educate novice Bioconductor users and to compare the syntax and semantics of the two frameworks. |
Authors: | Michael Lawrence |
Maintainer: | Michael Lawrence <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.33.0 |
Built: | 2024-12-18 03:44:00 UTC |
Source: | https://github.com/bioc/HelloRanges |
HelloRanges uses docopt for parsing the argument string
passed as the cmd
argument to functions like
bedtools_intersect
. bedtools
has its own style of
argument formatting. Here we document the subtle differences.
Here are the specific differences:
docopt requires that multi-character arguments are
prefixed by two hyphens, e.g., --bed. However,
bedtools
expects only a single hyphen. It turns out
docopt is robust to the single-hyphen case, except for
the first argument. Since the typical convention is to first
indicate the file, e.g., -a or -i, this
incompatibility does not often arise in practice.
docopt does not allow values of argument to be
space-separated, while bedtools
often expects space
separation for multi-valued arguments. As a compromise,
HelloRanges expects the values to be comma-separated. Thus,
-b b.bed c.bed needs to be -b b.bed,c.bed.
Most shells support nested commands within parentheses, e.g.,
-b < (grep foo file.bed), but docopt does not
support that. Instead, nested commands should be enclosed in
double quotes, e.g., -b < "grep foo file.bed". Such
constructs are supported via pipe
.
Michael Lawrence
Finds the features in one dataset that are closest to those in another. Supports restriction by strand, upstream, downstream, and overlap. There are several methods for resolving ties. Optionally returns the distance.
bedtools_closest(cmd = "--help") R_bedtools_closest(a, b, s = FALSE, S = FALSE, d = FALSE, D = c("none", "ref", "a", "b"), io = FALSE, iu = FALSE, id = FALSE, fu = FALSE, fd = FALSE, t = c("all", "first", "last"), mdb = c("each", "all"), k = 1L, names = NULL, filenames = FALSE, N = FALSE) do_bedtools_closest(a, b, s = FALSE, S = FALSE, d = FALSE, D = c("none", "ref", "a", "b"), io = FALSE, iu = FALSE, id = FALSE, fu = FALSE, fd = FALSE, t = c("all", "first", "last"), mdb = c("each", "all"), k = 1L, names = NULL, filenames = FALSE, N = FALSE)
bedtools_closest(cmd = "--help") R_bedtools_closest(a, b, s = FALSE, S = FALSE, d = FALSE, D = c("none", "ref", "a", "b"), io = FALSE, iu = FALSE, id = FALSE, fu = FALSE, fd = FALSE, t = c("all", "first", "last"), mdb = c("each", "all"), k = 1L, names = NULL, filenames = FALSE, N = FALSE) do_bedtools_closest(a, b, s = FALSE, S = FALSE, d = FALSE, D = c("none", "ref", "a", "b"), io = FALSE, iu = FALSE, id = FALSE, fu = FALSE, fd = FALSE, t = c("all", "first", "last"), mdb = c("each", "all"), k = 1L, names = NULL, filenames = FALSE, N = FALSE)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
a |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges. Each feature in |
b |
Like |
s |
Require same strandedness. That is, find the closest feature in
|
S |
Require opposite strandedness. That is, find the closest feature in
|
d |
In addition to the closest feature in |
D |
Like |
io |
Ignore features in |
iu |
Ignore features in |
id |
Ignore features in |
fu |
Choose first from features in |
fd |
Choose first from features in |
t |
Specify how ties for closest feature should be handled. This occurs
when two features in |
mdb |
How multiple databases should be resolved, either “each”
(find closest in each |
k |
Not supported yet. Report the |
names |
When using multiple databases, provide an alias for each to use instead of their integer index. If a single string, can be comma-separated. |
filenames |
When using multiple databases, use their complete filename instead of their integer index. |
N |
Not yet supported, but related use cases are often solved by
passing a single argument to
|
As with all commands, there are three interfaces to the
closest
command:
bedtools_closest
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_closest
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_closest
Evaluates the result of
R_bedtools_closest
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
The generated code includes calls to utilities like nearest, precede and follow. nearest lacks the ability to restrict its search by direction/overlap, so some complex code is generated to support all of the argument combinations.
Arguments io
, iu
, id
, fu
, and fd
require a notion of upstream/downstream to be defined via D
,
which accepts one of these values:
Report distance with respect to the reference genome. B features with a lower (start, stop) are “upstream”.
Report distance with respect to A. When A is on the - strand, “upstream” means B has a higher (start,stop).
Report distance with respect to B. When B is on the - strand, “upstream” means A has a higher (start,stop).
A language object containing the compiled R code, evaluating to a
Pairs object with the closest hits from a
and b
. If
d
or D
is TRUE
, has a metadata column called
“distance”.
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/closest.html
nearest-methods for the various ways to find the nearest ranges.
## Not run: setwd(system.file("unitTests", "data", "closest", package="HelloRanges")) ## End(Not run) ## basic bedtools_closest("-a a.bed -b b.bed -d") ## strand-specific bedtools_closest("-a strand-test-a.bed -b strand-test-b.bed -s") ## break ties bedtools_closest("-a close-a.bed -b close-b.bed -t first") ## multiple databases bedtools_closest("-a mq1.bed -b mdb1.bed,mdb2.bed,mdb3.bed -names a,b,c") ## ignoring upstream bedtools_closest("-a d.bed -b d_iu.bed -D ref -iu")
## Not run: setwd(system.file("unitTests", "data", "closest", package="HelloRanges")) ## End(Not run) ## basic bedtools_closest("-a a.bed -b b.bed -d") ## strand-specific bedtools_closest("-a strand-test-a.bed -b strand-test-b.bed -s") ## break ties bedtools_closest("-a close-a.bed -b close-b.bed -t first") ## multiple databases bedtools_closest("-a mq1.bed -b mdb1.bed,mdb2.bed,mdb3.bed -names a,b,c") ## ignoring upstream bedtools_closest("-a d.bed -b d_iu.bed -D ref -iu")
Finds regions of the genome that are not covered by a genomic dataset.
bedtools_complement(cmd = "--help") R_bedtools_complement(i, g) do_bedtools_complement(i, g)
bedtools_complement(cmd = "--help") R_bedtools_complement(i, g) do_bedtools_complement(i, g)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
i |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges. Use |
g |
A genome file, identifier or Seqinfo object that defines the order and size of the sequences. |
As with all commands, there are three interfaces to the
complement
command:
bedtools_complement
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_complement
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_complement
Evaluates the result of
R_bedtools_complement
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
The generated code is subtracts, via
setdiff
, the ranges from the set of
ranges representing the entire genome.
While it may be tempting to call gaps
instead, it is very unlikely to behave as expected. The GenomicRanges
set operations treat all three strand values (+, -, *) as separate
spaces. gaps
takes as its universe the genome on all three
strands, rather than just the “*” strand, resulting in
extraneous stranded ranges.
A language object containing the compiled R code, evaluating to a GRanges object with the complementary ranges.
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/complement.html
setops-methods for the various set operations.
## Not run: setwd(system.file("unitTests", "data", "coverage", package="HelloRanges")) ## End(Not run) bedtools_complement("-i a.bed -g test.genome")
## Not run: setwd(system.file("unitTests", "data", "coverage", package="HelloRanges")) ## End(Not run) bedtools_complement("-i a.bed -g test.genome")
Compute the coverage of one or more datasets over a set of query ranges.
bedtools_coverage(cmd = "--help") R_bedtools_coverage(a, b, hist = FALSE, d = FALSE, counts = FALSE, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, split = FALSE, g = NA_character_, header = FALSE, sortout = FALSE) do_bedtools_coverage(a, b, hist = FALSE, d = FALSE, counts = FALSE, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, split = FALSE, g = NA_character_, header = FALSE, sortout = FALSE)
bedtools_coverage(cmd = "--help") R_bedtools_coverage(a, b, hist = FALSE, d = FALSE, counts = FALSE, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, split = FALSE, g = NA_character_, header = FALSE, sortout = FALSE) do_bedtools_coverage(a, b, hist = FALSE, d = FALSE, counts = FALSE, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, split = FALSE, g = NA_character_, header = FALSE, sortout = FALSE)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
a |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges. The coverage is computed
over these ranges. Use |
b |
Like |
hist |
Report a histogram of coverage for each feature in |
d |
Report the depth at each position in each |
counts |
Only report the count of overlaps, not fraction, etc.
Restricted by |
f |
Minimum overlap required as a fraction of A [default: any overlap]. |
F |
Minimum overlap required as a fraction of B [default: any overlap]. |
r |
Require that the fraction of overlap be reciprocal for |
e |
Require that the minimum fraction be satisfied for |
s |
Force strandedness. That is, only count ranges in |
S |
Require opposite strandedness. That is, count the features in
|
split |
Treat split BAM (i.e., having an ‘N’ CIGAR operation) or BED12 entries as compound ranges with gaps, i.e., as GRangesList objects. |
g |
A genome file, identifier or Seqinfo object that defines the order and size of the sequences. |
header |
Ignored. |
sortout |
Sort the result by position. |
As with all commands, there are three interfaces to the
coverage
command:
bedtools_coverage
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_coverage
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_coverage
Evaluates the result of
R_bedtools_coverage
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
Typically, we compute coverage with
coverage
, but features like fractional
overlap restriction and histograms add (educational) complexity. One
key trick is the [,List,GenomicRanges
method, which lets us
extract coverage vectors for specific regions (see the generated
code).
A language object containing the compiled R code, evaluating to a GRanges object with coverage information. The exact type of information depends on the mode:
default |
Three metadata columns: “count” (the number of overlapping ranges), “covered” (the number of bases covered in the query), “fraction” (the fraction of bases covered). |
d |
Metadata column “coverage” is an RleList with position-level coverage (depth). This is what we typically refer to as coverage in Bioconductor. |
hist |
Metadata column “coverage” is a list of DataFrames. Each
DataFrame contains a histogram of the coverage depth over that
range, with columns “coverage” (the coverage value),
“count” (the number of positions with that coverage),
“len” (the length of the region, all the same) and
“fraction” (the fraction of positions at that coverage).
There is also a “coverage” component on |
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/coverage.html
coverage-methods for ways to compute coverage.
## Not run: setwd(system.file("unitTests", "data", "coverage", package="HelloRanges")) ## End(Not run) ## default behavior bedtools_coverage("-a a.bed -b b.bed") ## histogram bedtools_coverage("-a a.bed -b b.bed -hist -g test.genome") ## per-position depth bedtools_coverage("-a a.bed -b b.bed -d -g test.genome")
## Not run: setwd(system.file("unitTests", "data", "coverage", package="HelloRanges")) ## End(Not run) ## default behavior bedtools_coverage("-a a.bed -b b.bed") ## histogram bedtools_coverage("-a a.bed -b b.bed -hist -g test.genome") ## per-position depth bedtools_coverage("-a a.bed -b b.bed -d -g test.genome")
Compute flanking regions.
bedtools_flank(cmd = "--help") R_bedtools_flank(i, b = 0, l = 0, r = 0, s = FALSE, pct = FALSE, g = NULL, header = FALSE) do_bedtools_flank(i, b = 0, l = 0, r = 0, s = FALSE, pct = FALSE, g = NULL, header = FALSE)
bedtools_flank(cmd = "--help") R_bedtools_flank(i, b = 0, l = 0, r = 0, s = FALSE, pct = FALSE, g = NULL, header = FALSE) do_bedtools_flank(i, b = 0, l = 0, r = 0, s = FALSE, pct = FALSE, g = NULL, header = FALSE)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
i |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges. Use |
b |
Increase the BED/GFF/VCF range by the same number base pairs in each direction. Integer. |
l |
The number of base pairs to subtract from the start coordinate. Integer. |
r |
The number of base pairs to add to the end coordinate. Integer. |
s |
Define |
pct |
Define |
g |
Genome file, identifier or Seqinfo object that defines the order and size of the sequences. |
header |
Ignored. |
As with all commands, there are three interfaces to the
flank
command:
bedtools_flank
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_flank
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_flank
Evaluates the result of
R_bedtools_flank
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
We compute the flanks with flank
,
although flank
only computes one side at a time, so we may call
it multiple times.
A language object containing the compiled R code, evaluating to a GRanges object.
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/flank.html
intra-range-methods for flank
.
## Not run: setwd(system.file("unitTests", "data", "flank", package="HelloRanges")) ## End(Not run) ## 5 on both sides r <- bedtools_flank("-i a.bed -b 5 -g tiny.genome") ## 5 on left side bedtools_flank("-i a.bed -l 5 -r 0 -g tiny.genome") ## define left/right in terms of transcription direction bedtools_flank("-i a.bed -l 5 -r 0 -s -g tiny.genome")
## Not run: setwd(system.file("unitTests", "data", "flank", package="HelloRanges")) ## End(Not run) ## 5 on both sides r <- bedtools_flank("-i a.bed -b 5 -g tiny.genome") ## 5 on left side bedtools_flank("-i a.bed -l 5 -r 0 -g tiny.genome") ## define left/right in terms of transcription direction bedtools_flank("-i a.bed -l 5 -r 0 -s -g tiny.genome")
Compute coverage over the genome. By default, this computes a per-chromosome histogram of the coverage, but options allow for per-position coverage to be returned in different ways.
bedtools_genomecov(cmd = "--help") R_bedtools_genomecov(i, g = NA_character_, d = FALSE, dz = FALSE, bg = FALSE, bga = FALSE, split = FALSE, strand = c("any", "+", "-"), `5` = FALSE, `3` = FALSE, max = NULL, scale = 1, pc = FALSE, fs = NULL) do_bedtools_genomecov(i, g = NA_character_, d = FALSE, dz = FALSE, bg = FALSE, bga = FALSE, split = FALSE, strand = c("any", "+", "-"), `5` = FALSE, `3` = FALSE, max = NULL, scale = 1, pc = FALSE, fs = NULL)
bedtools_genomecov(cmd = "--help") R_bedtools_genomecov(i, g = NA_character_, d = FALSE, dz = FALSE, bg = FALSE, bga = FALSE, split = FALSE, strand = c("any", "+", "-"), `5` = FALSE, `3` = FALSE, max = NULL, scale = 1, pc = FALSE, fs = NULL) do_bedtools_genomecov(i, g = NA_character_, d = FALSE, dz = FALSE, bg = FALSE, bga = FALSE, split = FALSE, strand = c("any", "+", "-"), `5` = FALSE, `3` = FALSE, max = NULL, scale = 1, pc = FALSE, fs = NULL)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
i |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges. Use |
g |
Genome file, identifier or Seqinfo object that defines the order and size of the sequences. |
d |
Report the depth at each genome position. This causes a GPos object to be returned. |
dz |
Same as |
bg |
Like |
bga |
Like |
split |
Treat split BAM (i.e., having an ‘N’ CIGAR operation) or BED12 entries as compound ranges with gaps, i.e., as GRangesList objects. |
strand |
Calculate coverage of intervals from a specific strand. |
5 |
Calculate coverage of 5' positions (instead of entire interval). |
3 |
Calculate coverage of 3' positions (instead of entire interval). |
max |
Combine all positions with a depth >= |
scale |
Scale the coverage by a constant factor. Each coverage value is multiplied by this factor before being reported. Useful for normalizing coverage by, e.g., reads per million (RPM). |
pc |
Calculates coverage of intervals from left point of a pair reads to the right point. Works for BAM files only. |
fs |
Forces to use the given fragment size instead of read length. |
As with all commands, there are three interfaces to the
genomecov
command:
bedtools_genomecov
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_genomecov
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_genomecov
Evaluates the result of
R_bedtools_genomecov
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
We typically compute the coverage with
coverage
. Computing the histogram
requires more work.
A language object containing the compiled R code, evaluating to an object that depends on the mode:
default |
A DataFrame that is a per-chromosome histogram of the coverage that includes the whole genome margin. Includes columns “seqnames” (the chromosome name, or “genome”), “coverage” (the coverage value), “count” (the count of positions covered at that value), “len” (the length of the chromosome/genome), “fraction” (the fraction of bases covered at the value). |
d , dz
|
A GPos object with per-position coverage values. |
bg , bga
|
A GRanges object with coverage runs. |
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html
intra-range-methods for genomecov
.
## Not run: setwd(system.file("unitTests", "data", "genomecov", package="HelloRanges")) ## End(Not run) ## get coverage runs as a GRanges bedtools_genomecov("-i y.bed -bg -g test.genome") ## get coverage depth as a GPos, dropping zero values, ignore junctions bedtools_genomecov("-i three_blocks.bam -dz -split") ## custom fragment size bedtools_genomecov("-i chip.bam -bg -fs 100")
## Not run: setwd(system.file("unitTests", "data", "genomecov", package="HelloRanges")) ## End(Not run) ## get coverage runs as a GRanges bedtools_genomecov("-i y.bed -bg -g test.genome") ## get coverage depth as a GPos, dropping zero values, ignore junctions bedtools_genomecov("-i three_blocks.bam -dz -split") ## custom fragment size bedtools_genomecov("-i chip.bam -bg -fs 100")
Query sequence from a FASTA file given a set of ranges, including compound regions like transcripts and junction reads. This assumes the sequence is DNA.
bedtools_getfasta(cmd = "--help") R_bedtools_getfasta(fi, bed, s = FALSE, split = FALSE) do_bedtools_getfasta(fi, bed, s = FALSE, split = FALSE)
bedtools_getfasta(cmd = "--help") R_bedtools_getfasta(fi, bed, s = FALSE, split = FALSE) do_bedtools_getfasta(fi, bed, s = FALSE, split = FALSE)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
fi |
Path to a FASTA file, or an XStringSet object. |
bed |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges, as the query. Use
|
s |
Force strandedness. If the feature occupies the antisense strand, the sequence will be reverse complemented. |
split |
Given BED12 or BAM input, extract and concatenate the sequences from the blocks (e.g., exons). |
As with all commands, there are three interfaces to the
getfasta
command:
bedtools_getfasta
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_getfasta
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_getfasta
Evaluates the result of
R_bedtools_getfasta
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
It is recommended to retrieve reference sequence using a
BSgenome package, either custom or provided by
Bioconductor. Call getSeq
to query for
specific regions of the BSgenome object. If one must access a file,
consider converting it to 2bit or FA (razip) format for indexed
access using import
and its which
argument.
But if one must access a FASTA file, we need to read all of it with
readDNAStringSet
and extract regions using
x[gr]
, where gr
is a GRanges or GRangesList.
A language object containing the compiled R code, evaluating to a DNAStringSet object.
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html
getSeq, the primary sequence query interface.
## Not run: setwd(system.file("unitTests", "data", "getfasta", package="HelloRanges")) ## End(Not run) ## simple query bedtools_getfasta("--fi t.fa -bed blocks.bed") ## get spliced transcript/read sequence bedtools_getfasta("--fi t.fa -bed blocks.bed -split")
## Not run: setwd(system.file("unitTests", "data", "getfasta", package="HelloRanges")) ## End(Not run) ## simple query bedtools_getfasta("--fi t.fa -bed blocks.bed") ## get spliced transcript/read sequence bedtools_getfasta("--fi t.fa -bed blocks.bed -split")
Query sequence from a FASTA file given a set of ranges, including compound regions like transcripts and junction reads. This assumes the sequence is DNA.
bedtools_groupby(cmd = "--help") R_bedtools_groupby(i, g = 1:3, c, o = "sum", delim=",") do_bedtools_groupby(i, g = 1:3, c, o = "sum", delim=",")
bedtools_groupby(cmd = "--help") R_bedtools_groupby(i, g = 1:3, c, o = "sum", delim=",") do_bedtools_groupby(i, g = 1:3, c, o = "sum", delim=",")
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
i |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges. Use |
g |
Column index(es) for grouping the input. Columns may be comma-separated. By default, the grouping is by range. |
c |
Specify columns (by integer index) from the input file to operate
upon (see |
o |
Specify the operations (by name) that should be applied to the
columns indicated in |
delim |
Delimiter character used to collapse strings. |
As with all commands, there are three interfaces to the
groupby
command:
bedtools_groupby
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_groupby
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_groupby
Evaluates the result of
R_bedtools_groupby
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
The workhorse for aggregation in R is
aggregate
and we have extended its
interface to make it more convenient. See
aggregate
for details.
The following operations are supported (with R translation):
sum(X)
min(X)
max(X)
min(abs(X))
max(abs(X))
mean(X)
median(X)
distmode(X)
distmode(X, anti=TRUE)
unstrsplit(X, delim)
unstrsplit(unique(X), delim)
lengths(X)
lengths(unique(X))
sd(X)
freq
table(X)
first
drop(heads(X, 1L))
last
drop(tails(X, 1L))
For the sake of simplicity, and because the use cases are not clear, we do not support aggregation of every column. Here are some of the restrictions:
No support for the last column of GFF (the ragged list of attributes).
No support for the INFO, FORMAT and GENO fields of VCF.
No support for the FLAG field of BAM (bedtools
does
not support this either).
A language object containing the compiled R code, generally evaluating to a DataFrame, with a column for each grouping variable and each summarized variable. As a special case, if there are no grouping variables specified, then the grouping is by range, and an aggregated GRanges is returned.
We admit that using column subscripts for c
makes code hard
to read. All the more reason to just write R code.
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/groupby.html
aggregate-methods
for general aggregation.
## Not run: setwd(system.file("unitTests", "data", "groupby", package="HelloRanges")) ## End(Not run) ## aggregation by range bedtools_groupby("-i values3.header.bed -c 5") ## average variant qualities by chromosome and reference base ## Not run: indexTabix(bgzip("a_vcfSVtest.vcf", overwrite=TRUE), "vcf") ## End(Not run) bedtools_groupby("-i a_vcfSVtest.vcf.bgz -g 1,4 -c 6 -o mean")
## Not run: setwd(system.file("unitTests", "data", "groupby", package="HelloRanges")) ## End(Not run) ## aggregation by range bedtools_groupby("-i values3.header.bed -c 5") ## average variant qualities by chromosome and reference base ## Not run: indexTabix(bgzip("a_vcfSVtest.vcf", overwrite=TRUE), "vcf") ## End(Not run) bedtools_groupby("-i a_vcfSVtest.vcf.bgz -g 1,4 -c 6 -o mean")
Finds and/or counts the intersections between two ranged datasets.
bedtools_intersect(cmd = "--help") R_bedtools_intersect(a, b, ubam = FALSE, bed = FALSE, wa = FALSE, wb = FALSE, loj = FALSE, wo = FALSE, wao = FALSE, u = FALSE, c = FALSE, v = FALSE, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, split = FALSE, g = NA_character_, header = FALSE, names = NULL, filenames = FALSE, sortout = FALSE) do_bedtools_intersect(a, b, ubam = FALSE, bed = FALSE, wa = FALSE, wb = FALSE, loj = FALSE, wo = FALSE, wao = FALSE, u = FALSE, c = FALSE, v = FALSE, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, split = FALSE, g = NA_character_, header = FALSE, names = NULL, filenames = FALSE, sortout = FALSE)
bedtools_intersect(cmd = "--help") R_bedtools_intersect(a, b, ubam = FALSE, bed = FALSE, wa = FALSE, wb = FALSE, loj = FALSE, wo = FALSE, wao = FALSE, u = FALSE, c = FALSE, v = FALSE, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, split = FALSE, g = NA_character_, header = FALSE, names = NULL, filenames = FALSE, sortout = FALSE) do_bedtools_intersect(a, b, ubam = FALSE, bed = FALSE, wa = FALSE, wb = FALSE, loj = FALSE, wo = FALSE, wao = FALSE, u = FALSE, c = FALSE, v = FALSE, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, split = FALSE, g = NA_character_, header = FALSE, names = NULL, filenames = FALSE, sortout = FALSE)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
a |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges. Each feature in |
b |
Like |
ubam |
Not supported yet. Write uncompressed BAM output. The default is write compressed BAM output. |
bed |
When using BAM input, return a GRanges (with a “blocks” column) instead of a GAlignments. For VCF input, return a VRanges instead of a VCF object. |
wa |
Return the original entry in |
wb |
Return the original entry in |
loj |
Perform a ‘left outer join’. That is, for each feature in
|
wo |
Return the number of base pairs of
overlap between the two features as the “overlap_width”
metadata column. Implies |
wao |
Like |
u |
Like |
c |
Like |
v |
Like |
f |
Minimum overlap required as a fraction of |
F |
Minimum overlap required as a fraction of |
r |
Require that the fraction of overlap be reciprocal for |
e |
Require that the minimum fraction be satisfied for |
s |
Require same strandedness. That is, find the intersect feature in
|
S |
Require opposite strandedness. That is, find the intersect feature in
|
split |
Treat split BAM (i.e., having an ‘N’ CIGAR operation) or BED12 entries as compound ranges with gaps, i.e., as GRangesList objects. |
g |
A genome file, identifier or Seqinfo object that defines the order and size of the sequences. |
header |
Ignored. |
names |
When using multiple databases, provide an alias for each to use instead of their integer index. If a single string, can be comma-separated. |
filenames |
When using multiple databases, use their complete filename instead of their integer index. |
sortout |
Sort the result by genomic coordinate. |
As with all commands, there are three interfaces to the
intersect
command:
bedtools_intersect
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_intersect
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_intersect
Evaluates the result of
R_bedtools_intersect
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
This is by far the most complex bedtools
command, and it
offers a dizzying list of parameters, many of which are redundant or
mutually exclusive. The complexity of the generated code is highest
when using the fractional restriction feature, since no such support
exists in the GenomicRanges overlap routines.
A language object containing the compiled R code, evaluating to a
ranges object, the exact type of which depends on the arguments. If
both wa
and wb
are TRUE
, return a Pairs object
with the original, matched up ranges, possibly with metadata
columns. By default, the return value is a GAlignments for BAM input,
a VCF object for VCF input, or a GRanges for any other type of input.
If bed
is TRUE
, BAM input is converted to a GRanges,
containing a “blocks” column (encoding the junctions) if the
input is BAM. If the input is VCF, bed=TRUE
converts the input
to a VRanges.
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
setops-methods for set operations including intersect, findOverlaps-methods for different ways to detect overlaps.
## Not run: setwd(system.file("unitTests", "data", "intersect", package="HelloRanges")) ## End(Not run) ## return intersecting ranges bedtools_intersect("-a a.bed -b a.bed") ## add count bedtools_intersect("-a a.bed -b b.bed -c") ## restrict by strand and fraction of overlap bedtools_intersect("-a a.bed -b b.bed -c -s -f 0.1") ## return original 'a' ranges bedtools_intersect("-a a.bed -b b.bed -wa") ## return both 'a' and 'b' ranges, along with overlap widths bedtools_intersect("-a a.bed -b b.bed -wo") ## same as above, except left outer join bedtools_intersect("-a a.bed -b b.bed -wao") ## consider read junction structure bedtools_intersect("-a three_blocks.bam -b three_blocks_nomatch.bed -split")
## Not run: setwd(system.file("unitTests", "data", "intersect", package="HelloRanges")) ## End(Not run) ## return intersecting ranges bedtools_intersect("-a a.bed -b a.bed") ## add count bedtools_intersect("-a a.bed -b b.bed -c") ## restrict by strand and fraction of overlap bedtools_intersect("-a a.bed -b b.bed -c -s -f 0.1") ## return original 'a' ranges bedtools_intersect("-a a.bed -b b.bed -wa") ## return both 'a' and 'b' ranges, along with overlap widths bedtools_intersect("-a a.bed -b b.bed -wo") ## same as above, except left outer join bedtools_intersect("-a a.bed -b b.bed -wao") ## consider read junction structure bedtools_intersect("-a three_blocks.bam -b three_blocks_nomatch.bed -split")
Compare two sets of genomic regions using the Jaccard statistic, defined as the total width of the intersection, divided by the total width of the union.
bedtools_jaccard(cmd = "--help") R_bedtools_jaccard(a, b, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, split = FALSE) do_bedtools_jaccard(a, b, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, split = FALSE)
bedtools_jaccard(cmd = "--help") R_bedtools_jaccard(a, b, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, split = FALSE) do_bedtools_jaccard(a, b, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, split = FALSE)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
a |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges. Use |
b |
Like |
f |
Minimum overlap required as a fraction of |
F |
Minimum overlap required as a fraction of |
r |
Require that the fraction of overlap be reciprocal for |
e |
Require that the minimum fraction be satisfied for |
s |
Require same strandedness. That is, find the jaccard feature in
|
S |
Require opposite strandedness. That is, find the jaccard feature in
|
split |
Treat split BAM (i.e., having an ‘N’ CIGAR operation) or BED12 entries as compound ranges with gaps, i.e., as GRangesList objects. |
As with all commands, there are three interfaces to the
jaccard
command:
bedtools_jaccard
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_jaccard
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_jaccard
Evaluates the result of
R_bedtools_jaccard
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
This is mostly just intersect
and
union
, except when fractional overlap
restrictions are involved.
A language object containing the compiled R code, evaluating to a a DataFrame with four columns:
intersection |
total width of intersection |
union |
total width of union |
jaccard |
the jaccard statistic |
n_intersections |
the number of ranges representing the intersection |
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/jaccard.html
setops-methods for set operations including intersect and union.
## Not run: setwd(system.file("unitTests", "data", "jaccard", package="HelloRanges")) ## End(Not run) ## basic bedtools_jaccard("-a a.bed -b a.bed") ## excluding the gaps in compound ranges bedtools_jaccard("-a three_blocks_match.bed -b e.bed -split") ## strand and fractional overlap restriction bedtools_jaccard("-a aMixedStrands.bed -b bMixedStrands.bed -s -f 0.8")
## Not run: setwd(system.file("unitTests", "data", "jaccard", package="HelloRanges")) ## End(Not run) ## basic bedtools_jaccard("-a a.bed -b a.bed") ## excluding the gaps in compound ranges bedtools_jaccard("-a three_blocks_match.bed -b e.bed -split") ## strand and fractional overlap restriction bedtools_jaccard("-a aMixedStrands.bed -b bMixedStrands.bed -s -f 0.8")
Generate a partitioning/tiling or set of sliding windows over the genome or a set of ranges.
bedtools_makewindows(cmd = "--help") R_bedtools_makewindows(b, g = NA_character_, w, s, n) do_bedtools_makewindows(b, g = NA_character_, w, s, n)
bedtools_makewindows(cmd = "--help") R_bedtools_makewindows(b, g = NA_character_, w, s, n) do_bedtools_makewindows(b, g = NA_character_, w, s, n)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
b |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges. Use |
g |
A genome file, identifier or Seqinfo object that defines the order
and size of the sequences. Specifying this generates windows over
the genome. Exclusive with |
w |
Window size, exclusive with |
s |
Step size (generates sliding windows). |
n |
Number of windows, exclusive with |
As with all commands, there are three interfaces to the
makewindows
command:
bedtools_makewindows
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_makewindows
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_makewindows
Evaluates the result of
R_bedtools_makewindows
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
We view the generation of a partitioning (or tiling) as a distinct use
case from the generation of sliding windows. The two use cases
correspond to the tile
and
slidingWindows
functions, respectively.
A language object containing the compiled R code, evaluating to a a GRangesList containing the windows for each range (or chromosome).
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/makewindows.html
tile-methods for generating windows.
## Not run: setwd(system.file("unitTests", "data", "makewindows", package="HelloRanges")) ## End(Not run) ## tiles of width 5000 bedtools_makewindows("-b input.bed -w 5000") ## sliding windows, 5kb wide, every 2kb bedtools_makewindows("-b input.bed -w 5000 -s 2000") ## 3 tiles in each range bedtools_makewindows("-b input.bed -n 3") ## 3 tiles for each chromosome of the genome bedtools_makewindows("-g test.genome -n 3")
## Not run: setwd(system.file("unitTests", "data", "makewindows", package="HelloRanges")) ## End(Not run) ## tiles of width 5000 bedtools_makewindows("-b input.bed -w 5000") ## sliding windows, 5kb wide, every 2kb bedtools_makewindows("-b input.bed -w 5000 -s 2000") ## 3 tiles in each range bedtools_makewindows("-b input.bed -n 3") ## 3 tiles for each chromosome of the genome bedtools_makewindows("-g test.genome -n 3")
Group ranges by overlap with query ranges and aggregate. By default, the scores are summed.
bedtools_map(cmd = "--help") R_bedtools_map(a, b, c = "5", o = "sum", f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, header = FALSE, split = FALSE, g = NA_character_, delim=",") do_bedtools_map(a, b, c = "5", o = "sum", f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, header = FALSE, split = FALSE, g = NA_character_, delim=",")
bedtools_map(cmd = "--help") R_bedtools_map(a, b, c = "5", o = "sum", f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, header = FALSE, split = FALSE, g = NA_character_, delim=",") do_bedtools_map(a, b, c = "5", o = "sum", f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, header = FALSE, split = FALSE, g = NA_character_, delim=",")
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
a |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges. Use |
b |
Like |
c |
Specify columns (by integer index) from the input file to operate
upon (see |
o |
Specify the operations (by name) that should be applied to the
columns indicated in |
f |
Minimum overlap required as a fraction of |
F |
Minimum overlap required as a fraction of |
r |
Require that the fraction of overlap be reciprocal for |
e |
Require that the minimum fraction be satisfied for |
s |
Require same strandedness. That is, find the jaccard feature in
|
S |
Require opposite strandedness. That is, find the jaccard feature in
|
header |
Ignored. |
split |
Treat split BAM (i.e., having an ‘N’ CIGAR operation) or BED12 entries as compound ranges with gaps, i.e., as GRangesList objects. |
g |
A genome file, identifier or Seqinfo object that defines the order and size of the sequences. |
delim |
Delimiter character used to collapse strings. |
As with all commands, there are three interfaces to the
map
command:
bedtools_map
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_map
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_map
Evaluates the result of
R_bedtools_map
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
Computing overlaps with findOverlaps
generates a Hits object, which we can pass directly
to aggregate
to aggregate the subject
features that overlap the same range in the query.
There are several commands in the bedtools
suite that might
be approximately implemented by passing multiple files to b
and
specifying the aggregate expression table(b)
. That counts how
many ranges from each database/sample overlap a given query. The
covered commands are: bedtools annotate -counts
,
bedtools multicov
and bedtools tag
.
A language object containing the compiled R code, evaluating to a
DataFrame with a “grouping” column corresponding to
as(hits, "List")
, and a column for each summary.
We do not support the bedtools
null
argument, because
it seems more sensible to just let R decide on the value of statistics
when a group is empty.
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/map.html
findOverlaps-methods for finding hits, Hits-class for manipulating them, aggregate-methods for aggregating them.
## Not run: setwd(system.file("unitTests", "data", "map", package="HelloRanges")) ## End(Not run) ## default behavior bedtools_map("-a ivls.bed -b values.bed") ## take the mode of the scores bedtools_map("-a ivls.bed -b values.bed -o mode") ## collapse the chromosome names bedtools_map("-a ivls.bed -b test.gff2 -c 1 -o collapse") ## collapse the names, restricted by fractional overlap bedtools_map("-a ivls2.bed -b values5.bed -c 4 -o collapse -f 0.7")
## Not run: setwd(system.file("unitTests", "data", "map", package="HelloRanges")) ## End(Not run) ## default behavior bedtools_map("-a ivls.bed -b values.bed") ## take the mode of the scores bedtools_map("-a ivls.bed -b values.bed -o mode") ## collapse the chromosome names bedtools_map("-a ivls.bed -b test.gff2 -c 1 -o collapse") ## collapse the names, restricted by fractional overlap bedtools_map("-a ivls2.bed -b values5.bed -c 4 -o collapse -f 0.7")
Collapse overlapping and adjacent ranges into a single range, i.e.,
reduce
the ranges. Then, group the original ranges by
reduced range and aggregate. By default, the scores are summed.
bedtools_merge(cmd = "--help") R_bedtools_merge(i, s = FALSE, S = c("any", "+", "-"), d = 0L, c = NULL, o = "sum", delim = ",") do_bedtools_merge(i, s = FALSE, S = c("any", "+", "-"), d = 0L, c = NULL, o = "sum", delim = ",")
bedtools_merge(cmd = "--help") R_bedtools_merge(i, s = FALSE, S = c("any", "+", "-"), d = 0L, c = NULL, o = "sum", delim = ",") do_bedtools_merge(i, s = FALSE, S = c("any", "+", "-"), d = 0L, c = NULL, o = "sum", delim = ",")
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
i |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges. Use |
s |
Require same strandedness. That is, find the jaccard feature in
|
S |
Force merge for one specific strand only. Follow with + or - to force merge from only the forward or reverse strand, respectively. By default, merging is done without respect to strand. |
d |
Maximum distance between features allowed for features to be merged. Default is 0. That is, overlapping and/or book-ended features are merged. |
c |
Specify columns (by integer index) from the input file to operate
upon (see |
o |
Specify the operations (by name) that should be applied to the
columns indicated in |
delim |
Delimiter character used to collapse strings. |
As with all commands, there are three interfaces to the
merge
command:
bedtools_merge
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_merge
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_merge
Evaluates the result of
R_bedtools_merge
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
The workhorse for reduction is
reduce
. Passing with.revmap=TRUE
to reduce
causes it to return a list of integers, which can be
passed directly to aggregate
to aggregate the
original ranges.
Since the grouping information is preserved in the result, this
function serves as a proxy for bedtools cluster
.
A language object containing the compiled R code, evaluating to a
DataFrame with a “grouping” column corresponding to
as(hits, "List")
, and a column for each summary.
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/merge.html
bedtools_groupby for more details on bedtools-style aggregation, reduce for merging, aggregate-methods for aggregating.
## Not run: setwd(system.file("unitTests", "data", "merge", package="HelloRanges")) ## End(Not run) ## default behavior, sum the score bedtools_merge("-i a.bed") ## count the seqnames bedtools_merge("-i a.bed -c 1 -o count") ## collapse the names using "|" as the delimiter bedtools_merge("-i a.names.bed -delim \"|\" -c 4 -o collapse") ## collapse the names and sum the scores bedtools_merge("-i a.full.bed -c 4,5 -o collapse,sum") ## count and sum the scores bedtools_merge("-i a.full.bed -c 5 -o count,sum") ## only merge the positive strand features bedtools_merge("-i a.full.bed -S +")
## Not run: setwd(system.file("unitTests", "data", "merge", package="HelloRanges")) ## End(Not run) ## default behavior, sum the score bedtools_merge("-i a.bed") ## count the seqnames bedtools_merge("-i a.bed -c 1 -o count") ## collapse the names using "|" as the delimiter bedtools_merge("-i a.names.bed -delim \"|\" -c 4 -o collapse") ## collapse the names and sum the scores bedtools_merge("-i a.full.bed -c 4,5 -o collapse,sum") ## count and sum the scores bedtools_merge("-i a.full.bed -c 5 -o count,sum") ## only merge the positive strand features bedtools_merge("-i a.full.bed -S +")
Summarize the ranges according to disjoin
and annotate each disjoint range with the samples that overlap the
range.
bedtools_multiinter(cmd = "--help") R_bedtools_multiinter(i, header=FALSE, names=NULL, g=NA_character_, empty=FALSE) do_bedtools_multiinter(i, header=FALSE, names=NULL, g=NA_character_, empty=FALSE)
bedtools_multiinter(cmd = "--help") R_bedtools_multiinter(i, header=FALSE, names=NULL, g=NA_character_, empty=FALSE) do_bedtools_multiinter(i, header=FALSE, names=NULL, g=NA_character_, empty=FALSE)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
i |
Paths to BAM/BED/GFF/VCF/etc files (vector or comma-separated), or a list of objects. |
header |
Ignored. |
names |
Provide an alias for each to use for each |
g |
A genome file, identifier or Seqinfo object that defines the order and size of the sequences. |
empty |
Report empty regions (i.e., regions not covered in any of the
files). This essentially yields a partitioning of the genome (and
thus requires |
As with all commands, there are three interfaces to the
multiinter
command:
bedtools_multiinter
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_multiinter
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_multiinter
Evaluates the result of
R_bedtools_multiinter
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
The workhorse is
disjoin
. Passing with.revmap=TRUE
to disjoin
causes it to return a list of integers, which we use
to extract the sample identifiers. The empty
case requires a
bit more code, because we have to combine the disjoint ranges with the
gaps.
A language object containing the compiled R code, evaluating to a GRanges with a column “i” indicating the sample memberships.
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/multiinter.html
disjoin for forming disjoint ranges.
## Not run: setwd(system.file("unitTests", "data", "multiinter", package="HelloRanges")) ## End(Not run) ## default behavior bedtools_multiinter("-i a.bed,b.bed,c.bed") ## custom names bedtools_multiinter("-i a.bed,b.bed,c.bed -names A,B,C") ## include empty regions, i.e., partition the genome bedtools_multiinter("-i a.bed,b.bed,c.bed -names A,B,C -empty -g test.genome")
## Not run: setwd(system.file("unitTests", "data", "multiinter", package="HelloRanges")) ## End(Not run) ## default behavior bedtools_multiinter("-i a.bed,b.bed,c.bed") ## custom names bedtools_multiinter("-i a.bed,b.bed,c.bed -names A,B,C") ## include empty regions, i.e., partition the genome bedtools_multiinter("-i a.bed,b.bed,c.bed -names A,B,C -empty -g test.genome")
Summarize DNA sequences over the specified ranges.
bedtools_nuc(cmd = "--help") R_bedtools_nuc(fi, bed, s = FALSE, pattern = NULL, fullHeader = FALSE) do_bedtools_nuc(fi, bed, s = FALSE, pattern = NULL, fullHeader = FALSE)
bedtools_nuc(cmd = "--help") R_bedtools_nuc(fi, bed, s = FALSE, pattern = NULL, fullHeader = FALSE) do_bedtools_nuc(fi, bed, s = FALSE, pattern = NULL, fullHeader = FALSE)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
fi |
Path to a FASTA file, or an XStringSet. |
bed |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges, as the query. Use
|
s |
Force strandedness. If the feature occupies the antisense strand, the sequence will be reverse complemented. |
pattern |
Optional sequence pattern to count in each subsequence. |
fullHeader |
Use the full FASTA header as the names. By default, use just the first word. |
As with all commands, there are three interfaces to the
nuc
command:
bedtools_nuc
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_nuc
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_nuc
Evaluates the result of
R_bedtools_nuc
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
Computes AT/GC percentage and counts each type of base. Relies on
Biostrings utilities like letterFrequency
and alphabetFrequency
. The counting of
pattern
occurrences uses
vcountPattern
.
A language object containing the compiled R code, evaluating to a
DataFrame with summary statistics including the AC and GT
percentage, and the counts of each type of base. Also includes the
count of pattern
, if specified.
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/nuc.html
letterFrequency for summarizing sequences, matchPattern for pattern matching.
## Not run: setwd(system.file("unitTests", "data", "nuc", package="HelloRanges")) ## End(Not run) ## default behavior, note the two dashes in '--fi' bedtools_nuc("--fi test.fasta -bed a.bed") ## with pattern counting bedtools_nuc("--fi test.fasta -bed a.bed -pattern ATA")
## Not run: setwd(system.file("unitTests", "data", "nuc", package="HelloRanges")) ## End(Not run) ## default behavior, note the two dashes in '--fi' bedtools_nuc("--fi test.fasta -bed a.bed") ## with pattern counting bedtools_nuc("--fi test.fasta -bed a.bed -pattern ATA")
Compute shifting regions.
bedtools_shift(cmd = "--help") R_bedtools_shift(i, s = 0, m = 0, p = 0, pct = FALSE, g = NULL, header = FALSE) do_bedtools_shift(i, s = 0, m = 0, p = 0, pct = FALSE, g = NULL, header = FALSE)
bedtools_shift(cmd = "--help") R_bedtools_shift(i, s = 0, m = 0, p = 0, pct = FALSE, g = NULL, header = FALSE) do_bedtools_shift(i, s = 0, m = 0, p = 0, pct = FALSE, g = NULL, header = FALSE)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
i |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges. Use |
s |
Amount to shift all features. |
m |
Amount to shift negative strand features. |
p |
Amount to shift positive strand features. |
pct |
Define |
g |
Genome file, identifier or Seqinfo object that defines the order and size of the sequences. |
header |
Ignored. |
As with all commands, there are three interfaces to the
shift
command:
bedtools_shift
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_shift
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_shift
Evaluates the result of
R_bedtools_shift
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
This is a fairly straight-forward application of
shift
.
A language object containing the compiled R code, evaluating to a GRanges, or similar, object. In principle, this is an endomorphism.
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/shift.html
intra-range-methods for shift
.
## Not run: setwd(system.file("unitTests", "data", "shift", package="HelloRanges")) ## End(Not run) ## shift all ranges by 5 bedtools_shift("-i a.bed -s 5 -g tiny.genome") ## shift only the negative strand features by 5 bedtools_shift("-i a.bed -p 0 -m 5 -g tiny.genome")
## Not run: setwd(system.file("unitTests", "data", "shift", package="HelloRanges")) ## End(Not run) ## shift all ranges by 5 bedtools_shift("-i a.bed -s 5 -g tiny.genome") ## shift only the negative strand features by 5 bedtools_shift("-i a.bed -p 0 -m 5 -g tiny.genome")
Widen ranges on the left and/or right side.
bedtools_slop(cmd = "--help") R_bedtools_slop(i, b = 0, l = 0, r = 0, s = FALSE, pct = FALSE, g = NULL, header = FALSE) do_bedtools_slop(i, b = 0, l = 0, r = 0, s = FALSE, pct = FALSE, g = NULL, header = FALSE)
bedtools_slop(cmd = "--help") R_bedtools_slop(i, b = 0, l = 0, r = 0, s = FALSE, pct = FALSE, g = NULL, header = FALSE) do_bedtools_slop(i, b = 0, l = 0, r = 0, s = FALSE, pct = FALSE, g = NULL, header = FALSE)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
i |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges. Use |
b |
Widen the same number base pairs in each direction. |
l |
The number of base pairs to subtract from the start coordinate. |
r |
The number of base pairs to add to the end coordinate. |
s |
Define |
pct |
Define |
g |
Genome file, identifier or Seqinfo object that defines the order and size of the sequences. |
header |
Ignored. |
As with all commands, there are three interfaces to the
slop
command:
bedtools_slop
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_slop
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_slop
Evaluates the result of
R_bedtools_slop
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
This is a fairly straight-forward application of
resize
and the +
operator on
GRanges.
A language object containing the compiled R code, evaluating to a GRanges, or similar, object. In principle, this is an endomorphism.
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/slop.html
intra-range-methods for resize
.
## Not run: setwd(system.file("unitTests", "data", "slop", package="HelloRanges")) ## End(Not run) ## widen on both ends bedtools_slop("-i a.bed -b 5 -g tiny.genome") ## widen only on the left end bedtools_slop("-i a.bed -l 5 -r 0 -g tiny.genome")
## Not run: setwd(system.file("unitTests", "data", "slop", package="HelloRanges")) ## End(Not run) ## widen on both ends bedtools_slop("-i a.bed -b 5 -g tiny.genome") ## widen only on the left end bedtools_slop("-i a.bed -l 5 -r 0 -g tiny.genome")
Subtracts one set of ranges from another, either by position or range.
bedtools_subtract(cmd = "--help") R_bedtools_subtract(a, b, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, A = FALSE, N = FALSE, g = NA_character_) do_bedtools_subtract(a, b, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, A = FALSE, N = FALSE, g = NA_character_)
bedtools_subtract(cmd = "--help") R_bedtools_subtract(a, b, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, A = FALSE, N = FALSE, g = NA_character_) do_bedtools_subtract(a, b, f = 1e-09, F = 1e-09, r = FALSE, e = FALSE, s = FALSE, S = FALSE, A = FALSE, N = FALSE, g = NA_character_)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
a |
Path to a BAM/BED/GFF/VCF/etc file, a BED stream, a file object, or
a ranged data structure, such as a GRanges. Each feature in |
b |
Like |
f |
Minimum overlap required as a fraction of |
F |
Minimum overlap required as a fraction of |
r |
Require that the fraction of overlap be reciprocal for |
e |
Require that the minimum fraction be satisfied for |
s |
Require same strandedness. That is, find the subtract feature in
|
S |
Require opposite strandedness. That is, find the subtract feature in
|
A |
Remove entire feature if any overlap. If a feature in |
N |
Same as |
g |
A genome file, identifier or Seqinfo object that defines the order and size of the sequences. |
As with all commands, there are three interfaces to the
subtract
command:
bedtools_subtract
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_subtract
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_subtract
Evaluates the result of
R_bedtools_subtract
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
We typically subtract sets of ranges using
setdiff
; however, that will not work
here, because we cannot merge the ranges in a
.
The algorithm has two modes: by position (where ranges are clipped)
and by range (where ranges are discarded entirely). The position mode
is the default. We find overlaps, optionally restrict them, and for
each range in a
, we subtract all of the qualifying
intersections in b
.
When A
or N
are TRUE
, we use the second mode. In
the simplest case, that is just
subsetByOverlaps
with invert=TRUE
,
but fractional overlap restrictions and N
make that more
complicated.
A language object containing the compiled R code, evaluating to a
GRanges object, except when A
or N
are TRUE
,
where the value might be a GRanges, GAlignments or VCF object,
depending on the input.
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/subtract.html
setops-methods for set operations including setdiff, findOverlaps-methods for different ways to detect overlaps.
## Not run: setwd(system.file("unitTests", "data", "subtract", package="HelloRanges")) ## End(Not run) ## simple case, position-wise subtraction bedtools_subtract("-a a.bed -b b.bed") ## fractional overlap restriction bedtools_subtract("-a a.bed -b b.bed -f 0.5") ## range-wise subtraction bedtools_subtract("-a a.bed -b b.bed -A -f 0.5")
## Not run: setwd(system.file("unitTests", "data", "subtract", package="HelloRanges")) ## End(Not run) ## simple case, position-wise subtraction bedtools_subtract("-a a.bed -b b.bed") ## fractional overlap restriction bedtools_subtract("-a a.bed -b b.bed -f 0.5") ## range-wise subtraction bedtools_subtract("-a a.bed -b b.bed -A -f 0.5")
Summarize the ranges according to disjoin
and construct a matrix of scores (disjoint range by
sample/file). Empty cells are filled with NA
.
bedtools_unionbedg(cmd = "--help") R_bedtools_unionbedg(i, header=FALSE, names=NULL, g=NA_character_, empty=FALSE) do_bedtools_unionbedg(i, header=FALSE, names=NULL, g=NA_character_, empty=FALSE)
bedtools_unionbedg(cmd = "--help") R_bedtools_unionbedg(i, header=FALSE, names=NULL, g=NA_character_, empty=FALSE) do_bedtools_unionbedg(i, header=FALSE, names=NULL, g=NA_character_, empty=FALSE)
cmd |
String of bedtools command line arguments, as they would be entered at the shell. There are a few incompatibilities between the docopt parser and the bedtools style. See argument parsing. |
i |
Paths to BAM/BED/GFF/VCF/etc files (vector or comma-separated), or a list of objects. |
header |
Ignored. |
names |
Provide an alias for each to use for each |
g |
A genome file, identifier or Seqinfo object that defines the order and size of the sequences. |
empty |
Report empty regions (i.e., regions not covered in any of the
files). This essentially yields a partitioning of the genome (and
thus requires |
As with all commands, there are three interfaces to the
unionbedg
command:
bedtools_unionbedg
Parses the bedtools command line and compiles it to the equivalent R code.
R_bedtools_unionbedg
Accepts R arguments corresponding to the command line arguments and compiles the equivalent R code.
do_bedtools_unionbedg
Evaluates the result of
R_bedtools_unionbedg
. Recommended only for
demonstration and testing. It is best to integrate the compiled
code into an R script, after studying it.
This is essentially the same operation as
bedtools_multiinter
, except we build a score matrix
and embed it into a SummarizedExperiment. This is a bit tricky and
relies on the as.matrix,AtomicList-method
coercion.
A language object containing the compiled R code, evaluating to a RangedSummarizedExperiment with an assay called “score”.
Michael Lawrence
http://bedtools.readthedocs.io/en/latest/content/tools/unionbedg.html
disjoin for forming disjoint ranges, RangedSummarizedExperiment-class for SummarizedExperiment objects.
## Not run: setwd(system.file("unitTests", "data", "unionbedg", package="HelloRanges")) ## End(Not run) ## combine three samples bedtools_unionbedg("-i a.bedGraph,b.bedGraph,c.bedGraph -names A,B,C") ## include empty ranges (filled with NAs) bedtools_unionbedg("-i a.bedGraph,b.bedGraph,c.bedGraph -names A,B,C -empty -g test.genome")
## Not run: setwd(system.file("unitTests", "data", "unionbedg", package="HelloRanges")) ## End(Not run) ## combine three samples bedtools_unionbedg("-i a.bedGraph,b.bedGraph,c.bedGraph -names A,B,C") ## include empty ranges (filled with NAs) bedtools_unionbedg("-i a.bedGraph,b.bedGraph,c.bedGraph -names A,B,C -empty -g test.genome")
Computes the mode (and “antimode”) of a distribution. It is not
clear whether this is a generally useful statistic, but
bedtools
defined it, so we did for completeness.
distmode(x, anti = FALSE)
distmode(x, anti = FALSE)
x |
The vector for which the mode is computed. |
anti |
Whether to return the value with the least representation, instead of the highest. |
There are methods for List subclasses and ordinary vectors/factors. The List methods are useful for aggregation.
The value that is the most (or least) prevalent in the x
.
Michael Lawrence
Not to be confused with the data mode
of a vector.
bedtools_map
for an example in the context of aggregation.
distmode(c(1L, 2L, 1L))
distmode(c(1L, 2L, 1L))
Creates a Pairs from two vectors, optionally via a left outer join.
pair(x, y, ...) ## S4 method for signature 'Vector,Vector' pair(x, y, by = findMatches(x, y), all.x = FALSE, NA.VALUE = y[NA])
pair(x, y, ...) ## S4 method for signature 'Vector,Vector' pair(x, y, by = findMatches(x, y), all.x = FALSE, NA.VALUE = y[NA])
x |
The “first” vector. |
y |
The “second” vector. |
by |
The |
all.x |
If |
NA.VALUE |
The value to fill holes in |
... |
Arguments for methods. |
This might move to S4Vectors at some point. It is distinct from
simple Pairs
construction, because it
performs transformations like a left outer join. More options might be
added in the future.
For GRanges and other ranged objects, pair
adds “.” to
the seqlevels, because that is the seqname of the missing GRanges.
A Pairs object
Michael Lawrence
Pairs-class
, created by this function.
bedtools_intersect
, whose loj
argument motivated
the creation of this function.