Title: | An R package for the creation of complex genomic profile plots |
---|---|
Description: | recoup calculates and plots signal profiles created from short sequence reads derived from Next Generation Sequencing technologies. The profiles provided are either sumarized curve profiles or heatmap profiles. Currently, recoup supports genomic profile plots for reads derived from ChIP-Seq and RNA-Seq experiments. The package uses ggplot2 and ComplexHeatmap graphics facilities for curve and heatmap coverage profiles respectively. |
Authors: | Panagiotis Moulos <[email protected]> |
Maintainer: | Panagiotis Moulos <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.35.0 |
Built: | 2024-12-17 03:29:02 UTC |
Source: | https://github.com/bioc/recoup |
This function creates a local annotation database to be used with recoup so as to avoid long time on the fly annotation downloads and formatting.
buildAnnotationDatabase(organisms, sources, db = file.path(system.file(package = "recoup"), "annotation.sqlite"), forceDownload = TRUE, rc = NULL)
buildAnnotationDatabase(organisms, sources, db = file.path(system.file(package = "recoup"), "annotation.sqlite"), forceDownload = TRUE, rc = NULL)
organisms |
a list of organisms and versions
for which to download and build annotations. Check
the main |
sources |
a character vector of public sources
from which to download and build annotations. Check
the main |
db |
a valid path (accessible at least by the
current user) where the annotation database will be
set up. It defaults to
|
forceDownload |
by default,
|
rc |
fraction (0-1) of cores to use in a multicore
system. It defaults to |
Regarding the organisms
argument, it is a list
with specific format which instructs
buildAnnotationDatabase
on which organisms and
versions to download from the respective sources. Such
a list may have the format:
organisms=list(hg19=75, mm9=67, mm10=96:97)
This is explained as follows:
A database comprising the human genome versions
hg19
and the mouse genome versions
mm9, mm10
will be constructed.
If "ensembl"
is in sources
,
version 75 is downloaded for hg19
and versions
67, 96, 97
for mm9, mm10
.
If "ucsc"
or "refseq"
are in
sources
, the latest versions are downloaded
and marked by the download date. As UCSC and RefSeq
versions are not accessible in the same way as
Ensembl, this procedure cannot always be replicated.
organisms
can also be a character vector with organism
names/versions (e.g. organisms = c("mm10","hg19")
),
then the latest versions are downloaded in the case of
Ensembl.
Regarding db
, this controls the location of the
installation database. If the default is used, then there is
no need to provide the local database path to any function
that uses the database (e.g. the main recoup
).
Otherwise, the user will either have to provide this each
time, or the annotation will have to be downloaded and used
on-the-fly.
The function does not return anything. Only the SQLite database is created or updated.
Panagiotis Moulos
# Build a test database with one genome myDb <- file.path(tempdir(),"testann.sqlite") organisms <- list(mm10=96) sources <- "ensembl" # If the example is not running in a multicore system, rc is ignored buildAnnotationDatabase(organisms,sources,db=myDb,rc=0.5) # A more complete case, don't run as example # Since we are using Ensembl, we can also ask for a version #organisms <- list( # mm9=67, # mm10=96:97, # hg19=75, # hg38=96:97 #) #sources <- c("ensembl", "refseq") ## Build on the default location (depending on package location, it may ## require root/sudo) #buildAnnotationDatabase(organisms,sources) ## Build on an alternative location #myDb <- file.path(path.expand("~"),"my_ann.sqlite") #buildAnnotationDatabase(organisms,sources,db=myDb)
# Build a test database with one genome myDb <- file.path(tempdir(),"testann.sqlite") organisms <- list(mm10=96) sources <- "ensembl" # If the example is not running in a multicore system, rc is ignored buildAnnotationDatabase(organisms,sources,db=myDb,rc=0.5) # A more complete case, don't run as example # Since we are using Ensembl, we can also ask for a version #organisms <- list( # mm9=67, # mm10=96:97, # hg19=75, # hg38=96:97 #) #sources <- c("ensembl", "refseq") ## Build on the default location (depending on package location, it may ## require root/sudo) #buildAnnotationDatabase(organisms,sources) ## Build on an alternative location #myDb <- file.path(path.expand("~"),"my_ann.sqlite") #buildAnnotationDatabase(organisms,sources,db=myDb)
*This function is defunct! Please use
buildAnnotationDatabase
.*
This function creates a local annotation database to be
used with recoup so as to avoid long time on the fly
annotation downloads and formatting.
buildAnnotationStore(organisms, sources, home = file.path(path.expand("~"), ".recoup"), forceDownload = TRUE, rc = NULL)
buildAnnotationStore(organisms, sources, home = file.path(path.expand("~"), ".recoup"), forceDownload = TRUE, rc = NULL)
organisms |
a character vector of organisms
for which to download and build annotations. Check
the main |
sources |
a character vector of public sources
from which to download and build annotations. Check
the main |
home |
a valid path (accessible at least by the
current user) where the annotation database will be
set up. It defaults to |
forceDownload |
by default,
|
rc |
fraction (0-1) of cores to use in a multicore
system. It defaults to |
The function does not return anything. Only the annotation directory and contents are created.
Panagiotis Moulos
buildAnnotationStore("mm10","ensembl")
buildAnnotationStore("mm10","ensembl")
This function imports a GTF file with some custom annotation to the recoup annotation database.
buildCustomAnnotation(gtfFile, metadata, db = file.path(system.file(package = "recoup"), "annotation.sqlite"), rewrite=TRUE)
buildCustomAnnotation(gtfFile, metadata, db = file.path(system.file(package = "recoup"), "annotation.sqlite"), rewrite=TRUE)
gtfFile |
a GTF file containing the gene structure of the organism to be imported. |
metadata |
a list with additional information about the annotation to be imported. See Details. |
db |
a valid path (accessible at least by the
current user) where the annotation database will be
set up. It defaults to
|
rewrite |
if custom annotation found, rwrite?
(default |
Regarding the metadata
argument, it is a list
with specific format which instructs
buildCustomAnnotation
on importing the custom
annotation. Such a list may has the following members:
organism
a name of the organism which is
imported (e.g. "my_mm9"
). This is the only
mandatory member.
source
a name of the source for this
custom annotation (e.g. "my_mouse_db"
). If
not given or NULL
, the word "inhouse"
is used.
version
a string denoting the version.
If not given or NULL
, current date is used.
chromInfo
it can be one of the following:
a tab-delimited file with two columns, the first being the chromosome/sequence names and the second being the chromosome/sequence lengths.
a BAM file to read the header from and obtain the required information
a data.frame
with one column
with chromosome lengths and chromosome names as
rownames
.
See the examples below for a metadata
example.
Regarding db
, this controls the location of the
installation database. If the default is used, then there is
no need to provide the local database path to any function
that uses the database (e.g. the main metaseqr2
).
Otherwise, the user will either have to provide this each
time, or the annotation will have to be downloaded and used
on-the-fly.
The function does not return anything. Only the SQLite database is created or updated.
Panagiotis Moulos
# Dummy database as example customDir <- file.path(tempdir(),"test_custom") dir.create(customDir) myDb <- file.path(customDir,"testann.sqlite") chromInfo <- data.frame(length=c(1000L,2000L,1500L), row.names=c("A","B","C")) # Build with the metadata list filled (you can also provide a version) buildCustomAnnotation( gtfFile=file.path(system.file(package="recoup"),"dummy.gtf"), metadata=list( organism="dummy", source="dummy_db", version=1, chromInfo=chromInfo ), db=myDb ) # Try to retrieve some data myGenes <- loadAnnotation(genome="dummy",refdb="dummy_db", type="gene",db=myDb) myGenes ## Real data! ## Setup a temporary directory to download files etc. #customDir <- file.path(tempdir(),"test_custom") #dir.create(customDir) #myDb <- file.path(customDir,"testann.sqlite") ## Gene annotation dump from Ensembl #download.file(paste0("ftp://ftp.ensembl.org/pub/release-98/gtf/", # "dasypus_novemcinctus/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"), # file.path(customDir,"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz")) ## Chromosome information will be provided from the following BAM file ## available from Ensembl #bamForInfo <- paste0("ftp://ftp.ensembl.org/pub/release-98/bamcov/", # "dasypus_novemcinctus/genebuild/Dasnov3.broad.Ascending_Colon_5.1.bam") ## Build with the metadata list filled (you can also provide a version) #buildCustomAnnotation( # gtfFile=file.path(customDir,"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"), # metadata=list( # organism="dasNov3_test", # source="ensembl_test", # chromInfo=bamForInfo # ), # db=myDb #) ## Try to retrieve some data #dasGenes <- loadAnnotation(genome="dasNov3_test",refdb="ensembl_test", # level="gene",type="gene",db=myDb) #dasGenes
# Dummy database as example customDir <- file.path(tempdir(),"test_custom") dir.create(customDir) myDb <- file.path(customDir,"testann.sqlite") chromInfo <- data.frame(length=c(1000L,2000L,1500L), row.names=c("A","B","C")) # Build with the metadata list filled (you can also provide a version) buildCustomAnnotation( gtfFile=file.path(system.file(package="recoup"),"dummy.gtf"), metadata=list( organism="dummy", source="dummy_db", version=1, chromInfo=chromInfo ), db=myDb ) # Try to retrieve some data myGenes <- loadAnnotation(genome="dummy",refdb="dummy_db", type="gene",db=myDb) myGenes ## Real data! ## Setup a temporary directory to download files etc. #customDir <- file.path(tempdir(),"test_custom") #dir.create(customDir) #myDb <- file.path(customDir,"testann.sqlite") ## Gene annotation dump from Ensembl #download.file(paste0("ftp://ftp.ensembl.org/pub/release-98/gtf/", # "dasypus_novemcinctus/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"), # file.path(customDir,"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz")) ## Chromosome information will be provided from the following BAM file ## available from Ensembl #bamForInfo <- paste0("ftp://ftp.ensembl.org/pub/release-98/bamcov/", # "dasypus_novemcinctus/genebuild/Dasnov3.broad.Ascending_Colon_5.1.bam") ## Build with the metadata list filled (you can also provide a version) #buildCustomAnnotation( # gtfFile=file.path(customDir,"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"), # metadata=list( # organism="dasNov3_test", # source="ensembl_test", # chromInfo=bamForInfo # ), # db=myDb #) ## Try to retrieve some data #dasGenes <- loadAnnotation(genome="dasNov3_test",refdb="ensembl_test", # level="gene",type="gene",db=myDb) #dasGenes
This function returns a coverage list for the genomic
regions in mask
argument. Generally it should not
be used alone and is intended for internal use, although
it is useful for calculating stand-alone coverages.
calcCoverage(input, mask, strand = NULL, ignore.strand = TRUE, rc = NULL)
calcCoverage(input, mask, strand = NULL, ignore.strand = TRUE, rc = NULL)
input |
a |
mask |
a |
strand |
see the |
ignore.strand |
see the |
rc |
fraction (0-1) of cores to use in a multicore
system. It defaults to |
input
contains the short reads in one of the
formats described in the arguments section. When
input
is a list, this list should contain one
member per chromosome of the organism of interest.
mask
contains the genomic regions over which
the coverage will be calculated from the input reads.
When calculating RNA-Seq profiles, mask
must be
a named GRangesList
where each member represents
the exons of the respective gene.
A list of Rle
objects representing the
genomic coverages of interest.
Panagiotis Moulos
# Load some data data("recoup_test_data",package="recoup") # Calculate coverage Rle mask <- makeGRangesFromDataFrame(df=test.genome, keep.extra.columns=TRUE) small.cov <- calcCoverage(test.input[[1]]$ranges,mask)
# Load some data data("recoup_test_data",package="recoup") # Calculate coverage Rle mask <- makeGRangesFromDataFrame(df=test.genome, keep.extra.columns=TRUE) small.cov <- calcCoverage(test.input[[1]]$ranges,mask)
This function fills the coverage
field in the
main input
argument in recoup
function.
coverageRef(input, mainRanges, strandedParams = list(strand=NULL, ignoreStrand=TRUE), rc = NULL)
coverageRef(input, mainRanges, strandedParams = list(strand=NULL, ignoreStrand=TRUE), rc = NULL)
input |
an input list as in |
mainRanges |
the |
strandedParams |
see the |
rc |
fraction (0-1) of cores to use in a multicore
system. It defaults to |
Same as input with the ranges
fields filled.
Panagiotis Moulos
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages testGenomeRanges <- makeGRangesFromDataFrame(df=test.genome, keep.extra.columns=TRUE) test.input <- coverageRef( test.input, mainRanges=testGenomeRanges )
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages testGenomeRanges <- makeGRangesFromDataFrame(df=test.genome, keep.extra.columns=TRUE) test.input <- coverageRef( test.input, mainRanges=testGenomeRanges )
*This function is defunct! Please use
coverageRef
.*
This function fills the coverage
field in the
main input
argument in recoup
function.
coverageRnaRef( input, mainRanges, strandedParams = list(strand=NULL, ignoreStrand=TRUE), rc = NULL)
coverageRnaRef( input, mainRanges, strandedParams = list(strand=NULL, ignoreStrand=TRUE), rc = NULL)
input |
an input list as in |
mainRanges |
a named |
strandedParams |
see the |
rc |
fraction (0-1) of cores to use in a multicore
system. It defaults to |
Same as input with the ranges
fields filled.
Panagiotis Moulos
# Load some data #data("recoup_test_data",package="recoup") # Note: the figures that will be produced will not look # realistic or pretty and will be "bumpy". This is because # package size limitations posed by Bioconductor guidelines # do not allow for a full test dataset. As a result, the input # below is not an RNA-Seq dataset. Have a look at the # vignette on how to test with more realistic data. # Calculate coverages #testGenomeRanges <- makeGRangesFromDataFrame(df=test.genome, # keep.extra.columns=TRUE) #test.input <- coverageRef( # test.input, # mainRanges=test.exons #)
# Load some data #data("recoup_test_data",package="recoup") # Note: the figures that will be produced will not look # realistic or pretty and will be "bumpy". This is because # package size limitations posed by Bioconductor guidelines # do not allow for a full test dataset. As a result, the input # below is not an RNA-Seq dataset. Have a look at the # vignette on how to test with more realistic data. # Calculate coverages #testGenomeRanges <- makeGRangesFromDataFrame(df=test.genome, # keep.extra.columns=TRUE) #test.input <- coverageRef( # test.input, # mainRanges=test.exons #)
This function connects to the EBI's Biomart service using
the package biomaRt and downloads annotation elements
(gene co-ordinates, exon co-ordinates, gene
identifications, biotypes etc.) for each of the supported
organisms. See the help page of recoup
for a list of supported organisms. The function downloads
annotation for an organism genes or exons. It also uses
the UCSC public database connection API to download UCSC
and RefSeq annotations.
getAnnotation(org, type, refdb = "ensembl", ver = NULL, rc = NULL)
getAnnotation(org, type, refdb = "ensembl", ver = NULL, rc = NULL)
org |
the organism for which to download
annotation. Check the main |
type |
either |
refdb |
the online source to use to fetch
annotation. It can be |
ver |
the version of |
rc |
fraction (0-1) of cores to use in a multicore
system. It defaults to |
A data frame with the canonical (not isoforms!) genes or
exons of the requested organism. When
type="genes"
, the data frame has the following
columns: chromosome, start, end, gene_id, gc_content,
strand, gene_name, biotype. When type="exon"
the
data frame has the following columns: chromosome, start,
end, exon_id, gene_id, strand, gene_name, biotype. The
gene_id and exon_id correspond to Ensembl gene and exon
accessions respectively. The gene_name corresponds to
HUGO nomenclature gene names.
The data frame that is returned contains only "canonical" chromosomes for each organism. It does not contain haplotypes or random locations and does not contain chromosome M.
Panagiotis Moulos
mm10.genes <- getAnnotation("mm10","gene")
mm10.genes <- getAnnotation("mm10","gene")
This function returns a character vector of Ensembl biotypes for each supported organism. Mostly for internal use, but can also be used to list the biotypes and use some of them to subset initial genomic regions to be profiled.
getBiotypes(org)
getBiotypes(org)
org |
One of the supported |
A character vector of biotypes.
Panagiotis Moulos
hg18.bt <- getBiotypes("hg18")
hg18.bt <- getBiotypes("hg18")
This function returns a data frame with information on locally installed, supported or custom, annotations.
getInstalledAnnotations(obj = NULL)
getInstalledAnnotations(obj = NULL)
obj |
|
The function returns a data.frame
object with
the installed local annotations.
Panagiotis Moulos
db <- file.path(system.file(package="recoup"), "annotation.sqlite") if (file.exists(db)) ig <- getInstalledAnnotations(obj=db)
db <- file.path(system.file(package="recoup"), "annotation.sqlite") if (file.exists(db)) ig <- getInstalledAnnotations(obj=db)
This function imports GenomicRanges to be used with recoup from a local GTF file.
importCustomAnnotation(gtfFile, metadata, type = c("gene", "exon", "utr"))
importCustomAnnotation(gtfFile, metadata, type = c("gene", "exon", "utr"))
gtfFile |
a GTF file containing the gene structure of the organism to be imported. |
metadata |
a list with additional information about
the annotation to be imported. The same as in the
|
type |
one of the |
The function returns a GenomicRanges
object with
the requested annotation.
Panagiotis Moulos
# Dummy GTF as example chromInfo <- data.frame(length=c(1000L,2000L,1500L), row.names=c("A","B","C")) # Build with the metadata list filled (you can also provide a version) myGenes <- importCustomAnnotation( gtfFile=file.path(system.file(package="recoup"),"dummy.gtf"), metadata=list( organism="dummy", source="dummy_db", version=1, chromInfo=chromInfo ), type="gene" ) ## Real data! ## Gene annotation dump from Ensembl #download.file(paste0("ftp://ftp.ensembl.org/pub/release-98/gtf/", # "dasypus_novemcinctus/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"), # file.path(tempdir(),"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz")) ## Build with the metadata list filled (you can also provide a version) #dasGenes <- importCustomAnnotation( # gtfFile=file.path(tempdir(),"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"), # metadata=list( # organism="dasNov3_test", # source="ensembl_test" # ), # type="gene" #)
# Dummy GTF as example chromInfo <- data.frame(length=c(1000L,2000L,1500L), row.names=c("A","B","C")) # Build with the metadata list filled (you can also provide a version) myGenes <- importCustomAnnotation( gtfFile=file.path(system.file(package="recoup"),"dummy.gtf"), metadata=list( organism="dummy", source="dummy_db", version=1, chromInfo=chromInfo ), type="gene" ) ## Real data! ## Gene annotation dump from Ensembl #download.file(paste0("ftp://ftp.ensembl.org/pub/release-98/gtf/", # "dasypus_novemcinctus/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"), # file.path(tempdir(),"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz")) ## Build with the metadata list filled (you can also provide a version) #dasGenes <- importCustomAnnotation( # gtfFile=file.path(tempdir(),"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"), # metadata=list( # organism="dasNov3_test", # source="ensembl_test" # ), # type="gene" #)
This function performs k-means clustering on
recoup
generated profile matrices and
stores the result as a factor in the design element.
If no design is present, then one is created from the
k-means result.
kmeansDesign(input, design = NULL, kmParams)
kmeansDesign(input, design = NULL, kmParams)
input |
a list object created from
|
design |
See the respective argument in
|
kmParams |
Contains parameters for k-means
clustering on profiles. See the respective argument
in |
The design data frame, either created from scratch or augmented by k-means clustering.
Panagiotis Moulos
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(1000,1000), selector=NULL, plotParams=list(plot=FALSE,profile=TRUE, heatmap=TRUE,device="x11"), rc=0.1 ) # Re-design based on k-means kmParams=list(k=2,nstart=20,algorithm="MacQueen",iterMax=20, reference=NULL) design <- kmeansDesign(test.tss$data,kmParams=kmParams)
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(1000,1000), selector=NULL, plotParams=list(plot=FALSE,profile=TRUE, heatmap=TRUE,device="x11"), rc=0.1 ) # Re-design based on k-means kmParams=list(k=2,nstart=20,algorithm="MacQueen",iterMax=20, reference=NULL) design <- kmeansDesign(test.tss$data,kmParams=kmParams)
This function creates loads an annotation element from
the local annotation database to be used with recoup.
If the annotation is not found and the organism is
supported, the annotation is created on the fly but not
imported in the local database. Use
buildAnnotationDatabase
for this purpose.
loadAnnotation(genome, refdb, type = c("gene", "exon", "utr"), version="auto", db = file.path(system.file(package = "recoup"), "annotation.sqlite"), summarized = FALSE, asdf = FALSE, rc = NULL)
loadAnnotation(genome, refdb, type = c("gene", "exon", "utr"), version="auto", db = file.path(system.file(package = "recoup"), "annotation.sqlite"), summarized = FALSE, asdf = FALSE, rc = NULL)
genome |
a |
refdb |
a |
type |
one of the |
version |
same as the |
db |
same as the |
summarized |
if |
asdf |
return the result as a |
rc |
same as the |
The function returns a GenomicRanges
object with
the requested annotation.
Panagiotis Moulos
db <- file.path(system.file(package="recoup"), "annotation.sqlite") if (file.exists(db)) gr <- loadAnnotation(genome="hg19",refdb="ensembl", type="gene",db=db)
db <- file.path(system.file(package="recoup"), "annotation.sqlite") if (file.exists(db)) gr <- loadAnnotation(genome="hg19",refdb="ensembl", type="gene",db=db)
This function accepts two or more recoup
output objects holding single samples to a merged object
so that all samples can be used together. This is useful
when many coverages must be calulated/plotted and memory
issues do not allow effective parallelization.
mergeRuns(..., withDesign = c("auto", "drop"), dropPlots = TRUE)
mergeRuns(..., withDesign = c("auto", "drop"), dropPlots = TRUE)
... |
one or more |
withDesign |
one of |
dropPlots |
if profile and/or heatmap plots are
attached to the input object(s), they will be
recalculated if |
.
The withDesign
argument controls what should be
done if any input has an attached design. The default
behavioir ("auto"
) will try to do its best to
preserve compatible designs. If one or more inputs have
the same design, it will be applied to the rest of the
samples. If there is only one design, it will be applied
to all samples (if you don't want this to happen, choose
"drop"
). If more than one sample has an attached
design but these are incompatible (different numbers of
rows/rownames, columns/columnnames), then all designs
are dropped. Obviously, withDesign="drop"
drops
all attached designs and the output object is free of
a design data frame.
A recoup
output object with as many
samples as in ...
.
Panagiotis Moulos
# Load some data data("recoup_test_data",package="recoup") test.input.shift <- test.input names(test.input.shift) <- paste(names(test.input.shift),"_1",sep="") test.input.shift[[1]]$id <- paste0(test.input.shift[[1]]$id,"_1") test.input.shift[[1]]$ranges <- shift(test.input.shift[[1]]$ranges,100) test.input.shift[[2]]$id <- paste0(test.input.shift[[2]]$id,"_1") test.input.shift[[2]]$ranges <- shift(test.input.shift[[2]]$ranges,100) test.tss.1 <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, rc=0.1 ) test.tss.2 <- recoup( test.input.shift, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, rc=0.1 ) test.tss <- mergeRuns(test.tss.1,test.tss.2)
# Load some data data("recoup_test_data",package="recoup") test.input.shift <- test.input names(test.input.shift) <- paste(names(test.input.shift),"_1",sep="") test.input.shift[[1]]$id <- paste0(test.input.shift[[1]]$id,"_1") test.input.shift[[1]]$ranges <- shift(test.input.shift[[1]]$ranges,100) test.input.shift[[2]]$id <- paste0(test.input.shift[[2]]$id,"_1") test.input.shift[[2]]$ranges <- shift(test.input.shift[[2]]$ranges,100) test.tss.1 <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, rc=0.1 ) test.tss.2 <- recoup( test.input.shift, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, rc=0.1 ) test.tss <- mergeRuns(test.tss.1,test.tss.2)
This function reads the BAM/BED files present in the
input list object and fills the ranges
field of
the latter. At the same time it takes care of certain
preprocessing steps like normalization.
preprocessRanges(input, preprocessParams, genome, bamRanges=NULL, bamParams = NULL, rc = NULL)
preprocessRanges(input, preprocessParams, genome, bamRanges=NULL, bamParams = NULL, rc = NULL)
input |
an input list as in |
preprocessParams |
see the |
genome |
see the |
bamRanges |
a |
bamParams |
see the |
rc |
fraction (0-1) of cores to use in a multicore
system. It defaults to |
This function fills the ranges
field in the
main input
argument in recoup
function.
Panagiotis Moulos
# This example only demonstrates the usage of the # preprocessRanges function. The input BAM files # included with the package will not produce # realistic plots as they contain only a very small # subset of the original data presented in the # vignettes (50k reads). Please see recoup vignettes # for further demonstrations. test.in <- list( WT_H4K20me1=list( id="WT_H4K20me1", name="WT H4K20me1", file=system.file("extdata", "WT_H4K20me1_50kr.bam", package="recoup"), format="bam", color="#EE0000" ), Set8KO_H4K20me1=list( id="Set8KO_H4K20me1", name="Set8KO H4K20me1", file=system.file("extdata", "Set8KO_H4K20me1_50kr.bam", package="recoup"), format="bam", color="#00BB00" ) ) pp=list( normalize="none", spliceAction="split", spliceRemoveQ=0.75 ) test.in <- preprocessRanges(test.in,pp)
# This example only demonstrates the usage of the # preprocessRanges function. The input BAM files # included with the package will not produce # realistic plots as they contain only a very small # subset of the original data presented in the # vignettes (50k reads). Please see recoup vignettes # for further demonstrations. test.in <- list( WT_H4K20me1=list( id="WT_H4K20me1", name="WT H4K20me1", file=system.file("extdata", "WT_H4K20me1_50kr.bam", package="recoup"), format="bam", color="#EE0000" ), Set8KO_H4K20me1=list( id="Set8KO_H4K20me1", name="Set8KO H4K20me1", file=system.file("extdata", "Set8KO_H4K20me1_50kr.bam", package="recoup"), format="bam", color="#00BB00" ) ) pp=list( normalize="none", spliceAction="split", spliceRemoveQ=0.75 ) test.in <- preprocessRanges(test.in,pp)
This function fills the profile
field in the
main input
argument in recoup
function by calculating profile matrices from coverages
which will be used for plotting.
profileMatrix(input, flank, binParams, rc = NULL, .feNoSplit = FALSE)
profileMatrix(input, flank, binParams, rc = NULL, .feNoSplit = FALSE)
input |
an input list as in |
flank |
see the |
binParams |
see the |
rc |
fraction (0-1) of cores to use in a multicore
system. It defaults to |
.feNoSplit |
Temporary internal variable. Do not change unless you know what you are doing! |
Same as input with the profile
fields filled.
Panagiotis Moulos
# Load some data data("recoup_test_data",package="recoup") # Do some work testGenomeRanges <- makeGRangesFromDataFrame(df=test.genome, keep.extra.columns=TRUE) w <- width(testGenomeRanges) testGenomeRanges <- promoters(testGenomeRanges,upstream=2000,downstream=0) testGenomeRanges <- resize(testGenomeRanges,width=w+4000) test.input <- coverageRef( test.input, mainRanges=testGenomeRanges ) test.input <- profileMatrix( test.input, flank=c(2000,2000), binParams=list(flankBinSize=50,regionBinSize=150, sumStat="mean",interpolation="auto"), rc=0.1 )
# Load some data data("recoup_test_data",package="recoup") # Do some work testGenomeRanges <- makeGRangesFromDataFrame(df=test.genome, keep.extra.columns=TRUE) w <- width(testGenomeRanges) testGenomeRanges <- promoters(testGenomeRanges,upstream=2000,downstream=0) testGenomeRanges <- resize(testGenomeRanges,width=w+4000) test.input <- coverageRef( test.input, mainRanges=testGenomeRanges ) test.input <- profileMatrix( test.input, flank=c(2000,2000), binParams=list(flankBinSize=50,regionBinSize=150, sumStat="mean",interpolation="auto"), rc=0.1 )
This function calculates and plots signal profiles
created from short sequence reads derived from Next
Generation Sequencing technologies. The profiles
provided are either sumarized curve profiles or
heatmap profiles. Currently, recoup
supports
genomic profile plots for reads derived from ChIP-Seq
and RNA-Seq experiments. The function uses ggplot2 and
ComplexHeatmap graphics facilities for curve and
heatmap coverage profiles respectively. The output
list object can be reused as input to this function
which will automatically recognize which profile
elements needto be recalculated, saving time.
recoup( input, design = NULL, region = c("genebody", "tss", "tes", "utr3", "custom"), type = c("chipseq", "rnaseq"), signal = c("coverage", "rpm"), genome = c("hg18", "hg19", "hg38", "mm9" ,"mm10", "rn5", "rn6", "dm3", "dm6", "danrer7", "danrer10", "pantro4", "pantro5", "susscr3", "susscr11", "ecucab2", "tair10"), version = "auto", refdb = c("ensembl", "ucsc", "refseq"), flank = c(2000, 2000), onFlankFail = c("drop","trim"), fraction = 1, orderBy = list( what = c("none", "suma", "sumn", "maxa","maxn", "avga", "avgn", "hcn"), order = c("descending", "ascending"), custom = NULL ), binParams = list( flankBinSize = 0, regionBinSize = 0, sumStat = c("mean", "median"), interpolation = c("auto", "spline", "linear", "neighborhood"), binType = c("variable", "fixed"), forceHeatmapBinning = TRUE, forcedBinSize = c(50, 200), chunking = FALSE ), selector = NULL, preprocessParams = list( fragLen = NA, cleanLevel = c(0, 1, 2, 3), normalize = c("none", "linear", "downsample","sampleto"), sampleTo = 1e+6, spliceAction = c("split", "keep", "remove"), spliceRemoveQ = 0.75, bedGenome = NA ), plotParams = list( plot = TRUE, profile = TRUE, heatmap = TRUE, correlation = TRUE, signalScale = c("natural", "log2"), heatmapScale = c("common", "each" ), heatmapFactor = 1, corrScale = c("normalized", "each"), sumStat = c("mean", "median"), smooth = TRUE, corrSmoothPar = ifelse(is.null(design), 0.1, 0.5), singleFacet = c("none", "wrap", "grid"), multiFacet = c("wrap", "grid"), singleFacetDirection = c("horizontal", "vertical"), conf = TRUE, device = c("x11", "png", "jpg", "tiff", "bmp", "pdf", "ps"), outputDir = ".", outputBase = NULL ), saveParams = list( ranges = TRUE, coverage = TRUE, profile = TRUE, profilePlot = TRUE, heatmapPlot = TRUE, correlationPlot = TRUE ), kmParams = list( k = 0, nstart = 20, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"), iterMax = 20, reference = NULL ), strandedParams = list( strand = NULL, ignoreStrand = TRUE ), ggplotParams = list( title = element_text(size = 12), axis.title.x = element_text(size = 10, face = "bold"), axis.title.y = element_text(size = 10, face = "bold"), axis.text.x = element_text(size = 9, face = "bold"), axis.text.y = element_text(size = 10, face = "bold"), strip.text.x = element_text(size = 10, face = "bold"), strip.text.y = element_text(size = 10, face = "bold"), legend.position = "bottom", panel.spacing = grid::unit(1, "lines") ), complexHeatmapParams = list( main = list( cluster_rows = ifelse(length(grep( "hc", orderBy$what)) > 0, TRUE, FALSE), cluster_columns = FALSE, column_title_gp = grid::gpar(fontsize = 10, font = 2), show_row_names = FALSE, show_column_names = FALSE, heatmap_legend_param = list( color_bar = "continuous" ) ), group=list( cluster_rows = ifelse(length(grep( "hc", orderBy$what)) > 0, TRUE, FALSE), cluster_columns = FALSE, column_title_gp = grid::gpar(fontsize = 10, font = 2), show_row_names = FALSE, show_column_names = FALSE, row_title_gp = grid::gpar(fontsize = 8, font = 2), gap = unit(5, "mm"), heatmap_legend_param = list( color_bar = "continuous" ) ) ), bamParams = NULL, onTheFly = FALSE, localDb = file.path(system.file(package = "recoup"), "annotation.sqlite"), rc = NULL )
recoup( input, design = NULL, region = c("genebody", "tss", "tes", "utr3", "custom"), type = c("chipseq", "rnaseq"), signal = c("coverage", "rpm"), genome = c("hg18", "hg19", "hg38", "mm9" ,"mm10", "rn5", "rn6", "dm3", "dm6", "danrer7", "danrer10", "pantro4", "pantro5", "susscr3", "susscr11", "ecucab2", "tair10"), version = "auto", refdb = c("ensembl", "ucsc", "refseq"), flank = c(2000, 2000), onFlankFail = c("drop","trim"), fraction = 1, orderBy = list( what = c("none", "suma", "sumn", "maxa","maxn", "avga", "avgn", "hcn"), order = c("descending", "ascending"), custom = NULL ), binParams = list( flankBinSize = 0, regionBinSize = 0, sumStat = c("mean", "median"), interpolation = c("auto", "spline", "linear", "neighborhood"), binType = c("variable", "fixed"), forceHeatmapBinning = TRUE, forcedBinSize = c(50, 200), chunking = FALSE ), selector = NULL, preprocessParams = list( fragLen = NA, cleanLevel = c(0, 1, 2, 3), normalize = c("none", "linear", "downsample","sampleto"), sampleTo = 1e+6, spliceAction = c("split", "keep", "remove"), spliceRemoveQ = 0.75, bedGenome = NA ), plotParams = list( plot = TRUE, profile = TRUE, heatmap = TRUE, correlation = TRUE, signalScale = c("natural", "log2"), heatmapScale = c("common", "each" ), heatmapFactor = 1, corrScale = c("normalized", "each"), sumStat = c("mean", "median"), smooth = TRUE, corrSmoothPar = ifelse(is.null(design), 0.1, 0.5), singleFacet = c("none", "wrap", "grid"), multiFacet = c("wrap", "grid"), singleFacetDirection = c("horizontal", "vertical"), conf = TRUE, device = c("x11", "png", "jpg", "tiff", "bmp", "pdf", "ps"), outputDir = ".", outputBase = NULL ), saveParams = list( ranges = TRUE, coverage = TRUE, profile = TRUE, profilePlot = TRUE, heatmapPlot = TRUE, correlationPlot = TRUE ), kmParams = list( k = 0, nstart = 20, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"), iterMax = 20, reference = NULL ), strandedParams = list( strand = NULL, ignoreStrand = TRUE ), ggplotParams = list( title = element_text(size = 12), axis.title.x = element_text(size = 10, face = "bold"), axis.title.y = element_text(size = 10, face = "bold"), axis.text.x = element_text(size = 9, face = "bold"), axis.text.y = element_text(size = 10, face = "bold"), strip.text.x = element_text(size = 10, face = "bold"), strip.text.y = element_text(size = 10, face = "bold"), legend.position = "bottom", panel.spacing = grid::unit(1, "lines") ), complexHeatmapParams = list( main = list( cluster_rows = ifelse(length(grep( "hc", orderBy$what)) > 0, TRUE, FALSE), cluster_columns = FALSE, column_title_gp = grid::gpar(fontsize = 10, font = 2), show_row_names = FALSE, show_column_names = FALSE, heatmap_legend_param = list( color_bar = "continuous" ) ), group=list( cluster_rows = ifelse(length(grep( "hc", orderBy$what)) > 0, TRUE, FALSE), cluster_columns = FALSE, column_title_gp = grid::gpar(fontsize = 10, font = 2), show_row_names = FALSE, show_column_names = FALSE, row_title_gp = grid::gpar(fontsize = 8, font = 2), gap = unit(5, "mm"), heatmap_legend_param = list( color_bar = "continuous" ) ) ), bamParams = NULL, onTheFly = FALSE, localDb = file.path(system.file(package = "recoup"), "annotation.sqlite"), rc = NULL )
input |
the main input to |
design |
either a data frame with grouping factors
as columns (e.g. two grouping factors can be strand,
and Ensembl biotype) or a tab delimited text file with
the same content (grouping factors in columns). If a
data frame, the |
region |
one of |
type |
one of |
signal |
plots signal based on coverage
( |
genome |
when |
version |
the version of |
refdb |
one of |
flank |
a vector of length two with the number of
base pairs to flank upstream and downstream the
|
onFlankFail |
action to be taken when flanking
causes the requested plot genomic coordinates to go
beyond the lengths of reference sequences (e.g.
chromosomes). It can be |
region
Minimum flank is 0bp and maximum is
50kb. It is always expressed in bp.
fraction |
a number from 0 to 1 (default) denoting the fraction of total data to be used. See Details for further information. |
orderBy |
a named list whose members control the
order of the genomic regions (related to the
|
binParams |
a named list whose members control the resolution of the coverage profiles. The list has the following fields:
See Details for a few further notes in the usage of
|
selector |
a named list whose members control
some subsetting abilities regarding the input reference
genomic regions (
|
preprocessParams |
a named list whose members
control certain preprocessing steps applied to the
|
plotParams |
a named list whose members control profile (curve and heatmap) plotting parameters. The list has the following fields:
|
saveParams |
a named list which controls the
information to be stored in the
See the Details section for some additional information. |
kmParams |
a named list which controls the
execution of k-means clustering using standard R base
function
The result of k-means clustering will be appended to
|
strandedParams |
a named list which controls how strand information will be treated (if present). The list has the following fields:
|
ggplotParams |
a named list with theme parameters
passed to the |
complexHeatmapParams |
a named list with groups
of parameters passed to the
|
bamParams |
BAM file read parameters passed to
|
onTheFly |
Read short reads directly from BAM files
when input contains paths to BAM files. In this case the
storage of short reads in the output list object as a
GRanges object is not possible and the final object
becomes less reusable but the memory footprint is
lower. Defaults to |
localDb |
local path with the annotation database.
See also |
rc |
fraction (0-1) of cores to use in a multicore
system. It defaults to |
When input
is a list, is should contain as many
sublists as the number of samples. Each sublist must
have at least the following fields:
id
: a unique identifier for each
sample which should not contain whitespaces
and preferably no special characters.
file
: the full path to the BAM, BED
of BigWig file. If the path to the BAM is a
hyperlink, the BAM file must be indexed. BigWigs
are already indexed.
format
: one of "bam"
,
"bed"
or "bigwig"
.
Additionally, each sublist may also contain the following fields:
name
: a sample name which will
appear in plots.
color
: either an R color (see the
colors
function) or a hexadecimal
color (e.g. "#FF0000"
).
When input
is a text file, this should be
strictly tab-delimited (no other delimiters like
comma), should contain a header with the same names
(case sensitive) as the sublist fields above (
id, file, format
are mandatory and name, color
are optional).
When genome
is not one of the supported
organisms, it should be a text tab delimited file
(only tabs supported) with a header line, or a
data frame, where the basic BED field must be
present, that means that at least "chromosome"
,
"start"
, "end"
, a unique identifier
column and "strand"
must be present, \
preferably in this order. This file is read in a
data frame and then passed to the
makeGRangesFromDataFrame
function
from the GenomicRanges
package which takes
care of the rest. See also the
makeGRangesFromDataFrame
's documentation.
When genome
is one of the supported organisms,
recoup
takes care of the rest.
The version
argument controls what annotation
version is used (when using local annotation after
having built a store with
buildAnnotationStore
or when downloading
on the fly). When "auto", it will use the latest
annotation version for the selected source. So, if
source="ensembl"
, it will use the latest
installed or available version for the specified
organism based on information retrieved from the
biomaRt
package. For example, for
organism="hg19"
, it will be 91
at
the point where this manual is written. If
source="refseq"
, recoup
will
either use the latest downloaded annotation
according to a timestamp in the directory structure
or download and use the latest tables from UCSC on
the fly. If an annotation version does not exist,
recoup
will throw an error and exit.
When region
is "tss"
, the curve and
heatmap profiles are centered around the TSS of the
(gene) regions provided with the genome
argument, flanked accordind to the flank
argument. The same applied for region="tes"
where the plots are centered around the transcription
end site. When region
is "genebody"
,
the profiles consist of two flanking parts (upstream
of the TSS and downstream of the TES) and a middle
part consisting of the gene body coverage profile.
The latter is constructed by creating a fixed number
of intervals (bins) along each gene and averaging
the coverage of each interval. In some extreme cases
(e.g. for small genes), the number of bins may be
larger than the gene length. In these cases, a few
zeros are distributed randomly across the number of
bins to reach the predefined number of gene body
intervals. When region
is "custom"
the behavior depends on the custom regions length.
If it contains single-base intervals (e.g. ChIP-Seq
peak centers), then the behavior is similar to the
TSS behavior above. If it contains genomic intervals
of equal or unequal size, the behavior is similar
to the gene body case.
The fraction
parameter controls the total
fraction of both total reads and genomic regions
to be used for profile creation. This means that
the total reads for each sample are randomly
downsampled to fraction*100%
of the
original reads and the same applied to the input
genomic regions. This practice is followed by
similar packages (like ngs.plot) and serves the
purpose of a quick overview of how the actual
profiles look before profiling the whole genome.
Regarding the orderBy
parameters, for the
options of the what
parameter "sum"
type of options order profiles according to i) the
sum of coverages of all samples in each genomic
region when orderBy$what="suma"
or ii) the
sum of coverages of sample n (e.g. 2) in each
genomic region when orderBy$what="sumn"
(e.g. orderBy$what="sum1"
). The same apply
for the "max"
type of options but this time
the ordering is performed according to the position
of the highest coverage in each genomic profile. Ties
in the position of highest coverage are broken
randomly and sorting is performed with the default R
sort
. Similarly for "avg"
type
of options, the ordering is performed according to
the average total coverage of a reference region.
For the "hc"
type of options, hierarchical
clustering is performed on the selected (n)
reference profile (e.g. orderBy$what="hc1"
)
and this ordering is applied to the rest of the
sample profiles. When what="none"
, no
ordering is performed and the input order is used
(genome
argument). If any design is present
through the design
argument or k-means
clustering is also performed (through the
kmParams
argument), the orderBy directives
are applied to each sub-profile created by design
or k-means clustering.
Regarding the flankBinSize
field of
binParams
, it is used only when
region="genebody"
or region="custom"
and the custom regions are not single-base regions.
This happens as when the genomic regions to be
profiled are single-base regions (e.g. TSSs or
ChIP-Seq peak centers), these regions are merged
with the flanking areas and alltogether form the
main genomic region. In these cases, only the
regionBinSize
field value is used. Note that
when type="rnaseq"
or region="genebody"
or region="custom"
with non single-base regions
the values of flankBinSize
and
regionBinSize
offer a fine control over how
the flanking and the main regions are presented in the
profiles. For example, when flankBinSize=100
and regionBinSize=100
with a gene body profile
plot, the outcome will look kind of "unrealistic" as
the e.g. 2kb flanking regions will look very similar
to the usually larger gene bodies. On the other hand
if flankBinSize=50
and regionBinSize=200
,
this setting will create more "realistic" gene body
profiles as the flanking regions will be squished and
the gene body area will look expanded. Within the same
parameter group is also interpolation
. When
working with reference regions of different lengths
(e.g. gene bodies), it happens very often that their
lengths are a little to a lot smaller than the number
of bins into which they should be split and averaged
in order to be able to create the average curve and
heatmap profiles. recoup
allows for dynamic
resolutions by permitting to the user to set the
number of bins into which genomic areas will be
binned or by allowing a per-base resolution where
possible. The interpolation
parameter controls
what happens in such cases. When "spline"
, the
R function spline
is used, with the
default method, to produce a spline interpolation
of the same size as the regionBinSize
option
and is used as the coverage for that region. When
"linear"
, the procedure is the same as above
but using approx
. When
"neighborhood"
, a number of NA
values
are distributed randomly across the small area
coverage vector, excluding the first and the last
two positions, in order to reach regionBinSize
.
Then, each NA
position is filled with the mean
value of the two values before and the two values
after the NA
position, with na.rm=TRUE
.
This method should be avoided when >20% of the values
of the extended vector are NA
's as it may cause
a crash. However, it should be the most accurate one
in the opposite case (few NA
's). When
"auto"
(the default), a hybrid of
"spline"
and "neighborhood"
is applied.
If the NA
's constitute more than 20% of the
extended vector, "spline"
is used, otherwise
"neighborhood"
. None of the above is applied
to regions of equal length as there is no need for
that. Furthermore, the parameter binType
within the same parameter group controls the type
of bins that a genomic interval should be split to
in order to effectively calculate realistic signals
when signal="rpm"
. When "variable"
, the
number of bins that each genomic interval is split to
is proportional to the square root of its size (the
square root smooths the region length distribution,
otherwise many regions e.g. in the set of human
genes/transcripts will end up in unit-size bins even
though they can support larger resolutions). The final
signal is interpolated to a length of
regionBinSize
or flankBinSize
to
produce the final plots. When "fixed"
, the
genomic intervals are "pushed" to have
regionBinSize
or flankBinSize
bins, but
if the areas are not large enought, they may end-up to
many unit-size bins which will inflate and oversmooth
the signal. It may give better results if the regions
where the profile is to be created are all large enough.
Regarding the usage of selector$id
field, this
requires some careful usage, as if the ids present
there and the ids of the genome
areas do not
match, there will be no genomic regions left to
calculate coverage profiles on and the program will
crash.
Regarding the usage of the preprocessParams
argument, the normalize
field controls how
the GRanges
representing the reads extracted
from BAM/BED files or the signals extracted from BigWig
files will be normalized. When "none"
, no
normalization is applied and external normalization is
assumed. When "linear"
, all the library sizes
are divided by the maximum one and a normalization
factor is calculated for each sample. The coverage
of this sample across the input genomic regions is
then multiplied by this factor. When
"downsample"
, all libraries are downsampled to
the minimum library size among samples. When
"sampleto"
, all libraries are downsampled to
a fixed number of reads. The sampleTo
field
of preprocessParams
tells recoup the fixed number
of reads to downsample all libraries when
preprocessParams$normalize="sampleTo"
. It
defaults to 1 million reads (1e+6
). The
spliceAction
field of preprocessParams
is
used to control the action to be taken in the presence
of RNA-Seq spliced reads (implies type="rnaseq"
).
When "keep"
, no action is performed regarding
the spliced reads (represented as very long reads
spanning intronic regions in the GRanges
object).
When "remove"
, these reads are excluded from
coverage calculations according to their length as
follows: firstly the length distribution of all reads
lengths (using the width
function for
GRanges
) is calculated. Then the quantile
defined by the field spliceRemoveQ
of
preprocessParams
is calculated and reads above
the length corresponding to this quantile are excluded.
When "splice"
, then splice junction information
inferred from CIGAR strings (if) present in the BAM
files is used to splice the longer reads and calculate
real coverages. This option is not available with
BED files, however, BED files can contain pre-spliced
reads using for example BEDTools for conversion. It
should also be noted that in the case of BigWig files,
only linear normalization is supported as there is no
information on raw reads. The cleanLevel
field
controls what filtering will be applied to the raw
reads read from BAM/BED files prior to producing the
signal track. It can have four values: 0
for no
read processing/filtering, (use reads as they are, no
uniqueness and no removal of unlocalized regions
and mitochondrial DNA reads, unless filtered by the
user before using recoup
), 1
for removing
unlocalized regions (e.g. chrU, hap, random etc.),
2
for removing reads of level 1 plus
mitochondrial reads (chrM) and 3
for removing
reads of level 2 plus using unique reads only. The
default is level 0
(no filtering).
Regarding the heatmapFactor
option of
plotParams
, it controls the color scale of the
heatmap as follows: the default value (1) causes the
extremes of the heatmap colors to be linearly and
equally distributed across the actual coverage profile
values. If set smaller than 1, the the upper extreme of
the coverage values (which by default maps to the upper
color point) is multiplied by this factor and this new
value is set as the upper color break (limit). This has
the effect of decreasing the brightness of the heatmap
as color is saturated before reaching the maximum
coverage value. If set greater than 1, then the heatmap
brightness is increased. Regarding the correlation
option of plotParams
, if TRUE
then
recoup
calculates average coverage values for
each reference region (row-wise in the profile matrices)
instead of the average coverage in each base of the
reference regions (column-wise in the profile matrices).
This is particularly useful for checking whether total
genome profiles for some biological factor/condition
correlate with each other. This potential correlation
is becoming even clearer when orderBy$what
is
not "none"
. Regarding the corrScale
option
of plotParams
, it controls whether the average
coverage curves over the set of reference genomic
regions (one average coverage vale per genomic region,
note the difference with the profiles where the coverage
is calculated over the genomic locations themselves)
should be normalized to a 0-1 scale or not. This is
particularly useful when plotting data from different
libraries (e.g. PolII and H3K27me1 occupancy over gene
bodies) where other types of normalization (e.g. read
downsampling cannot be applied). Regarding the
corrSmoothPar
option of plotParams
, it
controls the smoothing parameter for coverage
correlation curves. If design
is present,
spline smoothing is applied
(smooth.spline
) with spar=0.5
else lowess smoothing is applied (lowess
)
with f=0.1
. corrSmoothPar
controls the
spar
and f
respectively.
Regarding the usage of saveParams
argument,
this is useful for several purposes: one is for re-using
recoup without re-reading BAM/BED/BigWig files. If the
ranges are present in the input object to recoup, they
are not re-calculated. If not stored, the memory/storage
usage is reduced but the object can be used only for
simply replotting the profiles using
recoupProfile
and/or
recoupHeatmap
functions.
As a note regarding parallel calculations, the number
of cores assigned to recoup
depends both on the
number of cores and the available RAM in your system.
The most RAM expensive part of recoup
is
currently the construction of binned profile matrices.
If you have a lot of cores (e.g. 16) but less than
128Gb of RAM for this number of cores, you should avoid
using all cores, especially with large BAM files. Half of
them would be more appropriate.
Finally, the output list of recoup
can be provided
as input again to recoup
with some input parameters
changed. recoup
will then automatically recognize
what has been changed and recalculate some, all or none
of the genomic region profiles, depending on what input
parameters have changed. For example, if any of the
ordering options change (e.g. from no profile ordering to
k-means clustering), then no recalculations are performed
and the process is very fast. If region binning is changed
(binParams$flankBinSize
or
binParams$regionBinSize
), then only profile matrices
are recalculated and coverages are maintained. If any of the
preprocessParams
changes, this causes all object
including the short reads to be reimported and profiles
recalculated from the beginning.
a named list with five members:
data
: the input
argument if it
was a list or the resulting list from the unexported
internal readConfig
function, with the
ranges
, coverage
and profile
fields filled according to saveParams
. This
data
member can be used again as an argument
to recoup
. The coverage
and
profile
fields will be recalculated according
to recoup
parameters but the ranges
will be resued if the input files are not changed.
design
: the design data frame which is
used to facet the profiles.
plots
: the ggplot2
and/or
Heatmap
objects created by recoup
.
callopts
: the majority of recoup
call parameters. Their storage serves the reuse of
a recoup
list object so that only certain
elements of plots are recalculated.
Panagiotis Moulos
# Load some sample data data("recoup_test_data",package="recoup") # Note: the figures that will be produced will not look # realistic and will be "bumpy". This is because package # size limitations posed by Bioconductor guidelines do not # allow for a full test dataset. Have a look at the # vignette on how to test with more realistic data. # TSS high resolution profile with no design test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, rc=0.1 ) # Genebody low resolution profile with 2-factor design, # wide genebody and more narrow flanking test.gb <- recoup( test.input, design=test.design, region="genebody", type="chipseq", genome=test.genome, flank=c(2000,2000), binParams=list(flankBinSize=50,regionBinSize=150), orderBy=list(what="hc1"), selector=NULL, rc=0.1 )
# Load some sample data data("recoup_test_data",package="recoup") # Note: the figures that will be produced will not look # realistic and will be "bumpy". This is because package # size limitations posed by Bioconductor guidelines do not # allow for a full test dataset. Have a look at the # vignette on how to test with more realistic data. # TSS high resolution profile with no design test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, rc=0.1 ) # Genebody low resolution profile with 2-factor design, # wide genebody and more narrow flanking test.gb <- recoup( test.input, design=test.design, region="genebody", type="chipseq", genome=test.genome, flank=c(2000,2000), binParams=list(flankBinSize=50,regionBinSize=150), orderBy=list(what="hc1"), selector=NULL, rc=0.1 )
The testing data package containes a small gene set, a design data frame, some genomic regions and an input object for testing of recoup with ChIP-Seq and RNA-Seq data. Specifically:
test.input
: A small data set which
contains 10000 reads from H4K20me1 ChIP-Seq
data from WT adult mice and Set8 (Pr-Set7) KO
mice. The tissue is liver.
test.genome
: A small gene set (100
genes) and their coordinates from mouse mm9
chromosome 12.
test.design: A data frame containing the 100 above genes categorized according to expression and strand.
test.exons: A GRangesList
containing
the exons of the 100 above genes for use with
recoup
RNA-Seq mode.
data.frame
and list
objects whose format
is accepted by recoup.
Panagiotis Moulos
Personal communication with the Talianids lab at BSRC 'Alexander Fleming'. Unpublished data.
These functions are provided for compatibility with older versions of ‘recoup’ only, and will be defunct at the next release.
The following functions are defunct and will be made defunct; use the replacement indicated below:
coverageRnaRef: coverageRef
These functions are provided for compatibility with older versions of ‘recoup’ only, and will be defunct at the next release.
The following functions are deprecated and will be made defunct; use the replacement indicated below:
buildAnnotationStore: buildAnnotationDatabase
This function takes as input argument and output object
from recoup
and creates the average genomic
curve correlations according to the options present in the
input object. It can be used with saved recoup outputs
so as to recreate the plots without re-reading BAM/BED
files and re-calculating coverages.
recoupCorrelation(recoupObj, samples = NULL, rc = NULL)
recoupCorrelation(recoupObj, samples = NULL, rc = NULL)
recoupObj |
a list object created from
|
samples |
which samples to plot. Either numeric
(denoting the sample indices) or sample ids. Defaults
to |
rc |
fraction (0-1) of cores to use in a multicore
system. It defaults to |
The function returns the recoupObj
with the
slot for the correlation plot filled. See also the
recoupPlot
, getr
and setr
function.
Panagiotis Moulos
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, plotParams=list(profile=FALSE,correlation=TRUE, heatmap=FALSE), rc=0.1 ) # Plot coverage correlations recoupCorrelation(test.tss,rc=0.1)
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, plotParams=list(profile=FALSE,correlation=TRUE, heatmap=FALSE), rc=0.1 ) # Plot coverage correlations recoupCorrelation(test.tss,rc=0.1)
This function takes as input argument and output object
from recoup
and creates heatmaps
depicting genomic coverages using the
ComplexHeatmap
package and the options present
in the input object. It can be used with saved recoup
outputs so as to recreate the plots without re-reading
BAM/BED files and re-calculating coverages.
recoupHeatmap(recoupObj, samples = NULL, rc = NULL)
recoupHeatmap(recoupObj, samples = NULL, rc = NULL)
recoupObj |
a list object created from
|
samples |
which samples to plot. Either numeric
(denoting the sample indices) or sample ids. Defaults
to |
rc |
fraction (0-1) of cores to use in a multicore
system. It defaults to |
The function returns the recoupObj
with the
slot for the profile plot filled. See also the
recoupPlot
, getr
and setr
function.
Panagiotis Moulos
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, plotParams=list(profile=FALSE,heatmap=FALSE), rc=0.1 ) # Plot coverage profiles recoupHeatmap(test.tss,rc=0.1)
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, plotParams=list(profile=FALSE,heatmap=FALSE), rc=0.1 ) # Plot coverage profiles recoupHeatmap(test.tss,rc=0.1)
This function takes as input argument an output object
from recoup
and plots the ggplot2
and ComplexHeatmap
objects stored there.
recoupPlot(recoupObj, what = c("profile", "heatmap", "correlation"), device = c("x11", "png", "jpg", "tiff", "bmp", "pdf", "ps"), outputDir = ".", outputBase = paste(vapply(recoupObj, function(x) return(x$data$id), character(1)), sep = "_"), mainh = 1, ...)
recoupPlot(recoupObj, what = c("profile", "heatmap", "correlation"), device = c("x11", "png", "jpg", "tiff", "bmp", "pdf", "ps"), outputDir = ".", outputBase = paste(vapply(recoupObj, function(x) return(x$data$id), character(1)), sep = "_"), mainh = 1, ...)
recoupObj |
a list object created from
|
what |
one or more of |
device |
a valid R graphics device. See the
|
outputDir |
a valid directory when device is not
|
outputBase |
a valid file name to be used as
basis when device is not |
mainh |
the reference heatmap for ordering
operations. Normally, calculated in
|
... |
further parameters passed either to
|
This function does not returns anything, just plots the
recoup
plots.
Panagiotis Moulos
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, plotParams=list(plot=FALSE,profile=TRUE, heatmap=TRUE,device="x11"), rc=0.1 ) # Plot coverage profiles recoupPlot(test.tss)
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, plotParams=list(plot=FALSE,profile=TRUE, heatmap=TRUE,device="x11"), rc=0.1 ) # Plot coverage profiles recoupPlot(test.tss)
This function takes as input argument and output object
from recoup
and creates the average genomic
curve profiles according to the options present in the
input object. It can be used with saved recoup outputs
so as to recreate the plots without re-reading BAM/BED
files and re-calculating coverages.
recoupProfile(recoupObj, samples = NULL, rc = NULL)
recoupProfile(recoupObj, samples = NULL, rc = NULL)
recoupObj |
a list object created from
|
samples |
which samples to plot. Either numeric
(denoting the sample indices) or sample ids. Defaults
to |
rc |
fraction (0-1) of cores to use in a multicore
system. It defaults to |
The function returns the recoupObj
with the
slot for the profile plot filled. See also the
recoupPlot
, getr
and setr
function.
Panagiotis Moulos
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, plotParams=list(profile=FALSE,heatmap=FALSE), rc=0.1 ) # Plot coverage profiles recoupProfile(test.tss,rc=0.1)
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, plotParams=list(profile=FALSE,heatmap=FALSE), rc=0.1 ) # Plot coverage profiles recoupProfile(test.tss,rc=0.1)
This function clears members of the recoup
output object that must be cleared in order to apply
a new set of parameters without completely rerunning
recoup
.
removeData(input, type = c("ranges", "coverage", "profile", "reference"))
removeData(input, type = c("ranges", "coverage", "profile", "reference"))
input |
a list object created from
|
type |
one of |
This function clears members of the recoup
output object which typically take some time to be
calculated but it is necessary to clean them if the
user wants to change input parameters that cause
recalculations of these members. For example, if the
user changes the binParams
, the profile matrices
("profile"
object member) have to be
recalculated.
type
controls what data will be removed.
"ranges"
removes the reads imported from BAM/BED
files. This is useful when for example the normalization
method is changed. "coverage"
removes the
calculated coverages over the reference genomic regions.
This is required again when the normalization method
changes. "profile"
removes the profile matrices
derived from coverages. This is required for example
when the binParams
main argument changes. Finally,
"reference"
removes the genomic loci over which
the calculations are taking place. This is required when
the genome
, refdb
or version
main
arguments change.
A list which is normally the output of
recoup
without the members that have
been removed from it.
Panagiotis Moulos
# Load some data data("recoup_test_data",package="recoup") # Before removing names(test.input) # Remove a member test.input <- removeData(test.input,"ranges") # Removed names(test.input)
# Load some data data("recoup_test_data",package="recoup") # Before removing names(test.input) # Remove a member test.input <- removeData(test.input,"ranges") # Removed names(test.input)
This function fills the profile
field in the
main input
argument in recoup
function by calculating profile matrices using reads
per million (rpm) or reads per kb per million reads
(rpkm) over binned genomic areas of interest, instead
of genomic coverage signals. The profile matrices are
used for later plotting.
rpMatrix(input, mainRanges, flank, binParams, strandedParams = list(strand = NULL, ignoreStrand = TRUE), rc = NULL)
rpMatrix(input, mainRanges, flank, binParams, strandedParams = list(strand = NULL, ignoreStrand = TRUE), rc = NULL)
input |
an input list as in |
mainRanges |
the |
flank |
see the |
binParams |
see the |
strandedParams |
see the |
rc |
fraction (0-1) of cores to use in a multicore
system. It defaults to |
Regarding the calculation of rpm
and rpkm
values, the calculations slightly differ from the
default defintions of these measurements in the sense
that they are also corrected for the bin lengths so as
to acquire human-friendly values for plotting.
Note that the genomic ranges (BAM/BED files) must be
imported before using this function (as per the default
recoup
pipeline). We plan to support streaming
directly from BAM files in the future.
Same as input with the profile
fields filled.
Panagiotis Moulos
# Load some data data("recoup_test_data",package="recoup") # Do some work testGenomeRanges <- makeGRangesFromDataFrame(df=test.genome, keep.extra.columns=TRUE) w <- width(testGenomeRanges) testGenomeRanges <- promoters(testGenomeRanges,upstream=2000,downstream=0) testGenomeRanges <- resize(testGenomeRanges,width=w+4000) test.input <- rpMatrix( test.input, mainRanges=testGenomeRanges, flank=c(2000,2000), binParams=list(flankBinSize=50,regionBinSize=150,binType="fixed", sumStat="mean",interpolation="spline"), rc=0.1 )
# Load some data data("recoup_test_data",package="recoup") # Do some work testGenomeRanges <- makeGRangesFromDataFrame(df=test.genome, keep.extra.columns=TRUE) w <- width(testGenomeRanges) testGenomeRanges <- promoters(testGenomeRanges,upstream=2000,downstream=0) testGenomeRanges <- resize(testGenomeRanges,width=w+4000) test.input <- rpMatrix( test.input, mainRanges=testGenomeRanges, flank=c(2000,2000), binParams=list(flankBinSize=50,regionBinSize=150,binType="fixed", sumStat="mean",interpolation="spline"), rc=0.1 )
The getr
and setr
functions are used to
get several reusable/changeable objects of
recoup
or replcace them (e.g. when the
user wishes to change some ggplot
or
ComplexHeatmap
parameters manually in a plot, or
change the heatmap profile ordering mode).
getr(obj, key = c("design", "profile", "heatmap", "correlation", "orderBy", "kmParams", "plotParams")) setr(obj, key, value = NULL)
getr(obj, key = c("design", "profile", "heatmap", "correlation", "orderBy", "kmParams", "plotParams")) setr(obj, key, value = NULL)
obj |
a list object created from
|
key |
one of |
value |
a valid |
For getr
, the object asked to be retrieved. For
setr
, the obj
with the respective slots
filled or replaced with value
.
Panagiotis Moulos
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, plotParams=list(plot=FALSE,profile=TRUE, heatmap=TRUE,device="x11"), rc=0.1 ) # Plot coverage profiles # Get the curve profile plot pp <- getr(test.tss,"profile") # Change some ggplot parameter pp <- pp + theme(axis.title.x=element_text(size=14)) # Store the new plot test.tss <- setr(test.tss,"profile",pp) ## or even better # test.tss <- setr(test.tss,list(profile=pp))
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, plotParams=list(plot=FALSE,profile=TRUE, heatmap=TRUE,device="x11"), rc=0.1 ) # Plot coverage profiles # Get the curve profile plot pp <- getr(test.tss,"profile") # Change some ggplot parameter pp <- pp + theme(axis.title.x=element_text(size=14)) # Store the new plot test.tss <- setr(test.tss,"profile",pp) ## or even better # test.tss <- setr(test.tss,list(profile=pp))
This function takes as input argument an output object
from recoup
and subsets it according to the
inputs i,j,k
. The attached plots may or may not be
recalculated. Other input parameters stores in
obj$callopts
are not changed apart from any
selector
option which is dropped. Note that when
slicing vertically (by j
), the $coverage
member of the input data (if present) is not
sliced, but remains as is. You can drop it using
removeData
as it is used to recalculate
profile matrices only if bin sizes are changed in a
recoup
call.
sliceObj(obj, i = NULL, j = NULL, k = NULL, dropPlots = FALSE, rc = NULL)
sliceObj(obj, i = NULL, j = NULL, k = NULL, dropPlots = FALSE, rc = NULL)
obj |
a list object created from
|
i |
vector of numeric or character indices,
corresponding to the index or rownames or names of
reference genomic regions. The |
j |
vector of numeric indices corresponding to the profile matrix vertical index (or base pair position or bin of base pairs) so as to subset the profile. The function will do its best to "guess" new plotting x-axis labels. |
k |
vector of numeric or character indices corresponding to sample index or sample names. These will be returned. |
dropPlots |
if profile and/or heatmap plots are
attached to the input object, they will be recalculated
if |
.
rc |
fraction (0-1) of cores to use in a multicore
system. It defaults to |
A recoup
list object, susbet according to
i, j
.
Panagiotis Moulos
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, plotParams=list(plot=FALSE,profile=TRUE, heatmap=TRUE,device="x11"), rc=0.1 ) # Plot coverage profiles o <- sliceObj(test.tss,i=1:10,k=1)
# Load some data data("recoup_test_data",package="recoup") # Calculate coverages test.tss <- recoup( test.input, design=NULL, region="tss", type="chipseq", genome=test.genome, flank=c(2000,2000), selector=NULL, plotParams=list(plot=FALSE,profile=TRUE, heatmap=TRUE,device="x11"), rc=0.1 ) # Plot coverage profiles o <- sliceObj(test.tss,i=1:10,k=1)