Title: | Methods for identifying small RNA loci from high-throughput sequencing data |
---|---|
Description: | High-throughput sequencing technologies allow the production of large volumes of short sequences, which can be aligned to the genome to create a set of matches to the genome. By looking for regions of the genome which to which there are high densities of matches, we can infer a segmentation of the genome into regions of biological significance. The methods in this package allow the simultaneous segmentation of data from multiple samples, taking into account replicate data, in order to create a consensus segmentation. This has obvious applications in a number of classes of sequencing experiments, particularly in the discovery of small RNA loci and novel mRNA transcriptome discovery. |
Authors: | Thomas J. Hardcastle [aut], Samuel Granjeaud [cre] |
Maintainer: | Samuel Granjeaud <[email protected]> |
License: | GPL-3 |
Version: | 2.41.1 |
Built: | 2024-11-30 05:21:33 UTC |
Source: | https://github.com/bioc/segmentSeq |
The segmentSeq package is intended to take multiple samples of high-throughput data (together with replicate information) and identify regions of the genome which have a (reproducibly) high density of tags aligning to them. The package was developed for use in identifying small RNA precursors from small RNA sequencing data, but may also be useful in some mRNA-Seq and chIP-Seq applications.
Package: | segmentSeq |
Type: | Package |
Version: | 0.0.2 |
Date: | 2010-01-20 |
License: | GPL-3 |
LazyLoad: | yes |
Depends: | baySeq, ShortRead |
To use the package, we construct an alignmentData
object
from sets of alignment files using either the readGeneric
function to read text files or the readBAM
function to
read from BAM format files.
We then use the processAD
function to identify all
potential subsegments of the data and the number of tags that align to
these subsegments. We then use either a heuristic or empirical Bayesian
approach to segment the genome into ‘loci’ and ‘null’ regions. We can then
acquire posterior likelihoods for each set of replicates which tell us
whether a region is likely to be a locus or a null in that replicate group.
The segmentation is designed to be usable by the
baySeq
package to allow
differential expression analyses to be carried out on the discovered loci.
The package (optionally) makes use of the 'snow' package for parallelisation of computationally intensive functions. This is highly recommended for large data sets.
See the vignette for more details.
Thomas J. Hardcastle
Maintainer: Thomas J. Hardcastle <[email protected]>
Hardcastle T.J., Kelly, K.A. and Balcombe D.C. (2011). Identifying small RNA loci from high-throughput sequencing data. In press.
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an 'alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames) # Process the alignmentData object to produce a 'segData' object. sD <- processAD(alignData, gap = 100, cl = NULL)
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an 'alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames) # Process the alignmentData object to produce a 'segData' object. sD <- processAD(alignData, gap = 100, cl = NULL)
The alignmentClass
class records information about
a set of alignments of high-throughput sequencing data to a
genome.
alignments
:Object of class "GRanges"
.
Stores information about the alignments. See Details.
libnames
:Object of class "character"
. The
names of the libraries for which alignment data exists.
replicates
:Object of class "factor"
.
Replicate information for each of the libraries. See Details.
The alignments
slot is a GRanges
object defining the location of aligned
objects to a reference genome.
The replicates
slot is a vector of factors such that the ith
sample is a replicate of the jth sample if and only if @replicates[i] ==
@replicates[j]
.
The libnames
slot is a vector defining the names of the
libraries described by the object.
signature(x = "alignmentClass")
: ...
signature(x = "alignmentClass")
: ...
signature(.Object = "alignmentClass")
: ...
signature(object = "alignmentClass")
: ...
Thomas J. Hardcastle
The alignmentData
class inherits from the
alignmentClass
class and records information about
a set of alignments of high-throughput sequencing data to a
genome. Details include the alignments themselves, the
chromosomes of the genome to which the data are aligned, and
counts of the aligned tags from each of the libraries from which the
data come.
Objects can be created by calls of the form new("alignmentData",
...)
, but more usually by using one of readBAM
or
readGeneric
functions to generate the object from a set of
alignment files.
alignments
:Object of class "GRanges"
.
Stores information about the alignments. See Details.
replicates
:Object of class "factor"
.
Replicate information for each of the libraries. See Details.
data
:Object of class "matrix"
. For each
alignment described in the alignments
slot, contains the
number of times the alignment is seen in each sample.
libnames
:Object of class "character"
. The
names of the libraries for which alignment data exists.
libsizes
:Object of class "numeric"
. The
library sizes (see Details) for each of the libraries.
The alignments
slot is the key element of this class. This is a
GRanges
object that, in addition to the usual elements defining
the location of aligned objects to a reference genome, also describes
the values ‘tag’, giving the sequence of the tag aligning to the
location, ‘matches’, indicating in how many places that tag matches to
the genome, ‘chunk’, an identifier for the sets of tags that align
close enough together to form a potential locus, and ‘chunkDup’,
indicating whether that tag matches to multiple places within the chunk.
The library sizes, defined in the libsizes
slot, provide some
scaling factor for the observed number of counts of a tag in different
samples.
The replicates
slot is a vector of factors such that the ith
sample is a replicate of the jth sample if and only if @replicates[i] ==
@replicates[j]
.
signature(x = "alignmentData")
: ...
signature(x = "alignmentData")
: ...
signature(.Object = "alignmentData")
: ...
signature(object = "alignmentData")
: ...
Thomas J. Hardcastle
alignmentClass
, the class from which
'alignmentData'
inherits.
readGeneric
, which will produce a 'alignmentData'
object from appropriately formatted tab-delimited files.
readBAM
, which will produce a 'alignmentData'
object from BAM files.
processAD
, which will convert an 'alignmentData'
object into a 'segData' object for segmentation.
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an 'alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames)
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an 'alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames)
The alignmentMeth
class inherits from the
alignmentClass
class and records information about
a set of alignments of high-throughput sequencing data to a
genome. Details include the alignments themselves, the
chromosomes of the genome to which the data are aligned, and
counts of the aligned tags from each of the libraries from which the
data come.
Objects can be created by calls of the form new("alignmentMeth",
...)
, but more usually by using one of readBAM
or
readGeneric
functions to generate the object from a set of
alignment files.
alignments
:Object of class "GRanges"
.
Defines the location of sequenced cytosines, amongst other data. See Details.
libnames
:Object of class "character"
. The
names of the libraries for which alignment data exists.
replicates
:Object of class "factor"
.
Replicate information for each of the libraries. See Details.
Cs
:Object of class "matrix"
. For each
cytosine described in the alignments
slot, contains the
number of times the cytosine is sequenced as a 'C', and is thus
methylated.
Ts
:Object of class "matrix"
. For each
cytosine described in the alignments
slot, contains the
number of times the cytosine is sequenced as a 'T', and is thus
unmethylated.
nonconversion
:Object of class "numeric"
. The
(estimated) nonconversion rate (see Details) for each of the libraries.
The nonconversion slot is an estimate of the rate (for each library) at which an unmethylated cytosine has failed to be converted by sodium bisulphite treatment into thymine, and is thus recorded (incorrectly) as methylated. In some cases, this can be estimated from considering observed methylation rates on regions known to be unmethylated (e.g., chloroplasts) or by introducing unmethylated control sequences.
The replicates
slot is a vector of factors such that the ith
sample is a replicate of the jth sample if and only if @replicates[i] ==
@replicates[j]
.
signature(x = "alignmentMeth")
: ...
signature(x = "alignmentMeth")
: ...
signature(.Object = "alignmentMeth")
: ...
signature(object = "alignmentMeth")
: ...
Thomas J. Hardcastle
alignmentClass
, the class from which
'alignmentMeth'
inherits.
readMeths
, which will produce a 'alignmentMeth'
object from files generated by the YAMA aligner.
processAD
, which will convert an 'alignmentMeth'
object into a 'segData' object for segmentation.
Given an alignmentData or alignmentMeth object and a set of coordinates, plots the average distribution of coverage/methylation over those coordinates. The plotted distributions can be split up into different sample groups by the user.
averageProfiles(mD, samples, coordinates, cuts, maxcuts = 200, bw = 5000, surrounding = 0, add = FALSE, col, ylim, meanOnly = TRUE, plot = TRUE, ...) plotAverageProfile(position, profiles, col, surrounding, ylim, add = FALSE, meanOnly = TRUE, legend = TRUE, titles, ...)
averageProfiles(mD, samples, coordinates, cuts, maxcuts = 200, bw = 5000, surrounding = 0, add = FALSE, col, ylim, meanOnly = TRUE, plot = TRUE, ...) plotAverageProfile(position, profiles, col, surrounding, ylim, add = FALSE, meanOnly = TRUE, legend = TRUE, titles, ...)
mD |
The |
samples |
A factor or list defining the different groups of samples for which to plot different distributions. If a list, each member should consist of integer values defining the column numbers of the 'mD' object. If missing, will use the mD@replicates value. |
coordinates |
A |
cuts |
Optionally, the number of subdivisions to cut the coordinates in when calculating the average coverage/methylation density. |
maxcuts |
The maximum number of subdivisions permitted when calculating the average coverage/methylation density. |
bw |
If 'cuts' is missing, this factor divides the product of the length of the 'coordinates' object and the median width of the coordinates to defines the number of cuts (minimum twenty cuts). |
surrounding |
If non-zero, the size of the region up- and down-stream of the given coordinates for which average coverage/methylation should also be calculated. |
add |
If TRUE, the plotted distribution will be added to the current plot. |
col |
If given, a vector of colours for each of the groups defined in 'samples'. Defaults to 'rainbow(length(samples))'. |
ylim |
See ‘ylim’ option for plot. If missing, will be calculated from data. |
meanOnly |
If TRUE, the mean methylation profile for each member of the 'samples' parameter is plotted on a single graph. If FALSE, every 5th percentile is plotted for each member of the sample parameters, each on a separate graph. |
plot |
Should the profile be plotted? Defaults to TRUE. |
position |
A vector describing the position of each point to be plotted. Take from the ‘$position’ element in the list object returned by ‘averageProfiles’. |
profiles |
A matrix describing the profiles to be plotted. Take from the '$profiles element in the list object returned by ‘averageProfiles’. |
legend |
If TRUE, a legend describing the samples is included on the plot. |
titles |
If given, and ‘meanOnly = FALSE’, a vector of titles for the quantile plots. |
... |
Additional arguments to be passed to the 'plotAverageProfile' function, and hence to the 'plot' or 'lines' methods. |
Invisibly, a list containing the coordinates of the lines plotted.
Thomas J. Hardcastle
This function acquires empirical distributions of sequence tag density from an already existing (or heuristically defined) segment map. It uses these to classify potential segments as either segments or nulls in order to define a new (and improved) segment map.
classifySeg(sD, cD, aD, lociCutoff = 0.9, nullCutoff = 0.9, subRegion = NULL, getLikes = TRUE, lR = FALSE, samplesize = 1e5, largeness = 1e8, tempDir = NULL, recoverFromTemp = FALSE, cl)
classifySeg(sD, cD, aD, lociCutoff = 0.9, nullCutoff = 0.9, subRegion = NULL, getLikes = TRUE, lR = FALSE, samplesize = 1e5, largeness = 1e8, tempDir = NULL, recoverFromTemp = FALSE, cl)
sD |
A |
cD |
A |
aD |
An |
lociCutoff |
The minimum posterior likelihood of being a locus for a region to be treated as a locus. |
nullCutoff |
The minimum posterior likelihood of being a null for a region to be treated as a null. |
subRegion |
A |
getLikes |
Should posterior likelihoods for the new segmented genome (loci and nulls) be assessed? |
lR |
If TRUE, locus and null calls are made on the basis of likelihood ratios rather than posterior likelihoods. Not recommended. |
samplesize |
The sample size to be used when estimating the prior
distribution of the data with the
|
largeness |
The maximum size for a split analysis. |
tempDir |
A directory for storing temporary files produced during the segmentation. |
recoverFromTemp |
If TRUE, will attempt to recover the position saved in 'tempDir'. Defaults to FALSE. See Details. |
cl |
A SNOW cluster object, or NULL. See Details. |
This function acquires empirical distributions of sequence tag density
from the segmentation map defined by the ‘cD’ argument (if ‘cD’ is NULL
or missing, then the heuristicSeg
function is used to
define a segmentation map. It uses these empirical distributions to
acquire posterior likelihoods on each potential segment being either a
true segment or a null region. These posterior likelihoods are then used
to define the segment map.
If recoverFromTemp = TRUE
, the function will attempt to recover
a crashed position from the temporary files in tempDir
. At
present, the function assumes you know what you are doing, and will
perform no checking that these files are suitable for the specified
recovery. Use with caution.
A lociData
object, containing the
segmentation map discovered.
Thomas J. Hardcastle
Hardcastle T.J., Kelly, K.A. and Balcombe D.C. (2011). Identifying small RNA loci from high-throughput sequencing data. In press.
heuristicSeg
a fast heuristic alternative to this function.
plotGenome
, a function for plotting the alignment of
tags to the genome (together with the segments defined by this
function).
baySeq
, a package for discovering
differential expression in lociData
objects.
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, gap = 100) # Process the alignmentData object to produce a `segData' object. sD <- processAD(alignData, gap = 100, cl = NULL) # Use the classifySeg function on the segData object to produce a lociData object. pS <- classifySeg(aD = alignData, sD = sD, subRegion = data.frame(chr = ">Chr1", start = 1, end = 1e5), getLikes = TRUE, cl = NULL)
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, gap = 100) # Process the alignmentData object to produce a `segData' object. sD <- processAD(alignData, gap = 100, cl = NULL) # Use the classifySeg function on the segData object to produce a lociData object. pS <- classifySeg(aD = alignData, sD = sD, subRegion = data.frame(chr = ">Chr1", start = 1, end = 1e5), getLikes = TRUE, cl = NULL)
This function identifies chunks of data within a set of aligned reads by looking for gaps within the alignments; regions where no reads align. If we assume that a locus should not contain a gap of sufficient length, then we can separate the analysis of the data into chunks defined by these gaps, reducing the complexity of the problem of segmentation.
findChunks(alignments, gap, checkDuplication = TRUE, justChunks = FALSE)
findChunks(alignments, gap, checkDuplication = TRUE, justChunks = FALSE)
alignments |
A |
gap |
The minimum length of a gap across which it is assumed that no locus can exist. |
checkDuplication |
Should we check whether or not reads are duplicated within a chunk? Defaults to TRUE. |
justChunks |
If TRUE, returns a vector of the chunks rather than the GRanges object with chunks attached. Defaults to FALSE. |
This function is called by the readGeneric
and
readBAM
functions but may usefully be called again if
filtering of an linkS4class{alignmentData}
object has altered
the data present, or to increase the computational effort required for
subsequent analysis. The lower the ‘gap’ parameter used to define the
chunks, the faster (though potentially less accurate) any subsequent
analyses will be.
A modified GRanges
object, now containing columns ‘chunk’
and ‘chunkDup’ (if 'checkDuplication' is TRUE), identifying the chunk
to which the alignment belongs and whether the alignment of the tag is
duplicated within the chunk respectively.
Thomas J. Hardcastle
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Read the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, gap = 100) # Filter the data on number of matches of each tag to the genome alignData <- alignData[values(alignData@alignments)$matches < 5,] # Redefine the chunking structure of the data. alignData <- findChunks(alignData@alignments, gap = 100)
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Read the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, gap = 100) # Filter the data on number of matches of each tag to the genome alignData <- alignData[values(alignData@alignments)$matches < 5,] # Redefine the chunking structure of the data. alignData <- findChunks(alignData@alignments, gap = 100)
A function for extracting count data from an
alignmentData
object given a set of segments defined on the genome.
getCounts(segments, aD, preFiltered = FALSE, adjustMultireads = TRUE, useChunk = FALSE, cl)
getCounts(segments, aD, preFiltered = FALSE, adjustMultireads = TRUE, useChunk = FALSE, cl)
segments |
A |
aD |
An |
preFiltered |
The function internally cleans the data; however, this may not be needed and omitting these steps may save computational time. See Details. |
adjustMultireads |
If working with methylation data, this option toggles an adjustment for reads that align to multiple locations on the genome. Defaults to TRUE. |
useChunk |
If all segments are within defined ‘chunks’ of the alignmentData object, speed increases if this is set to TRUE. Otherwise, counts may be inaccurate. Defaults to FALSE. |
cl |
A SNOW cluster object, or NULL. See Details. |
The function extracts count data from alignmentData
object
'aD' given a set of segments. The non-trivial aspect of this function
is that at a segment which contains a tag that matches to multiple
places in that segment (and thus appears multiple times in the
alignmentData
object) should count it only once.
If preFiltered = FALSE
then the function allows for missing (NA)
data in the segments, unordered segments and duplicated segments. If
the segment list has no missing data, is already ordered, and contains
no duplications, then computational time can be saved by setting
preFiltered = TRUE
.
A cluster
object (package: snow) is recommended for
parallelisation of this function when using large data sets.
Passing NULL to this variable will cause the function to run in non-parallel mode.
In general, this function will probably not be accessed by the user as
the processAD
function includes a call to getCounts
as
part of the standard processing of an alignmentData
object into
a segData
object.
If ‘as.matrix’, a matrix, each column of which corresponds to a library in the
alignmentData
object ‘aD’ and each row to the
segment defined by the corresponding row in ‘segments’. Otherwise an
equivalent DataFrame
object.
Thomas J. Hardcastle
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an 'alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, gap = 100) # Get count data for three arbitrarily chosen segments on chromosome 1. getCounts(segments = GRanges(seqnames = c(">Chr1"), IRanges(start = c(1,100,2000), end = c(40,3000,5000))), aD = alignData, cl = NULL)
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an 'alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, gap = 100) # Get count data for three arbitrarily chosen segments on chromosome 1. getCounts(segments = GRanges(seqnames = c(">Chr1"), IRanges(start = c(1,100,2000), end = c(40,3000,5000))), aD = alignData, cl = NULL)
This function identifies which of a set of genomic segments overlaps with another set of coordinates; either with partial overlap or with the segments completely contained within the coordinates. The function is used within the ‘segmentSeq’ package for various methods of constructing a segmentation map, but may also be useful in downstream analysis (e.g. annotation analyses).
getOverlaps(coordinates, segments, overlapType = "overlapping", whichOverlaps = TRUE, ignoreStrand = FALSE, cl)
getOverlaps(coordinates, segments, overlapType = "overlapping", whichOverlaps = TRUE, ignoreStrand = FALSE, cl)
coordinates |
A |
segments |
A |
overlapType |
Which kind of overlaps are being sought? Can be one of ‘overlapping’, ‘contains’ or ‘within’. See Details. |
whichOverlaps |
If TRUE, returns the ‘segments’ overlapping with the ‘coordinates’. If FALSE, returns a boolean vector specifying which of the ‘coordinates’ overlap with the ‘segments’. |
ignoreStrand |
If TRUE, a segment may overlap a set of coordinates regardless of the strand the two are on. If FALSE, overlaps will only be identified if both are on the same strand (or if either has no strand specified). Defaults to FALSE. |
cl |
A SNOW cluster object, or NULL. See Details. |
If overlapType = "overlapping"
then any overlap between the
‘coordinates’ and the ‘segments’ is sufficient. If overlapType =
"contains"
then a region defined in ‘coordinates’ must completely
contain at least one of the ‘segments’ to count as an overlap. If
overlapType = "within"
then a region defined in ‘coordinates’
must be completely contained by at least one of the ‘segments’ to
count as an overlap.
A 'cluster'
object (package: snow) may usefully be used for
parallelisation of this function when examining large data sets.
Passing NULL to this variable will cause the function to run in
non-parallel mode.
If whichOverlaps = TRUE
, then the function returns a list object
with length equal to the number of rows of the ‘coordinates’
argument. The ‘i’th member of the list will be a numeric vector giving
the row numbers of the ‘segments’ object which overlap with the
‘i’th row of the ‘coordinates’ object, or NA if no segments
overlap with this coordinate region.
If whichOverlaps = FALSE
, then the function returns a boolean
vector with length equal to the number of rows of the ‘coordinates’
argument, indicating which of the regions defined in coordinates have
the correct type of overlap with the ‘segments’.
Thomas J. Hardcastle
# Define the chromosome lengths for the genome of interest. chrlens <- c(2e6, 1e6) # Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, chrs = c(">Chr1", ">Chr2"), chrlens = chrlens, gap = 100) # Find which tags overlap with an arbitrary set of coordinates. getOverlaps(coordinates = GRanges(seqnames = c(">Chr1"), IRanges(start = c(1,100,2000), end = c(40,3000,5000))), segments = alignData@alignments, overlapType = "overlapping", whichOverlaps = TRUE, cl = NULL)
# Define the chromosome lengths for the genome of interest. chrlens <- c(2e6, 1e6) # Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, chrs = c(">Chr1", ">Chr2"), chrlens = chrlens, gap = 100) # Find which tags overlap with an arbitrary set of coordinates. getOverlaps(coordinates = GRanges(seqnames = c(">Chr1"), IRanges(start = c(1,100,2000), end = c(40,3000,5000))), segments = alignData@alignments, overlapType = "overlapping", whichOverlaps = TRUE, cl = NULL)
For certain types of differential behaviour, the likelihood of difference can be high even for regions for which the likelihood that the region represents a true locus is low in all replicate groups. This function corrects the likelihood of differential behaviour by the likelihood that the locus is expressed in at least one replicate group.
givenExpression(cD)
givenExpression(cD)
cD |
A lociData object with computed posteriors and locLikelihoods. |
A lociData object with updated posteriors.
Thomas J. Hardcastle
This method identifies by heuristic methods a set of loci from a
segData
or segMeth
object. It does this by identifying
within replicate groups regions of the genome that satisfy the criteria
for being a locus and have no region within them that satisfies the
criteria for being a null. These criteria can be defined by the user or
inferred from the data.
heuristicSeg(sD, aD, gap = 50, RKPM = 1000, prop, coverage = 1, locCutoff = 0.9, nullCutoff = 0.9, subRegion = NULL, largeness = 1e8, getLikes = TRUE, verbose = TRUE, tempDir = NULL, cl = NULL, recoverFromTemp = FALSE, trimMeth = FALSE)
heuristicSeg(sD, aD, gap = 50, RKPM = 1000, prop, coverage = 1, locCutoff = 0.9, nullCutoff = 0.9, subRegion = NULL, largeness = 1e8, getLikes = TRUE, verbose = TRUE, tempDir = NULL, cl = NULL, recoverFromTemp = FALSE, trimMeth = FALSE)
aD |
An |
sD |
|
gap |
What is the minimum length of a null region? |
RKPM |
For analysis of a |
prop |
For analysis of a |
coverage |
For analysis of a |
locCutoff |
For analysis of a |
nullCutoff |
For analysis of a |
subRegion |
A |
largeness |
The maximum size for a split analysis. |
getLikes |
Should posterior likelihoods for the new segmented genome (loci and nulls) be assessed? |
verbose |
Should the function be verbose? Defaults to TRUE. |
tempDir |
A directory for storing temporary files produced during the segmentation. |
cl |
A SNOW cluster object, or NULL. Defaults to NULL. See Details. |
recoverFromTemp |
If TRUE, will attempt to recover the position saved in 'tempDir'. Defaults to FALSE. See Details. |
trimMeth |
Should putative methylation regions be trimmed? Defaults to FALSE; see Details. |
A 'cluster'
object (package: snow) may be used for
parallelisation of parts of this function when examining large data sets.
Passing NULL to this variable will cause the function to run in
non-parallel mode.
If recoverFromTemp = TRUE
, the function will attempt to recover
a crashed position from the temporary files in tempDir
. At
present, the function assumes you know what you are doing, and will
perform no checking that these files are suitable for the specified
recovery. Use with caution.
The prop
variable can be used to set the proportion of
methylation required to identify a locus by giving a numerical value
between 0-1. It can also be determined automatically (see
thresholdFinder).
Due to the way that methylation loci are identified, it is possible that the cytosines at the edges of methylation loci have limited evidence for methylation. The 'trimMeth' option trims cytosines at the edge of predicted methylation loci that have less than 50% likelihood of being above the required threshold.
A lociData
object, containing count
information on all the segments discovered.
Thomas J. Hardcastle
Hardcastle T.J., Kelly, K.A. and Balcombe D.C. (2011). Identifying small RNA loci from high-throughput sequencing data. In press.
classifySeg
, an alternative approach to this problem using an empirical
Bayes approach to classify segments.
thresholdFinder
, a function for determining a suitable
threshold on methylation by examining the data.
plotGenome
, a function for plotting the alignment of
tags to the genome (together with the segments defined by this
function).
baySeq
, a package for discovering
differential expression in lociData
objects.
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, gap = 100) # Process the alignmentData object to produce a `segData' object. sD <- processAD(alignData, gap = 100, cl = NULL) # Use the segData object to produce a segmentation of the genome. segD <- heuristicSeg(sD = sD, aD = alignData, prop = 0.2, subRegion = data.frame(chr = ">Chr1", start = 1, end = 1e5), cl = NULL)
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, gap = 100) # Process the alignmentData object to produce a `segData' object. sD <- processAD(alignData, gap = 100, cl = NULL) # Use the segData object to produce a segmentation of the genome. segD <- heuristicSeg(sD = sD, aD = alignData, prop = 0.2, subRegion = data.frame(chr = ">Chr1", start = 1, end = 1e5), cl = NULL)
This is the preprocessed (to save computational time) 'lociData' object described in the vignette 'segmentSeq: methods for detecting methylation loci and differential methylation'.
data(hSL)
data(hSL)
A 'lociData' object.
See vignette.
The lociData
class is based on the countData
class defined in the ‘baySeq’ package, but includes a ‘coordinates’
slot giving the coordinates of genomic loci and a ‘locLikelihoods’
slot which contains the estimated likelihoods that each annotated
region is a locus in each replicate group and a coordinates structure
giving the locations of the loci.
locLikelihoods
:Object of class "matrix"
describing estimated likelihoods that each region defined in
‘coordinates’ is a locus in each replicate group.
coordinates
:Object of class "GRanges"
defining
the coordinates of the genomic loci.
data
:Object of class "matrix"
defining count
data for each locus defined in ‘coordinates’
replicates
:Object of class "factor"
defining
the replicate structure of the data.
groups
:Object of class "list"
defing the group
(model) structure of the data (see the
baySeq package
)
.
annotation
:Object of class "data.frame"
giving
any additional annotation information for each locus.
priorType
:Object of class "character"
describing the type of prior information available in slot 'priors'
.
priors
:Object of class "list"
defing the prior
parameter information. Calculated by
the baySeq package.
posteriors
:Object of class "matrix"
giving the
estimated posterior likelihoods for each replicate group. Calculated by
the baySeq package.
nullPosts
:Object of class "numeric"
which, if
calculated, defines the posterior likelihoods for the data having no
true expression of any kind. Calculated by
the baySeq package.
estProps
:Object of class "numeric"
giving the
estimated proportion of tags belonnging to each group. Calculated by
the baySeq package.
cellObservables
A list object containing arrays of identical dimension to that in the ‘@data’ slot. These arrays define some observed characteristic of the data (e.g., GC content of sRNAs) which may be used in analysis.
rowObservables
A list object containing arrays with first dimension identical to the number of rows in the ‘@data’ slot. These arrays define some observed characteristic of the data (e.g., genomic length of the region) which may be used in analysis.
sampleObservables
A list object containing arrays with first dimension identical to the number of columns of the ‘@data’ slot. These arrays define some observed characteristic of the data (e.g., library scaling factor) which may be used in analysis.
Class "countData"
, directly.
Methods ‘new’, ‘dim’, ‘[’, ‘c’ and ‘show’ have been defined for this class.
Thomas J. Hardcastle
An empirical Bayesian approach that takes a segmentation map and uses this to bootstrap posterior likelihoods on each region being a locus for each replicate group.
lociLikelihoods(cD, aD, newCounts = FALSE, bootStraps = 3, inferNulls = TRUE, nasZero = FALSE, usePosteriors = TRUE, tail = 0.1, subset = NULL, cl)
lociLikelihoods(cD, aD, newCounts = FALSE, bootStraps = 3, inferNulls = TRUE, nasZero = FALSE, usePosteriors = TRUE, tail = 0.1, subset = NULL, cl)
cD |
A |
aD |
An |
newCounts |
Should new counts be evaluated for the segmentation map in ‘cD’ before calculating loci likelihoods? Defaults to FALSE |
bootStraps |
What level of bootstrapping should be carried out on
the inference of posterior likelihoods? See the baySeq function
|
inferNulls |
Should null regions be inferred from the gaps between segments defined by the ‘cD’ object? |
nasZero |
If FALSE, any locus with a posterior likelihood ‘NA’ in the existing segmentation map is treated as a null region for the first bootstrap; If TRUE, it is ignored for the first bootstrap. |
usePosteriors |
If TRUE, the function uses the existing likelihoods to weight the prior estimation of parameters. Defaults to TRUE. |
tail |
The cutoff for the tail of the distribution to be used in
pre-calculating data for methylation analysis. See
|
subset |
A subset of the data on which to calculate the likelihoods. |
cl |
A SNOW cluster object, or NULL. See Details. |
A 'cluster'
object (package: snow) may be used for
parallelisation of this function when examining large data sets.
Passing NULL to this variable will cause the function to run in non-parallel mode.
A lociData
object.
Thomas J. Hardcastle
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, gap = 100) # Process the alignmentData object to produce a `segData' object. sD <- processAD(alignData, gap = 100, cl = NULL) # Use the segData object to produce a segmentation of the genome, but # without evaluating posterior likelihoods. segD <- heuristicSeg(sD = sD, aD = alignData, subRegion = data.frame(chr= ">Chr1", start = 1, end = 1e5), getLikes = FALSE, cl = NULL) # Use the lociData function to evaluate the posterior likelihoods directly. lociData <- lociLikelihoods(segD, aD = alignData, bootStraps = 5, inferNulls = TRUE, cl = NULL)
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, gap = 100) # Process the alignmentData object to produce a `segData' object. sD <- processAD(alignData, gap = 100, cl = NULL) # Use the segData object to produce a segmentation of the genome, but # without evaluating posterior likelihoods. segD <- heuristicSeg(sD = sD, aD = alignData, subRegion = data.frame(chr= ">Chr1", start = 1, end = 1e5), getLikes = FALSE, cl = NULL) # Use the lociData function to evaluate the posterior likelihoods directly. lociData <- lociLikelihoods(segD, aD = alignData, bootStraps = 5, inferNulls = TRUE, cl = NULL)
Within a region of cytosine methylation, there may be some cytosines which show no evidence of methylation. The presence of these cytosines may lead to the region being split into multiple methylation loci. This function merges neighbouring loci if the pattern of expression is the same in each locus, and if they are not separated by too great a genomic distance.
mergeMethSegs(segs, aD, gap, cl)
mergeMethSegs(segs, aD, gap, cl)
segs |
A |
aD |
An |
gap |
The maximum gap below which neighbouring loci may be merged. |
cl |
A cluster object, or NULL. |
An object of type methData
.
Thomas J. Hardcastle
The methData
class is based on the countData
class defined in the ‘baySeq’ package, but includes a ‘coordinates’
slot giving the coordinates of genomic loci and a ‘locLikelihoods’
slot which contains the estimated likelihoods that each annotated
region is a locus in each replicate group and a coordinates structure
giving the locations of the loci.
locLikelihoods
:Object of class "matrix"
describing estimated likelihoods that each region defined in
‘coordinates’ is a locus in each replicate group.
coordinates
:Object of class "GRanges"
defining
the coordinates of the genomic loci.
data
:Object of class "matrix"
defining the
number of methylated cytosines observed for each locus defined in
‘coordinates’
data
:Object of class "matrix"
defining the
number of un-methylated cytosines observed for each locus defined in
‘coordinates’
replicates
:Object of class "factor"
defining
the replicate structure of the data.
groups
:Object of class "list"
defing the group
(model) structure of the data (see the
baySeq package
)
.
annotation
:Object of class "data.frame"
giving
any additional annotation information for each locus.
priorType
:Object of class "character"
describing the type of prior information available in slot 'priors'
.
priors
:Object of class "list"
defing the prior
parameter information. Calculated by
the baySeq package.
posteriors
:Object of class "matrix"
giving the
estimated posterior likelihoods for each replicate group. Calculated by
the baySeq package.
nullPosts
:Object of class "numeric"
which, if
calculated, defines the posterior likelihoods for the data having no
true expression of any kind. Calculated by
the baySeq package.
estProps
:Object of class "numeric"
giving the
estimated proportion of tags belonnging to each group. Calculated by
the baySeq package.
cellObservables
A list object containing arrays of identical dimension to that in the ‘@data’ slot. These arrays define some observed characteristic of the data (e.g., GC content of sRNAs) which may be used in analysis.
rowObservables
A list object containing arrays with first dimension identical to the number of rows in the ‘@data’ slot. These arrays define some observed characteristic of the data (e.g., genomic length of the region) which may be used in analysis.
sampleObservables
A list object containing arrays with first dimension identical to the number of columns of the ‘@data’ slot. These arrays define some observed characteristic of the data (e.g., nonconversion rates) which may be used in analysis.
Class "countData"
, directly.
Methods ‘new’, ‘dim’, ‘[’ and ‘show’ have been defined for this class.
Thomas J. Hardcastle
This function adjusts the observed cytosine methylated/unmethylated counts at each cytosine site based on the reported nonconversion rates for each samples.
normaliseNC(mD, nonconversion)
normaliseNC(mD, nonconversion)
mD |
Either an |
nonconversion |
A vector defining nonconversion rates for each sample, required if a
|
This function operates by estimating the expected number of unconverted cytosines at each site and subtracting this from the reported methylated cytosines and adding to the reported unmethylated cytosines. It should not be used on data that will be analysed in a way that accounts for nonconversion; e.g., using the ‘bbNCDist’ densityFunction object.
A modified version of the object supplied as ‘mD’.
Thomas J. Hardacastle
Hardcastle T.J. Discovery of methylation loci and analyses of differential methylation from replicated high-throughput sequencing data. bioRxiv (http://dx.doi.org/10.1101/021436)
datadir <- system.file("extdata", package = "segmentSeq") files <- c("short_18B_C24_C24_trim.fastq_CG_methCalls.gz", "short_Sample_17A_trimmed.fastq_CG_methCalls.gz", "short_13_C24_col_trim.fastq_CG_methCalls.gz", "short_Sample_28_trimmed.fastq_CG_methCalls.gz") mD <- readMeths(files = files, dir = datadir, libnames = c("A1", "A2", "B1", "B2"), replicates = c("A","A","B","B"), nonconversion = c(0.004777, 0.005903, 0.016514, 0.006134)) mD <- normaliseNC(mD)
datadir <- system.file("extdata", package = "segmentSeq") files <- c("short_18B_C24_C24_trim.fastq_CG_methCalls.gz", "short_Sample_17A_trimmed.fastq_CG_methCalls.gz", "short_13_C24_col_trim.fastq_CG_methCalls.gz", "short_Sample_28_trimmed.fastq_CG_methCalls.gz") mD <- readMeths(files = files, dir = datadir, libnames = c("A1", "A2", "B1", "B2"), replicates = c("A","A","B","B"), nonconversion = c(0.004777, 0.005903, 0.016514, 0.006134)) mD <- normaliseNC(mD)
Plots the data from an alignmentData
object for a given set of
samples. Can optionally include in the plot the annotation data from a
lociData
object containing segment information.
plotGenome(aD, loci, chr = 1, limits = c(0, 1e4), samples = NULL, plotType = "pileup", plotDuplicated = FALSE, density = 0, showNumber = TRUE, logScale = FALSE, cap = Inf, ...)
plotGenome(aD, loci, chr = 1, limits = c(0, 1e4), samples = NULL, plotType = "pileup", plotDuplicated = FALSE, density = 0, showNumber = TRUE, logScale = FALSE, cap = Inf, ...)
aD |
An |
loci |
A |
chr |
The name of the chromosome to be plotted. Should correspond to a
chromosome name in the |
limits |
The start and end point of the region to be plotted. |
samples |
The sample numbers of the samples to be plotted. If NULL, plots all samples. |
plotType |
The manner in which the plot is created. Currently only
|
plotDuplicated |
If TRUE, then any duplicated sequence tags (i.e., sequence tags that match to multiple places in the genome) in the ‘aD’ object will be plotted on a negative scale for each sample. Defaults to FALSE (recommended). |
density |
The density of the shading lines to be used in plotting each segment. |
showNumber |
Should the row number of each segment be shown? |
logScale |
Should a log scale be used for the number of sequence tags found at each base? |
cap |
A numeric value defining a cap on the maximum number of reads to be plotted at any one point. Useful if a large number of reads at one location prevent a clear signal being seen elsewhere. |
... |
Any additional graphical parameters for passing to |
Plotting function.
Thomas J. Hardcastle
alignmentData
, heuristicSeg
, classifySeg
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, gap = 100) # Plot the alignments to the genome on chromosome 1 between bases 1 and 10000 plotGenome(alignData, chr = ">Chr1", limits = c(1, 1e5))
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, gap = 100) # Plot the alignments to the genome on chromosome 1 between bases 1 and 10000 plotGenome(alignData, chr = ">Chr1", limits = c(1, 1e5))
This function takes an alignmentMeth
object and plots the
observed levels of methylation within some defined region on the
genome. If a methData
object is also supplied, loci of
methylation will also be shown.
plotMeth(aM, loci, chr, limits, samples, showNumber = TRUE, rgb = c(1, 0, 0), angle = 45, cap, add = FALSE)
plotMeth(aM, loci, chr, limits, samples, showNumber = TRUE, rgb = c(1, 0, 0), angle = 45, cap, add = FALSE)
aM |
An |
loci |
A |
chr |
The name of the chromosome to be plotted. Should correspond to a
chromosome name in the |
limits |
The start and end point of the region to be plotted. |
samples |
The sample numbers of the samples to be plotted. If NULL, plots all samples. |
showNumber |
Should the row number of each segment be shown? |
rgb |
The rgb code ( |
angle |
The angle at which loci are shaded (see |
cap |
Caps the maximum level of coverage shown on the plot; thus, if a base has been sequenced at a level greater than the cap, the data for that base will be shown as if it has a coverage of cap. |
add |
If TRUE, adds the plot of methylation level to the current plot. Defaults to FALSE. |
Plotting function.
Thomas J. Hardcastle
Plots the distribution of methylation (as defined in an
alignmentMeth
object upon the genome.
plotMethDistribution(meth, samples, bw = 1e-3, subtract, chrs, centromeres, add = FALSE, col, ylim = NULL, legend = TRUE, ...)
plotMethDistribution(meth, samples, bw = 1e-3, subtract, chrs, centromeres, add = FALSE, col, ylim = NULL, legend = TRUE, ...)
meth |
An object of class |
samples |
A numeric vector defining the columns of data in the ‘meth’ object from which to estimate proportions of methylation, or a list object containing numeric vectors if multiple distributions are to be derived from the ‘meth’ object, or a factor in which each level of the factor defines a set of columns for the ‘meth’ object. If missing, derived from the ‘@replicates’ slot of the ‘meth’ object. |
bw |
Gives the bandwidth of the density plots; analagous to the ‘bw’
parameter in |
subtract |
A numeric vector giving values to be subtracted from the density of methylation. See Details. |
chrs |
The names of the chromosomes for which a distribution should be plotted. If mising, derived from the ‘@alignments’ slot of the ‘meth’ object. |
centromeres |
If given, a numeric vector defining the position of the centromeres on the chromosomes. These will be then be plotted. |
add |
Should the distribution curve be added to an existing plot? Defaults to FALSE. |
col |
A vector of colours to be used to plot the distributions. If
missing, generated from |
ylim |
Limits on the y-axis. Defaults to NULL, in which case limits are automatically set. |
legend |
Should a legend be added to the plot? Defaults to TRUE. |
... |
Any additional parameters to be passed to |
The function returns the density of methylation calculated. This can be used in further plots as the 'subtract' parameter, which allows one methylation profile to be subtracted from another.
An object of class density
describing the plotted distribution.
Thomas J. Hardcastle
In order to discover segments of the genome with a high density of sequenced data, a ‘segData’ object must be produced. This is an object containing a set of potential segments, together with the counts for each sample in each potential segment.
processAD(aD, gap = 300, squeeze = 2, filterProp = 0.05, strandSplit = FALSE, verbose = TRUE, getCounts = FALSE, cl)
processAD(aD, gap = 300, squeeze = 2, filterProp = 0.05, strandSplit = FALSE, verbose = TRUE, getCounts = FALSE, cl)
aD |
An |
gap |
The maximum gap between aligned tags that should be allowed in constructing potential segments. Defaults to 300. See Details. |
squeeze |
If greater than zero, the minimum gap between aligned tags that should be allowed in constructing potential segments. See Details. |
filterProp |
If 'aD' is a |
strandSplit |
If TRUE, the data will be split by strand and segments will be constructed separately for each strand. Defaults to FALSE. |
verbose |
Should processing information be displayed? Defaults to TRUE. |
getCounts |
If TRUE, counts will be estimated for the constructed ‘segData’ object. If FALSE, they will not, and must be estimated on the fly for further operations on the ‘segData’ object, which is computationally wasteful but will substantially reduce the memory requirements. |
cl |
A SNOW cluster object, or NULL. See Details. |
This function takes an alignmentData
or
alignmentMeth
object and
constructs a segData
or
segMeth
object from it. The function
creates a set of potential segments by looking for all locations on
the genome where the start of a region of overlapping alignments
(or, if ‘squeeze’ is non-zero, those alignments separated by no more
than ‘squeeze’.) exists in the alignmentData
object. A potential
segment then exists from this start point to the end of all regions
of overlapping alignments such that there is no region in the
segment of at least length ‘gap’ where no tag aligns. The number
of potential segments can therefore be increased by increasing this
limit, or (usually more usefully) decreased by decreasing this limit
in order to save computational effort.
A 'cluster'
object (package: snow) is recommended for
parallelisation of this function when using large data sets.
Passing NULL to this variable will cause the function to run in non-parallel mode.
A segData
object.
Thomas J. Hardcastle
getCounts
, which produces the count data for each
potential segment.
heuristicSeg
and classifySeg
, which segment the
genome based on the segData
object produced by this function
segData
alignmentData
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, gap = 100) # Process the alignmentData object to produce a `segData' object. sD <- processAD(alignData, gap = 100, cl = NULL)
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames, gap = 100) # Process the alignmentData object to produce a `segData' object. sD <- processAD(alignData, gap = 100, cl = NULL)
These functions take alignment files of various formats to produce an object (see Details) describing the alignment of sequencing tags from different libraries. At present, BAM and text files are supported.
readGeneric(files, dir = ".", replicates, libnames, chrs, chrlens, cols, header = TRUE, minlen = 15, maxlen = 1000, multireads = 1000, polyLength, estimationType = "quantile", discardTags = FALSE, verbose = TRUE, filterReport = NULL, ...) readBAM(files, dir = ".", replicates, libnames, chrs, chrlens, countID = NULL, minlen = 15, maxlen = 1000, multireads = 1000, polyLength, estimationType = "quantile", discardTags = FALSE, verbose = TRUE, filterReport = NULL)
readGeneric(files, dir = ".", replicates, libnames, chrs, chrlens, cols, header = TRUE, minlen = 15, maxlen = 1000, multireads = 1000, polyLength, estimationType = "quantile", discardTags = FALSE, verbose = TRUE, filterReport = NULL, ...) readBAM(files, dir = ".", replicates, libnames, chrs, chrlens, countID = NULL, minlen = 15, maxlen = 1000, multireads = 1000, polyLength, estimationType = "quantile", discardTags = FALSE, verbose = TRUE, filterReport = NULL)
files |
Filenames of the files to be read in. |
dir |
Directory (or directories) in which the files can be found. |
replicates |
A vector defining the replicate structure if the group. If and
only if the ith library is a replicate of the jth library
then |
libnames |
Names of the libraries defined by the file names. |
chrs |
A chracter vector defining (a selection of) the chromosome names used in the alignment files; optional, will be inferred from data if not given. |
chrlens |
Lengths of the chromosomes to which the alignments were made; optional, will be inferred from data if not given. |
cols |
A named character vector which describes which column of the input files contains which data. See Details. |
countID |
A (two-character) string used by the BAM file to identify the ‘counts’ of individual sequenced reads; that is, how many times a given read appears in the sequenced library. If NULL, it is assumed that the data are redundant (see Details). |
header |
Do the input files have a header line? Defaults to TRUE. See Details. |
minlen |
Minimum length for reads. |
maxlen |
Maximum length for reads. |
multireads |
The functions will discard any read that aligns to the genome in more locations than given by this value. Set to Inf to keep everything. Defaults to 1000. |
polyLength |
If given, an integer value N defining the length of (approximate) homopolymers which will be removed from the data. If a tag contains a sequence of N+1 reads consisting of at least N identical bases, it will be removed. If not given, all data is used. |
estimationType |
The estimationType that will be used by the ‘baySeq’ function
|
discardTags |
If TRUE, information about the sequence of the aligned reads will be discarded. Useful for very large data sets. Defaults to FALSE. |
verbose |
Should processing information be displayed? Defaults to TRUE. |
filterReport |
If not NULL, this should be a string defining a file to which will be written those data filtered on the basis of chromsome choices, widths of sequences, multireads or polyBase. |
... |
Additional parameters to be passed to
|
readBAM:
This function takes a set of BAM files and generates the
'alignmentData'
object from these. If a character string for
‘countID’ is given, the function assumes the data are non-redundant
and that ‘countID’ identifies the count data (i.e., how many times
each read appears in the sequenced library) in each BAM file. If
‘countID’ is NULL, then it is assumed that the data are redundant, and
the count data are inferred from the file.
readGeneric:
The purpose of this function is to take a set of plain text files
and produce an 'alignmentData'
object. The function uses
read.table
to read in the columns of data in the files
and so by default columns are separated by any white
space. Alternative separators can be used by passing the appropriate
value for 'sep'
to read.table
.
The files may contain columns with column names
'chr'
, 'tag'
, 'count'
, 'start'
,
'end'
, 'strand'
in which case the ‘cols’ argument can be
ommitted and ‘header’ set to TRUE. If this is the case, there is no
requirement for all the files to have the same ordering of columns
(although all must have these column names).
Alternatively, the columns of data in the input files can be specified by
the ‘cols’ argument in the form of a named character vector (e.g;
'cols = c(chr = 1, tag = 2, count = 3, start = 4, end = 5,
strand = 6)'
would cause the function to assume that the first column
contains the chromosome information, the second column contained the
tag information, etc. If ‘cols’ is specified then information in the
header is ignored. If ‘cols’ is missing and ‘header’ is FALSE, then it
is assumed that the data takes the form described in the example above.
The 'tag'
, 'count'
and 'strand'
columns may optionally be
omitted from either the file column headers or the ‘cols’ argument. If
the 'tag'
column is omitted, then the data will not account for
duplicated sequences when estimating the number of counts in loci. If
the 'count'
column is omitted, the 'readGeneric'
function
will assume that the file contains the alignments of each copy of each
sequence tag, rather than an aggregated alignment of each unique
sequence. The unique alignments will be identified and the number of
sequence tags aligning to each position will be calculated. If
'strand'
is omitted, the strand will simply be ignored.
An alignmentData
object.
Thomas J. Hardcastle
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames)
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an `alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames)
This function takes as input a set of files that describe the number of
times a set of cytosines are observed to be methylated or unmethylated
in some high-throughput sequencing data. It merges the data from these
files into an object of 'alignmentMeth
' class which can
then be further processed to identify methylation loci.
readMeths(files, dir = ".", libnames, replicates, nonconversion, chrs)
readMeths(files, dir = ".", libnames, replicates, nonconversion, chrs)
files |
A character vector defining the file names of the alignment files to be read in. |
dir |
The directory in which the files are located. |
libnames |
A character vector giving the names of the samples to be read in. |
replicates |
A vector defining the replicate structure of the data. The ‘i’th and ‘j’th libraries are treated as replicates if and only if replicates[i] == replicates[j]. |
nonconversion |
A numeric vector (all members should lie between 0 and 1) defining the
non-conversion rate of each library. See
|
chrs |
An (optional) character vector giving the names of the chromosomes to be read from the files. If ommitted, all chromosomes will be read in. |
An object of class alignmentMeth
.
Thomas J. Hardcastle
datadir <- system.file("extdata", package = "segmentSeq") files <- c("short_18B_C24_C24_trim.fastq_CG_methCalls.gz", "short_Sample_17A_trimmed.fastq_CG_methCalls.gz", "short_13_C24_col_trim.fastq_CG_methCalls.gz", "short_Sample_28_trimmed.fastq_CG_methCalls.gz") mD <- readMeths(files = files, dir = datadir, libnames = c("A1", "A2", "B1", "B2"), replicates = c("A","A","B","B"), nonconversion = c(0.004777, 0.005903, 0.016514, 0.006134))
datadir <- system.file("extdata", package = "segmentSeq") files <- c("short_18B_C24_C24_trim.fastq_CG_methCalls.gz", "short_Sample_17A_trimmed.fastq_CG_methCalls.gz", "short_13_C24_col_trim.fastq_CG_methCalls.gz", "short_Sample_28_trimmed.fastq_CG_methCalls.gz") mD <- readMeths(files = files, dir = datadir, libnames = c("A1", "A2", "B1", "B2"), replicates = c("A","A","B","B"), nonconversion = c(0.004777, 0.005903, 0.016514, 0.006134))
The segClass
class contains data about potential
segments on the genome.
Objects can be created by calls of the form new("segClass", ...,
seglens)
. However, more usually they will be created by calling the
processAD
function.
coordinates
:A GRanges
object defining the coordinates of the segments.
replicates
:Object of class "factor"
. The
replicate structure for the samples.
locLikelihoods
:Object of class "DataFrame"
describing estimated likelihoods that each region defined in
‘coordinates’ is a locus in each replicate group.
The @coordinates
slot contains information on each of the potential
segments; specifically, chromosome, start and end of the segment,
together. Each row of the @coordinates
slot should correspond to
the same row of the @data
slot.
In almost all cases objects of this class should be produced by the
processAD
function.
Methods 'new', 'dim', '[' and 'show' have been defined for this class.
Thomas J. Hardcastle
processAD
, the function that will most often be used to
create objects of this class.
segData
, which inherits from this class.
segMeth
, which inherits from this class.
The segData
class inherits from the
segClass
class and contains data about potential
segments on the genome, together with counts for each of those segments.
Objects can be created by calls of the form new("segData", ...,
seglens)
. However, more usually they will be created by calling the
processAD
function.
coordinates
:A GRanges
object defining the coordinates of the segments.
replicates
:Object of class "factor"
. The
replicate structure for the samples.
locLikelihoods
:Object of class "DataFrame"
describing estimated likelihoods that each region defined in
‘coordinates’ is a locus in each replicate group.
data
:Object of class matrix
. Contains the
number of counts observed for each sample in each potential
segment.
libsizes
:Object of class "numeric"
. The
library sizes for each sample.
The @coordinates
slot contains information on each of the potential
segments; specifically, chromosome, start and end of the segment,
together. Each row of the @coordinates
slot should correspond to
the same row of the @data
slot.
In almost all cases objects of this class should be produced by the
processAD
function.
Methods 'new', 'dim', '[' and 'show' have been defined for this class.
Thomas J. Hardcastle
processAD
, the function that will most often be used to
create objects of this class.
classifySeg
, an empirical Bayesian method for defining a
segmentation based on a segData object.
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an 'alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames) # Process the alignmentData object to produce a 'segData' object. sD <- processAD(alignData, gap = 100, cl = NULL)
# Define the files containing sample information. datadir <- system.file("extdata", package = "segmentSeq") libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt") # Establish the library names and replicate structure. libnames <- c("SL9", "SL10", "SL26", "SL32") replicates <- c(1,1,2,2) # Process the files to produce an 'alignmentData' object. alignData <- readGeneric(file = libfiles, dir = datadir, replicates = replicates, libnames = libnames) # Process the alignmentData object to produce a 'segData' object. sD <- processAD(alignData, gap = 100, cl = NULL)
The segMeth
class inherits from the
segClass
class and contains data about potential
segments on the genome, together with counts for each of those segments.
Objects can be created by calls of the form new("segMeth", ...,
seglens)
. However, more usually they will be created by calling the
processAD
function.
coordinates
:A GRanges
object defining the coordinates of the segments.
replicates
:Object of class "factor"
. The
replicate structure for the samples.
locLikelihoods
:Object of class "DataFrame"
describing estimated likelihoods that each region defined in
‘coordinates’ is a locus in each replicate group.
Cs
:Object of class matrix
. Contains the
number of methylated cytosines (which are sequenced as a ‘C’)
observed for each sample in each potential segment.
Ts
:Object of class matrix
. Contains the
number of unmethylated cytosines (which are sequenced as a ‘T’)
observed for each sample in each potential segment.
nonconversion
:Object of class "numeric"
. The
(estimated) nonconversion rate (see Details) for each of the libraries.
The @coordinates
slot contains information on each of the potential
segments; specifically, chromosome, start and end of the segment,
together. Each row of the @coordinates
slot should correspond to
the same row of the @C
and @T
slots.
The nonconversion slot is an estimate of the rate (for each library) at which an unmethylated cytosine has failed to be converted by sodium bisulphite treatment into thymine, and is thus recorded (incorrectly) as methylated. In some cases, this can be estimated from considering observed methylation rates on regions known to be unmethylated (e.g., chloroplasts) or by introducing unmethylated control sequences.
In almost all cases objects of this class should be produced by the
processAD
function.
Methods 'new', 'dim', '[' and 'show' have been defined for this class.
Thomas J. Hardcastle
processAD
, the function that will most often be used to
create objects of this class.
segClass
, from which this class inherits.
Selects loci from a ‘lociData’ object based on likelihood, false discovery rate or familywise error rate for downstream processing.
selectLoci(cD, likelihood, FDR, FWER, perReplicate = TRUE)
selectLoci(cD, likelihood, FDR, FWER, perReplicate = TRUE)
cD |
The |
likelihood |
If provided, all loci with a likelihood greater than this criterion will be selected. |
FDR |
If provided (and likelihood is not provided), the maximal set of loci which controls the FDR at this level is selected. |
FWER |
If provided (and likelihood and FDR are not provided), the maximal set of loci which controls the FWER at this level is selected. |
perReplicate |
If TRUE, selection of loci is done on a replicate by replicate basis. If FALSE, selection will be done on the likelihood that the locus represents a true locus in at least one replicate group. |
A lociData
object.
Thomas J. Hardcastle
Each of the files 'SL9', 'SL10', 'SL26' and 'SL32' represents a subset of the data from an Illumina sequencing experiment. These data consist of alignment information; the tag sequence, and the number of times that each sequence is observed.
A set of tab-delimited files containing data from four sequencing experiments.
In-house Illumina sequencing experiments
Summarises the expected number of loci, either in toto or on a per replicate group basis.
summariseLoci(cD, perReplicate = TRUE)
summariseLoci(cD, perReplicate = TRUE)
cD |
A ‘lociData’ object with calculated values in the ‘@lociLikelihoods’ slot. |
perReplicate |
Should the expectation be calculated on a per replicate group basis, or the total number of loci identified in the dataset? |
A numeric vector summarising the expectated number of loci in the cD object.
Thomas J. Hardcastle
This function offers a variety of methods for the analysis of methylation data to determine a suitable threshold for the proportion of methylation at which to distinguish a methylation locus from a non-methylated locus.
thresholdFinder(method, aM, subset, minprop = 0.05, bootstrap = 100, abstol = 1e-4, verbose = FALSE, cl = NULL, processAD.args = list(), heuristicSeg.args = list())
thresholdFinder(method, aM, subset, minprop = 0.05, bootstrap = 100, abstol = 1e-4, verbose = FALSE, cl = NULL, processAD.args = list(), heuristicSeg.args = list())
method |
Character string defining method to use for threshold estimation. Available options are 'varsum', 'minden', 'beta' and 'abc'. See Details. |
aM |
An |
subset |
Numeric vector defining a subset on |
minprop |
For 'minden' method, a minimum proportion permitted for choice of threshold. |
bootstrap |
The maximum number of bootstraps to be permitted in estimating a threshold. Defaults to 100. See Details. |
abstol |
Minimum tolerance fro threshold estimation. |
verbose |
Verbose reporting. Defaults to FALSE. |
cl |
A cluster object, or NULL. Defaults to NULL. |
processAD.args |
Arguments to be passed to processAD function if bootstrapping. |
heuristicSeg.args |
Arguments to be passed to heuristicSeg function if bootstrapping. |
This function operates on the data observed within each replicate group, and then takes the mean of the thresholds calculated for each group.
Methods currently available for threshold estimation are 'varsum', 'minden', 'beta' and 'abc'. The 'varsum' method attempts to split the vector of proportions of methylation observed at each cytosine into two sets of minimal total variance. The 'minden' method finds the minimum point on a smoothed kernel density of the proportions of methylation. The 'beta' method estimates for each cytosine a posterior distribution on proportions of methylation based on the beta-binomial conjugacy, takes the average of these distributions and finds the minumum. The 'abc' method performs like the beta method, but estimates the posterior distribution through approximate Bayesian computation.
Bootstrapping uses the estimated threshold to define loci. Based on the defined loci, cytosines are then only included in a re-estimation of the thresholds if they are identified as belonging to an expressed locus within the current replicate group, or if they are not expressed in any replicate group. Thresholds are re-esimated until the maximum number of bootstraps is reached or the difference between estimated thresholds drops below 'abstol', whichever is the sooner.
A numeric value defining a threshold on methylation.
Thomas J. Hardcastle
datadir <- system.file("extdata", package = "segmentSeq") files <- c("short_18B_C24_C24_trim.fastq_CG_methCalls.gz", "short_Sample_17A_trimmed.fastq_CG_methCalls.gz", "short_13_C24_col_trim.fastq_CG_methCalls.gz", "short_Sample_28_trimmed.fastq_CG_methCalls.gz") mD <- readMeths(files = files, dir = datadir, libnames = c("A1", "A2", "B1", "B2"), replicates = c("A","A","B","B"), nonconversion = c(0.004777, 0.005903, 0.016514, 0.006134)) ## Not run: thresholdFinder("beta", mD, cl = NULL)
datadir <- system.file("extdata", package = "segmentSeq") files <- c("short_18B_C24_C24_trim.fastq_CG_methCalls.gz", "short_Sample_17A_trimmed.fastq_CG_methCalls.gz", "short_13_C24_col_trim.fastq_CG_methCalls.gz", "short_Sample_28_trimmed.fastq_CG_methCalls.gz") mD <- readMeths(files = files, dir = datadir, libnames = c("A1", "A2", "B1", "B2"), replicates = c("A","A","B","B"), nonconversion = c(0.004777, 0.005903, 0.016514, 0.006134)) ## Not run: thresholdFinder("beta", mD, cl = NULL)