Title: | Genotype Calling (CRLMM) and Copy Number Analysis tool for Affymetrix SNP 5.0 and 6.0 and Illumina arrays |
---|---|
Description: | Faster implementation of CRLMM specific to SNP 5.0 and 6.0 arrays, as well as a copy number tool specific to 5.0, 6.0, and Illumina platforms. |
Authors: | Benilton S Carvalho, Robert Scharpf, Matt Ritchie, Ingo Ruczinski, Rafael A Irizarry |
Maintainer: | Benilton S Carvalho <[email protected]>, Robert Scharpf <[email protected]>, Matt Ritchie <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.65.0 |
Built: | 2024-10-31 00:38:40 UTC |
Source: | https://github.com/bioc/crlmm |
Faster implementation of CRLMM specific to SNP 5.0 and 6.0 arrays.
Index:
crlmm-package New implementation of the CRLMM Algorithm. crlmm Genotype SNP 5.0 or 6.0 samples. calls Accessor for genotype calls. confs Accessor for confidences.
The 'crlmm' package reimplements the CRLMM algorithm present in the 'oligo' package. This implementation primes for efficient genotyping of samples on SNP 5.0 and SNP 6.0 Affymetrix arrays.
To use this package, the user must have additional data packages: 'genomewidesnp5Crlmm' - SNP 5.0 arrays 'genomewidesnp6Crlmm' - SNP 6.0 arrays
Rafael A Irizarry Maintainer: Benilton S Carvalho <[email protected]>
Carvalho BS, Louis TA, Irizarry RA. Quantifying uncertainty in genotype calls. Bioinformatics. 2010 Jan 15;26(2):242-9. Epub 2009 Nov 11.
A panel function for plotting prediction regions and log-normalized intensities
ABpanel(x, y, predictRegion, copyNumber = 0:4, fill, ..., subscripts)
ABpanel(x, y, predictRegion, copyNumber = 0:4, fill, ..., subscripts)
x |
log-normalized intensities for the A or B allele |
y |
log-normalized intensities for the A or B allele |
predictRegion |
A |
copyNumber |
Integer vector. Indicates which prediction regions are drawn. |
fill |
Character or integer vector for coloring the points. Only valid for
certain point symbols. See |
... |
Additional arguments to |
subscripts |
See |
Not applicable
ABpanel
can be passed as the argument to panel in the
xyplot
method for CNSet
objects. See the examples in
xyplot
.
R. Scharpf
The batchStatistics
slot in a CNSet
object is an
instance of the AssayData
slot. In general, the
accessors for AssayData
are called indirectly by the
corresponding method for the CNSet
class and not called
directly by the user.
signature(object="AssayData")
: Accessor for genotype
frequencies
signature(object="AssayData")
: Accessor for the correlation of the
log-transformed normalized intensities within the diallelic genotype clusters
signature(x="AssayData")
: Accessor for the median
absolute deviation of the normalized intensities within the
diallelic genotype clusters
signature(object="AssayData")
: Accessor for
the posterior mean of the normalized intensity within the diallelic genotype
clusters.
signature(object="AssayData")
: Accessor for the
median absolute deviation of the log-transformed intensities within
the diallelic genotype clusters
CNSet-class
, Ns
, tau2
, corr
, mads
, medians
The summary statistics stored here are used by the tools for copy number estimation.
corr(object, ...) tau2(object, ...) mads(object,...) medians(object,...) Ns(object,...)
corr(object, ...) tau2(object, ...) mads(object,...) medians(object,...) Ns(object,...)
object |
An object of class |
... |
Ignored |
An array with dimension R x A x G x C, or R x G x C.
R: number of markers A: number of alleles (2) G: number of biallelic genotypes (3) C: number of batches
Ns
returns an array of genotype frequencies stratified by
batch. Dimension R x G x C.
corr
returns an array of within-genotype correlations
(log2-scale) stratified by batch. Dimension R x G x C.
medians
returns an array of the within-genotype medians
(intensity-scale) stratified by batch and allele. Dimension R x A x G
x C.
mads
returns an array of the within-genotype median absolute
deviations (intensity-scale) stratified by batch and allele. Dimension
is the same as for medians
.
tau2
returns an array of the squared within-genotype median
absolute deviation on the log-scale. Only the mads for AA and BB
genotypes are stored. Dimension is R x A x G x C, where G is AA or
BB. Note that the mad for allele A/B for subjects with genotype BB/AA
is a robust estimate of the background variance, whereas the the mad
for allele A/B for subjects with genotype AA/BB is a robust estimate
of the variance for copy number greater than 0 (we assume that on the
log-scale the variance is rougly constant for CA, CB > 0).
data(cnSetExample) Ns(cnSetExample)[1:5, , ] corr(cnSetExample)[1:5, , ] meds <- medians(cnSetExample) mads(cnSetExample)[1:5, , ,] tau2(cnSetExample)[1:5, , ,]
data(cnSetExample) Ns(cnSetExample)[1:5, , ] corr(cnSetExample)[1:5, , ] meds <- medians(cnSetExample) mads(cnSetExample)[1:5, , ,] tau2(cnSetExample)[1:5, , ,]
Calculate log R ratios and B allele frequencies from
a CNSet
object
calculateRBaf(object, batch.name, chrom)
calculateRBaf(object, batch.name, chrom)
object |
A |
batch.name |
A character string indicating the batch. If
missing, log R ratios and B allele frequencies are calculated for all
batches in the |
chrom |
Integer indicating which chromosome to process. If
missing, B allele frequencies and log R ratios are calculated for all
autosomal chromosomes and chromosome X that are included in
|
batch.name
must be a value in batch(object)
. Currently,
one must specify a single batch.name
. If a character vector for
batch.name
is supplied, only the first is evaluated.
TODO: A description of how these values are calculated.
A named list.
baf
: Each element in the baf list is a matrix of B allele
frequencies (one matrix for each chromosome).
lrr
: Each element in the lrr list is a matrix of log R ratios
(one matrix for each chromosome).
The log R ratios were scaled by a factor of 100 and stored as an integer. B allele frequencies were scaled by a factor of 1000 and stored as an integer.
Lynn Mireless
Peiffer et al., High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping (2006), Genome Research
data(cnSetExample) baf.lrr <- suppressWarnings(calculateRBaf(cnSetExample, "SHELF")) hist(baf.lrr[["baf"]][[1]]/1000, breaks=100) hist(baf.lrr[["lrr"]][[1]]/100, breaks=100) ## Not run: library(ff) baf.lrr <- suppressWarnings(calculateRBaf(cnSetExample, "SHELF")) class(baf.lrr[["baf"]][[1]]) ## ff_matrix class(baf.lrr[["lrr"]][[1]]) ## ff_matrix ## End(Not run)
data(cnSetExample) baf.lrr <- suppressWarnings(calculateRBaf(cnSetExample, "SHELF")) hist(baf.lrr[["baf"]][[1]]/1000, breaks=100) hist(baf.lrr[["lrr"]][[1]]/100, breaks=100) ## Not run: library(ff) baf.lrr <- suppressWarnings(calculateRBaf(cnSetExample, "SHELF")) class(baf.lrr[["baf"]][[1]]) ## ff_matrix class(baf.lrr[["lrr"]][[1]]) ## ff_matrix ## End(Not run)
Quantile normalize nonpolymorphic markers to hapmap reference distribution
cnrmaAffy(cnSet, seed = 1, verbose = TRUE)
cnrmaAffy(cnSet, seed = 1, verbose = TRUE)
cnSet |
Object of class |
seed |
Random number seed |
verbose |
Logical. |
Returns logical. Normalized intensities are written to the alleleA
ff_matrix
stored in the CNSet
assayData.
R. Scharpf
CNSet is a container defined in the oligoClasses package for storing normalized intensities for genotyping platforms, genotype calls, and parameters estimated for copy number. Accessors for data that an object of this class contains are largely defined in the package oligoClasses. CNSet methods that involve more complex calculations that are specific to the crlmm package, such as computing allele-specific copy number, are included in crlmm and described here.
as(from, "oligoSnpSet")
: Method for coercing object
from
(class CNSet
) to an object of class
oligoSnpSet
.
signature(object="CNSet")
: calculates raw copy number
for allele A
signature(object="CNSet")
: calculates raw copy
number for allele B
signature(x="CNSet")
: plot ellipses (95th
percentile) for prediction regions
signature(object="CNSet")
: calculates
total raw copy number
signature(object="CNSet")
: same as totalCopynumber
signature(object="CNSet")
: estimate of mean
background (intensity-scale) for allele A
signature(object="CNSet")
: estimate of mean
background (intensity-scale) for allele A
signature(object="CNSet")
: estimate of slope coefficient (intensity-scale) for allele A
signature(object="CNSet")
: estimate of slope coefficient (intensity-scale) for allele B
signature(object="CNSet")
: genotype frequencies
signature(object="CNSet")
: correlation of
log-transformed normalized intensities within the genotype clusters
signature(x="CNSet")
: ...
signature(object="CNSet")
: ...
signature(object="CNSet")
: ...
OligoSetList(object)
: constructs an object of class
OligoSetList
from object
having class CNSet
.
BafLrrSetList(object)
: constructs an object of class
BafLrrSetList
from object
having class CNSet
.
CNSet-class
, CA
, CB
,
totalCopynumber
, rawCopynumber
The data for the first 16 polymorphic markers in the HapMap analysis.
data(cnSetExample) data(cnSetExample2)
data(cnSetExample) data(cnSetExample2)
The data illustrates the CNSet-class
, with
assayData
containing the quantile-normalized
intensities for the A and B alleles, genotype calls and
confidence scores.
This object was created from the copynumber vignette in inst/scripts. A subset of markers was selected to keep the package size small.
data(cnSetExample) data(cnSetExample2)
data(cnSetExample) data(cnSetExample2)
Construct a container for normalized intensities for Affymetrix cel
files, referred to as a CNSet
constructAffyCNSet(filenames, sns, cdfName, batch, verbose = TRUE, genome)
constructAffyCNSet(filenames, sns, cdfName, batch, verbose = TRUE, genome)
filenames |
Vector of cel file names. |
sns |
Sample identifiers. Defaults to |
cdfName |
Character string indicating annotation package (e.g., "genomewidesnp6Crlmm") |
batch |
Vector of same length as filenames indicating batch. |
verbose |
Logical. |
genome |
Character string indicating UCSC genome build (hg18 or hg19 supported) |
An object of class CNSet
R. Scharpf
Instantiates an object of class CNSet for the Infinium
platforms. Elements of assayData
and
batchStatistics
will be ff
objects. See details.
constructInf(sampleSheet = NULL, arrayNames = NULL, path = ".", arrayInfoColNames = list(barcode="SentrixBarcode_A",position="SentrixPosition_A"), highDensity = FALSE, sep = "_", fileExt = list(green = "Grn.idat", red = "Red.idat"), XY, cdfName, anno, genome, verbose = FALSE, batch=NULL, saveDate = TRUE)
constructInf(sampleSheet = NULL, arrayNames = NULL, path = ".", arrayInfoColNames = list(barcode="SentrixBarcode_A",position="SentrixPosition_A"), highDensity = FALSE, sep = "_", fileExt = list(green = "Grn.idat", red = "Red.idat"), XY, cdfName, anno, genome, verbose = FALSE, batch=NULL, saveDate = TRUE)
sampleSheet |
|
arrayNames |
character vector containing names of arrays to be
read in. If |
path |
character string specifying the location of files to be read by the function |
arrayInfoColNames |
(used when |
highDensity |
logical (used when |
sep |
character string specifying separator used in .idat file names. |
fileExt |
list containing elements 'Green' and 'Red' which specify the .idat file extension for the Cy3 and Cy5 channels. |
XY |
an |
cdfName |
annotation package (see also |
anno |
data.frame containing SNP annotation information from
manifest and additional columns 'isSnp', 'position', 'chromosome'
and 'featureNames'. For use when |
genome |
character string specifying which genome is used in annotation |
verbose |
'logical.' Whether to print descriptive messages during processing. |
batch |
batch variable. See details. |
saveDate |
'logical'. Should the dates from each .idat be saved with sample information? |
This function initializes a container for storing the normalized intensities for the A and B alleles at polymorphic loci and the normalized intensities for the 'A' allele at nonpolymorphic loci. CRLMM genotype calls and confidence scores are also stored in the assayData. This function does not do any preprocessing or genotyping – it only creates an object of the appropriate size. The initialized values will all be 'NA'.
The ff package provides infrastructure for accessing and writing
data to disk instead of keeping data in memory. Each element of
the assayData
and batchStatistics
slot are ff
objects. ff objects in the R workspace contain pointers to
several files with the '.ff' extension on disk. The location of
where the data is stored on disk can be specified by use of the
ldPath
function. Users should not move or rename this
directory. If only output files are stored in ldPath
,
one can either remove the entire directory prior to rerunning
the analysis or all of the '.ff' files. Otherwise, one would
accumulate a large number of '.ff' files on disk that are no
longer in use.
We have adopted the ff
package in order to reduce crlmm's
memory footprint. The memory usage can be fine-tuned by the
utilities ocSamples
and ocProbesets
provided in
the oligoClasses
package. In most instances, the
user-level interface will be no different than accessing data
from ordinary matrices in R. However, the differences in the
underlying representation can become more noticeable for very
large datasets in which the I/O for accessing data from the disk
can be substantial.
A CNSet
object
R. Scharpf
ldPath
, ocSamples
, ocProbesets
, CNSet-class
, preprocessInf
, genotypeInf
## See the Illumina vignettes in inst/scripts of the ## source package for an example
## See the Illumina vignettes in inst/scripts of the ## source package for an example
These methods can be applied after an object of class CNSet
has
been generated by the crlmmCopynumber
function.
CA(object, ...) CB(object, ...) nuA(object) nuB(object) phiA(object) phiB(object) totalCopynumber(object,...) rawCopynumber(object,...)
CA(object, ...) CB(object, ...) nuA(object) nuB(object) phiA(object) phiB(object) totalCopynumber(object,...) rawCopynumber(object,...)
object |
An object of class |
... |
An additional argument named 'i' can be passed to subset the markers and an argument 'j' can be passed to subset the samples. Other arguments are ignored. |
At polymorphic markers, nuA and nuB provide the intercept coefficient (the estimated background intensity) for the A and B alleles, respectively. phiA and phiB provide the slope coefficients for the A and B alleles, respectively.
At nonpolymorphic markers, nuB and phiB are 'NA'.
These functions can be used to tranlate the normalized intensities to the copy number scale. Plotting the copy number estimates as a function of physical position can be used to guide downstream algorithms that smooth, as well as to assess possible mosaicism.
nu[A/B] and phi[A/B] return matrices of the intercept and slope coefficients, respectively.
CA and CB return matrices of allele-specific copy number.
totalCopynumber (or rawCopynumber) returns a matrix of CA+CB.
Subsetting the CNSet
object before extracting copy number can be
very inefficient when the data set is very large, particularly if using
ff objects. The [
method will subset all of the assay data
elements and all of the elements in the LinearModelParameter slot.
## Not run: data(cnSetExample) all(isCurrent(cnSetExample)) ## is the cnSet object current? ## -------------------------------------------------- ## calculating allele-specific copy number ## -------------------------------------------------- ## copy number for allele A, first 5 markers, first 2 samples (ca <- CA(cnSetExample, i=1:5, j=1:2)) ## copy number for allele B, first 5 markers, first 2 samples (cb <- CB(cnSetExample, i=1:5, j=1:2)) ## total copy number for first 5 markers, first 2 samples (cn1 <- ca+cb) ## total copy number at first 5 nonpolymorphic loci index <- which(!isSnp(cnSetExample))[1:5] cn2 <- CA(cnSetExample, i=index, j=1:2) ## note, cb is NA at nonpolymorphic loci (cb <- CB(cnSetExample, i=index, j=1:2)) ## note, ca+cb will give NAs at nonpolymorphic loci CA(cnSetExample, i=index, j=1:2) + cb ## A shortcut for total copy number cn3 <- totalCopynumber(cnSetExample, i=1:5, j=1:2) all.equal(cn3, cn1) cn4 <- totalCopynumber(cnSetExample, i=index, j=1:2) all.equal(cn4, cn2) ## markers 1-5, all samples cn5 <- totalCopynumber(cnSetExample, i=1:5) ## all markers, samples 1-5 cn6 <- totalCopynumber(cnSetExample, j=1:2) ## End(Not run)
## Not run: data(cnSetExample) all(isCurrent(cnSetExample)) ## is the cnSet object current? ## -------------------------------------------------- ## calculating allele-specific copy number ## -------------------------------------------------- ## copy number for allele A, first 5 markers, first 2 samples (ca <- CA(cnSetExample, i=1:5, j=1:2)) ## copy number for allele B, first 5 markers, first 2 samples (cb <- CB(cnSetExample, i=1:5, j=1:2)) ## total copy number for first 5 markers, first 2 samples (cn1 <- ca+cb) ## total copy number at first 5 nonpolymorphic loci index <- which(!isSnp(cnSetExample))[1:5] cn2 <- CA(cnSetExample, i=index, j=1:2) ## note, cb is NA at nonpolymorphic loci (cb <- CB(cnSetExample, i=index, j=1:2)) ## note, ca+cb will give NAs at nonpolymorphic loci CA(cnSetExample, i=index, j=1:2) + cb ## A shortcut for total copy number cn3 <- totalCopynumber(cnSetExample, i=1:5, j=1:2) all.equal(cn3, cn1) cn4 <- totalCopynumber(cnSetExample, i=index, j=1:2) all.equal(cn4, cn2) ## markers 1-5, all samples cn5 <- totalCopynumber(cnSetExample, i=1:5) ## all markers, samples 1-5 cn6 <- totalCopynumber(cnSetExample, j=1:2) ## End(Not run)
This is a faster and more efficient implementation of the CRLMM algorithm, especially designed for Affymetrix SNP 5 and 6 arrays (to be soon extended to other platforms).
crlmm(filenames, row.names=TRUE, col.names=TRUE, probs=c(1/3, 1/3, 1/3), DF=6, SNRMin=5, gender=NULL, save.it=FALSE, load.it=FALSE, intensityFile, mixtureSampleSize=10^5, eps=0.1, verbose=TRUE, cdfName, sns, recallMin=10, recallRegMin=1000, returnParams=FALSE, badSNP=0.7) crlmm2(filenames, row.names=TRUE, col.names=TRUE, probs=c(1/3, 1/3, 1/3), DF=6, SNRMin=5, gender=NULL, save.it=FALSE, load.it=FALSE, intensityFile, mixtureSampleSize=10^5, eps=0.1, verbose=TRUE, cdfName, sns, recallMin=10, recallRegMin=1000, returnParams=FALSE, badSNP=0.7)
crlmm(filenames, row.names=TRUE, col.names=TRUE, probs=c(1/3, 1/3, 1/3), DF=6, SNRMin=5, gender=NULL, save.it=FALSE, load.it=FALSE, intensityFile, mixtureSampleSize=10^5, eps=0.1, verbose=TRUE, cdfName, sns, recallMin=10, recallRegMin=1000, returnParams=FALSE, badSNP=0.7) crlmm2(filenames, row.names=TRUE, col.names=TRUE, probs=c(1/3, 1/3, 1/3), DF=6, SNRMin=5, gender=NULL, save.it=FALSE, load.it=FALSE, intensityFile, mixtureSampleSize=10^5, eps=0.1, verbose=TRUE, cdfName, sns, recallMin=10, recallRegMin=1000, returnParams=FALSE, badSNP=0.7)
filenames |
'character' vector with CEL files to be genotyped. |
row.names |
'logical'. Use rownames - SNP names? |
col.names |
'logical'. Use colnames - Sample names? |
probs |
'numeric' vector with priors for AA, AB and BB. |
DF |
'integer' with number of degrees of freedom to use with t-distribution. |
SNRMin |
'numeric' scalar defining the minimum SNR used to filter out samples. |
gender |
'integer' vector, with same length as 'filenames', defining sex. (1 - male; 2 - female) |
save.it |
'logical'. Save preprocessed data? |
load.it |
'logical'. Load preprocessed data to speed up analysis? |
intensityFile |
'character' with filename to be saved/loaded - preprocessed data. |
mixtureSampleSize |
Number of SNP's to be used with the mixture model. |
eps |
Minimum change for mixture model. |
verbose |
'logical'. |
cdfName |
'character' defining the CDF name to use ('GenomeWideSnp5', 'GenomeWideSnp6') |
sns |
'character' vector with sample names to be used. |
recallMin |
Minimum number of samples for recalibration. |
recallRegMin |
Minimum number of SNP's for regression. |
returnParams |
'logical'. Return recalibrated parameters. |
badSNP |
'numeric'. Threshold to flag as bad SNP (affects batchQC) |
'crlmm2' allows one to genotype very large datasets (via ff package) and also permits the use of clusters or multiple cores (via snow package) to speed up genotyping.
As noted above, the call probabilities are stored using an integer
representation to reduce file size using the transformation
'round(-1000*log2(1-p))', where p is the probability. The function
i2P
can be used to convert the integers back to the scale of
probabilities.
A SnpSet
object.
calls |
Genotype calls (1 - AA, 2 - AB, 3 - BB) |
confs |
Confidence scores 'round(-1000*log2(1-p))' |
SNPQC |
SNP Quality Scores |
batchQC |
Batch Quality Score |
params |
Recalibrated parameters |
Carvalho B, Bengtsson H, Speed TP, Irizarry RA. Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics. 2007 Apr;8(2):485-99. Epub 2006 Dec 22. PMID: 17189563.
Carvalho BS, Louis TA, Irizarry RA. Quantifying uncertainty in genotype calls. Bioinformatics. 2010 Jan 15;26(2):242-9.
i2p
, snpCall
, snpCallProbability
## this can be slow library(oligoClasses) if (require(genomewidesnp6Crlmm) & require(hapmapsnp6)){ path <- system.file("celFiles", package="hapmapsnp6") ## the filenames with full path... ## very useful when genotyping samples not in the working directory cels <- list.celfiles(path, full.names=TRUE) (crlmmOutput <- crlmm(cels)) ## If gender is known, one should check that the assigned gender is ## correct, or pass the integer coding of gender as an argument to the ## crlmm function as done below } ## Not run: ## HPC Example library(ff) library(snow) library(crlmm) ## genotype 50K SNPs at a time ocProbesets(50000) ## setup cluster - 8 cores on the machine library(doSNOW) cl <- makeCluster(8, "SOCK") registerDoSNOW(cl) ##setCluster(8, "SOCK") path <- system.file("celFiles", package="hapmapsnp6") cels <- list.celfiles(path, full.names=TRUE) crlmmOutput <- crlmm2(cels) ## End(Not run)
## this can be slow library(oligoClasses) if (require(genomewidesnp6Crlmm) & require(hapmapsnp6)){ path <- system.file("celFiles", package="hapmapsnp6") ## the filenames with full path... ## very useful when genotyping samples not in the working directory cels <- list.celfiles(path, full.names=TRUE) (crlmmOutput <- crlmm(cels)) ## If gender is known, one should check that the assigned gender is ## correct, or pass the integer coding of gender as an argument to the ## crlmm function as done below } ## Not run: ## HPC Example library(ff) library(snow) library(crlmm) ## genotype 50K SNPs at a time ocProbesets(50000) ## setup cluster - 8 cores on the machine library(doSNOW) cl <- makeCluster(8, "SOCK") registerDoSNOW(cl) ##setCluster(8, "SOCK") path <- system.file("celFiles", package="hapmapsnp6") cels <- list.celfiles(path, full.names=TRUE) crlmmOutput <- crlmm2(cels) ## End(Not run)
Locus- and allele-specific estimation of copy number.
crlmmCopynumber(object, MIN.SAMPLES=10, SNRMin = 5, MIN.OBS = 1, DF.PRIOR = 50, bias.adj = FALSE, prior.prob = rep(1/4, 4), seed = 1, verbose = TRUE, GT.CONF.THR = 0.80, MIN.NU = 2^3, MIN.PHI = 2^3, THR.NU.PHI = TRUE, type=c("SNP", "NP", "X.SNP", "X.NP"), fit.linearModel=TRUE)
crlmmCopynumber(object, MIN.SAMPLES=10, SNRMin = 5, MIN.OBS = 1, DF.PRIOR = 50, bias.adj = FALSE, prior.prob = rep(1/4, 4), seed = 1, verbose = TRUE, GT.CONF.THR = 0.80, MIN.NU = 2^3, MIN.PHI = 2^3, THR.NU.PHI = TRUE, type=c("SNP", "NP", "X.SNP", "X.NP"), fit.linearModel=TRUE)
object |
object of class |
MIN.SAMPLES |
'Integer'. The minimum number of samples in a batch. Bathes with fewer than MIN.SAMPLES are skipped. Therefore, samples in batches with fewer than MIN.SAMPLES have NA's for the allele-specific copy number and NA's for the linear model parameters. |
SNRMin |
Samples with low signal to noise ratios are excluded. |
MIN.OBS |
For a SNP with with fewer than |
DF.PRIOR |
The 2 x 2 covariance matrix of the background and signal variances is estimated from the data at each locus. This matrix is then smoothed towards a common matrix estimated from all of the loci. DF.PRIOR controls the amount of smoothing towards the common matrix, with higher values corresponding to greater smoothing. Currently, DF.PRIOR is not estimated from the data. Future versions may estimate DF.PRIOR empirically. |
bias.adj |
|
prior.prob |
This argument is currently ignored. A numerical vector providing prior probabilities for copy number states corresponding to homozygous deletion, hemizygous deletion, normal copy number, and amplification, respectively. |
seed |
Seed for random number generation. |
verbose |
Logical. |
GT.CONF.THR |
Confidence threshold for genotype calls (0, 1). Calls with confidence scores below this theshold are not used to estimate the within-genotype medians. See Carvalho et al., 2007 for information regarding confidence scores of biallelic genotypes. |
MIN.NU |
numeric. Minimum value for background
intensity. Ignored if |
MIN.PHI |
numeric. Minimum value for slope. Ignored if
|
THR.NU.PHI |
If |
type |
Character string vector that must be one or more of "SNP", "NP", "X.SNP", or "X.NP". Type refers to a set of markers. See details below |
fit.linearModel |
Logical. If TRUE, a linear model is fit to estimate the parameters for computing the absolute copy number. If FALSE, we compute the batch-specific, within-genotype median and MAD at polymorphic loci and the median and MAD at nonpolymorphic loci. |
We suggest a minimum of 10 samples per batch for using
crlmmCopynumber
. 50 or more samples per batch is preferred
and will improve the estimates.
The functions crlmmCopynumberLD
and
crlmmCopynumber2
have been deprecated.
The argument type
can be used to specify a subset of
markers for which the copy number estimation algorithm is run.
One or more of the following possible entries are valid: 'SNP',
'NP', 'X.SNP', and 'X.NP'.
'SNP' referers to autosomal SNPs.
'NP' refers to autosomal nonpolymorphic markers.
'X.SNP' refers to SNPs on chromosome X.
'X.NP' refers to autosomes on chromosome X.
However, users must run 'SNP' prior to running 'NP' and 'X.NP',
or specify type = c('SNP', 'X.NP')
.
The value returned by the crlmmCopynumber
function
depends on whether the data is stored in RAM or whether the data
is stored on disk using the R package ff
for reading /
writing. If uncertain, the first line of the show
method
defined for CNSet
objects prints whether the
assayData
elements are derived from the ff
package
in the first line. Specifically,
- if the elements of the batchStaticts
slot in the
CNSet
object have the class "ff_matrix" or "ffdf", then
the crlmmCopynumber
function updates the data stored on
disk and returns the value TRUE
.
- if the elements of the batchStatistics
slot in the
CNSet
object have the class 'matrix', then the
crlmmCopynumber
function returns an object of class
CNSet
with the elements of batchStatistics
updated.
R. Scharpf
Carvalho B, Bengtsson H, Speed TP, Irizarry RA. Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics. 2007 Apr;8(2):485-99. Epub 2006 Dec 22. PMID: 17189563.
Carvalho BS, Louis TA, Irizarry RA. Quantifying uncertainty in genotype calls. Bioinformatics. 2010 Jan 15;26(2):242-9.
Scharpf RB, Ruczinski I, Carvalho B, Doan B, Chakravarti A, and Irizarry RA, Biostatistics. Biostatistics, Epub July 2010.
Preprocessing and genotyping of Affymetrix arrays.
genotype(filenames, cdfName, batch, mixtureSampleSize = 10^5, eps =0.1, verbose = TRUE, seed = 1, sns, probs = rep(1/3, 3), DF = 6, SNRMin = 5, recallMin = 10, recallRegMin = 1000, gender = NULL, returnParams = TRUE, badSNP = 0.7, genome=c("hg19", "hg18"))
genotype(filenames, cdfName, batch, mixtureSampleSize = 10^5, eps =0.1, verbose = TRUE, seed = 1, sns, probs = rep(1/3, 3), DF = 6, SNRMin = 5, recallMin = 10, recallRegMin = 1000, gender = NULL, returnParams = TRUE, badSNP = 0.7, genome=c("hg19", "hg18"))
filenames |
complete path to CEL files |
cdfName |
annotation package (see also
|
batch |
vector of class |
mixtureSampleSize |
Sample size to be use when fitting the mixture model. |
eps |
Stop criteria. |
verbose |
Logical. Whether to print descriptive messages during processing. |
seed |
Seed to be used when sampling. Useful for reproducibility |
sns |
The sample identifiers. If missing, the default sample names are |
probs |
'numeric' vector with priors for AA, AB and BB. |
DF |
'integer' with number of degrees of freedom to use with t-distribution. |
SNRMin |
'numeric' scalar defining the minimum SNR used to filter out samples. |
recallMin |
Minimum number of samples for recalibration. |
recallRegMin |
Minimum number of SNP's for regression. |
gender |
integer vector ( male = 1, female =2 ) or missing, with same length as filenames. If missing, the gender is predicted. |
returnParams |
'logical'. Return recalibrated parameters from crlmm. |
badSNP |
'numeric'. Threshold to flag as bad SNP (affects batchQC) |
genome |
character string indicating the UCSC genome build for the SNP annotation |
For large datasets it is important to utilize the large data
support by installing and loading the ff package before calling
the genotype
function. In previous versions of the
crlmm
package, we useed different functions for
genotyping depending on whether the ff package is loaded, namely
genotype
and genotype2
. The genotype
function now handles both instances.
genotype
is essentially a wrapper of the crlmm
function for genotyping. Differences include (1) that the copy
number probes (if present) are also quantile-normalized and (2)
the class of object returned by this function, CNSet
, is
needed for subsequent copy number estimation. Note that the
batch variable that must be passed to this function has no
effect on the normalization or genotyping steps. Rather,
batch
is required in order to initialize a CNSet
container with the appropriate dimensions and is used directly
when estimating copy number.
A SnpSuperSet
instance.
For large datasets, load the 'ff' package prior to genotyping –
this will greatly reduce the RAM required for big jobs. See
ldPath
and ocSamples
.
R. Scharpf
Carvalho B, Bengtsson H, Speed TP, Irizarry RA. Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics. 2007 Apr;8(2):485-99. Epub 2006 Dec 22. PMID: 17189563.
Carvalho BS, Louis TA, Irizarry RA. Quantifying uncertainty in genotype calls. Bioinformatics. 2010 Jan 15;26(2):242-9.
snprma
, crlmm
,
ocSamples
,
ldOpts
,
batch
,
crlmmCopynumber
if (require(ff) & require(genomewidesnp6Crlmm) & require(hapmapsnp6)){ ldPath(tempdir()) path <- system.file("celFiles", package="hapmapsnp6") ## the filenames with full path... ## very useful when genotyping samples not in the working directory cels <- list.celfiles(path, full.names=TRUE) ## Note: one would need at least 10 CEL files for copy number estimation ## To use less RAM, specify a smaller argument to ocProbesets ocProbesets(50e3) batch <- rep("A", length(cels)) (cnSet <- genotype(cels, cdfName="genomewidesnp6", batch=batch)) ##Segment faults that occur with the above step can often be traced to a ##corrupt cel file. To check if any of the files are corrupt, try ##reading the files in one at a time: ## Not run: require(affyio) validCEL(cels) ## End(Not run) ## when gender is not specified (as in the above example), crlmm tries ## to predict the gender from SNPs on chromosome X cnSet$gender ## If gender is known, one should check that the assigned gender is ## correct. Alternatively, one can pass gender as an argument to the ## genotype function. gender <- c("female", "female", "male") gender[gender == "female"] <- 2 gender[gender == "male"] <- 1 dim(cnSet) table(isSnp(cnSet)) }
if (require(ff) & require(genomewidesnp6Crlmm) & require(hapmapsnp6)){ ldPath(tempdir()) path <- system.file("celFiles", package="hapmapsnp6") ## the filenames with full path... ## very useful when genotyping samples not in the working directory cels <- list.celfiles(path, full.names=TRUE) ## Note: one would need at least 10 CEL files for copy number estimation ## To use less RAM, specify a smaller argument to ocProbesets ocProbesets(50e3) batch <- rep("A", length(cels)) (cnSet <- genotype(cels, cdfName="genomewidesnp6", batch=batch)) ##Segment faults that occur with the above step can often be traced to a ##corrupt cel file. To check if any of the files are corrupt, try ##reading the files in one at a time: ## Not run: require(affyio) validCEL(cels) ## End(Not run) ## when gender is not specified (as in the above example), crlmm tries ## to predict the gender from SNPs on chromosome X cnSet$gender ## If gender is known, one should check that the assigned gender is ## correct. Alternatively, one can pass gender as an argument to the ## genotype function. gender <- c("female", "female", "male") gender[gender == "female"] <- 2 gender[gender == "male"] <- 1 dim(cnSet) table(isSnp(cnSet)) }
Preprocessing and genotyping of Illumina Infinium II arrays.
genotype.Illumina(sampleSheet=NULL, arrayNames=NULL, ids=NULL, path=".", arrayInfoColNames=list(barcode="SentrixBarcode_A", position="SentrixPosition_A"), highDensity=FALSE, sep="_", fileExt=list(green="Grn.idat", red="Red.idat"), XY=NULL, anno, genome, call.method="crlmm", trueCalls=NULL, cdfName, copynumber=TRUE, batch=NULL, saveDate=FALSE, stripNorm=TRUE, useTarget=TRUE, quantile.method="between", nopackage.norm="quantile", mixtureSampleSize=10^5, fitMixture=TRUE, eps=0.1, verbose = TRUE, seed = 1, sns, probs = rep(1/3, 3), DF = 6, SNRMin = 5, recallMin = 10, recallRegMin = 1000, gender = NULL, returnParams = TRUE, badSNP = 0.7) crlmmIllumina(sampleSheet=NULL, arrayNames=NULL, ids=NULL, path=".", arrayInfoColNames=list(barcode="SentrixBarcode_A", position="SentrixPosition_A"), highDensity=FALSE, sep="_", fileExt=list(green="Grn.idat", red="Red.idat"), XY=NULL, anno, genome, call.method="crlmm", trueCalls=NULL, cdfName, copynumber=TRUE, batch=NULL, saveDate=FALSE, stripNorm=TRUE, useTarget=TRUE, quantile.method="between", nopackage.norm="quantile", mixtureSampleSize=10^5, fitMixture=TRUE, eps=0.1, verbose = TRUE, seed = 1, sns, probs = rep(1/3, 3), DF = 6, SNRMin = 5, recallMin = 10, recallRegMin = 1000, gender = NULL, returnParams = TRUE, badSNP = 0.7)
genotype.Illumina(sampleSheet=NULL, arrayNames=NULL, ids=NULL, path=".", arrayInfoColNames=list(barcode="SentrixBarcode_A", position="SentrixPosition_A"), highDensity=FALSE, sep="_", fileExt=list(green="Grn.idat", red="Red.idat"), XY=NULL, anno, genome, call.method="crlmm", trueCalls=NULL, cdfName, copynumber=TRUE, batch=NULL, saveDate=FALSE, stripNorm=TRUE, useTarget=TRUE, quantile.method="between", nopackage.norm="quantile", mixtureSampleSize=10^5, fitMixture=TRUE, eps=0.1, verbose = TRUE, seed = 1, sns, probs = rep(1/3, 3), DF = 6, SNRMin = 5, recallMin = 10, recallRegMin = 1000, gender = NULL, returnParams = TRUE, badSNP = 0.7) crlmmIllumina(sampleSheet=NULL, arrayNames=NULL, ids=NULL, path=".", arrayInfoColNames=list(barcode="SentrixBarcode_A", position="SentrixPosition_A"), highDensity=FALSE, sep="_", fileExt=list(green="Grn.idat", red="Red.idat"), XY=NULL, anno, genome, call.method="crlmm", trueCalls=NULL, cdfName, copynumber=TRUE, batch=NULL, saveDate=FALSE, stripNorm=TRUE, useTarget=TRUE, quantile.method="between", nopackage.norm="quantile", mixtureSampleSize=10^5, fitMixture=TRUE, eps=0.1, verbose = TRUE, seed = 1, sns, probs = rep(1/3, 3), DF = 6, SNRMin = 5, recallMin = 10, recallRegMin = 1000, gender = NULL, returnParams = TRUE, badSNP = 0.7)
sampleSheet |
|
arrayNames |
character vector containing names of arrays to be
read in. If |
ids |
vector containing ids of probes to be read in. If
|
path |
character string specifying the location of files to be read by the function |
arrayInfoColNames |
(used when |
highDensity |
logical (used when |
sep |
character string specifying separator used in .idat file names. |
fileExt |
list containing elements 'Green' and 'Red' which specify the .idat file extension for the Cy3 and Cy5 channels. |
XY |
|
anno |
data.frame containing SNP annotation information from
manifest and additional columns 'isSnp', 'position', 'chromosome'
and 'featureNames'. For use when |
genome |
character string specifying which genome is used in annotation |
call.method |
character string specifying the genotype calling algorithm to use ('crlmm' or 'krlmm'). |
trueCalls |
matrix specifying known Genotype calls(can contain some NAs) for a subset of samples and features (1 - AA, 2 - AB, 3 - BB). |
cdfName |
annotation package (see also |
copynumber |
'logical.' Whether to store copy number intensities with SNP output. |
batch |
character vector indicating the batch variable. Must be the same length as the number of samples. See details. |
saveDate |
'logical'. Should the dates from each .idat be saved with sample information? |
stripNorm |
'logical'. Should the data be strip-level normalized? |
useTarget |
'logical' (only used when |
quantile.method |
character string specifying the quantile normalization method to use ('within' or 'between' channels). |
nopackage.norm |
character string specifying normalization to be used when |
mixtureSampleSize |
Sample size to be use when fitting the mixture model. |
fitMixture |
'logical.' Whether to fit per-array mixture model. |
eps |
Stop criteria. |
verbose |
'logical.' Whether to print descriptive messages during processing. |
seed |
Seed to be used when sampling. Useful for reproducibility |
sns |
The sample identifiers. If missing, the default sample names are |
probs |
'numeric' vector with priors for AA, AB and BB. |
DF |
'integer' with number of degrees of freedom to use with t-distribution. |
SNRMin |
'numeric' scalar defining the minimum SNR used to filter out samples. |
recallMin |
Minimum number of samples for recalibration. |
recallRegMin |
Minimum number of SNP's for regression. |
gender |
integer vector ( male = 1, female = 2 ) or missing, with same length as filenames. If missing, the gender is predicted. |
returnParams |
'logical'. Return recalibrated parameters from crlmm. |
badSNP |
'numeric'. Threshold to flag as bad SNP (affects batchQC) |
genotype.Illumina
(or equivalently crlmmIllumina
)
is a wrapper of the crlmm
function for genotyping.
Differences include (1) that the copy number probes (if present)
are also quantile-normalized and (2) the class of object returned
by this function, CNSet
, is needed for subsequent copy number
estimation. Note that the batch variable (a character string) has
no effect on the normalization or genotyping steps. Rather, batch
is required in order to initialize a CNSet
container with the
appropriate dimensions.
The new 'krlmm' option is available for certain chip types. Optional
argument trueCalls
matrix contains known Genotype calls
(1 - AA, 2 - AB, 3 - BB) for a subset of samples and features. This
will used to compute KRLMM coefficients by calling vglm
function
from VGAM
package.
The 'krlmm' method makes use of functions provided in parallel
package to speed up the process. It by default initialises up to
8 clusters. This is configurable by setting up an option named
"krlmm.cores", e.g. options("krlmm.cores" = 16).
In general, a chip specific annotation package is required to use the
genotype.Illumina
function. If this is not available (newer chip
types or custom chips often don't have a chip-specific package available
on Bioconductor), consider using cdfName
='nopackage' and specifying
anno
and genome
, which runs 'krlmm' on the samples available.
Here anno
is a data.frame read in from the relevant chip-specific
manifest, which must have additional columns 'isSnp' which is a logical that
indicates whether a probe is polymorphic or not, 'position', 'chromosome' and
'featureNames' that give the location on the chromosome and SNP name.
A SnpSuperSet
instance.
Matt Ritchie, Cynthia Liu, Zhiyin Dai
Ritchie ME, Carvalho BS, Hetrick KN, Tavar\'e S, Irizarry RA. R/Bioconductor software for Illumina's Infinium whole-genome genotyping BeadChips. Bioinformatics. 2009 Oct 1;25(19):2621-3.
Liu R, Dai Z, Yeager M, Irizarry RA1, Ritchie ME. KRLMM: an adaptive genotype calling method for common and low frequency variants. BMC Bioinformatics. 2014 May 23;15:158.
Carvalho B, Bengtsson H, Speed TP, Irizarry RA. Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics. 2007 Apr;8(2):485-99. Epub 2006 Dec 22. PMID: 17189563.
Carvalho BS, Louis TA, Irizarry RA. Quantifying uncertainty in genotype calls. Bioinformatics. 2010 Jan 15;26(2):242-9.
## Not run: # example for 'crlmm' option library(ff) library(crlmm) ## to enable paralellization, set to TRUE if(FALSE){ library(snow) library(doSNOW) ## with 10 workers cl <- makeCluster(10, type="SOCK") registerDoSNOW(cl) } ## path to idat files datadir <- "/thumper/ctsa/snpmicroarray/illumina/IDATS/370k" ## read in your samplesheet samplesheet = read.csv(file.path(datadir, "HumanHap370Duo_Sample_Map.csv"), header=TRUE, as.is=TRUE) samplesheet <- samplesheet[-c(28:46,61:75,78:79), ] arrayNames <- file.path(datadir, unique(samplesheet[, "SentrixPosition"])) arrayInfo <- list(barcode=NULL, position="SentrixPosition") cnSet <- genotype.Illumina(sampleSheet=samplesheet, arrayNames=arrayNames, arrayInfoColNames=arrayInfo, cdfName="human370v1c", batch=rep("1", nrow(samplesheet))) ## End(Not run) ## Not run: # example for 'krlmm' option library(crlmm) library(ff) # line below is an optional step for krlmm to initialise 16 workers # options("krlmm.cores" = 16) # read in raw X and Y intensities output by GenomeStudio's GenCall genotyping module XY = readGenCallOutput(c("HumanOmni2-5_4v1_FinalReport_83TUSCAN.csv","HumanOmni2-5_4v1_FinalReport_88CHB-JPT.csv"), cdfName="humanomni25quadv1b", verbose=TRUE) krlmmResult = genotype.Illumina(XY=XY, cdfName=ThiscdfName, call.method="krlmm", verbose=TRUE) # example for 'krlmm' option with known genotype call for some SNPs and samples library(VGAM) hapmapCalls = load("hapmapCalls.rda") # hapmapCalls should have rownames and colnames corresponding to XY featureNames and sampleNames krlmmResult = genotype.Illumina(XY=XY, cdfName=ThiscdfName, call.method="krlmm", trueCalls=hapmapCalls, verbose=TRUE) ## End(Not run)
## Not run: # example for 'crlmm' option library(ff) library(crlmm) ## to enable paralellization, set to TRUE if(FALSE){ library(snow) library(doSNOW) ## with 10 workers cl <- makeCluster(10, type="SOCK") registerDoSNOW(cl) } ## path to idat files datadir <- "/thumper/ctsa/snpmicroarray/illumina/IDATS/370k" ## read in your samplesheet samplesheet = read.csv(file.path(datadir, "HumanHap370Duo_Sample_Map.csv"), header=TRUE, as.is=TRUE) samplesheet <- samplesheet[-c(28:46,61:75,78:79), ] arrayNames <- file.path(datadir, unique(samplesheet[, "SentrixPosition"])) arrayInfo <- list(barcode=NULL, position="SentrixPosition") cnSet <- genotype.Illumina(sampleSheet=samplesheet, arrayNames=arrayNames, arrayInfoColNames=arrayInfo, cdfName="human370v1c", batch=rep("1", nrow(samplesheet))) ## End(Not run) ## Not run: # example for 'krlmm' option library(crlmm) library(ff) # line below is an optional step for krlmm to initialise 16 workers # options("krlmm.cores" = 16) # read in raw X and Y intensities output by GenomeStudio's GenCall genotyping module XY = readGenCallOutput(c("HumanOmni2-5_4v1_FinalReport_83TUSCAN.csv","HumanOmni2-5_4v1_FinalReport_88CHB-JPT.csv"), cdfName="humanomni25quadv1b", verbose=TRUE) krlmmResult = genotype.Illumina(XY=XY, cdfName=ThiscdfName, call.method="krlmm", verbose=TRUE) # example for 'krlmm' option with known genotype call for some SNPs and samples library(VGAM) hapmapCalls = load("hapmapCalls.rda") # hapmapCalls should have rownames and colnames corresponding to XY featureNames and sampleNames krlmmResult = genotype.Illumina(XY=XY, cdfName=ThiscdfName, call.method="krlmm", trueCalls=hapmapCalls, verbose=TRUE) ## End(Not run)
Assign diallelic genotypes at polymorphic markers
genotypeAffy(cnSet, SNRMin = 5, recallMin = 10, recallRegMin = 1000, gender = NULL, badSNP = 0.7, returnParams = TRUE, verbose = TRUE)
genotypeAffy(cnSet, SNRMin = 5, recallMin = 10, recallRegMin = 1000, gender = NULL, badSNP = 0.7, returnParams = TRUE, verbose = TRUE)
cnSet |
An object of class |
SNRMin |
See |
recallMin |
See |
recallRegMin |
See |
gender |
See |
badSNP |
See |
returnParams |
See |
verbose |
Logical. |
Wrapper for crlmm genotyping.
Returns logical. SNP genotypes and confidence scores are written to
ff_matrix
objects.
R.Scharpf
Genotyping of Illumina Infinium II arrays. This function provides CRLMM/KRLMM genotypes and confidence scores for the the polymorphic markers and is a required step prior to copy number estimation.
genotypeInf(cnSet, mixtureParams, probs = rep(1/3, 3), SNRMin = 5, recallMin = 10, recallRegMin = 1000, verbose = TRUE, returnParams = TRUE, badSNP = 0.7, gender = NULL, DF = 6, cdfName, nopackage.norm="quantile", call.method="crlmm", trueCalls = NULL)
genotypeInf(cnSet, mixtureParams, probs = rep(1/3, 3), SNRMin = 5, recallMin = 10, recallRegMin = 1000, verbose = TRUE, returnParams = TRUE, badSNP = 0.7, gender = NULL, DF = 6, cdfName, nopackage.norm="quantile", call.method="crlmm", trueCalls = NULL)
cnSet |
An object of class |
mixtureParams |
|
probs |
'numeric' vector with priors for AA, AB and BB. |
SNRMin |
'numeric' scalar defining the minimum SNR used to filter out samples. |
recallMin |
Minimum number of samples for recalibration. |
recallRegMin |
Minimum number of SNP's for regression. |
verbose |
'logical.' Whether to print descriptive messages during processing. |
returnParams |
'logical'. Return recalibrated parameters from crlmm. |
badSNP |
'numeric'. Threshold to flag as bad SNP (affects batchQC) |
gender |
integer vector ( male = 1, female =2 ) or missing, with same length as filenames. If missing, the gender is predicted. |
DF |
'integer' with number of degrees of freedom to use with t-distribution. |
cdfName |
|
nopackage.norm |
character string specifying normalization to be used when |
call.method |
character string specifying the genotype calling algorithm to use ('crlmm' or 'krlmm'). |
trueCalls |
matrix specifying known Genotype calls for a subset of samples and features(1 - AA, 2 - AB, 3 - BB). |
The genotype calls and confidence scores are written to
file using ff
protocols for I/O. For the most part,
the calls and confidence scores can be accessed as though the
data is in memory through the methods snpCall
and
snpCallProbability
, respectively.
The genotype calls are stored using an integer representation: 1
- AA, 2 - AB, 3 - BB. Similarly, the call probabilities are
stored using an integer representation to reduce file size using
the transformation 'round(-1000*log2(1-p))', where p is the
probability. The function i2P
can be used to convert the
integers back to the scale of probabilities.
An optional trueCalls
argument can be provided to KRLMM method
which contains known genotype calls(can contain some NAs) for some
samples and SNPs. This will used to compute KRLMM parameters by calling
vglm
function from VGAM
package.
The KRLMM method makes use of functions provided in parallel
package to speed up the process. It by default initialises up to
8 clusters. This is configurable by setting up an option named
"krlmm.cores", e.g. options("krlmm.cores" = 16).
Logical. If the genotyping is completed, the value 'TRUE' is
returned. Note that assayData
elements 'call' and
'callProbability' are updated on disk. Therefore, the genotypes and
confidence scores can be retrieved using accessors for the
CNSet
class.
R. Scharpf
crlmm
, snpCall
,
snpCallProbability
, annotationPackages
## See the 'illumina_copynumber' vignette in inst/scripts of ## the source package
## See the 'illumina_copynumber' vignette in inst/scripts of ## the source package
The possible genotypes for an integer copy number (0-4).
genotypes(copyNumber, is.snp=TRUE)
genotypes(copyNumber, is.snp=TRUE)
copyNumber |
Integer (0-4 allowed). |
is.snp |
Logical. If TRUE, possible genotypes for a polymorphic SNP is returned. If FALSE, only monomorphic genotypes returned. |
Character vector.
R. Scharpf
for(i in 0:4) print(genotypes(i)) for(i in 0:4) print(genotypes(i, FALSE))
for(i in 0:4) print(genotypes(i)) for(i in 0:4) print(genotypes(i, FALSE))
BafLrrSetList
in Package crlmm ~~Constructors for BafLrrSetList
and OligoSetList
objects.
BafLrrSetList(object, ...) OligoSetList(object, ...)
BafLrrSetList(object, ...) OligoSetList(object, ...)
object |
A |
... |
Additional arguments |
Constructs a BafLrrSetList
object or a OligoSetList
object from an object of class CNSet
.
A BafLrrSetList
or OligoSetList
data(cnSetExample) oligoList <- OligoSetList(cnSetExample) ## only contains 1 chromosome, so list only has one element dims(oligoList) brList <- BafLrrSetList(cnSetExample) dims(brList)
data(cnSetExample) oligoList <- OligoSetList(cnSetExample) ## only contains 1 chromosome, so list only has one element dims(oligoList) brList <- BafLrrSetList(cnSetExample) dims(brList)
These functions plot the M-values (log-ratios) versus S-values (average intensities) for given SNP/(s) or sample/(s) or beanplots for M-values from different samples.
plotSNPs(cnSet, row=1, offset=0, xlim=c(9,16), ylim=c(-5,5), verbose=FALSE) plotSamples(cnSet, col=1, offset=0, xlim=c(9,16), ylim=c(-5,5), verbose=FALSE, sample=100000, seed=1, type="smoothScatter")
plotSNPs(cnSet, row=1, offset=0, xlim=c(9,16), ylim=c(-5,5), verbose=FALSE) plotSamples(cnSet, col=1, offset=0, xlim=c(9,16), ylim=c(-5,5), verbose=FALSE, sample=100000, seed=1, type="smoothScatter")
cnSet |
An object of class |
row |
scalar/vector of SNP indexes to plot |
col |
scalar/vector of sample indexes to plot |
offset |
numeric, offset to add to intensities in |
xlim |
the x limits of the plot |
ylim |
the y limits of the plot |
verbose |
'logical.' Whether to print descriptive messages during processing |
sample |
integer indicating the number of SNPs to sample for the plot |
seed |
integer seed for the random number generator to sample the SNPs |
type |
character vector specifying the type of sample plot (either 'smoothScatter' or 'beanplot') |
The plotSNPs
and plotSamples
functions plot the M and S
values derived from the cnSet
object.
One or more M vs S plot for plotSNPs
for a given SNP(/s)
or either a smoothed scatter plot of M vs S or a beanplot of the M-values
for a selected sample(/s) for plotSamples
.
Matt Ritchie and Cynthia Liu
## Not run: crlmmResult <- genotype.Illumina(sampleSheet=samples[1:10,], path=path, arrayInfoColNames=list(barcode=NULL, position="SentrixPosition"), saveDate=TRUE, cdfName="human370v1c") par(mfrow=c(2,2)) plotSamples(crlmmResult, col=1:4) plotSNPs(crlmmResult, row=1:4) ## End(Not run)
## Not run: crlmmResult <- genotype.Illumina(sampleSheet=samples[1:10,], path=path, arrayInfoColNames=list(barcode=NULL, position="SentrixPosition"), saveDate=TRUE, cdfName="human370v1c") par(mfrow=c(2,2)) plotSamples(crlmmResult, col=1:4) plotSNPs(crlmmResult, row=1:4) ## End(Not run)
Calculate the posterior probability for integer copy numbers using the bivariate normal prediction regions.
posteriorProbability(object, predictRegion, copyNumber = 0:4, w)
posteriorProbability(object, predictRegion, copyNumber = 0:4, w)
object |
A |
predictRegion |
A list containing the bivariate normal prediction region for each of the possible genotypes. |
copyNumber |
Integer vector. |
w |
numeric vector of prior probabilities for each of the copy
number states. Must be the same length as |
This is currently under development.
An array (features x samples x copy number)
This is under development. Use at your own risk.
R. Scharpf
data(cnSetExample) pr <- predictionRegion(cnSetExample, copyNumber=0:4) pp <- posteriorProbability(cnSetExample, predictRegion=pr) dim(pp) ## multiple batches data(cnSetExample2) pr <- predictionRegion(cnSetExample2, copyNumber=0:4) pp <- posteriorProbability(cnSetExample2, predictRegion=pr)
data(cnSetExample) pr <- predictionRegion(cnSetExample, copyNumber=0:4) pp <- posteriorProbability(cnSetExample, predictRegion=pr) dim(pp) ## multiple batches data(cnSetExample2) pr <- predictionRegion(cnSetExample2, copyNumber=0:4) pp <- posteriorProbability(cnSetExample2, predictRegion=pr)
Bivariate normal prediction regions for integer copy number. Copy numbers 0-4 allowed.
predictionRegion(object, copyNumber)
predictionRegion(object, copyNumber)
object |
A |
copyNumber |
Integer vector. 0-4 allowed. |
We fit a linear regression for each allele to the diallic genotype cluster medians. Denoting the background and slope by nu and phi, respectively, the mean for the bivariate normal prediction region is given by
mu_A = nu_A + CA * phi_A
and
mu_B nu_B + CB * phi_B
The variance and correlation of the normalized intensities is estimated from the diallelic genotype clusters AA, AB, and BB on the log-scale. For copy number not equal to two, we assume that the variance is approximately the same for copy number not equal to 2.
A list named by the genotype. ‘NULL’ refers to copy number zero, ‘A’ is a hemizygous deletion, etc. Each element is a list of the means (mu) and covariance (cov) for each marker stored as an array. For ‘mu’, the dimensions of the array are marker x allele (A or B) x batch. For ‘cov’, the dimensions of the array are marker x 3 (varA, cor, and varB) x batch.
R. Scharpf
Scharpf et al., 2011, Biostatistics.
posteriorProbability
, genotypes
data(cnSetExample) pr <- predictionRegion(cnSetExample, copyNumber=0:4) names(pr) ## bivariate normal prediction region for NULL genotype (homozygous deletion) str(pr[["NULL"]])
data(cnSetExample) pr <- predictionRegion(cnSetExample, copyNumber=0:4) names(pr) ## bivariate normal prediction region for NULL genotype (homozygous deletion) str(pr[["NULL"]])
"PredictionRegion"
A container for bivariate normal prediction regions for SNP data and univarite prediction regions for nonpolymorphic markers.
Objects from the class are created from the predictionRegion
function.
.Data
:Object of class "list"
~~
Class "list"
, from data part.
Class "vector"
, by class "list", distance 2.
Class "AssayData"
, by class "list", distance 2.
Class "list_or_ffdf"
, by class "list", distance 2.
Class vectorORfactor
, by class "list", distance 3.
signature(x = "PredictionRegion")
: ...
Prediction regions can be subset by markers.
R. Scharpf
showClass("PredictionRegion")
showClass("PredictionRegion")
This function normalizes the intensities for the 'A' and 'B'
alleles for a CNSet
object and estimates mixture
parameters used for subsequent genotyping. See details for
how the normalized intensities are written to file. This step
is required for subsequent genotyping and copy number
estimation.
preprocessInf(cnSet, sampleSheet=NULL, arrayNames = NULL, ids = NULL, path = ".", arrayInfoColNames = list(barcode = "SentrixBarcode_A", position = "SentrixPosition_A"), highDensity = TRUE, sep = "_", fileExt = list(green = "Grn.idat", red = "Red.idat"), XY, anno, saveDate = TRUE, stripNorm = TRUE, useTarget = TRUE, mixtureSampleSize = 10^5, fitMixture = TRUE, quantile.method="between", eps = 0.1, verbose = TRUE, seed = 1, cdfName)
preprocessInf(cnSet, sampleSheet=NULL, arrayNames = NULL, ids = NULL, path = ".", arrayInfoColNames = list(barcode = "SentrixBarcode_A", position = "SentrixPosition_A"), highDensity = TRUE, sep = "_", fileExt = list(green = "Grn.idat", red = "Red.idat"), XY, anno, saveDate = TRUE, stripNorm = TRUE, useTarget = TRUE, mixtureSampleSize = 10^5, fitMixture = TRUE, quantile.method="between", eps = 0.1, verbose = TRUE, seed = 1, cdfName)
cnSet |
object of class |
sampleSheet |
|
arrayNames |
character vector containing names of arrays to be
read in. If |
ids |
vector containing ids of probes to be read in. If
|
path |
character string specifying the location of files to be read by the function |
arrayInfoColNames |
(used when |
highDensity |
logical (used when |
sep |
character string specifying separator used in .idat file names. |
fileExt |
list containing elements 'Green' and 'Red' which specify the .idat file extension for the Cy3 and Cy5 channels. |
XY |
an |
anno |
data.frame containing SNP annotation information from
manifest and additional columns 'isSnp', 'position', 'chromosome'
and 'featureNames'. For use when |
saveDate |
'logical'. Should the dates from each .idat be saved with sample information? |
stripNorm |
'logical'. Should the data be strip-level normalized? |
useTarget |
'logical' (only used when |
mixtureSampleSize |
Sample size to be use when fitting the mixture model. |
fitMixture |
'logical.' Whether to fit per-array mixture model. |
quantile.method |
character string specifying the quantile normalization method to use ('within' or 'between' channels). |
eps |
Stop criteria. |
verbose |
'logical.' Whether to print descriptive messages during processing. |
seed |
Seed to be used when sampling. Useful for reproducibility |
cdfName |
|
The normalized intensities are written to disk using package
ff
protocols for writing/reading to disk. Note that the
object CNSet
containing the ff
objects in the
assayData
slot will be updated after applying this
function.
A ff_matrix
object containing parameters for fitting the
mixture model. Note that while the CNSet
object is not
returned by this function, the object will be updated as the
normalized intensities are written to disk. In particular,
after applying this function the normalized intensities in the
alleleA
and alleleB
elements of assayData
are now available.
R. Scharpf
CNSet-class
, A
, B
,
constructInf
, genotypeInf
, annotationPackages
## See the 'illumina_copynumber' vignette in inst/scripts of ## the source package
## See the 'illumina_copynumber' vignette in inst/scripts of ## the source package
This function reads the raw X and Y intensities output by GenomeStudio's GenCall genotyping module in preparation for genotyping with crlmm
.
readGenCallOutput(filenames, path=".", cdfName, colnames=list("SampleID"="Sample ID", "SNPID"="SNP Name", "XRaw"="X Raw", "YRaw"="Y Raw"), type=list("SampleID"="character", "SNPID"="character", "XRaw"="integer", "YRaw"="integer"), verbose=FALSE)
readGenCallOutput(filenames, path=".", cdfName, colnames=list("SampleID"="Sample ID", "SNPID"="SNP Name", "XRaw"="X Raw", "YRaw"="Y Raw"), type=list("SampleID"="character", "SNPID"="character", "XRaw"="integer", "YRaw"="integer"), verbose=FALSE)
filenames |
'character' string, or a vector of character string specifying the name of the file(s) to read in |
path |
'character' string specifying the location of file to be read by the function |
cdfName |
'character' defining the chip annotation (manifest) to use ('human370v1c', human550v3b', 'human650v3a', 'human1mv1c', 'human370quadv3c', 'human610quadv1b', 'human660quadv1a', 'human1mduov3b', 'humanomni1quadv1b', 'humanomniexpress12v1b', 'humancytosnp12v2p1h') |
colnames |
list containing elements 'SampleID', 'SNPID', 'XRaw' and 'YRaw', which specify the column names from in 'file' that pertain to these variables. The default should suffice in most situations. |
type |
list containing data types for the columns to be read in. The default should be fine in most situations. |
verbose |
'logical'. Should processing information be displayed as data is read in? |
This function returns an NChannelSet
containing raw intensity data (X and Y) from GenCall final report file.
It assumes the GenCall output is formatted to have samples listed one below the other, and that the columns 'X Raw' and 'Y Raw' are available in the file.
The function crlmmillumina() can be run on the output of the readGenCallOutput
function.
NChannelSet
containing X and Y bead intensities.
Cynthia Liu, Matt Ritchie, Zhiyin Dai
Ritchie ME, Carvalho BS, Hetrick KN, Tavar\'e S, Irizarry RA. R/Bioconductor software for Illumina's Infinium whole-genome genotyping BeadChips. Bioinformatics. 2009 Oct 1;25(19):2621-3.
#XY = readGenCallOutput(file="Hap650Yv3_Final_Report.txt", cdfName="human650v3a") #crlmmOut = crlmmIllumina(XY=XY)
#XY = readGenCallOutput(file="Hap650Yv3_Final_Report.txt", cdfName="human650v3a") #crlmmOut = crlmmIllumina(XY=XY)
Reads intensity information for each bead type from .idat files of Infinium II genotyping BeadChips
readIdatFiles(sampleSheet=NULL, arrayNames=NULL, ids=NULL, path="", arrayInfoColNames=list(barcode="SentrixBarcode_A", position="SentrixPosition_A"), highDensity=FALSE, sep="_", fileExt=list(green="Grn.idat", red="Red.idat"), saveDate=FALSE, verbose=FALSE)
readIdatFiles(sampleSheet=NULL, arrayNames=NULL, ids=NULL, path="", arrayInfoColNames=list(barcode="SentrixBarcode_A", position="SentrixPosition_A"), highDensity=FALSE, sep="_", fileExt=list(green="Grn.idat", red="Red.idat"), saveDate=FALSE, verbose=FALSE)
sampleSheet |
|
arrayNames |
character vector containing names of arrays to be
read in. If |
ids |
vector containing ids of probes to be read in. If
|
path |
character string specifying the location of files to be read by the function |
arrayInfoColNames |
(used when |
highDensity |
logical (used when |
sep |
character string specifying separator used in .idat file names. |
fileExt |
list containing elements 'Green' and 'Red' which specify the .idat file extension for the Cy3 and Cy5 channels. |
saveDate |
logical. Should the dates from each .idat be saved with sample information? |
verbose |
logical. Should processing information be displayed as data is read in? |
The summarised Cy3 (G) and Cy5 (R) intensities (on the orginal scale) are read in from the .idat files.
Where available, a sampleSheet
data.frame, in the same format
as used by BeadStudio (columns 'Sample\_ID', 'SentrixBarcode\_A' and
'SentrixPosition\_A' are required) which keeps track of sample
information can be specified.
Thanks to Keith Baggerly who provided the code to read in the binary .idat files.
NChannelSet with intensity data (R
, G
), and indicator
for SNPs with 0 beads (zero
) for each bead type.
Matt Ritchie
Ritchie ME, Carvalho BS, Hetrick KN, Tavar\'e S, Irizarry RA. R/Bioconductor software for Illumina's Infinium whole-genome genotyping BeadChips. Bioinformatics. 2009 Oct 1;25(19):2621-3.
#RG = readIdatFiles()
#RG = readIdatFiles()
SNPRMA will preprocess SNP chips. The preprocessing consists of quantile normalization to a known target distribution and summarization to the SNP-Allele level.
snprma(filenames, mixtureSampleSize = 10^5, fitMixture = FALSE, eps = 0.1, verbose = TRUE, seed = 1, cdfName, sns) snprma2(filenames, mixtureSampleSize = 10^5, fitMixture = FALSE, eps = 0.1, verbose = TRUE, seed = 1, cdfName, sns)
snprma(filenames, mixtureSampleSize = 10^5, fitMixture = FALSE, eps = 0.1, verbose = TRUE, seed = 1, cdfName, sns) snprma2(filenames, mixtureSampleSize = 10^5, fitMixture = FALSE, eps = 0.1, verbose = TRUE, seed = 1, cdfName, sns)
filenames |
'character' vector with file names. |
mixtureSampleSize |
Sample size to be use when fitting the mixture model. |
fitMixture |
'logical'. Fit the mixture model? |
eps |
Stop criteria. |
verbose |
'logical'. |
seed |
Seed to be used when sampling. |
cdfName |
cdfName: 'GenomeWideSnp\_5', 'GenomeWideSnp\_6' |
sns |
Sample names. |
'snprma2' allows one to genotype very large datasets (via ff package) and also permits the use of clusters or multiple cores (via snow package) to speed up preprocessing.
A |
Summarized intensities for Allele A |
B |
Summarized intensities for Allele B |
sns |
Sample names |
gns |
SNP names |
SNR |
Signal-to-noise ratio |
SKW |
Skewness |
mixtureParams |
Parameters from mixture model |
cdfName |
Name of the CDF |
if (require(genomewidesnp6Crlmm) & require(hapmapsnp6) & require(oligoClasses)){ path <- system.file("celFiles", package="hapmapsnp6") ## the filenames with full path... ## very useful when genotyping samples not in the working directory cels <- list.celfiles(path, full.names=TRUE) snprmaOutput <- snprma(cels) snprmaOutput[["A"]][1:10,] snprmaOutput[["B"]][1:10,] } ## Not run: ## HPC Example library(ff) library(snow) library(crlmm) ## genotype 50K SNPs at a time ocProbesets(50000) ## setup cluster - 8 cores on the machine setCluster(8, "SOCK") path <- system.file("celFiles", package="hapmapsnp6") cels <- list.celfiles(path, full.names=TRUE) snprmaOutput <- snprma2(cels) ## End(Not run)
if (require(genomewidesnp6Crlmm) & require(hapmapsnp6) & require(oligoClasses)){ path <- system.file("celFiles", package="hapmapsnp6") ## the filenames with full path... ## very useful when genotyping samples not in the working directory cels <- list.celfiles(path, full.names=TRUE) snprmaOutput <- snprma(cels) snprmaOutput[["A"]][1:10,] snprmaOutput[["B"]][1:10,] } ## Not run: ## HPC Example library(ff) library(snow) library(crlmm) ## genotype 50K SNPs at a time ocProbesets(50000) ## setup cluster - 8 cores on the machine setCluster(8, "SOCK") path <- system.file("celFiles", package="hapmapsnp6") cels <- list.celfiles(path, full.names=TRUE) snprmaOutput <- snprma2(cels) ## End(Not run)
Quantile normalize intensities for SNPs to a HapMap target reference distribution
snprmaAffy(cnSet, mixtureSampleSize = 10^5, eps = 0.1, seed = 1, verbose = TRUE)
snprmaAffy(cnSet, mixtureSampleSize = 10^5, eps = 0.1, seed = 1, verbose = TRUE)
cnSet |
Object of class |
mixtureSampleSize |
Sample size to be use when fitting the mixture model. |
eps |
Stop criteria. |
seed |
Seed to be used when sampling. |
verbose |
Logical. |
Returns nothing. Normalized intensities are written to files.
R.Scharpf
Supported annotation packages for crlmm genotyping
validCdfNames()
validCdfNames()
List of available annotation packages
character vector
R.Scharpf
validCdfNames()
validCdfNames()
Reads cel files and return an error if a file is not read
validCEL(celfiles) celDates(celfiles)
validCEL(celfiles) celDates(celfiles)
celfiles |
vector of cel file names to read |
Returns a message that cel files were successfully read, or an error if there were problems reading the cel files.
R. Scharpf
read.celfile.header,
POSIXt,
read.celfile
library(oligoClasses) if(require(hapmapsnp6)){ path <- system.file("celFiles", package="hapmapsnp6") cels <- list.celfiles(path, full.names=TRUE) validCEL(cels) celDates(cels) }
library(oligoClasses) if(require(hapmapsnp6)){ path <- system.file("celFiles", package="hapmapsnp6") cels <- list.celfiles(path, full.names=TRUE) validCEL(cels) celDates(cels) }
Plot prediction regions for integer copy number and normalized intensities.
xyplot(x, data, ...)
xyplot(x, data, ...)
x |
A |
data |
A |
... |
Additional arguments passed to |
A trellis
object.
R. Scharpf
library(oligoClasses) data(cnSetExample2) table(batch(cnSetExample2)) sample.index <- which(batch(cnSetExample2) == "CUPID") ## A single SNP pr <- predictionRegion(cnSetExample2[1:4, sample.index], copyNumber=0:4) gt <- calls(cnSetExample2[1:4, sample.index]) lim <- c(6,13) xyplot(B~A|snpid, data=cnSetExample2[1:4, sample.index], predictRegion=pr, panel=ABpanel, pch=21, fill=c("red", "blue", "green3")[gt], xlim=lim, ylim=lim) ## multiple SNPs, prediction regions for 3 batches ## Not run: tab <- table(batch(cnSetExample2)) bns <- names(tab)[tab > 50] sample.index <- which(batch(cnSetExample2) pr <- predictionRegion(cnSetExample2[1:10, sample.index], copyNumber=0:4) gt <- as.integer(calls(cnSetExample2[1:10, sample.index])) xyplot(B~A|snpid, data=cnSetExample2[1:10, sample.index], predictRegion=pr, panel=ABpanel, pch=21, fill=c("red", "blue", "green3")[gt], xlim=c(6,12), ylim=c(6,12)) ## nonpolymorphic markers data(cnSetExample2) tab <- table(batch(cnSetExample2)) bns <- names(tab)[tab > 50] sample.index <- which(batch(cnSetExample2) np.index <- which(!isSnp(cnSetExample2))[1:10] taus <- tau2(cnSetExample)[np.index, , , ] pr <- predictionRegion(cnSetExample2[np.index, sample.index], copyNumber=0:4) pp <- posteriorProbability(cnSetExample2[np.index, sample.index], predictRegion=pr, copyNumber=0:4) ## End(Not run)
library(oligoClasses) data(cnSetExample2) table(batch(cnSetExample2)) sample.index <- which(batch(cnSetExample2) == "CUPID") ## A single SNP pr <- predictionRegion(cnSetExample2[1:4, sample.index], copyNumber=0:4) gt <- calls(cnSetExample2[1:4, sample.index]) lim <- c(6,13) xyplot(B~A|snpid, data=cnSetExample2[1:4, sample.index], predictRegion=pr, panel=ABpanel, pch=21, fill=c("red", "blue", "green3")[gt], xlim=lim, ylim=lim) ## multiple SNPs, prediction regions for 3 batches ## Not run: tab <- table(batch(cnSetExample2)) bns <- names(tab)[tab > 50] sample.index <- which(batch(cnSetExample2) pr <- predictionRegion(cnSetExample2[1:10, sample.index], copyNumber=0:4) gt <- as.integer(calls(cnSetExample2[1:10, sample.index])) xyplot(B~A|snpid, data=cnSetExample2[1:10, sample.index], predictRegion=pr, panel=ABpanel, pch=21, fill=c("red", "blue", "green3")[gt], xlim=c(6,12), ylim=c(6,12)) ## nonpolymorphic markers data(cnSetExample2) tab <- table(batch(cnSetExample2)) bns <- names(tab)[tab > 50] sample.index <- which(batch(cnSetExample2) np.index <- which(!isSnp(cnSetExample2))[1:10] taus <- tau2(cnSetExample)[np.index, , , ] pr <- predictionRegion(cnSetExample2[np.index, sample.index], copyNumber=0:4) pp <- posteriorProbability(cnSetExample2[np.index, sample.index], predictRegion=pr, copyNumber=0:4) ## End(Not run)