Title: | Analysis of big data from aCGH experiments using parallel computing and ff objects |
---|---|
Description: | Analysis and plotting of array CGH data. Allows usage of Circular Binary Segementation, wavelet-based smoothing (both as in Liu et al., and HaarSeg as in Ben-Yaacov and Eldar), HMM, GLAD, CGHseg. Most computations are parallelized (either via forking or with clusters, including MPI and sockets clusters) and use ff for storing data. |
Authors: | Ramon Diaz-Uriarte <[email protected]> and Oscar M. Rueda <[email protected]>. Wavelet-based aCGH smoothing code from Li Hsu <[email protected]> and Douglas Grove <[email protected]>. Imagemap code from Barry Rowlingson <[email protected]>. HaarSeg code from Erez Ben-Yaacov; downloaded from <http://www.ee.technion.ac.il/people/YoninaEldar/Info/software/HaarSeg.htm>. Code from ffbase <https://github.com/edwindj/ffbase> by Edwin de Jonge <[email protected]>, Jan Wijffels, Jan van der Laan. |
Maintainer: | Ramon Diaz-Uriarte <[email protected]> |
License: | GPL (>= 3) |
Version: | 2.47.0 |
Built: | 2024-12-29 03:36:33 UTC |
Source: | https://github.com/bioc/ADaCGH2 |
A text file is split by individual columns, creating as
many files as individual columns. Splitting is used using multiple
cores, if available. Specified columns are renamed as ID.txt,
Chrom.txt, Pos.txt, and other columns can be deleted. The individual
files can then all be read from a given directory with
inputToADaCGH
.
cutFile(filename, id.col, chrom.col, pos.col, sep = "\t", cols = NULL, mc.cores = detectCores(), delete.columns = NULL, fork = FALSE)
cutFile(filename, id.col, chrom.col, pos.col, sep = "\t", cols = NULL, mc.cores = detectCores(), delete.columns = NULL, fork = FALSE)
filename |
Name of the text file with the data. The file contains
at least a column for probe ID, a column for Chromosome, and a column
for position, and one or more for data. The column is expected to have
a first row with an identifier. All columns are to be separated by a
single separating ccharacter |
id.col |
The number of the column (starting from 1) that should
be used as ID. This is the name before any columns are possibly
deleted (see argument |
chrom.col |
The number of the column (starting from 1) that
should be used as the identifier for Chromosome. This is the name
before any columns are possibly deleted (see argument
|
pos.col |
The number of the column (starting from 1) that should
be used as the position (the coordinates). This is the name before any
columns are possibly deleted (see argument |
cols |
The number of columns of the file. If not specified, we try to guess it. But guessing can be dangerous. |
sep |
The field or column separator, similar to |
mc.cores |
The number of processes to launch simultaneously. |
delete.columns |
The number of the columns (starting from 1) that you do not want to preserve. You probably do not want to have too many of these, and if you do you should note that the file is cut into pieces BEFORE dealing with the unwanted columns so many unwanted columns will mean that we are doing many unwanted calls to cut. |
fork |
Should we fork R processes, via mclapply, or just send several system commands from this R process. The default (FALSE) is probably the most reasonable option for large files. |
This function is unlikely to work under Windows unless MinGW or similar are installed (and even then it might not work). This function should work under Mac OS, and it does in the machines we've tried it, but it seems not to work on the BioC testing machine.
This function basically calls "head" and "awk" using system
,
and trying to divide all the jobs into as many cores as you specify
(argument cores
).
This function is used for it main effect: cutting a file into individual one-column files. These files are names "col_1.txt", "col_2.txt", etc, and there are three called "ID.txt", "Chrom.txt", "Pos.txt". The files are created in the current working directory.
As we move the files corresponding to "ID", "Chrom", and "Position", the stdout output is shown to the user (to check that things worked).
After calling this function, you can call
inputToADaCGH
.
Ramon Diaz-Uriarte [email protected]
## Read a tab separated file, and assign the first, ## second, and third positions to ID, Chrom, and Position if( (.Platform$OS.type == "unix") && (Sys.info()['sysname'] != "Darwin") ) { ## This will not work in Windows, and might, or might not, work under Mac fnametxt <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.txt") cutFile(fnametxt, 1, 2, 3, sep = "\t") ## Verify we have ID, Chrom, Pos c("ID.txt", "Chrom.txt", "Pos.txt") %in% list.files(getwd()) ## verify some other column c("col_5.txt") %in% list.files(getwd()) ## Read a white space separated file, and assign the first, second, and ## third positions to ID, Chrom, and Position, but remove the fifth ## column fnametxt2 <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx-sp.txt") cutFile(fnametxt2, 1, 2, 3, sep = " ", delete.columns = 5) }
## Read a tab separated file, and assign the first, ## second, and third positions to ID, Chrom, and Position if( (.Platform$OS.type == "unix") && (Sys.info()['sysname'] != "Darwin") ) { ## This will not work in Windows, and might, or might not, work under Mac fnametxt <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.txt") cutFile(fnametxt, 1, 2, 3, sep = "\t") ## Verify we have ID, Chrom, Pos c("ID.txt", "Chrom.txt", "Pos.txt") %in% list.files(getwd()) ## verify some other column c("col_5.txt") %in% list.files(getwd()) ## Read a white space separated file, and assign the first, second, and ## third positions to ID, Chrom, and Position, but remove the fifth ## column fnametxt2 <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx-sp.txt") cutFile(fnametxt2, 1, 2, 3, sep = " ", delete.columns = 5) }
A fictitious aCGH data set.
inputEx
inputEx
A data frame with 500 rows and 6 columns; the last three correspond to the aCGH data for three samples. There are data for five chromosomes. The same file is available in three formats: an RData file, a tab separated text file, and a space-separate text file.
Simulated data
Input data with CGH data are converted to several ff files and data checked for potential errors and location duplications.
inputToADaCGH(ff.or.RAM = "RAM", robjnames = c("cgh.dat", "chrom.dat", "pos.dat", "probenames.dat"), ffpattern = paste(getwd(), "/", sep = ""), MAList = NULL, cloneinfo = NULL, RDatafilename = NULL, textfilename = NULL, dataframe = NULL, path = NULL, excludefiles = NULL, cloneinfosep = "\t", cloneinfoquote = "\"", minNumPerChrom = 10, verbose = FALSE, mc.cores = floor(detectCores()/2))
inputToADaCGH(ff.or.RAM = "RAM", robjnames = c("cgh.dat", "chrom.dat", "pos.dat", "probenames.dat"), ffpattern = paste(getwd(), "/", sep = ""), MAList = NULL, cloneinfo = NULL, RDatafilename = NULL, textfilename = NULL, dataframe = NULL, path = NULL, excludefiles = NULL, cloneinfosep = "\t", cloneinfoquote = "\"", minNumPerChrom = 10, verbose = FALSE, mc.cores = floor(detectCores()/2))
ff.or.RAM |
Whether you want to store the output as |
robjnames |
Name of the objects that will be created in you use
|
ffpattern |
See argument |
MAList |
The name of an object of class You have to specify one, and only one, of |
cloneinfo |
A character vector with the full path to a file that
conforms to the characteristics of This is only needed if you use |
RDatafilename |
Name of data RData file that contains the data frame
with original, non-ff, data. Note: this is the name of the RData file
(possibly including path), NOT the name of the data frame. (For
that, look at The first three columns of the data frame are the IDs of the probes, the chromosome number, and the position, and all remaining columns contain the data for the arrays, one column per array. The names of the first three column do not matter, but the order does. Names of the remaining columns will be used if existing; otherwise, fake array names will be created. You have to specify one, and only one, of |
textfilename |
The name of a text file with the data. It should be a tab separated file, with a header. The first three columns of the data frame are the IDs of the probes, the chromosome number, and the position, and all remaining columns contain the data for the arrays, one column per array. The names of the first three column do not matter, but the order does. Names of the remaining columns will be used if existing; otherwise, fake array names will be created. You have to specify one, and only one, of |
dataframe |
The name of a data frame with the data. The first three columns of the data frame are the IDs of the probes, the chromosome number, and the position, and all remaining columns contain the data for the arrays, one column per array. The names of the first three column do not matter, but the order does. Names of the remaining columns will be used if existing; otherwise, fake array names will be created. |
path |
The name of the directory (the full path) to where each of
the individual, one-column, files are. We will read ALL of the files
in this directory, except for those listed under
All of the files are expected to be one-column text files, with a first row with a header. The header will not be used for "ID.txt", "Chrom.txt", or "Pos.txt", but the header will be used as the name of the array/subject for the CGH data files. You have to specify one, and only one, of |
excludefiles |
If you have specified |
cloneinfosep |
Argument to |
cloneinfoquote |
Argument to |
minNumPerChrom |
If any chromosome has fewer observations than minNumPerChrom the function will fail. This can help detect upstream pre-processing errors. |
verbose |
If TRUE, provide additional information that can be useful to debug problems. Right now it provides the list of files that will be read if using a directory. The default is FALSE. |
mc.cores |
The number of cores to use when reading files. This is
always 1 in Windows. See details about the number of cores in
|
If there are identical positions (in the same chromosome) a small random uniform variate is added to get unique locations.
We carry out several checks (e.g., no duplicated positions), but note that we DO NOT check for extremely large or small values, and this includes NOT CHECKING for infinite values.
Missing values are allowed in the data columns. However, we do not check for missing values in the ID, chromosome, or position columns, except if you are using as input an RData file or MA list. You better not have any missing values there; otherwise, things will break in strange ways. Why this inconsistency? Checking for missing values can consume a lot of resources (CPU and memory). If your are really huge, they will probably be stored as text files, and you are expected to use the appropriate tools there to filter (e.g., sed, awk, whatever). If they exist as an MA list or an RData file, they once fitted in RAM, so checking for these NAs is probably reasonable.
If you provide a text file as input (textfilename
), the reading
operation is carried out using read.table.ffdf
, to
allow for reading very large files. Using this option, however, does
not force you to produce as output ff
objects.
Commented examples of reading objects from limma are provided in the vignnette.
This function is used mainly for its side effects: writing either
several ff
files to the current working directory, or several
RAM objects (the usual, in memory, local, R objects). The actual names
are printed out.
Ramon Diaz-Uriarte [email protected]
cutFile
for obtaining files in the format needed if you read from a directory.
## Create a temp dir for storing output. ## (Not needed, but cleaner). dir.create("ADaCGH2_example_input_dir") originalDir <- getwd() setwd("ADaCGH2_example_input_dir") ## Sys.sleep(1) ## Get location (and full filename) of example data file fnameRData <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.RData") fnametxt <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.txt") namepath <- system.file("example-datadir", package = "ADaCGH2") ## Read from RData and write to ff inputToADaCGH(ff.or.RAM = "ff", RDatafilename = fnameRData) ## Read from text file and write to ff ## You might want to adapt mc.cores to your hardware inputToADaCGH(ff.or.RAM = "ff", textfilename = fnametxt, mc.cores = 2) ## Read from text file and write to RAM ## You might want to adapt mc.cores to your hardware inputToADaCGH(ff.or.RAM = "RAM", textfilename = fnametxt, mc.cores = 2) ## Read from a directory and write to ff ## You might want to adapt mc.cores to your hardware inputToADaCGH(ff.or.RAM = "ff", path = namepath, mc.cores = 2) ### Clean up (DO NOT do this with objects you want to keep!!!) load("chromData.RData") load("posData.RData") load("cghData.RData") delete(cghData); rm(cghData) delete(posData); rm(posData) delete(chromData); rm(chromData) unlink("chromData.RData") unlink("posData.RData") unlink("cghData.RData") unlink("probeNames.RData") ### Running in a separate process. Only makes sense ### if returning ff objects (ff.or.RAM = "ff") ### This example will not work on Windows ## Not run: mcparallel(inputToADaCGH(ff.or.RAM = "ff", RDatafilename = fnameRData), silent = FALSE) tableChromArray <- mccollect() if(inherits(tableChromArray, "try-error")) { stop("ERROR in input data conversion") } ### Clean up (DO NOT do this with objects you want to keep!!!) load("chromData.RData") load("posData.RData") load("cghData.RData") delete(cghData); rm(cghData) delete(posData); rm(posData) delete(chromData); rm(chromData) unlink("chromData.RData") unlink("posData.RData") unlink("cghData.RData") unlink("probeNames.RData") ## End(Not run) ### Try to prevent problems in R CMD check ## Sys.sleep(2) ### Delete temp dir setwd(originalDir) ## Sys.sleep(2) unlink("ADaCGH2_example_input_dir", recursive = TRUE) ## Sys.sleep(2)
## Create a temp dir for storing output. ## (Not needed, but cleaner). dir.create("ADaCGH2_example_input_dir") originalDir <- getwd() setwd("ADaCGH2_example_input_dir") ## Sys.sleep(1) ## Get location (and full filename) of example data file fnameRData <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.RData") fnametxt <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.txt") namepath <- system.file("example-datadir", package = "ADaCGH2") ## Read from RData and write to ff inputToADaCGH(ff.or.RAM = "ff", RDatafilename = fnameRData) ## Read from text file and write to ff ## You might want to adapt mc.cores to your hardware inputToADaCGH(ff.or.RAM = "ff", textfilename = fnametxt, mc.cores = 2) ## Read from text file and write to RAM ## You might want to adapt mc.cores to your hardware inputToADaCGH(ff.or.RAM = "RAM", textfilename = fnametxt, mc.cores = 2) ## Read from a directory and write to ff ## You might want to adapt mc.cores to your hardware inputToADaCGH(ff.or.RAM = "ff", path = namepath, mc.cores = 2) ### Clean up (DO NOT do this with objects you want to keep!!!) load("chromData.RData") load("posData.RData") load("cghData.RData") delete(cghData); rm(cghData) delete(posData); rm(posData) delete(chromData); rm(chromData) unlink("chromData.RData") unlink("posData.RData") unlink("cghData.RData") unlink("probeNames.RData") ### Running in a separate process. Only makes sense ### if returning ff objects (ff.or.RAM = "ff") ### This example will not work on Windows ## Not run: mcparallel(inputToADaCGH(ff.or.RAM = "ff", RDatafilename = fnameRData), silent = FALSE) tableChromArray <- mccollect() if(inherits(tableChromArray, "try-error")) { stop("ERROR in input data conversion") } ### Clean up (DO NOT do this with objects you want to keep!!!) load("chromData.RData") load("posData.RData") load("cghData.RData") delete(cghData); rm(cghData) delete(posData); rm(posData) delete(chromData); rm(chromData) unlink("chromData.RData") unlink("posData.RData") unlink("cghData.RData") unlink("probeNames.RData") ## End(Not run) ### Try to prevent problems in R CMD check ## Sys.sleep(2) ### Delete temp dir setwd(originalDir) ## Sys.sleep(2) unlink("ADaCGH2_example_input_dir", recursive = TRUE) ## Sys.sleep(2)
Convert ADaCGH2 output to a data frame that can be used as
input for CGHregions
. This function takes as
input the two possible types of input produced by the
pSegment
functions: either an ff object (and its
associated directory) or the names of the RAM objects (the usual, in
memory R objects) with the output, and chromosome, position, and probe
name information.
outputToCGHregions(ffoutput = NULL, directory = getwd(), output.dat = NULL, chrom.dat = NULL, pos.dat = NULL, probenames.dat = NULL)
outputToCGHregions(ffoutput = NULL, directory = getwd(), output.dat = NULL, chrom.dat = NULL, pos.dat = NULL, probenames.dat = NULL)
ffoutput |
The name of the ff object with the output from a call to a
|
directory |
The directory where the initial data transformation and the analysis have been carried out. It is a lot better if you just work on a single directory for a set of files. Otherwise, unless you keep very carefull track of where you do what, you will run into trouble. This is only relevant if you use an |
output.dat |
The name of the RAM object with the output from
a call to a |
chrom.dat |
The name of the RAM object with the chromosome
data information. See the help for |
pos.dat |
The name of the RAM object with the position data
information. See the help for |
probenames.dat |
The name of the RAM object with the probe
names. See the help for |
A data frame of 4 + k columns that can be used as input to the
CGHregions
function. The first four columns
are the probe name, the chromosome, the position and the position. The
last k columns are the calls for the k samples.
This function does NOT check if the calls are meaningfull. In
particular, you probably do NOT want to use this function when
pSegment
has been called using merging = "none".
Moreover, we do not check if there are missing values, and CGHregions
is likely to fail when there are NAs. Finally, we do not try to use
ff
objects, so using this function with very large objects will
probably fail.
Ramon Diaz-Uriarte [email protected]
## Get location (and full filename) of example data file ## We will read from a text file fnametxt <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.txt") ################################## ##### ##### Using RAM objects ##### ################################## ## Read data into RAM objects ## You might want to adapt mc.cores to your hardware inputToADaCGH(ff.or.RAM = "RAM", textfilename = fnametxt, mc.cores = 2) ## Run segmentation (e.g., HaarSeg) ## You might want to adapt mc.cores to your hardware haar.RAM.fork <- pSegmentHaarSeg(cgh.dat, chrom.dat, merging = "MAD", mc.cores = 2) forcghr <- outputToCGHregions(output.dat = haar.RAM.fork, chrom.dat = chrom.dat, pos.dat = pos.dat, probenames.dat = probenames.dat) ## Run CGHregions if(require(CGHregions)) { regions1 <- CGHregions(na.omit(forcghr)) regions1 } ################################## ##### ##### Using ff objects ##### ################################## if(.Platform$OS.type != "windows") { ## We do not want this to run in Windows the automated tests since ## issues with I/O. It should work, though, in interactive usage ## Create a temp dir for storing output. ## (Not needed, but cleaner). dir.create("ADaCGH2_cghreg_example_tmp_dir") originalDir <- getwd() setwd("ADaCGH2_cghreg_example_tmp_dir") ## Sys.sleep(1) ## You might want to adapt mc.cores to your hardware inputToADaCGH(ff.or.RAM = "ff", textfilename = fnametxt, mc.cores = 2) ## You might want to adapt mc.cores to your hardware haar.ff.fork <- pSegmentHaarSeg("cghData.RData", "chromData.RData", merging = "MAD", mc.cores = 2) forcghr.ff <- outputToCGHregions(ffoutput = haar.ff.fork) if(require(CGHregions)) { regions1 <- CGHregions(na.omit(forcghr.ff)) regions1 } ### Clean up (DO NOT do this with objects you want to keep!!!) load("chromData.RData") load("posData.RData") load("cghData.RData") delete(cghData); rm(cghData) delete(posData); rm(posData) delete(chromData); rm(chromData) unlink("chromData.RData") unlink("posData.RData") unlink("cghData.RData") unlink("probeNames.RData") lapply(haar.ff.fork, delete) rm(haar.ff.fork) ### Delete all files and temp dir setwd(originalDir) ## Sys.sleep(2) unlink("ADaCGH2_cghreg_example_tmp_dir", recursive = TRUE) ## Sys.sleep(2) }
## Get location (and full filename) of example data file ## We will read from a text file fnametxt <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.txt") ################################## ##### ##### Using RAM objects ##### ################################## ## Read data into RAM objects ## You might want to adapt mc.cores to your hardware inputToADaCGH(ff.or.RAM = "RAM", textfilename = fnametxt, mc.cores = 2) ## Run segmentation (e.g., HaarSeg) ## You might want to adapt mc.cores to your hardware haar.RAM.fork <- pSegmentHaarSeg(cgh.dat, chrom.dat, merging = "MAD", mc.cores = 2) forcghr <- outputToCGHregions(output.dat = haar.RAM.fork, chrom.dat = chrom.dat, pos.dat = pos.dat, probenames.dat = probenames.dat) ## Run CGHregions if(require(CGHregions)) { regions1 <- CGHregions(na.omit(forcghr)) regions1 } ################################## ##### ##### Using ff objects ##### ################################## if(.Platform$OS.type != "windows") { ## We do not want this to run in Windows the automated tests since ## issues with I/O. It should work, though, in interactive usage ## Create a temp dir for storing output. ## (Not needed, but cleaner). dir.create("ADaCGH2_cghreg_example_tmp_dir") originalDir <- getwd() setwd("ADaCGH2_cghreg_example_tmp_dir") ## Sys.sleep(1) ## You might want to adapt mc.cores to your hardware inputToADaCGH(ff.or.RAM = "ff", textfilename = fnametxt, mc.cores = 2) ## You might want to adapt mc.cores to your hardware haar.ff.fork <- pSegmentHaarSeg("cghData.RData", "chromData.RData", merging = "MAD", mc.cores = 2) forcghr.ff <- outputToCGHregions(ffoutput = haar.ff.fork) if(require(CGHregions)) { regions1 <- CGHregions(na.omit(forcghr.ff)) regions1 } ### Clean up (DO NOT do this with objects you want to keep!!!) load("chromData.RData") load("posData.RData") load("cghData.RData") delete(cghData); rm(cghData) delete(posData); rm(posData) delete(chromData); rm(chromData) unlink("chromData.RData") unlink("posData.RData") unlink("cghData.RData") unlink("probeNames.RData") lapply(haar.ff.fork, delete) rm(haar.ff.fork) ### Delete all files and temp dir setwd(originalDir) ## Sys.sleep(2) unlink("ADaCGH2_cghreg_example_tmp_dir", recursive = TRUE) ## Sys.sleep(2) }
Produce PNG figures of segment plots (by chromosome) for aCGH segmentation results. Internal calls are parallelized for increased speed and we use ff objets to allow the handling of very large objects. The output can include files for creating HTML with imagemaps.
pChromPlot(outRDataName, cghRDataName, chromRDataName, probenamesRDataName = NULL, posRDataName = NULL, imgheight = 500, pixels.point = 3, pch = 20, colors = c("orange", "red", "green", "blue", "black"), imagemap = FALSE, typeParall = "fork", mc.cores = detectCores(), typedev = "default", certain_noNA = FALSE, loadBalance = TRUE, ...)
pChromPlot(outRDataName, cghRDataName, chromRDataName, probenamesRDataName = NULL, posRDataName = NULL, imgheight = 500, pixels.point = 3, pch = 20, colors = c("orange", "red", "green", "blue", "black"), imagemap = FALSE, typeParall = "fork", mc.cores = detectCores(), typedev = "default", certain_noNA = FALSE, loadBalance = TRUE, ...)
outRDataName |
The RAM object or the RData file name that contains the results
from the segmentation (as an Note that the type of object in As well, note that if you use RAM objects, you must use
|
cghRDataName |
The Rdata file name that contains the
If this is an |
chromRDataName |
The RData file name with the ff (short integer)
vector with the chromosome indicator, or the name of the RAM object
with the data. Function |
posRDataName |
The RData file name with the ff double vector with
the location (e.g., position in kbases) of each probe in the
chromosome, or the name of the RAM object with the data. Function
This argument is used for the spacing in the plots. If NULL, the x-axis goes from 1:number of probes in that chromosome. |
probenamesRDataName |
The RData file name with the vector with
the probe names or the RAM object. Function
|
imgheight |
Height of png image. See |
pixels.point |
Approximate number of pixels that each point takes; this determines also final figure size. With many probes per chromosome, you will want to make this a small value. |
pch |
The type of plotting symbol. See |
colors |
A five-element character vector with the colors for: probes without change, probes that have a "gained" status, probes that have a "lost" status, the line that connects (smoothed values of) probes, the horizontal line at the 0 level. |
imagemap |
If FALSE only the png figure is produced. If TRUE, for each array * chromosome, two additional files are produced: "pngCoord_ChrNN@MM" and "geneNames_ChrNN@MM", where "NN" is the chromosome number and "MM" is the array name. The first file contains the coordinates of the png and radius and the second the gene or probe names, so that you can easily produce an HTML imagemap. (Former versions of ADaCGH did this automatically with Python. In this version we include the Python files under "imagemap-example".) |
typeParall |
One of "fork" or "cluster". "fork" is unavailable in
Windows, and will lead to sequential execution. "cluster" requires
having set up a cluster before, with appropriate calls to
Using "fork" and "cluster" will lead to different schemes for parallelization. See the vignette. If you use |
mc.cores |
The number of cores used if |
.
typedev |
The device type. One of "cairo", "cairo-png", "Cairo", or "default". "Cairo" requires the Cairo package to be available, but might work with headless Linux server without png support, and might be a better choice with Mac OS. "default" chooses "Cairo" for Mac, and "cairo" otherwise. |
certain_noNA |
Are you certain, absolutely sure, your data contain
no missing values? (Default is FALSE). If you are, you can achieve
considerable speed ups by setting it to TRUE. See the help for this
option in |
loadBalance |
If TRUE (the default) use load balancing with MPI
(use |
... |
Additional arguments; not used. |
Used only for its side effects of producing PNG plots, stored in the
current working directory (getwd()
.)
Ramon Diaz-Uriarte [email protected]
##################################################### ### ### Using forking with RAM objects ### ##################################################### ### Note to windows users: under Windows, this will ### result in sequential execution, as forking is not ### available. ## Get example input data and create data objects data(inputEx) ## (this is not necessary, but is convenient; ## you could do the subsetting in the call themselves) cgh.dat <- inputEx[, -c(1, 2, 3)] chrom.dat <- as.integer(inputEx[, 2]) pos.dat <- inputEx[, 3] ## Segment with HaarSeg ## You might want to adapt mc.cores to your hardware haar.RAM.fork <- pSegmentHaarSeg(cgh.dat, chrom.dat, merging = "MAD", mc.cores = 2) ## You might want to adapt mc.cores to your hardware pChromPlot(haar.RAM.fork, cghRDataName = cgh.dat, chromRDataName = chrom.dat, posRDataName = pos.dat, imgheight = 350, mc.cores = 2) ## Not run: ##################################################### ### ### Using a cluster with ff objects and create imagemaps ### ##################################################### ## Create a temp dir for storing output dir.create("ADaCGH2_plot_tmp_dir") originalDir <- getwd() setwd("ADaCGH2_plot_tmp_dir") ## Start a socket cluster. Change the appropriate number of CPUs ## for your hardware and use other types of clusters (e.g., MPI) ## if you want. cl2 <- makeCluster(4,"PSOCK") clusterSetRNGStream(cl2) setDefaultCluster(cl2) clusterEvalQ(NULL, library("ADaCGH2")) ## The following is not really needed if you create the cluster AFTER ## changing directories. But better to be explicit. wdir <- getwd() clusterExport(NULL, "wdir") clusterEvalQ(NULL, setwd(wdir)) ## Get input data in ff format ## (we loaded the RData above, but we need to find the full path ## to use it in the call to inputToADaCGH) fname <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.RData") inputToADaCGH(ff.or.RAM = "ff", RDatafilename = fname) ## Segment with HaarSeg haar.ff.cluster <- pSegmentHaarSeg("cghData.RData", "chromData.RData", merging = "MAD", typeParall= "cluster") ## Save the output (an ff object) and plot save(haar.ff.cluster, file = "haar.ff.cluster.out.RData", compress = FALSE) pChromPlot(outRDataName = "haar.ff.cluster.out.RData", cghRDataName = "cghData.RData", chromRDataName = "chromData.RData", posRDataName = "posData.RData", probenamesRDataName = "probeNames.RData", imgheight = 350, imagemap = TRUE, typeParall= "cluster") ### Explicitly stop cluster stopCluster(NULL) ### Clean up (DO NOT do this with objects you want to keep!!!) load("chromData.RData") load("posData.RData") load("cghData.RData") delete(cghData); rm(cghData) delete(posData); rm(posData) delete(chromData); rm(chromData) unlink("chromData.RData") unlink("posData.RData") unlink("cghData.RData") unlink("probeNames.RData") lapply(haar.ff.cluster, delete) rm(haar.ff.cluster) unlink("haar.ff.cluster.out.RData") ### Try to prevent problems in R CMD check ## Sys.sleep(2) ### Delete all png files and temp dir setwd(originalDir) ## Sys.sleep(2) unlink("ADaCGH2_plot_tmp_dir", recursive = TRUE) ## Sys.sleep(2) ## End(Not run) ### PNGs are in this directory getwd()
##################################################### ### ### Using forking with RAM objects ### ##################################################### ### Note to windows users: under Windows, this will ### result in sequential execution, as forking is not ### available. ## Get example input data and create data objects data(inputEx) ## (this is not necessary, but is convenient; ## you could do the subsetting in the call themselves) cgh.dat <- inputEx[, -c(1, 2, 3)] chrom.dat <- as.integer(inputEx[, 2]) pos.dat <- inputEx[, 3] ## Segment with HaarSeg ## You might want to adapt mc.cores to your hardware haar.RAM.fork <- pSegmentHaarSeg(cgh.dat, chrom.dat, merging = "MAD", mc.cores = 2) ## You might want to adapt mc.cores to your hardware pChromPlot(haar.RAM.fork, cghRDataName = cgh.dat, chromRDataName = chrom.dat, posRDataName = pos.dat, imgheight = 350, mc.cores = 2) ## Not run: ##################################################### ### ### Using a cluster with ff objects and create imagemaps ### ##################################################### ## Create a temp dir for storing output dir.create("ADaCGH2_plot_tmp_dir") originalDir <- getwd() setwd("ADaCGH2_plot_tmp_dir") ## Start a socket cluster. Change the appropriate number of CPUs ## for your hardware and use other types of clusters (e.g., MPI) ## if you want. cl2 <- makeCluster(4,"PSOCK") clusterSetRNGStream(cl2) setDefaultCluster(cl2) clusterEvalQ(NULL, library("ADaCGH2")) ## The following is not really needed if you create the cluster AFTER ## changing directories. But better to be explicit. wdir <- getwd() clusterExport(NULL, "wdir") clusterEvalQ(NULL, setwd(wdir)) ## Get input data in ff format ## (we loaded the RData above, but we need to find the full path ## to use it in the call to inputToADaCGH) fname <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.RData") inputToADaCGH(ff.or.RAM = "ff", RDatafilename = fname) ## Segment with HaarSeg haar.ff.cluster <- pSegmentHaarSeg("cghData.RData", "chromData.RData", merging = "MAD", typeParall= "cluster") ## Save the output (an ff object) and plot save(haar.ff.cluster, file = "haar.ff.cluster.out.RData", compress = FALSE) pChromPlot(outRDataName = "haar.ff.cluster.out.RData", cghRDataName = "cghData.RData", chromRDataName = "chromData.RData", posRDataName = "posData.RData", probenamesRDataName = "probeNames.RData", imgheight = 350, imagemap = TRUE, typeParall= "cluster") ### Explicitly stop cluster stopCluster(NULL) ### Clean up (DO NOT do this with objects you want to keep!!!) load("chromData.RData") load("posData.RData") load("cghData.RData") delete(cghData); rm(cghData) delete(posData); rm(posData) delete(chromData); rm(chromData) unlink("chromData.RData") unlink("posData.RData") unlink("cghData.RData") unlink("probeNames.RData") lapply(haar.ff.cluster, delete) rm(haar.ff.cluster) unlink("haar.ff.cluster.out.RData") ### Try to prevent problems in R CMD check ## Sys.sleep(2) ### Delete all png files and temp dir setwd(originalDir) ## Sys.sleep(2) unlink("ADaCGH2_plot_tmp_dir", recursive = TRUE) ## Sys.sleep(2) ## End(Not run) ### PNGs are in this directory getwd()
These functions parallelize several segmentation algorithms and make their calling use the same conventions as for other methods.
pSegmentDNAcopy(cghRDataName, chromRDataName, merging = "MAD", mad.threshold = 3, smooth = TRUE, alpha=0.01, nperm=10000, p.method = "hybrid", min.width = 2, kmax=25, nmin=200, eta = 0.05,trim = 0.025, undo.splits = "none", undo.prune=0.05, undo.SD=3, typeParall = "fork", mc.cores = detectCores(), certain_noNA = FALSE, loadBalance = TRUE, ...) pSegmentHaarSeg(cghRDataName, chromRDataName, merging = "MAD", mad.threshold = 3, W = vector(), rawI = vector(), breaksFdrQ = 0.001, haarStartLevel = 1, haarEndLevel = 5, typeParall = "fork", mc.cores = detectCores(), certain_noNA = FALSE, loadBalance = FALSE, ...) pSegmentHMM(cghRDataName, chromRDataName, merging = "mergeLevels", mad.threshold = 3, aic.or.bic = "AIC", typeParall = "fork", mc.cores = detectCores(), certain_noNA = FALSE, loadBalance = TRUE, ...) pSegmentCGHseg(cghRDataName, chromRDataName, CGHseg.thres = -0.05, merging = "MAD", mad.threshold = 3, typeParall = "fork", mc.cores = detectCores(), certain_noNA = FALSE, loadBalance = TRUE, ...) pSegmentGLAD(cghRDataName, chromRDataName, deltaN = 0.10, forceGL = c(-0.15, 0.15), deletion = -5, amplicon = 1, typeParall = "fork", mc.cores = detectCores(), certain_noNA = FALSE, GLADdetails = FALSE, loadBalance = TRUE, ...) pSegmentWavelets(cghRDataName, chromRDataName, merging = "MAD", mad.threshold = 3, minDiff = 0.25, minMergeDiff = 0.05, thrLvl = 3, initClusterLevels = 10, typeParall = "fork", mc.cores = detectCores(), certain_noNA = FALSE, loadBalance = TRUE, ...)
pSegmentDNAcopy(cghRDataName, chromRDataName, merging = "MAD", mad.threshold = 3, smooth = TRUE, alpha=0.01, nperm=10000, p.method = "hybrid", min.width = 2, kmax=25, nmin=200, eta = 0.05,trim = 0.025, undo.splits = "none", undo.prune=0.05, undo.SD=3, typeParall = "fork", mc.cores = detectCores(), certain_noNA = FALSE, loadBalance = TRUE, ...) pSegmentHaarSeg(cghRDataName, chromRDataName, merging = "MAD", mad.threshold = 3, W = vector(), rawI = vector(), breaksFdrQ = 0.001, haarStartLevel = 1, haarEndLevel = 5, typeParall = "fork", mc.cores = detectCores(), certain_noNA = FALSE, loadBalance = FALSE, ...) pSegmentHMM(cghRDataName, chromRDataName, merging = "mergeLevels", mad.threshold = 3, aic.or.bic = "AIC", typeParall = "fork", mc.cores = detectCores(), certain_noNA = FALSE, loadBalance = TRUE, ...) pSegmentCGHseg(cghRDataName, chromRDataName, CGHseg.thres = -0.05, merging = "MAD", mad.threshold = 3, typeParall = "fork", mc.cores = detectCores(), certain_noNA = FALSE, loadBalance = TRUE, ...) pSegmentGLAD(cghRDataName, chromRDataName, deltaN = 0.10, forceGL = c(-0.15, 0.15), deletion = -5, amplicon = 1, typeParall = "fork", mc.cores = detectCores(), certain_noNA = FALSE, GLADdetails = FALSE, loadBalance = TRUE, ...) pSegmentWavelets(cghRDataName, chromRDataName, merging = "MAD", mad.threshold = 3, minDiff = 0.25, minMergeDiff = 0.05, thrLvl = 3, initClusterLevels = 10, typeParall = "fork", mc.cores = detectCores(), certain_noNA = FALSE, loadBalance = TRUE, ...)
cghRDataName |
The Rdata file name that contains the
If this is an Note that the type of object in |
chromRDataName |
The RData file name with the ff (short integer)
vector with the chromosome indicator, or the name of the in-memory
RAM R object with the data. Function
|
merging |
Merging method; for most methods one of "MAD" or "mergeLevels". For CBS (pSegmentDNAcopy), GGHseg (pSegmentCGHseg), HaarSeg (pSegmentHaarSeg), and Wavelets (as in Hsu et al. —pSegmentWavelets) also "none". This option does not apply to GLAD (which has its own merging-like approach). See details. |
mad.threshold |
If using |
typeParall |
One of "fork" or "cluster". "fork" is unavailable in
Windows, and will lead to sequential execution. "cluster" requires
having set up a cluster before, with appropriate calls to
Using "fork" and "cluster" will lead to different schemes for parallelization. See details and the vignette. |
mc.cores |
The number of cores used if |
.
certain_noNA |
Are you certain, absolutely sure, your data contain no missing values? (Default is FALSE). If you are, you can achieve considerable speed ups by setting it to TRUE. But if you set it to TRUE and you are wrong, some methods will fail (some with harder to understand error messages) and, even worse, other methods might appear to work (but give incorrect results). You've been warned. |
loadBalance |
If TRUE (the default for all methods except HaarSeg)
use load balancing with MPI (use
|
smooth |
For DNAcopy only. If TRUE (default) carry out smoothing as
explained in |
alpha |
For DNAcopy only. See |
nperm |
For DNAcopy only. See |
p.method |
For DNAcopy only. See |
min.width |
For DNAcopy only. See |
kmax |
For DNAcopy only. See |
nmin |
For DNAcopy only. See |
eta |
For DNAcopy only. See |
trim |
For DNAcopy only. See |
undo.splits |
For DNAcopy only. See |
undo.prune |
For DNAcopy only. See |
undo.SD |
For DNAcopy only. See |
W |
For HaarSeg: Weight matrix, corresponding to quality of measurment. Insert 1/(sigma**2) as weights if your platform output sigma as the quality of measurment. W must have the same size as I. |
rawI |
For HaarSeg. Mininum of the raw red and raw green measurment, before the log. rawI is used for the non-stationary variance compensation. rawI must have the same size as I. |
breaksFdrQ |
For HaarSeg. The FDR q parameter. Common used values are 0.05, 0.01, 0.001. Default value is 0.001. |
haarStartLevel |
For HaarSeg. The detail subband from which we start to detect peaks. The higher this value is, the less sensitive we are to short segments. The default is value is 1, corresponding to segments of 2 probes. |
haarEndLevel |
For HaarSeg. The detail subband until which we use to detect peaks. The higher this value is, the more sensitive we re to large trends in the data. This value DOES NOT indicate the largest possible segment that can be detected. The default is value is 5, corresponding to step of 32 probes in each direction. |
aic.or.bic |
For HMM. One of "AIC" or "BIC". |
CGHseg.thres |
The threshold for the adaptive penalization in Picard et al.'s CGHseg. See p. 13 of the original paper. Must be a negative number. The default value used in the original reference is -0.5. However, our experience with the simulated data in Willenbrock and Fridlyand (2005) indicates that for those data values around -0.005 are more appropriate. We use here -0.05 as default. |
deltaN |
Only for GLAD. See deltaN in |
forceGL |
Only for GLAD. See forceGL in |
deletion |
Only for GLAD. See deletion in |
amplicon |
Only for GLAD. See amplicon in
|
GLADdetails |
Only for GLAD. If set to TRUE the function returns verbose output about where it is along the execution. This option (setting it to FALSE) is likely to become hard-coded in the future. |
minMergeDiff |
Used only when doing merging in the wavelet
method of Hsu et al.. The finall call as to which segments go together is done by
a |
minDiff |
For Wavelets (Hsu et al.). Minimum (absolute) difference between the medians of two adjacent clusters for them to be considered truly different. Clusters "closer" together than this are collapsed together to form a single cluster. |
thrLvl |
The level used for the wavelet thresholding in Hsu et al. |
initClusterLevels |
For Wavelets (Hsu et al.). The initial number of clusters to form. |
... |
Additional arguments; not used. |
In most cases, these are wrappers to the original code, with
modifications for parallelization and for using ff
objects, if appropriate.
Using option typeParall = "fork"
will, as it says, use the
forking mechanism available in package parallel
. The objects used
can be either ff
objects or regular R objects. Using
typeParall = "cluster"
will use a pre-existing cluster, and the
objects used must be ff
ones, since we only pass pointers to the
objects, not the objects themselves, to try to minimize communication
and memory usage. To put it the other way around: if you use RAM
objects, you must use typeParall = "fork"
; with ff
objects
you can use both typeParall = "cluster"
and typeParall =
"fork"
. Further details are provided in the vignette.
For HMM, CGHseg, and Wavelets, the first part of the analysis is conducted parallelizing over array by chromosome (because the methods are slow and/or very memory consuming). The final step (merging), however, is carried out over array (it is a step that must be carried array-wise). For all other methods, we have parallelized over arrays: the extra communication overheads of the much finer-grained parallelization of array by chromosome are rarely justified with these methods and, in the case of GLAD, would require modifying the original C code.
CGHseg has been implemented here following the original authors description. Note that several publications incorrectly claim that they use the CGHseg approach when, actually, they are only using the "segment" function in the "tilingArray" package, but they are missing the key step of choosing the optimal number of segments (see p. 13 in Picard et al, 2005). We implement the author's method in our (internal, so use "ADaCGH2:::piccardsKO" to see it) function "piccardsKO".
When using GLAD, we use the HaarSeg approach. This is the same as using
the daglad
function with argument smoothfunc = "haarseg"
.
For HMM the smoothed results are merged, by default by the mergeLevels algorithm, as recommended in Willenbrock and Fridlyand, 2005. For DNAcopy the default used to be mergeLevels, following the above recommendations, but we are now using MAD by default, as it is much faster and it is unclear that mergeLevels is the right approach with the type of data available today. Your mileage might vary and you probably will want to try both on some test data and check which makes more sense.
Merging is also done in GLAD (with GLAD's own merging algorithm). For HaarSeg, calling/merging is carried out using MAD, following page i141 of Ben-Yaacov and Eldar, section 2.3, "Determining aberrant intervals": a MAD (per their definition) is computed and any segment with absolute value larger than mad.threshold * MAD is considered aberrant. Merging is also performed for CGHseg (the default, however, is MAD, not mergeLevels). Merging (using either of "mergeLevels" or "MAD") can also be used with the wavelet-based method of Hsu et al.; please note that the later is an experimental feature implemented by us, and there is no study of its performance.
In summary, for all segmentation methods (except GLAD) merging is available as either "mergeLevels" or "MAD". For DNAcopy, CGHseg, HaarSeg, and wavelets as in Hsu et al., you can also choose no merging, though this will rarely be what you want (we offer this option to allow using the original authors' choices in their first descriptions of methods).
When using mergeLevels, we map the results to states of "Alteration", so that we categorize each probe as taking one, and only one, of three possible values, -1 (loss of genomic DNA), 0 (no change in DNA content), +1 (gain of genomic DNA). We have made the assumption, in this mapping, that the "no change" class is the one that has the absolute value closest to zero, and any other classes are either gains or losses. When the data are normalized, the "no change" class should be the most common one. When using MAD this step is implicit in the procedure ( any segment with absolute value larger than mad.threshold * MAD is considered aberrant).
Note that "mergeLevels", in addition to being used for calling gains and losses, results in a decrease in the number of distinct smoothed values, since it can merge two or more adjacent smoothed levels. "MAD", in contrast, performs no merging as such, but only calling.
A list of two components (the components will be either
ff
or regular, in-memory R objects, depending on the input):
outSmoothed |
The smoothed values, as either a
|
outState |
The calls for each probe, as either an
|
If the output uses ffdf
, rows and columns of each
element can be accessed in the usual way for ffdf
objects, but accept also most of the usual R operations for data
frames.
The code for DNAcopy, HMM, and GLAD are basically wrappers
around the original functions by their corresponding authors, with some
modiffications for parallelization and usage of ff objects. The original
packages are: DNAcopy
, aCGH
, cgh
,
GLAD
, respectively. The CGHseg method uses package
tilingArray
.
HaarSeg has been turned into an R package, available from https://r-forge.r-project.org/projects/haarseg/. That package uses, at its core, the same R and C code as we do, from Ben-Yaacov and Eldar. We have not used the available R package for historical reasons (we used Eldar and Ben-Yaacov's C and R code in the former ADaCGH package, before a proper R package was available).
For the wavelet-based method we have only wrapped the code that was kindly provided by L. Hsu and D. Grove, and parallelized a few calls. Their original code is included in the sources of the package.
Parallelization and modifications for using ff and additions are by Ramon Diaz-Uriarte [email protected]
Diaz-Uriarte, R. (2014). ADaCGH2: parallelized analysis of (big) CNA data. Bioinformatics, 30: 1759–1761.
Carro A, Rico D, Rueda O M, Diaz-Uriarte R, and Pisano DG. (2010). waviCGH: a web application for the analysis and visualization of genomic copy number alterations. Nucleic Acids Research, 38 Suppl:W182–187.
Fridlyand, Jane and Snijders, Antoine M. and Pinkel, Dan and Albertson, Donna G. (2004). Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis, 90: 132–153.
Hsu L, Self SG, Grove D, Randolph T, Wang K, Delrow JJ, Loo L, Porter P. (2005) Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics, 6:211-26.
Hupe, P. and Stransky, N. and Thiery, J. P. and Radvanyi, F. and Barillot, E. (2004). Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20: 3413–3422.
Lingjaerde OC, Baumbusch LO, Liestol K, Glad I, Borresen-Dale AL. (2005). CGH-Explorer: a program for analysis of CGH-data. Bioinformatics, 21: 821–822.
Olshen, A. B. and Venkatraman, E. S. and Lucito, R. and Wigler, M. (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 4, 557–572. http://www.mskcc.org/biostat/~olshena/research.
Picard, F. and Robin, S. and Lavielle, M. and Vaisse, C. and Daudin, J. J. (2005). A statistical approach for array CGH data analysis. BMC Bioinformatics, 6, 27. http://dx.doi.org/10.1186/1471-2105-6-27.
Price TS, Regan R, Mott R, Hedman A, Honey B, Daniels RJ, Smith L, Greenfield A, Tiganescu A, Buckle V, Ventress N, Ayyub H, Salhan A, Pedraza-Diaz S, Broxholme J, Ragoussis J, Higgs DR, Flint J, Knight SJ. (2005) SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. Nucleic Acids Res., 33:3455-64.
Willenbrock, H. and Fridlyand, J. (2005). A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics, 21, 4084–4091.
Diaz-Uriarte, R. and Rueda, O.M. (2007). ADaCGH: A parallelized web-based application and R package for the analysis of aCGH data, PLoS ONE, 2: e737.
Ben-Yaacov, E. and Eldar, Y.C. (2008). A Fast and Flexible Method for the Segmentation of aCGH Data, Bioinformatics, 24: i139-i145.
##################################################### ### ### Using forking with RAM objects ### ##################################################### ### Note to windows users: under Windows, this will ### result in sequential execution, as forking is not ### available. ## Get example input data and create data objects data(inputEx) ## (this is not necessary, but is convenient; ## you could do the subsetting in the call themselves) cgh.dat <- inputEx[, -c(1, 2, 3)] chrom.dat <- as.integer(inputEx[, 2]) pos.dat <- inputEx[, 3] ## Segment with HaarSeg ## You might want to adapt mc.cores to your hardware haar.RAM.fork <- pSegmentHaarSeg(cgh.dat, chrom.dat, merging = "MAD", mc.cores = 2) ## What does the output look like? lapply(haar.RAM.fork, head) ## Where and what length are segments in first sample? rle(haar.RAM.fork$outSmoothed[, 1]) ## Repeat, without load-balancing ## You might want to adapt mc.cores to your hardware haar.RAM.fork.nlb <- pSegmentHaarSeg(cgh.dat, chrom.dat, merging = "MAD", loadBalance = FALSE, mc.cores = 2) if(.Platform$OS.type != "windows") { ## We do not want this to run in Windows the automated tests since ## issues with I/O. It should work, though, in interactive usage ##################################################### ### ### Using forking with ff objects ### ##################################################### ### Note to windows users: under Windows, this will ### result in sequential execution, as forking is not ### available. ## Create a temp dir for storing output and ff objects. ## (Not needed, but cleaner). dir.create("ADaCGH2_example_tmp_dir") originalDir <- getwd() setwd("ADaCGH2_example_tmp_dir") ## Get input data in ff format ## (we loaded the RData above, but we need to find the full path ## to use it in the call to inputToADaCGH) fname <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.RData") inputToADaCGH(ff.or.RAM = "ff", RDatafilename = fname) ## Segment with HaarSeg ## You might want to adapt mc.cores to your hardware haar.ff.fork <- pSegmentHaarSeg("cghData.RData", "chromData.RData", merging = "MAD", mc.cores = 2) ## What does the output look like? haar.ff.fork ## Note the warnings; we will be gentler in next example. ##################################################### ### ### Using a cluster with ff objects ### ##################################################### ## Start a socket cluster. Change the appropriate number of CPUs ## for your hardware and use other types of clusters (e.g., MPI) ## if you want. cl2 <- parallel::makeCluster(2,"PSOCK") parallel::clusterSetRNGStream(cl2) parallel::setDefaultCluster(cl2) parallel::clusterEvalQ(NULL, library("ADaCGH2")) ## The following is not really needed if you create the cluster AFTER ## changing directories. But better to be explicit. wdir <- getwd() parallel::clusterExport(NULL, "wdir") parallel::clusterEvalQ(NULL, setwd(wdir)) ## Segment with HaarSeg haar.ff.cluster <- pSegmentHaarSeg("cghData.RData", "chromData.RData", merging = "MAD", typeParall= "cluster") ## Avoid warnings by opening the objects names(haar.ff.cluster) open(haar.ff.cluster$outSmoothed) open(haar.ff.cluster$outState) ## Alternatively, we can open the two ffdfs with lapply ## lapply(haar.ff.cluster, open) ########################################## ### ### Compare output (should be identical) ### ########################################## all.equal(haar.ff.cluster$outSmoothed[ , ], haar.ff.fork$outSmoothed[ , ]) all.equal(haar.ff.cluster$outSmoothed[ , ], haar.RAM.fork$outSmoothed[ , ]) identical(haar.ff.cluster$outState[ , ], haar.ff.fork$outState[ , ]) identical(haar.ff.cluster$outState[ , ], haar.RAM.fork$outState[ , ]) ##################################################################### #### #### Clean up actions #### #### (These are not needed. They are convenient here, to prevent #### leaving garbage in your hard drive. In "real life" you will #### have to decide what to delete and what to store). ##################################################################### ### Explicitly stop cluster parallel::stopCluster(cl2) ### All objects (RData and ff) are left in this directory getwd() ### We will clean it up, and do it step-by-step ### BEWARE: DO NOT do this with objects you want to keep!!! ## Remove ff and RData for the data load("chromData.RData") load("posData.RData") load("cghData.RData") delete(cghData); rm(cghData) delete(posData); rm(posData) delete(chromData); rm(chromData) unlink("chromData.RData") unlink("posData.RData") unlink("cghData.RData") unlink("probeNames.RData") ## Remove ff and R objects with segmentation results lapply(haar.ff.fork, delete) rm(haar.ff.fork) lapply(haar.ff.cluster, delete) rm(haar.ff.cluster) ### Try to prevent problems in R CMD check ## Sys.sleep(2) ### Delete temp dir setwd(originalDir) ## Sys.sleep(2) unlink("ADaCGH2_example_tmp_dir", recursive = TRUE) ## Sys.sleep(2) }
##################################################### ### ### Using forking with RAM objects ### ##################################################### ### Note to windows users: under Windows, this will ### result in sequential execution, as forking is not ### available. ## Get example input data and create data objects data(inputEx) ## (this is not necessary, but is convenient; ## you could do the subsetting in the call themselves) cgh.dat <- inputEx[, -c(1, 2, 3)] chrom.dat <- as.integer(inputEx[, 2]) pos.dat <- inputEx[, 3] ## Segment with HaarSeg ## You might want to adapt mc.cores to your hardware haar.RAM.fork <- pSegmentHaarSeg(cgh.dat, chrom.dat, merging = "MAD", mc.cores = 2) ## What does the output look like? lapply(haar.RAM.fork, head) ## Where and what length are segments in first sample? rle(haar.RAM.fork$outSmoothed[, 1]) ## Repeat, without load-balancing ## You might want to adapt mc.cores to your hardware haar.RAM.fork.nlb <- pSegmentHaarSeg(cgh.dat, chrom.dat, merging = "MAD", loadBalance = FALSE, mc.cores = 2) if(.Platform$OS.type != "windows") { ## We do not want this to run in Windows the automated tests since ## issues with I/O. It should work, though, in interactive usage ##################################################### ### ### Using forking with ff objects ### ##################################################### ### Note to windows users: under Windows, this will ### result in sequential execution, as forking is not ### available. ## Create a temp dir for storing output and ff objects. ## (Not needed, but cleaner). dir.create("ADaCGH2_example_tmp_dir") originalDir <- getwd() setwd("ADaCGH2_example_tmp_dir") ## Get input data in ff format ## (we loaded the RData above, but we need to find the full path ## to use it in the call to inputToADaCGH) fname <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.RData") inputToADaCGH(ff.or.RAM = "ff", RDatafilename = fname) ## Segment with HaarSeg ## You might want to adapt mc.cores to your hardware haar.ff.fork <- pSegmentHaarSeg("cghData.RData", "chromData.RData", merging = "MAD", mc.cores = 2) ## What does the output look like? haar.ff.fork ## Note the warnings; we will be gentler in next example. ##################################################### ### ### Using a cluster with ff objects ### ##################################################### ## Start a socket cluster. Change the appropriate number of CPUs ## for your hardware and use other types of clusters (e.g., MPI) ## if you want. cl2 <- parallel::makeCluster(2,"PSOCK") parallel::clusterSetRNGStream(cl2) parallel::setDefaultCluster(cl2) parallel::clusterEvalQ(NULL, library("ADaCGH2")) ## The following is not really needed if you create the cluster AFTER ## changing directories. But better to be explicit. wdir <- getwd() parallel::clusterExport(NULL, "wdir") parallel::clusterEvalQ(NULL, setwd(wdir)) ## Segment with HaarSeg haar.ff.cluster <- pSegmentHaarSeg("cghData.RData", "chromData.RData", merging = "MAD", typeParall= "cluster") ## Avoid warnings by opening the objects names(haar.ff.cluster) open(haar.ff.cluster$outSmoothed) open(haar.ff.cluster$outState) ## Alternatively, we can open the two ffdfs with lapply ## lapply(haar.ff.cluster, open) ########################################## ### ### Compare output (should be identical) ### ########################################## all.equal(haar.ff.cluster$outSmoothed[ , ], haar.ff.fork$outSmoothed[ , ]) all.equal(haar.ff.cluster$outSmoothed[ , ], haar.RAM.fork$outSmoothed[ , ]) identical(haar.ff.cluster$outState[ , ], haar.ff.fork$outState[ , ]) identical(haar.ff.cluster$outState[ , ], haar.RAM.fork$outState[ , ]) ##################################################################### #### #### Clean up actions #### #### (These are not needed. They are convenient here, to prevent #### leaving garbage in your hard drive. In "real life" you will #### have to decide what to delete and what to store). ##################################################################### ### Explicitly stop cluster parallel::stopCluster(cl2) ### All objects (RData and ff) are left in this directory getwd() ### We will clean it up, and do it step-by-step ### BEWARE: DO NOT do this with objects you want to keep!!! ## Remove ff and RData for the data load("chromData.RData") load("posData.RData") load("cghData.RData") delete(cghData); rm(cghData) delete(posData); rm(posData) delete(chromData); rm(chromData) unlink("chromData.RData") unlink("posData.RData") unlink("cghData.RData") unlink("probeNames.RData") ## Remove ff and R objects with segmentation results lapply(haar.ff.fork, delete) rm(haar.ff.fork) lapply(haar.ff.cluster, delete) rm(haar.ff.cluster) ### Try to prevent problems in R CMD check ## Sys.sleep(2) ### Delete temp dir setwd(originalDir) ## Sys.sleep(2) unlink("ADaCGH2_example_tmp_dir", recursive = TRUE) ## Sys.sleep(2) }