Title: | Microarray QA and statistical data analysis for Applied Biosystems Genome Survey Microrarray (AB1700) gene expression data. |
---|---|
Description: | Automated pipline to perform gene expression analysis for Applied Biosystems Genome Survey Microarray (AB1700) data format. Functions include data preprocessing, filtering, control probe analysis, statistical analysis in one single function. A GUI interface is also provided. The raw data, processed data, graphics output and statistical results are organized into folders according to the analysis settings used. |
Authors: | Yongming Andrew Sun |
Maintainer: | Yongming Andrew Sun <[email protected]> |
License: | GPL |
Version: | 1.75.0 |
Built: | 2024-10-30 03:23:56 UTC |
Source: | https://github.com/bioc/ABarray |
(1) Read output from AB1700 software output; (2) Create raw data QA and associated plots including boxplot, control data signal plot; (3) Missing value calculation; (4) Create MA, scatter plot; (5) Perform quantile normalization; (6) Perform t test and fold change, or ANOVA (using separate function if more than 2 subgroups). (7) Create heatmap with hierarchical clustering. (8) The results are either in graphics or text files.
ABarray(dataFile, designFile, group, test = TRUE, impute = "avg", normMethod = "quantile", ...)
ABarray(dataFile, designFile, group, test = TRUE, impute = "avg", normMethod = "quantile", ...)
dataFile |
csv or tab delimit file contain expression measurement that are output from AB1700 software |
designFile |
Experiment design file, including information for sample type and additional phenotype information. |
group |
Specify which group statistical test will be performed on. The samples will be ordered according the group. |
test |
Specify whether to perform t test. By default, t test will be performed using specified group information. |
impute |
Treat flagged value (above 5000) as missing value, and impute the missing value. |
normMethod |
The method of normalizaiton. The default is "quantile". The following normMethods are supported: quantile, mean, median, trimMean, and trimAMean. If the parameter value is one of the supported normMethods, the analysis will be performed on the chosen method. If the parameter value is "all", the analysis will be performed on quantile only, but the normalization results will be produced for each of the normMethods. |
... |
Additional arguments. Use snThresh and/or detectSample to perform filtering. snThresh is the threshold of S/N value to be considered that the probe is detected (default value = 3, if snThresh is not specified). detectSample is used to determine if a probe should be included in statistical analysis (default value = 0.5, ie 50% of samples in any one subgroup). |
The function works on AB1700 software export data file. It expects certain file format to work. The rows of the file represent probes. The columns should contain these headings: probeID, geneID, Signal, S/N, Flag, and optionally SDEV, CV, AssayNormSignal (these values will be ignored in the process).
It is optional to have control probes. If they are present, plots will be generated for the control probes and they will be removed for further analysis.
It is required to have an experiment design file in certain format. The rows of the file are samples or arrays. The first column should be sampleName. Perhaps, sampleName should be concise and no spaces between characters. Second and third columns maybe assayName and arrayName (arrayName is optional). Additional columns should specify what type of samples. Note: It is best to have assayName the same as in dataFile.
Group name should be the same as in designFile. The samples will be ordered according the group information. The samples within the same subgroup will be ordered together. Only one group is accepted.
If test is TRUE (default), t test and ANOVA (if applicable) results will be produced.
If impute is avg (default), the signal values of the flagged probes will be imputed from average of the subgroup only if there are 2 or more values remaining in the subgroup.
Even if snThresh is not specified in the argument, snThresh is set to 3 by default. If a value other than 3 is desired (e.g., 2), put 'snThresh = 2' in the argument.
detectSample is also preset to a value = 0.5. This means that if a probe is detected in 50% or more samples in any subgroup within the group, it is included in statistical analysis. For example, if the group is named 'tissue', and there are 2 subgroups named 'lung' and 'liver', then, if a probe is detected in 50% or more samples in 'lung', it is included in the statistical analysis regardless the detectability in the other subgroup ('liver').
An ExpressionSet
object. The assayDataElement(eset, "exprs")
will be populated with normalized signals, assayDataElement(eset, "snDetect")
will be populated with S/N ratio values, and the phenoData
slot will
be populated with information from designFile
. Further analysis can
be performed on the ExpressionSet
object with various R and Bioconductor
packages.
Y Andrew Sun <[email protected]>
doPlotEset, doPlotFCT, doANOVA, matrixPlot, mvaPair2, doLPE, doVennDiagram, hclusterPlot
#- eset <- ABarray(dataFile, designFile, "sampleGroup") #- eset <- ABarray(dataFile, designFile, "group", detectSample = 0.8)
#- eset <- ABarray(dataFile, designFile, "sampleGroup") #- eset <- ABarray(dataFile, designFile, "group", detectSample = 0.8)
A front end GUI for ABarray package to perform data analysis.
ABarrayGUI()
ABarrayGUI()
The interface gathers required paramters for the ABarray packages to run. See ABarray for more details.
No return values.
Y Andrew Sun <[email protected]>
ABarray, doPlotEset, doPlotFCT, doANOVA, matrixPlot, mvaPair2, doLPE, doVennDiagram, hclusterPlot
#- ABarrayGUI()
#- ABarrayGUI()
Calculate S/N ratio summary for each group
calcsn(sn, snThresh, pdata, group, grpMember)
calcsn(sn, snThresh, pdata, group, grpMember)
sn |
S/N ratio data |
snThresh |
S/N threshold filtering |
pdata |
experiment design |
group |
which group should be calculated |
grpMember |
optional, members of the group |
data matrix
Y Andrew Sun
Calculate signal detection concordance between columns using S/N threshold (default = 3)
concord(sn, snThresh = 3)
concord(sn, snThresh = 3)
sn |
a matrix containing s/n ratio |
snThresh |
S/N threshold to use, default = 3 |
a matrix with the concordance
Y Andrew Sun
#-concordance <- concord(sn) ##- sn ratio matrix
#-concordance <- concord(sn) ##- sn ratio matrix
Calculate cv
cvv(data)
cvv(data)
data |
data matrix contain expression values |
vector of cv for each gene or probe
Yongming Sun
Plot CV value against average intensity (log2)
cvvPlot(data, name)
cvvPlot(data, name)
data |
vector of cv for each gene |
name |
name of the plot |
None
Yongming Sun
cvv cvv
to calulate cv
If only one factor is provided in parameter, one way ANOVA is performed. If two factors are provided, two way ANOVA is performed.
doANOVA(eset, group1, group2, snThresh = 3, detectSample = 0.5)
doANOVA(eset, group1, group2, snThresh = 3, detectSample = 0.5)
eset |
An |
group1 |
A factor name or labels to test on. If eset is an
|
group2 |
A factor name or labels to test on. |
snThresh |
Using probes detectable for ANOVA analysis, default S/N value is 3 or more to be considered detectable. |
detectSample |
The percentage of samples the probe is detected in order to be considered in ANOVA analysis. |
At least one group should be provided. If ExpressionSet
object
is used, group1 or group2 is the name of the sampleGroup defined in
experiment design file. If labels are to be used, they can be either
numeric or text, e.g., c(1,1,2,2,3,3) or c("treat1", "treat1", "treat2",
"treat2", "treat3", "treat3").
If the probe is detectable in 50% (default) or more samples in any one of the subgroup, it is included in the ANOVA analysis.
a vector if one way ANOVA; a matrix if two way ANOVA
Y Andrew Sun
#- one way ANOVA #- anova <- doANOVA(eset, "sampleGroup") #- two way ANOVA #- anova <- doANOVA(eset, "sampleGroup1", "sampleGoup2")
#- one way ANOVA #- anova <- doANOVA(eset, "sampleGroup") #- two way ANOVA #- anova <- doANOVA(eset, "sampleGroup1", "sampleGoup2")
The local pooled error test attempts to reduce dependence on the within-gene estimates in tests for differential expression, by pooling error estimates within regions of similar intensity. Note that with the large number of genes there will be genes with low within-gene error estimates by chance, so that some signal-to-noise ratios will be large regardless of mean expression intensities and fold-change. The local pooled error attempts to avert this by combining within-gene error estimates with those of genes with similar expression intensity.
doLPE(eset, group, member, name = "", snThresh = 3, detectSample = 0.5)
doLPE(eset, group, member, name = "", snThresh = 3, detectSample = 0.5)
eset |
an |
group |
which group should LPE be performed |
member |
optional. The member names in the group specified above |
name |
a prefix name for use when writing output to file |
snThresh |
S/N ratio threshold to use to define gene detectability |
detectSample |
percentage of samples detectable above snThresh to include in LPE test. The default is 50%. If the probe is detected in 50% or more samples in one of the subgroup, it is considered in LPE analysis |
The LPE test statistic numerator is the difference in medians between the two experimental conditions. The test statistic denominator is the combined pooled standard error for the two experimental conditions obtained by looking up the var.M from each baseOlig.error variance function. The conversion to p-values is based on the Gaussian distribution for difference if order statistics (medians). The user may select both the smoother degrees of freedom (smaller is smoother) and the trim percent to obtain a variance function to suit particular issues i.e. variability of genes with low expression intensity.
Dataframe
Y Andrew Sun
Bioconductor LPE package
##---- Some example usage ----
##---- Some example usage ----
Produce boxplot, MA plot, scatter plot, correlation, S/N detection concordance, CV, and t test, ANOVA test if subgroup is more than 2
doPlotEset(eset, group, name = "", snThresh = 3, test = TRUE, ...)
doPlotEset(eset, group, name = "", snThresh = 3, test = TRUE, ...)
eset |
an |
group |
name of the group from experiment design file |
name |
a name for use in output files for record purpose |
snThresh |
threshold of S/N considered detectable, default = 3 |
test |
whether t or ANOVA test should be performed |
... |
Additional arguments, currently not implemented |
The t test and fold change is performed with function
fctPlot
. See additional information with fctPlot
. ANOVA
is performed with doANOVA
.
If there are more than 2 subgroup in group
, t test and fold
change will be performed for each pair of subgroup and one way ANOVA
will be performed. If subgroup is 2, ANOVA will not be performed.
None. A number of plots and t or ANOVA test result file will be produced.
Y Andrew Sun
#-doPlotEset(eset, "sampleGroup") #-doPlotEset(eset, "sampleGroup", name = "perfect") #-doPlotEset(eset, "sampleGroup", test = FALSE) ##- t test will be not performed
#-doPlotEset(eset, "sampleGroup") #-doPlotEset(eset, "sampleGroup", name = "perfect") #-doPlotEset(eset, "sampleGroup", test = FALSE) ##- t test will be not performed
Calculate fold changes and p values from t test, and plot the results using preset FDR threshold
doPlotFCT(eset, group, grpMember, order1 = NULL, order2 = NULL, detectSample = 0.5, snThresh = 3, ...)
doPlotFCT(eset, group, grpMember, order1 = NULL, order2 = NULL, detectSample = 0.5, snThresh = 3, ...)
eset |
an |
group |
which group from experiment design should calculation and plot be performed |
grpMember |
optional group member within the group |
order1 |
optional, For a pairwise comparison the ordering of the first group of replicates |
order2 |
optional, For a pairwise comparison the ordering of the first group of replicates |
detectSample |
optional number between 0 and 1 to indicate the percentage of arrays should be above snThresh to include in the t test analysis. Default = 0.5. If the probe is detected in 50% or more samples on one of the subgroup, the probe is included in the t test, otherwise, it will be excluded in the t test |
snThresh |
optional S/N ratio threshold. Default = 3 |
... |
Additional argument, currently not implemented |
Group members are optional. For example, if group name is "tissue", and group members in experiment design file include "brain", "liver", "lung", "muscle". We could include c("brain", "liver") as group member for the parameter, then t test will be performed between "brain" and "liver", and "lung" "muscle" will be ignored. However, if we omit group member in the arguments, all tissue members will be used for t test. In this case, there will be 6 pairwise t test between each member of the group.
If order1 and order2 are specified then a paired sample t-test will be conducted between the groups, with the arrays in each group sorted according to the ordering specified. For example, if order1 is c(1,3,2) and order2 is c(1,2,3), then the sample pairing is a1-b1, a3-b2, a2-b3, with a and b are subgroup 1 and subgroup 2 within the group.
The fold changes are difference between averaged subgroup1 expression vs averaged subgroup2. If paired t test is performed, the fold changes are calculated using each paired difference and take an average of paired difference.
None. But a number of plot and result files will be produced.
Y Andrew Sun
#- doPlotFCT(eset, "sampleGroup", c("liver", "muscle")) #- For a paired t test #- doPlotFCT(eset, "sampleGroup", c("liver", "muscle"), order1 = c(1,2,3), order2 = c(1,3,2))
#- doPlotFCT(eset, "sampleGroup", c("liver", "muscle")) #- For a paired t test #- doPlotFCT(eset, "sampleGroup", c("liver", "muscle"), order1 = c(1,2,3), order2 = c(1,3,2))
Create Venn diagram from lists.
doVennDiagram(a, b, c = NULL, names, ...)
doVennDiagram(a, b, c = NULL, names, ...)
a |
a vector of first list |
b |
a vector of second list |
c |
a vector of third list, optional |
names |
a vector for the name of the set |
... |
additional graphical parameter |
The funciton will create Venn diagram. If two lists (a and b) are provided, two-way Venn diagram will produced. If three lists (a, b, and c) are provided, three-way Venn diagram will be produced.
This function depends on some functions of limma package, and is derived from limma package.
A plot of Venn diagram
Yongming Sun
Bioconductor limma package.
Drawing actual Venn diagram
drawVennDiagram(object, names, mar = rep(0.5, 4), cex = 1, ...)
drawVennDiagram(object, names, mar = rep(0.5, 4), cex = 1, ...)
object |
VennCounts object produced by |
names |
optional character vector giving names for the sets |
mar |
numeric vector of length 4 specifying the width of the margins around the plot. This argument is passed to par. |
cex |
numerical value giving the amount by which the contrast names should be scaled on the plot relative to the default.plotting text. See par. |
... |
any other arguments are passed to plot |
a plot of Venn Diagram
Yongming Sun
Bioconductor Limma package
##---- Do not call this function directly !! ----
##---- Do not call this function directly !! ----
From a group and its member name, return an ExpressionSet containing just these members
getMemberEset(eset, group, member)
getMemberEset(eset, group, member)
eset |
an |
group |
the name of the group which must be in the experiment design file |
member |
member name(s) in the above mentioned group |
an ExpressionSet
object
Yongming Sun
Given a list of probeID, attempt to find out panther classification information
getPantherMap(probeID, title, figDir)
getPantherMap(probeID, title, figDir)
probeID |
a list of probeIDs |
title |
the title for the figure to be generated |
figDir |
directory for the figures to be placed in |
None. Several figures will be generated.
Yongming Sun
plot clustering heatmap using correlation
hclusterPlot(expr, title, dist)
hclusterPlot(expr, title, dist)
expr |
matrix of gene expression value |
title |
the title for the plot |
dist |
whether to use correlation or distance for clustering, default to use Euclidean distance. Use dist = "Correlation" to cluster with correlation coefficient |
generating heatmap using correlation as distance
None. heatmap will be generated.
Y Andrew Sun
QC plot for internal control probes
icpPlot(controlData, colProbeID = 1, plotWhat = "Signal", pdfDir, jpgDir)
icpPlot(controlData, colProbeID = 1, plotWhat = "Signal", pdfDir, jpgDir)
controlData |
Signal intensity matrix for icp probes |
colProbeID |
the column where probeID is located |
plotWhat |
Whether we are plotting signal or S/N |
pdfDir |
a directory where pdf files should be produced |
jpgDir |
a directory where jpg or bmp files should be produced |
A series of QC plots
Yongming Sun
##---- Do not call this function DIRECTLY !! ----
##---- Do not call this function DIRECTLY !! ----
Perform imputation for missing values.
imputeFlag(rawSig, pd = NULL, group = "", impute = "avg")
imputeFlag(rawSig, pd = NULL, group = "", impute = "avg")
rawSig |
a matrix containing gene expression with missing values labeled as NA |
pd |
phenoData object |
group |
which group should average be performed |
impute |
choice of impute method, only avg (average) is implemented |
a list containing a matrix with the imputed values and rows that are imputed.
Y Andrew Sun
#-imputed <- imputeFlag(raw, pd, group = "tissue", impute = "avg") ##- sn ratio matrix
#-imputed <- imputeFlag(raw, pd, group = "tissue", impute = "avg") ##- sn ratio matrix
Perform Benjamini and Hochberg FDR adjustment on LPE results
lpe.fdr.BH(lpe.result, adjp = "BH")
lpe.fdr.BH(lpe.result, adjp = "BH")
lpe.result |
the result from LPE analysis |
adjp |
Type of adjustment, default "BH" |
Do not call this function directly. Called from doLPE
a matrix with original and ajusted p values
Yongming Sun
Bioconductor LPE package
##---- Do not call this function directly !! ----
##---- Do not call this function directly !! ----
plot MA from vectors A and M
mamaplot(A, M, idx, subset = sample(1:length(M), min(c(10000, length(M)))), span = 2/3, family.loess = "gaussian", cex = 2, ...)
mamaplot(A, M, idx, subset = sample(1:length(M), min(c(10000, length(M)))), span = 2/3, family.loess = "gaussian", cex = 2, ...)
A |
vector of average signal |
M |
vector of difference signal |
idx |
index for which S/N < 3 |
subset |
subset |
span |
span |
family.loess |
loess fit |
cex |
cex value |
... |
additional arguments |
MA plot
Modified from bioconductor affy package
Yongming Sun
bioconductor affy package
See Also as mvaPair2
##---- Do not call this function DIRECTLY !! ----
##---- Do not call this function DIRECTLY !! ----
Create heatmap from a matrix
matrixPlot(x, nrgcols = 50, rlabels = TRUE, clabels = TRUE, rcols = 1, ccols = 1, k = 10, title = "", ...)
matrixPlot(x, nrgcols = 50, rlabels = TRUE, clabels = TRUE, rcols = 1, ccols = 1, k = 10, title = "", ...)
x |
a matrix |
nrgcols |
number of colors to use |
rlabels |
whether to use row labels |
clabels |
whether to use column labels |
rcols |
use supplemental row label |
ccols |
use supplemental column label |
k |
number of tick labels for scale bar |
title |
title for the plot |
... |
additional argument |
This function can be used to plot any numberic matrix, e.g., correlation matrix, S/N matrix, signal intensity matrix, etc
heatmap
Yongming Sun
MA plot for each pair of columns
mvaPair2(x, y = NULL, snThresh = 3, labels = colnames(x), log.it = FALSE, span = 2/3, family.loess = "gaussian", digits = 3, line.col = 2, main = "MA plot", ... )
mvaPair2(x, y = NULL, snThresh = 3, labels = colnames(x), log.it = FALSE, span = 2/3, family.loess = "gaussian", digits = 3, line.col = 2, main = "MA plot", ... )
x |
expression matrix |
y |
S/N ratio matrix |
snThresh |
S/N threshold |
labels |
name for the labels |
log.it |
should data be log transformed |
span |
span of the plot |
family.loess |
curve fitting |
digits |
number of digits to display |
line.col |
size of the line col |
main |
title for the MA plot |
... |
additional argument |
If S/N ratio is available, probes with S/N < 3 in both array will be colored differently.
MA plot
Yongming Sun
##---- exprs expression matrix, sn s/n ratio !! ----
##---- exprs expression matrix, sn s/n ratio !! ----
Create correlation panel
panel.cor(x, y, digits=3, prefix="", cex.cor)
panel.cor(x, y, digits=3, prefix="", cex.cor)
x |
vector of expression value for one sample |
y |
vector of expression value for another sample |
digits |
number of digits to display the correlation |
prefix |
additional text to display |
cex.cor |
size of the text |
None
Yongming Sun
##---- Not intended for direct function call !! ----
##---- Not intended for direct function call !! ----
Create scatter plot
panel.scatter(x, y, col = "blue", bg = NA, pch = ".", cex = 1, col.smooth = "red", span = 2/3, iter = 3, ...)
panel.scatter(x, y, col = "blue", bg = NA, pch = ".", cex = 1, col.smooth = "red", span = 2/3, iter = 3, ...)
x |
vector of expression for one sample |
y |
vector of expression for another sample |
col |
color of points |
bg |
background colors |
pch |
pch paremeter |
cex |
size of text |
col.smooth |
color of smooth line |
span |
span of the plot |
iter |
iteration |
... |
additional arguments |
None
Yongming Sun
##---- Not intended for use this function directly !! ----
##---- Not intended for use this function directly !! ----
Perform quantile normalization between arrays
qnNormalize(eData, snr, method = 'quantile', snThresh = 3, ties = TRUE)
qnNormalize(eData, snr, method = 'quantile', snThresh = 3, ties = TRUE)
eData |
matrix of gene expression values |
snr |
Optional signal/noise ratio. Only used for trimAMean method |
method |
The normalization method desired. Default method is quantile |
snThresh |
Signal/noise threshold (default = 3) to indicate presence or absence of a probe signal |
ties |
handle values with same rank |
This function performs various normalization for the array data. The default is quantile normalization method (adapted from Bioconductor limma package). Other normalization methods include median, mean, trimMean (trimmed mean), trimAMean (mean with absent gene removed).
For the median normalizaiton, the median signal of each array is scaled to the same value (this value is calculated to equal to the median of all values in the data). The signal values for each array are then adjusted by the scaling factor.
For the mean normalization, the approach is similar to the median normalization procedure except that the mean signal of each array is scaled to the same value (this value is median of all signals in the data).
For the trimMean normalization, the approach is similar to the mean normalization except that the mean for each array is calculated after trimming the top and botton 5% of signals (a total of 10% of values).
For the trimAMean normalization, the signal values for absent probes are not considered. If the s/n of a probe is less than snThresh (default = 3), the expression of the probe is considered not present (absent). The remaining values are then trimmed (top and botton 2.5%, a total of 5%), and the mean value for each array after trimming is scaled to the same value (median of all values in the data).
data matrix with quantile normalized data values
Yongming Sun
bioconductor limma package for quantile normalization
Generate color map for heatmap use
rgcolorsfunc(n = 50)
rgcolorsfunc(n = 50)
n |
number of colors to generate |
rgb color vector
Yongming Sun
## Do not call this function directly rgb <- rgcolorsfunc()
## Do not call this function directly rgb <- rgcolorsfunc()
save plot device to jpg image file
savejpg(x, width = 1024, height = 768)
savejpg(x, width = 1024, height = 768)
x |
file name to be saved to |
width |
The width for the figure in pixal |
height |
The height for the figure |
For windows version, it produce bmp formatted image, otherwise, produce jpg images.
Yongming Sun
Create a bar for heatmap scales
scaleColorBar(x, horizontal = FALSE, col = rgcolorsfunc(50), scale = 1:length(x), k = 10, cLen = 9, ...)
scaleColorBar(x, horizontal = FALSE, col = rgcolorsfunc(50), scale = 1:length(x), k = 10, cLen = 9, ...)
x |
vector of scales need to be plotted |
horizontal |
whether the bar is vertical or horizontal |
col |
color function |
scale |
scale of the bar |
k |
number of intervals on scale |
cLen |
length of columns |
... |
additional arguments |
none
Yongming Sun
##--- Do not call this function directly !! ----
##--- Do not call this function directly !! ----
Create summary information for S/N ratio for each sample group
snSummary(eset, snThresh = 3, group, grpMember)
snSummary(eset, snThresh = 3, group, grpMember)
eset |
an |
snThresh |
S/N ratio threshold to use, default = 3 |
group |
sample group |
grpMember |
sample group members, optional |
a matrix containing the number of samples with S/N >=3 for each probe
Yongming Sun