Title: | Cross Platform Meta-Analysis of Microarray Data |
---|---|
Description: | Implements cross-platform and cross-species meta-analyses of Affymentrix, Illumina, and Agilent microarray data. This package automates common tasks such as downloading, normalizing, and annotating raw GEO data. The user then selects control and treatment samples in order to perform differential expression analyses for all comparisons. After analysing each contrast seperately, the user can select tissue sources for each contrast and specify any tissue sources that should be grouped for the subsequent meta-analyses. |
Authors: | Alex Pickering |
Maintainer: | Alex Pickering <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.33.0 |
Built: | 2024-10-30 06:45:58 UTC |
Source: | https://github.com/bioc/crossmeta |
Add expression data adjusted for pairs/surrogate variables
add_adjusted(eset, svobj = list(sv = NULL), numsv = 0)
add_adjusted(eset, svobj = list(sv = NULL), numsv = 0)
eset |
ExpressionSet |
svobj |
surrogate variable object |
numsv |
Number of surrogate variables to adjust for |
eset with adjusted
element added
User selects a tissue source for each contrast and indicates any sources that should be paired. This step is required if you would like to perform source-specific effect-size/pathway meta-analyses.
add_sources(diff_exprs, data_dir = getwd(), postfix = NULL)
add_sources(diff_exprs, data_dir = getwd(), postfix = NULL)
diff_exprs |
Previous result of |
data_dir |
String specifying directory of GSE folders. |
postfix |
Optional string to append to saved results. Useful if need to run multiple meta-analyses on the same series but with different contrasts. |
The Sources tab is used to add a source for each contrast. To do so: click the relevant contrast rows, search for a source in the Sample source dropdown box, and then click the Add button.
The Pairs tab is used to indicate sources that should be paired (treated as the same source for subsequent effect-size and pathway meta-analyses). To do so: select at least two sources from the Paired sources dropdown box, and then click the Add button.
For each GSE, analysis results with added sources/pairs are saved in the corresponding GSE
folder (in data_dir
) that was created by get_raw
.
Same as diff_expr
with added slots for each GSE in diff_exprs
:
sources |
Named vector specifying selected sample source for each contrast. Vector names identify the contrast. |
pairs |
List of character vectors indicating tissue sources that should be treated as the same source for subsequent effect-size and pathway meta-analyses. |
library(lydata) # load result of previous call to diff_expr: data_dir <- system.file("extdata", package = "lydata") gse_names <- c("GSE9601", "GSE34817") anals <- load_diff(gse_names, data_dir) # run shiny GUI to add tissue sources # anals <- add_sources(anals, data_dir)
library(lydata) # load result of previous call to diff_expr: data_dir <- system.file("extdata", package = "lydata") gse_names <- c("GSE9601", "GSE34817") anals <- load_diff(gse_names, data_dir) # run shiny GUI to add tissue sources # anals <- add_sources(anals, data_dir)
For microarray datasets duplicates exprs slot into vsd slot.
add_vsd(eset, rna_seq = TRUE)
add_vsd(eset, rna_seq = TRUE)
eset |
ExpressionSet with group column in |
rna_seq |
Is this an RNA-seq |
eset
with 'vsd'
assayDataElement
added.
Logic for Select Contrasts Interface
bulkPage(input, output, session, eset, gse_name, prev)
bulkPage(input, output, session, eset, gse_name, prev)
input , output , session
|
shiny module boilerplate |
eset |
ExpressionSet |
gse_name |
GEO accession for the series. |
prev |
Previous result of |
UI for Select Contrasts Interface
bulkPageUI(id)
bulkPageUI(id)
id |
The id string to be namespaced. |
After selecting control and test samples for each contrast, surrogate variable
analysis (sva
) and differential expression analysis is performed.
diff_expr( esets, data_dir = getwd(), annot = "SYMBOL", prev_anals = list(NULL), svanal = TRUE, recheck = FALSE, postfix = NULL, port = 3838 )
diff_expr( esets, data_dir = getwd(), annot = "SYMBOL", prev_anals = list(NULL), svanal = TRUE, recheck = FALSE, postfix = NULL, port = 3838 )
esets |
List of annotated esets. Created by |
data_dir |
String specifying directory of GSE folders. |
annot |
String, column name in fData common to all esets. For duplicated values in this column, the row with the highest interquartile range across selected samples will be kept. If meta-analysis will follow, appropriate values are "SYMBOL" (default - for gene level analysis) or, if all esets are from the same platform, "PROBE" (for probe level analysis). |
prev_anals |
Previous result of |
svanal |
Use surrogate variable analysis? Default is |
recheck |
Would you like to recheck previous group/contrast annotations? Requires
|
postfix |
Optional string to append to saved results. Useful if need to run multiple meta-analyses on the same series but with different contrasts. |
port |
See |
Click the Download icon and fill in the Group name column and optionally the Pairs column. Then save and upload the filled in metadata csv. After doing so, select a test and control group to compare and click the + icon to add the contrast. Repeat previous step to add additional contrasts. After control and test samples have been added for all contrasts that you wish to include, click the Done button. Repeat for all GSEs.
Paired samples (e.g. the same subject before and after treatment) can be specified by filling out the Pairs column before uploading the metadata.
For each GSE, analysis results are saved in the corresponding GSE
folder in data_dir
that was created by get_raw
. If analyses
needs to be repeated, previous results can be reloaded with load_diff
and supplied to the prev_anals
parameter. In this case, previous
selections, names, and pairs will be reused.
List of named lists, one for each GSE. Each named list contains:
pdata |
data.frame with phenotype data for selected samples.
Columns |
top_tables |
List with results of |
ebayes_sv |
Results of call to |
annot |
Value of |
library(lydata) # location of raw data data_dir <- system.file("extdata", package = "lydata") # gather GSE names gse_names <- c("GSE9601", "GSE15069", "GSE50841", "GSE34817", "GSE29689") # load first eset esets <- load_raw(gse_names[1], data_dir) # run analysis (opens GUI) # anals_old <- diff_expr(esets, data_dir) # re-run analysis on first eset prev <- load_diff(gse_names[1], data_dir) anals <- diff_expr(esets[1], data_dir, prev_anals = prev)
library(lydata) # location of raw data data_dir <- system.file("extdata", package = "lydata") # gather GSE names gse_names <- c("GSE9601", "GSE15069", "GSE50841", "GSE34817", "GSE29689") # load first eset esets <- load_raw(gse_names[1], data_dir) # run analysis (opens GUI) # anals_old <- diff_expr(esets, data_dir) # re-run analysis on first eset prev <- load_diff(gse_names[1], data_dir) anals <- diff_expr(esets[1], data_dir, prev_anals = prev)
Performs effect-size meta-analyses across all studies and seperately for each tissue source.
es_meta(diff_exprs, cutoff = 0.3, by_source = FALSE)
es_meta(diff_exprs, cutoff = 0.3, by_source = FALSE)
diff_exprs |
Previous result of |
cutoff |
Minimum fraction of contrasts that must have measured each gene. Between 0 and 1. |
by_source |
Should seperate meta-analyses be performed for each tissue
source added with |
Builds on zScores
function from GeneMeta by allowing for genes
that were not measured in all studies. This implementation also uses moderated unbiased
effect sizes calculated by effectsize
from metaMA and determines
false discovery rates using fdrtool
.
A list of named lists, one for each tissue source. Each list contains
two named data.frames. The first, filt
, has all the columns below for genes
present in cutoff or more fraction of contrasts. The second, raw
, has only
dprime
and vardprime
columns, but for all genes (NAs for genes
not measured by a given contrast).
dprime |
Unbiased effect sizes (one column per contrast). |
vardprime |
Variances of unbiased effect sizes (one column per contrast). |
mu |
Overall mean effect sizes. |
var |
Variances of overall mean effect sizes. |
z |
Overall z score = |
fdr |
False discovery rates calculated from column |
pval |
p-values calculated from column |
library(lydata) # location of data data_dir <- system.file("extdata", package = "lydata") # gather GSE names gse_names <- c("GSE9601", "GSE15069", "GSE50841", "GSE34817", "GSE29689") # load previous analysis anals <- load_diff(gse_names, data_dir) # add tissue sources to perform seperate meta-analyses for each source (optional) # anals <- add_sources(anals, data_dir) # perform meta-analysis es <- es_meta(anals, by_source = TRUE)
library(lydata) # location of data data_dir <- system.file("extdata", package = "lydata") # gather GSE names gse_names <- c("GSE9601", "GSE15069", "GSE50841", "GSE34817", "GSE29689") # load previous analysis anals <- load_diff(gse_names, data_dir) # add tissue sources to perform seperate meta-analyses for each source (optional) # anals <- add_sources(anals, data_dir) # perform meta-analysis es <- es_meta(anals, by_source = TRUE)
Converts M and A-values to log-expression values. The output matrix will have two columns for each array, in the order all red then all green. Adapted from plotDensities.MAList instead of exprs.MA so that order is same as phenoData.ch2.
exprs.MA(MA)
exprs.MA(MA)
MA |
an |
A numeric matrix with twice the columns of the input.
Uses filterByExpr to filter based on 'counts' assay or 'exprs' assay if 'counts' isn't available (for ARCHS4 data).
filter_genes(eset)
filter_genes(eset)
eset |
ExpressionSet with 'counts' assayDataElement and group column in pData |
filtered eset
# example ExpressionSet eset <- makeExampleCountsEset() eset <- filter_genes(eset)
# example ExpressionSet eset <- makeExampleCountsEset() eset <- filter_genes(eset)
Fit ebayes model
fit_ebayes( lm_fit, contrasts, robust = TRUE, trend = FALSE, allow.no.resid = FALSE )
fit_ebayes( lm_fit, contrasts, robust = TRUE, trend = FALSE, allow.no.resid = FALSE )
lm_fit |
Result of call to run_limma |
contrasts |
Character vector of contrasts to fit. |
robust |
logical, should the estimation of |
trend |
logical, should an intensity-dependent trend be allowed for the prior variance? If |
allow.no.resid |
Allow no residual degrees of freedom? if |
result of eBayes
Reads raw data files and tries to fix them up so that they can be loaded by read.ilmn.
fix_illum_headers(elist_paths, eset = NULL)
fix_illum_headers(elist_paths, eset = NULL)
elist_paths |
Path to Illumina raw data files. Usually contain patterns: non_normalized.txt, raw.txt, or _supplementary_.txt |
eset |
ExpressionSet from getGEO. |
Character vector for annotation
argument to read.ilmn. Fixed raw data files
are saved with filename ending in _fixed.txt
Downloads and unpacks microarray supplementary files from GEO. Files are stored in the supplied data directory under the GSE name.
get_raw(gse_names, data_dir = getwd())
get_raw(gse_names, data_dir = getwd())
gse_names |
Character vector of GSE names to download. |
data_dir |
String specifying directory for GSE folders. |
NULL (for download/unpack only).
get_raw("GSE41845")
get_raw("GSE41845")
Used by add_adjusted
to create model matrix with surrogate variables.
get_sva_mods(pdata)
get_sva_mods(pdata)
pdata |
|
List with model matrix(mod) and null model matrix (mod0) used for sva
.
Get top table
get_top_table( lm_fit, groups = c("test", "ctrl"), with.es = TRUE, robust = FALSE, trend = FALSE, allow.no.resid = FALSE )
get_top_table( lm_fit, groups = c("test", "ctrl"), with.es = TRUE, robust = FALSE, trend = FALSE, allow.no.resid = FALSE )
lm_fit |
Result of run_limma |
groups |
Test and Control group as strings. |
with.es |
Add |
robust |
logical, should the estimation of |
trend |
logical, should an intensity-dependent trend be allowed for the prior variance? If |
allow.no.resid |
Allow no residual degrees of freedom? if |
result of toptable
Get variance stabilized data for exploratory data analysis
get_vsd(eset)
get_vsd(eset)
eset |
ExpressionSet loaded with load_raw. |
matrix
with variance stabilized expression data.
Used to map human KEGG pathway numbers to names. Updated Feb 2017.
data(gs.names)
data(gs.names)
An object of class character
of length 310.
A named character vector of human KEGG pathway names. Names of vector are KEGG pathway numbers.
Genes for human KEGG pathways. Updated Feb 2017.
data(gslist)
data(gslist)
An object of class list
of length 310.
A named list with entrez ids of genes for human KEGG pathways. List names are KEGG pathway numbers.
Excludes probe ID cols
ilmn.nnum(elist_paths)
ilmn.nnum(elist_paths)
elist_paths |
Paths to raw illumina data files |
Number of numeric columns in elist_paths
excluding probe ID columns.
For rows with duplicated annot, highested IQR retained.
iqr_replicates(eset, annot = "SYMBOL", rm.dup = FALSE)
iqr_replicates(eset, annot = "SYMBOL", rm.dup = FALSE)
eset |
Annotated eset created by |
annot |
feature to use to remove replicates. |
rm.dup |
remove duplicates (same measure, multiple ids)? Used for Pathway analysis so that doesn't treat probes that map to multiple genes as distinct measures. |
Expression set with unique features at probe or gene level.
Load Agilent raw data
load_agil_plat(eset, gse_name, gse_dir, ensql)
load_agil_plat(eset, gse_name, gse_dir, ensql)
eset |
ExpressionSet from getGEO. |
gse_name |
Accession name for |
gse_dir |
Direction with Agilent raw data. |
ensql |
For development. Path to sqlite file with ENTREZID and SYMBOL columns created in data-raw/entrezdt. |
ExpressionSet
Loads previous differential expression analyses.
load_diff(gse_names, data_dir = getwd(), annot = "SYMBOL", postfix = NULL)
load_diff(gse_names, data_dir = getwd(), annot = "SYMBOL", postfix = NULL)
gse_names |
Character vector specifying GSE names to be loaded. |
data_dir |
String specifying directory of GSE folders. |
annot |
Level of previous analysis (e.g. "SYMBOL" or "PROBE"). |
postfix |
Optional string to append to saved results. Useful if need to run multiple meta-analyses on the same series but with different contrasts. |
Result of previous call to diff_expr
.
library(lydata) data_dir <- system.file("extdata", package = "lydata") gse_names <- c("GSE9601", "GSE34817") prev <- load_diff(gse_names, data_dir)
library(lydata) data_dir <- system.file("extdata", package = "lydata") gse_names <- c("GSE9601", "GSE34817") prev <- load_diff(gse_names, data_dir)
Loads and annotates raw data previously downloaded with get_raw
.
Supported platforms include Affymetrix, Agilent, and Illumina.
load_raw( gse_names, data_dir = getwd(), gpl_dir = "..", overwrite = FALSE, ensql = NULL )
load_raw( gse_names, data_dir = getwd(), gpl_dir = "..", overwrite = FALSE, ensql = NULL )
gse_names |
Character vector of GSE names. |
data_dir |
String specifying directory with GSE folders. |
gpl_dir |
String specifying parent directory to search for previously downloaded GPL.soft files. |
overwrite |
Do you want to overwrite saved esets from previous |
ensql |
For development. Path to sqlite file with ENTREZID and SYMBOL columns created in data-raw/entrezdt. |
List of annotated esets.
library(lydata) data_dir <- system.file("extdata", package = "lydata") eset <- load_raw("GSE9601", data_dir = data_dir)
library(lydata) data_dir <- system.file("extdata", package = "lydata") eset <- load_raw("GSE9601", data_dir = data_dir)
adapted from DESeq2::makeExampleDESeqDataSet
makeExampleCountsEset( n = 1000, m = 12, betaSD = 0, interceptMean = 4, interceptSD = 2, dispMeanRel = function(x) 4/x + 0.1, sizeFactors = rep(1, m) )
makeExampleCountsEset( n = 1000, m = 12, betaSD = 0, interceptMean = 4, interceptSD = 2, dispMeanRel = function(x) 4/x + 0.1, sizeFactors = rep(1, m) )
n |
number of rows |
m |
number of columns |
betaSD |
the standard deviation for non-intercept betas, i.e. beta ~ N(0,betaSD) |
interceptMean |
the mean of the intercept betas (log2 scale) |
interceptSD |
the standard deviation of the intercept betas (log2 scale) |
dispMeanRel |
a function specifying the relationship of the dispersions on
|
sizeFactors |
multiplicative factors for each sample |
eset <- makeExampleCountsEset()
eset <- makeExampleCountsEset()
Helper function to open raw Illumina microarray files in order to check that they are formatted correctly. For details on correct format, please see 'Checking Raw Illumina Data' in vignette.
open_raw_illum(gse_names, data_dir = getwd())
open_raw_illum(gse_names, data_dir = getwd())
gse_names |
Character vector of Illumina GSE names to open. |
data_dir |
String specifying directory with GSE folders. |
Character vector of successfully formated Illumina GSE names.
library(lydata) # Illumina GSE names illum_names <- c("GSE50841", "GSE34817", "GSE29689") # location of raw data data_dir <- system.file("extdata", package = "lydata") # open raw data files with default text editor # open_raw_illum(illum_names)
library(lydata) # Illumina GSE names illum_names <- c("GSE50841", "GSE34817", "GSE29689") # location of raw data data_dir <- system.file("extdata", package = "lydata") # open raw data files with default text editor # open_raw_illum(illum_names)
Construct AnnotatedDataFrame from Two-Channel ExpressionSet
phenoData.ch2(eset)
phenoData.ch2(eset)
eset |
ExpressionSet with |
AnnotatedDataFrame with twice as many rows as eset
, one for each channel of each array in order all red then all green.
Run prefix on Illumina raw data files
prefix_illum_headers(elist_paths)
prefix_illum_headers(elist_paths)
elist_paths |
Paths to raw Illumina data files |
Paths to fixed versions of elist_paths
Auto-named columns start with 'V' followed by the column number.
remove_autonamed(ex)
remove_autonamed(ex)
ex |
data.frame loaded with fread |
ex
with auto-named columns removed.
After selecting control and test samples for a contrast, surrogate variable
analysis (sva
) and linear model fitting with lmFit is performed.
run_limma( eset, annot = "SYMBOL", svobj = list(sv = NULL), numsv = 0, filter = TRUE )
run_limma( eset, annot = "SYMBOL", svobj = list(sv = NULL), numsv = 0, filter = TRUE )
eset |
Annotated eset created by |
annot |
String, column name in fData. For duplicated
values in this column, the row with the highest interquartile range
across selected samples will be kept. Appropriate values are |
svobj |
Surrogate variable analysis results. Returned from run_sva. |
numsv |
Number of surrogate variables to model. |
filter |
For RNA-seq. Should genes with low counts be filtered? dseqr shiny app performs this step
separately. Should be |
If analyses need to be repeated, previous results can be reloaded with readRDS
and supplied to the prev_anal
parameter. In this case, previous selections will be reused.
List with:
fit |
result of |
mod |
|
Setup ExpressionSet for running limma analysis
run_limma_setup(eset, prev)
run_limma_setup(eset, prev)
eset |
ExpressionSet |
prev |
previous result of call to diff_expr |
eset
ready for run_limma
Run surrogate variable analysis
run_sva(mods, eset, svanal = TRUE)
run_sva(mods, eset, svanal = TRUE)
mods |
result of get_sva_mods |
eset |
ExpressionSet |
svanal |
Should surrogate variable analysis be run? If |
Function is useful when number of samples makes manual selection with
diff_expr
error prone and time-consuming. This is often true
for large clinical data sets.
setup_prev(eset, contrasts)
setup_prev(eset, contrasts)
eset |
List containing one expression set with pData 'group' and 'pair'
(optional) columns. Name of |
contrasts |
Character vector specifying contrasts to analyse. Each
contrast must take the form "B-A" where both "B" and "A" are present in
|
List containing necessary information for prev_anal
parameter
of diff_expr
.
library(lydata) library(Biobase) # location of raw data data_dir <- system.file("extdata", package = "lydata") # load eset gse_name <- c("GSE34817") eset <- load_raw(gse_name, data_dir) # inspect pData of eset # View(pData(eset$GSE34817)) # if using RStudio head(pData(eset$GSE34817)) # otherwise # get group info from pData (differs based on eset) group <- pData(eset$GSE34817)$characteristics_ch1.1 # make group names concise and valid group <- gsub("treatment: ", "", group) group <- make.names(group) # add group to eset pData pData(eset$GSE34817)$group <- group # setup selections sel <- setup_prev(eset, contrasts = "LY-DMSO") # run differential expression analysis anal <- diff_expr(eset, data_dir, prev_anal = sel)
library(lydata) library(Biobase) # location of raw data data_dir <- system.file("extdata", package = "lydata") # load eset gse_name <- c("GSE34817") eset <- load_raw(gse_name, data_dir) # inspect pData of eset # View(pData(eset$GSE34817)) # if using RStudio head(pData(eset$GSE34817)) # otherwise # get group info from pData (differs based on eset) group <- pData(eset$GSE34817)$characteristics_ch1.1 # make group names concise and valid group <- gsub("treatment: ", "", group) group <- make.names(group) # add group to eset pData pData(eset$GSE34817)$group <- group # setup selections sel <- setup_prev(eset, contrasts = "LY-DMSO") # run differential expression analysis anal <- diff_expr(eset, data_dir, prev_anal = sel)
Function first maps entrez gene ids to homologous human entrez gene ids and then to hgnc symbols.
symbol_annot(eset, gse_name = "", ensql = NULL)
symbol_annot(eset, gse_name = "", ensql = NULL)
eset |
Expression set to annotate. |
gse_name |
GSE name for eset. |
ensql |
For development. Path to sqlite file with ENTREZID and SYMBOL columns created in data-raw/entrezdt. |
Initial entrez gene ids are obtained from bioconductor annotation data packages or from feature data of supplied expression set. Homologous human entrez ids are obtained from homologene and then mapped to hgnc symbols using org.Hs.eg.db. Expression set is expanded if 1:many mappings occur.
Expression set with hgnc symbols ("SYMBOL") and row names ("PROBE") added to fData slot.
library(lydata) # location of raw data data_dir <- system.file("extdata", package = "lydata") # load eset eset <- load_raw("GSE9601", data_dir)[[1]] # annotate eset (need if load_raw failed to annotate) eset <- symbol_annot(eset)
library(lydata) # location of raw data data_dir <- system.file("extdata", package = "lydata") # load eset eset <- load_raw("GSE9601", data_dir)[[1]] # annotate eset (need if load_raw failed to annotate) eset <- symbol_annot(eset)