Title: | An R package for subset-based association analysis of heterogeneous traits and subtypes |
---|---|
Description: | An R package for subset-based analysis of heterogeneous traits and disease subtypes. The package allows the user to search through all possible subsets of z-scores to identify the subset of traits giving the best meta-analyzed z-score. Further, it returns a p-value adjusting for the multiple-testing involved in the search. It also allows for searching for the best combination of disease subtypes associated with each variant. |
Authors: | Samsiddhi Bhattacharjee [aut, cre], Guanghao Qi [aut], Nilanjan Chatterjee [aut], William Wheeler [aut] |
Maintainer: | Samsiddhi Bhattacharjee <[email protected]> |
License: | GPL-2 + file LICENSE |
Version: | 2.25.0 |
Built: | 2024-10-30 03:30:51 UTC |
Source: | https://github.com/bioc/ASSET |
ASSET is a suite of statistical tools specifically designed to be powerful for pooling association signals across multiple studies when true effects may exist only in a subset of the studies and could be in opposite directions across studies. The method explores all possible subsets (or a restricted set if user specifies so) of studies and evaluates fixed-effect meta-analysis-type test-statistics for each subset. The final test-statistic is obtained by maximizing the subset-specific test-statistics over all possible subsets and then evaluating its significance after efficient adjustment for multiple-testing, taking into account the correlation between test-statistics across different subsets due to overlapping subjects. The method not only returns a p-value for significance for the overall evidence of association of a SNP across studies, but also outputs the "best subset" containing the studies that contributed to the overall association signal. For detection of association signals with effects in opposite directions, ASSET allows subset search separately for positively- and negatively- associated studies and then combines association signals from two directions using a chi-square test-statistic. The method can take into account correlation due to overlapping subjects across studies (e.g. shared controls). Although the method is originally developed for conducting genetic association scans, it can also be applied for analysis of non-genetic risk factors as well.
The package consists of two main functions: (1) h.traits
and (2) h.types
. The function h.traits
is suitable for conducting meta-analysis of studies of possibly different traits when summary level data are available from individual
studies. The function allows correlation among different studies/traits, which, for example, may arise due to shared subjects across studies.
The function can also be used to conduct "meta-analysis" across multiple correlated traits on the same individuals by appropriately specifying the correlation
matrix for the multivariate trait. The method, however, is not optimized yet (from a power perspective) for analyzing multivariate traits measured on the same
individuals. The function h.types
is suitable for analysis of case-control studies when cases consist of distinct disease subtypes. This
function assumes individual level data are available.
The functions h.summary
and h.forestPlot
are useful for summarizing results and displaying forest plots.
The helper functions
z.max
and p.dlm
are generic functions called internally for obtaining the maximized subset-based test-statistics and the corresponding
p-values approximated by the Discrete Local Maximization (DLM) method. These functions can be further customized for specific applications.
For example, the default options of these functions currently assume all possible subsets to search. For analysis of case-control studies with ordered
diseased subtypes (e.g. stages of a cancer), however, it may be more meaningful to restrict the subset search to incorporate ordering
constraints among the disease subtypes. In such a situation, one can pass a function argument sub.def
to z.max
and p.dlm
for performing restricted
subset searches.
Samsiddhi Bhattacharjee, Nilanjan Chatterjee and William Wheeler <[email protected]>
Samsiddhi Bhattacharjee, Preetha Rajaraman, Kevin B. Jacobs, William A. Wheeler, Beatrice S. Melin, Patricia Hartge, GliomaScan Consortium, Meredith Yeager, Charles C. Chung, Stephen J. Chanock, Nilanjan Chatterjee. A subset-based approach improves power and interpretation for combined-analysis of genetic association studies of heterogeneous traits. Am J Hum Genet, 2012, 90(5):821-35
Partition the set of traits into blocks of correlated traits using hierarchical clustering.
create_blocks(cormat, cor_thr = 0.2)
create_blocks(cormat, cor_thr = 0.2)
cormat |
A named k by k correlation matrix of z-statistics with row and column names being the names of studies/traits. |
cor_thr |
Threshold of correlation to divide the set of traits into blocks. Z-statistics with correlation less than |
A list of character vectors. Each element is a vector of study/trait names that fall into the same correlation block based on cormat
. Studies from different blocks are treated as independent. Studies that are not in the list are treated as independent from other studies.
data("ex_fast_asset", package="ASSET") block <- create_blocks(ldscintmat)
data("ex_fast_asset", package="ASSET") block <- create_blocks(ldscintmat)
Data for fast_asset
The object data contains estimated log odds-ratios and their standard errors
for 1 snp and 116 traits in betahat
and SE
. The trait names are in traits
and the snp name in SNP
. The Neff
vector contains effective sample sizes of the
individual traits. The ldscintmat
matrix contains correlation matrix between pairs of
traits as estimated using genome-wide LD-score regression.
h.traits, h.types
data(ex_fast_asset, package="ASSET") # Display the parts of the data objects head(betahat) head(SE) head(traits) SNP head(Neff) ldscintmat[1:6, 1:6]
data(ex_fast_asset, package="ASSET") # Display the parts of the data objects head(betahat) head(SE) head(traits) SNP head(Neff) ldscintmat[1:6, 1:6]
Data for h.traits
The object data contains estimated log odds-ratios and their standard errors for 5 SNPs and 6 traits. The matrices N00, N10, and N11 are the case-control overlap matrices.
h.types, fast_asset
data(ex_trait, package="ASSET") # Display the data, and case/control overlap matrices data N00 N11 N10
data(ex_trait, package="ASSET") # Display the data, and case/control overlap matrices data N00 N11 N10
Sample data for h.types
The data frame contains columns for disease subtype, study center, and 3 SNPs.
h.traits, fast_asset
data(ex_types, package="ASSET") # Display the first 10 rows of the data data[1:10, ]
data(ex_types, package="ASSET") # Display the first 10 rows of the data data[1:10, ]
This function implements a fast version of ASSET for analysis of a large number of traits. It accelerates the computation by restricting subset search among the traits with suggestive level of associations. The traits are selected by a liberal p-value threshold. The input is GWAS summary statistics for one SNP. See details for more information.
fast_asset( snp, traits.lab, beta.hat, sigma.hat, Neff, cor, block, scr_pthr = 0.05 )
fast_asset( snp, traits.lab, beta.hat, sigma.hat, Neff, cor, block, scr_pthr = 0.05 )
snp |
A character string giving the SNP name to be analyzed. No default. |
traits.lab |
A character vector giving the names/identifiers of the k studies/traits being analyzed. The order of this vector must match the columns of beta.hat and sigma.hat No default. |
beta.hat |
A numeric vector of length k giving the coefficients obtained from the analysis of that SNP across the k studies/traits. No default. |
sigma.hat |
A vector of same dimension as beta.hat, giving the corresponding standard errors. No default. |
Neff |
A numeric vector of same dimension as beta.hat, giving the effective sample size of each study. For continuous traits, the effective sample size is the total sample size; for binary traits, the effective sample size is Ncase*Ncontrol/(Ncase+Ncontrol). |
cor |
A matrix of dimension k by k storing the correlation of z-statistics under the null hypothesis. It can be estimated using LD score regression (https://github.com/bulik/ldsc). For example, element |
block |
A list of character vectors. Each element is a vector of study/trait names that fall into the same correlation block based on |
scr_pthr |
P-value threshold for pre-screening. Only traits with p-value < scr_pthr are retained. Default to 0.05. |
The standard ASSET (h.traits
) which searches through all subsets can be computationally intractable for analyzing a large number of traits. fast_asset
reduces the computational burden by the following procedure: (1) De-correlate the z-statistics associated with different traits using the correlation matrix (cor
); (2) select the set of traits using a liberal threshold (p<0.05 by default) of the de-correlated z-statistics; (3) adjust the de-correlated z-statistics for the pre-selection independently across different traits; (4) re-introduce the correlation using the matrix cor
and supply the adjusted statistics to standard ASSET.
A list of the same format as the output of h.traits
.
Guanghao Qi, Surya Chhetri, Elizaveta Naydanova, Debashree Ray, Diptavo Dutta, Alexis Battle, Samsiddhi Bhattacharjee and Nilanjan Chatterjee. "Genome-wide Multi-trait Analysis Across 116 Studies Identifies Complex Patterns of Pleiotropy and Unique Trait-Specific Loci". In preparation (2021).
data("ex_fast_asset", package="ASSET") block <- create_blocks(ldscintmat) test <- fast_asset(snp=SNP, traits.lab=traits, beta.hat=betahat, sigma.hat=SE , Neff=Neff, cor=ldscintmat, block=block, scr_pthr=0.05)
data("ex_fast_asset", package="ASSET") block <- create_blocks(ldscintmat) test <- fast_asset(snp=SNP, traits.lab=traits, beta.hat=betahat, sigma.hat=SE , Neff=Neff, cor=ldscintmat, block=block, scr_pthr=0.05)
Forest Plot for meta-analysis of heterogenerous traits or types.
h.forestPlot(rlist, snp.var, level=0.05, p.adj=TRUE, digits=2, calc.miss=FALSE)
h.forestPlot(rlist, snp.var, level=0.05, p.adj=TRUE, digits=2, calc.miss=FALSE)
rlist |
The list of results returned by |
snp.var |
A character string giving the name of the SNP variable to be plotted. No default. |
level |
Level for confidence intervals. Default is |
p.adj |
Logical. Whether to report Bonferroni adjusted p-values for each individual subtype. Default is TRUE. |
digits |
Number of significant digits to display the odds ratios in the plot. |
calc.miss |
Logical. If TRUE, first 3 components of rlist are calculated and filled (if NULL) so that all 3 summarizations can be shown in the forest plot. |
Forest plot for a SNP showing regression coefficients (e.g. log-odds-ratio for case-control studies) for individual studies/traits and confidence intervals, estimate of an overall regression coefficient and confidence interval based on standard fixed-effect meta-analysis and estimate of regression coefficient(s) and confidence intervals associated with the identified best subset(s).
# Use the example data data(ex_trait, package="ASSET") data # Define the input arguments to h.traits snps <- as.vector(data[, "SNP"]) traits.lab <- paste("Trait_", 1:6, sep="") beta.hat <- as.matrix(data[, paste(traits.lab, ".Beta", sep="")]) sigma.hat <- as.matrix(data[, paste(traits.lab, ".SE", sep="")]) cor <- list(N11=N11, N00=N00, N10=N10) ncase <- diag(N11) ncntl <- diag(N00) # Now let us call h.traits on these summary data. res <- h.traits(snps, traits.lab, beta.hat, sigma.hat, ncase, ncntl, cor=cor, cor.numr=FALSE, search=NULL, side=2, meta=TRUE, zmax.args=NULL, meth.pval="DLM") h.forestPlot(res, "SNP_1", digits=3)
# Use the example data data(ex_trait, package="ASSET") data # Define the input arguments to h.traits snps <- as.vector(data[, "SNP"]) traits.lab <- paste("Trait_", 1:6, sep="") beta.hat <- as.matrix(data[, paste(traits.lab, ".Beta", sep="")]) sigma.hat <- as.matrix(data[, paste(traits.lab, ".SE", sep="")]) cor <- list(N11=N11, N00=N00, N10=N10) ncase <- diag(N11) ncntl <- diag(N00) # Now let us call h.traits on these summary data. res <- h.traits(snps, traits.lab, beta.hat, sigma.hat, ncase, ncntl, cor=cor, cor.numr=FALSE, search=NULL, side=2, meta=TRUE, zmax.args=NULL, meth.pval="DLM") h.forestPlot(res, "SNP_1", digits=3)
This function produces summary results from subset-based association analysis.
h.summary(rlist, level = 0.05, digits = 3)
h.summary(rlist, level = 0.05, digits = 3)
rlist |
|
level |
Level for confidence intervals. Default is |
digits |
Number of significant digits to retain in odds ratios and confidence intervals in the summary table |
Returns a list of data frames containing p-values, odds-ratios, confidence intervals and the traits/types for each analysis.
The number of data frames in the list will depend on which function (h.traits
or h.types
)
was called and on the function options specified.
A list of data frames, one for each of the methods specified the original call of the functions h.traits
or h.types
. Each row of a data frame corresponds to a SNP and
the values include p-values for overall association (including component-wise p-values for two-sided search), names of phenotypes or disease subtypes included in
the best-subset, summary regression coefficients (e.g. log-odds-ratio for case-control studies) representing strength of association of a SNP with the
identified subset of traits/subtype and corresponding confidence intervals.
h.forestPlot
, h.traits
, h.types
# Use the example data data(ex_trait, package="ASSET") # Define the input arguments to h.traits snps <- as.vector(data[, "SNP"]) traits.lab <- paste("Trait_", 1:6, sep="") beta.hat <- as.matrix(data[, paste(traits.lab, ".Beta", sep="")]) sigma.hat <- as.matrix(data[, paste(traits.lab, ".SE", sep="")]) cor <- list(N11=N11, N00=N00, N10=N10) ncase <- diag(N11) ncntl <- diag(N00) # Now let us call h.traits on these summary data. res <- h.traits(snps, traits.lab, beta.hat, sigma.hat, ncase, ncntl, cor=cor, cor.numr=FALSE, search=NULL, side=2, meta=TRUE, zmax.args=NULL, meth.pval="DLM") h.summary(res)
# Use the example data data(ex_trait, package="ASSET") # Define the input arguments to h.traits snps <- as.vector(data[, "SNP"]) traits.lab <- paste("Trait_", 1:6, sep="") beta.hat <- as.matrix(data[, paste(traits.lab, ".Beta", sep="")]) sigma.hat <- as.matrix(data[, paste(traits.lab, ".SE", sep="")]) cor <- list(N11=N11, N00=N00, N10=N10) ncase <- diag(N11) ncntl <- diag(N00) # Now let us call h.traits on these summary data. res <- h.traits(snps, traits.lab, beta.hat, sigma.hat, ncase, ncntl, cor=cor, cor.numr=FALSE, search=NULL, side=2, meta=TRUE, zmax.args=NULL, meth.pval="DLM") h.summary(res)
Performs one-sided or two-sided subset-based meta-analysis of heteregeneous studies/traits.
h.traits(snp.vars, traits.lab, beta.hat, sigma.hat, ncase, ncntl, cor=NULL, cor.numr=FALSE, search=NULL, side=2, meta=FALSE, zmax.args=NULL, meth.pval="DLM", pval.args=NULL)
h.traits(snp.vars, traits.lab, beta.hat, sigma.hat, ncase, ncntl, cor=NULL, cor.numr=FALSE, search=NULL, side=2, meta=FALSE, zmax.args=NULL, meth.pval="DLM", pval.args=NULL)
snp.vars |
A character vector giving the SNP names to be analyzed. No default. |
traits.lab |
A character vector giving the names/identifiers of the |
beta.hat |
A matrix of dimension length( |
sigma.hat |
A vector or matrix of same dimension as beta.hat, giving the corresponding standard errors. No default. |
ncase |
The number of cases in each of the |
ncntl |
Same as |
cor |
Either a |
cor.numr |
Logical. When specified as TRUE the correlation information is used for optimal weighting of the studies in the definition of the meta-analysis test-statistics. The default is FALSE. In either case, the correlation is accounted for variance calculation of the meta-analysis test-statistic. |
search |
1, 2 or NULL. Search option 1 and 2 indicate one-sided and two-sided subset-searches respectively. The default option is NULL that automatically returns both one-sided and two-sided subset searches. Default is NULL. |
side |
Either 1 or 2. For two-tailed tests (where absolute values of Z-scores are maximized), |
meta |
Logical. When specified as TRUE, standard fixed effect meta-analysis results are returned together with results from subset-based meta-analysis. The Default is FALSE. |
zmax.args |
Optional arguments to be passed to |
pval.args |
Optional arguments to be passed to internal p-value calculation functions |
meth.pval |
The method of p-value computation. Currently the options are DLM (Discrete Local Maximum), IS(Exact Importance Sampling) and B (Bonferroni) with the default option being DLM. The IS method is currently computationally feasible for analysis of at most k=10 studies/traits |
The one-sided subset search maximizes the standard fixed-effect meta-analysis test-statistics over all possible subsets (or over a restricted set of
subsets if such an option is specified) of studies to detect the best possible association signals. The p-value returned for the maximum test-statistics automatically
accounts for multiple testing penalty due to subset search and can be taken as an evidence of an overall association for the SNP accross the k
studies/traits.
The one-sided method automatically guarantees identification of studies/traits that have associations in the same direction and thus is useful in applications where
it is desirable to identify SNPs that shows effects in the same direction across multiple traits/studies. The two-sided subset search, applies one-side subset search separately for positively and negatively associated traits for a given SNP and then combines the association
signals from two directions into a single combined chi-square type statistic. The method is sensitive in detecting SNPs that may be associated with different
traits in different directions.
The methods allow for accounting for correlation among studies/subject that might arise due to shared subjects across distinct studies or due to correlation
among related traits in the same study. For application of the method for meta-analysis of case-control studies, the matrices N11
, N10
and N00
denote the number subjects that are shared between studies by case-control status. By defintion, the
diagonals of the matrices N11
and N00
contain the number of cases and controls, respectively, in the k
studies. Also, by definition,
the diagonal of N10
is zero since cases cannot serve as controls and vice versa in the same study. The most common situation
may involve shared controls accross studies, ie non-zero off-diagonal elements of the matrix N00.
The output standard errors are approximate (based on inverting p-values) and are used for constructing confidence
intervals in h.summary
and h.forestPlot
.
A list containing 3 main component lists named:
(1) "Meta" (Results from standard fixed effect meta-analysis of all studies/traits).
This list is non-null when meta
is TRUE and contains 3 vectors named (pval, beta, sd) of length same as snp.vars.
(2) "Subset.1sided" (one-sided subset search):
This list is non-null when search
is NULL or 1 and contains, 4 vectors named (pval, beta, sd, sd.meta) of length same as snp.vars
and a logical matrix named "pheno" with one row for each snp and one column for each phenotype. For a particular SNP and phenotype, the entry
has "TRUE" if this phenotype was in the selected subset for that SNP. In the output, the p-value is automatically adjusted for multiple testing
due to subset search. The beta and sd correspond to the standard fixed-effect meta-analysis estimate and corresponding standard error estimate
for the regression coefficient of a SNP based only on those studies/traits that are included in the identified subset.
The vector sd.meta gives the meta-analysis standard errors for estimates of beta based on studies in the identified subset ignoring the randomness of the subset.
(3) Subset.2sided (two-sided subset search)
This list is non-null when search
is NULL or 2 and contains 9 vectors named
(pval, pval.1, pval.2, beta.1, sd.1, beta.2, sd.2, sd.1.meta, sd.2.meta) of length
same as snp.vars and two matrices named "pheno.1" and "pheno.2" giving logical indicators of a phenotype being among the positively or negatively
associated subsets (respectively) as identified by 2-sided subset search. In the output, while pval provides the significance of the overall test-statistics that
combined association signals from two directions, pval.1 and and pval.2 return the corresponding level of significance for each of the component one-sided
test-statistics in the positive and negative directions. The values (beta.1, sd.1, beta.2, sd.2) denote the corresponding meta-analysis estimate of regression coefficients
and standard errors for the identifed subsets of traits/studies that show association in positive and negative directions, respectively.
The vector sd.1.meta and sd.2.meta give the meta-analysis standard errors for estimates of beta based on studies in the identified subset ignoring the randomness of the subset in positive and negative directions, respectively.
The other objects in the list are the input arguments passed into h.traits
.
Samsiddhi Bhattacharjee, Preetha Rajaraman, Kevin B. Jacobs, William A. Wheeler, Beatrice S. Melin, Patricia Hartge, GliomaScan Consortium, Meredith Yeager, Charles C. Chung, Stephen J. Chanock, Nilanjan Chatterjee. A subset-based approach improves power and interpretation for combined-analysis of genetic association studies of heterogeneous traits. Am J Hum Genet, 2012, 90(5):821-35
# Use the example data data(ex_trait, package="ASSET") # Display the data, and case/control overlap matrices data N00 N11 N10 # Define the input arguments to h.traits snps <- as.vector(data[, "SNP"]) traits.lab <- paste("Trait_", 1:6, sep="") beta.hat <- as.matrix(data[, paste(traits.lab, ".Beta", sep="")]) sigma.hat <- as.matrix(data[, paste(traits.lab, ".SE", sep="")]) cor <- list(N11=N11, N00=N00, N10=N10) ncase <- diag(N11) ncntl <- diag(N00) # Now let us call h.traits on these summary data. res <- h.traits(snps, traits.lab, beta.hat, sigma.hat, ncase, ncntl, cor=cor, cor.numr=FALSE, search=NULL, side=2, meta=TRUE, zmax.args=NULL, meth.pval="DLM") h.summary(res)
# Use the example data data(ex_trait, package="ASSET") # Display the data, and case/control overlap matrices data N00 N11 N10 # Define the input arguments to h.traits snps <- as.vector(data[, "SNP"]) traits.lab <- paste("Trait_", 1:6, sep="") beta.hat <- as.matrix(data[, paste(traits.lab, ".Beta", sep="")]) sigma.hat <- as.matrix(data[, paste(traits.lab, ".SE", sep="")]) cor <- list(N11=N11, N00=N00, N10=N10) ncase <- diag(N11) ncntl <- diag(N00) # Now let us call h.traits on these summary data. res <- h.traits(snps, traits.lab, beta.hat, sigma.hat, ncase, ncntl, cor=cor, cor.numr=FALSE, search=NULL, side=2, meta=TRUE, zmax.args=NULL, meth.pval="DLM") h.summary(res)
Subset-based analysis of case-control studies with heterogeneous disease subtypes.
h.types(dat, response.var, snp.vars, adj.vars, types.lab, cntl.lab, subset=NULL, method=NULL, side=2, logit=FALSE, test.type="Score", zmax.args=NULL, meth.pval="DLM", pval.args=NULL)
h.types(dat, response.var, snp.vars, adj.vars, types.lab, cntl.lab, subset=NULL, method=NULL, side=2, logit=FALSE, test.type="Score", zmax.args=NULL, meth.pval="DLM", pval.args=NULL)
dat |
A data frame containing individual level data for phenotype (disease status/subtype information), covariate data and SNPs. No default. |
response.var |
Variable name or position of the response variable column in the data frame. This variable needs to contain disease status/subtype information in the data frame. No default. |
snp.vars |
A character or numeric vector giving the variable names or positions of the SNP variables. Missing values for SNP genotypes are indicated by NA. No default. |
adj.vars |
A character or numeric vector containing the variable names or positions of the columns in the data frame that would be used as adjusting covariates in the analysis. Use NULL if no covariates are used for adjustment. |
types.lab |
NULL or a character vector giving the names/identifiers of the disease subtypes in |
cntl.lab |
A single character string giving the name/identifier of controls (disease-free subjects) in |
subset |
A logical vector with length=nrow( |
method |
A single character string indicating the choice of method as "case-control" or "case-complement".
The Default option is NULL which will carry out both types of analysis. For the case-complement analysis of disease subtype |
side |
A numeric value of either 1 or 2 indicating whether one or two-tailed p-values should be computed, respectively. The default is 2. |
logit |
If TRUE, results are returned from an overall case-control analysis using standard logistic regression. Default is FALSE. |
test.type |
A character string indicating the type of tests to be performed. The current options are "Score" and "Wald". The default is "Score." |
zmax.args |
Optional arguments to be passed to |
meth.pval |
A character string indicating the method of evaluating the p-value. Currently the options are "DLM" (Discrete Local Maximum), "IS" (Exact Importance Sampling) and "B" (Bonferroni) with the default option being DLM. The IS method is currently computationally feasible for analysis of at most k=10 studies/traits |
pval.args |
Optional arguments to be passed to internal p-value calculation functions |
The output standard errors are approximate (based on inverting DLM pvalues) and are used for constructing confidence intervals
in h.summary
and h.forestPlot
. For a particular SNP, if any of the genotypes are missing, then those
subjects will be removed from the analysis for that SNP.
A list containing 3 component lists named:
(1) "Overall.Logistic" (output for overall case-control analysis using standard logistic regression):
This list is non-null when logit
is TRUE and contains 3 vectors named (pval, beta, sd) of length same as snp.vars.
(2) "Subset.Case.Control" (output for subset-based case-control analysis):
This list is non-null when method
is NULL or "case-control". The output contains, 3 vectors named (pval, beta, sd) of length same as snp.vars
and a logical matrix named "pheno" with one row for each snp and one column for each disease subtype. For a particular SNP and disease-subtype, the
corresponding entry is "TRUE" if that disease subtype is included the best subset of disease subtypes that is identified to be associated with
the SNP in the subset-based case-control analysis. In the output, the p-value is automatically adjusted for multiple testing
due to subset search. The beta and sd corresponds to estimate of log-odds-ratio and standard error for a SNP from a logistic regression analysis involving
the cases of the identified disease subtypes and the controls.
(3) "Subset.Case.Complement" (output for subset-based case-complement analysis):
This list is non-null when method
is NULL or "case-complement". The output contains, 3 vectors named (pval, beta, sd) of length same as snp.vars
and a logical matrix named "pheno" with one row for each snp and one column for each disease subtype. For a particular SNP and disease-subtype, the
corresponding entry is "TRUE" if that disease subtype is included the best subset of disease subtypes that is identified to be associated with
the SNP in the subset-based case-complement analysis. In the output, the p-value is automatically adjusted for multiple testing
due to subset search. The beta and sd corresponds to estimate of log-odds-ratio and standard error for the SNP from a logistic regression analysis involving
the cases of the selected disease subtypes and the whole complement set of subjects that includes original controls and the cases of unselected disease subtypes.
Samsiddhi Bhattacharjee, Preetha Rajaraman, Kevin B. Jacobs, William A. Wheeler, Beatrice S. Melin, Patricia Hartge, GliomaScan Consortium, Meredith Yeager, Charles C. Chung, Stephen J. Chanock, Nilanjan Chatterjee. A subset-based approach improves power and interpretation for combined-analysis of genetic association studies of heterogeneous traits. Am J Hum Genet, 2012, 90(5):821-35
# Use the example data data(ex_types, package="ASSET") # Display the first 10 rows of the data and a table of the subtypes data[1:10, ] table(data[, "TYPE"]) # Define the input arguments to h.types. snps <- paste("SNP_", 1:3, sep="") adj.vars <- c("CENTER_1", "CENTER_2", "CENTER_3") types <- paste("SUBTYPE_", 1:5, sep="") # SUBTYPE_0 will denote the controls res <- h.types(data, "TYPE", snps, adj.vars, types, "SUBTYPE_0", subset=NULL, method="case-control", side=2, logit=FALSE, test.type="Score", zmax.args=NULL, meth.pval="DLM", pval.args=NULL) h.summary(res)
# Use the example data data(ex_types, package="ASSET") # Display the first 10 rows of the data and a table of the subtypes data[1:10, ] table(data[, "TYPE"]) # Define the input arguments to h.types. snps <- paste("SNP_", 1:3, sep="") adj.vars <- c("CENTER_1", "CENTER_2", "CENTER_3") types <- paste("SUBTYPE_", 1:5, sep="") # SUBTYPE_0 will denote the controls res <- h.types(data, "TYPE", snps, adj.vars, types, "SUBTYPE_0", subset=NULL, method="case-control", side=2, logit=FALSE, test.type="Score", zmax.args=NULL, meth.pval="DLM", pval.args=NULL) h.summary(res)
Function to obtain Discrete Local Maxima based estimates of p-values for z-scores maximized over subsets (of traits or subtypes), with possible restrictions and weights. Should not be called directly. See details.
p.dlm(t.vec, k, search, side, cor.def=NULL, cor.args=NULL, sizes=rep(1, k), sub.def=NULL, sub.args=NULL, wt.def=NULL, wt.args=NULL)
p.dlm(t.vec, k, search, side, cor.def=NULL, cor.args=NULL, sizes=rep(1, k), sub.def=NULL, sub.args=NULL, wt.def=NULL, wt.args=NULL)
t.vec |
Numeric vector of (positive) points for which to calculate p-values, i.e. general observed Z-max values. No default. |
k |
Integer (currently less than 30). The number of studies (traits) or subtypes being analyzed. No default. |
search |
0, 1 or 2. Search option, with 0 indicating subtype analysis, 1 and 2 denote one-sided and two-sided subset-search. No default. |
side |
Either 1 or 2. For two-tailed tests (where absolute values of Z-scores are maximized), |
cor.def |
A function with at least 3 arguments which calculates correlation between its first argument (a subset) and its second argument (subsets such as its neighbors). The third argument is the number of traits/subtypes and the function should return a vector of correlations with the neighbors. If NULL or a non-function value is specified, internal default functions for the corresponding search option are used. |
cor.args |
Other arguments to be passed to |
sizes |
Sizes of equivalence classes of traits. By default, no two traits or studies are equivalent. This argument is for internal use. |
sub.def |
A function to restrict subsets, e.g., order restrictions in subtype analysis. Should accept a subset (a logical vector of size k) as its first argument and should return TRUE if the subset satisfies restrictions and FALSE otherwise. Default is NULL implying all (2^k - 1) subsets are considered in the maximum. |
sub.args |
Other arguments to be passed to |
wt.def |
A function that gives the weight of one subset with respect to another. Should accept two subsets as the first two arguments and return a single positive weight. Default NULL. Currently this option is not implemented and the argument is ignored. |
wt.args |
Other arguments to be passed to |
The function is vectorized to handle blocks of SNPs at a time. Currently weight options are ignored.
This is a helper function that is called internally by h.traits
and h.types
and should not be called directly. The arguments of this function that have defaults, can be customized using
the argument pval.args
in h.traits
and h.types
.
A numeric vector of estimated p-values.
# A function to define the correlations between a subset and its neighbors # Returned values should not exceed the value of 1 cor.def <- function(subset, neighbors, k, ncase, ncntl) { n <- ncol(neighbors) mat <- matrix(subset, nrow=k, ncol=n, byrow=FALSE) cor <- (mat + neighbors)*(1:k)/(k^2) cor <- colSums(cor) cor <- cor/max(cor) dim(cor) <- c(n, 1) cor } # Subset definition sub.def <- function(logicalVec) { # Only allow the cummulative subsets: # TRUE FALSE FALSE FALSE ... # TRUE TRUE FALSE FALSE ... # TRUE TRUE TRUE FALSE ... # etc sum <- sum(logicalVec) ret <- all(logicalVec[1:sum]) ret } k <- 5 t.vec <- 1:k p.dlm(t.vec, k, 1, 2, cor.def=cor.def, sub.def=sub.def, cor.args=list(ncase=rep(1000, k), ncntl=rep(1000,k)))
# A function to define the correlations between a subset and its neighbors # Returned values should not exceed the value of 1 cor.def <- function(subset, neighbors, k, ncase, ncntl) { n <- ncol(neighbors) mat <- matrix(subset, nrow=k, ncol=n, byrow=FALSE) cor <- (mat + neighbors)*(1:k)/(k^2) cor <- colSums(cor) cor <- cor/max(cor) dim(cor) <- c(n, 1) cor } # Subset definition sub.def <- function(logicalVec) { # Only allow the cummulative subsets: # TRUE FALSE FALSE FALSE ... # TRUE TRUE FALSE FALSE ... # TRUE TRUE TRUE FALSE ... # etc sum <- sum(logicalVec) ret <- all(logicalVec[1:sum]) ret } k <- 5 t.vec <- 1:k p.dlm(t.vec, k, 1, 2, cor.def=cor.def, sub.def=sub.def, cor.args=list(ncase=rep(1000, k), ncntl=rep(1000,k)))
Function to maximize z-scores over subsets of traits or subtypes, with possible restrictions and weights. Should not be called directly. See details.
z.max(k, snp.vars, side, meta.def, meta.args, th=rep(-1, length(snp.vars)), sub.def=NULL, sub.args=NULL, wt.def=NULL, wt.args=NULL)
z.max(k, snp.vars, side, meta.def, meta.args, th=rep(-1, length(snp.vars)), sub.def=NULL, sub.args=NULL, wt.def=NULL, wt.args=NULL)
k |
Single integer (currently at most 30). The total number of traits or studies or subtypes being analyzed. No default |
snp.vars |
Vector of integers or string labels for the SNPs being analyzed. No default. |
side |
Either 1 or 2. For two-tailed tests (where absolute values of Z-scores are maximzed), |
meta.def |
Function that calculates z-scores for a given subset, for all the SNPs. Should accept a subset (logical vector of length k) as its
first argument, followed by a list of SNPs (subset of snp.vars) as its second argument.
Should return a named list with at least the name "z", which is the vector of z-scores. The length of the vector should be the same
length as |
meta.args |
Other arguments to be passed to |
th |
A vector of thresholds for each SNP, beyond which to stop maximization for that SNP. Default is a threshold of -1 for each SNP , implying no threshold. This argument is for internal use. |
sub.def |
A function to restrict subsets, e.g., order restrictions in subtype analysis. Should accept a subset (a logical vector of size k) as its first argument and should return TRUE if the subset satisfies restrictions and FALSE otherwise. Default is NULL implying all (2^k - 1) subsets are considered in the maximum. |
sub.args |
Other arguments to be passed to |
wt.def |
A function that gives weight of one subset with respect to another. Should accept two subsets as first two argumets and return a single positive weight. Default NULL. Currently this option is not implemented and the argument is ignored. |
wt.args |
Other arguments to be passed to |
This function loops through all possible (2^k - 1) subsets of (k) studies (or traits or subtypes), skips subsets that are not valid (e.g.
that do not satisfy order restrictions), and maximizes the z-scores or re-weighted z-scores if weights are specified.
The function is vectorized to handle blocks of SNPs at a time. Currently weight options are ignored.
This is a helper function that is called internally by h.traits
and h.types
and should not be called directly. The arguments of this function that have defaults, can be customized using
the argument zmax.args
in h.traits
and h.types
.
A list with two components. A vector of optimized z-scores (opt.z) and a logical matrix (opt.s) of dimension length(snp.vars) by k
.
Each row of (opt.s) has indicators of each trait/subtype being included in the best (optimal) subset.
set.seed(123) # Define the function to calculate the z-scores meta.def <- function(logicalVec, SNP.list, arg.beta, arg.sigma) { # Get the snps and subset to use beta <- as.matrix(arg.beta[SNP.list, logicalVec]) se <- as.matrix(arg.sigma[SNP.list, logicalVec]) test <- (beta/se)^2 ret <- apply(test, 1, max) list(z=ret) } # Define the function to determine which subsets to consider sub.def <- function(logicalVec) { # Only allow the cummulative subsets: # TRUE FALSE FALSE FALSE ... # TRUE TRUE FALSE FALSE ... # TRUE TRUE TRUE FALSE ... # etc sum <- sum(logicalVec) ret <- all(logicalVec[1:sum]) ret } # Assume there are 10 subtypes and 3 SNPs k <- 10 snp.vars <- 1:3 # Generate some data nsnp <- length(snp.vars) beta <- matrix(-0.5 + runif(k*nsnp), nrow=nsnp) sigma <- matrix(runif(k*nsnp)^2, nrow=nsnp) meta.args <- list(arg.beta=beta, arg.sigma=sigma) z.max(k, snp.vars, 2, meta.def, meta.args, sub.def=sub.def)
set.seed(123) # Define the function to calculate the z-scores meta.def <- function(logicalVec, SNP.list, arg.beta, arg.sigma) { # Get the snps and subset to use beta <- as.matrix(arg.beta[SNP.list, logicalVec]) se <- as.matrix(arg.sigma[SNP.list, logicalVec]) test <- (beta/se)^2 ret <- apply(test, 1, max) list(z=ret) } # Define the function to determine which subsets to consider sub.def <- function(logicalVec) { # Only allow the cummulative subsets: # TRUE FALSE FALSE FALSE ... # TRUE TRUE FALSE FALSE ... # TRUE TRUE TRUE FALSE ... # etc sum <- sum(logicalVec) ret <- all(logicalVec[1:sum]) ret } # Assume there are 10 subtypes and 3 SNPs k <- 10 snp.vars <- 1:3 # Generate some data nsnp <- length(snp.vars) beta <- matrix(-0.5 + runif(k*nsnp), nrow=nsnp) sigma <- matrix(runif(k*nsnp)^2, nrow=nsnp) meta.args <- list(arg.beta=beta, arg.sigma=sigma) z.max(k, snp.vars, 2, meta.def, meta.args, sub.def=sub.def)