| Title: | Case Control Allele Frequency Estimation |
|---|---|
| Description: | Functions to reconstruct case and control AFs from summary statistics. One function uses OR, NCase, NControl, and SE(log(OR)). The second function uses OR, NCase, NControl, and AF for the whole sample. |
| Authors: | Hayley Wolff [cre, aut] |
| Maintainer: | Hayley Wolff <[email protected]> |
| License: | GPL-3 |
| Version: | 1.5.0 |
| Built: | 2026-05-20 10:12:15 UTC |
| Source: | https://github.com/bioc/CCAFE |
This is a function to derive the case and control AFs from GWAS summary
statistics when the user has access to the whole sample AF, the sample sizes,
and the OR (or beta).
If user has SE instead of sample AF use CaseControl_SE()
CaseControl_AF( data, N_case = 0, N_control = 0, OR_colname = "OR", AF_total_colname = "AF" )CaseControl_AF( data, N_case = 0, N_control = 0, OR_colname = "OR", AF_total_colname = "AF" )
data |
dataframe with each row being a variant and columns for AF_total and OR |
N_case |
the number of cases in the sample |
N_control |
the number of controls in the sample |
OR_colname |
a string containing the exact column name in 'data' with the OR |
AF_total_colname |
a string containing the exact column name in 'data' with the whole sample AF |
returns a dataframe with two columns (AF_case, AF_control) and rows equal to the number of variants
Hayley Wolff (Stoneman), [email protected]
https://github.com/wolffha/CCAFE
https://github.com/wolffha/CCAFE for further documentation
library(CCAFE) data("sampleDat") sampleDat <- as.data.frame(sampleDat) nCase_sample = 16550 nControl_sample = 403923 # get the estimated case and control AFs af_method_results <- CaseControl_AF(data = sampleDat, N_case = nCase_sample, N_control = nControl_sample, OR_colname = "OR", AF_total_colname = "true_maf_pop") head(af_method_results)library(CCAFE) data("sampleDat") sampleDat <- as.data.frame(sampleDat) nCase_sample = 16550 nControl_sample = 403923 # get the estimated case and control AFs af_method_results <- CaseControl_AF(data = sampleDat, N_case = nCase_sample, N_control = nControl_sample, OR_colname = "OR", AF_total_colname = "true_maf_pop") head(af_method_results)
This is a function to derive the case, control, and total MAFs
from GWAS summary statistics when the user has access to the sample sizes,
and the OR (or beta), and SE for the log(OR) for each variant.
If user has total AF instead of SE use CaseControl_AF()
This code uses the GroupFreq function adapted from C from
https://github.com/Paschou-Lab/ReAct/blob/main/GrpPRS_src/CountConstruct.c
CaseControl_SE( data, N_case = 0, N_control = 0, OR_colname = "OR", SE_colname = "SE", chromosome_colname = "chr", sex_chromosomes = FALSE, position_colname = "pos", N_XX_case = NA, N_XX_control = NA, N_XY_case = NA, N_XY_control = NA, do_correction = FALSE, correction_data = NA, remove_sex_chromosomes = TRUE, verbose = FALSE )CaseControl_SE( data, N_case = 0, N_control = 0, OR_colname = "OR", SE_colname = "SE", chromosome_colname = "chr", sex_chromosomes = FALSE, position_colname = "pos", N_XX_case = NA, N_XX_control = NA, N_XY_case = NA, N_XY_control = NA, do_correction = FALSE, correction_data = NA, remove_sex_chromosomes = TRUE, verbose = FALSE )
data |
dataframe where each row is a variant and columns contain the OR, SE, chromosome and positions |
N_case |
an integer of the number of Case individuals |
N_control |
an integer of the number of Control individuals |
OR_colname |
a string containing the exact column name in 'data' with the OR |
SE_colname |
a string containing the exact column name in 'data' with the SE |
chromosome_colname |
a string containing the exact column name in 'data' with the chromosomes, default "chr" |
sex_chromosomes |
boolean, TRUE if variants from sex chromosomes are included in the dataset. Sex chromosomes can be numeric (23, 24) or character (X, Y). If numeric, assumes X=23 and Y=24. |
position_colname |
a string containing the exact column name in 'data' with the position, default "pos" |
N_XX_case |
the number of XX chromosome case individuals (REQUIRED if sex_chromosomes == TRUE) |
N_XX_control |
the number of XX chromosome control individuals (REQUIRED if sex_chromosomes == TRUE) |
N_XY_case |
the number of XY chromosome case individuals (REQUIRED if sex_chromosomes == TRUE) |
N_XY_control |
the number of XY chromosome control individuals (REQUIRED if sex_chromosomes == TRUE) |
do_correction |
boolean, TRUE if data is provided to perform correction |
correction_data |
a dataframe with the following exact columns: CHR, POS, proxy_MAF with data that is harmonized between the proxy true datasets and the observed dataset |
remove_sex_chromosomes |
boolean, TRUE if should keep autosomes only. This is needed when the number of biological sex males/females per case and control group is not known. |
verbose |
boolean, determine whether warnings should be displayed (default FALSE) |
returns data as a dataframe with three additional columns: MAF_case, MAF_control, MAF_total for the estimated MAFs for each variant. If do_correction = TRUE, then will output 3 additional columns (MAF_case_adj, MAF_control_adj, MAF_total_adj) with the adjusted estimates.
Hayley Wolff (Stoneman), [email protected]
https://github.com/wolffha/CCAFE
https://github.com/wolffha/CCAFE for further documentation
library(CCAFE) data("sampleDat") sampleDat <- as.data.frame(sampleDat) nCase_sample = 16550 nControl_sample = 403923 # get the estimated case and control MAFs se_method_results <- CaseControl_SE(data = sampleDat, N_case = nCase_sample, N_control = nControl_sample, OR_colname = "OR", SE_colname = "SE", chromosome_colname = "CHR", position_colname = "POS") head(se_method_results)library(CCAFE) data("sampleDat") sampleDat <- as.data.frame(sampleDat) nCase_sample = 16550 nControl_sample = 403923 # get the estimated case and control MAFs se_method_results <- CaseControl_SE(data = sampleDat, N_case = nCase_sample, N_control = nControl_sample, OR_colname = "OR", SE_colname = "SE", chromosome_colname = "CHR", position_colname = "POS") head(se_method_results)
Formats information from a VCF object for use in CCAFE methods as follows: From the rowRanges object: seqnames (chromosome), ranges (position), From the geno object: ES (effect size of ALT), SE, AF (allele frequency of ALT)
CCAFE_convertVCF(vcf)CCAFE_convertVCF(vcf)
vcf |
a Variant Call Format (VCF) file read in using VariantAnnotation BioConductor package |
a dataframe object with columns Position, RSID, Chromosome, REF, ALT, beta, SE, AF, OR
Hayley Wolff (Stoneman), [email protected]
library(VariantAnnotation) library(CCAFE) # load the data data("vcf_sample") # run the method df_sample <- CCAFE_convertVCF(vcf_sample) print(head(df_sample)) # can then use in CCAFE methods # since we have total AF, will use CaseControl_AF df_sample <- CaseControl_AF(data = df_sample, N_case = 48286, N_control = 250671, OR_colname = "OR", AF_total_colname = "AF") head(df_sample)library(VariantAnnotation) library(CCAFE) # load the data data("vcf_sample") # run the method df_sample <- CCAFE_convertVCF(vcf_sample) print(head(df_sample)) # can then use in CCAFE methods # since we have total AF, will use CaseControl_AF df_sample <- CaseControl_AF(data = df_sample, N_case = 48286, N_control = 250671, OR_colname = "OR", AF_total_colname = "AF") head(df_sample)
This is a subset of 500 variants on chromosome 1 from the Pan-UKBB diabetes GWAS with the whole sample (pop), case, and control minor allele frequency (MAF) for those classified in Pan-UKBB as European (EUR). These variants (which are mapped to GRCh37) have been harmonized with gnomAD non-Finnish European (NFE) MAFs.
data("sampleDat")data("sampleDat")
'sampleDat' A data frame with 500 rows and 11 columns:
chromosome number
base-pair position of variant (GRCh37 coordinates)
Reference allele
Alternate allele
MAF in EUR cases in Pan-UKBB Diabetes
MAF in EUR controls in Pan-UKBB Diabetes
MAF in whole EUR sample in Pan-UKBB Diabetes
beta from EUR GWAS in Pan-UKBB Diabetes
SE of beta from EUR GWAS in Pan-UKBB Diabetes
OR from EUR GWAS in Pan-UKBB Diabetes
MAF in gnomAD NFE
<https://pan.ukbb.broadinstitute.org/docs/per-phenotype-files>
<https://gnomad.broadinstitute.org/downloads>
A VCF from this GWAS of Type 2 Diabetes https://doi.org/10.1038/s41588-018-0084-1. containing a subset of 10,000 variants
data("vcf_sample")data("vcf_sample")
'vcf_sample' A CollapsedVCF
dim: 10000 1 rowRanges(vcf): GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER info(vcf): DataFrame with 1 column: AF info(header(vcf)): Number Type Description AF A Float Allele Frequency geno(vcf): List of length 9: ES, SE, LP, AF, SS, EZ, SI, NC, ID geno(header(vcf)): Number Type Description ES A Float Effect size estimate relative to the alternative allele SE A Float Standard error of effect size estimate LP A Float -log10 p-value for effect estimate AF A Float Alternate allele frequency in the association study SS A Integer Sample size used to estimate genetic effect EZ A Float Z-score provided if it was used to derive the EFFECT and SE fields SI A Float Accuracy score of summary data imputation NC A Integer Number of cases used to estimate genetic effect ID 1 String Study variant identifier