Michigan Imputation Server
Michigan
Imputation Server pre-phases typed genotypes using HAPI-UR, SHAPEIT,
or EAGLE (default is EAGLE2), imputes using Minimac3 imputation engine
and outputs Blocked GNU Zip Format VCF files (.vcf.gz
).
These .vcf.gz
files are used as input for
gwasurvivr
. Minimac uses slightly different metrics to
assess imputation quality (R2 versus info score) and
complete details as to minimac output are available on the Minimac3
Wikipage.
The function, michiganCoxSurv
uses a modification of Cox
proportional hazard regression from the R library
survival:::coxph.fit
. Built specifically for genetic data,
michiganCoxSurv
allows the user to filter on info (R2) score (imputation
quality metric) and minor allele frequency from the reference panel used
for imputation using RefPanelAF
as the input arguement for
maf.filter
. Users are also provided with the sample minor
allele frequency (MAF) in the sangerCoxSurv
output, which
can be used for filtering post analysis.
Samples can be selected for analyses by providing a vector of
sample.ids
. The output from Sanger imputation server
returns the samples as SAMP1, ..., SAMPN
, where
N
is the total number of samples. The sample order
corresponds to the sample order in the vcf.file
used for
imputation. Note, sample order can also be found in the
.fam
file if genotyping data were initially in
.bed
, .bim
and .fam
(PLINK)
format prior to conversion to VCF. If no sample list is specified all
samples are included in the analyses.
vcf.file <- system.file(package="gwasurvivr",
"extdata",
"michigan.chr14.dose.vcf.gz")
pheno.fl <- system.file(package="gwasurvivr",
"extdata",
"simulated_pheno.txt")
pheno.file <- read.table(pheno.fl,
sep=" ",
header=TRUE,
stringsAsFactors = FALSE)
head(pheno.file)
## ID_1 ID_2 event time age DrugTxYes sex group
## 1 1 SAMP1 0 12.00 33.93 0 male control
## 2 2 SAMP2 1 7.61 58.71 1 male experimental
## 3 3 SAMP3 0 12.00 39.38 0 female control
## 4 4 SAMP4 0 4.30 38.85 0 male control
## 5 5 SAMP5 0 12.00 43.58 0 male experimental
## 6 6 SAMP6 1 2.60 57.74 0 male control
# recode sex column and remove first column
pheno.file$SexFemale <- ifelse(pheno.file$sex=="female", 1L, 0L)
# select only experimental group sample.ids
sample.ids <- pheno.file[pheno.file$group=="experimental",]$ID_2
head(sample.ids)
## [1] "SAMP2" "SAMP5" "SAMP7" "SAMP9" "SAMP11" "SAMP12"
In this example, we will select samples from the
experimental
group and will test survival only on these
patients. The first column in the pheno.file
are sample
IDs, which link the phenotype file to the imputation file. We include
age
, DrugTxYes
, and sex
in the
survival model as covariates.
We perform the analysis using the experimental
group to
demonstrate how one may want to prepare their data if interested in
testing only on a subset of samples (i.e. a case-control study and
survival of cases is of interest). Note that how the IDs
(sample.ids
) need to be a vector of class
character
. The chunk.size
refers to size of
each data chunk read in and is defaulted to 10,000 rows. Users can
customize that to their needs. The larger the chunk.size
the more memory (RAM) required to run the analysis. The recommended
chunk.size=10000
and probably should not exceed
chunk.size=100000
. This does not mean that
gwasurvivr
is limited to only 100,000 SNPs, it just is how
many SNPs are analyzed for each iteration.
By default survival estimates and pvalues for the SNP adjusted for
other covariates are outputted (print.covs='only'
), however
users can select print.covs=all
to get the coefficient
estimates for covariates included in the model. Depending on the number
of covariates included this can add substantially to output file
size.
Single SNP analysis
Next we run michiganCoxSurv
with the default,
print.covs="only"
, load the results into R and provide
descriptions of output by column. We will then run the analysis again
using print.covs="all"
. verbose=TRUE
is used
for these examples so the function display messages while running.
Use ?michiganCoxSurv
for argument specific
documentation.
print.covs="only"
michiganCoxSurv(vcf.file=vcf.file,
covariate.file=pheno.file,
id.column="ID_2",
sample.ids=sample.ids,
time.to.event="time",
event="event",
covariates=c("age", "SexFemale", "DrugTxYes"),
inter.term=NULL,
print.covs="only",
out.file="michigan_only",
r2.filter=0.3,
maf.filter=0.005,
chunk.size=100,
verbose=TRUE,
clusterObj=NULL)
## Analysis started on 2024-10-30 at 08:13:42
## Covariates included in the models are: age, DrugTxYes, SexFemale
## 52 samples are included in the analysis
## Analyzing chunk 0-100
## Analyzing chunk 100-200
## Analysis completed on 2024-10-30 at 08:13:54
## 0 SNPs were removed from the analysis for not meeting the threshold criteria.
## List of removed SNPs can be found in /tmp/Rtmp17QU0T/michigan_only1c9f1698a934.snps_removed
## 3 SNPs were analyzed in total
## The survival output can be found at /tmp/Rtmp17QU0T/michigan_only1c9f1698a934.coxph
Here we load the data and glimpse the first few values in each column
outputted from the SNP*interaction survival analyses using
print.covs="only"
.
michigan_only <- read.table("michigan_only.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
## 'data.frame': 3 obs. of 21 variables:
## $ RSID : chr "rs34919020" "rs8005305" "rs757545375"
## $ TYPED : logi FALSE FALSE FALSE
## $ CHR : int 14 14 14
## $ POS : int 19459185 20095842 20097287
## $ REF : chr "C" "G" "A"
## $ ALT : chr "T" "T" "G"
## $ AF : num 0.301 0.515 0.52
## $ MAF : num 0.301 0.485 0.48
## $ SAMP_FREQ_ALT: num 0.355 0.517 0.52
## $ SAMP_MAF : num 0.355 0.483 0.48
## $ R2 : num 0.552 0.479 0.481
## $ ER2 : logi NA NA NA
## $ PVALUE : num 0.713 0.805 0.798
## $ HR : num 1.224 0.885 0.881
## $ HR_lowerCI : num 0.417 0.333 0.332
## $ HR_upperCI : num 3.6 2.35 2.34
## $ Z : num 0.368 -0.246 -0.255
## $ COEF : num 0.203 -0.123 -0.127
## $ SE.COEF : num 0.55 0.498 0.498
## $ N : int 52 52 52
## $ N.EVENT : int 21 21 21
SNP with covariate interaction
A SNP*covariate interaction can be implemented using the
inter.term
argument. In this example, we will use
DrugTxYes
from the covariate file as the covariate we want
to test for interaction with the SNP.
print.covs="only"
michiganCoxSurv(vcf.file=vcf.file,
covariate.file=pheno.file,
id.column="ID_2",
sample.ids=sample.ids,
time.to.event="time",
event="event",
covariates=c("age", "SexFemale", "DrugTxYes"),
inter.term="DrugTxYes",
print.covs="only",
out.file="michigan_intx_only",
r2.filter=0.3,
maf.filter=0.005,
chunk.size=100,
verbose=FALSE,
clusterObj=NULL)
Here we load the data and glimpse the first few values in each column
outputted from the SNP*interaction survival analyses using
print.covs="only"
.
michigan_intx_only <- read.table("michigan_intx_only.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(michigan_intx_only))
## 'data.frame': 3 obs. of 21 variables:
## $ RSID : chr "rs34919020" "rs8005305" "rs757545375"
## $ TYPED : logi FALSE FALSE FALSE
## $ CHR : int 14 14 14
## $ POS : int 19459185 20095842 20097287
## $ REF : chr "C" "G" "A"
## $ ALT : chr "T" "T" "G"
## $ AF : num 0.301 0.515 0.52
## $ MAF : num 0.301 0.485 0.48
## $ SAMP_FREQ_ALT: num 0.355 0.517 0.52
## $ SAMP_MAF : num 0.355 0.483 0.48
## $ R2 : num 0.552 0.479 0.481
## $ ER2 : logi NA NA NA
## $ PVALUE : num 0.9017 0.0806 0.0786
## $ HR : num 1.15 7.03 7.11
## $ HR_lowerCI : num 0.127 0.789 0.799
## $ HR_upperCI : num 10.4 62.7 63.2
## $ Z : num 0.124 1.747 1.759
## $ COEF : num 0.139 1.951 1.961
## $ SE.COEF : num 1.12 1.12 1.11
## $ N : int 52 52 52
## $ N.EVENT : int 21 21 21
Sanger Imputation Server
Sanger Imputation
Server pre-phases typed genotypes using either SHAPEIT or EAGLE,
imputes genotypes using PBWT algorithm and outputs a
.vcf.gz
file for each chromosome. These
.vcf.gz
files are used as input for
gwasurvivr
. The function, sangerCoxSurv
uses a
modification of Cox proportional hazard regression from
survival::coxph.fit
. Built specifically for genetic data,
sangerCoxSurv
allows the user to filter on info score
(imputation quality metric) and minor allele frequency from the
reference panel used for imputation using RefPanelAF
as the
input arguement for maf.filter
. Users are also provided
with the sample minor allele frequency in the sangerCoxSurv
output.
Samples can be selected for analyses by providing a vector of
sample.ids
. The output from Sanger imputation server
returns the samples as SAMP1, ..., SAMPN
, where
N
is the total number of samples. The sample order
corresponds to the sample order in the VCF file used for imputation.
Note, sample order can also be found in the .fam
file if
genotyping data were initially in .bed
, .bim
and .fam
(PLINK) format prior to conversion to VCF. If no
sample list is specified, all samples are included in the analyses.
In this example, we will select samples from the
experimental
group and will test survival only on these
patients. The first column in the pheno.file
are sample IDs
(we will match on these). We include age
,
DrugTxYes
, and sex
in the survival model as
covariates.
vcf.file <- system.file(package="gwasurvivr",
"extdata",
"sanger.pbwt_reference_impute.vcf.gz")
pheno.fl <- system.file(package="gwasurvivr",
"extdata",
"simulated_pheno.txt")
pheno.file <- read.table(pheno.fl,
sep=" ",
header=TRUE,
stringsAsFactors = FALSE)
head(pheno.file)
## ID_1 ID_2 event time age DrugTxYes sex group
## 1 1 SAMP1 0 12.00 33.93 0 male control
## 2 2 SAMP2 1 7.61 58.71 1 male experimental
## 3 3 SAMP3 0 12.00 39.38 0 female control
## 4 4 SAMP4 0 4.30 38.85 0 male control
## 5 5 SAMP5 0 12.00 43.58 0 male experimental
## 6 6 SAMP6 1 2.60 57.74 0 male control
# recode sex column and remove first column
pheno.file$SexFemale <- ifelse(pheno.file$sex=="female", 1L, 0L)
# select only experimental group sample.ids
sample.ids <- pheno.file[pheno.file$group=="experimental",]$ID_2
head(sample.ids)
## [1] "SAMP2" "SAMP5" "SAMP7" "SAMP9" "SAMP11" "SAMP12"
We perform the analysis using the experimental
group to
demonstrate how one may want to prepare their data if not initially all
samples are patients or cases (i.e. a case-control study and survival of
cases is of interest). We also are showing how the IDs
(sample.ids
) need to be a vector of class
character
. The chunk.size
refers to size of
each data chunk read in and is defaulted to 10,000 rows, users can
customize that to their needs. The larger the chunk.size
the more memory (RAM) required to run the analysis. The recommended
chunk.size=10000
and probably should not exceed
chunk.size=100000
. This does not mean that
gwasurvivr
is limited to only 100,000 SNPs, it just is how
many SNPs are analyzed for each iteration.
By default survival estimates and pvalues for the SNP adjusted for
other covariates are outputted (print.covs='only'
), however
users can select print.covs=all
to get the coefficient
estimates for covariates included in the model. Depending on the number
of covariates included this can add substantially to output file
size.
Use ?sangerCoxSurv
for argument specific
documentation.
Single SNP analysis
Next we run sangerCoxSurv
with the default,
print.covs="only"
, load the results into R and provide
descriptions of output by column. verbose=TRUE
is used for
these examples so the function display messages while running.
print.covs="only"
sangerCoxSurv(vcf.file=vcf.file,
covariate.file=pheno.file,
id.column="ID_2",
sample.ids=sample.ids,
time.to.event="time",
event="event",
covariates=c("age", "SexFemale", "DrugTxYes"),
inter.term=NULL,
print.covs="only",
out.file="sanger_only",
info.filter=0.3,
maf.filter=0.005,
chunk.size=100,
verbose=TRUE,
clusterObj=NULL)
## Analysis started on 2024-10-30 at 08:14:05
## Covariates included in the models are: age, DrugTxYes, SexFemale
## 52 samples are included in the analysis
## Analyzing chunk 0-100
## Analyzing chunk 100-200
## Analysis completed on 2024-10-30 at 08:14:16
## 0 SNPs were removed from the analysis for not meeting the threshold criteria.
## List of removed SNPs can be found in /tmp/Rtmp17QU0T/sanger_only1c9f4039f247.snps_removed
## 3 SNPs were analyzed in total
## The survival output can be found at /tmp/Rtmp17QU0T/sanger_only1c9f4039f247.coxph
Here we load the data and glimpse the first few values in each column
from the survival analyses.
## 'data.frame': 3 obs. of 19 variables:
## $ RSID : chr "rs34919020" "rs8005305" "rs757545375"
## $ TYPED : logi FALSE FALSE FALSE
## $ CHR : int 14 14 14
## $ POS : int 19459185 20095842 20097287
## $ REF : chr "C" "G" "A"
## $ ALT : chr "T" "T" "G"
## $ RefPanelAF : num 0.301 0.515 0.52
## $ SAMP_FREQ_ALT: num 0.355 0.517 0.52
## $ SAMP_MAF : num 0.355 0.483 0.48
## $ INFO : num 0.552 0.479 0.481
## $ PVALUE : num 0.713 0.805 0.798
## $ HR : num 1.224 0.885 0.881
## $ HR_lowerCI : num 0.417 0.333 0.332
## $ HR_upperCI : num 3.6 2.35 2.34
## $ Z : num 0.368 -0.246 -0.255
## $ COEF : num 0.203 -0.123 -0.127
## $ SE.COEF : num 0.55 0.498 0.498
## $ N : int 52 52 52
## $ N.EVENT : int 21 21 21
Column names with descriptions from the survival analyses with
covariates, specifying the default print.covs="only"
SNP with covariate interaction
A SNP*covariate interaction can be implemented using the
inter.term
argument. In this example, we will use
DrugTxYes
from the covariate file as the covariate we want
to test for interaction with the SNP.
print.covs="only"
sangerCoxSurv(vcf.file=vcf.file,
covariate.file=pheno.file,
id.column="ID_2",
sample.ids=sample.ids,
time.to.event="time",
event="event",
covariates=c("age", "SexFemale", "DrugTxYes"),
inter.term="DrugTxYes",
print.covs="only",
out.file="sanger_intx_only",
info.filter=0.3,
maf.filter=0.005,
chunk.size=100,
verbose=TRUE,
clusterObj=NULL)
IMPUTE2 Imputation
IMPUTE2 is a genotype imputation and haplotype phasing program (Howie
et al 2009). IMPUTE2 outputs 6 files for each chromosome chunk imputed
(usually 5 MB in size). Only 2 of these files are required for analyses
using gwasurvivr
:
- Genotype file (
.impute
)
- Sample file (
.sample
)
More
information can be read about these formats
We are going to load in and pre-process the genetic data and the
phenotypic data (covariate.file
).
impute.file <- system.file(package="gwasurvivr",
"extdata",
"impute_example.impute2.gz")
sample.file <- system.file(package="gwasurvivr",
"extdata",
"impute_example.impute2_sample")
covariate.file <- system.file(package="gwasurvivr",
"extdata",
"simulated_pheno.txt")
covariate.file <- read.table(covariate.file,
sep=" ",
header=TRUE,
stringsAsFactors = FALSE)
covariate.file$SexFemale <- ifelse(covariate.file$sex=="female", 1L, 0L)
sample.ids <- covariate.file[covariate.file$group=="experimental",]$ID_2
To perform survival analysis using IMPUTE2 the function arguments are
very similarto michiganCoxSurv
and
sangerCoxSurv
, however the function now takes a chromosome
arguement. This is needed to properly annotate the file output with the
chromosome that these SNPs are in. This is purely an artifact of IMPUTE2
and how we leverage GWASTools
in this function.
Single SNP analysis
First we will do the analysis with no interaction term, followed by
doing the analysis with the interaction term. The recommended output
setting for single SNP analysis is print.cov="only"
.
impute2CoxSurv(impute.file=impute.file,
sample.file=sample.file,
chr=14,
covariate.file=covariate.file,
id.column="ID_2",
sample.ids=sample.ids,
time.to.event="time",
event="event",
covariates=c("age", "SexFemale", "DrugTxYes"),
inter.term=NULL,
print.covs="only",
out.file="impute_example_only",
chunk.size=100,
maf.filter=0.005,
exclude.snps=NULL,
flip.dosage=TRUE,
verbose=TRUE,
clusterObj=NULL)
## Covariates included in the models are: age, DrugTxYes, SexFemale
## 52 samples are included in the analysis
## Analysis started on 2024-10-30 at 08:14:27
## Determining number of SNPs and samples...
## Including all SNPs.
## scan.df not given. Assigning scanIDs automatically.
## Reading sample file...
## Reading genotype file...
## Block 1 of 1
## Writing annotation...
## Compressing...
## Clean up the fragments of GDS file:
## open the file '/tmp/Rtmp17QU0T/1c9fb0adb0.gds' (3.3K)
## # of fragments: 30
## save to '/tmp/Rtmp17QU0T/1c9fb0adb0.gds.tmp'
## rename '/tmp/Rtmp17QU0T/1c9fb0adb0.gds.tmp' (2.4K, reduced: 987B)
## # of fragments: 14
## ***** Compression time ******
## User:0.037
## System: 0.008
## Elapsed: 0.045
## *****************************
## Analyzing part 1/1...
## 0 SNPs were removed from the analysis for not meeting
## the given threshold criteria or for having MAF = 0
## List of removed SNPs are saved to /tmp/Rtmp17QU0T/impute_example_only1c9f6c053f6a.snps_removed
## In total 3 SNPs were included in the analysis
## The Cox model results output was saved to /tmp/Rtmp17QU0T/impute_example_only1c9f6c053f6a.coxph
## Analysis completed on 2024-10-30 at 08:14:27
Here we load the data and glimpse the first few values in each column
output.
impute2_res_only <- read.table("impute_example_only.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(impute2_res_only))
## 'data.frame': 3 obs. of 17 variables:
## $ RSID : chr "rs34919020" "rs8005305" "rs757545375"
## $ TYPED : chr "---" "---" "---"
## $ CHR : int 14 14 14
## $ POS : int 19459185 20095842 20097287
## $ A0 : chr "C" "G" "A"
## $ A1 : chr "T" "T" "G"
## $ exp_freq_A1: num 0.355 0.517 0.52
## $ SAMP_MAF : num 0.355 0.483 0.48
## $ PVALUE : num 0.713 0.805 0.798
## $ HR : num 1.224 0.885 0.881
## $ HR_lowerCI : num 0.417 0.333 0.332
## $ HR_upperCI : num 3.6 2.35 2.34
## $ Z : num 0.368 -0.246 -0.255
## $ COEF : num 0.203 -0.123 -0.127
## $ SE.COEF : num 0.55 0.498 0.498
## $ N : int 52 52 52
## $ N.EVENT : int 21 21 21
SNP covariate interaction
Now we will perform a SNP*covariate interaction survival analysis
using impute2CoxSurv
.
impute2CoxSurv(impute.file=impute.file,
sample.file=sample.file,
chr=14,
covariate.file=covariate.file,
id.column="ID_2",
sample.ids=sample.ids,
time.to.event="time",
event="event",
covariates=c("age", "SexFemale", "DrugTxYes"),
inter.term="DrugTxYes",
print.covs="only",
out.file="impute_example_intx",
chunk.size=100,
maf.filter=0.005,
flip.dosage=TRUE,
verbose=FALSE,
clusterObj=NULL,
keepGDS=FALSE)
## Determining number of SNPs and samples...
## Including all SNPs.
## scan.df not given. Assigning scanIDs automatically.
## Reading sample file...
## Reading genotype file...
## Block 1 of 1
## Writing annotation...
## Compressing...
## Clean up the fragments of GDS file:
## open the file '/tmp/Rtmp17QU0T/1c9f5b4656f2.gds' (3.3K)
## # of fragments: 30
## save to '/tmp/Rtmp17QU0T/1c9f5b4656f2.gds.tmp'
## rename '/tmp/Rtmp17QU0T/1c9f5b4656f2.gds.tmp' (2.4K, reduced: 987B)
## # of fragments: 14
## ***** Compression time ******
## User:0.041
## System: 0.004
## Elapsed: 0.045
## *****************************
Here we load the data and glimpse the first few values in each column
outputted from the SNP*interaction survival analyses using
print.covs="only"
.
impute2_res_intx <- read.table("impute_example_intx.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(impute2_res_intx))
## 'data.frame': 3 obs. of 17 variables:
## $ RSID : chr "rs34919020" "rs8005305" "rs757545375"
## $ TYPED : chr "---" "---" "---"
## $ CHR : int 14 14 14
## $ POS : int 19459185 20095842 20097287
## $ A0 : chr "C" "G" "A"
## $ A1 : chr "T" "T" "G"
## $ exp_freq_A1: num 0.355 0.517 0.52
## $ SAMP_MAF : num 0.355 0.483 0.48
## $ PVALUE : num 0.9017 0.0806 0.0786
## $ HR : num 1.15 7.03 7.11
## $ HR_lowerCI : num 0.127 0.789 0.799
## $ HR_upperCI : num 10.4 62.7 63.2
## $ Z : num 0.124 1.747 1.759
## $ COEF : num 0.139 1.951 1.961
## $ SE.COEF : num 1.12 1.12 1.11
## $ N : int 52 52 52
## $ N.EVENT : int 21 21 21