Genome-wide association studies (GWAS) are widely used to help determine the genetic basis of diseases and traits, but they pose many computational challenges. We developed gdsfmt and SNPRelate (high-performance computing R packages for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations in GWAS: principal component analysis (PCA) and relatedness analysis using identity-by-descent (IBD) measures1. The kernels of our algorithms are written in C/C++ and have been highly optimized. The calculations of the genetic covariance matrix in PCA and pairwise IBD coefficients are split into non-overlapping parts and assigned to multiple cores for performance acceleration, as shown in Figure 1.
GDS is also used by the R/Bioconductor package GWASTools as one of its data storage formats2,3. GWASTools provides many functions for quality control and analysis of GWAS, including statistics by SNP or scan, batch quality, chromosome anomalies, association tests, etc. The extended GDS format is implemented in the SeqArray package to support the storage of single nucleotide variation (SNV), insertion/deletion polymorphism (indel) and structural variation calls. It is strongly suggested to use SeqArray for large-scale whole-exome and whole-genome sequencing variant data instead of SNPRelate.
Figure 1: Flowchart of parallel computing for principal component analysis and identity-by-descent analysis.
~
R is the most popular statistical programming environment, but one not typically optimized for high performance or parallel computing which would ease the burden of large-scale GWAS calculations. To overcome these limitations we have developed a project named CoreArray (http://corearray.sourceforge.net/) that includes two R packages: gdsfmt to provide efficient, platform independent memory and file management for genome-wide numerical data, and SNPRelate to solve large-scale, numerically intensive GWAS calculations (i.e., PCA and IBD) on multi-core symmetric multiprocessing (SMP) computer architectures.
This vignette takes the user through the relatedness and principal component analysis used for genome wide association data. The methods in these vignettes have been introduced in the paper of Zheng et al. (2012)1. For replication purposes the data used here are taken from the HapMap Phase II project. These data were kindly provided by the Center for Inherited Disease Research (CIDR) at Johns Hopkins University and the Broad Institute of MIT and Harvard University (Broad). The data supplied here should not be used for any purpose other than this tutorial.
To install the package SNPRelate, you need a current version (>=2.14.0) of R and the R package gdsfmt. After installing R you can run the following commands from the R command shell to install the R package SNPRelate.
Install the package from Bioconductor repository:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("gdsfmt")
BiocManager::install("SNPRelate")
Install the development version from Github:
The install_github()
approach requires that you build
from source, i.e. make
and compilers must be installed on
your system – see the R
FAQ for your operating system; you may also need to install
dependencies manually.
To support efficient memory management for genome-wide numerical data, the gdsfmt package provides the genomic data structure (GDS) file format for array-oriented bioinformatic data, which is a container for storing annotation data and SNP genotypes. In this format each byte encodes up to four SNP genotypes thereby reducing file size and access time. The GDS format supports data blocking so that only the subset of data that is being processed needs to reside in memory. GDS formatted data is also designed for efficient random access to large data sets. A tutorial for the R/Bioconductor package gdsfmt can be found: http://corearray.sourceforge.net/tutorials/gdsfmt/.
## SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)
Here is a typical GDS file:
## The file name: /tmp/RtmpRSE5ca/Rinst190e3aed40e7/SNPRelate/extdata/hapmap_geno.gds
## The total number of samples: 279
## The total number of SNPs: 9088
## SNP genotypes are stored in SNP-major mode (Sample X SNP).
snpgdsExampleFileName()
returns the file name of a GDS
file used as an example in SNPRelate, and
it is a subset of data from the HapMap project and the samples were
genotyped by the Center for Inherited Disease Research (CIDR) at Johns
Hopkins University and the Broad Institute of MIT and Harvard University
(Broad). snpgdsSummary()
summarizes the genotypes stored in
the GDS file. “Individual-major mode” indicates listing all SNPs for an
individual before listing the SNPs for the next individual, etc.
Conversely, “SNP-major mode” indicates listing all individuals for the
first SNP before listing all individuals for the second SNP, etc.
Sometimes “SNP-major mode” is more computationally efficient than
“individual-major mode”. For example, the calculation of genetic
covariance matrix deals with genotypic data SNP by SNP, and then
“SNP-major mode” should be more efficient.
## File: /tmp/RtmpRSE5ca/Rinst190e3aed40e7/SNPRelate/extdata/hapmap_geno.gds (709.6K)
## + [ ] *
## |--+ sample.id { VStr8 279 ZIP(29.9%), 679B }
## |--+ snp.id { Int32 9088 ZIP(34.8%), 12.3K }
## |--+ snp.rs.id { VStr8 9088 ZIP(40.1%), 36.2K }
## |--+ snp.position { Int32 9088 ZIP(94.7%), 33.6K }
## |--+ snp.chromosome { UInt8 9088 ZIP(0.94%), 85B } *
## |--+ snp.allele { VStr8 9088 ZIP(11.3%), 4.0K }
## |--+ genotype { Bit2 279x9088, 619.0K } *
## \--+ sample.annot [ data.frame ] *
## |--+ family.id { VStr8 279 ZIP(34.4%), 514B }
## |--+ father.id { VStr8 279 ZIP(31.5%), 220B }
## |--+ mother.id { VStr8 279 ZIP(30.9%), 214B }
## |--+ sex { VStr8 279 ZIP(17.0%), 95B }
## \--+ pop.group { VStr8 279 ZIP(6.18%), 69B }
The output lists all variables stored in the GDS file. At the first level, it stores variables sample.id, snp.id, etc. The additional information are displayed in the braces indicating data type, size, compressed or not + compression ratio. The second-level variables sex and pop.group are both stored in the folder of sample.annot. All of the functions in SNPRelate require a minimum set of variables in the annotation data. The minimum required variables are
Users can define the numeric chromosome codes which are stored with the variable snp.chromosome as its attributes when snp.chromosome is numeric only. For example, snp.chromosome has the attributes of chromosome coding:
## $autosome.start
## [1] 1
##
## $autosome.end
## [1] 22
##
## $X
## [1] 23
##
## $XY
## [1] 24
##
## $Y
## [1] 25
##
## $M
## [1] 26
##
## $MT
## [1] 26
autosome.start is the starting numeric code of autosomes,
and autosome.end is the last numeric code of autosomes.
put.attr.gdsn()
can be used to add a new attribute or
modify an existing attribute.
There are four possible values stored in the variable genotype: 0, 1, 2 and 3. For bi-allelic SNP sites, “0” indicates two B alleles, “1” indicates one A allele and one B allele, “2” indicates two A alleles, and “3” is a missing genotype. For multi-allelic sites, it is a count of the reference allele (3 meaning no call). “Bit2” indicates that each byte encodes up to four SNP genotypes since one byte consists of eight bits.
# Take out genotype data for the first 3 samples and the first 5 SNPs
(g <- read.gdsn(index.gdsn(genofile, "genotype"), start=c(1,1), count=c(5,3)))
## [,1] [,2] [,3]
## [1,] 2 1 0
## [2,] 1 1 0
## [3,] 2 1 1
## [4,] 2 1 1
## [5,] 0 0 0
Or take out genotype data with sample and SNP IDs, and four possible values are returned 0, 1, 2 and NA (3 is replaced by NA):
## $sample.order
## NULL
The returned value could be either “snp.order” or “sample.order”, indicating individual-major mode (snp is the first dimension) and SNP-major mode (sample is the first dimension) respectively.
## [1] 1 2 3 4 5 6
## [1] "rs1695824" "rs13328662" "rs4654497" "rs10915489" "rs12132314"
## [6] "rs12042555"
There are two additional and optional variables:
The information of sample annotation can be obtained by the same
function read.gdsn()
. For example, population information.
“VStr8” indicates a character-type variable.
# Read population information
pop <- read.gdsn(index.gdsn(genofile, path="sample.annot/pop.group"))
table(pop)
## pop
## CEU HCB JPT YRI
## 92 47 47 93
The function snpgdsCreateGeno()
can be used to create a
GDS file. The first argument should be a numeric matrix for SNP
genotypes. There are possible values stored in the input genotype
matrix: 0, 1, 2 and other values. “0” indicates two B alleles, “1”
indicates one A allele and one B allele, “2” indicates two A alleles,
and other values indicate a missing genotype. The SNP matrix can be
either nsample × nsnp
(snpfirstdim=FALSE, the argument in
snpgdsCreateGeno()
) or nsnp × nsample
(snpfirstdim=TRUE).
For example,
# Load data
data(hapmap_geno)
# Create a gds file
snpgdsCreateGeno("test.gds", genmat = hapmap_geno$genotype,
sample.id = hapmap_geno$sample.id, snp.id = hapmap_geno$snp.id,
snp.chromosome = hapmap_geno$snp.chromosome,
snp.position = hapmap_geno$snp.position,
snp.allele = hapmap_geno$snp.allele, snpfirstdim=TRUE)
# Open the GDS file
(genofile <- snpgdsOpen("test.gds"))
## File: /tmp/RtmpRSE5ca/Rbuild190e19a9cf6c/SNPRelate/vignettes/test.gds (79.0K)
## + [ ] *
## |--+ sample.id { Str8 279 ZIP_ra(31.2%), 715B }
## |--+ snp.id { Str8 1000 ZIP_ra(43.7%), 4.4K }
## |--+ snp.position { Int32 1000 ZIP_ra(95.9%), 3.8K }
## |--+ snp.chromosome { Int32 1000 ZIP_ra(2.25%), 97B }
## |--+ snp.allele { Str8 1000 ZIP_ra(14.1%), 571B }
## \--+ genotype { Bit2 1000x279, 68.1K } *
In the following code, the functions createfn.gds()
,
add.gdsn()
, put.attr.gdsn()
,
write.gdsn()
and index.gdsn()
are defined in
the package gdsfmt:
# Create a new GDS file
newfile <- createfn.gds("your_gds_file.gds")
# add a flag
put.attr.gdsn(newfile$root, "FileFormat", "SNP_ARRAY")
# Add variables
add.gdsn(newfile, "sample.id", sample.id)
add.gdsn(newfile, "snp.id", snp.id)
add.gdsn(newfile, "snp.chromosome", snp.chromosome)
add.gdsn(newfile, "snp.position", snp.position)
add.gdsn(newfile, "snp.allele", c("A/G", "T/C", ...))
#####################################################################
# Create a snp-by-sample genotype matrix
# Add genotypes
var.geno <- add.gdsn(newfile, "genotype",
valdim=c(length(snp.id), length(sample.id)), storage="bit2")
# Indicate the SNP matrix is snp-by-sample
put.attr.gdsn(var.geno, "snp.order")
# Write SNPs into the file sample by sample
for (i in 1:length(sample.id))
{
g <- ...
write.gdsn(var.geno, g, start=c(1,i), count=c(-1,1))
}
#####################################################################
# OR, create a sample-by-snp genotype matrix
# Add genotypes
var.geno <- add.gdsn(newfile, "genotype",
valdim=c(length(sample.id), length(snp.id)), storage="bit2")
# Indicate the SNP matrix is sample-by-snp
put.attr.gdsn(var.geno, "sample.order")
# Write SNPs into the file sample by sample
for (i in 1:length(snp.id))
{
g <- ...
write.gdsn(var.geno, g, start=c(1,i), count=c(-1,1))
}
# Get a description of chromosome codes
# allowing to define a new chromosome code, e.g., snpgdsOption(Z=27)
option <- snpgdsOption()
var.chr <- index.gdsn(newfile, "snp.chromosome")
put.attr.gdsn(var.chr, "autosome.start", option$autosome.start)
put.attr.gdsn(var.chr, "autosome.end", option$autosome.end)
for (i in 1:length(option$chromosome.code))
{
put.attr.gdsn(var.chr, names(option$chromosome.code)[i],
option$chromosome.code[[i]])
}
# Add your sample annotation
samp.annot <- data.frame(sex = c("male", "male", "female", ...),
pop.group = c("CEU", "CEU", "JPT", ...), ...)
add.gdsn(newfile, "sample.annot", samp.annot)
# Add your SNP annotation
snp.annot <- data.frame(pass=c(TRUE, TRUE, FALSE, FALSE, TRUE, ...), ...)
add.gdsn(newfile, "snp.annot", snp.annot)
# Close the GDS file
closefn.gds(newfile)
The SNPRelate
package provides a function snpgdsPED2GDS()
and
snpgdsBED2GDS()
for converting a PLINK text/binary file to
a GDS file:
# The PLINK BED file, using the example in the SNPRelate package
bed.fn <- system.file("extdata", "plinkhapmap.bed.gz", package="SNPRelate")
fam.fn <- system.file("extdata", "plinkhapmap.fam.gz", package="SNPRelate")
bim.fn <- system.file("extdata", "plinkhapmap.bim.gz", package="SNPRelate")
Or, uses your PLINK files:
bed.fn <- "C:/your_folder/your_plink_file.bed"
fam.fn <- "C:/your_folder/your_plink_file.fam"
bim.fn <- "C:/your_folder/your_plink_file.bim"
## Start file conversion from PLINK BED to SNP GDS ...
## BED file: '/tmp/RtmpRSE5ca/Rinst190e3aed40e7/SNPRelate/extdata/plinkhapmap.bed.gz'
## SNP-major mode (Sample X SNP), 45.7K
## FAM file: '/tmp/RtmpRSE5ca/Rinst190e3aed40e7/SNPRelate/extdata/plinkhapmap.fam.gz'
## BIM file: '/tmp/RtmpRSE5ca/Rinst190e3aed40e7/SNPRelate/extdata/plinkhapmap.bim.gz'
## Thu Oct 31 05:22:47 2024 (store sample id, snp id, position, and chromosome)
## start writing: 60 samples, 5000 SNPs ...
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s
## Thu Oct 31 05:22:47 2024 Done.
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
## open the file 'test.gds' (98.1K)
## # of fragments: 38
## save to 'test.gds.tmp'
## rename 'test.gds.tmp' (97.8K, reduced: 240B)
## # of fragments: 18
## The file name: /tmp/RtmpRSE5ca/Rbuild190e19a9cf6c/SNPRelate/vignettes/test.gds
## The total number of samples: 60
## The total number of SNPs: 5000
## SNP genotypes are stored in SNP-major mode (Sample X SNP).
The SNPRelate
package provides a function snpgdsVCF2GDS()
to reformat a
VCF file. There are two options for extracting markers from a VCF file
for downstream analyses: 1. to extract and store dosage of the reference
allele only for biallelic SNPs 2. to extract and store dosage of the
reference allele for all variant sites, including bi-allelic SNPs,
multi-allelic SNPs, indels and structural variants.
# The VCF file, using the example in the SNPRelate package
vcf.fn <- system.file("extdata", "sequence.vcf", package="SNPRelate")
Or, uses your VCF file:
## Start file conversion from VCF to SNP GDS ...
## Method: extracting biallelic SNPs
## Number of samples: 3
## Parsing "/tmp/RtmpRSE5ca/Rinst190e3aed40e7/SNPRelate/extdata/sequence.vcf" ...
## import 2 variants.
## + genotype { Bit2 3x2, 2B } *
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
## open the file 'test.gds' (2.9K)
## # of fragments: 46
## save to 'test.gds.tmp'
## rename 'test.gds.tmp' (2.6K, reduced: 312B)
## # of fragments: 20
## The file name: /tmp/RtmpRSE5ca/Rbuild190e19a9cf6c/SNPRelate/vignettes/test.gds
## The total number of samples: 3
## The total number of SNPs: 2
## SNP genotypes are stored in SNP-major mode (Sample X SNP).
The SeqArray
package provides a function seqVCF2GDS()
to reformat a VCF
file, and it allows merging multiple VCF files during format conversion.
The genotypic and annotation data are stored in a compressed manner by
default. SeqArray is suited for large-scale whole-exome and whole-genome
sequencing variant data. See: SeqArray
R Integration for more details. It is strongly suggested to use
SeqArray for large-scale whole-genome sequencing variant data.
library(SeqArray)
# the VCF file, using the example in the SeqArray package
vcf.fn <- seqExampleFileName("vcf")
# or vcf.fn <- "C:/YourFolder/Your_VCF_File.vcf.gz"
# convert, save in "tmp.gds" with the default lzma compression algorithm
seqVCF2GDS(vcf.fn, "test.gds")
## Tue Mar 20 13:53:38 2018
## Variant Call Format (VCF) Import:
## file(s):
## CEU_Exon.vcf.gz (226.0K)
## file format: VCFv4.0
## the number of sets of chromosomes (ploidy): 2
## the number of samples: 90
## genotype storage: bit2
## compression method: LZMA_RA
## Output:
## test.gds
## Parsing 'CEU_Exon.vcf.gz':
## + genotype/data { Bit2 2x90x1348 LZMA_ra, 42B }
## Digests:
## sample.id [md5: ac460b05cf0de81d3a307259fb908238]
## variant.id [md5: c9602a5420b6a5a148f5a0120a8750e1]
## position [md5: a23801beb47fb2d7ca26b65d2b71e622]
## chromosome [md5: a46ad5529a68298eb581c7c66b31b99b]
## allele [md5: e65988a36b2675d1e4f6a9ad9d2774a9]
## genotype [md5: 318c71bd2c1878e7d05c6e4b8b3067ef]
## phase [md5: 4873107397a2eec80cca77d8fa09592b]
## annotation/id [md5: 164df6a971c24c99ad386bbaf8759cb2]
## annotation/qual [md5: ff3b3c516fe7081c406d4c26782b44e4]
## annotation/filter [md5: 5b09a6e58b307857c38e3d82284dfff0]
## annotation/info/AA [md5: 7bba129ada9e50a98db7451044abdde9]
## annotation/info/AC [md5: 79076139f25b3f78164182af5d86c680]
## annotation/info/AN [md5: b4c305461e62a78dc439f7a1df50e5fc]
## annotation/info/DP [md5: 9f358649989b5fd48fba25b6b50af02f]
## annotation/info/HM2 [md5: 9b792cdd10840bdda63d77a1ce065588]
## annotation/info/HM3 [md5: b936dc73a3ffa1241305dfdcc14d71e1]
## annotation/info/OR [md5: 6f6f800d686268b592ac50f10c5851b9]
## annotation/info/GP [md5: a1ccfb37b78edd2bb1204c8b9c901b0a]
## annotation/info/BN [md5: 0ac62828c0c8d3d27cbd15aa975532fd]
## annotation/format/DP [md5: d967efdfcb57f3327af2cbf1adc21bbb]
## Done.
## Tue Mar 20 13:53:39 2018
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
## open the file 'test.gds' (163.3K)
## # of fragments: 155
## save to 'test.gds.tmp'
## rename 'test.gds.tmp' (162.3K, reduced: 1.0K)
## # of fragments: 66
## Tue Mar 20 13:53:39 2018
Get Data:
It is suggested to use seqGetData()
to take out data
from the SeqArray file since this function can take care of
variable-length data and multi-allelic genotypes, although users could
also use read.gdsn()
in the gdsfmt package to
read data.
# take out sample id
head(samp.id <- seqGetData(genofile, "sample.id"))
## [1] "NA06984" "NA06985" "NA06986" "NA06989" "NA06994" "NA07000"
# take out variant id
head(variant.id <- seqGetData(genofile, "variant.id"))
## [1] 1 2 3 4 5 6
# get "chromosome"
table(seqGetData(genofile, "chromosome"))
## 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9
## 142 70 16 62 11 61 46 84 100 54 111 59 59 23 23 81 48 61 99 58 51 29
# get "allele"
head(seqGetData(genofile, "allele"))
## [1] "T,C" "G,A" "G,A" "T,C" "G,C" "C,T"
# get "annotation/info/GP"
head(seqGetData(genofile, "annotation/info/GP"))
## [1] "1:1115503" "1:1115548" "1:1120431" "1:3548136" "1:3548832" "1:3551737"
# get "sample.annotation/family"
head(seqGetData(genofile, "sample.annotation/family"))
## [1] "1328" "" "13291" "1328" "1340" "1340"
Users can set a filter to samples and/or variants by
seqSetFilter()
. For example, a subset consisting of three
samples and four variants:
# set sample and variant filters
seqSetFilter(genofile, sample.id=samp.id[c(2,4,6)])
# or seqSetFilter(genofile, sample.sel=c(2,4,6))
## # of selected samples: 3
set.seed(100)
seqSetFilter(genofile, variant.id=sample(variant.id, 4))
# or seqSetFilter(genofile, variant.sel=...) # an integer vector
## # of selected variants: 4
# get "allele"
seqGetData(genofile, "allele")
## [1] "T,A" "G,A" "G,C" "A,G"
Get genotypic data, it is a 3-dimensional array with respect to allele, sample and variant. 0 refers to the reference allele (or the first allele in the variable allele), 1 for the second allele, and so on, while NA is missing allele.
# get genotypic data
seqGetData(genofile, "genotype")
## , , 1
## sample
## allele [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
##
## , , 2
## sample
## allele [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 0 0
##
## , , 3
## sample
## allele [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
##
## , , 4
## sample
## allele [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
Get regular genotypes (i.e., genotype dosage, or the number of copies of reference allele), it is an integer matrix.
# get the dosage of reference allele
seqGetData(genofile, "$dosage")
## variant
## sample [,1] [,2] [,3] [,4]
## [1,] 2 1 2 2
## [2,] 2 2 2 2
## [3,] 2 2 2 2
# close the file
seqClose(genofile)
We developed gdsfmt and SNPRelate (high-performance computing R packages for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations in GWAS: principal component analysis (PCA) and relatedness analysis using identity-by-descent (IBD) measures.
# Get population information
# or pop_code <- scan("pop.txt", what=character())
# if it is stored in a text file "pop.txt"
pop_code <- read.gdsn(index.gdsn(genofile, path="sample.annot/pop.group"))
table(pop_code)
## pop_code
## CEU HCB JPT YRI
## 92 47 47 93
## [1] "YRI" "YRI" "YRI" "YRI" "CEU" "CEU"
It is suggested to use a pruned set of SNPs which are in approximate linkage equilibrium with each other to avoid the strong influence of SNP clusters in principal component analysis and relatedness analysis.
set.seed(1000)
# Try different LD thresholds for sensitivity analysis
snpset <- snpgdsLDpruning(genofile, ld.threshold=0.2)
## SNP pruning based on LD:
## Excluding 365 SNPs on non-autosomes
## Excluding 139 SNPs (monomorphic: TRUE, MAF: 0.005, missing rate: 0.05)
## # of samples: 279
## # of SNPs: 8,584
## using 1 thread
## sliding window: 500,000 basepairs, Inf SNPs
## |LD| threshold: 0.2
## method: composite
## Chrom 1: |====================|====================|
## 74.30%, 532 / 716 (Thu Oct 31 05:22:47 2024)
## Chrom 2: |====================|====================|
## 72.24%, 536 / 742 (Thu Oct 31 05:22:47 2024)
## Chrom 3: |====================|====================|
## 73.40%, 447 / 609 (Thu Oct 31 05:22:47 2024)
## Chrom 4: |====================|====================|
## 72.42%, 407 / 562 (Thu Oct 31 05:22:47 2024)
## Chrom 5: |====================|====================|
## 75.80%, 429 / 566 (Thu Oct 31 05:22:47 2024)
## Chrom 6: |====================|====================|
## 73.81%, 417 / 565 (Thu Oct 31 05:22:47 2024)
## Chrom 7: |====================|====================|
## 75.21%, 355 / 472 (Thu Oct 31 05:22:47 2024)
## Chrom 8: |====================|====================|
## 69.67%, 340 / 488 (Thu Oct 31 05:22:47 2024)
## Chrom 9: |====================|====================|
## 76.92%, 320 / 416 (Thu Oct 31 05:22:47 2024)
## Chrom 10: |====================|====================|
## 73.08%, 353 / 483 (Thu Oct 31 05:22:47 2024)
## Chrom 11: |====================|====================|
## 76.51%, 342 / 447 (Thu Oct 31 05:22:47 2024)
## Chrom 12: |====================|====================|
## 74.71%, 319 / 427 (Thu Oct 31 05:22:47 2024)
## Chrom 13: |====================|====================|
## 76.74%, 264 / 344 (Thu Oct 31 05:22:47 2024)
## Chrom 14: |====================|====================|
## 76.24%, 215 / 282 (Thu Oct 31 05:22:47 2024)
## Chrom 15: |====================|====================|
## 75.95%, 199 / 262 (Thu Oct 31 05:22:47 2024)
## Chrom 16: |====================|====================|
## 70.86%, 197 / 278 (Thu Oct 31 05:22:47 2024)
## Chrom 17: |====================|====================|
## 76.33%, 158 / 207 (Thu Oct 31 05:22:47 2024)
## Chrom 18: |====================|====================|
## 73.31%, 195 / 266 (Thu Oct 31 05:22:47 2024)
## Chrom 19: |====================|====================|
## 82.50%, 99 / 120 (Thu Oct 31 05:22:47 2024)
## Chrom 20: |====================|====================|
## 70.31%, 161 / 229 (Thu Oct 31 05:22:47 2024)
## Chrom 21: |====================|====================|
## 75.40%, 95 / 126 (Thu Oct 31 05:22:47 2024)
## Chrom 22: |====================|====================|
## 75.86%, 88 / 116 (Thu Oct 31 05:22:47 2024)
## 6,468 markers are selected in total.
## List of 22
## $ chr1 : int [1:532] 1 2 4 5 7 10 12 14 15 16 ...
## $ chr2 : int [1:536] 717 718 719 720 721 723 724 725 726 727 ...
## $ chr3 : int [1:447] 1459 1461 1464 1466 1468 1469 1471 1472 1473 1476 ...
## $ chr4 : int [1:407] 2068 2069 2070 2071 2072 2074 2075 2076 2077 2078 ...
## $ chr5 : int [1:429] 2630 2631 2635 2636 2637 2638 2640 2642 2643 2645 ...
## $ chr6 : int [1:417] 3196 3197 3198 3200 3201 3204 3205 3206 3207 3208 ...
## $ chr7 : int [1:355] 3761 3762 3763 3766 3767 3768 3770 3771 3772 3773 ...
## $ chr8 : int [1:340] 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 ...
## $ chr9 : int [1:320] 4721 4722 4724 4727 4728 4730 4731 4732 4733 4735 ...
## $ chr10: int [1:353] 5138 5139 5140 5143 5144 5145 5146 5147 5148 5149 ...
## $ chr11: int [1:342] 5620 5623 5624 5625 5626 5628 5629 5630 5631 5632 ...
## $ chr12: int [1:319] 6067 6068 6069 6070 6073 6074 6075 6077 6078 6079 ...
## $ chr13: int [1:264] 6494 6497 6498 6499 6500 6501 6503 6505 6507 6509 ...
## $ chr14: int [1:215] 6840 6841 6842 6843 6844 6845 6846 6847 6848 6850 ...
## $ chr15: int [1:199] 7120 7121 7122 7124 7125 7126 7127 7128 7129 7130 ...
## $ chr16: int [1:197] 7382 7383 7384 7385 7387 7388 7389 7391 7392 7394 ...
## $ chr17: int [1:158] 7660 7661 7662 7663 7664 7665 7666 7667 7668 7669 ...
## $ chr18: int [1:195] 7867 7868 7869 7870 7871 7872 7873 7874 7875 7877 ...
## $ chr19: int [1:99] 8133 8135 8136 8137 8138 8139 8140 8141 8142 8144 ...
## $ chr20: int [1:161] 8253 8254 8257 8258 8259 8260 8261 8262 8265 8266 ...
## $ chr21: int [1:95] 8482 8484 8485 8486 8487 8488 8489 8490 8491 8492 ...
## $ chr22: int [1:88] 8608 8609 8610 8612 8613 8614 8615 8617 8618 8620 ...
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
## [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
## [19] "chr19" "chr20" "chr21" "chr22"
## [1] 1 2 4 5 7 10
The functions in SNPRelate for PCA include calculating the genetic covariance matrix from genotypes, computing the correlation coefficients between sample loadings and genotypes for each SNP, calculating SNP eigenvectors (loadings), and estimating the sample loadings of a new dataset from specified SNP eigenvectors.
## Principal Component Analysis (PCA) on genotypes:
## Excluding 2,620 SNPs (non-autosomes or non-selection)
## Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## # of samples: 279
## # of SNPs: 6,468
## using 2 threads
## # of principal components: 32
## PCA: the sum of all selected genotypes (0,1,2) = 1809692
## CPU capabilities: Double-Precision SSE2
## Thu Oct 31 05:22:47 2024 (internal increment: 14780)
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s
## Thu Oct 31 05:22:47 2024 Begin (eigenvalues and eigenvectors)
## Thu Oct 31 05:22:47 2024 Done.
The code below shows how to calculate the percent of variation is accounted for by the top principal components. It is clear to see the first two eigenvectors hold the largest percentage of variance among the population, although the total variance accounted for is still less the one-quarter of the total.
## [1] 10.30 5.53 1.03 0.98 0.87 0.77
In the case of no prior population information,
# make a data.frame
tab <- data.frame(sample.id = pca$sample.id,
EV1 = pca$eigenvect[,1], # the first eigenvector
EV2 = pca$eigenvect[,2], # the second eigenvector
stringsAsFactors = FALSE)
head(tab)
## sample.id EV1 EV2
## 1 NA19152 -0.08171478 0.009230963
## 2 NA19139 -0.08420799 0.010582727
## 3 NA18912 -0.08278560 0.012231918
## 4 NA19160 -0.08751380 0.011927292
## 5 NA07034 0.03176569 -0.078990295
## 6 NA07055 0.03492885 -0.082021198
If there are population information,
# Get sample id
sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))
# Get population information
# or pop_code <- scan("pop.txt", what=character())
# if it is stored in a text file "pop.txt"
pop_code <- read.gdsn(index.gdsn(genofile, "sample.annot/pop.group"))
# assume the order of sample IDs is as the same as population codes
head(cbind(sample.id, pop_code))
## sample.id pop_code
## [1,] "NA19152" "YRI"
## [2,] "NA19139" "YRI"
## [3,] "NA18912" "YRI"
## [4,] "NA19160" "YRI"
## [5,] "NA07034" "CEU"
## [6,] "NA07055" "CEU"
# Make a data.frame
tab <- data.frame(sample.id = pca$sample.id,
pop = factor(pop_code)[match(pca$sample.id, sample.id)],
EV1 = pca$eigenvect[,1], # the first eigenvector
EV2 = pca$eigenvect[,2], # the second eigenvector
stringsAsFactors = FALSE)
head(tab)
## sample.id pop EV1 EV2
## 1 NA19152 YRI -0.08171478 0.009230963
## 2 NA19139 YRI -0.08420799 0.010582727
## 3 NA18912 YRI -0.08278560 0.012231918
## 4 NA19160 YRI -0.08751380 0.011927292
## 5 NA07034 CEU 0.03176569 -0.078990295
## 6 NA07055 CEU 0.03492885 -0.082021198
# Draw
plot(tab$EV2, tab$EV1, col=as.integer(tab$pop), xlab="eigenvector 2", ylab="eigenvector 1")
legend("bottomright", legend=levels(tab$pop), pch="o", col=1:nlevels(tab$pop))
Plot the principal component pairs for the first four PCs:
lbls <- paste("PC", 1:4, "\n", format(pc.percent[1:4], digits=2), "%", sep="")
pairs(pca$eigenvect[,1:4], col=tab$pop, labels=lbls)
Parallel coordinates plot for the top principal components:
library(MASS)
datpop <- factor(pop_code)[match(pca$sample.id, sample.id)]
parcoord(pca$eigenvect[,1:16], col=datpop)
To calculate the SNP correlations between eigenvactors and SNP genotypes:
# Get chromosome index
chr <- read.gdsn(index.gdsn(genofile, "snp.chromosome"))
CORR <- snpgdsPCACorr(pca, genofile, eig.which=1:4)
## SNP Correlation:
## # of samples: 279
## # of SNPs: 9,088
## using 1 thread
## Correlation: the sum of all selected genotypes (0,1,2) = 2553065
## Thu Oct 31 05:22:48 2024 (internal increment: 65536)
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s
## Thu Oct 31 05:22:48 2024 Done.
savepar <- par(mfrow=c(2,1), mai=c(0.45, 0.55, 0.1, 0.25))
for (i in 1:2)
{
plot(abs(CORR$snpcorr[i,]), ylim=c(0,1), xlab="", ylab=paste("PC", i),
col=chr, pch="+")
}
Given two or more populations, Fst can be estimated by the method of Weir & Cockerham (1984).
# Get sample id
sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))
# Get population information
# or pop_code <- scan("pop.txt", what=character())
# if it is stored in a text file "pop.txt"
pop_code <- read.gdsn(index.gdsn(genofile, "sample.annot/pop.group"))
# Two populations: HCB and JPT
flag <- pop_code %in% c("HCB", "JPT")
samp.sel <- sample.id[flag]
pop.sel <- pop_code[flag]
v <- snpgdsFst(genofile, sample.id=samp.sel, population=as.factor(pop.sel),
method="W&C84")
## Fst estimation on genotypes:
## Excluding 365 SNPs on non-autosomes
## Excluding 1,682 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## # of samples: 94
## # of SNPs: 7,041
## Method: Weir & Cockerham, 1984
## # of Populations: 2
## HCB (47), JPT (47)
## [1] 0.007560346
## [1] 0.00703106
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.022312 -0.008565 -0.001147 0.007031 0.012537 0.193880 1
# Multiple populations: CEU HCB JPT YRI
# we should remove offsprings
father <- read.gdsn(index.gdsn(genofile, "sample.annot/father.id"))
mother <- read.gdsn(index.gdsn(genofile, "sample.annot/mother.id"))
flag <- (father=="") & (mother=="")
samp.sel <- sample.id[flag]
pop.sel <- pop_code[flag]
v <- snpgdsFst(genofile, sample.id=samp.sel, population=as.factor(pop.sel),
method="W&C84")
## Fst estimation on genotypes:
## Excluding 365 SNPs on non-autosomes
## Excluding 1 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## # of samples: 219
## # of SNPs: 8,722
## Method: Weir & Cockerham, 1984
## # of Populations: 4
## CEU (62), HCB (47), JPT (47), YRI (63)
## [1] 0.1377293
## [1] 0.1206991
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.009225 0.042554 0.091801 0.120699 0.167754 0.792465 1
For n study individuals,
snpgdsIBS()
can be used to create a n × n matrix of genome-wide
average IBS pairwise identities:
## Identity-By-State (IBS) analysis on genotypes:
## Excluding 365 SNPs on non-autosomes
## Excluding 1 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## # of samples: 279
## # of SNPs: 8,722
## using 2 threads
## IBS: the sum of all selected genotypes (0,1,2) = 2446510
## Thu Oct 31 05:22:55 2024 (internal increment: 65536)
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 1s
## Thu Oct 31 05:22:56 2024 Done.
The heat map is shown:
# individulas in the same population are clustered together
pop.idx <- order(pop_code)
image(ibs$ibs[pop.idx, pop.idx], col=terrain.colors(16))
To perform multidimensional scaling analysis on the n × n matrix of genome-wide IBS pairwise distances:
loc <- cmdscale(1 - ibs$ibs, k = 2)
x <- loc[, 1]; y <- loc[, 2]
race <- as.factor(pop_code)
plot(x, y, col=race, xlab = "", ylab = "",
main = "Multidimensional Scaling Analysis (IBS)")
legend("topleft", legend=levels(race), pch="o", text.col=1:nlevels(race))
To perform cluster analysis on the n × n matrix of genome-wide IBS pairwise distances, and determine the groups by a permutation score:
## Identity-By-State (IBS) analysis on genotypes:
## Excluding 365 SNPs on non-autosomes
## Excluding 1 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## # of samples: 279
## # of SNPs: 8,722
## using 2 threads
## IBS: the sum of all selected genotypes (0,1,2) = 2446510
## Thu Oct 31 05:22:56 2024 (internal increment: 65536)
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s
## Thu Oct 31 05:22:56 2024 Done.
## Determine groups by permutation (Z threshold: 15, outlier threshold: 5):
## Create 3 groups.
##
## G001 G002 G003
## 93 94 92
Here is the population information we have known:
# Determine groups of individuals by population information
rv2 <- snpgdsCutTree(ibs.hc, samp.group=as.factor(pop_code))
## Create 4 groups.
plot(rv2$dendrogram, leaflab="none", main="HapMap Phase II")
legend("topright", legend=levels(race), col=1:nlevels(race), pch=19, ncol=4)
The extended GDS format is implemented in the SeqArray package to support the storage of single nucleotide variation (SNV), insertion/deletion polymorphism (indel) and structural variation calls. See: SeqArray R Integration 4.
Function | Description |
---|---|
Data Format: | |
snpgdsBED2GDS | Conversion from PLINK BED to GDS » |
snpgdsGEN2GDS | Conversion from Oxford GEN format to GDS » |
snpgdsPED2GDS | Conversion from PLINK PED to GDS » |
snpgdsVCF2GDS | Reformat VCF file(s) » |
Principal Component Analysis: | |
snpgdsPCA | Principal Component Analysis (PCA) » |
snpgdsPCACorr | PC-correlated SNPs in PCA » |
snpgdsPCASampLoading | Project individuals onto existing principal component axes » |
snpgdsPCASNPLoading | SNP loadings in principal component analysis » |
snpgdsEIGMIX | Eigen-analysis on SNP genotype data » |
snpgdsAdmixProp | Estimate ancestral proportions from the eigen-analysis » |
Identity By Descent: | |
snpgdsIBDMLE | Maximum likelihood estimation (MLE) for the Identity-By-Descent (IBD) Analysis » |
snpgdsIBDMLELogLik | Log likelihood for MLE method in the Identity-By-Descent (IBD) Analysis » |
snpgdsIBDMoM | PLINK method of moment (MoM) for the Identity-By-Descent (IBD) Analysis » |
snpgdsIBDKING | KING method of moment for the identity-by-descent (IBD) analysis » |
snpgdsGRM | Genetic Relationship Matrix (GRM) for SNP genotype data » |
snpgdsFst | F-statistics (fixation index) » |
snpgdsIndInb | Individual Inbreeding Coefficients » |
snpgdsIndInbCoef | Individual Inbreeding Coefficient » |
Clustering: | |
snpgdsIBS | Identity-By-State (IBS) proportion » |
snpgdsIBSNum | Identity-By-State (IBS) » |
snpgdsDiss | Individual dissimilarity analysis » |
snpgdsHCluster | Hierarchical cluster analysis » |
snpgdsCutTree | Determine clusters of individuals » |
snpgdsDrawTree | Draw a dendrogram » |
Linkage Disequilibrium: | |
snpgdsLDMat | Linkage Disequilibrium (LD) analysis » |
snpgdsLDpruning | LD-based SNP pruning » |
snpgdsApartSelection | SNP pruning with a minimum basepair distance » |
… |
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MASS_7.3-61 SNPRelate_1.41.0 gdsfmt_1.43.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0 xfun_0.48
## [5] maketools_1.3.1 cachem_1.1.0 knitr_1.48 htmltools_0.5.8.1
## [9] rmarkdown_2.28 buildtools_1.0.0 lifecycle_1.0.4 cli_3.6.3
## [13] sass_0.4.9 jquerylib_0.1.4 compiler_4.4.1 highr_0.11
## [17] sys_3.4.3 tools_4.4.1 evaluate_1.0.1 bslib_0.8.0
## [21] yaml_2.3.10 jsonlite_1.8.9 rlang_1.1.4