Pairwise Measures of Ancestry Divergence
It is possible to identify a subset of mutually unrelated individuals
in a sample based solely on pairwise measures of genetic relatedness
(i.e. kinship coefficients). However, in order to obtain accurate
population structure inference for the entire sample, it is important
that the ancestries of all individuals in the sample are represented by
at least one individual in this subset. In order to identify a mutually
unrelated and ancestry representative subset of individuals, PC-AiR also
utilizes measures of ancestry divergence. These measures are calculated
using the KING-robust kinship coefficient estimator (Manichaikul et al.,
2010), which provides systematically negative estimates for unrelated
pairs of individuals with different ancestry. The number of negative
pairwise estimates that an individual has provides information regarding
how different that individual’s ancestry is from the rest of the sample,
which helps to prioritize individuals that should be kept in the
ancestry representative subset.
The relevant output from the KING software is two text files with the
file extensions .kin0 and .kin. The kingToMatrix
function
can be used to extract the kinship coefficients (which we use as
divergence measures) from this output and create a matrix usable by the
*[GENESIS](https://bioconductor.org/packages/3.20/GENESIS)*
functions.
# create matrix of KING estimates
library(GENESIS)
KINGmat <- kingToMatrix(
c(system.file("extdata", "MXL_ASW.kin0", package="GENESIS"),
system.file("extdata", "MXL_ASW.kin", package="GENESIS")),
estimator = "Kinship")
KINGmat[1:5,1:5]
## 5 x 5 Matrix of class "dsyMatrix"
## NA19625 NA19649 NA19650 NA19651 NA19652
## NA19625 0.5000 -0.0761 -0.0656 -0.0497 -0.0486
## NA19649 -0.0761 0.5000 0.2513 -0.0187 -0.0141
## NA19650 -0.0656 0.2513 0.5000 -0.0037 -0.0033
## NA19651 -0.0497 -0.0187 -0.0037 0.5000 0.0112
## NA19652 -0.0486 -0.0141 -0.0033 0.0112 0.5000
The column and row names of the matrix are the individual IDs, and
each off-diagonal entry is the KING-robust estimate for the specified
pair of individuals.
Alternative to running the KING software, the
snpgdsIBDKING
function from the SNPRelate
package can be used to calculate the KING-robust estimates directly from
a GDS file. The ouput of this function contains a matrix of pairwise
estimates, which can be used by the GENESIS
functions.
Running PC-AiR
The PC-AiR algorithm requires pairwise measures of both kinship and
ancestry divergence in order to partition the sample into an “unrelated
subset” and a “related subset.” The kinship coefficient estimates are
used to identify relatives, as only one individual from a set of
relatives can be included in the “unrelated subset,” and the rest must
be assigned to the “related subset.” The ancestry divergence measures
calculated from KING-robust are used to help select which individual
from a set of relatives has the most unique ancestry and should be given
priority for inclusion in the “unrelated subset.”
The KING-robust estimates read in above are always used as measures
of ancestry divergence for unrelated pairs of individuals, but they can
also be used as measures of kinship for relatives (NOTE: they may be
biased measures of kinship for admixed relatives with different
ancestry). The pcair
function performs the PC-AiR
analysis.
We use the GWASTools
package to create the GenotypeData object needed by GENESIS.
library(GWASTools)
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)
HapMap_genoData
## An object of class GenotypeData
## | data:
## File: /tmp/RtmpxaBi6X/Rinst22274f2bb047/GENESIS/extdata/HapMap_ASW_MXL_geno.gds (901.8K)
## + [ ] *
## |--+ sample.id { Int32,factor 173 ZIP(40.9%), 283B } *
## |--+ snp.id { Int32 20000 ZIP(34.6%), 27.1K }
## |--+ snp.position { Int32 20000 ZIP(34.6%), 27.1K }
## |--+ snp.chromosome { Int32 20000 ZIP(0.13%), 103B }
## \--+ genotype { Bit2 20000x173, 844.7K } *
## | SNP Annotation:
## NULL
## | Scan Annotation:
## NULL
# run PC-AiR on pruned SNPs
mypcair <- pcair(HapMap_genoData, kinobj = KINGmat, divobj = KINGmat,
snp.include = pruned)
## Principal Component Analysis (PCA) on genotypes:
## Excluding 16,103 SNPs (non-autosomes or non-selection)
## Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## # of samples: 97
## # of SNPs: 3,897
## using 1 thread
## # of principal components: 32
## PCA: the sum of all selected genotypes (0,1,2) = 193415
## CPU capabilities: Double-Precision SSE2
## Sun Dec 29 06:56:16 2024 (internal increment: 42520)
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s
## Sun Dec 29 06:56:16 2024 Begin (eigenvalues and eigenvectors)
## Sun Dec 29 06:56:16 2024 Done.
## SNP Loading:
## # of samples: 97
## # of SNPs: 3,897
## using 1 thread
## using the top 32 eigenvectors
## SNP Loading: the sum of all selected genotypes (0,1,2) = 193415
## Sun Dec 29 06:56:16 2024 (internal increment: 65536)
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s
## Sun Dec 29 06:56:16 2024 Done.
## Sample Loading:
## # of samples: 76
## # of SNPs: 3,897
## using 1 thread
## using the top 32 eigenvectors
## Sample Loading: the sum of all selected genotypes (0,1,2) = 150718
## Sun Dec 29 06:56:16 2024 (internal increment: 65536)
## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 1s
## Sun Dec 29 06:56:17 2024 Done.
genoData
is a GenotypeData
class
object
kinobj
is a matrix of pairwise kinship coefficient
estimates
divobj
is a matrix of pairwise measures of ancestry
divergence (KING-robust estimates)
snp.include
is a vector of snp IDs
If better estimates of kinship coefficients are available, then the
kinobj
input can be replaced with a similar matrix of these
estimates. The divobj
input should always remain as the
KING-robust estimates.
Reference Population Samples
As PCA is an unsupervised method, it is often difficult to identify
what population structure each of the top PCs represents. In order to
provide some understanding of the inferred structure, it is sometimes
recommended to include reference population samples of known ancestry in
the analysis. If the data set contains such reference population
samples, we may want to make sure that these reference population
samples are included in the “unrelated subset.” This can be accomplished
by setting the input unrel.set
equal to a vector,
IDs
, of the individual IDs for the reference population
samples.
mypcair <- pcair(HapMap_genoData, kinobj = KINGmat, divobj = KINGmat,
snp.include = pruned, unrel.set = IDs)
This will force individuals specified with unrel.set
into the “unrelated subset,” move any of their relatives into the
“related subset,” and then perform the PC-AiR partitioning algorithm on
the remaining samples.
Partition a Sample without Running PCA
It may be of interest to partition a sample into an ancestry
representative “unrelated subset” of individuals and a “related subset”
of individuals without performing a PCA. The pcairPartition
function, which is called within the pcair
function,
enables the user to do this.
part <- pcairPartition(kinobj = KINGmat, divobj = KINGmat)
The output contains two vectors that give the individual IDs for the
“unrelated subset” and the “related subset.”
head(part$unrels); head(part$rels)
## [1] "NA19708" "NA19711" "NA19712" "NA19737" "NA19740" "NA19741"
## [1] "NA19686" "NA20346" "NA20345" "NA20347" "NA19664" "NA19677"
As with the pcair
function, the input
unrel.set
can be used to specify certain individuals that
must be part of the “unrelated subset.”
Output from PC-AiR
An object returned from the pcair
function has class
pcair
. A summary
method for an object of class
pcair
is provided to obtain a quick overview of the
results.
## Call:
## .pcair(gdsobj = gdsobj, kinobj = kinobj, divobj = divobj, kin.thresh = kin.thresh,
## div.thresh = div.thresh, unrel.set = unrel.set, sample.include = sample.include,
## snp.include = snp.include, num.cores = num.cores, verbose = verbose)
##
## PCA Method: PC-AiR
##
## Sample Size: 173
## Unrelated Set: 97 Samples
## Related Set: 76 Samples
##
## Kinship Threshold: 0.02209709
## Divergence Threshold: -0.02209709
##
## Principal Components Returned: 32
## Eigenvalues: 2.915 1.74 1.306 1.281 1.277 1.263 1.242 1.239 1.234 1.208 ...
## SNPs Used: 3897
The output provides the total sample size along with the number of
individuals assigned to the unrelated and related subsets, as well as
the threshold values used for determining which pairs of individuals
were related or ancestrally divergent. The eigenvalues for the top PCs
are also shown, which can assist in determining the number of PCs that
reflect structure. The minor allele frequency (MAF) filter used for
excluding SNPs is also specified, along with the total number of SNPs
analyzed after this filtering. Details describing how to modify the
analysis parameters and the available output of the function can be
found with the command help(pcair)
.
Plotting PC-AiR PCs
The GENESIS
package also provides a plot
method for an object of class
pcair
to quickly visualize pairs of PCs. Each point in one
of these PC pairs plots represents a sample individual. These plots help
to visualize population structure in the sample and identify clusters of
individuals with similar ancestry.
# plot top 2 PCs
plot(mypcair)
# plot PCs 3 and 4
plot(mypcair, vx = 3, vy = 4)
The default is to plot PC values as black dots and blue pluses for
individuals in the “unrelated subset” and “related subsets”
respectively. The plotting colors and characters, as well as other
standard plotting parameters, can be changed with the standard input to
the plot
function.