Here we illustrate how to detect data linkage errors with the omicsPrint package using both artificially generated data and experimental data. Several additional vignettes are available that show the use of the package on further experimental data in different settings, i.e., 450k DNA methylation and imputed genotypes.
Here we generate a single vector with 100 randomly drawn integers from the set; 1, 2, 3, representing 100 SNP calls from a single individual. Three additional individuals are generated by randomly swapping a certain fraction of the SNPs. Swapping 5 SNPs will introduce a few mismatches mimicking a situation where the same individual was measured twice (replicate) but with measurement error. Swapping 50% of the SNPs will be similar to the difference in genotypes between parents and offspring. Swapping all SNPs will result in a situation similar to comparing two unrelated individuals.
swap <- function(x, frac=0.05) {
n <- length(x)
k <- floor(n*frac)
x1 <- sample(1:n,k)
x2 <- sample(1:n,k) ##could be overlapping
x[x2] <- x[x1]
x
}
x1 <- 1 + rbinom(100, size=2, prob=1/3)
x2 <- swap(x1, 0.05) ##strongly related e.g. replicate
x3 <- swap(x1, 0.5) ##related e.g. parent off spring
x4 <- swap(x1, 1) ##unrelated
x <- cbind(x1, x2, x3, x4)
Now x
contains the 100 SNPs for the four individuals
using head
we can inspect the first six SNPs.
## x1 x2 x3 x4
## [1,] 1 1 2 2
## [2,] 2 2 2 1
## [3,] 1 1 3 1
## [4,] 1 1 2 1
## [5,] 1 1 2 2
## [6,] 2 2 2 2
allelesharing
algorithmWe use Identity by State (IBS) for the set of SNPs to infer sample relations. See Abecasis et al. (2001), for the description of this approach applied to genetic data. Briefly, between each sample pair, the IBS-vector is calculated, which is a measure of genetic distance between individuals. Next, the vector is summarized by its mean and variance. A mean of 2 and variance of 0 indicates that the samples are identical.
## Hash relations
## Pruning 100 SNPs ...
## 0 SNPs removed because of low call rate!
## 0 samples removed because too few SNPs called!
## Using 100 polymorphic SNPs to determine allele sharing.
## Running `square` IBS algorithm!
The set of SNPs may contain uninformative SNPs, SNPs of bad quality
or even SNPs could be missing. The following pruning steps are
implemented to yield the most informative set of SNPs (thresholds can be
adjusted see ?alleleSharing
):
callRate = 0.95
)coverageRate = 2/3
)alpha = 0
).maf = 0
)Hardy-Weinberg test-statistics is calculated using a χ2-test and Bonferonni multiple testing correction is performed.
## mean var colnames.x colnames.y relation
## 1 2.00 0.0000000 x1 x1 identical
## 2 1.96 0.0589899 x2 x1 unrelated
## 3 1.62 0.3187879 x3 x1 unrelated
## 4 1.40 0.3636364 x4 x1 unrelated
## 5 2.00 0.0000000 x2 x2 identical
## 6 1.60 0.3434343 x3 x2 unrelated
## 7 1.38 0.3793939 x4 x2 unrelated
## 8 2.00 0.0000000 x3 x3 identical
## 9 1.40 0.4242424 x4 x3 unrelated
## 10 2.00 0.0000000 x4 x4 identical
By default no relations are assumed except for the self-self relations.
The output is a data.frame
containing all pairwise
comparisons with the mean and variance of the IBS over the set of SNPs
and the reported sample relationship, including the identifiers.
Since we provided a list of known relations and assume that the majority is correct, we can build a classifier to discover misclassified relations. Linear discriminant analysis is used to generate a confusion-matrix, which is subsequently used to graphically represent the classification boundaries and generate an output file with misclassified sample pairs.
## mean var colnames.x colnames.y relation predicted
## 2 1.96 0.0589899 x2 x1 unrelated identical
There is one misclassified sample, namely the replicate that we
introduced but was not a priori specified as an existing relationship.
The true relationship with between sample x1
and sample
x2
is an identical relation. Furthermore, we see two sample
pairs with mean IBS of 1.96 and variance 0.06 which is an indication
that also these pairs share a considerable number of alleles. If known,
such relationships can be specified prior to analysis.
relations <- expand.grid(idx = colnames(x), idy= colnames(x))
relations$relation_type <- "unrelated"
relations$relation_type[relations$idx == relations$idy] <- "identical"
relations$relation_type[c(2,5)] <- "identical" ##replicate
relations$relation_type[c(3,7,9,10)] <- "parent offspring"
relations
## idx idy relation_type
## 1 x1 x1 identical
## 2 x2 x1 identical
## 3 x3 x1 parent offspring
## 4 x4 x1 unrelated
## 5 x1 x2 identical
## 6 x2 x2 identical
## 7 x3 x2 parent offspring
## 8 x4 x2 unrelated
## 9 x1 x3 parent offspring
## 10 x2 x3 parent offspring
## 11 x3 x3 identical
## 12 x4 x3 unrelated
## 13 x1 x4 unrelated
## 14 x2 x4 unrelated
## 15 x3 x4 unrelated
## 16 x4 x4 identical
Rerun the allelesharing algorithm now provided with the known relations.
## Hash relations
## Pruning 100 SNPs ...
## 0 SNPs removed because of low call rate!
## 0 samples removed because too few SNPs called!
## Using 100 polymorphic SNPs to determine allele sharing.
## Running `square` IBS algorithm!
## mean var colnames.x colnames.y relation
## 1 2.00 0.0000000 x1 x1 identical
## 2 1.96 0.0589899 x2 x1 identical
## 3 1.62 0.3187879 x3 x1 parent offspring
## 4 1.40 0.3636364 x4 x1 unrelated
## 5 2.00 0.0000000 x2 x2 identical
## 6 1.60 0.3434343 x3 x2 parent offspring
## 7 1.38 0.3793939 x4 x2 unrelated
## 8 2.00 0.0000000 x3 x3 identical
## 9 1.40 0.4242424 x4 x3 unrelated
## 10 2.00 0.0000000 x4 x4 identical
## [1] mean var colnames.x colnames.y relation predicted
## <0 rows> (or 0-length row.names)
All misclassified relations were resolved.
The previous example showed how to perform sample relationship verification within a single omics data type. If a second set of SNPs is available obtained from a different omic data type (and the SNPs are partly overlapping), omicsPrint can be used to verify relationships between omics types, e.g. to establish whether two omics data types were indeed generated for the same individual in order to exclude or detect sample mix-ups.
In this artificial example a random subset of 80 SNPs is selected as the set of SNPs from a different omic type. First running without providing the known relations.
## Hash relations
## Pruning 100 SNPs ...
## 0 SNPs removed because of low call rate!
## 0 samples removed because too few SNPs called!
## Pruning 80 SNPs ...
## 0 SNPs removed because of low call rate!
## 0 samples removed because too few SNPs called!
## Using 80 polymophic SNPs to determine allele sharing.
## Running `rectangular` IBS algorithm!
Note that pruning is performed on both data types and automatically a
set of overlapping SNPs (80) is used provided that the rownames of
x
and y
are identical (this also holds for
sample relations where the relation identifiers idx
and
idy
should match with the columnames of x
and
y
).
## mean var colnames.x colnames.y relation
## 1 2.000 0.00000000 x1 x1 identical
## 2 1.950 0.07341772 x2 x1 unrelated
## 3 1.625 0.31329114 x3 x1 unrelated
## 4 1.425 0.34873418 x4 x1 unrelated
## 5 1.950 0.07341772 x1 x2 unrelated
## 6 2.000 0.00000000 x2 x2 identical
## 7 1.600 0.34430380 x3 x2 unrelated
## 8 1.400 0.36962025 x4 x2 unrelated
## 9 1.625 0.31329114 x1 x3 unrelated
## 10 1.600 0.34430380 x2 x3 unrelated
## 11 2.000 0.00000000 x3 x3 identical
## 12 1.375 0.41455696 x4 x3 unrelated
## 13 1.425 0.34873418 x1 x4 unrelated
## 14 1.400 0.36962025 x2 x4 unrelated
## 15 1.375 0.41455696 x3 x4 unrelated
## 16 2.000 0.00000000 x4 x4 identical
## mean var colnames.x colnames.y relation predicted
## 2 1.95 0.07341772 x2 x1 unrelated identical
## 5 1.95 0.07341772 x1 x2 unrelated identical
Now running with providing the known relations.
## Hash relations
## Pruning 100 SNPs ...
## 0 SNPs removed because of low call rate!
## 0 samples removed because too few SNPs called!
## Pruning 80 SNPs ...
## 0 SNPs removed because of low call rate!
## 0 samples removed because too few SNPs called!
## Using 80 polymophic SNPs to determine allele sharing.
## Running `rectangular` IBS algorithm!
## mean var colnames.x colnames.y relation
## 1 2.000 0.00000000 x1 x1 identical
## 2 1.950 0.07341772 x2 x1 identical
## 3 1.625 0.31329114 x3 x1 parent offspring
## 4 1.425 0.34873418 x4 x1 unrelated
## 5 1.950 0.07341772 x1 x2 identical
## 6 2.000 0.00000000 x2 x2 identical
## 7 1.600 0.34430380 x3 x2 parent offspring
## 8 1.400 0.36962025 x4 x2 unrelated
## 9 1.625 0.31329114 x1 x3 parent offspring
## 10 1.600 0.34430380 x2 x3 parent offspring
## 11 2.000 0.00000000 x3 x3 identical
## 12 1.375 0.41455696 x4 x3 unrelated
## 13 1.425 0.34873418 x1 x4 unrelated
## 14 1.400 0.36962025 x2 x4 unrelated
## 15 1.375 0.41455696 x3 x4 unrelated
## 16 2.000 0.00000000 x4 x4 identical
## [1] mean var colnames.x colnames.y relation predicted
## <0 rows> (or 0-length row.names)
Providing the known, true relationships thus yields no missclassified sample relationships.
SummarizedExperiment
Here we will show how you could varify sample relationships on publicly available DNA methylation data. The dataset used here contains pairs of monozygotic twins. We will extract the beta-value matrix from GEO GSE100940, paper in press.
First we extract the data from GEO using the GEOquery-package.
library(GEOquery)
library(SummarizedExperiment)
file <- tempfile(fileext = ".txt.gz")
download.file("ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE100nnn/GSE100940/matrix/GSE100940_series_matrix.txt.gz", file)
gset <- getGEO(filename=file, getGPL=FALSE)
Next we convert the returned object into a SummarizedExperiment:
## class: RangedSummarizedExperiment
## dim: 485577 24
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(485577): cg00000029 cg00000108 ... rs966367 rs9839873
## rowData names(0):
## colnames(24): GSM2696882 GSM2696883 ... GSM2696904 GSM2696905
## colData names(33): title geo_accession ... data_row_count gender.ch1
Sample data can be extracted from the
SummarizedExperiment
-object using the
colData
-function and we can see which pair of twins each
sample belongs to through the source_name_ch1
field. Using
this knowledge we can construct a table of expected relationships:
r <- expand.grid(idx=colnames(se), idy=colnames(se))
r$Xpair <- sapply(strsplit(as.character(colData(se)[r$idx, "source_name_ch1"]),
split = "_"), head, 1)
r$Ypair <- sapply(strsplit(as.character(colData(se)[r$idy, "source_name_ch1"]),
split = "_"), head, 1)
r$relation_type <- "unrelated"
r$relation_type[r$Xpair == r$Ypair] <- "twin"
r$relation_type[r$idx == r$idy] <- "identical"
head(r)
## idx idy Xpair Ypair relation_type
## 1 GSM2696882 GSM2696882 M01 M01 identical
## 2 GSM2696883 GSM2696882 M01 M01 twin
## 3 GSM2696884 GSM2696882 M02 M01 unrelated
## 4 GSM2696885 GSM2696882 M02 M01 unrelated
## 5 GSM2696886 GSM2696882 M03 M01 unrelated
## 6 GSM2696887 GSM2696882 M03 M01 unrelated
Several probes on the array contain SNPs occurring frequently in different populations(Chen et al. 2013; Zhou, Laird, and Shen 2016). We can use these to verify the expected relationships. We have made these data available from inside of this package.
Now we make a selection of CpGs probably affected by polymorphic SNPS in populations from East Asian, as these samples are from South Korea:
data(hm450.manifest.pop.GoNL)
cpgs <- names(hm450.manifest.pop.GoNL[
mcols(hm450.manifest.pop.GoNL)$MASK.snp5.EAS])
se <- se[cpgs,]
Next the beta-values are converted to genotypes using our enhanced K-means algorithm:
## [1] 821 24
## GSM2696882 GSM2696883 GSM2696884 GSM2696885 GSM2696886
## cg09762182 3 3 1 1 3
## cg25282454 3 3 2 2 2
## cg24345856 3 3 2 2 2
## cg13167158 3 3 3 3 1
## cg12213037 2 2 3 3 3
The DNA methylation based genotype calls can be directly supplied to the allelesharing algorithm to perform the intra-omic sample matching:
## Hash relations
## Pruning 821 SNPs ...
## 0 SNPs removed because of low call rate!
## 0 samples removed because too few SNPs called!
## Using 821 polymorphic SNPs to determine allele sharing.
## Running `square` IBS algorithm!
## 25 of 300 (8.33%) ...
## mean var colnames.x colnames.y relation predicted
## 2 1.995128 0.004854282 GSM2696883 GSM2696882 twin identical
## 49 1.997564 0.002433083 GSM2696885 GSM2696884 twin identical
## 92 1.991474 0.008463801 GSM2696887 GSM2696886 twin identical
## 131 1.995128 0.004854282 GSM2696889 GSM2696888 twin identical
## 166 1.995128 0.004854282 GSM2696891 GSM2696890 twin identical
## 197 1.995128 0.004854282 GSM2696893 GSM2696892 twin identical
## 224 1.996346 0.003645168 GSM2696895 GSM2696894 twin identical
## 247 2.000000 0.000000000 GSM2696897 GSM2696896 twin identical
## 266 1.995128 0.004854282 GSM2696899 GSM2696898 twin identical
## 281 1.992692 0.007263599 GSM2696901 GSM2696900 twin identical
## 292 1.993910 0.006060426 GSM2696903 GSM2696902 twin identical
## 299 1.993910 0.006060426 GSM2696905 GSM2696904 twin identical
The twins are predicted as being identical to each other. This is not unexpected as they are monozygotic.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SummarizedExperiment_1.35.5 GenomicRanges_1.57.2
## [3] GenomeInfoDb_1.41.2 IRanges_2.39.2
## [5] S4Vectors_0.43.2 MatrixGenerics_1.17.1
## [7] matrixStats_1.4.1 GEOquery_2.73.5
## [9] Biobase_2.67.0 BiocGenerics_0.53.0
## [11] BiocStyle_2.35.0 omicsPrint_1.27.0
## [13] MASS_7.3-61
##
## loaded via a namespace (and not attached):
## [1] xfun_0.48 bslib_0.8.0
## [3] lattice_0.22-6 tzdb_0.4.0
## [5] vctrs_0.6.5 tools_4.4.1
## [7] generics_0.1.3 tibble_3.2.1
## [9] fansi_1.0.6 highr_0.11
## [11] R.oo_1.26.0 pkgconfig_2.0.3
## [13] Matrix_1.7-1 data.table_1.16.2
## [15] rentrez_1.2.3 lifecycle_1.0.4
## [17] GenomeInfoDbData_1.2.13 compiler_4.4.1
## [19] statmod_1.5.0 htmltools_0.5.8.1
## [21] sys_3.4.3 buildtools_1.0.0
## [23] sass_0.4.9 yaml_2.3.10
## [25] tidyr_1.3.1 pillar_1.9.0
## [27] crayon_1.5.3 jquerylib_0.1.4
## [29] R.utils_2.12.3 limma_3.61.12
## [31] DelayedArray_0.33.1 cachem_1.1.0
## [33] abind_1.4-8 tidyselect_1.2.1
## [35] digest_0.6.37 purrr_1.0.2
## [37] dplyr_1.1.4 maketools_1.3.1
## [39] fastmap_1.2.0 grid_4.4.1
## [41] cli_3.6.3 SparseArray_1.5.45
## [43] magrittr_2.0.3 S4Arrays_1.5.11
## [45] XML_3.99-0.17 utf8_1.2.4
## [47] withr_3.0.2 readr_2.1.5
## [49] UCSC.utils_1.1.0 RaggedExperiment_1.29.5
## [51] rmarkdown_2.28 XVector_0.45.0
## [53] httr_1.4.7 R.methodsS3_1.8.2
## [55] hms_1.1.3 evaluate_1.0.1
## [57] knitr_1.48 MultiAssayExperiment_1.31.5
## [59] rlang_1.1.4 glue_1.8.0
## [61] xml2_1.3.6 BiocManager_1.30.25
## [63] jsonlite_1.8.9 R6_2.5.1
## [65] zlibbioc_1.51.2