Title: | Cross omic genetic fingerprinting |
---|---|
Description: | omicsPrint provides functionality for cross omic genetic fingerprinting, for example, to verify sample relationships between multiple omics data types, i.e. genomic, transcriptomic and epigenetic (DNA methylation). |
Authors: | Maarten van Iterson [aut], Davy Cats [cre] |
Maintainer: | Davy Cats <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.27.0 |
Built: | 2024-10-30 09:09:55 UTC |
Source: | https://github.com/bioc/omicsPrint |
Run the allele sharing algorithm based on ibs
alleleSharing(x, y = NULL, relations = NULL, idx.col = "idx", idy.col = "idy", rel.col = "relation_type", callRate = 0.95, coverageRate = 2/3, alpha = 0, maf = 0, alignment = FALSE, assayNameX = NULL, assayNameY = NULL, verbose = TRUE)
alleleSharing(x, y = NULL, relations = NULL, idx.col = "idx", idy.col = "idy", rel.col = "relation_type", callRate = 0.95, coverageRate = 2/3, alpha = 0, maf = 0, alignment = FALSE, assayNameX = NULL, assayNameY = NULL, verbose = TRUE)
x , y
|
genotype matrix with row and column names; if this is a SummarizedExperiment or a MultiAssayExperiment assayName must also be specified |
relations |
'data.frame' with relations and their mapping identifiers, provide columns if different from the default and beware identifiers should match with colnames of x and y |
idx.col |
column name containing mapping identifiers |
idy.col |
column name containing mapping identifiers |
rel.col |
column name containing the relations, i.e. identical, parentoffspring, etc. |
callRate |
default 0.95 SNPs that are called in less then the threshold are dropped |
coverageRate |
default 2/3 samples with less then threshold SNPs called are set to NA |
alpha |
significance level for Hardy-Weinberg test default alpha = 0, no filtering, internaly Bonferonni multiple testing will be applied |
maf |
minor allele frequency threshold, variants with lower frequency (default 0 no filtering) will be dropped |
alignment |
default FALSE |
assayNameX |
the name of the assay to be used for x (see x, y) |
assayNameY |
same as assayNameX, but for y; if y is not specified, the assay will be retreived from x |
verbose |
show progress default TRUE |
calculate mean and variance of identity by state between genotypes, coded as 1,2,3, of all sample pairs either give one omic inferred set of SNPs or two from different omic types.
'Strand Alignment' is required if methylation data inferred genotypes are compared with DNA based genotypes, i.e., DNA based genotype is 3 whereas methylation is 1. The strand alignment step will fix this.
Notice that there are two algorithms for calculating allele sharing. One for the case one matrix is provided and one for the case two, x and y, are provided. If one is provided the algorithm takes in account the symmetric relations between pairs i.e. x12 = x21 etc.
To improve the lookup of relations, which can be millions of say 1000 samples are provided, a hash-table is created from the provided data.frame with relations.
data.frame with mean and variance ibs between all pairs
mvaniterson
set.seed(12345) beta <- matrix(runif(100*10, 0,1), nrow=100) beta[1:5, 1:5] colnames(beta) <- paste0("sample", 1:10) genotype <- beta2genotype(beta) genotype[1:5, 1:5] data <- alleleSharing(genotype) head(data)
set.seed(12345) beta <- matrix(runif(100*10, 0,1), nrow=100) beta[1:5, 1:5] colnames(beta) <- paste0("sample", 1:10) genotype <- beta2genotype(beta) genotype[1:5, 1:5] data <- alleleSharing(genotype) head(data)
convert DNA methylation beta-value to inferred genotypes
beta2genotype(betas, na.rm = TRUE, minSep = 0.25, minSize = 5, centers = c(0.2, 0.5, 0.8), assayName = NULL)
beta2genotype(betas, na.rm = TRUE, minSep = 0.25, minSize = 5, centers = c(0.2, 0.5, 0.8), assayName = NULL)
betas |
beta matrix of probes possibly affected SNPs; if this is a SummarizedExperiment or a MultiAssayExperiment assayName must also be specified |
na.rm |
TRUE drop cpg for which no clustering was observed |
minSep |
minimal separation between clusters |
minSize |
size of smallest cluster (in percentage) |
centers |
center of clusters, defaults to 0.2, 0.5, 0.8. |
assayName |
the name of the assay to be used (see betas) |
Using kmeans unsupervised clustering to infer genotypes based on idea's from Leonard Schalkwyk; wateRmelon packages.
'minSep' and 'minSize' ensure good clusters are found. This function is similar to the gaphunter approach implemented in minfi.
matrix with genotypes
mvaniterson
set.seed(12345) beta <- matrix(runif(100*10, 0,1), nrow=100) beta[1:5, 1:5] genotype <- beta2genotype(beta) genotype[1:5, 1:5]
set.seed(12345) beta <- matrix(runif(100*10, 0,1), nrow=100) beta[1:5, 1:5] genotype <- beta2genotype(beta) genotype[1:5, 1:5]
Dataframe containing all SNPs and short INDELS from GoNLv5 that overlap with 450K probes. This release does not include X and Y chromosomes, so only information for autosomal probes is available. For each overlap there is an unique row. Consequently, some probes are duplicated (probes that overlap with multiple variants) and some variants are duplicated (some variants overlap with more than one probe).
data(hg19.GoNLsnps)
data(hg19.GoNLsnps)
A data frame with 207866 rows and 19 variables:
chromosome, X and Y chromosomes are not available, since they are not included in this GoNL release
probe ID
Infinium probedesign
orientation of the probe
whether the probe measures a CpG-site (cg) or a non-CpG site (ch)
Location of the queried 'C' of the CpG dinucleotide. Note that this is the location of the C that is actually measured. For probes that interrogate the reverse strand (plus-strand probes) this is one base downstream of the C nucleotide on the forward strand
Location of the G nucleotide of the CpG dinucleotide. Note that this is the location of the queried G. For probes that interrogate the reverse strand (plus-strand probes) this is one base upstream of the G nucleotide on the forward strand
SNP ID
Start coordinate of the variant. Identical to snpEnd for SNPs.
End coordinate of the variant. Identical to snpBeg for SNPs
Allele frequency of alternative allele
Reference allele
Alternative allele
Filter information from GoNL.
Minor allele frequency
SNP or INDEL
Distance between SNP and 3'end of the probe. For type I probes the 3'end of the probe coincides with the queried C nucleotide. For type II probes the 3'end of the probe coincides with the G nucleotide directly after the C nucleotide.
Distance from queried C nucleotide. A distance of -1 indicates that the SNPs overlaps the SBE-position for type I probes.
Indicates whether a variant in the SBE-location of type I probes causes a color-channel-switch or overlap with an INDEL. For plus-strand probes C/T, C/A and C/G SNPs are expected to cause a color-channel switch. For min-strand probes A/G, G/T and C/G SNPs are expected to cause a color-channel switch.
http://zwdzwd.github.io/InfiniumAnnotation
https://molgenis26.target.rug.nl/downloads/gonl_public/variants/release5/
data(hg19.GoNLsnps) # Select variants that overlap with queried C nucleotide snps_c <- hg19.GoNLsnps[hg19.GoNLsnps$distance_c == 0, ] # Select all INDELS indels <- hg19.GoNLsnps[hg19.GoNLsnps$variantType == "INDEL",] # Select SNPs that cause a channel-switch channel_switch <- hg19.GoNLsnps[!is.na(hg19.GoNLsnps$channel_switch) & hg19.GoNLsnps$channel_switch == "Yes",]
data(hg19.GoNLsnps) # Select variants that overlap with queried C nucleotide snps_c <- hg19.GoNLsnps[hg19.GoNLsnps$distance_c == 0, ] # Select all INDELS indels <- hg19.GoNLsnps[hg19.GoNLsnps$variantType == "INDEL",] # Select SNPs that cause a channel-switch channel_switch <- hg19.GoNLsnps[!is.na(hg19.GoNLsnps$channel_switch) & hg19.GoNLsnps$channel_switch == "Yes",]
Adapted version of the annotation file provided by Zhou et al. (see source, Mar-13-2017 release). This annotation file contains population-specific probe-masking recommendations based on SNPs within 5 bases from the 3'end of the probe, mapping issues, non-unique 3' 30bp subsequence and channel-switching SNPs in the single-base-extension for type I probes. We added population-specific masking recommendations for the Dutch population using GoNL release 5. This release does not include X and Y chromosomes, so for the Dutch population, only masking information for the autosomal probes is available.
data(hm450.manifest.pop.GoNL)
data(hm450.manifest.pop.GoNL)
A GRanges object with 485577 ranges and 65 metadata columns:
Recommended general purpose masking merged from "MASK.sub30.copy", "MASK.mapping" (in either the hg38 or hg19 genome), "MASK.extBase", "MASK.typeINextBaseSwitch" and "MASK.snp5.pop" from the "hm450.manifest" file and the "hm450.manifest.pop" file (see source).For GoNL, "MASK.typeINextBaseSwitchandINDEL.GoNL" is used instead of "MASK.typeINextBaseSwitch"
Whether the 5bp 3'-subsequence (including extension for type II) overlap with a SNP with population-specific AF > 0.01
SNPs (that cause a color-channel switch) and INDELS with AF > 0.01 in GoNL. In contrast, "MASK.typeINextBaseSwitch" column is based on all SNPs in 1000 genomes and dbSNP, regardless of population or allele frequency
Note: Zhou et al. identified several probes that match to a different location than annotated in the original Illumina manifest file. The authors have used the 'updated' location in their annotation file. Therefore, a handful of probes in this annotation file are annotated to a different location than the original Illumina manifest file. For the identification of the overlaps we used the locations as annotated in the original Illumina file. Therefore, a few of the identified overlaps do not match the locations as specified in this annotation file. This also explains why GoNL masking information is available for a couple of probes that are located in the X- and Y-chromosome in this annotation: these probes map to autosomal probes in the original Illumina file. All probes that map to a different location than originally annotated are recommended to be masked (in the MASK.mapping and MASK.general column), so generally they won't be included in further analyses.
http://zwdzwd.github.io/InfiniumAnnotation
Zhou W, Laird PW and Shen H: Comprehensive characterization, annotation and innovative use of Infinium DNA Methylation BeadChip probes. Nucleic Acids Research 2016
# Select probes that should be masked in Dutch population # (note that X and Y chromosomes are not included) hm450.manifest.pop.GoNL <- hm450.manifest.pop.GoNL[!is.na( hm450.manifest.pop.GoNL$MASK.general.GoNL) & hm450.manifest.pop.GoNL$MASK.general.GoNL == TRUE, ] # Select probes that should be masked in Dutch population because there is # a SNP within 5 bases of the 3'end of the probe # (note that X and Y chromosomes are not included) hm450.manifest.pop.GoNL <- hm450.manifest.pop.GoNL[!is.na( hm450.manifest.pop.GoNL$MASK.snp5.GoNL) & hm450.manifest.pop.GoNL$MASK.snp5.GoNL == TRUE, ] # When studying a Dutch population and one wants to include X and Y # chromosomal probes, the EUR or CEU population can be used. # Select probes that should be masked in European population # (these include X and Y chromosomes) hm450.manifest.pop.GoNL <- hm450.manifest.pop.GoNL[ hm450.manifest.pop.GoNL$MASK.general.EUR == TRUE,]
# Select probes that should be masked in Dutch population # (note that X and Y chromosomes are not included) hm450.manifest.pop.GoNL <- hm450.manifest.pop.GoNL[!is.na( hm450.manifest.pop.GoNL$MASK.general.GoNL) & hm450.manifest.pop.GoNL$MASK.general.GoNL == TRUE, ] # Select probes that should be masked in Dutch population because there is # a SNP within 5 bases of the 3'end of the probe # (note that X and Y chromosomes are not included) hm450.manifest.pop.GoNL <- hm450.manifest.pop.GoNL[!is.na( hm450.manifest.pop.GoNL$MASK.snp5.GoNL) & hm450.manifest.pop.GoNL$MASK.snp5.GoNL == TRUE, ] # When studying a Dutch population and one wants to include X and Y # chromosomal probes, the EUR or CEU population can be used. # Select probes that should be masked in European population # (these include X and Y chromosomes) hm450.manifest.pop.GoNL <- hm450.manifest.pop.GoNL[ hm450.manifest.pop.GoNL$MASK.general.EUR == TRUE,]
predict mismatches
inferRelations(data, n = 100, plot.it = TRUE, verbose = FALSE, ...)
inferRelations(data, n = 100, plot.it = TRUE, verbose = FALSE, ...)
data |
output from allelesharing |
n |
= 100 default interpolation for showing the classification boundaries |
plot.it |
= TRUE default plot classification graph and returing mismatches otherwise return all |
verbose |
default FALSE, if TRUE show confusion matrix |
... |
optional plotting argument passed to plot |
based on all data a classifier is build using Linear Discriminant Analysis and on the same data a prediction is performed in order to detect wrong sample relationships. The assumption is that the majority of sample relations is correct otherwise we could not do this!
predicted mismatches
mvaniterson
set.seed(12345) beta <- matrix(runif(100*10, 0,1), nrow=100) beta[1:5, 1:5] colnames(beta) <- paste0("sample", 1:10) genotype <- beta2genotype(beta) genotype[1:5, 1:5] data <- alleleSharing(genotype) head(data) inferRelations(data)
set.seed(12345) beta <- matrix(runif(100*10, 0,1), nrow=100) beta[1:5, 1:5] colnames(beta) <- paste0("sample", 1:10) genotype <- beta2genotype(beta) genotype[1:5, 1:5] data <- alleleSharing(genotype) head(data) inferRelations(data)