Title: | Classification of Helicobacter Pylori Genomes |
---|---|
Description: | To classify Helicobacter pylori genomes according to genetic distance from nine reference populations. The nine reference populations are hpgpAfrica, hpgpAfrica-distant, hpgpAfroamerica, hpgpEuroamerica, hpgpMediterranea, hpgpEurope, hpgpEurasia, hpgpAsia, and hpgpAklavik86-like. The vertex populations are Africa, Europe and Asia. |
Authors: | William Wheeler [aut, cre], Difei Wang [aut], Isaac Zhao [aut], Yumi Jin [aut], Charles Rabkin [aut] |
Maintainer: | William Wheeler <[email protected]> |
License: | GPL-2 |
Version: | 1.3.0 |
Built: | 2025-01-16 06:11:05 UTC |
Source: | https://github.com/bioc/GrafGen |
To classify H. pylori genomes according to genetic distance from nine reference populations.
This package was modified from the GrafPop software
(https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/Software.cgi)
to be
applied on H. pylori genomes. The three vertex populations are
"Africa", "Europe" and "Asia".
The nine reference populations are "hpgpAfrica", "hpgpAfrica-distant",
"hpgpAfroamerica", "hpgpEuroamerica", "hpgpMediterranea", "hpgpEurope",
"hpgpEurasia", "hpgpAsia", and "hpgpAklavik86-like".
The training data is based on The Helicobacter pylori Genome
Project (HpGP), see
https://www.ncbi.nlm.nih.gov/bioproject/?term=HpGP or
https://zenodo.org/records/10048320.
To use this package, the user must have a file of genotypes for
H. pylori strains. The genotype file can be a binary PLINK file
in SNP-major format, or a VCF file of genotypes. If a PLINK file,
then the corresponding bim
and fam
files must also
be present. If a VCF file, then the format should be genotypes:##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
.
Ideally, the
genotype file will contain all the SNPs with positions given
in the HpyloriData
data frame, where the positions
are based on the reference genome
26695 (NCBI GenBank Accession NC_000915.1). However, the
software has been shown to work well with only using a much smaller
fraction of SNPs in HpyloriData
. The main function
in this package is grafGen
.
William Wheeler, Difei Wang, Isaac Zhao, Yumi Jin, Charles Rabkin
Jin Y, Schaffer AA, Feolo M, Holmes JB and Kattman BL (2019). GRAF-pop: A Fast Distance-based Method to Infer Subject Ancestry from Multiple Genotype Datasets without Principal Components Analysis. G3: Genes | Genomes | Genetics. DOI: 10.1534/g3.118.200925.
Thorell K, Munoz-Ramirez ZY, Wang D, Sandoval-Motta S, Boscolo Agostini R, Ghirotto S, Torres RC, HpGP Research Network, Falush D, Camargo MC and Rabkin CS (2023). New insights into Helicobacter pylori population structure from analysis of a worldwide collection of complete genomes: the H. pylori genome project. Nature Communications. DOI: 10.1038/s41467-023-43562-y.
To return an R Shiny app for the user's data.
createApp(obj, metadata=NULL, id=NULL)
createApp(obj, metadata=NULL, id=NULL)
obj |
Return object from |
metadata |
NULL or data frame containing meta data for the plot. This data frame must contain an id variable. |
id |
Name of the id column in |
This R function returns an R Shiny app that can be launched by
calling runApp
. The app allows the user
to view and filter the plot using up to two variables.
A list containing an R Shiny app and data frames needed to run the app.
library(GrafGen) data(grafGen_example_results, package="GrafGen") data(example_metadata, package="GrafGen") tmp <- createApp(grafGen_example_results, metadata=example_metadata, id="Sample") reference_results <- tmp$reference_results user_results <- tmp$user_results user_metadata <- tmp$user_metadata if (interactive()) { shiny::runApp(tmp$app) }
library(GrafGen) data(grafGen_example_results, package="GrafGen") data(example_metadata, package="GrafGen") tmp <- createApp(grafGen_example_results, metadata=example_metadata, id="Sample") reference_results <- tmp$reference_results user_results <- tmp$user_results user_metadata <- tmp$user_metadata if (interactive()) { shiny::runApp(tmp$app) }
A data frame containing metadata used in examples.
The data frame contains the sample id,
type (i.e. source country),
and country abbreviation for the 206 genomes in
grafGen_example_results
.
A data frame
data(example_metadata, package="GrafGen") # Display a few rows example_metadata[seq_len(5), ]
data(example_metadata, package="GrafGen") # Display a few rows example_metadata[seq_len(5), ]
To determine the ancestry of H. pylori strains.
grafGen(genoFile, print=1)
grafGen(genoFile, print=1)
genoFile |
The complete path to the input genotype file.
This file can only be a PLINK binary file (.bed)
or a VCF file (.vcf, .vcf.gz). If it is a .bed file,
then the corresponding .bim and .fam files
must also exist.
If a VCF file, then the format should be genotypes: |
print |
0 or 1 to print information as the program runs. |
See the references for complete details of the algorithm.
This function is more efficient if the input genotype file only
contains the set (or subset) of SNPs defined in HpyloriData
.
The SNPs can be extracted by utilizing the VCFtools software if
the genotype file is a VCF file. For a binary PLINK file, the PLINK
software can be used to extract the SNPs.
A list of class "grafpop" containing a data frame (table
)
that includes the ancestry percents
(F_percent, E_percent, A_percent
) for
African, European and Asian respectively,
normalized genetic distance scores (GD1_x, GD2_y, GD3_z
),
the predicted reference population (Refpop
),
next nearest reference population (Nearest_neighbor
),
separation to the next nearest reference population
(Separation_percent
) defined as 100*abs(d1 - d2)/d1
,
where d1
and d2
are the genetic distances to the
sample's assigned reference population and next nearest reference
population respectively, and
the genetic distances to each reference population
(hpgpAfrica
, hpgpAfrica-distant
,
hpgpAfroamerica
, hpgpEuroamerica
, hpgpMediterranea
,
hpgpEurope
, hpgpEurasia
, hpgpAsia
, and
hpgpAklavik86-like
) as defined by equation 2 in
Jin (2019).
The returned object also includes the list vertex
which
gives the x-y coordinates of the vertex populations.
dir <- system.file("extdata", package="GrafGen", mustWork=TRUE) file <- file.path(dir, "data.vcf.gz") grafGen(file)
dir <- system.file("extdata", package="GrafGen", mustWork=TRUE) file <- file.path(dir, "data.vcf.gz") grafGen(file)
The returned object from grafGen
in the
analysis of a subset of the reference data.
An object of class "grafpop" containing the
grafGen
results for a subset
of 206 genomes and 35528 SNPs in the reference data.
This subset of the reference data is included in the package
(/extdata/data.vcf.gz
).
An object of class "grafpop".
data(grafGen_example_results, package="GrafGen") grafGen_example_results
data(grafGen_example_results, package="GrafGen") grafGen_example_results
A data frame of the reference data results used in creating plots.
The data frame contains the results for each of the 1011 genomes in the reference data used in training the model along with some additional columns.
A data frame
data(grafGen_reference_dataframe, package="GrafGen") # Display a few rows grafGen_reference_dataframe[seq_len(5), ]
data(grafGen_reference_dataframe, package="GrafGen") # Display a few rows grafGen_reference_dataframe[seq_len(5), ]
The returned object from grafGen
in the
analysis of the reference data.
An object of class "grafpop" containing the
grafGen
results for each
of the 1011 genomes in the reference data.
The full set of reference data can be found at
https://github.com/wheelerb/GrafGen/tree/reference/data .
An object of class "grafpop".
data(grafGen_reference_results, package="GrafGen") grafGen_reference_results
data(grafGen_reference_results, package="GrafGen") grafGen_reference_results
Plot results
grafGenPlot(obj, which=1, legend.pos=NULL, ylim=NULL, showRefData=TRUE, jitter=0)
grafGenPlot(obj, which=1, legend.pos=NULL, ylim=NULL, showRefData=TRUE, jitter=0)
obj |
An object of class "grafpop" returned from
|
||||||||||||
which |
A vector of integers in
|
||||||||||||
legend.pos |
The position of the legend.
See |
||||||||||||
ylim |
NULL or the limits of the y-axis.
See |
||||||||||||
showRefData |
TRUE or FALSE to display the 95 percent confidence ellipses for the reference data results. |
||||||||||||
jitter |
Numeric value for the amount of jitter to add
for the plot |
The option legend.pos
is only available for
which = 1-3
,
option ylim
is only available for which = 4-5
,
and option jitter
is only available for which = 5
.
NULL
data(grafGen_example_results, package="GrafGen") grafGenPlot(grafGen_example_results)
data(grafGen_example_results, package="GrafGen") grafGenPlot(grafGen_example_results)
SNP positions and allele frequencies for the reference data
A GPos
class object containing
the vertex and reference population
allele frequencies for the
set of 143705 SNPs used in the analysis for H. pylori.
The SNPs were created using
26695 (NCBI GenBank Accession NC_000915.1) as the
reference genome.
The set of SNPs was selected using a MAF threshold of 0.01.
The total sample size was from a set of 1011 H. pylori strains.
An object of class GPos
.
# Load data and view the first few rows data(HpyloriData, package="GrafGen") HpyloriData
# Load data and view the first few rows data(HpyloriData, package="GrafGen") HpyloriData
Create an interactive plot of user data
interactivePlot(obj, metadata=NULL, id=NULL, type=NULL, group=NULL)
interactivePlot(obj, metadata=NULL, id=NULL, type=NULL, group=NULL)
obj |
Return object from |
metadata |
NULL or data frame containing meta data for the plot. This data frame must contain an id variable. |
id |
Name of the id column in |
type |
Name of the type variable in |
group |
Name of the group variable in |
This plot will all show the results of all samples in the user's data.
Hovering over a point in the plot will display three lines of information.
Line 1 contains the group, type and id of that sample.
Line 2 contains the sample's assigned reference population, next
nearest reference population, and separation
to the next nearest reference population defined as
100*abs(d1 - d2)/d1
, where d1
and d2
are the
genetic distances to the sample's assigned reference population and
next nearest reference population respectively.
Line 3 contains the percent African, European and Asian ancestry for that sample.
The legend shows the types for all samples, and clicking a
type will add or remove those samples from the plot.
Note that printing the returned object from grafGen
with the command
print(obj)
will display the frequency counts for each reference population.
NULL
if (interactive()) { data(grafGen_example_results, package="GrafGen") data(example_metadata, package="GrafGen") interactivePlot(grafGen_example_results, metadata=example_metadata, id="Sample", type="Country") }
if (interactive()) { data(grafGen_example_results, package="GrafGen") data(example_metadata, package="GrafGen") interactivePlot(grafGen_example_results, metadata=example_metadata, id="Sample", type="Country") }
Create an interactive plot of the reference data
interactiveReferencePlot()
interactiveReferencePlot()
This plot will all show the results of all samples in the reference data.
Hovering over a point in the plot will display three lines of information.
Line 1 contains the type (i.e., the source country) and id of that sample.
Line 2 contains the sample's assigned reference population, next
nearest reference population, and separation
to the next nearest reference population defined as
100*abs(d1 - d2)/d1
, where d1
and d2
are the
genetic distances to the sample's assigned reference population and
next nearest reference population respectively.
Line 3 contains the percent African, European and Asian ancestry for that sample.
The legend shows the abbreviated names of the source countries for all samples,
and clicking a
country will add or remove those samples from the plot.
NULL
if (interactive()) { interactiveReferencePlot() }
if (interactive()) { interactiveReferencePlot() }
Plot or print an object of class "grafpop".
## S3 method for class 'grafpop' plot(x, legend.pos="right", showRefData=TRUE, ...) ## S3 method for class 'grafpop' print(x, ...)
## S3 method for class 'grafpop' plot(x, legend.pos="right", showRefData=TRUE, ...) ## S3 method for class 'grafpop' print(x, ...)
x |
An object of class "grafpop" returned from
|
legend.pos |
The position of the legend.
The default is "topleft".
See |
showRefData |
TRUE or FALSE to display the 95 percent confidence ellipses for the reference data results. |
... |
Additional arguments. |
Printing an object of class "grafpop" will display the frequency counts of the predicted reference populations.
NULL
data(grafGen_example_results, package="GrafGen") obj <- grafGen_example_results print(obj) plot(obj)
data(grafGen_example_results, package="GrafGen") obj <- grafGen_example_results print(obj) plot(obj)