The GBScleanR package has been developed to conduct error correction on genotype data obtained via NGS-based genotyping methods such as RAD-seq and GBS (Miller et al. 2007; Elshire et al. 2011). It is designed to estimate true genotypes along chromosomes from given allele read counts in the VCF file generated by SNP callers like GATK and TASSEL-GBS (McKenna et al. 2010; Glaubitz et al. 2014). The current implementation supports genotype data of a mapping population derived from two or more diploid parents followed by selfings, sibling crosses, or random crosses. e.g. F2 and 8-way RILs. Our method supposes markers to be biallelic and ordered along chromosomes by mapping reads on a reference genome sequence. To support smooth access to large size genotype data, every input VCF file is first converted to a Genomic Data Structure (GDS) file (Zheng et al. 2012). The current implementation does not allow non-biallelic markers, and those markers in an input VCF file will be automatically removed from a resultant GDS file. GBScleanR also provides functions for data visualization, filtering, and loading/writing a VCF file. Furthermore, the data structure of the GDS file created via this package is compatible with those used in the SNPRelate, GWASTools and GENESIS packages those are designed to handle large variant data and conduct downstream analyses including regression analysis (Zheng et al. 2012; Stephanie M. Gogarten et al. 2012; Stephanie M. Gogarten et al. 2019). In this document, we first walk through the utility functions implemented in GBScleanR to introduce a basic usage. Then, the core function of GBScleanR which estimates true genotypes for error correction will be introduced.
The latest release of GBScleanR
is version 2. The major update in this version is that GBScleanR
supports polyploid populations. GBScleanR
tries to estimate the haplotype phases of founder and offspring samples
based on given read counts. To process your data as of a polyploid
population, set the ploidy
argument in the
loadGDS()
function.
gds <- loadGDS(x = "/path/to/your/gds", ploidy = 4) # For tetraploid
gds <- loadGDS(x = "/path/to/your/gds", ploidy = 6) # For hexaploid
You can use any other functions in the GBScleanR package for your polyploid population in the same way for diploids.
In addition, the setDominantMarkers()
function enables
you to force GBScleanR
to treat some markers as dominant markers. In the iterative parameter
optimization of the genotype estimation step, GBScleanR
calculates the marker-wise reference-read-bias. We can assume that reads
of either reference or alternative allele only can be observed at
dominant markers and the marker-wise reference-read-bias must be 1 or 0
for dominant markers that give only reference or alternative allele
reads, respectively.
As another important change in estGeno() since version 2, the function now requires the node argument to explicitly specify whether raw or filtered read counts should be used for genotype estimation. If setCallFilter() has not been executed, the user-specified node argument will be ignored, and the “raw” node will be used by default.
The data visualization by GBScleanR
was also enhanced by new plot functions: plotReadRatio()
and plotDosage()
. See the “Plot Read
Ratio and Dosage” section for the details.
This package internally uses the following packages.
- ggplot2
- dplyr
- tidyr
- expm
- gdsfmt
- SeqArray
You can install GBScleanR from the Bioconductor repository with the following code.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GBScleanR")
The code below let you install the package from the github repository. The package released in the github usually get frequent update more than that in Bioconductor due to the release schedule.
You may face to the following error message or similar one if you killed the process that was accessing a GDS file.
This error message indicates the corruption of the GDS file and you
cannot access it anymore. In this case, please remake a GDS file using
the gbsrVCF2GDS()
function.
The main class of the GBScleanR
package is GbsrGenotypData
which inherits the
GenotypeData
class in the SeqArray
package. The gbsrGenotypeData
class object has three slots:
sample
, marker
, and scheme
. The
data
slot holds genotype data as a gds.class
object which is defined in the gdsfmt
package while
snpAnnot
and scanAnnot
contain objects storing
annotation information of SNPs and samples, which are the
SnpAnnotationDataFrame
and
ScanAnnotationDataFrame
objects defined in the GWASTools
package. See the vignette of GWASTools
for more detail. GBScleanR
follows the way of GWASTools
in which a unique genotyping instance (genotyped sample) is called
“scan”.
Load the package.
GBScleanR only supports a VCF file as input. As an example data, we use sample genotype data for a biparental F2 population derived from inbred parents.
vcf_fn <- system.file("extdata", "sample.vcf", package = "GBScleanR")
gds_fn <- tempfile("sample", fileext = ".gds")
As mentioned above, the GbsrGenotypeData
class requires
genotype data in the gds.class
object which enable us quick
access to the genotype data without loading the whole data on RAM. At
the beginning of the processing, we need to convert data format of our
genotype data from VCF to GDS. This conversion can be achieved using
gbsrVCF2GDS()
as shown below. A compressed VCF file
(.vcf.gz) is also acceptable.
# `force = TRUE` allow the function to over write the GDS file,
# even if a GDS file exists at `out_fn`.
gbsrVCF2GDS(vcf_fn = vcf_fn, out_fn = gds_fn, force = TRUE, verbose = FALSE)
## [1] "/tmp/RtmpTLJj4v/sample14b13ce30abc.gds"
Once we converted the VCF to the GDS, we can create a
GbsrGenotypeData
instance for our data.
The first time to load a newly produced GDS file will take long time due to data reformatting for quick access.
Getter functions allow you to retrieve basic information of genotype data, e.g. the number of SNPs and samples, chromosome names, physical position of SNPs and alleles.
## [1] 102
## [1] 242
## [1] "1" "1" "1" "1" "1" "1"
## [1] "1" "1" "1" "1" "1" "1"
## [1] 522289 571177 577905 720086 735019 841286
## [1] "A,G" "A,G" "C,T" "G,C" "C,T" "G,T"
## [1] 1 2 3 4 5 6
## [1] "F2_1054" "F2_1055" "F2_1059" "F2_1170" "F2_1075" "F2_1070"
The function getGenotype()
returns genotype call data in
which integer numbers 0, 1, and 2 indicate the number of reference
allele.
The function getRead()
returns read count data as a list
with two elements ref
and alt
containing
reference read counts and alternative read counts, respectively.
countGenotype()
and countRead()
are class
methods of GbsrGenotypeData
and they summarize genotype
counts and read counts per marker and per sample.
These summary statistics can be visualized via plotting functions.
With the values obtained via countGenotype()
, we can plot
histograms of missing rate , heterozygosity, reference allele frequency
as shown below.
With the values obtained via countRead()
, we can plot
histograms of total read depth , allele read depth , reference read
frequency as shown below.
plotGBSR()
and pairsGBSR()
provide other
ways to visualize statistics. plotGBSR()
draws a line plot
of a specified statistics per marker along each chromosome.
pairsGBSR()
give us a two-dimensional scatter plot to
visualize relationship between statistics.
The statistics obtained via countGenotype()
,
countReat()
, and calcReadStats()
are sotred in
the snpAnnot
and scanAnnot
slots. They can be
retrieved using getter functions as follows.
## [1] 35 38 33 41 42 47
## [1] 172 76 76 118 28 89
## [1] 48 33 45 14 12 25
## [1] 35 79 26 72 179 97
## [1] 19 28 24 23 29 25
## [1] 25 46 122 46 20 39
## [1] 0 3 0 24 19 5
## [1] 10 41 18 6 15 17
## [1] 118 109 111 96 96 119
## [1] 379 231 178 308 235 275
## [1] 86 89 93 60 70 75
## [1] 85 171 270 164 219 175
## [1] 0 6 0 48 38 10
## [1] 20 82 36 12 30 34
## [1] 696 1012 604 182 163 271
## [1] 1227 489 600 1693 1097 859
## [1] 654 1555 537 156 199 428
## [1] 269 575 1516 613 1103 481
# Sum of reference and alternative read counts per marker
head(getCountRead(gds, target = "marker"))
## [1] 1350 2567 1141 338 362 699
# Sum of reference and alternative read counts per sample
head(getCountRead(gds, target = "sample"))
## [1] 1496 1064 2116 2306 2200 1340
## [1] 4034.762 2966.011 4032.764 1213.087 1248.329 1702.554
## [1] 3905.653 3481.716 3993.717 3546.723 2493.182 3237.600
## [1] 3221.486 2783.018 3501.793 1117.116 1469.465 1836.445
## [1] 3047.675 3651.443 3958.266 2507.814 2387.446 2804.338
## [1] 2244.3271 4346.0563 2620.9744 830.4441 818.4920 1080.3064
## [1] 3265.505 2489.965 3467.100 3115.169 2048.550 2443.671
## [1] 1814.5382 4932.9596 2109.4078 739.2224 912.1266 1480.2307
## [1] 2483.399 3011.280 3373.579 2137.899 1875.358 2070.380
## [1] 0.4215686 0.4494949 0.4558824 0.3846154 0.4216867 0.3865979
## [1] 0.1831897 0.4253731 0.3973214 0.3474576 0.4823789 0.3888889
## [1] 86 89 93 60 70 75
## [1] 85 171 178 164 219 175
You can get the proportion of each genotype call with
prop = TRUE
.
## [1] 0.3431373 0.3838384 0.3235294 0.5256410 0.5060241 0.4845361
## [1] 0.4705882 0.3333333 0.4411765 0.1794872 0.1445783 0.2577320
## [1] 0.1862745 0.2828283 0.2352941 0.2948718 0.3493976 0.2577320
## [1] 0.00000000 0.02941176 0.00000000 0.23529412 0.18627451 0.04901961
The proportion of each allele counts.
## [1] 0.5784314 0.5505051 0.5441176 0.6153846 0.5783133 0.6134021
## [1] 0.4215686 0.4494949 0.4558824 0.3846154 0.4216867 0.3865979
## [1] 0.00000000 0.02941176 0.00000000 0.23529412 0.18627451 0.04901961
The proportion of each allele read counts.
## [1] 0.5155556 0.3942345 0.5293602 0.5384615 0.4502762 0.3876967
## [1] 0.4844444 0.6057655 0.4706398 0.4615385 0.5497238 0.6123033
Based on the statistics we obtained, we can filter out less reliable
markers and samples using setMarFilter()
and
setSamFilter()
, respectively.
gds <- setMarFilter(gds, missing = 0.2, het = c(0.1, 0.9), maf = 0.05)
gds <- setSamFilter(gds, missing = 0.8, het = c(0.25, 0.75))
setCallFilter()
is another type of filtering which works
on each genotype call that is a data point at a marker in a sample. We
can replace some genotype calls with missing based on the specified
criteria. If you would like to filter out less reliable genotype calls
that are only supported by less than 5 reads, set the arguments as
below.
If needed to remove genotype calls supported by too many reads, which might be the results of mismapping from repetitive sequences, set as follows.
# Filtering genotype calls based on total read counts
gds <- setCallFilter(gds, dp_qtile = c(0, 0.9))
# Filtering genotype calls based on reference read counts
# and alternative read counts separately.
gds <- setCallFilter(gds, ref_qtile = c(0, 0.9), alt_qtile = c(0, 0.9))
Usually reference reads and alternative reads show different data
distributions. Thus, we can set the different thresholds for them via
dp_qtile
, ref_qtile
, and
alt_qtile
to filter out genotype calls based on quantiles
of total, reference, and alternative read counts in each sample.
Here, the following codes filter out calls supported by less than 5 reads and then filter out markers showing a missing rate of more than 20%.
Based on our study using simulation data and real data for a rice F2
population derived from a cross between distant relatives (cultivar x
wild species), we recommend the setting of
ref_qtile = c(0, 0.9), alt_qtile = c(0, 0.9)
to filter out
markers with over represented reads. If your population contains samples
that have only either of reference or alternative reads at the majority
of markers, filtering with
ref_qtile = c(0, 0.9), alt_qtile = c(0, 0.9)
will set
missing to a large portion of markers for the samples. In that case, it
is better to set dp_qtile = c(0, 0.9)
. In addition, the
error correction by GBScleanR does not require any filtering for markers
based on missing rate, heterozygosity, and allele frequency. Therefore,
setMarFilter()
and setSamFilter()
will be used
only when you have specific markers and samples that should be removed.
In that case, please specify marker IDs and sample IDs to the
id
argument of setMarFilter()
and
setSamFilter()
, respectively.
gds <- setCallFilter(gds, ref_qtile = c(0, 0.9), alt_qtile = c(0, 0.9))
invalid_mar <- getMarID(gds)[1:5]
gds <- setMarFilter(gds, id = invalid_mar)
invalid_sam <- getSamID(gds)[1:3]
gds <- setSamFilter(gds, id = invalid_sam)
In addition to those statistics based filtering functions, GBScleanR
provides a filtering function based on relative marker positions.
Markers locating too close each other usually have redundant
information, especially if those markers are closer each other than the
read length, in which case the markers are supported by completely (or
almost) the same set of reads. To select only one marker from those
markers, we can sue thinMarker()
. This function selects one
marker having the least missing rate from each stretch of the specified
length. If some markers have the least missing rate, select the first
marker in the stretch.
We can obtain the summary statistics using
countGenotype()
, countRead()
, and
calcReadStats()
for only the SNPs and samples retained
after the filtering. Importantly, we need to set
node = "filt
if we have apllied
setCallFiler()
. Otherwise, countGenotype()
used the raw genotype calls.
We can check which markers and samples are retained after the
filtering using validSnp()
and validSam()
.
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
The class methods of GbsrGenotypeData
basically work
with only the markers and samples retained after filtering. To use all
markers and samples, please specify valid = FALSE
to the
GbsrGenotypeData
class methods.
## [1] 151
## [1] 242
The error correction algorithm of GBScleanR bases on the HMM assuming observed allele read counts for each SNP marker along a chromosome as the outputs from a sequence of latent true genotypes. Our model supposes that a population of No ≥ 1 sampled offspring was originally derived form the crosses between Nf ≥ 2 parent individuals. The parents can be inbred lines having homozygotes at all markers and outbred lines in which markers show heterozygous genotype.
The update on Mar 7, 2024 added a function to set sample replicate
information to jointly evaluate read counts for replicates in the
genotype estimation by the estGeno()
function. The
estGeno()
function sums up the read counts of replicates
specified by setReplicates()
and estimates genotypes based
on the summed-up read counts. The samples specified as replicates in
setReplicates()
will have the same genotypes at all markers
in the estimated genotypes obtained via estGeno()
. The
setReplicates()
function assumes that replicate information
would be supplied for all samples in the data including parents via the
replicates
argument. In addition, the
setReplicates()
function assumes that the samples that are
assigned same numbers or characters via the replicates
argument are replicates. Therefore, the ordering of samples in the data
and the identifiers in the vector specified to replicates
should match. Replicates can be specified as follows. First, it is
better to confirm the ordering of samples in the data with the setting
“valid = FALSE” to obtain all sample IDs.
## [1] "F2_1054" "F2_1055" "F2_1059" "F2_1170" "F2_1075" "F2_1070"
## [7] "F2_1074" "F2_1007" "F2_1009" "F2_1489" "F2_1010" "F2_1014"
## [13] "F2_1490" "F2_1022" "F2_1267" "F2_1382" "F2_1038" "Founder1"
## [19] "F2_1216" "F2_1571" "F2_1695" "F2_1575" "F2_1584" "F2_1355"
## [25] "F2_1236" "F2_1357" "F2_1470" "F2_1887" "F2_1762" "F2_1884"
## [31] "F2_1763" "F2_1400" "F2_1885" "F2_1643" "F2_1547" "F2_1784"
## [37] "F2_1785" "F2_1314" "F2_1317" "F2_1600" "F2_1848" "F2_1849"
## [43] "F2_1840" "F2_1719" "F2_1736" "F2_1850" "F2_1502" "F2_1868"
## [49] "F2_1740" "F2_1741" "F2_1755" "F2_1756" "F2_1874" "Founder2"
## [55] "F2_1811" "F2_1812" "F2_1700" "F2_1824" "F2_1827" "F2_1710"
## [61] "F2_1714" "F2_1166" "F2_1282" "F2_1178" "F2_1192" "F2_1195"
## [67] "F2_1361" "F2_1122" "F2_1486" "F2_1132" "F2_1378" "F2_1257"
## [73] "F2_1258" "F2_1262" "F2_1384" "F2_1152" "F2_1274" "F2_1150"
## [79] "F2_1688" "F2_1681" "F2_1682" "F2_1684" "F2_1686" "F2_1565"
## [85] "F2_1339" "F2_1333" "F2_1227" "F2_1222" "F2_1592" "F2_1472"
## [91] "F2_1476" "F2_1404" "F2_1527" "F2_1641" "F2_1666" "F2_1541"
## [97] "F2_1677" "F2_1671" "F2_1602" "F2_1618" "F2_1635" "F2_1637"
Here, as an example, we assume that the samples at odd indices in the
sample_id
vector are the replicates of the next samples at
even indices. For example, F2_1054 and F2_1055 are replicates for which
DNA samples were extracted from the same F2 individual although they
have different samples IDs. In this case, you can set replicate
information as shown below.
As another example, if parents in the data are Founder1 and Founder2
and replicates are F2_1054 and F2_1022 for Founder1 and F2_1178 and
F2_1637 for Founder2, you should give a vector to the
replicates
argument like the following.
rep_of_parent1 <- sample_id %in% c("Founder1", "F2_1054", "F2_1022")
rep_of_parent2 <- sample_id %in% c("Founder2", "F2_1178", "F2_1637")
sample_id[rep_of_parent1] <- "Founder1"
sample_id[rep_of_parent2] <- "Founder2"
gds <- setReplicates(gds, replicates = sample_id)
If you set replicates for parents, you should give a sample id of the
replicates as an identifier for a parent in setParents()
as
described in the next section. If you set replicates for parents after
setting parents by setParents()
, the replicates for parents
will be also set as parents with assigning the same member ID for the
replicates of each parent.
To reset the assigned replicate information, please use
setReplicates()
with specifying different values to the
replicates
argument.
As the first step for genotype error correction, we have to specify
which samples are the parents of the population via
setParents()
. In the case of genotype data in the
biparental population, people usually filter out SNPs which are not
monomorphic in each parental sample and not biallelic between parents.
setParents()
automatically do this filtering, if you set
mono = TRUE
and bi = TRUE
.
p1 <- grep("Founder1", getSamID(gds, valid = FALSE), value = TRUE)
p2 <- grep("Founder2", getSamID(gds, valid = FALSE), value = TRUE)
gds <- setParents(gds, parents = c(p1, p2), mono = TRUE, bi = TRUE)
If you set replicates for parents, you should give a sample id of the
replicates as an identifier for a parent in setParents()
.
In the last example in the previous section, we set three replicates for
each parent. To properly set parents, we should specify either of
“Founder1”, “F2_1054”, and “F2_1022” for p1
and either of
“Founder2”, “F2_1178”, “F2_1637” for p2
.
The next step is to visualize statistical summaries of the data. Get genotype data summaries as mentioned in the previous section.
Then, get histograms.
As the histograms showed, the data contains a lot of missing genotype calls with unreasonable heterozygosity in a F2 population. Reference allele frequency shows a bias to reference allele. If you can say your population has no strong segregation distortion in any positions of the genome, you can filter out the markers having too high or too low reference allele frequency.
# filter out markers with reference allele frequency
# less than 5% or more than 95%.
gds <- setMarFilter(gds, maf = 0.05)
However, sometimes filtering based on allele frequency per marker removes all markers from regions truly showing segregation distortion. Although heterozygosity can be a criterion to filter out markers, this will removes too many markers which even contains useful information for genotyping. In addition, as we described in the previous section, the error correction by GBScleanR does not require any filtering for markers based on missing rate, heterozygosity, and allele frequency.
If we found poor quality samples in you dataset based on missing
rate, heterozygosity, and reference allele frequency, we can omit those
samples with setSamFilter()
.
# Filter out samples with more than 90% missing genotype calls,
# less than 5% heterozygosity, and less than 5% minor allele frequency.
gds <- setSamFilter(gds, missing = 0.9, het = c(0.05, 1), maf = 0.05)
Before filtering using setMarFilter()
and
setSamFilter()
, we recomend that you conduct filtering on
each genotype call based on read depth. The error correction via GBScleanR
is robust against low coverage calls, while genotype calls messed up by
mismapping might lead less reliable error correction. Therefore,
filtering for extremely high coverage calls are necessary rather than
that for low coverage ones.
# Filter out genotype calls supported by reads less than 2 reads.
gds <- setCallFilter(gds, dp_count = c(2, Inf))
# Filter out genotype calls supported by reads more than 100.
gds <- setCallFilter(gds, dp_count = c(0, 100))
# Filter out genotype calls based on quantile values
# of read counts at markers in each sample.
gds <- setCallFilter(gds, ref_qtile = c(0, 0.9), alt_qtile = c(0, 0.9))
Since missing genotype calls left in the data basically give no negative effect on genotype error correction. Therefore, you can leave any missing genotype calls. We can, however, remove markers based on missing genotype calls.
# Remove markers having more than 75% of missing genotype calls
gds <- setMarFilter(gds, missing = 0.2)
nmar(gds)
## [1] 200
To check statistical summaries of the filtered genotype data, we need
to set node = "filt
. Otherwise,
countGenotype()
used the raw genotype data.
We can still see the markers showing distortion in allele frequency, while the expected allele frequency is 0.5 in a F2 population. To investigate that those markers having distorted allele frequency were derived from truly distorted regions or just error prone markers, we must check if there are regions where the markers with distorted allele frequency are clustered.
No region seem to have severe distortion. Based on the histogram of reference allele frequency, we can roughly cut off the markers with frequency more than 0.75 or less than 0.25, in other words, less than 0.25 of minor allele frequency. As we mentioned already in the previous section, the error correction by GBScleanR basically works finely without any filtering for markers based on missing rate, heterozygosity, and allele frequency.
## [1] 188
Let’s see the statistics again.
At the end of filtering, check marker density and genotype ratio per marker along chromosomes.
The coord
argument controls the number of rows and
columns of the facets in the plot.
Before executing the function for true genotype estimation, we need
to build a scheme object. The update on May 17, 2024 added a wrapper
function for scheme information preparation. If your population has been
developed in a relatively simple scheme, you can use the
makeScheme()
function.
## [,1]
## [1,] 3
## [2,] 3
## [,1]
## [1,] 3
## [2,] 3
## [,1]
## [1,] 3
## [2,] 3
## [,1]
## [1,] 4
## [2,] 4
## [,1]
## [1,] 5
## [2,] 5
## [,1]
## [1,] 6
## [2,] 6
# For F1 population
gds <- makeScheme(gds, generation = 1) # the crosstype argument will be ignored.
When your population has 2n parents specified by
setParents()
, makeScheme()
assumes those
parents were crossed in the “funnel” design in which 2n parents are crossed to
obtain 2n/2 F1
hybrids followed by successive intercrossings (pairings) of the hybrids
to combine the genomes of all parents in one family of siblings. The
makeScheme()
function assumes that the parents that were
assigned an odd number member ID (N) in setParents()
had
been crossed with the parent that were assigned an even number (N+1).
For example, if you set parents as shown below. The
makeScheme()
function prepare a scheme information that
indicates the intercrossings of “p1 x p2”, “p3 x p4”, “p5 x p6”, and “p7
x p8” followed by crossing of “p1xp2_F1 x p3xp4_F1” and “p5xp6_F1 x
p7xp8_F1” and then crossing of the two 4-way crossed liens to produce
8-way crossed hybrid lines. If, for example, generation = 5
indicating an F5 generation was specified to makeScheme()
,
the function adds 4 successive selfing or sibling crossings in the
scheme.
# Do not run.
gds <- setParents(gds,
parents = c("p1", "p2", "p3", "p4", "p5", "p6", "p7", "p8"))
# Member IDs will be 1, 2, 3, 4, 5, 6, 7, and 8
# for p1, p2, p3, p4, p5, p6, p7, and p8, respectively.
In many cases, makeScheme()
is enough to prepare scheme
information. However, if your population underwent more complicated
crossings, please register scheme information step-by-step as shown
below.
Our sample data is of a biparental F2 population derived from inbred
parents. Therefore, we should run initScheme()
and
addScheme()
as following.
# As always the first step of breeding scheme would be "pairing" cross(es) of
# parents, never be "selfing" and a "sibling" cross,
# the argument `crosstype` in initScheme() was deprecated on the update on April 6, 2023.
# gds <- initScheme(gds, crosstype = "pairing", mating = matrix(1:2, 2))
gds <- initScheme(gds, mating = rbind(1, 2))
gds <- addScheme(gds, crosstype = "selfing")
## [,1]
## [1,] 3
## [2,] 3
The function initScheme()
initializes the scheme object
with information about parents. You need to specify a matrix indicating
combinations of mating
, in which each column shows a pair
of parental samples. For example, if you have only two parents, the
mating
matrix should be
mating = matrix(1:2, nrow = 2, ncol = 1)
or equivalents.
The indices used in the matrix should match with the IDs labeled to
parental samples by setParents()
. To confirm the IDs for
parental samples, run the following code.
## sampleID memberID indexes
## 1 Founder1 1 18
## 2 Founder2 2 54
The created GbsrScheme
object is set in the
scheme
slot of the GbsrGenotypeData
object.
The function addScheme()
adds the information about the
next breeding step of your population. In the case of our example data,
the second step was selfing to produce F2 individuals from the F1
obtained via the first parent crossing.
The codes shown below in the rest of the section “Prepare scheme information” are sample codes assuming some specific situations that are not applicable for the sample data used in this vignette. Therefore, you will get error messages if you run the codes.
If your population was derived from a 4-way or 8-way cross, you need
to add more paring
steps. In the case of 8-way RILs
developed by three pairing crosses followed by five selfing cycles, the
scheme object should be built as following. First we need to initialize
the scheme object with specifying the first mating scheme. The crosstype
argument should be “pairing” and the mating argument should be given as
a matrix in which each pairing combination of parents are shown in each
column. The following case indicates the pairing of parent 1 and 2 as
well as 3 and 4, 5 and 6, and 7 and 8.
# As always the first step of breeding scheme would be "pairing" cross(es) of
# parents, never be "selfing" and a "sibling" cross,
# the argument `crosstype` in initScheme() was deprecated on the update on April 6, 2023.
# gds <- initScheme(gds, crosstype = "pair", mating = cbind(c(1:2), c(3:4), c(5:6), c(7:8)))
# Do not run.
gds <- initScheme(gds, mating = cbind(c(1:2), c(3:4), c(5:6), c(7:8)))
Now the progenies of the crosses above have member ID 9, 10, 11, and 12 for each combination of mating. You can check IDs with showScheme().
Then, add the step to make 4-way crosses.
# Do not run.
gds <- addScheme(gds, crosstype = "pair", mating = cbind(c(9:10), c(11:12)))
# Check IDs.
showScheme(gds)
Add the last generation of the paring step .
# Do not run.
gds <- addScheme(gds, crosstype = "pair", mating = cbind(c(13:14)))
#' # Check IDs.
showScheme(gds)
Now we have the scheme information of a 8-way cross. The follwoing steps add the selfing cycles.
# Inbreeding by five times selfing.
# Do not run.
gds <- addScheme(gds, crosstype = "self")
gds <- addScheme(gds, crosstype = "self")
gds <- addScheme(gds, crosstype = "self")
gds <- addScheme(gds, crosstype = "self")
gds <- addScheme(gds, crosstype = "self")
You can set crosstype = "sibling"
or
crosstype = "random"
, if your population was developed
through sibling crosses or random crosses, respectively.
The update on April 4, 2023 introduced new function to GBScleanR. The genotype estimation algorithm in estGeno() supports populations that consist of samples belonging to different pedigree. For example, if you have a population of F1 samples that derived from three different crosses of four parents: Founder1 x Founder2, Founder1 x Founder3, Founder1 x Founder4. You can build a scheme info as following.
# Do not run.
gds <- setParents(object = gds,
parents = c("Founder1", "Founder2", "Founder3", "Founder4"))
gds <- initScheme(object = gds,
mating = cbind(c(1, 2), c(1, 3), c(1,4)))
# The initScheme() function here automatically set 5, 6, and 7 as member ID to
# the progenies of the above maiting (pairing) combinations, respectively.
# Then you have to assign member IDs to your samples to indicate which sample
# belongs to which pedigree.
gds <- assignScheme(object = gds,
id = c(rep(5, 10), rep(6, 15), rep(7, 20)))
The assignScheme() assign member IDs id
to the samples
in order. Please confirm the order of the member IDs in id
and the order of the sample IDs shown by getSamID(gds).
# Do not run.
# Get sample ID
sample_id <- getSamID(object = gds)
# Initialize the id vector
id <- integer(nsam(gds))
# Assume your samples were named with prefixes that indicate which
# sample was derived from which combination of parents.
id[grepl("P1xP2", sample_id)] <- 5
id[grepl("P1xP3", sample_id)] <- 6
id[grepl("P1xP4", sample_id)] <- 7
gds <- assignScheme(object = gds, id = id)
The following codes suppose that you built the scheme object for the
example data that is a biparental F2 population derived from a cross
between inbred parents, not for the 8-way RILs explained above. Now we
can execute genotype estimation for error correction. GBScleanR
estimates error pattern via iterative optimization of parameters for
genotype estimation. We could not guess the best number of iterations,
but our simulation tests showed iter = 4
usually saturates
the improvement of estimation accuracy.
As an important change in estGeno() since version 2, the function now requires the node argument to explicitly specify whether raw or filtered read counts should be used for genotype estimation. If setCallFilter() has not been executed, the user-specified node argument will be ignored, and the “raw” node will be used by default.
If your population derived from outbred parents, please set
het_parents = TRUE
.
# Do nut run
# This is an example to show the case to use "het_parents = TRUE".
gds <- estGeno(gds, het_parent = TRUE, iter = 4)
The larger number of iterations makes running time longer. If you
would like to execute no optimization, set optim = FALSE
or
iter = 1
.
# Following codes do the same.
# Do nut run
# These are examples to show the case to set "iter = 1" or "optim = FALSE".
gds <- estGeno(gds, iter = 1)
gds <- estGeno(gds, optim = FALSE)
All of the results of estimation are stored in the gds file linked to
the GbsrGenotypeData
object. You can obtain the estimated
genotype data via the getGenotype()
function with
node = "cor"
.
GBScleanR also estimates phased parent genotypes and you can access it.
GBScleanR
first internally estimate phased haplotype and then convert them to
genotype calls. If you need the estimated haplotype data, run
getHaplotype()
.
The function gbsrGDS2VCF()
generate a VCF file
containing the estimated genotype data and phased haplotype information.
The estimated haplotypes are indicated in the FORMAT field with the HAP
tag. The parent genotypes correspond to each haplotype are indicated in
the INFO field with the PGT tag. HAP shows the pair of haplotype for
each marker of each sample, while PGT shows the allele of each
haplotype. HAP is indicated by two numbers separated by a pipe symbol
“|”. Each of the numbers takes one of the numbers from 1 to 2N,
where N is the number of parents. PGT is indicated by 2N numbers
separated by a pipe symbol “|”. The first number in PGT represents the
allele of haplotype 1 at the marker, and the rest numbers also show the
alleles of rest haplotypes 2, 3 and 4, if your population is of
biparental diploid samples. As a default, gbsrGDS2VCF()
outputs the estimated genotype data as entries of the CGT tag in the
FORMAT field. With node = "cor"
, you can output a VCF file
in which GT field was replaced with the estimated genotype data obtained
via estGeno()
.
## [1] "/tmp/RtmpTLJj4v/sample_est14b1cb69ebb.vcf.gz"
## [1] "/tmp/RtmpTLJj4v/sample_est14b1cb69ebb.vcf.gz"
The output vcf file contains data like the one shown below.
##fileformat=VCFv4.2
##fileDate=20240909
##source=SeqArray_Format_v1.0
##INFO=<ID=PGT,Number=1,Type=String,Description="Estimated allele of parental haplotype by GBScleanR">
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=FAD,Number=2,Type=Integer,Description="Call-filtered read counts generated by GBScleanR">
##FORMAT=<ID=FGT,Number=2,Type=Float,Description="Estimated mismapping rate by GBScleanR">
##FORMAT=<ID=HAP,Number=1,Type=String,Description="Estimated haplotype data by GBScleanR">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT F2_1054 F2_1055 F2_1059 F2_1170 F2_1075 F2_1070 F2_1074 F2_1007 F2_1009 F2_1489 F2_1010 F2_1014 F2_1490 F2_1022 F2_1267 F2_1382 F2_1038 Founder1 F2_1216 F2_1571 F2_1695 F2_1575 F2_1584 F2_1355 F2_1236 F2_1357 F2_1470 F2_1762 F2_1884 F2_1763 F2_1400 F2_1885 F2_1643 F2_1547 F2_1784 F2_1785 F2_1314 F2_1317 F2_1600 F2_1848 F2_1849 F2_1719 F2_1850 F2_1502 F2_1868 F2_1740 F2_1741 F2_1755 F2_1756 F2_1874 Founder2 F2_1811 F2_1812 F2_1700 F2_1824 F2_1827 F2_1710 F2_1714 F2_1166 F2_1282 F2_1178 F2_1192 F2_1195 F2_1361 F2_1122 F2_1486 F2_1132 F2_1378 F2_1257 F2_1258 F2_1262 F2_1384 F2_1152 F2_1274 F2_1150 F2_1688 F2_1681 F2_1682 F2_1684 F2_1686 F2_1565 F2_1339 F2_1333 F2_1222 F2_1592 F2_1472 F2_1476 F2_1404 F2_1527 F2_1641 F2_1666 F2_1541 F2_1677 F2_1671 F2_1602 F2_1618 F2_1635 F2_1637
1 522289 S01_522289 A G 100 PASS PGT=0|0|1|1;ADB=0.520833;MR=0,0 GT:AD:FAD:FGT:HAP:EDS 0/0:10,0:10,0:0/0:1|1:0 1/0:6,6:6,6:0/1:2|1:1 0/0:6,0:6,0:0/0:1|1:0 1/1:0,12:0,0:./.:2|2:2 1/0:12,9:0,0:./.:2|1:1 1/0:3,1:3,1:0/1:2|1:1 0/1:5,5:5,5:0/1:1|2:1 1/0:5,2:5,2:0/1:2|1:1 0/1:4,1:4,1:0/1:1|2:1 1/0:1,4:1,4:0/1:2|1:1 1/1:0,14:0,0:./.:2|2:2 1/0:3,6:3,6:0/1:2|1:1 1/0:6,5:6,5:0/1:2|1:1 1/1:9,6:0,0:./.:2|2:2 1/1:0,14:0,14:1/1:2|2:2 0/0:12,0:12,0:0/0:1|1:0 0/0:1,0:1,0:0/0:1|1:0 0/0:146,0:146,0:0/0:1|1:0 0/0:8,0:8,0:0/0:1|1:0 1/0:4,6:4,6:0/1:2|1:1 0/1:11,3:11,3:0/1:1|2:1 0/1:2,2:2,2:0/1:1|2:1 0/1:11,8:11,8:0/1:1|2:1 0/1:6,6:6,6:0/1:1|2:1 0/1:3,2:3,2:0/1:1|2:1 0/0:16,0:16,0:0/0:1|1:0 0/0:24,0:0,0:./.:1|1:0 1/0:6,10:6,10:0/1:2|1:1 0/0:19,0:19,0:0/0:1|1:0 0/0:7,0:7,0:0/0:1|1:0 0/0:8,0:8,0:0/0:1|1:0 1/1:0,9:0,9:1/1:2|2:2 ./.:3,6:0,0:./.:.|.:0 0/0:19,0:19,0:0/0:1|1:0 0/0:8,0:8,0:0/0:1|1:0 0/0:6,0:6,0:0/0:1|1:0 1/0:7,4:7,4:0/1:2|1:1 0/1:10,7:10,7:0/1:1|2:1 0/1:4,8:4,8:0/1:1|2:1 0/1:6,4:6,4:0/1:1|2:1 1/0:7,1:7,1:0/0:2|1:1 0/1:4,8:4,8:0/1:1|2:1 0/1:7,3:7,3:0/1:1|2:1 0/0:5,0:0,0:./.:1|1:0 0/1:4,18:4,18:0/1:1|2:1 0/0:8,0:8,0:0/0:1|1:0 0/1:4,6:0,0:./.:1|2:1 0/1:5,4:5,4:0/1:1|2:1 0/0:8,0:8,0:0/0:1|1:0 1/1:0,8:0,0:./.:2|2:2 1/1:0,285:0,285:1/1:2|2:2 0/1:3,2:3,2:0/1:1|2:1 1/0:6,3:6,3:0/1:2|1:1 1/0:4,5:4,5:0/1:2|1:1 1/0:10,4:0,0:./.:2|1:1 0/0:12,0:12,0:0/0:1|1:0 0/0:10,0:10,0:0/0:1|1:0 0/0:12,0:0,0:./.:1|1:0 0/0:13,0:0,0:./.:1|1:0 0/1:5,3:5,3:0/1:1|2:1 0/0:8,0:8,0:0/0:1|1:0 0/1:5,1:5,1:0/1:1|2:1 ./.:0,6:0,6:1/1:.|.:0 1/0:2,7:2,7:0/1:2|1:1 1/1:0,7:0,7:1/1:2|2:2 1/0:3,1:3,1:0/1:2|1:1 0/0:8,0:0,0:./.:1|1:0 1/0:4,2:4,2:0/1:2|1:1 0/1:2,1:2,1:0/1:1|2:1 1/1:0,10:0,10:1/1:2|2:2 1/0:5,2:5,2:0/1:2|1:1 0/0:3,0:3,0:0/0:1|1:0 0/0:10,0:10,0:0/0:1|1:0 1/1:0,4:0,4:1/1:2|2:2 0/0:4,0:4,0:0/0:1|1:0 0/0:9,0:9,0:0/0:1|1:0 1/1:0,11:0,11:1/1:2|2:2 0/1:3,2:3,2:0/1:1|2:1 0/1:5,2:5,2:0/1:1|2:1 0/0:4,0:4,0:0/0:1|1:0 1/0:3,3:3,3:0/1:2|1:1 1/1:0,8:0,0:./.:2|2:2 0/0:9,0:9,0:0/0:1|1:0 0/1:4,5:4,5:0/1:1|2:1 0/0:10,0:10,0:0/0:1|1:0 0/0:7,0:7,0:0/0:1|1:0 0/1:1,3:1,3:0/1:1|2:1 0/0:7,0:7,0:0/0:1|1:0 1/1:0,11:0,11:1/1:2|2:2 1/0:0,3:0,3:1/1:2|1:1 1/1:0,11:0,0:./.:2|2:2 0/1:6,5:6,5:0/1:1|2:1 1/0:1,4:1,4:0/1:2|1:1 1/0:4,2:4,2:0/1:2|1:1 1/1:0,4:0,4:1/1:2|2:2 0/0:9,0:9,0:0/0:1|1:0 1/0:4,4:4,4:0/1:2|1:1 1/0:6,6:0,0:./.:2|1:1
In the output vcf, “PGT=0|0|1|1” in the INFO field indicates the
parental samples alleles for haplotype 1, 2, 3, and 4 separated by |.
The numbers 0 and 1 represent reference and alternative alleles that are
A and G at the marker shown above, resectively. FAD and FGT in the
FORMAT field show the allele read counts and genotype call after
filtering by setCallFiler()
. HAP and EDS indicate
descendent haplotypes and haplotype dosage at the given marker. EDS will
be provided only if you specified two parents in
setParents()
and specified het_parent = FALSE
for estGeno()
.
Alternatively, you can also output the genotype data into a CSV file
using gbsrGDS2CSV()
## [1] "/tmp/RtmpTLJj4v/sample_est14b14b2ed5bb.csv"
## [1] "/tmp/RtmpTLJj4v/sample_est14b14b2ed5bb.csv"
You can format the output for the qtl package.
## [1] "/tmp/RtmpTLJj4v/sample_est14b12b96da27.csv"
## [1] "/tmp/RtmpTLJj4v/sample_est14b12b96da27.csv"
When you set format = "qtl"
, the marker positions will
be automatically converted from physical positions (bp) to genetic
distances (cM). The conversion is performed by multiplying the physical
positions by the value set to bp2cm
. The default is
bp2cm = 4e-06
if format = "qtl"
.
Please use reopenGDS()
to open the connection again if
you need.
The ratio of reference and alternative allele reads can be visualized
in dot plots using the plotReadRatio()
function. Every
single call of plotReadRatio()
generates dot plots for all
chromosomes but only for a single sample.
Since the sample data contains only one chromosome,
coord = c(3, 4)
does not change any coordinate of the plot.
If, for example, you sample has 12 chromosomes, the dot plots for the
chromosomes will be shown in 3 rows and 4 columns.
If you need to draw plots for all samples in your dataset, please loop the function over your samples.
If you set node = "filt"
, you can visualize the read
ratio obtained by applying the setCallFilter()
function.
The plotDosage()
function enables you to visualize the
estimated dosage overlaying on the read ratio plot. Since this function
visualize dosages estimated by the estGeno()
function, you
will get an error message if you call this function before running
estGeno()
.
The magenta line indicates the estimated alternative allele dosage par
marker while colored dots shows alternative allele read ratio par
marker. The plotDosage()
function also draw plots for a
single sample in a single call. Please loop the function over samples if
needed.
You can also use the filtered read count data for the read ratio visualization in the dosage plot.
To safely close the connection to the GDS file, use
closeGDS()
.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] GBScleanR_2.1.1 SeqArray_1.47.0 gdsfmt_1.43.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyr_1.3.1 sass_0.4.9 generics_0.1.3
## [4] lattice_0.22-6 magrittr_2.0.3 digest_0.6.37
## [7] evaluate_1.0.1 grid_4.4.2 fastmap_1.2.0
## [10] jsonlite_1.8.9 Matrix_1.7-1 GenomeInfoDb_1.43.2
## [13] BiocManager_1.30.25 httr_1.4.7 purrr_1.0.2
## [16] UCSC.utils_1.3.0 scales_1.3.0 Biostrings_2.75.3
## [19] jquerylib_0.1.4 cli_3.6.3 rlang_1.1.4
## [22] crayon_1.5.3 expm_1.0-0 XVector_0.47.0
## [25] munsell_0.5.1 withr_3.0.2 cachem_1.1.0
## [28] yaml_2.3.10 tools_4.4.2 parallel_4.4.2
## [31] dplyr_1.1.4 colorspace_2.1-1 ggplot2_3.5.1
## [34] GenomeInfoDbData_1.2.13 BiocGenerics_0.53.3 buildtools_1.0.0
## [37] vctrs_0.6.5 R6_2.5.1 stats4_4.4.2
## [40] lifecycle_1.0.4 zlibbioc_1.52.0 S4Vectors_0.45.2
## [43] IRanges_2.41.2 pkgconfig_2.0.3 pillar_1.10.0
## [46] RcppParallel_5.1.9 bslib_0.8.0 gtable_0.3.6
## [49] glue_1.8.0 Rcpp_1.0.13-1 tidyselect_1.2.1
## [52] tibble_3.2.1 xfun_0.49 GenomicRanges_1.59.1
## [55] sys_3.4.3 knitr_1.49 farver_2.1.2
## [58] htmltools_0.5.8.1 labeling_0.4.3 rmarkdown_2.29
## [61] maketools_1.3.1 compiler_4.4.2