Title: | Phylogenetic, Distance and Other Calculations on VCF and Fasta Files |
---|---|
Description: | Calculate distances, build phylogenetic trees or perform hierarchical clustering between the samples of a VCF or FASTA file. Functions are implemented in Java and called via rJava. Parallel implementation that operates directly on the VCF or FASTA file for fast execution. |
Authors: | Anestis Gkanogiannis [aut, cre] |
Maintainer: | Anestis Gkanogiannis <[email protected]> |
License: | GPL-3 |
Version: | 1.11.0 |
Built: | 2024-11-20 06:23:01 UTC |
Source: | https://github.com/bioc/fastreeR |
Performs Hierarchical Clustering on a distance matrix
(i.e. calculated with vcf2dist
or fasta2dist
)
and generates a phylogenetic tree with
agglomerative Neighbor Joining method (complete linkage)
(as in dist2tree
).
The phylogenetic tree is then pruned with
cutreeDynamic
to get clusters
(as in tree2clusters
).
dist2clusters(inputDist, cutHeight = NULL, minClusterSize = 1, extra = TRUE)
dist2clusters(inputDist, cutHeight = NULL, minClusterSize = 1, extra = TRUE)
inputDist |
Input distances file location
(generated with |
cutHeight |
Define at which height to cut tree. Default automatically defined. |
minClusterSize |
Minimum size of clusters. Default 1. |
extra |
Boolean whether to use extra parameters
for the |
A list of :
character vector
of the generated
phylogenetic tree in Newick format
character vector
of the clusters.
Each row contains data for a cluster, separated by space.
The id of the cluster,
the size of the cluster (number of elements)
and the names of its elements,
Cluster id 0 contains all the objects not assigned
to a cluster (singletons).
Example clusters output :
0 | 3 | Sample1 | Sample2 | Sample3 |
1 | 3 | Sample4 | Sample5 | Sample6 |
2 | 2 | Sample7 | Sample8 | |
3 | 2 | Sample9 | Sample0 | |
Anestis Gkanogiannis, [email protected]
Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils
my.clust <- dist2clusters( inputDist = system.file("extdata", "samples.vcf.dist.gz", package = "fastreeR") )
my.clust <- dist2clusters( inputDist = system.file("extdata", "samples.vcf.dist.gz", package = "fastreeR") )
Performs Hierarchical Clustering on a distance matrix
(i.e. calculated with vcf2dist
or fasta2dist
)
and generates a phylogenetic tree with
agglomerative Neighbor Joining method (complete linkage).
dist2tree(inputDist)
dist2tree(inputDist)
inputDist |
Input distances file location
(generated with |
A character vector
of the generated
phylogenetic tree in Newick format.
Anestis Gkanogiannis, [email protected]
Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils
my.tree <- dist2tree( inputDist = system.file("extdata", "samples.vcf.dist.gz", package = "fastreeR") )
my.tree <- dist2tree( inputDist = system.file("extdata", "samples.vcf.dist.gz", package = "fastreeR") )
This function calculates a d2_S type dissimilarity measurement between the
n
sequences (which can represent samples) of a FASTA file.
See doi:10.1186/s12859-016-1186-3 for more details.
fasta2dist( ..., outputFile = NULL, threads = 2, kmer = 6, normalize = FALSE, compress = TRUE )
fasta2dist( ..., outputFile = NULL, threads = 2, kmer = 6, normalize = FALSE, compress = TRUE )
... |
Input fasta files locations (uncompressed or gzip compressed). |
outputFile |
Output distances file location. |
threads |
Number of java threads to use. |
kmer |
Kmer length to use for analyzing fasta sequences. |
normalize |
Normalize on sequences length. |
compress |
Compress output (adds .gz extension). |
A dist
distances object of the calculation.
Anestis Gkanogiannis, [email protected]
Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils
my.dist <- fasta2dist( inputfile = system.file("extdata", "samples.fasta.gz", package = "fastreeR" ) )
my.dist <- fasta2dist( inputfile = system.file("extdata", "samples.fasta.gz", package = "fastreeR" ) )
The phylogenetic tree is pruned with
cutreeDynamic
to get clusters.
tree2clusters( treeStr, treeDistances = NULL, treeLabels = NULL, cutHeight = NULL, minClusterSize = 1, extra = TRUE )
tree2clusters( treeStr, treeDistances = NULL, treeLabels = NULL, cutHeight = NULL, minClusterSize = 1, extra = TRUE )
treeStr |
A |
treeDistances |
|
treeLabels |
A |
cutHeight |
Define at which height to cut tree. Default automatically defined. |
minClusterSize |
Minimum size of clusters. Default 1. |
extra |
Boolean whether to use extra parameters
for the |
character vector
of the clusters.
Each row contains data for a cluster, separated by space.
The id of the cluster,
the size of the cluster (number of elements)
and the names of its elements,
Cluster id 0 contains all the objects not assigned
to a cluster (singletons).
Example clusters output :
0 | 3 | Sample1 | Sample2 | Sample3 |
1 | 3 | Sample4 | Sample5 | Sample6 |
2 | 2 | Sample7 | Sample8 | |
3 | 2 | Sample9 | Sample0 | |
Anestis Gkanogiannis, [email protected]
Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils
my.clust <- tree2clusters( treeStr = dist2tree( inputDist = system.file("extdata", "samples.vcf.dist.gz", package = "fastreeR" ) ) )
my.clust <- tree2clusters( treeStr = dist2tree( inputDist = system.file("extdata", "samples.vcf.dist.gz", package = "fastreeR" ) ) )
Performs Hierarchical Clustering on a distance matrix
calculated as in vcf2dist
and generates a phylogenetic tree with
agglomerative Neighbor Joining method (complete linkage)
(as in dist2tree
).
The phylogenetic tree is then pruned with
cutreeDynamic
to get clusters
(as in tree2clusters
).
vcf2clusters( inputFile, threads = 2, ignoreMissing = FALSE, onlyHets = FALSE, ignoreHets = FALSE, cutHeight = NULL, minClusterSize = 1, extra = TRUE )
vcf2clusters( inputFile, threads = 2, ignoreMissing = FALSE, onlyHets = FALSE, ignoreHets = FALSE, cutHeight = NULL, minClusterSize = 1, extra = TRUE )
inputFile |
Input vcf file location (uncompressed or gzip compressed). |
threads |
Number of java threads to use. |
ignoreMissing |
Ignore variants with missing data
( |
onlyHets |
Only calculate on variants with heterozygous calls. |
ignoreHets |
Only calculate on variants with homozygous calls. |
cutHeight |
Define at which height to cut tree. Default automatically defined. |
minClusterSize |
Minimum size of clusters. Default 1. |
extra |
Boolean whether to use extra parameters
for the |
Biallelic or multiallelic (maximum 7 alternate alleles) SNP and/or INDEL variants are considered, phased or not. Some VCF encoding examples are:
heterozygous variants : 1/0
or 0/1
or 0/2
or 1|0
or 0|1
or 0|2
homozygous to the reference allele variants : 0/0
or 0|0
homozygous to the first alternate allele variants : 1/1
or 1|1
If there are n
samples and m
variants, an nxn
zero-diagonal symmetric distance matrix is calculated.
The calculated cosine type distance (1-cosine_similarity)/2 is in the range
[0,1] where value 0 means completely identical samples (cosine is 1),
value 0.5 means perpendicular samples (cosine is 0)
and value 1 means completely opposite samples (cosine is -1).
The calculation is performed by a Java back-end implementation,
that supports multi-core CPU utilization
and can be demanding in terms of memory resources.
By default a JVM is launched with a maximum memory allocation of 512 MB.
When this amount is not sufficient,
the user needs to reserve additional memory resources,
before loading the package,
by updating the value of the java.parameters
option.
For example in order to allocate 4GB of RAM,
the user needs to issue options(java.parameters="-Xmx4g")
before library(fastreeR)
.
A list of :
dist
distances object.
character vector
of the generated
phylogenetic tree in Newick format
character vector
of the clusters.
Each row contains data for a cluster, separated by space.
The id of the cluster,
the size of the cluster (number of elements)
and the names of its elements,
Cluster id 0 contains all the objects not assigned
to a cluster (singletons).
Example clusters output :
0 | 3 | Sample1 | Sample2 | Sample3 |
1 | 3 | Sample4 | Sample5 | Sample6 |
2 | 2 | Sample7 | Sample8 | |
3 | 2 | Sample9 | Sample0 | |
Anestis Gkanogiannis, [email protected]
Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils
my.clust <- vcf2clusters( inputFile = system.file("extdata", "samples.vcf.gz", package = "fastreeR" ) )
my.clust <- vcf2clusters( inputFile = system.file("extdata", "samples.vcf.gz", package = "fastreeR" ) )
This function calculates a cosine type dissimilarity measurement between the
n
samples of a VCF file.
vcf2dist( inputFile, outputFile = NULL, threads = 2, ignoreMissing = FALSE, onlyHets = FALSE, ignoreHets = FALSE, compress = FALSE )
vcf2dist( inputFile, outputFile = NULL, threads = 2, ignoreMissing = FALSE, onlyHets = FALSE, ignoreHets = FALSE, compress = FALSE )
inputFile |
Input vcf file location (uncompressed or gzip compressed). |
outputFile |
Output distances file location. |
threads |
Number of java threads to use. |
ignoreMissing |
Ignore variants with missing data
( |
onlyHets |
Only calculate on variants with heterozygous calls. |
ignoreHets |
Only calculate on variants with homozygous calls. |
compress |
Compress output (adds .gz extension). |
Biallelic or multiallelic (maximum 7 alternate alleles) SNP and/or INDEL variants are considered, phased or not. Some VCF encoding examples are:
heterozygous variants : 1/0
or 0/1
or 0/2
or 1|0
or 0|1
or 0|2
homozygous to the reference allele variants : 0/0
or 0|0
homozygous to the first alternate allele variants : 1/1
or 1|1
If there are n
samples and m
variants, an nxn
zero-diagonal symmetric distance matrix is calculated.
The calculated cosine type distance (1-cosine_similarity)/2 is in the range
[0,1] where value 0 means completely identical samples (cosine is 1),
value 0.5 means perpendicular samples (cosine is 0)
and value 1 means completely opposite samples (cosine is -1).
The calculation is performed by a Java backend implementation,
that supports multi-core CPU utilization
and can be demanding in terms of memory resources.
By default a JVM is launched with a maximum memory allocation of 512 MB.
When this amount is not sufficient,
the user needs to reserve additional memory resources,
before loading the package,
by updating the value of the java.parameters
option.
For example in order to allocate 4GB of RAM,
the user needs to issue options(java.parameters="-Xmx4g")
before library(fastreeR)
.
Output file, if provided, will contain n+1
lines.
The first line contains the number n
of samples
and number m
of variants, separated by space.
Each of the subsequent n
lines contains n+1
values,
separated by space.
The first value of each line is a sample name
and the rest n
values
are the calculated distances of this sample to all the samples.
Example output file of the distances of 3 samples
calculated from 1000 variants:
3 1000 | |||
Sample1 | 0.0 | 0.5 | 0.2 |
Sample2 | 0.5 | 0.0 | 0.9 |
Sample3 | 0.2 | 0.9 | 0.0 |
A dist
distances object of the calculation.
Anestis Gkanogiannis, [email protected]
Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils
my.dist <- vcf2dist( inputFile = system.file("extdata", "samples.vcf.gz", package = "fastreeR" ) )
my.dist <- vcf2dist( inputFile = system.file("extdata", "samples.vcf.gz", package = "fastreeR" ) )
Only biallelic SNPs are considered. For each sample, the following statistics are calculated :
INDIV : Sample name
N_SITES : Total number of SNPs
N_HET : Number of SNPs with
heterozygous call (0/1
or 0|1
or 1/0
or 1|0
)
N_ALT : Number of SNPs with
alternate homozygous call (1/1
or 1|1
)
N_REF : Number of SNPs with
reference homozygous call (0/0
or 0|0
)
N_MISS : Number of SNPs with
missing call (./.
or .|.
)
P_HET : Percentage of heterozygous calls
P_ALT : Percentage of alternate homozygous calls
P_REF : Percentage of reference homozygous calls
P_MISS : Percentage of missing calls (missing rate)
vcf2istats(inputFile, outputFile = NULL)
vcf2istats(inputFile, outputFile = NULL)
inputFile |
Input vcf file location (uncompressed or gzip compressed). |
outputFile |
Output samples statistics file location. |
A data.frame
of sample statistics.
Anestis Gkanogiannis, [email protected]
Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils
my.istats <- vcf2istats( inputFile = system.file("extdata", "samples.vcf.gz", package = "fastreeR") )
my.istats <- vcf2istats( inputFile = system.file("extdata", "samples.vcf.gz", package = "fastreeR") )
This function calculates a distance matrix between the samples of a VCF file
as in vcf2dist
and performs Hierarchical Clustering on this distance matrix
as in dist2tree
.
A phylogenetic tree is calculated with
agglomerative Neighbor Joining method (complete linkage).
vcf2tree( inputFile, threads = 2, ignoreMissing = FALSE, onlyHets = FALSE, ignoreHets = FALSE )
vcf2tree( inputFile, threads = 2, ignoreMissing = FALSE, onlyHets = FALSE, ignoreHets = FALSE )
inputFile |
Input vcf file location (uncompressed or gzip compressed). |
threads |
Number of java threads to use. |
ignoreMissing |
Ignore variants with missing data
( |
onlyHets |
Only calculate on variants with heterozygous calls. |
ignoreHets |
Only calculate on variants with homozygous calls. |
Biallelic or multiallelic (maximum 7 alternate alleles) SNP and/or INDEL variants are considered, phased or not. Some VCF encoding examples are:
heterozygous variants : 1/0
or 0/1
or 0/2
or 1|0
or 0|1
or 0|2
homozygous to the reference allele variants : 0/0
or 0|0
homozygous to the first alternate allele variants : 1/1
or 1|1
If there are n
samples and m
variants, an nxn
zero-diagonal symmetric distance matrix is calculated.
The calculated cosine type distance (1-cosine_similarity)/2 is in the range
[0,1] where value 0 means completely identical samples (cosine is 1),
value 0.5 means perpendicular samples (cosine is 0)
and value 1 means completely opposite samples (cosine is -1).
The calculation is performed by a Java backend implementation,
that supports multi-core CPU utilization
and can be demanding in terms of memory resources.
By default a JVM is launched with a maximum memory allocation of 512 MB.
When this amount is not sufficient,
the user needs to reserve additional memory resources,
before loading the package,
by updating the value of the java.parameters
option.
For example in order to allocate 4GB of RAM,
the user needs to issue options(java.parameters="-Xmx4g")
before library(fastreeR)
.
A character
vector of the generated
phylogenetic tree in Newick format.
Anestis Gkanogiannis, [email protected]
Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils
my.tree <- vcf2tree( inputFile = system.file("extdata", "samples.vcf.gz", package = "fastreeR" ) )
my.tree <- vcf2tree( inputFile = system.file("extdata", "samples.vcf.gz", package = "fastreeR" ) )