Package 'fastreeR' reference manual

Title:	Phylogenetic, Distance and Other Calculations on VCF and Fasta Files
Description:	Calculate distances, build phylogenetic trees or perform hierarchical clustering between the samples of a VCF or FASTA file. Functions are implemented in Java and called via rJava. Parallel implementation that operates directly on the VCF or FASTA file for fast execution.
Authors:	Anestis Gkanogiannis [aut, cre]
Maintainer:	Anestis Gkanogiannis <[email protected]>
License:	GPL-3
Version:	1.11.0
Built:	2025-03-01 07:10:02 UTC
Source:	https://github.com/bioc/fastreeR

Perform Hierarchical Clustering and tree pruning on a distance matrix

Description

Performs Hierarchical Clustering on a distance matrix (i.e. calculated with vcf2dist or fasta2dist) and generates a phylogenetic tree with agglomerative Neighbor Joining method (complete linkage) (as in dist2tree). The phylogenetic tree is then pruned with cutreeDynamic to get clusters (as in tree2clusters).

Usage

dist2clusters(inputDist, cutHeight = NULL, minClusterSize = 1, extra = TRUE)
dist2clusters(inputDist, cutHeight = NULL, minClusterSize = 1, extra = TRUE)

Arguments

`inputDist`	Input distances file location (generated with `vcf2dist` or `fasta2dist`). File can be gzip compressed. Or a `dist` distances object.
`cutHeight`	Define at which height to cut tree. Default automatically defined.
`minClusterSize`	Minimum size of clusters. Default 1.
`extra`	Boolean whether to use extra parameters for the `cutreeDynamic`.

Value

A list of :

character vector of the generated phylogenetic tree in Newick format
character vector of the clusters. Each row contains data for a cluster, separated by space. The id of the cluster, the size of the cluster (number of elements) and the names of its elements, Cluster id 0 contains all the objects not assigned to a cluster (singletons). Example clusters output :

0 3 Sample1 Sample2 Sample3

1 3 Sample4 Sample5 Sample6

2 2 Sample7 Sample8

3 2 Sample9 Sample0

Author(s)

Anestis Gkanogiannis, [email protected]

References

Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils

Examples

my.clust <- dist2clusters(
    inputDist =
        system.file("extdata", "samples.vcf.dist.gz", package = "fastreeR")
)
my.clust <- dist2clusters(
    inputDist =
        system.file("extdata", "samples.vcf.dist.gz", package = "fastreeR")
)

Generate phylogenetic tree from samples of a distance matrix

Description

Usage

dist2tree(inputDist)
dist2tree(inputDist)

Arguments

inputDist

Input distances file location (generated with vcf2dist or fasta2dist). File can be gzip compressed. Or a dist distances object.

Value

A character vector of the generated phylogenetic tree in Newick format.

Author(s)

Anestis Gkanogiannis, [email protected]

References

Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils

Examples

my.tree <- dist2tree(
    inputDist =
        system.file("extdata", "samples.vcf.dist.gz", package = "fastreeR")
)
my.tree <- dist2tree(
    inputDist =
        system.file("extdata", "samples.vcf.dist.gz", package = "fastreeR")
)

Calculate distances between sequences of a FASTA file

Description

This function calculates a d2_S type dissimilarity measurement between the n sequences (which can represent samples) of a FASTA file. See doi:10.1186/s12859-016-1186-3 for more details.

Usage

fasta2dist(
  ...,
  outputFile = NULL,
  threads = 2,
  kmer = 6,
  normalize = FALSE,
  compress = TRUE
)
fasta2dist(
  ...,
  outputFile = NULL,
  threads = 2,
  kmer = 6,
  normalize = FALSE,
  compress = TRUE
)

Arguments

`...`	Input fasta files locations (uncompressed or gzip compressed).
`outputFile`	Output distances file location.
`threads`	Number of java threads to use.
`kmer`	Kmer length to use for analyzing fasta sequences.
`normalize`	Normalize on sequences length.
`compress`	Compress output (adds .gz extension).

Value

A dist distances object of the calculation.

Author(s)

Anestis Gkanogiannis, [email protected]

References

Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils

Examples

my.dist <- fasta2dist(
    inputfile = system.file("extdata", "samples.fasta.gz",
        package = "fastreeR"
    )
)
my.dist <- fasta2dist(
    inputfile = system.file("extdata", "samples.fasta.gz",
        package = "fastreeR"
    )
)

Perform Hierarchical Clustering and tree pruning on a phylogenetic tree

Description

The phylogenetic tree is pruned with cutreeDynamic to get clusters.

Usage

tree2clusters(
  treeStr,
  treeDistances = NULL,
  treeLabels = NULL,
  cutHeight = NULL,
  minClusterSize = 1,
  extra = TRUE
)
tree2clusters(
  treeStr,
  treeDistances = NULL,
  treeLabels = NULL,
  cutHeight = NULL,
  minClusterSize = 1,
  extra = TRUE
)

Arguments

`treeStr`	A `character vector` of a phylogenetic tree in Newick format
`treeDistances`	`numeric matrix` of distances, that were used to generate the tree. If NULL, it will be inferred from tree branch lengths.
`treeLabels`	A `character vector` of tree leaf labels.
`cutHeight`	Define at which height to cut tree. Default automatically defined.
`minClusterSize`	Minimum size of clusters. Default 1.
`extra`	Boolean whether to use extra parameters for the `cutreeDynamic`.

Value

character vector of the clusters. Each row contains data for a cluster, separated by space. The id of the cluster, the size of the cluster (number of elements) and the names of its elements, Cluster id 0 contains all the objects not assigned to a cluster (singletons). Example clusters output :

0 3 Sample1 Sample2 Sample3

1 3 Sample4 Sample5 Sample6

2 2 Sample7 Sample8

3 2 Sample9 Sample0

Author(s)

Anestis Gkanogiannis, [email protected]

References

Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils

Examples

my.clust <- tree2clusters(
    treeStr = dist2tree(
        inputDist = system.file("extdata", "samples.vcf.dist.gz",
            package = "fastreeR"
        )
    )
)
my.clust <- tree2clusters(
    treeStr = dist2tree(
        inputDist = system.file("extdata", "samples.vcf.dist.gz",
            package = "fastreeR"
        )
    )
)

Perform Hierarchical Clustering and tree pruning on samples of VCF file

Description

Performs Hierarchical Clustering on a distance matrix calculated as in vcf2dist and generates a phylogenetic tree with agglomerative Neighbor Joining method (complete linkage) (as in dist2tree). The phylogenetic tree is then pruned with cutreeDynamic to get clusters (as in tree2clusters).

Usage

vcf2clusters(
  inputFile,
  threads = 2,
  ignoreMissing = FALSE,
  onlyHets = FALSE,
  ignoreHets = FALSE,
  cutHeight = NULL,
  minClusterSize = 1,
  extra = TRUE
)
vcf2clusters(
  inputFile,
  threads = 2,
  ignoreMissing = FALSE,
  onlyHets = FALSE,
  ignoreHets = FALSE,
  cutHeight = NULL,
  minClusterSize = 1,
  extra = TRUE
)

Arguments

`inputFile`	Input vcf file location (uncompressed or gzip compressed).
`threads`	Number of java threads to use.
`ignoreMissing`	Ignore variants with missing data (`./.` or `.\|.`)
`onlyHets`	Only calculate on variants with heterozygous calls.
`ignoreHets`	Only calculate on variants with homozygous calls.
`cutHeight`	Define at which height to cut tree. Default automatically defined.
`minClusterSize`	Minimum size of clusters. Default 1.
`extra`	Boolean whether to use extra parameters for the `cutreeDynamic`.

Details

Biallelic or multiallelic (maximum 7 alternate alleles) SNP and/or INDEL variants are considered, phased or not. Some VCF encoding examples are:

heterozygous variants : 1/0 or 0/1 or 0/2 or 1|0 or 0|1 or 0|2
homozygous to the reference allele variants : 0/0 or 0|0
homozygous to the first alternate allele variants : 1/1 or 1|1

If there are n samples and m variants, an nxn zero-diagonal symmetric distance matrix is calculated. The calculated cosine type distance (1-cosine_similarity)/2 is in the range [0,1] where value 0 means completely identical samples (cosine is 1), value 0.5 means perpendicular samples (cosine is 0) and value 1 means completely opposite samples (cosine is -1).

The calculation is performed by a Java back-end implementation, that supports multi-core CPU utilization and can be demanding in terms of memory resources. By default a JVM is launched with a maximum memory allocation of 512 MB. When this amount is not sufficient, the user needs to reserve additional memory resources, before loading the package, by updating the value of the java.parameters option. For example in order to allocate 4GB of RAM, the user needs to issue options(java.parameters="-Xmx4g") before library(fastreeR).

Value

A list of :

dist distances object.
character vector of the generated phylogenetic tree in Newick format
character vector of the clusters. Each row contains data for a cluster, separated by space. The id of the cluster, the size of the cluster (number of elements) and the names of its elements, Cluster id 0 contains all the objects not assigned to a cluster (singletons). Example clusters output :

0 3 Sample1 Sample2 Sample3

1 3 Sample4 Sample5 Sample6

2 2 Sample7 Sample8

3 2 Sample9 Sample0

Author(s)

Anestis Gkanogiannis, [email protected]

References

Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils

Examples

my.clust <- vcf2clusters(
    inputFile = system.file("extdata", "samples.vcf.gz",
        package = "fastreeR"
    )
)
my.clust <- vcf2clusters(
    inputFile = system.file("extdata", "samples.vcf.gz",
        package = "fastreeR"
    )
)

Calculate distances between samples of a VCF file

Description

This function calculates a cosine type dissimilarity measurement between the n samples of a VCF file.

Usage

vcf2dist(
  inputFile,
  outputFile = NULL,
  threads = 2,
  ignoreMissing = FALSE,
  onlyHets = FALSE,
  ignoreHets = FALSE,
  compress = FALSE
)
vcf2dist(
  inputFile,
  outputFile = NULL,
  threads = 2,
  ignoreMissing = FALSE,
  onlyHets = FALSE,
  ignoreHets = FALSE,
  compress = FALSE
)

Arguments

`inputFile`	Input vcf file location (uncompressed or gzip compressed).
`outputFile`	Output distances file location.
`threads`	Number of java threads to use.
`ignoreMissing`	Ignore variants with missing data (`./.` or `.\|.`)
`onlyHets`	Only calculate on variants with heterozygous calls.
`ignoreHets`	Only calculate on variants with homozygous calls.
`compress`	Compress output (adds .gz extension).

Details

Biallelic or multiallelic (maximum 7 alternate alleles) SNP and/or INDEL variants are considered, phased or not. Some VCF encoding examples are:

heterozygous variants : 1/0 or 0/1 or 0/2 or 1|0 or 0|1 or 0|2
homozygous to the reference allele variants : 0/0 or 0|0
homozygous to the first alternate allele variants : 1/1 or 1|1

The calculation is performed by a Java backend implementation, that supports multi-core CPU utilization and can be demanding in terms of memory resources. By default a JVM is launched with a maximum memory allocation of 512 MB. When this amount is not sufficient, the user needs to reserve additional memory resources, before loading the package, by updating the value of the java.parameters option. For example in order to allocate 4GB of RAM, the user needs to issue options(java.parameters="-Xmx4g") before library(fastreeR).

Output file, if provided, will contain n+1 lines. The first line contains the number n of samples and number m of variants, separated by space. Each of the subsequent n lines contains n+1 values, separated by space. The first value of each line is a sample name and the rest n values are the calculated distances of this sample to all the samples. Example output file of the distances of 3 samples calculated from 1000 variants:

3 1000
Sample1	0.0	0.5	0.2
Sample2	0.5	0.0	0.9
Sample3	0.2	0.9	0.0

Value

A dist distances object of the calculation.

Author(s)

Anestis Gkanogiannis, [email protected]

References

Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils

Examples

my.dist <- vcf2dist(
    inputFile = system.file("extdata", "samples.vcf.gz",
        package = "fastreeR"
    )
)
my.dist <- vcf2dist(
    inputFile = system.file("extdata", "samples.vcf.gz",
        package = "fastreeR"
    )
)

Calculate various per sample statistics from a VCF file

Description

Only biallelic SNPs are considered. For each sample, the following statistics are calculated :

INDIV : Sample name
N_SITES : Total number of SNPs
N_HET : Number of SNPs with heterozygous call (0/1 or 0|1 or 1/0 or 1|0)
N_ALT : Number of SNPs with alternate homozygous call (1/1 or 1|1)
N_REF : Number of SNPs with reference homozygous call (0/0 or 0|0)
N_MISS : Number of SNPs with missing call (./. or .|.)
P_HET : Percentage of heterozygous calls
P_ALT : Percentage of alternate homozygous calls
P_REF : Percentage of reference homozygous calls
P_MISS : Percentage of missing calls (missing rate)

Usage

vcf2istats(inputFile, outputFile = NULL)
vcf2istats(inputFile, outputFile = NULL)

Arguments

`inputFile`	Input vcf file location (uncompressed or gzip compressed).
`outputFile`	Output samples statistics file location.

Value

A data.frame of sample statistics.

Author(s)

Anestis Gkanogiannis, [email protected]

References

Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils

Examples

my.istats <- vcf2istats(
    inputFile =
        system.file("extdata", "samples.vcf.gz", package = "fastreeR")
)
my.istats <- vcf2istats(
    inputFile =
        system.file("extdata", "samples.vcf.gz", package = "fastreeR")
)

Generate phylogenetic tree from samples of a VCF file

Description

This function calculates a distance matrix between the samples of a VCF file as in vcf2dist and performs Hierarchical Clustering on this distance matrix as in dist2tree. A phylogenetic tree is calculated with agglomerative Neighbor Joining method (complete linkage).

Usage

vcf2tree(
  inputFile,
  threads = 2,
  ignoreMissing = FALSE,
  onlyHets = FALSE,
  ignoreHets = FALSE
)
vcf2tree(
  inputFile,
  threads = 2,
  ignoreMissing = FALSE,
  onlyHets = FALSE,
  ignoreHets = FALSE
)

Arguments

`inputFile`	Input vcf file location (uncompressed or gzip compressed).
`threads`	Number of java threads to use.
`ignoreMissing`	Ignore variants with missing data (`./.` or `.\|.`)
`onlyHets`	Only calculate on variants with heterozygous calls.
`ignoreHets`	Only calculate on variants with homozygous calls.

Details

Biallelic or multiallelic (maximum 7 alternate alleles) SNP and/or INDEL variants are considered, phased or not. Some VCF encoding examples are:

heterozygous variants : 1/0 or 0/1 or 0/2 or 1|0 or 0|1 or 0|2
homozygous to the reference allele variants : 0/0 or 0|0
homozygous to the first alternate allele variants : 1/1 or 1|1

Value

A character vector of the generated phylogenetic tree in Newick format.

Author(s)

Anestis Gkanogiannis, [email protected]

References

Java implementation: https://github.com/gkanogiannis/BioInfoJava-Utils

Examples

my.tree <- vcf2tree(
    inputFile = system.file("extdata", "samples.vcf.gz",
        package = "fastreeR"
    )
)
my.tree <- vcf2tree(
    inputFile = system.file("extdata", "samples.vcf.gz",
        package = "fastreeR"
    )
)

0	3	Sample1	Sample2	Sample3
1	3	Sample4	Sample5	Sample6
2	2	Sample7	Sample8
3	2	Sample9	Sample0

0	3	Sample1	Sample2	Sample3
1	3	Sample4	Sample5	Sample6
2	2	Sample7	Sample8
3	2	Sample9	Sample0

0	3	Sample1	Sample2	Sample3
1	3	Sample4	Sample5	Sample6
2	2	Sample7	Sample8
3	2	Sample9	Sample0

Package 'fastreeR'

Help Index

Perform Hierarchical Clustering and tree pruning on a distance matrix

Description

Usage

Arguments

Value

Author(s)

References

Examples

Generate phylogenetic tree from samples of a distance matrix

Description

Usage

Arguments

Value

Author(s)

References

Examples

Calculate distances between sequences of a FASTA file

Description

Usage

Arguments

Value

Author(s)

References

Examples

Perform Hierarchical Clustering and tree pruning on a phylogenetic tree

Description

Usage

Arguments

Value

Author(s)

References

Examples

Perform Hierarchical Clustering and tree pruning on samples of VCF file

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Calculate distances between samples of a VCF file

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Calculate various per sample statistics from a VCF file

Description

Usage

Arguments

Value

Author(s)

References

Examples

Generate phylogenetic tree from samples of a VCF file

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples