Title: | Inference And Analysis Of Synteny Networks |
---|---|
Description: | syntenet can be used to infer synteny networks from whole-genome protein sequences and analyze them. Anchor pairs are detected with the MCScanX algorithm, which was ported to this package with the Rcpp framework for R and C++ integration. Anchor pairs from synteny analyses are treated as an undirected unweighted graph (i.e., a synteny network), and users can perform: i. network clustering; ii. phylogenomic profiling (by identifying which species contain which clusters) and; iii. microsynteny-based phylogeny reconstruction with maximum likelihood. |
Authors: | Fabrício Almeida-Silva [aut, cre] , Tao Zhao [aut] , Kristian K Ullrich [aut] , Yves Van de Peer [aut] |
Maintainer: | Fabrício Almeida-Silva <[email protected]> |
License: | GPL-3 |
Version: | 1.9.1 |
Built: | 2024-12-19 03:36:30 UTC |
Source: | https://github.com/bioc/syntenet |
Original tree file obtained from Zhao et al., 2021.
The tree is an object of class 'phylo', which can be created by reading
the tree file with treeio::read.tree()
.
data(angiosperm_phylogeny)
data(angiosperm_phylogeny)
An object of class 'phylo'.
Zhao, T., Zwaenepoel, A., Xue, J. Y., Kao, S. M., Li, Z., Schranz, M. E., & Van de Peer, Y. (2021). Whole-genome microsynteny-based phylogeny of angiosperms. Nature Communications, 12(1), 1-14.
data(angiosperm_phylogeny)
data(angiosperm_phylogeny)
Data obtained from Pico-PLAZA 3.0. Only annotation data for primary transcripts were included, and only genes for chromosomes 1, 2, and 3.
data(annotation)
data(annotation)
A CompressedGRangesList containing
the elements Olucimarinus
, Osp_RCC809
, and Otauri
.
Van Bel, M., Silvestri, F., Weitz, E. M., Kreft, L., Botzki, A., Coppens, F., & Vandepoele, K. (2021). PLAZA 5.0: extending the scope and power of comparative and functional genomics in plants. Nucleic acids research.
data(annotation)
data(annotation)
Binarize and transpose the phylogenomic profile matrix
binarize_and_transpose(profile_matrix = NULL)
binarize_and_transpose(profile_matrix = NULL)
profile_matrix |
A matrix with phylogenomic profiles obtained
with |
A binary and transposed version of the profiles matrix.
data(clusters) profile_matrix <- phylogenomic_profile(clusters) tmat <- binarize_and_transpose(profile_matrix)
data(clusters) profile_matrix <- phylogenomic_profile(clusters) tmat <- binarize_and_transpose(profile_matrix)
The object was created by running run_diamond
on the protein
sequences for the Ostreococcus algae available in the proteomes
example data. Hits with <50% identity were filtered out. Code to recreate
this data is available at the script/ subdirectory.
data(blast_list)
data(blast_list)
A list of data frames containing the pairwise comparisons between proteomes of Ostreococcus species.
data(blast_list)
data(blast_list)
Check if input objects are ready for further analyses
check_input(seq = NULL, annotation = NULL, gene_field = "gene_id")
check_input(seq = NULL, annotation = NULL, gene_field = "gene_id")
seq |
A list of AAStringSet objects, each list element containing protein sequences for a given species. This list must have names (not NULL), and names of each list element must match the names of list elements in annotation. |
annotation |
A GRangesList, CompressedGRangesList, or list of GRanges with the annotation for the sequences in seq. This list must have names (not NULL), and names of each list element must match the names of list elements in seq. |
gene_field |
Character, name of the column in the GRanges objects that contains gene IDs. Default: "gene_id". |
This function checks the input data for 3 required conditions:
Names of seq list (i.e., names(seq)
) match
the names of annotation GRangesList/CompressedGRangesList
(i.e., names(annotation)
)
For each species (list elements), the number of sequences in seq is not greater than the number of genes in annotation. This is a way to ensure users do not input the translated sequences for multiple isoforms of the same gene (generated by alternative splicing). Ideally, the number of sequences in seq should be equal to the number of genes in annotation, but this may not always stand true because of non-protein-coding genes.
For each species, sequence names (i.e., names(seq[[x]])
,
equivalent to FASTA headers) match gene names in annotation.
TRUE if the objects pass the check.
data(annotation) data(proteomes) check_input(proteomes, annotation)
data(annotation) data(proteomes) check_input(proteomes, annotation)
Cluster the synteny network using the Infomap algorithm
cluster_network( network = NULL, clust_function = igraph::cluster_infomap, clust_params = NULL )
cluster_network( network = NULL, clust_function = igraph::cluster_infomap, clust_params = NULL )
network |
A network represented as an edge list, which is a 2-column data frame with node 1 in the first column and node 2 in the second column. In a synteny network, node 1 and node are the anchor pairs. |
clust_function |
Function to be used to cluster the network. It must be one the functions from the cluster_* family in the igraph package (e.g., cluster_infomap, cluster_leiden, etc). Default: igraph::cluster_infomap. |
clust_params |
A list with additional parameters (if any) to be passed to the igraph clustering function. Default: NULL (no additional parameters). |
A 2-column data frame with the following variables:
Gene ID.
Cluster ID as identified by infomap.
data(network) clusters <- cluster_network(network[1:500, ])
data(network) clusters <- cluster_network(network[1:500, ])
Data obtained from Zhao & Schranz, 2019.
data(clusters)
data(clusters)
A 2-column data frame containing the following variables:
Gene ID
Cluster ID
Zhao, T., & Schranz, M. E. (2019). Network-based microsynteny analysis identifies major differences and genomic outliers in mammalian and angiosperm genomes. Proceedings of the National Academy of Sciences, 116(6), 2165-2174.
data(clusters)
data(clusters)
This function can be used if the sequence names of the AAStringSet objects contain protein IDs instead of gene IDs (what syntenet requires)
collapse_protein_ids(seq, protein2gene = NULL)
collapse_protein_ids(seq, protein2gene = NULL)
seq |
A list of AAStringSet objects, each list element containing protein sequences for a given species. This list must have names (not NULL), and names of each list element must match the names of list elements in protein2gene. |
protein2gene |
A list of 2-column data frames containing protein-to-gene ID correspondences, where the first column contains protein IDs, and the second column contains gene IDs. Names of list elements must match names of seq. |
For each species, this function will replace the protein IDs in sequence names with gene IDs using the protein-to-gene correspondence table in protein2gene. After replacing protein IDs with gene IDs, if there are multiple sequences with the same gene ID (indicating different isoforms of the same gene), only the longest sequence is kept, so that the number of sequences is not greater than the number of genes.
A list of AAStringSet objects as in seq, but with protein IDs replaced with gene IDs.
# Load data seq_path <- system.file( "extdata", "RefSeq_parsing_example", package = "syntenet" ) seq <- fasta2AAStringSetlist(seq_path) annot <- gff2GRangesList(seq_path) # Clean sequence names names(seq$Aalosa) <- gsub(" .*", "", names(seq$Aalosa)) # Create a correspondence data frame cor_df <- as.data.frame(annot$Aalosa[annot$Aalosa$type == "CDS", ]) cor_df <- cor_df[, c("Name", "gene")] # Create a list of correspondence data frames protein2gene <- list(Aalosa = cor_df) # Collapse IDs new_seqs <- collapse_protein_ids(seq, protein2gene)
# Load data seq_path <- system.file( "extdata", "RefSeq_parsing_example", package = "syntenet" ) seq <- fasta2AAStringSetlist(seq_path) annot <- gff2GRangesList(seq_path) # Clean sequence names names(seq$Aalosa) <- gsub(" .*", "", names(seq$Aalosa)) # Create a correspondence data frame cor_df <- as.data.frame(annot$Aalosa[annot$Aalosa$type == "CDS", ]) cor_df <- cor_df[, c("Name", "gene")] # Create a list of correspondence data frames protein2gene <- list(Aalosa = cor_df) # Collapse IDs new_seqs <- collapse_protein_ids(seq, protein2gene)
Create a data frame of species IDs (3-5-character abbreviations)
create_species_id_table(species_names)
create_species_id_table(species_names)
species_names |
A character vector of names extracted from
the seq or annotation lists, which can be extracted with
|
A 2-column data frame with the following variables:
Character, species ID consisting of 3-5 characters.
Character, original names passed as input.
# Load 'seq' list (list of AAStringSet objects) data(proteomes) # Create ID table create_species_id_table(names(proteomes))
# Load 'seq' list (list of AAStringSet objects) data(proteomes) # Create ID table create_species_id_table(names(proteomes))
Check if DIAMOND is installed
diamond_is_installed()
diamond_is_installed()
Logical indicating whether DIAMOND is installed or not.
diamond_is_installed()
diamond_is_installed()
The object was created by running infer_syntenet
on the
blast_list example data. Code to recreate this data set is
available at the script/ subdirectory.
data(edges)
data(edges)
A data frame containing anchor pairs between two Ostreococcus proteomes.
data(edges)
data(edges)
Export processed sequences as FASTA files
export_sequences(seq = NULL, outdir = tempdir())
export_sequences(seq = NULL, outdir = tempdir())
seq |
A processed list of AAStringSet objects
as returned by |
outdir |
Path to output directory where FASTA files will be stored. |
Path to exported FASTA files.
# Load data data(proteomes) data(annotation) # Process data pdata <- process_input(proteomes, annotation) # Export data outdir <- file.path(tempdir(), "example_test") export_sequences(pdata$seq, outdir)
# Load data data(proteomes) data(annotation) # Process data pdata <- process_input(proteomes, annotation) # Export data outdir <- file.path(tempdir(), "example_test") export_sequences(pdata$seq, outdir)
Read FASTA files in a directory as a list of AAStringSet objects
fasta2AAStringSetlist(fasta_dir)
fasta2AAStringSetlist(fasta_dir)
fasta_dir |
Character indicating the path to the directory containing FASTA files. |
A list of AAStringSet objects, where each element represents a different FASTA file.
fasta_dir <- system.file("extdata", "sequences", package = "syntenet") aastringsetlist <- fasta2AAStringSetlist(fasta_dir)
fasta_dir <- system.file("extdata", "sequences", package = "syntenet") aastringsetlist <- fasta2AAStringSetlist(fasta_dir)
Find group-specific clusters based on user-defined species classification
find_GS_clusters( profile_matrix = NULL, species_annotation = NULL, min_percentage = 50 )
find_GS_clusters( profile_matrix = NULL, species_annotation = NULL, min_percentage = 50 )
profile_matrix |
A matrix of phylogenomic profiles obtained
with |
species_annotation |
A 2-column data frame with species IDs in the first column (same as column names of profile matrix), and species annotation (e.g., higher-level taxonomic information) in the second column. |
min_percentage |
Numeric scalar with the minimum percentage of species in a group to consider group specificity. For instance, if a given cluster is present in only 1 group of species, but in less than min_percentage of the species for this group, it will not be considered a group-specific cluster. This filtering criterion is useful to differentiate group-specific clusters (e.g., family-specific) from subgroup-specific clusters (e.g., genus-specific). Default: 50. |
A data frame with the following variables:
To which group of species the cluster is specific.
Percentage of species from the group that are represented by the cluster.
Cluster ID.
data(clusters) profile_matrix <- phylogenomic_profile(clusters) # Species annotation species_order <- c( "vra", "van", "pvu", "gma", "cca", "tpr", "mtr", "adu", "lja", "Lang", "car", "pmu", "ppe", "pbr", "mdo", "roc", "fve", "Mnot", "Zjuj", "hlu", "jcu", "mes", "rco", "lus", "ptr" ) species_annotation <- data.frame( Species = species_order, Family = c(rep("Fabaceae", 11), rep("Rosaceae", 6), "Moraceae", "Ramnaceae", "Cannabaceae", rep("Euphorbiaceae", 3), "Linaceae", "Salicaceae") ) gs_clusters <- find_GS_clusters(profile_matrix, species_annotation)
data(clusters) profile_matrix <- phylogenomic_profile(clusters) # Species annotation species_order <- c( "vra", "van", "pvu", "gma", "cca", "tpr", "mtr", "adu", "lja", "Lang", "car", "pmu", "ppe", "pbr", "mdo", "roc", "fve", "Mnot", "Zjuj", "hlu", "jcu", "mes", "rco", "lus", "ptr" ) species_annotation <- data.frame( Species = species_order, Family = c(rep("Fabaceae", 11), rep("Rosaceae", 6), "Moraceae", "Ramnaceae", "Cannabaceae", rep("Euphorbiaceae", 3), "Linaceae", "Salicaceae") ) gs_clusters <- find_GS_clusters(profile_matrix, species_annotation)
Read GFF/GTF files in a directory as a GRangesList object
gff2GRangesList(gff_dir)
gff2GRangesList(gff_dir)
gff_dir |
Character indicating the path to the directory containing GFF/GTF files. |
A GRangesList object, where each element represents a different GFF/GTF file.
gff_dir <- system.file("extdata", "annotation", package = "syntenet") grangeslist <- gff2GRangesList(gff_dir)
gff_dir <- system.file("extdata", "annotation", package = "syntenet") grangeslist <- gff2GRangesList(gff_dir)
Infer microsynteny-based phylogeny with IQTREE
infer_microsynteny_phylogeny( transposed_profiles = NULL, bootr = 1000, alrtboot = 1000, threads = "AUTO", model = "MK+FO+R", outdir = tempdir(), outgroup = NULL, verbose = FALSE )
infer_microsynteny_phylogeny( transposed_profiles = NULL, bootr = 1000, alrtboot = 1000, threads = "AUTO", model = "MK+FO+R", outdir = tempdir(), outgroup = NULL, verbose = FALSE )
transposed_profiles |
A binary and transposed profile matrix.
The profile matrix can be obtained with |
bootr |
Numeric scalar with the number of bootstrap replicates. Default: 1000. |
alrtboot |
Numeric scalar with the number of replicates for the SH-like approximate likelihood ratio test. Default: 1000. |
threads |
Numeric scalar indicating the number of threads to use or "AUTO", which allows IQTREE to automatically choose the best number of threads to use. Default: "AUTO". |
model |
Substitution model to use. If you are unsure, pick the default. Default: "MK+FO+R". |
outdir |
Path to output directory. By default, files are saved in a temporary directory, so they will be deleted when the R session closes. If you want to keep the files, specify a custom output directory. |
outgroup |
Name of outgroup clade to group the phylogeny. Default: NULL (unrooted phylogeny). |
verbose |
Logical indicating if progress messages should be prompted. Default: FALSE. |
A character vector of paths to output files.
data(clusters) profile_matrix <- phylogenomic_profile(clusters) tmat <- binarize_and_transpose(profile_matrix) # Leave only some legumes and P. mume as an outgroup for testing purposes included <- c("gma", "pvu", "vra", "van", "cca", "pmu") tmat <- tmat[rownames(tmat) %in% included, ] # Remove non-variable sites tmat <- tmat[, colSums(tmat) != length(included)] if(iqtree_is_installed()) { phylo <- infer_microsynteny_phylogeny(tmat, outgroup = "pmu", threads = 1) }
data(clusters) profile_matrix <- phylogenomic_profile(clusters) tmat <- binarize_and_transpose(profile_matrix) # Leave only some legumes and P. mume as an outgroup for testing purposes included <- c("gma", "pvu", "vra", "van", "cca", "pmu") tmat <- tmat[rownames(tmat) %in% included, ] # Remove non-variable sites tmat <- tmat[, colSums(tmat) != length(included)] if(iqtree_is_installed()) { phylo <- infer_microsynteny_phylogeny(tmat, outgroup = "pmu", threads = 1) }
Infer synteny network
infer_syntenet( blast_list = NULL, annotation = NULL, outdir = tempdir(), anchors = 5, max_gaps = 25, is_pairwise = TRUE, verbose = FALSE, bp_param = BiocParallel::SerialParam(), ... )
infer_syntenet( blast_list = NULL, annotation = NULL, outdir = tempdir(), anchors = 5, max_gaps = 25, is_pairwise = TRUE, verbose = FALSE, bp_param = BiocParallel::SerialParam(), ... )
blast_list |
A list of data frames, each data frame having
the tabular output of BLASTp or similar programs, such as DIAMOND.
This is the output of the function |
annotation |
A processed GRangesList, CompressedGRangesList, or
list of GRanges as returned by |
outdir |
Path to the output directory. Default: tempdir(). |
anchors |
Numeric indicating the minimum required number of genes to call a syntenic block. Default: 5. |
max_gaps |
Numeric indicating the number of upstream and downstream genes to search for anchors. Default: 25. |
is_pairwise |
specify if only pairwise blocks should be reported Default: TRUE |
verbose |
Logical indicating if log messages should be printed on screen. Default: FALSE. |
bp_param |
BiocParallel back-end to be used.
Default: |
... |
Any additional arguments to the |
A network represented as an edge list.
# Load data data(proteomes) data(annotation) data(blast_list) # Create processed annotation list annotation <- process_input(proteomes, annotation)$annotation # Infer the synteny network net <- infer_syntenet(blast_list, annotation)
# Load data data(proteomes) data(annotation) data(blast_list) # Create processed annotation list annotation <- process_input(proteomes, annotation)$annotation # Infer the synteny network net <- infer_syntenet(blast_list, annotation)
Detect interspecies synteny
interspecies_synteny( blast_inter = NULL, annotation = NULL, inter_dir = file.path(tempdir(), "inter"), anchors = 5, max_gaps = 25, is_pairwise = TRUE, verbose = FALSE, bp_param = BiocParallel::SerialParam(), ... )
interspecies_synteny( blast_inter = NULL, annotation = NULL, inter_dir = file.path(tempdir(), "inter"), anchors = 5, max_gaps = 25, is_pairwise = TRUE, verbose = FALSE, bp_param = BiocParallel::SerialParam(), ... )
blast_inter |
A list of BLAST/DIAMOND data frames for
interspecies comparisons as returned by |
annotation |
A processed GRangesList or CompressedGRangesList object
as returned by |
inter_dir |
Path to output directory where .collinearity files will be stored. |
anchors |
Numeric indicating the minimum required number of genes to call a syntenic block. Default: 5. |
max_gaps |
Numeric indicating the number of upstream and downstream genes to search for anchors. Default: 25. |
is_pairwise |
specify if only pairwise blocks should be reported Default: TRUE. |
verbose |
Logical indicating if log messages should be printed on screen. Default: FALSE. |
bp_param |
BiocParallel back-end to be used.
Default: |
... |
Any additional arguments to the |
Paths to .collinearity files.
# Load data data(proteomes) data(blast_list) data(annotation) # Get DIAMOND and processed annotation lists blast_inter <- blast_list[2] annotation <- process_input(proteomes, annotation)$annotation # Detect interspecies synteny intersyn <- interspecies_synteny(blast_inter, annotation)
# Load data data(proteomes) data(blast_list) data(annotation) # Get DIAMOND and processed annotation lists blast_inter <- blast_list[2] annotation <- process_input(proteomes, annotation)$annotation # Detect interspecies synteny intersyn <- interspecies_synteny(blast_inter, annotation)
Detect intraspecies synteny
intraspecies_synteny( blast_intra = NULL, annotation = NULL, intra_dir = file.path(tempdir(), "intra"), anchors = 5, max_gaps = 25, is_pairwise = TRUE, verbose = FALSE, bp_param = BiocParallel::SerialParam(), ... )
intraspecies_synteny( blast_intra = NULL, annotation = NULL, intra_dir = file.path(tempdir(), "intra"), anchors = 5, max_gaps = 25, is_pairwise = TRUE, verbose = FALSE, bp_param = BiocParallel::SerialParam(), ... )
blast_intra |
A list of BLAST/DIAMOND data frames for
intraspecies comparisons as returned by |
annotation |
A processed GRangesList or CompressedGRangesList object
as returned by |
intra_dir |
Path to output directory where .collinearity files will be stored. |
anchors |
Numeric indicating the minimum required number of genes to call a syntenic block. Default: 5. |
max_gaps |
Numeric indicating the number of upstream and downstream genes to search for anchors. Default: 25. |
is_pairwise |
Logical indicating if only pairwise blocks should be reported. Default: TRUE. |
verbose |
Logical indicating if log messages should be printed on screen. Default: FALSE. |
bp_param |
BiocParallel back-end to be used.
Default: |
... |
Any additional arguments to the |
Paths to .collinearity files.
# Load data data(scerevisiae_annot) data(scerevisiae_diamond) # Detect intragenome synteny intra_syn <- intraspecies_synteny( scerevisiae_diamond, scerevisiae_annot )
# Load data data(scerevisiae_annot) data(scerevisiae_diamond) # Detect intragenome synteny intra_syn <- intraspecies_synteny( scerevisiae_diamond, scerevisiae_annot )
Check if IQTREE is installed
iqtree_is_installed()
iqtree_is_installed()
Logical indicating whether IQTREE is installed or not.
iqtree_is_installed()
iqtree_is_installed()
Get IQ-TREE version
iqtree_version()
iqtree_version()
Numeric indicating IQ-TREE version, with either 1 or 2.
iqtree_version()
iqtree_version()
Check if last is installed
last_is_installed()
last_is_installed()
Logical indicating whether last is installed or not.
last_is_installed()
last_is_installed()
Data obtained from Zhao & Schranz, 2019.
data(network)
data(network)
An edgelist (i.e., a 2-column data frame with node 1 in column 1 and node 2 in column 2).
Zhao, T., & Schranz, M. E. (2019). Network-based microsynteny analysis identifies major differences and genomic outliers in mammalian and angiosperm genomes. Proceedings of the National Academy of Sciences, 116(6), 2165-2174.
data(network)
data(network)
The .collinearity files can be obtained with intraspecies_synteny
and interspecies_synteny
, which execute a native version of the
MCScan algorithm.
parse_collinearity(collinearity_paths = NULL, as = "anchors")
parse_collinearity(collinearity_paths = NULL, as = "anchors")
collinearity_paths |
Character vector of paths to .collinearity files. |
as |
Character specifying what to extract. One of "anchors" (default), "blocks", or "all". |
If as is "anchors", a data frame with variables "Anchor1", and "Anchor2". If as is "blocks", a data frame with variables "Block", "Block_score", "Chr", and "Orientation". If as is "all", a data frame with all aforementioned variables, which indicate:
Numeric, synteny block ID
Numeric, score of synteny block.
Character, query and target chromosome of the synteny
block formatted as "
Character, the orientation of genes within blocks, with "plus" indicating that genes are in the same direction, and "minus" indicating that genes are in opposite directions.
Character, gene ID of anchor 1.
Character, gene ID of anchor 2.
collinearity_paths <- system.file( "extdata", "Scerevisiae.collinearity", package = "syntenet" ) net <- parse_collinearity(collinearity_paths)
collinearity_paths <- system.file( "extdata", "Scerevisiae.collinearity", package = "syntenet" ) net <- parse_collinearity(collinearity_paths)
Perform phylogenomic profiling for synteny network clusters
phylogenomic_profile(clusters = NULL)
phylogenomic_profile(clusters = NULL)
clusters |
A 2-column data frame with variables Gene and
Cluster as returned by |
A matrix of i rows and j columns containing the number of genes in cluster i for each species j. The number of rows is equal to the number of clusters in clusters, and the number of columns is equal to the number of species in clusters.
data(clusters) profiles <- phylogenomic_profile(clusters)
data(clusters) profiles <- phylogenomic_profile(clusters)
Plot network
plot_network( network = NULL, clusters = NULL, cluster_id = NULL, color_by = "cluster", interactive = FALSE, dim_interactive = c(600, 600) )
plot_network( network = NULL, clusters = NULL, cluster_id = NULL, color_by = "cluster", interactive = FALSE, dim_interactive = c(600, 600) )
network |
The synteny network represented as an edge list, which is a 2-column data frame with each member of the anchor pair in a column. |
clusters |
A 2-column data frame with the variables Gene
and Cluster representing gene ID and cluster ID, respectively,
exactly as returned by |
cluster_id |
Character scalar or vector with cluster ID. If more than one cluster is passed as input, clusters are colored differently. |
color_by |
Either "cluster" or a 2-column data frame with gene IDs in the first column and variable to be used for coloring (e.g., taxonomic information) in the second column. |
interactive |
Logical scalar indicating whether to display an interactive network or not. Default: FALSE. |
dim_interactive |
Numeric vector of length 2 with the window dimensions of the interactive plot. If interactive is set to FALSE, this parameter is ignored. |
A ggplot object with the network.
data(network) data(clusters) # Option 1: 1 cluster cluster_id <- 25 plot_network(network, clusters, cluster_id) # Option 2: 2 clusters cluster_id <- c(25, 1089) plot_network(network, clusters, cluster_id) # Option 3: custom annotation for coloring species_order <- c( "vra", "van", "pvu", "gma", "cca", "tpr", "mtr", "adu", "lja", "Lang", "car", "pmu", "ppe", "pbr", "mdo", "roc", "fve", "Mnot", "Zjuj", "jcu", "mes", "rco", "lus", "ptr" ) species_annotation <- data.frame( Species = species_order, Family = c(rep("Fabaceae", 11), rep("Rosaceae", 6), "Moraceae", "Ramnaceae", rep("Euphorbiaceae", 3), "Linaceae", "Salicaceae") ) genes <- unique(c(network$node1, network$node2)) gene_df <- data.frame( Gene = genes, Species = unlist(lapply(strsplit(genes, "_"), head, 1)) ) gene_df <- merge(gene_df, species_annotation)[, c("Gene", "Family")] plot_network(network, clusters, cluster_id = 25, color_by = gene_df)
data(network) data(clusters) # Option 1: 1 cluster cluster_id <- 25 plot_network(network, clusters, cluster_id) # Option 2: 2 clusters cluster_id <- c(25, 1089) plot_network(network, clusters, cluster_id) # Option 3: custom annotation for coloring species_order <- c( "vra", "van", "pvu", "gma", "cca", "tpr", "mtr", "adu", "lja", "Lang", "car", "pmu", "ppe", "pbr", "mdo", "roc", "fve", "Mnot", "Zjuj", "jcu", "mes", "rco", "lus", "ptr" ) species_annotation <- data.frame( Species = species_order, Family = c(rep("Fabaceae", 11), rep("Rosaceae", 6), "Moraceae", "Ramnaceae", rep("Euphorbiaceae", 3), "Linaceae", "Salicaceae") ) genes <- unique(c(network$node1, network$node2)) gene_df <- data.frame( Gene = genes, Species = unlist(lapply(strsplit(genes, "_"), head, 1)) ) gene_df <- merge(gene_df, species_annotation)[, c("Gene", "Family")] plot_network(network, clusters, cluster_id = 25, color_by = gene_df)
Plot a heatmap of phylogenomic profiles
plot_profiles( profile_matrix = NULL, species_annotation = NULL, palette = "Greens", dist_function = stats::dist, dist_params = list(method = "euclidean"), clust_function = stats::hclust, clust_params = list(method = "ward.D"), cluster_species = FALSE, show_colnames = FALSE, discretize = TRUE, ... )
plot_profiles( profile_matrix = NULL, species_annotation = NULL, palette = "Greens", dist_function = stats::dist, dist_params = list(method = "euclidean"), clust_function = stats::hclust, clust_params = list(method = "ward.D"), cluster_species = FALSE, show_colnames = FALSE, discretize = TRUE, ... )
profile_matrix |
A matrix of phylogenomic profiles obtained
with |
species_annotation |
A 2-column data frame with species IDs in the first column (same as column names of profile matrix), and species annotation (e.g., higher-level taxonomic information) in the second column. |
palette |
A character vector of colors or a character scalar with the name of an RColorBrewer palette. Default: "RdYlBu". |
dist_function |
Function to use to calculate a distance matrix for
synteny clusters. Popular examples include |
dist_params |
A list with parameters to be passed to the function specified in parameter dist_function. Default: list(method = "euclidean"). |
clust_function |
Function to use to cluster the distance matrix
returned by the function specified in |
clust_params |
A list with additional parameters (if any) to be passed to the function specified in parameter clust_function. Default: list(method = "ward.D"). |
cluster_species |
Either a logical scalar (TRUE or FALSE) or
a character vector with the order in which species should be arranged.
TRUE or FALSE indicate whether hierarchical clustering should be applied
to rows (species). Ideally, the character vector
should contain the order of species in a phylogenetically meaningful way.
If users pass a named vector, vector names will be used to rename species.
If users have a species tree, they can read it
with |
show_colnames |
Logical indicating whether to show column names (i.e., cluster IDs) or not. Showing cluster IDs can be useful when visualizing a small subset of them. When visualizing all clusters, cluster IDs are impossible to read. Default: FALSE. |
discretize |
Logical indicating whether to discretize clusters in 4 categories: 0, 1, 2, and 3+. If FALSE, counts will be log2 transformed. Default: TRUE. |
... |
Additional parameters to |
A pheatmap object.
data(clusters) profile_matrix <- phylogenomic_profile(clusters) species_order <- c( "vra", "van", "pvu", "gma", "cca", "tpr", "mtr", "adu", "lja", "Lang", "car", "pmu", "ppe", "pbr", "mdo", "roc", "fve", "Mnot", "Zjuj", "jcu", "mes", "rco", "lus", "ptr" ) species_names <- c( "V. radiata", "V. angularis", "P. vulgaris", "G. max", "C. cajan", "T. pratense", "M. truncatula", "A. duranensis", "L. japonicus", "L. angustifolius", "C. arietinum", "P. mume", "P. persica", "P. bretschneideri", "M. domestica", "R. occidentalis", "F. vesca", "M. notabilis", "Z. jujuba", "J. curcas", "M. esculenta", "R. communis", "L. usitatissimum", "P. trichocarpa" ) names(species_order) <- species_names species_annotation <- data.frame( Species = species_order, Family = c(rep("Fabaceae", 11), rep("Rosaceae", 6), "Moraceae", "Ramnaceae", rep("Euphorbiaceae", 3), "Linaceae", "Salicaceae") ) p <- plot_profiles(profile_matrix, species_annotation, cluster_species = species_order) p <- plot_profiles(profile_matrix, species_annotation, cluster_species = species_order, discretize = FALSE)
data(clusters) profile_matrix <- phylogenomic_profile(clusters) species_order <- c( "vra", "van", "pvu", "gma", "cca", "tpr", "mtr", "adu", "lja", "Lang", "car", "pmu", "ppe", "pbr", "mdo", "roc", "fve", "Mnot", "Zjuj", "jcu", "mes", "rco", "lus", "ptr" ) species_names <- c( "V. radiata", "V. angularis", "P. vulgaris", "G. max", "C. cajan", "T. pratense", "M. truncatula", "A. duranensis", "L. japonicus", "L. angustifolius", "C. arietinum", "P. mume", "P. persica", "P. bretschneideri", "M. domestica", "R. occidentalis", "F. vesca", "M. notabilis", "Z. jujuba", "J. curcas", "M. esculenta", "R. communis", "L. usitatissimum", "P. trichocarpa" ) names(species_order) <- species_names species_annotation <- data.frame( Species = species_order, Family = c(rep("Fabaceae", 11), rep("Rosaceae", 6), "Moraceae", "Ramnaceae", rep("Euphorbiaceae", 3), "Linaceae", "Salicaceae") ) p <- plot_profiles(profile_matrix, species_annotation, cluster_species = species_order) p <- plot_profiles(profile_matrix, species_annotation, cluster_species = species_order, discretize = FALSE)
Process sequence data
process_input( seq = NULL, annotation = NULL, gene_field = "gene_id", filter_annotation = FALSE )
process_input( seq = NULL, annotation = NULL, gene_field = "gene_id", filter_annotation = FALSE )
seq |
A list of AAStringSet objects, each list element containing protein sequences for a given species. This list must have names (not NULL), and names of each list element must match the names of list elements in annotation. |
annotation |
A GRangesList, CompressedGRangesList, or list of GRanges with the annotation for the sequences in seq. This list must have names (not NULL), and names of each list element must match the names of list elements in seq. |
gene_field |
Character, name of the column in the GRanges objects that contains gene IDs. Default: "gene_id". |
filter_annotation |
Logical indicating whether annotation should be filtered to keep only genes that are also in seq. This is particularly useful if users want to remove information on non-protein coding genes from annotation, since such genes are typically not present in sets of whole-genome protein sequences. Default: FALSE. |
This function processes the input sequences and annotation to:
Remove whitespace and anything after it in sequence names
(i.e., names(seq[[x]])
, which is equivalent to FASTA headers), if
there is any.
Add a unique species identifier to sequence names. The species identifier consists of the first 3-5 strings of the element name. For instance, if the first element of the seq list is named "Athaliana", each sequence in it will have an identifier "Atha_" added to the beginning of each gene name (e.g., Atha_AT1G01010).
If sequences have an asterisk (*) representing stop codon, remove it.
Add a unique species identifier (same as above) to gene and chromosome names of each element of the annotation GRangesList/CompressedGRangesList.
Filter each element of the annotation GRangesList/CompressedGRangesList to keep only seqnames, ranges, and gene ID.
A list of 2 elements:
The processed list of AAStringSet objects from seq.
The processed GRangesList or CompressedGRangesList object from annotation.
data(annotation) data(proteomes) seq <- proteomes clean_data <- process_input(seq, annotation)
data(annotation) data(proteomes) seq <- proteomes clean_data <- process_input(seq, annotation)
Save the transposed binary profiles matrix to a file in PHYLIP format
profiles2phylip(transposed_profiles = NULL, outdir = tempdir())
profiles2phylip(transposed_profiles = NULL, outdir = tempdir())
transposed_profiles |
A binary and transposed profile matrix.
The profile matrix can be obtained with |
outdir |
Path to output directory. By default, files are saved in a temporary directory, so they will be deleted when the R session closes. If you want to keep the files, specify a custom output directory. |
Character specifying the path to the PHYLIP file.
data(clusters) profile_matrix <- phylogenomic_profile(clusters) tmat <- binarize_and_transpose(profile_matrix) profiles2phylip(tmat)
data(clusters) profile_matrix <- phylogenomic_profile(clusters) tmat <- binarize_and_transpose(profile_matrix) profiles2phylip(tmat)
Data obtained from Pico-PLAZA 3.0. Only the translated sequences of primary transcripts were included, and only genes from chromosomes 1, 2, and 3.
data(proteomes)
data(proteomes)
A list of AAStringSet objects containing
the elements Olucimarinus
, Osp_RCC809
, and Otauri
.
Van Bel, M., Silvestri, F., Weitz, E. M., Kreft, L., Botzki, A., Coppens, F., & Vandepoele, K. (2021). PLAZA 5.0: extending the scope and power of comparative and functional genomics in plants. Nucleic acids research.
data(proteomes)
data(proteomes)
MCSCanX provides a clustering module for viewing the relationship of collinear segments in multiple genomes (or heavily redundant genomes). It takes the predicted pairwise segments from dynamic programming (DAGchainer in particular) and then tries to build consensus segments from a set of related, overlapping segments.
rcpp_mcscanx_file( blast_file, gff_file, prefix = "out", outdir = "", match_score = 50L, gap_penalty = -1L, match_size = 5L, e_value = 1e-05, max_gaps = 25L, overlap_window = 5L, is_pairwise = FALSE, in_synteny = 0L, species_id_length = 3L, verbose = FALSE )
rcpp_mcscanx_file( blast_file, gff_file, prefix = "out", outdir = "", match_score = 50L, gap_penalty = -1L, match_size = 5L, e_value = 1e-05, max_gaps = 25L, overlap_window = 5L, is_pairwise = FALSE, in_synteny = 0L, species_id_length = 3L, verbose = FALSE )
blast_file |
Character indicating the path to the BLAST/DIAMOND output file. |
gff_file |
Character indicating the path to the "gff" file, which is a tab-delimited file with 4 columns indicating the chromosome name, gene id, gene start position, and gene end position, respectively. |
prefix |
Character indicating the prefix to output files. Default: "out". |
outdir |
Character indicating the path to the output directory. Default: "". |
match_score |
Numeric indicating the match score. Default: 50. |
gap_penalty |
Numeric indicating the gap penalty. Default: -1. |
match_size |
Numeric indicating the minimum number of genes required to call synteny. Default: 5. |
e_value |
Numeric indicating the minimum e-value allowed. Default: 1e-5. |
max_gaps |
Numeric indicating the maximum number of gaps between
genes allowed. The unit measure of gaps is number of genes,
so |
overlap_window |
Numeric indicating the overlap window. Default: 5. |
is_pairwise |
Logical indicating whether only pairwise blocks should be reported. Default: FALSE. |
in_synteny |
Numeric indicating the patterns of collinear blocks, where 0 indicates intra and interspecies comparisons, 1 indicates intraspecies comparisons, and 2 indicates interspecies comparisons. Default: 0. |
species_id_length |
Integer indicating the length of the species IDs. Default: 3. 0: intra- and inter-species (default); 1: intra-species; 2: inter-species |
verbose |
Logical indicating whether to print progress messages to the screen. Default: FALSE. |
NULL, and a .collinearity file is created in the directory
specified in outdir
.
Kristian K Ullrich and Fabricio Almeida-Silva
Wang et al. (2012) MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic acids research. 40.7, e49-e49.
Haas et al. (2004) DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics. 20.18 3643-3646.
Read DIAMOND/BLAST tables as a list of data frames
read_diamond(diamond_dir = NULL)
read_diamond(diamond_dir = NULL)
diamond_dir |
Path to directory containing the tabular output of DIAMOND or similar programs (e.g., BLAST). |
A list of data frames with the tabular DIAMOND output.
# Path to output directory diamond_dir <- system.file("extdata", package = "syntenet") # Read output l <- read_diamond(diamond_dir)
# Path to output directory diamond_dir <- system.file("extdata", package = "syntenet") # Read output l <- read_diamond(diamond_dir)
Wrapper to run DIAMOND from an R session
run_diamond( seq = NULL, top_hits = 5, verbose = FALSE, outdir = tempdir(), threads = NULL, compare = "all", ... )
run_diamond( seq = NULL, top_hits = 5, verbose = FALSE, outdir = tempdir(), threads = NULL, compare = "all", ... )
seq |
A processed list of AAStringSet objects
as returned by |
top_hits |
Number of top hits to keep in DIAMOND search. Default: 5. |
verbose |
Logical indicating if progress messages should be printed. Default: FALSE. |
outdir |
Output directory for DIAMOND results. By default, output files are saved to a temporary directory. |
threads |
Number of threads to use. Default: let DIAMOND auto-detect and use all available virtual cores on the machine. |
compare |
Character scalar indicating which comparisons should be made when running DIAMOND. Possible modes are "all" (all-vs-all comparisons), "intraspecies" (intraspecies comparisons only), or "interspecies" (interspecies comparisons only). Alternatively, users can pass a 2-column data frame as input with the names of species to be compared. |
... |
Any additional arguments to |
A list of data frames containing DIAMOND's tabular output
for each pairwise combination of species. For n species, the list length
will be .
data(proteomes) data(annotation) seq <- process_input(proteomes, annotation)$seq[1:2] if(diamond_is_installed()) { diamond_results <- run_diamond(seq) }
data(proteomes) data(annotation) seq <- process_input(proteomes, annotation)$seq[1:2] if(diamond_is_installed()) { diamond_results <- run_diamond(seq) }
Wrapper to run last from an R session
run_last( seq = NULL, verbose = FALSE, outdir = tempdir(), threads = 1, compare = "all", lastD = 1e+06, ... )
run_last( seq = NULL, verbose = FALSE, outdir = tempdir(), threads = 1, compare = "all", lastD = 1e+06, ... )
seq |
A processed list of AAStringSet objects
as returned by |
verbose |
Logical indicating if progress messages should be printed. Default: FALSE. |
outdir |
Output directory for last results. By default, output files are saved to a temporary directory. |
threads |
Number of threads to use. Default: 1. |
compare |
Character scalar indicating which comparisons should be made when running last. Possible modes are "all" (all-vs-all comparisons), "intraspecies" (intraspecies comparisons only), or "interspecies" (interspecies comparisons only). Alternatively, users can pass a 2-column data frame as input with the names of species to be compared. |
lastD |
last option D: query letters per random alignment. Default: 1e6. |
... |
Any additional arguments to |
A list of data frames containing last's tabular output
for each pairwise combination of species. For n species, the list length
will be .
data(proteomes) data(annotation) seq <- process_input(proteomes, annotation)$seq[1:2] if(last_is_installed()) { last_results <- run_last(seq) }
data(proteomes) data(annotation) seq <- process_input(proteomes, annotation)$seq[1:2] if(last_is_installed()) { last_results <- run_last(seq) }
Data obtained from Ensembl Fungi. Only annotation data for primary transcripts were included.
data(scerevisiae_annot)
data(scerevisiae_annot)
A GRangesList as returned by process_input()
containing
the element Scerevisiae.
data(scerevisiae_annot)
data(scerevisiae_annot)
List obtained with run_diamond()
.
data(scerevisiae_diamond)
data(scerevisiae_diamond)
A list of data frames (length 1) containing the whole paranome of S. cerevisiae resulting from intragenome similarity searches.
data(scerevisiae_diamond)
data(scerevisiae_diamond)