Title: | Systematic quality checks on comparative genomics analyses |
---|---|
Description: | cogeqc aims to facilitate systematic quality checks on standard comparative genomics analyses to help researchers detect issues and select the most suitable parameters for each data set. cogeqc can be used to asses: i. genome assembly and annotation quality with BUSCOs and comparisons of statistics with publicly available genomes on the NCBI; ii. orthogroup inference using a protein domain-based approach and; iii. synteny detection using synteny network properties. There are also data visualization functions to explore QC summary statistics. |
Authors: | FabrÃcio Almeida-Silva [aut, cre] , Yves Van de Peer [aut] |
Maintainer: | FabrÃcio Almeida-Silva <[email protected]> |
License: | GPL-3 |
Version: | 1.11.0 |
Built: | 2024-12-19 03:23:25 UTC |
Source: | https://github.com/bioc/cogeqc |
Assess orthogroup inference based on functional annotation
assess_orthogroups( orthogroups = NULL, annotation = NULL, correct_overclustering = TRUE )
assess_orthogroups( orthogroups = NULL, annotation = NULL, correct_overclustering = TRUE )
orthogroups |
A 3-column data frame with columns Orthogroup,
Species, and Gene. This data frame can be created from
the 'Orthogroups.tsv' file generated by OrthoFinder with the function
|
annotation |
A list of 2-column data frames with columns Gene (gene ID) and Annotation (annotation ID). The names of list elements must correspond to species names as in the second column of orthogroups. For instance, if there are two species in the orthogroups data frame named "SpeciesA" and "SpeciesB", annotation must be a list of 2 data frames, and each list element must be named "SpeciesA" and "SpeciesB". |
correct_overclustering |
Logical indicating whether to correct for overclustering in orthogroups. Default: TRUE. |
A data frame.
data(og) data(interpro_ath) data(interpro_bol) # Subsetting annotation for demonstration purposes. annotation <- list(Ath = interpro_ath[1:1000,], Bol = interpro_bol[1:1000,]) assess <- assess_orthogroups(og, annotation)
data(og) data(interpro_ath) data(interpro_bol) # Subsetting annotation for demonstration purposes. annotation <- list(Ath = interpro_ath[1:1000,], Bol = interpro_bol[1:1000,]) assess <- assess_orthogroups(og, annotation)
Assess synteny network based on graph properties
assess_synnet(synnet = NULL, cc_type = "average")
assess_synnet(synnet = NULL, cc_type = "average")
synnet |
Edge list for the synteny network in a 2-column data frame, with columns 1 and 2 representing names of loci in anchor 1 and anchor 2, respectively. |
cc_type |
Type of clustering coefficient to be calculated. One of 'global' or 'average'. Default: 'average'. |
Network score is the product of the network's clustering coefficient, node count, and R squared for the scale-free topology fit.
A data frame with the following variables:
Numeric representing clustering coefficient.
Numeric representing number of nodes in the network.
Numeric indicating the coefficient of determination for the scale-free topology fit.
Numeric representing network score, which is the product of 'CC' and 'Node_number'.
data(synnet) assess_synnet(synnet)
data(synnet) assess_synnet(synnet)
assess_synnet
Assess list of synteny networks as in assess_synnet
assess_synnet_list(synnet_list = NULL, cc_type = "average")
assess_synnet_list(synnet_list = NULL, cc_type = "average")
synnet_list |
A list of networks, each network being an edge list as a 2-column data frame, with columns 1 and 2 representing names of loci in anchor 1 and anchor 2, respectively. |
cc_type |
Type of clustering coefficient to be calculated. One of 'global' or 'average'. Default: 'average'. |
A data frame with the following variables:
Numeric representing clustering coefficient.
Numeric representing number of nodes in the network.
Numeric indicating the coefficient of determination for the scale-free topology fit.
Numeric representing network score, which is the product of 'CC' and 'Node_number'.
Character of network name.
set.seed(123) data(synnet) net1 <- synnet net2 <- synnet[-sample(1:10000, 500), ] net3 <- synnet[-sample(1:10000, 1000), ] synnet_list <- list(net1 = net1, net2 = net2, net3 = net3) assess_synnet_list(synnet_list)
set.seed(123) data(synnet) net1 <- synnet net2 <- synnet[-sample(1:10000, 500), ] net3 <- synnet[-sample(1:10000, 1000), ] synnet_list <- list(net1 = net1, net2 = net2, net3 = net3) assess_synnet_list(synnet_list)
This object was created with the function read_busco()
using
a batch run of BUSCO on the genomes of Herbaspirillum seropedicae SmR1
and Herbaspirillum rubrisubalbicans M1.
data(batch_summary)
data(batch_summary)
A 2-column data frame with the following variables:
Factor of BUSCO classes
Numeric with the percentage of BUSCOs in each class.
Character with the lineage dataset used.
Character with the name of the FASTA file used.
data(batch_summary)
data(batch_summary)
Check if BUSCO is installed
busco_is_installed()
busco_is_installed()
Logical indicating whether BUSCO is installed or not.
busco_is_installed()
busco_is_installed()
Calculate homogeneity scores for orthogroups
calculate_H( orthogroup_df, correct_overclustering = TRUE, max_size = 200, update_score = TRUE )
calculate_H( orthogroup_df, correct_overclustering = TRUE, max_size = 200, update_score = TRUE )
orthogroup_df |
Data frame with orthogroups and their associated genes and annotation. The columns Gene, Orthogroup, and Annotation are mandatory, and they must represent Gene ID, Orthogroup ID, and Annotation ID (e.g., Interpro/PFAM), respectively. |
correct_overclustering |
Logical indicating whether to correct for overclustering in orthogroups. Default: TRUE. |
max_size |
Numeric indicating the maximum orthogroup size to consider. If orthogroups are too large, calculating Sorensen-Dice indices for all pairwise combinations could take a long time, so setting a limit prevents that. Default: 200. |
update_score |
Logical indicating whether to replace scores with corrected scores or not. If FALSE, the dispersal term and corrected scores are returned as separate variables in the output data frame. |
Homogeneity is calculated based on pairwise Sorensen-Dice similarity indices between gene pairs in an orthogroup, and they range from 0 to 1. Thus, if all genes in an orthogroup share the same domain, the orthogroup will have a homogeneity score of 1. On the other hand, if genes in an orthogroup do not have any domain in common, the orthogroup will have a homogeneity score of 0. The percentage of orthogroups with size greater than max_size will be subtracted from the homogeneity scores, since too large orthogroups typically have very low scores. Additionally, users can correct for overclustering by penalizing protein domains that appear in multiple orthogroups (default).
A 2-column data frame with the variables Orthogroup and Score, corresponding to orthogroup ID and orthogroup score, respectively. If update_score = FALSE, additional columns named Dispersal and Score_c are added, which correspond to the dispersal term and corrected scores, respectively.
data(og) data(interpro_ath) orthogroup_df <- merge(og[og$Species == "Ath", ], interpro_ath) # Filter data to reduce run time orthogroup_df <- orthogroup_df[1:10000, ] H <- calculate_H(orthogroup_df)
data(og) data(interpro_ath) orthogroup_df <- merge(og[og$Species == "Ath", ], interpro_ath) # Filter data to reduce run time orthogroup_df <- orthogroup_df[1:10000, ] H <- calculate_H(orthogroup_df)
This function helps users analyze their genome assembly stats in a context by comparing metrics obtained by users with "reference" metrics in closely-related organisms.
compare_genome_stats(ncbi_stats = NULL, user_stats = NULL)
compare_genome_stats(ncbi_stats = NULL, user_stats = NULL)
ncbi_stats |
A data frame of summary statistics for a particular
taxon obtained from the NCBI, as obtained with the
function |
user_stats |
A data frame with assembly statistics obtained by the user. A column named accession is mandatory, and it must contain unique identifiers for the genome(s) analyzed by the user. Dummy variables can be used as identifiers (e.g., "my_genome_001"), as long as they are unique. All other column containing assembly stats must have the same names as their corresponding columns in the data frame specified in ncbi_stats. For instance, stats on total number of genes and sequence length must be in columns named "gene_count_total" and "sequence_length", as in the ncbi_stats data frame. |
For each genome assembly statistic (e.g., "gene_count_total"), values in user_stats are compared to a distribution of values from ncbi_stats, and their percentile and rank in the distributions are reported.
A data frame with the following variables:
character, unique identifier as in user_stats$accession.
character, name of the genome assembly metric (e.g., "CC_ratio").
numeric, percentile in the distribution.
numeric, rank in the distribution (highest to lowest). For the variable "CC_ratio", ranks go from lowest to highest.
# Use case: user assembled a maize (Zea mays) genome ## Obtain stats for maize genomes on the NCBI ncbi_stats <- get_genome_stats(taxon = "Zea mays") ## Create a data frame of stats for fictional maize genome user_stats <- data.frame( accession = "my_lovely_maize", sequence_length = 2.4 * 1e9, gene_count_total = 50000, CC_ratio = 1 ) # Compare stats compare_genome_stats(ncbi_stats, user_stats)
# Use case: user assembled a maize (Zea mays) genome ## Obtain stats for maize genomes on the NCBI ncbi_stats <- get_genome_stats(taxon = "Zea mays") ## Create a data frame of stats for fictional maize genome user_stats <- data.frame( accession = "my_lovely_maize", sequence_length = 2.4 * 1e9, gene_count_total = 50000, CC_ratio = 1 ) # Compare stats compare_genome_stats(ncbi_stats, user_stats)
Compare inferred orthogroups to a reference set
compare_orthogroups(ref_orthogroups = NULL, test_orthogroups = NULL)
compare_orthogroups(ref_orthogroups = NULL, test_orthogroups = NULL)
ref_orthogroups |
Reference orthogroups in a 3-column data frame
with columns Orthogroup, Species, and Gene.
This data frame can be created from the 'Orthogroups.tsv' file
generated by OrthoFinder with the function |
test_orthogroups |
Test orthogroups that will be compared to ref_orthogroups in the same 3-column data frame format. |
This function compares a test set of orthogroups to a reference set and returns which orthogroups in the reference set are fully preserved in the test set (i.e., identical gene repertoire) and which are not. Species names (column 2) must be the same between reference and test set. If some species are not shared between reference and test sets, they will not be considered for the comparison.
A 2-column data frame with the following variables:
Character of orthogroup IDs.
A logical vector of preservation status. It is TRUE if the orthogroup in the reference set is fully preserved in the test set, and FALSE otherwise.
set.seed(123) data(og) og <- og[1:5000, ] ref <- og # Shuffle genes to simulate a different set test <- data.frame( Orthogroup = sample(og$Orthogroup, nrow(og), replace = FALSE), Species = og$Species, Gene = og$Gene ) comparison <- compare_orthogroups(ref, test) # Calculating percentage of preservation sum(comparison$Preserved) / length(comparison$Preserved)
set.seed(123) data(og) og <- og[1:5000, ] ref <- og # Shuffle genes to simulate a different set test <- data.frame( Orthogroup = sample(og$Orthogroup, nrow(og), replace = FALSE), Species = og$Species, Gene = og$Gene ) comparison <- compare_orthogroups(ref, test) # Calculating percentage of preservation sum(comparison$Preserved) / length(comparison$Preserved)
Goodness of fit test for the scale-free topology model
fit_sft(edges)
fit_sft(edges)
edges |
A 2-column data frame with network edges represented in each. Columns 1 and 2 represent nodes 1 and 2 of each edge. |
A numeric scalar with the R squared for the scale-free topology fit.
data(synnet) edges <- synnet fit_sft(edges)
data(synnet) edges <- synnet fit_sft(edges)
Get summary statistics for genomes on NCBI using the NCBI Datasets API
get_genome_stats(taxon = NULL, filters = NULL)
get_genome_stats(taxon = NULL, filters = NULL)
taxon |
Taxon for which summary statistics will be retrieved, either as a character scalar (e.g., "brassicaceae") or as a numeric scalar representing NCBI Taxonomy ID (e.g., 3700). |
filters |
(optional) A list of filters to use when querying the API
in the form of key-value pairs, with keys in list names and values in list
elements (e.g., |
Possible filters for the filters parameter can be accessed at https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/rest-api/#get-/genome/taxon/-taxons-/dataset_report.
A data frame with the following variables:
character, accession number.
character, data source.
numeric, NCBI Taxonomy ID.
character, species' scientific name.
character, species' common name.
character, species' ecotype.
character, species' strain.
character, species' isolate.
character, species' cultivar.
factor, assembly level ("Complete", "Chromosome", "Scaffold", or "Contig").
character, assembly status.
character, assembly name.
character, assembly type.
character, submission date (YYYY-MM-DD).
character, submitter name.
character, sequencing technology.
logical, indicator of wheter the genome is atypical.
character, RefSeq category.
numeric, number of chromosomes.
numeric, total sequence length.
numeric, ungapped sequence length.
numeric, number of contigs.
numeric, contig N50.
numeric, contig L50.
numeric, contig N50.
numeric, contig L50.
numeric, GC percentage (0-100).
character, name of annotation provider.
character, annotation release date (YYYY-MM-DD).
numeric, total number of genes.
numeric, number of protein-coding genes.
numeric, number of non-coding genes.
numeric, number of pseudogenes.
numeric, number of other genes.
numeric, ratio of the number of contigs to the number of chromosomes.
# Example 1: Search for A. thaliana genomes by tax ID ex1 <- get_genome_stats(taxon = 3702) # Example 2: Search for A. thaliana genomes by name ex2 <- get_genome_stats(taxon = "Arabidopsis thaliana") # Example 3: Search for chromosome-level Brassicaeae genomes ex3 <- get_genome_stats( taxon = "brassicaceae", filters = list(filters.assembly_level = "chromosome") )
# Example 1: Search for A. thaliana genomes by tax ID ex1 <- get_genome_stats(taxon = 3702) # Example 2: Search for A. thaliana genomes by name ex2 <- get_genome_stats(taxon = "Arabidopsis thaliana") # Example 3: Search for chromosome-level Brassicaeae genomes ex3 <- get_genome_stats( taxon = "brassicaceae", filters = list(filters.assembly_level = "chromosome") )
The annotation data were retrieved from PLAZA Dicots 5.0.
data(interpro_ath)
data(interpro_ath)
A 2-column data frame:
Character of gene IDs.
Character of Interpro domains.
Van Bel, M., Silvestri, F., Weitz, E. M., Kreft, L., Botzki, A., Coppens, F., & Vandepoele, K. (2021). PLAZA 5.0: extending the scope and power of comparative and functional genomics in plants. Nucleic acids research.
data(interpro_ath)
data(interpro_ath)
The annotation data were retrieved from PLAZA Dicots 5.0.
data(interpro_bol)
data(interpro_bol)
A 2-column data frame:
Character of gene IDs.
Character of Interpro domains.
Van Bel, M., Silvestri, F., Weitz, E. M., Kreft, L., Botzki, A., Coppens, F., & Vandepoele, K. (2021). PLAZA 5.0: extending the scope and power of comparative and functional genomics in plants. Nucleic acids research.
data(interpro_bol)
data(interpro_bol)
List BUSCO data sets
list_busco_datasets()
list_busco_datasets()
A hierarchically organized list of available data sets as returned
by busco --list-datasets
.
if(busco_is_installed()) { list_busco_datasets() }
if(busco_is_installed()) { list_busco_datasets() }
Data obtained from PLAZA Dicots 5.0.
data(og)
data(og)
A 3-column data frame with the following variables:
Orthogroup ID.
Abbreviation for species' name.
Gene ID
Van Bel, M., Silvestri, F., Weitz, E. M., Kreft, L., Botzki, A., Coppens, F., & Vandepoele, K. (2021). PLAZA 5.0: extending the scope and power of comparative and functional genomics in plants. Nucleic acids research.
data(og)
data(og)
Plot BUSCO summary output
plot_busco(summary_df = NULL)
plot_busco(summary_df = NULL)
summary_df |
Data frame with BUSCO summary output as returned
by |
A ggplot object with a barplot of BUSCOs in each class.
# Single file result_dir <- system.file("extdata", package = "cogeqc") summary_df <- read_busco(result_dir) # Batch mode data(batch_summary) plot_busco(summary_df) plot_busco(batch_summary)
# Single file result_dir <- system.file("extdata", package = "cogeqc") summary_df <- read_busco(result_dir) # Batch mode data(batch_summary) plot_busco(summary_df) plot_busco(batch_summary)
Plot species-specific duplications
plot_duplications(stats_list = NULL)
plot_duplications(stats_list = NULL)
stats_list |
A list of data frames with Orthofinder summary stats
as returned by the function |
A ggplot object with a barplot of number of species-specific duplications.
dir <- system.file("extdata", package = "cogeqc") stats_list <- read_orthofinder_stats(dir) plot_duplications(stats_list)
dir <- system.file("extdata", package = "cogeqc") stats_list <- read_orthofinder_stats(dir) plot_duplications(stats_list)
Plot percentage of genes in orthogroups for each species
plot_genes_in_ogs(stats_list = NULL)
plot_genes_in_ogs(stats_list = NULL)
stats_list |
A list of data frames with Orthofinder summary stats
as returned by the function |
A ggplot object with a barplot of percentages of genes in orthogroups for each species.
dir <- system.file("extdata", package = "cogeqc") stats_list <- read_orthofinder_stats(dir) plot_genes_in_ogs(stats_list)
dir <- system.file("extdata", package = "cogeqc") stats_list <- read_orthofinder_stats(dir) plot_genes_in_ogs(stats_list)
Plot statistics on genome assemblies on the NCBI
plot_genome_stats(ncbi_stats = NULL, user_stats = NULL)
plot_genome_stats(ncbi_stats = NULL, user_stats = NULL)
ncbi_stats |
A data frame of summary statistics for a particular
taxon obtained from the NCBI, as obtained with the
function |
user_stats |
(Optional) A data frame with assembly statistics obtained by the user. Statistics in this data frame are highlighted in red if this data frame is passed. A column named accession is mandatory, and it must contain unique identifiers for the genome(s) analyzed by the user. Dummy variables can be used as identifiers (e.g., "my_genome_001"), as long as they are unique. All other column containing assembly stats must have the same names as their corresponding columns in the data frame specified in ncbi_stats. For instance, stats on total number of genes and sequence length must be in columns named "gene_count_total" and "sequence_length", as in the ncbi_stats data frame. |
A composition of ggplot objects made with patchwork.
# Example 1: plot stats on maize genomes on the NCBI ## Obtain stats for maize genomes on the NCBI ncbi_stats <- get_genome_stats(taxon = "Zea mays") plot_genome_stats(ncbi_stats) ## Plot stats # Example 2: highlight user-defined stats in the distribution ## Create a data frame of stats for fictional maize genome user_stats <- data.frame( accession = "my_lovely_maize", sequence_length = 2.4 * 1e9, gene_count_total = 50000, CC_ratio = 1 ) plot_genome_stats(ncbi_stats, user_stats)
# Example 1: plot stats on maize genomes on the NCBI ## Obtain stats for maize genomes on the NCBI ncbi_stats <- get_genome_stats(taxon = "Zea mays") plot_genome_stats(ncbi_stats) ## Plot stats # Example 2: highlight user-defined stats in the distribution ## Create a data frame of stats for fictional maize genome user_stats <- data.frame( accession = "my_lovely_maize", sequence_length = 2.4 * 1e9, gene_count_total = 50000, CC_ratio = 1 ) plot_genome_stats(ncbi_stats, user_stats)
Plot pairwise orthogroup overlap between species
plot_og_overlap(stats_list = NULL, clust = TRUE)
plot_og_overlap(stats_list = NULL, clust = TRUE)
stats_list |
A list of data frames with Orthofinder summary stats
as returned by the function |
clust |
Logical indicating whether to clust data based on overlap. Default: TRUE |
A ggplot object with a heatmap.
dir <- system.file("extdata", package = "cogeqc") stats_list <- read_orthofinder_stats(dir) plot_og_overlap(stats_list)
dir <- system.file("extdata", package = "cogeqc") stats_list <- read_orthofinder_stats(dir) plot_og_overlap(stats_list)
Plot orthogroup sizes per species
plot_og_sizes(orthogroups = NULL, log = FALSE, max_size = NULL)
plot_og_sizes(orthogroups = NULL, log = FALSE, max_size = NULL)
orthogroups |
A 3-column data frame with columns Orthogroup,
Species, and Gene. This data frame can be created from
the 'Orthogroups.tsv' file generated by OrthoFinder with the function
|
log |
Logical indicating whether to transform orthogroups sizes with natural logarithms. Default: FALSE. |
max_size |
Numeric indicating the maximum orthogroup size to consider.
If this parameter is not NULL, orthogroups larger
than |
A ggplot object with a violin plot.
data(og) plot_og_sizes(og, log = TRUE) plot_og_sizes(og, max_size = 100) plot_og_sizes(og, log = TRUE, max_size = 100)
data(og) plot_og_sizes(og, log = TRUE) plot_og_sizes(og, max_size = 100) plot_og_sizes(og, log = TRUE, max_size = 100)
This function is a wrapper for plot_species_tree
,
plot_duplications
, plot_genes_in_ogs
,
plot_species_specific_ogs
.
plot_orthofinder_stats(tree = NULL, stats_list = NULL, xlim = c(0, 1))
plot_orthofinder_stats(tree = NULL, stats_list = NULL, xlim = c(0, 1))
tree |
Tree object as returned by |
stats_list |
(optional) A list of data frames with Orthofinder summary stats
as returned by the function |
xlim |
Numeric vector of x-axis limits. This is useful if your node tip labels are not visible due to margin issues. Default: c(0, 1). |
A panel of ggplot objects.
data(tree) dir <- system.file("extdata", package = "cogeqc") stats_list <- read_orthofinder_stats(dir) plot_orthofinder_stats(tree, xlim = c(0, 1.5), stats_list = stats_list)
data(tree) dir <- system.file("extdata", package = "cogeqc") stats_list <- read_orthofinder_stats(dir) plot_orthofinder_stats(tree, xlim = c(0, 1.5), stats_list = stats_list)
Plot number of species-specific orthogroups
plot_species_specific_ogs(stats_list = NULL)
plot_species_specific_ogs(stats_list = NULL)
stats_list |
A list of data frames with Orthofinder summary stats
as returned by the function |
A ggplot object with a barplot of number of species-specific orthogroups for each species.
dir <- system.file("extdata", package = "cogeqc") stats_list <- read_orthofinder_stats(dir) plot_species_specific_ogs(stats_list)
dir <- system.file("extdata", package = "cogeqc") stats_list <- read_orthofinder_stats(dir) plot_species_specific_ogs(stats_list)
Plot species tree
plot_species_tree(tree = NULL, xlim = c(0, 1), stats_list = NULL)
plot_species_tree(tree = NULL, xlim = c(0, 1), stats_list = NULL)
tree |
Tree object as returned by |
xlim |
Numeric vector of x-axis limits. This is useful if your node tip labels are not visible due to margin issues. Default: c(0, 1). |
stats_list |
(optional) A list of data frames with Orthofinder summary
stats as returned by the function |
A ggtree/ggplot object with the species tree.
data(tree) plot_species_tree(tree)
data(tree) plot_species_tree(tree)
Read and parse BUSCO's summary report
read_busco(result_dir = NULL)
read_busco(result_dir = NULL)
result_dir |
Path to the directory where BUSCO results are stored. This function will look for the short_summary* file (single run) or short_summary* file (batch mode). |
A data frame with the following variables:
BUSCO class. One of Complete_SC, Complete_duplicate, Fragmented, or Missing
Frequency of BUSCOs in each class. If BUSCO was run in batch mode, this variable will contain relative frequencies. If BUSCO was run for a single file, it will contain absolute frequencies.
Name of the lineage dataset used.
Name of the input FASTA file.
result_dir <- system.file("extdata", package = "cogeqc") df <- read_busco(result_dir)
result_dir <- system.file("extdata", package = "cogeqc") df <- read_busco(result_dir)
Read and parse Orthofinder summary statistics
read_orthofinder_stats(stats_dir = NULL)
read_orthofinder_stats(stats_dir = NULL)
stats_dir |
Path to directory containing Orthofinder's comparative genomics statistics. In your Orthofinder results directory, this directory is named Comparative_Genomics_Statistics. |
A list of data frames with the following elements:
stats A data frame of summary stats per species with the following variables:
Factor of species names.
Numeric of number of genes.
Numeric of number of genes in orthogroups.
Numeric of percentage of genes in orthogroups.
Numeric of number of species-specific orthogroups.
Numeric of number of genes in species-specific orthogroups.
Numeric of percentage of genes in species-specific orthogroups.
Integer with number of duplications per species.
og_overlap A symmetric data frame of pairwise orthogroup overlap between species.
duplications A 2-column data frame with node IDs in the first column and number of gene duplications (50% support) in the second column.
stats_dir <- system.file("extdata", package = "cogeqc") ortho_stats <- read_orthofinder_stats(stats_dir)
stats_dir <- system.file("extdata", package = "cogeqc") ortho_stats <- read_orthofinder_stats(stats_dir)
This function converts the orthogroups file named Orthogroups.tsv to a parsed data frame.
read_orthogroups(orthogroups_path = NULL)
read_orthogroups(orthogroups_path = NULL)
orthogroups_path |
Path to Orthogroups/Orthogroups.tsv file generated by OrthoFinder. |
A 3-column data frame with orthogroups, species IDs and gene IDs, respectively.
Fabricio Almeida-Silva
path <- system.file("extdata", "Orthogroups.tsv.gz", package = "cogeqc") og <- read_orthogroups(path)
path <- system.file("extdata", "Orthogroups.tsv.gz", package = "cogeqc") og <- read_orthogroups(path)
Run BUSCO assessment of assembly and annotation quality
run_busco( sequence = NULL, outlabel = NULL, mode = c("genome", "transcriptome", "proteins"), lineage = NULL, auto_lineage = NULL, force = FALSE, threads = 1, outpath = NULL, download_path = tempdir() )
run_busco( sequence = NULL, outlabel = NULL, mode = c("genome", "transcriptome", "proteins"), lineage = NULL, auto_lineage = NULL, force = FALSE, threads = 1, outpath = NULL, download_path = tempdir() )
sequence |
An object of class DNAStringSet/AAStringSet/RNAStringSet or path to FASTA file with the genome, transcriptome, or protein sequences to be analyzed. If there are many FASTA files in a directory, you can input the path to this directory, so BUSCO will be run in all FASTA files inside it. |
outlabel |
Character with a recognizable short label for analysis directory and files. |
mode |
Character with BUSCO mode. One of 'genome', 'transcriptome', or 'proteins'. |
lineage |
Character with name of lineage to be used. |
auto_lineage |
Character indicating whether BUSCO should determine optimum lineage path automatically. One of 'euk', 'prok', 'all', or NULL. If 'euk', it will determine optimum lineage path on eukaryote tree. If 'prok', it will determine optimum lineage path on non-eukaryote trees. If 'all', it will determine optimum lineage path for all trees. If NULL, it will not automatically determine lineage, and lineage must be manually specified. Default: NULL. |
force |
Logical indicating whether existing runs with the same file names should be overwritten. Default: FALSE. |
threads |
Numeric with the number of threads/cores to use. Default: 1. |
outpath |
Path to results directory. If NULL, results will be stored in the current working directory. Default: NULL. |
download_path |
Path to directory where BUSCO datasets will be stored after downloading. Default: tempdir(). |
A character vector with the names of subdirectories and files in the results directory.
sequence <- system.file("extdata", "Hse_subset.fa", package = "cogeqc") download_path <- paste0(tempdir(), "/datasets") if(busco_is_installed()) { run_busco(sequence, outlabel = "Hse", mode = "genome", lineage = "burkholderiales_odb10", outpath = tempdir(), download_path = download_path) }
sequence <- system.file("extdata", "Hse_subset.fa", package = "cogeqc") download_path <- paste0(tempdir(), "/datasets") if(busco_is_installed()) { run_busco(sequence, outlabel = "Hse", mode = "genome", lineage = "burkholderiales_odb10", outpath = tempdir(), download_path = download_path) }
Synteny network for Brassica oleraceae, B. napus, and B. rapa
data(synnet)
data(synnet)
A 2-column data frame with the variables anchor1 and anchor2, containing names of loci in anchor 1 and anchor 2, respectively.
Zhao, T., & Schranz, M. E. (2019). Network-based microsynteny analysis identifies major differences and genomic outliers in mammalian and angiosperm genomes. Proceedings of the National Academy of Sciences, 116(6), 2165-2174.
data(synnet)
data(synnet)
The data used to create this object was retrieved from Orthofinder's example output for model species, available in https://bioinformatics.plants.ox.ac.uk/davidemms/public_data/.
data(tree)
data(tree)
An object of class "phylo" as returned by treeio::read.tree()
.
Emms, D. M., & Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome biology, 20(1), 1-14.
data(tree)
data(tree)