When working on your own genome project or when using publicly
available genomes for comparative analyses, it is critical to assess the
quality of your data. Over the past years, several tools have been
developed and several metrics have been proposed to assess the quality
of a genome assembly and annotation. cogeqc
helps users
interpret their genome assembly statistics by comparing them with
statistics on publicly available genomes on the NCBI. Additionally,
cogeqc
also provides an interface to BUSCO (Simão et al. 2015), a popular tool to assess
gene space completeness. Graphical functions are available to make
publication-ready plots that summarize the results of quality
control.
You can install cogeqc
from Bioconductor with the
following code:
When analyzing and interpreting genome assembly statistics, it is
often useful to place your stats in a context by comparing them with
stats from genomes of closely-related or even the same species.
cogeqc
provides users with an interface to the NCBI
Datasets API, which can be used to retrieve summary stats for genomes on
NCBI. In this section, we will guide you on how to retrieve such
information and use it as a reference to interpret your data.
To obtain a data frame of summary statistics for NCBI genomes of a
particular taxon, you will use the function
get_genome_stats()
. In the taxon parameter, you
must specify the taxon from which data will be extracted. This can be
done either by passing a character scalar with taxon name or by passing
a numeric scalar with NCBI Taxonomy ID. For example, the code below
demonstrates two ways of extracting stats on maize (Zea mays)
genomes on NCBI:
# Example 1: get stats for all maize genomes using taxon name
maize_stats <- get_genome_stats(taxon = "Zea mays")
head(maize_stats)
#> accession source species_taxid species_name
#> 1 GCA_902167145.1 GENBANK 4577 Zea mays
#> 2 GCF_902167145.1 REFSEQ 4577 Zea mays
#> 3 GCA_022117705.1 GENBANK 4577 Zea mays
#> 4 GCA_029775835.1 GENBANK 4577 Zea mays
#> 5 GCA_905067065.1 GENBANK 4577 Zea mays
#> 6 GCA_963555555.1 GENBANK 381124 Zea mays subsp. mays
#> species_common_name species_ecotype species_strain species_isolate
#> 1 maize <NA> NA <NA>
#> 2 maize <NA> NA <NA>
#> 3 maize <NA> NA <NA>
#> 4 maize <NA> NA <NA>
#> 5 maize <NA> NA <NA>
#> 6 maize <NA> NA <NA>
#> species_cultivar assembly_level assembly_status
#> 1 B73 Chromosome current
#> 2 B73 Chromosome current
#> 3 Mo17-2021 <NA> current
#> 4 LT2357 Chromosome current
#> 5 <NA> Chromosome current
#> 6 <NA> Chromosome current
#> assembly_name assembly_type submission_date
#> 1 Zm-B73-REFERENCE-NAM-5.0 haploid NA
#> 2 Zm-B73-REFERENCE-NAM-5.0 haploid NA
#> 3 Zm-Mo17-REFERENCE-CAU-T2T-assembly haploid NA
#> 4 ASM2977583v1 haploid NA
#> 5 Zm-LH244-REFERENCE-BAYER-1.0 haploid NA
#> 6 Zm-PT-REFERENCE-HiLo-1.0 haploid NA
#> submitter sequencing_technology atypical
#> 1 MaizeGDB <NA> FALSE
#> 2 MaizeGDB <NA> FALSE
#> 3 China Agriculture University Oxford Nanopore PromethION; PacBio FALSE
#> 4 Beijing Lantron Seed Co., LTD. PacBio Sequel FALSE
#> 5 BAYER CROPSCIENCE <NA> FALSE
#> 6 MaizeGDB PacBio SEQUEL II FALSE
#> refseq_category chromosome_count sequence_length ungapped_length
#> 1 <NA> 10 2182075994 2178268108
#> 2 reference genome 10 2182075994 2178268108
#> 3 <NA> 10 2178604320 2178604320
#> 4 <NA> 10 2106865080 2106637080
#> 5 <NA> 10 2147745480 2107651308
#> 6 <NA> 10 2127995506 2127990406
#> contig_count contig_N50 contig_L50 scaffold_count scaffold_N50 scaffold_L50
#> 1 1393 47037903 16 685 226353449 5
#> 2 1393 47037903 16 685 226353449 5
#> 3 10 220303002 5 10 220303002 5
#> 4 460 15883073 41 10 222005600 5
#> 5 56173 84946 7498 10 225452224 5
#> 6 61 126598816 7 10 228863277 5
#> GC_percent annotation_provider annotation_release_date gene_count_total
#> 1 47 <NA> <NA> NA
#> 2 47 NCBI RefSeq 2020-08-09 49897
#> 3 47 <NA> <NA> NA
#> 4 47 <NA> <NA> NA
#> 5 47 <NA> <NA> NA
#> 6 47 <NA> <NA> NA
#> gene_count_coding gene_count_noncoding gene_count_pseudogene gene_count_other
#> 1 NA NA NA NA
#> 2 34337 10366 5194 NA
#> 3 NA NA NA NA
#> 4 NA NA NA NA
#> 5 NA NA NA NA
#> 6 NA NA NA NA
#> CC_ratio
#> 1 139.3
#> 2 139.3
#> 3 1.0
#> 4 46.0
#> 5 5617.3
#> 6 6.1
str(maize_stats)
#> 'data.frame': 122 obs. of 36 variables:
#> $ accession : chr "GCA_902167145.1" "GCF_902167145.1" "GCA_022117705.1" "GCA_029775835.1" ...
#> $ source : chr "GENBANK" "REFSEQ" "GENBANK" "GENBANK" ...
#> $ species_taxid : int 4577 4577 4577 4577 4577 381124 4577 4577 4577 4577 ...
#> $ species_name : chr "Zea mays" "Zea mays" "Zea mays" "Zea mays" ...
#> $ species_common_name : chr "maize" "maize" "maize" "maize" ...
#> $ species_ecotype : chr NA NA NA NA ...
#> $ species_strain : logi NA NA NA NA NA NA ...
#> $ species_isolate : chr NA NA NA NA ...
#> $ species_cultivar : chr "B73" "B73" "Mo17-2021" "LT2357" ...
#> $ assembly_level : Factor w/ 4 levels "Complete","Chromosome",..: 2 2 NA 2 2 2 2 2 2 2 ...
#> $ assembly_status : chr "current" "current" "current" "current" ...
#> $ assembly_name : chr "Zm-B73-REFERENCE-NAM-5.0" "Zm-B73-REFERENCE-NAM-5.0" "Zm-Mo17-REFERENCE-CAU-T2T-assembly" "ASM2977583v1" ...
#> $ assembly_type : chr "haploid" "haploid" "haploid" "haploid" ...
#> $ submission_date : logi NA NA NA NA NA NA ...
#> $ submitter : chr "MaizeGDB" "MaizeGDB" "China Agriculture University" "Beijing Lantron Seed Co., LTD." ...
#> $ sequencing_technology : chr NA NA "Oxford Nanopore PromethION; PacBio" "PacBio Sequel" ...
#> $ atypical : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
#> $ refseq_category : chr NA "reference genome" NA NA ...
#> $ chromosome_count : int 10 10 10 10 10 10 10 10 10 10 ...
#> $ sequence_length : num 2.18e+09 2.18e+09 2.18e+09 2.11e+09 2.15e+09 ...
#> $ ungapped_length : num 2.18e+09 2.18e+09 2.18e+09 2.11e+09 2.11e+09 ...
#> $ contig_count : int 1393 1393 10 460 56173 61 1016 1692 1191 1747 ...
#> $ contig_N50 : int 47037903 47037903 220303002 15883073 84946 126598816 161994764 55599674 49888411 49071148 ...
#> $ contig_L50 : int 16 16 5 41 7498 7 6 14 12 13 ...
#> $ scaffold_count : int 685 685 10 10 10 10 936 1502 800 1379 ...
#> $ scaffold_N50 : int 226353449 226353449 220303002 222005600 225452224 228863277 225306452 224665140 220287990 223168870 ...
#> $ scaffold_L50 : int 5 5 5 5 5 5 5 5 5 5 ...
#> $ GC_percent : num 47 47 47 47 47 47 46.5 46.5 47 46.5 ...
#> $ annotation_provider : chr NA "NCBI RefSeq" NA NA ...
#> $ annotation_release_date: chr NA "2020-08-09" NA NA ...
#> $ gene_count_total : int NA 49897 NA NA NA NA NA NA NA NA ...
#> $ gene_count_coding : int NA 34337 NA NA NA NA NA NA NA NA ...
#> $ gene_count_noncoding : int NA 10366 NA NA NA NA NA NA NA NA ...
#> $ gene_count_pseudogene : int NA 5194 NA NA NA NA NA NA NA NA ...
#> $ gene_count_other : int NA NA NA NA NA NA NA NA NA NA ...
#> $ CC_ratio : num 139 139 1 46 5617 ...
# Example 2: get stats for all maize genomes using NCBI Taxonomy ID
maize_stats2 <- get_genome_stats(taxon = 4577)
# Checking if objects are the same
identical(maize_stats, maize_stats2)
#> [1] TRUE
As you can see, there are 122 maize genomes on the NCBI. You can also include filters in your searches by passing a list of key-value pairs with keys in list names and values in elements. For instance, to obtain only chromosome-scale and annotated maize genomes, you would run:
# Get chromosome-scale maize genomes with annotation
## Create list of filters
filt <- list(
filters.has_annotation = "true",
filters.assembly_level = "chromosome"
)
filt
#> $filters.has_annotation
#> [1] "true"
#>
#> $filters.assembly_level
#> [1] "chromosome"
## Obtain data
filtered_maize_genomes <- get_genome_stats(taxon = "Zea mays", filters = filt)
dim(filtered_maize_genomes)
#> [1] 4 36
For a full list of filtering parameters and possible arguments, see the API documentation.
Now, suppose you sequenced a genome, obtained assembly and annotation stats, and want to compare them to NCBI genomes to identify potential issues. Examples of situations you may encounter include:
The genome you assembled is huge and you think there might be a problem with your assembly.
Your gene annotation pipeline predicted n genes, but you are not sure if this number is reasonable compared to other assemblies of the same species or closely-related species.
To compare user-defined summary stats with NCBI stats, you will use
the function compare_genome_stats()
. This function will
include the values you observed for each statistic into a distribution
(based on NCBI stats) and return the percentile and rank of your
observed values in each distribution.
As an example, let’s go back to our maize stats we obtained in the previous section. Suppose you sequenced a new maize genome and observed the following values:
To compare your observed values with those for publicly available
maize genomes, you need to store them in a data frame. The column
accession is mandatory, and any other column will be
matched against columns in the data frame obtained with
get_genome_stats()
. Thus, make sure column names in your
data frame match column names in the reference data frame. Then, you can
compare both data frames as below:
# Check column names in the data frame of stats for maize genomes on the NCBI
names(maize_stats)
#> [1] "accession" "source"
#> [3] "species_taxid" "species_name"
#> [5] "species_common_name" "species_ecotype"
#> [7] "species_strain" "species_isolate"
#> [9] "species_cultivar" "assembly_level"
#> [11] "assembly_status" "assembly_name"
#> [13] "assembly_type" "submission_date"
#> [15] "submitter" "sequencing_technology"
#> [17] "atypical" "refseq_category"
#> [19] "chromosome_count" "sequence_length"
#> [21] "ungapped_length" "contig_count"
#> [23] "contig_N50" "contig_L50"
#> [25] "scaffold_count" "scaffold_N50"
#> [27] "scaffold_L50" "GC_percent"
#> [29] "annotation_provider" "annotation_release_date"
#> [31] "gene_count_total" "gene_count_coding"
#> [33] "gene_count_noncoding" "gene_count_pseudogene"
#> [35] "gene_count_other" "CC_ratio"
# Create a simulated data frame of stats for a maize genome
my_stats <- data.frame(
accession = "my_lovely_maize",
sequence_length = 2.4 * 1e9,
gene_count_total = 50000,
CC_ratio = 2
)
# Compare stats
compare_genome_stats(ncbi_stats = maize_stats, user_stats = my_stats)
#> accession variable percentile rank
#> 1 my_lovely_maize sequence_length 0.95121951 7
#> 2 my_lovely_maize gene_count_total 1.00000000 1
#> 3 my_lovely_maize CC_ratio 0.02469136 2
To have a visual representation of the summary stats obtained with
get_genome_stats()
, you will use the function
plot_genome_stats()
.
Finally, you can pass your data frame of observed stats to highlight your values (as red points) in the distributions.
One of the most common metrics to assess gene space completeness is
BUSCO (best universal single-copy orthologs) (Simão et al. 2015). cogeqc
allows
users to run BUSCO from an R session and visualize results graphically.
BUSCO summary statistics will help you assess which assemblies have high
quality based on the percentage of complete BUSCOs.
To run BUSCO from R, you will use the function
run_busco()
2. Here, we will use an example FASTA file
containing the first 1,000 lines of the Herbaspirilllum seropedicae
SmR1 genome (GCA_000143225), which was downloaded from Ensembl
Bacteria. We will run BUSCO using burkholderiales_odb10 as the
lineage dataset. To view all available datasets, run
list_busco_datasets()
.
# Path to FASTA file
sequence <- system.file("extdata", "Hse_subset.fa", package = "cogeqc")
# Path to directory where BUSCO datasets will be stored
download_path <- paste0(tempdir(), "/datasets")
# Run BUSCO if it is installed
if(busco_is_installed()) {
run_busco(sequence, outlabel = "Hse", mode = "genome",
lineage = "burkholderiales_odb10",
outpath = tempdir(), download_path = download_path)
}
The output will be stored in the directory specified in
outpath. You can read and parse BUSCO’s output with the
function read_busco()
. For example, let’s read the output
of a BUSCO run using the genome of the green algae Ostreococcus
tauri. The output directory is /extdata
.
# Path to output directory
output_dir <- system.file("extdata", package = "cogeqc")
busco_summary <- read_busco(output_dir)
busco_summary
#> Class Frequency Lineage
#> 1 Complete_SC 1412 chlorophyta_odb10
#> 2 Complete_duplicate 4 chlorophyta_odb10
#> 3 Fragmented 35 chlorophyta_odb10
#> 4 Missing 68 chlorophyta_odb10
This is an example output for a BUSCO run with a single FASTA file.
You can also specify a directory containing multiple FASTA files in the
sequence argument of run_busco()
. This way, BUSCO
will be run in batch mode. Let’s see what the output of BUSCO in batch
mode looks like:
data(batch_summary)
batch_summary
#> Class Frequency Lineage File
#> 1 Complete_SC 98.5 burkholderiales_odb10 Hse.fa
#> 2 Complete_SC 98.8 burkholderiales_odb10 Hru.fa
#> 3 Complete_duplicate 0.7 burkholderiales_odb10 Hse.fa
#> 4 Complete_duplicate 0.7 burkholderiales_odb10 Hru.fa
#> 5 Fragmented 0.4 burkholderiales_odb10 Hse.fa
#> 6 Fragmented 0.3 burkholderiales_odb10 Hru.fa
#> 7 Missing 0.4 burkholderiales_odb10 Hse.fa
#> 8 Missing 0.2 burkholderiales_odb10 Hru.fa
The only difference between this data frame and the previous one is
the column File, which contains information on the
FASTA file. The example dataset batch_summary
contains the
output of run_busco()
using a directory containing two
genomes (Herbaspirillum seropedicae SmR1 and Herbaspirillum
rubrisubalbicans M1) as parameter to the sequence
argument.
After using run_busco()
and parsing its output with
read_busco()
, users can visualize summary statistics with
plot_busco()
.
We usually consider genomes with >90% of complete BUSCOs as having high quality. Thus, we can conclude that the three genomes analyzed here are high-quality genomes.
This document was created under the following conditions:
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.2 (2024-10-31)
#> os Ubuntu 24.04.1 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate C
#> ctype en_US.UTF-8
#> tz Etc/UTC
#> date 2024-11-19
#> pandoc 3.2.1 @ /usr/local/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> ape 5.8 2024-04-11 [2] RSPM (R 4.4.0)
#> aplot 0.2.3 2024-06-17 [2] RSPM (R 4.4.0)
#> beeswarm 0.4.0 2021-06-01 [2] RSPM (R 4.4.0)
#> BiocGenerics 0.53.3 2024-11-15 [2] https://bioc.r-universe.dev (R 4.4.2)
#> BiocManager 1.30.25 2024-08-28 [2] RSPM (R 4.4.0)
#> BiocStyle * 2.35.0 2024-11-19 [2] https://bioc.r-universe.dev (R 4.4.2)
#> Biostrings 2.75.1 2024-11-07 [2] https://bioc.r-universe.dev (R 4.4.2)
#> bslib 0.8.0 2024-07-29 [2] RSPM (R 4.4.0)
#> buildtools 1.0.0 2024-11-18 [3] local (/pkg)
#> cachem 1.1.0 2024-05-16 [2] RSPM (R 4.4.0)
#> cli 3.6.3 2024-06-21 [2] RSPM (R 4.4.0)
#> cogeqc * 1.11.0 2024-11-19 [1] https://bioc.r-universe.dev (R 4.4.2)
#> colorspace 2.1-1 2024-07-26 [2] RSPM (R 4.4.0)
#> crayon 1.5.3 2024-06-20 [2] RSPM (R 4.4.0)
#> digest 0.6.37 2024-08-19 [2] RSPM (R 4.4.0)
#> dplyr 1.1.4 2023-11-17 [2] RSPM (R 4.4.0)
#> evaluate 1.0.1 2024-10-10 [2] RSPM (R 4.4.0)
#> fansi 1.0.6 2023-12-08 [2] RSPM (R 4.4.0)
#> farver 2.1.2 2024-05-13 [2] RSPM (R 4.4.0)
#> fastmap 1.2.0 2024-05-15 [2] RSPM (R 4.4.0)
#> fs 1.6.5 2024-10-30 [2] RSPM (R 4.4.0)
#> generics 0.1.3 2022-07-05 [2] RSPM (R 4.4.0)
#> GenomeInfoDb 1.43.1 2024-11-18 [2] https://bioc.r-universe.dev (R 4.4.2)
#> GenomeInfoDbData 1.2.13 2024-11-19 [2] Bioconductor
#> ggbeeswarm 0.7.2 2023-04-29 [2] RSPM (R 4.4.0)
#> ggfun 0.1.7 2024-10-24 [2] RSPM (R 4.4.0)
#> ggplot2 3.5.1 2024-04-23 [2] RSPM (R 4.4.0)
#> ggplotify 0.1.2 2023-08-09 [2] RSPM (R 4.4.0)
#> ggtree 3.15.0 2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#> glue 1.8.0 2024-09-30 [2] RSPM (R 4.4.0)
#> gridGraphics 0.5-1 2020-12-13 [2] RSPM (R 4.4.0)
#> gtable 0.3.6 2024-10-25 [2] RSPM (R 4.4.0)
#> htmltools 0.5.8.1 2024-04-04 [2] RSPM (R 4.4.0)
#> httr 1.4.7 2023-08-15 [2] RSPM (R 4.4.0)
#> igraph 2.1.1 2024-10-19 [2] RSPM (R 4.4.0)
#> IRanges 2.41.1 2024-11-17 [2] https://bioc.r-universe.dev (R 4.4.2)
#> jquerylib 0.1.4 2021-04-26 [2] RSPM (R 4.4.0)
#> jsonlite 1.8.9 2024-09-20 [2] RSPM (R 4.4.0)
#> knitr 1.49 2024-11-08 [2] RSPM (R 4.4.0)
#> labeling 0.4.3 2023-08-29 [2] RSPM (R 4.4.0)
#> lattice 0.22-6 2024-03-20 [2] RSPM (R 4.4.0)
#> lazyeval 0.2.2 2019-03-15 [2] RSPM (R 4.4.0)
#> lifecycle 1.0.4 2023-11-07 [2] RSPM (R 4.4.0)
#> magrittr 2.0.3 2022-03-30 [2] RSPM (R 4.4.0)
#> maketools 1.3.1 2024-10-04 [3] RSPM (R 4.4.0)
#> munsell 0.5.1 2024-04-01 [2] RSPM (R 4.4.0)
#> nlme 3.1-166 2024-08-14 [2] RSPM (R 4.4.0)
#> patchwork 1.3.0 2024-09-16 [2] RSPM (R 4.4.0)
#> pillar 1.9.0 2023-03-22 [2] RSPM (R 4.4.0)
#> pkgconfig 2.0.3 2019-09-22 [2] RSPM (R 4.4.0)
#> plyr 1.8.9 2023-10-02 [2] RSPM (R 4.4.0)
#> purrr 1.0.2 2023-08-10 [2] RSPM (R 4.4.0)
#> R6 2.5.1 2021-08-19 [2] RSPM (R 4.4.0)
#> Rcpp 1.0.13-1 2024-11-02 [2] RSPM (R 4.4.0)
#> reshape2 1.4.4 2020-04-09 [2] RSPM (R 4.4.0)
#> rlang 1.1.4 2024-06-04 [2] RSPM (R 4.4.0)
#> rmarkdown 2.29 2024-11-04 [2] RSPM (R 4.4.0)
#> S4Vectors 0.45.2 2024-11-16 [2] https://bioc.r-universe.dev (R 4.4.2)
#> sass 0.4.9 2024-03-15 [2] RSPM (R 4.4.0)
#> scales 1.3.0 2023-11-28 [2] RSPM (R 4.4.0)
#> sessioninfo 1.2.2 2021-12-06 [2] RSPM (R 4.4.0)
#> stringi 1.8.4 2024-05-06 [2] RSPM (R 4.4.0)
#> stringr 1.5.1 2023-11-14 [2] RSPM (R 4.4.0)
#> sys 3.4.3 2024-10-04 [2] RSPM (R 4.4.0)
#> tibble 3.2.1 2023-03-20 [2] RSPM (R 4.4.0)
#> tidyr 1.3.1 2024-01-24 [2] RSPM (R 4.4.0)
#> tidyselect 1.2.1 2024-03-11 [2] RSPM (R 4.4.0)
#> tidytree 0.4.6 2023-12-12 [2] RSPM (R 4.4.0)
#> treeio 1.31.0 2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#> UCSC.utils 1.3.0 2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#> utf8 1.2.4 2023-10-22 [2] RSPM (R 4.4.0)
#> vctrs 0.6.5 2023-12-01 [2] RSPM (R 4.4.0)
#> vipor 0.4.7 2023-12-18 [2] RSPM (R 4.4.0)
#> withr 3.0.2 2024-10-28 [2] RSPM (R 4.4.0)
#> xfun 0.49 2024-10-31 [2] RSPM (R 4.4.0)
#> XVector 0.47.0 2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#> yaml 2.3.10 2024-07-26 [2] RSPM (R 4.4.0)
#> yulab.utils 0.1.8 2024-11-07 [2] RSPM (R 4.4.0)
#> zlibbioc 1.52.0 2024-10-29 [2] Bioconductor 3.20 (R 4.4.2)
#>
#> [1] /tmp/Rtmpa1SeZt/Rinst15362d3293f0
#> [2] /github/workspace/pkglib
#> [3] /usr/local/lib/R/site-library
#> [4] /usr/lib/R/site-library
#> [5] /usr/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
Note: The CC ratio is the ratio of the number of contigs to the number of chromosome pairs, and it has been proposed in Wang and Wang (2022) as a measurement of contiguity that compensates for the flaws of N50 and allows cross-species comparisons.↩︎
Note: You must have BUSCO installed and
in your PATH to use run_busco()
. You can check if BUSCO is
installed by running busco_is_installed()
. If you don’t
have it already, you can manually install it or use a conda virtual
environment with the Bioconductor package Herper
(Paul, Carroll, and Barrows 2021).↩︎