Title: | Identification and classification of duplicated genes |
---|---|
Description: | doubletrouble aims to identify duplicated genes from whole-genome protein sequences and classify them based on their modes of duplication. The duplication modes are i. segmental duplication (SD); ii. tandem duplication (TD); iii. proximal duplication (PD); iv. transposed duplication (TRD) and; v. dispersed duplication (DD). Transposon-derived duplicates (TRD) can be further subdivided into rTRD (retrotransposon-derived duplication) and dTRD (DNA transposon-derived duplication). If users want a simpler classification scheme, duplicates can also be classified into SD- and SSD-derived (small-scale duplication) gene pairs. Besides classifying gene pairs, users can also classify genes, so that each gene is assigned a unique mode of duplication. Users can also calculate substitution rates per substitution site (i.e., Ka and Ks) from duplicate pairs, find peaks in Ks distributions with Gaussian Mixture Models (GMMs), and classify gene pairs into age groups based on Ks peaks. |
Authors: | FabrÃcio Almeida-Silva [aut, cre] , Yves Van de Peer [aut] |
Maintainer: | FabrÃcio Almeida-Silva <[email protected]> |
License: | GPL-3 |
Version: | 1.7.0 |
Built: | 2024-12-29 06:55:26 UTC |
Source: | https://github.com/bioc/doubletrouble |
Data were obtained from Ensembl Fungi, and only CDS of primary transcripts were included.
data(cds_scerevisiae)
data(cds_scerevisiae)
A DNAStringSet object with CDS of S. cerevisiae.
data(cds_scerevisiae)
data(cds_scerevisiae)
Classify duplicate gene pairs based on their modes of duplication
classify_gene_pairs( annotation = NULL, blast_list = NULL, scheme = "standard", blast_inter = NULL, intron_counts, evalue = 1e-10, anchors = 5, max_gaps = 25, proximal_max = 10, collinearity_dir = NULL, outgroup_coverage = 70 )
classify_gene_pairs( annotation = NULL, blast_list = NULL, scheme = "standard", blast_inter = NULL, intron_counts, evalue = 1e-10, anchors = 5, max_gaps = 25, proximal_max = 10, collinearity_dir = NULL, outgroup_coverage = 70 )
annotation |
A processed GRangesList or CompressedGRangesList object as
returned by |
blast_list |
A list of data frames containing BLAST tabular output
for intraspecies comparisons.
Each list element corresponds to the BLAST output for a given species,
and names of list elements must match the names of list elements in
annotation. BLASTp, DIAMOND or simular programs must be run
on processed sequence data as returned by |
scheme |
Character indicating which classification scheme to use. One of "binary", "standard", "extended", or "full". See details below for information on what each scheme means. Default: "standard". |
blast_inter |
(Only valid if |
intron_counts |
(Only valid if |
evalue |
Numeric scalar indicating the E-value threshold. Default: 1e-10. |
anchors |
Numeric indicating the minimum required number of genes
to call a syntenic block, as in |
max_gaps |
Numeric indicating the number of upstream and downstream
genes to search for anchors, as in |
proximal_max |
Numeric scalar with the maximum distance (in number of genes) between two genes to consider them as proximal duplicates. Default: 10. |
collinearity_dir |
Character indicating the path to the directory
where .collinearity files will be stored. If NULL, files will
be stored in a subdirectory of |
outgroup_coverage |
Numeric indicating the minimum percentage of outgroup species to use to consider genes as transposed duplicates. Only valid if multiple outgroup species are present (see details below). Values should range from 0 to 100. Default: 70. |
The classification schemes increase in complexity (number of classes) in the order 'binary', 'standard', 'extended', and 'full'.
For classification scheme "binary", duplicates are classified into one of 'SD' (segmental duplications) or 'SSD' (small-scale duplications).
For classification scheme "standard" (default), duplicates are classified into 'SD' (segmental duplication), 'TD' (tandem duplication), 'PD' (proximal duplication), and 'DD' (dispersed duplication).
For classification scheme "extended", duplicates are classified into 'SD' (segmental duplication), 'TD' (tandem duplication), 'PD' (proximal duplication), 'TRD' (transposon-derived duplication), and 'DD' (dispersed duplication).
Finally, for classification scheme "full", duplicates are classified into 'SD' (segmental duplication), 'TD' (tandem duplication), 'PD' (proximal duplication), 'rTRD' (retrotransposon-derived duplication), 'dTRD' (DNA transposon-derived duplication), and 'DD' (dispersed duplication).
A list of 3-column data frames of duplicated gene pairs (columns 1 and 2), and their modes of duplication (column 3).
# Load example data data(diamond_intra) data(diamond_inter) data(yeast_annot) data(yeast_seq) # Get processed annotation data annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation # Get list of intron counts library(txdbmaker) txdb_list <- lapply(yeast_annot, txdbmaker::makeTxDbFromGRanges) intron_counts <- lapply(txdb_list, get_intron_counts) # Classify duplicates - full scheme dup_class <- classify_gene_pairs( annotation = annotation, blast_list = diamond_intra, scheme = "full", blast_inter = diamond_inter, intron_counts = intron_counts ) # Check number of gene pairs per class table(dup_class$Scerevisiae$type)
# Load example data data(diamond_intra) data(diamond_inter) data(yeast_annot) data(yeast_seq) # Get processed annotation data annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation # Get list of intron counts library(txdbmaker) txdb_list <- lapply(yeast_annot, txdbmaker::makeTxDbFromGRanges) intron_counts <- lapply(txdb_list, get_intron_counts) # Classify duplicates - full scheme dup_class <- classify_gene_pairs( annotation = annotation, blast_list = diamond_intra, scheme = "full", blast_inter = diamond_inter, intron_counts = intron_counts ) # Check number of gene pairs per class table(dup_class$Scerevisiae$type)
Classify genes into unique modes of duplication
classify_genes(gene_pairs_list = NULL)
classify_genes(gene_pairs_list = NULL)
gene_pairs_list |
List of classified gene pairs as returned
by |
If a gene is present in pairs with different duplication modes, the gene is classified into a unique mode of duplication following the order of priority indicated in the levels of the factor type.
For scheme "binary", the order is SD > SSD. For scheme "standard", the order is SD > TD > PD > DD. For scheme "extended", the order is SD > TD > PD > TRD > DD. For scheme "full", the order is SD > TD > PD > rTRD > dTRD > DD.
A list of 2-column data frames with variables gene and type representing gene ID and duplication type, respectively.
data(fungi_kaks) scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae cols <- c("dup1", "dup2", "type") gene_pairs_list <- list(Scerevisiae = scerevisiae_kaks[, cols]) class_genes <- classify_genes(gene_pairs_list)
data(fungi_kaks) scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae cols <- c("dup1", "dup2", "type") gene_pairs_list <- list(Scerevisiae = scerevisiae_kaks[, cols]) class_genes <- classify_genes(gene_pairs_list)
This list contains a similarity search of S. cerevisiae against
C. glabrata, and it was obtained with run_diamond()
.
data(diamond_inter)
data(diamond_inter)
A list of data frames (length 1) containing the output of a DIAMOND search of S. cerevisiae against C. glabrata (outgroup).
data(diamond_inter)
data(diamond_inter)
List obtained with run_diamond()
.
data(diamond_intra)
data(diamond_intra)
A list of data frames (length 1) containing the whole paranome of S. cerevisiae resulting from intragenome similarity searches.
data(diamond_intra)
data(diamond_intra)
Get a duplicate count matrix for each genome
duplicates2counts(duplicate_list, shape = "long")
duplicates2counts(duplicate_list, shape = "long")
duplicate_list |
A list of data frames with the duplicated genes or
gene pairs and their modes of duplication as returned
by |
shape |
Character specifying the shape of the output data frame. One of "long" (data frame in the long shape, in the tidyverse sense), or "wide" (data frame in the wide shape, in the tidyverse sense). Default: "long". |
If shape = "wide", a count matrix containing the frequency of duplicated genes (or gene pairs) by mode for each species, with species in rows and duplication modes in columns. If shape = "long", a data frame in long format with the following variables:
Factor, type of duplication.
Numeric, number of duplicates.
Character, species name
data(fungi_kaks) # Get unique duplicates duplicate_list <- classify_genes(fungi_kaks) # Get count table counts <- duplicates2counts(duplicate_list)
data(fungi_kaks) # Get unique duplicates duplicate_list <- classify_genes(fungi_kaks) # Get count table counts <- duplicates2counts(duplicate_list)
Find peaks in a Ks distribution with Gaussian Mixture Models
find_ks_peaks(ks, npeaks = 2, min_ks = 0.01, max_ks = 4, verbose = FALSE)
find_ks_peaks(ks, npeaks = 2, min_ks = 0.01, max_ks = 4, verbose = FALSE)
ks |
A numeric vector of Ks values. |
npeaks |
Numeric scalar indicating the number of peaks in the Ks distribution. If you don't know how many peaks there are, you can include a range of values, and the number of peaks that produces the lowest BIC (Bayesian Information Criterion) will be selected as the optimal. Default: 2. |
min_ks |
Numeric scalar with the minimum Ks value. Removing very small Ks values is generally used to avoid the incorporation of allelic and/or splice variants and to prevent the fitting of a component to infinity. Default: 0.01. |
max_ks |
Numeric scalar indicating the maximum Ks value. Removing very large Ks values is usually performed to account for Ks saturation. Default: 4. |
verbose |
Logical indicating if messages should be printed on screen. Default: FALSE. |
A list with the following elements:
Numeric with the estimated means.
Numeric with the estimated standard deviations.
Numeric with the estimated mixture weights.
Numeric vector of filtered Ks distribution based on arguments passed to min_ks and max_ks.
data(fungi_kaks) scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae ks <- scerevisiae_kaks$Ks # Find 2 peaks in Ks distribution peaks <- find_ks_peaks(ks, npeaks = 2) # From 2 to 4 peaks, verbose = TRUE to show BIC values peaks <- find_ks_peaks(ks, npeaks = c(2, 3, 4), verbose = TRUE)
data(fungi_kaks) scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae ks <- scerevisiae_kaks$Ks # Find 2 peaks in Ks distribution peaks <- find_ks_peaks(ks, npeaks = 2) # From 2 to 4 peaks, verbose = TRUE to show BIC values peaks <- find_ks_peaks(ks, npeaks = c(2, 3, 4), verbose = TRUE)
This data set was obtained with classify_gene_pairs()
followed
by pairs2kaks()
.
data(fungi_kaks)
data(fungi_kaks)
A list of data frame with elements named saccharomyces_cerevisiae, candida_glabrata, and schizosaccharomyces_pombe. Each data frame contains the following variables:
Character, duplicated gene 1.
Character, duplicated gene 2.
Numeric, Ka values.
Numeric, Ks values.
Numeric, Ka/Ks values.
Character, mode of duplication
data(fungi_kaks)
data(fungi_kaks)
Get a list of anchor pairs for each species
get_anchors_list( blast_list = NULL, annotation = NULL, evalue = 1e-10, anchors = 5, max_gaps = 25, collinearity_dir = NULL )
get_anchors_list( blast_list = NULL, annotation = NULL, evalue = 1e-10, anchors = 5, max_gaps = 25, collinearity_dir = NULL )
blast_list |
A list of data frames containing BLAST tabular output
for intraspecies comparisons.
Each list element corresponds to the BLAST output for a given species,
and names of list elements must match the names of list elements in
|
annotation |
A processed GRangesList or CompressedGRangesList object as
returned by |
evalue |
Numeric scalar indicating the E-value threshold. Default: 1e-10. |
anchors |
Numeric indicating the minimum required number of genes
to call a syntenic block, as in |
max_gaps |
Numeric indicating the number of upstream and downstream
genes to search for anchors, as in |
collinearity_dir |
Character indicating the path to the directory
where .collinearity files will be stored. If NULL, files will
be stored in a subdirectory of |
A list of data frames representing intraspecies anchor pairs.
data(diamond_intra) data(yeast_annot) data(yeast_seq) blast_list <- diamond_intra # Get processed annotation for S. cerevisiae annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation # Get list of intraspecies anchor pairs anchorpairs <- get_anchors_list(blast_list, annotation)
data(diamond_intra) data(yeast_annot) data(yeast_seq) blast_list <- diamond_intra # Get processed annotation for S. cerevisiae annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation # Get list of intraspecies anchor pairs anchorpairs <- get_anchors_list(blast_list, annotation)
Get a data frame of intron counts per gene
get_intron_counts(txdb)
get_intron_counts(txdb)
txdb |
A |
The family of functions makeTxDbFrom*
from
the txdbmaker package can be used to create TxDb
objects
from a variety of input data types. You can create TxDb
objects
from e.g., GRanges
objects (makeTxDbFromGRanges()
),
GFF files (makeTxDbFromGFF()
),
an Ensembl database (makeTxDbFromEnsembl
), and
a Biomart database (makeTxDbFromBiomart
).
A data frame with intron counts per gene, with variables:
Character with gene IDs.
Numeric with number of introns per gene.
data(yeast_annot) # Create TxDb object from GRanges library(txdbmaker) txdb <- txdbmaker::makeTxDbFromGRanges(yeast_annot[[1]]) # Get intron counts intron_counts <- get_intron_counts(txdb)
data(yeast_annot) # Create TxDb object from GRanges library(txdbmaker) txdb <- txdbmaker::makeTxDbFromGRanges(yeast_annot[[1]]) # Get intron counts intron_counts <- get_intron_counts(txdb)
Classify gene pairs derived from segmental duplications
get_segmental(anchor_pairs = NULL, pairs = NULL)
get_segmental(anchor_pairs = NULL, pairs = NULL)
anchor_pairs |
A 2-column data frame with anchor pairs in columns 1 and 2. |
pairs |
A 2-column data frame with all duplicate pairs. This is equivalent to the first 2 columns of the tabular output of BLAST-like programs. |
A 3-column data frame with the variables:
Character, duplicated gene 1
Character, duplicated gene 2
Factor indicating duplication types, with levels "SD" (segmental duplication) or "DD" (dispersed duplication).
data(diamond_intra) data(yeast_annot) data(yeast_seq) blast_list <- diamond_intra # Get processed annotation for S. cerevisiae annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation[1] # Get list of intraspecies anchor pairs anchor_pairs <- get_anchors_list(blast_list, annotation) anchor_pairs <- anchor_pairs[[1]][, c(1, 2)] # Get duplicate pairs from DIAMOND output duplicates <- diamond_intra[[1]][, c(1, 2)] dups <- get_segmental(anchor_pairs, duplicates)
data(diamond_intra) data(yeast_annot) data(yeast_seq) blast_list <- diamond_intra # Get processed annotation for S. cerevisiae annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation[1] # Get list of intraspecies anchor pairs anchor_pairs <- get_anchors_list(blast_list, annotation) anchor_pairs <- anchor_pairs[[1]][, c(1, 2)] # Get duplicate pairs from DIAMOND output duplicates <- diamond_intra[[1]][, c(1, 2)] dups <- get_segmental(anchor_pairs, duplicates)
Classify gene pairs derived from tandem and proximal duplications
get_tandem_proximal(pairs = NULL, annotation_granges = NULL, proximal_max = 10)
get_tandem_proximal(pairs = NULL, annotation_granges = NULL, proximal_max = 10)
pairs |
A 3-column data frame with columns dup1, dup2,
and type indicating duplicated gene 1, duplicated gene 2, and
the mode of duplication associated with the pair. This data frame
is returned by |
annotation_granges |
A processed GRanges object as in each element
of the list returned by |
proximal_max |
Numeric scalar with the maximum distance (in number of genes) between two genes to consider them as proximal duplicates. Default: 10. |
A 3-column data frame with the variables:
Character, duplicated gene 1.
Character, duplicated gene 2.
Factor of duplication types, with levels "SD" (segmental duplication), "TD" (tandem duplication), "PD" (proximal duplication), and "DD" (dispersed duplication).
data(yeast_annot) data(yeast_seq) data(fungi_kaks) scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae # Get processed annotation for S. cerevisiae pdata <- annotation <- syntenet::process_input(yeast_seq, yeast_annot) annot <- pdata$annotation[[1]] # Get duplicated pairs pairs <- scerevisiae_kaks[, c("dup1", "dup2", "type")] pairs$dup1 <- paste0("Sce_", pairs$dup1) pairs$dup2 <- paste0("Sce_", pairs$dup2) # Get tandem and proximal duplicates td_pd_pairs <- get_tandem_proximal(pairs, annot)
data(yeast_annot) data(yeast_seq) data(fungi_kaks) scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae # Get processed annotation for S. cerevisiae pdata <- annotation <- syntenet::process_input(yeast_seq, yeast_annot) annot <- pdata$annotation[[1]] # Get duplicated pairs pairs <- scerevisiae_kaks[, c("dup1", "dup2", "type")] pairs$dup1 <- paste0("Sce_", pairs$dup1) pairs$dup2 <- paste0("Sce_", pairs$dup2) # Get tandem and proximal duplicates td_pd_pairs <- get_tandem_proximal(pairs, annot)
Classify gene pairs originating from transposon-derived duplications
get_transposed( pairs, blast_inter, annotation, evalue = 1e-10, anchors = 5, max_gaps = 25, collinearity_dir = NULL, outgroup_coverage = 70 )
get_transposed( pairs, blast_inter, annotation, evalue = 1e-10, anchors = 5, max_gaps = 25, collinearity_dir = NULL, outgroup_coverage = 70 )
pairs |
A 3-column data frame with columns dup1, dup2,
and type indicating duplicated gene 1, duplicated gene 2, and
the mode of duplication associated with the pair. This data frame
is returned by |
blast_inter |
A list of data frames of length 1
containing BLAST tabular output for the comparison between the target
species and an outgroup. Names of list elements must match the names of
list elements in |
annotation |
A processed GRangesList or CompressedGRangesList object as
returned by |
evalue |
Numeric scalar indicating the E-value threshold. Default: 1e-10. |
anchors |
Numeric indicating the minimum required number of genes
to call a syntenic block, as in |
max_gaps |
Numeric indicating the number of upstream and downstream
genes to search for anchors, as in |
collinearity_dir |
Character indicating the path to the directory
where .collinearity files will be stored. If NULL, files will
be stored in a subdirectory of |
outgroup_coverage |
Numeric indicating the minimum percentage of outgroup species to use to consider genes as transposed duplicates. Only valid if multiple outgroup species are present (see details below). Values should range from 0 to 100. Default: 70. |
If the list of interspecies DIAMOND tables contain comparisons of the same species to multiple outgroups (e.g., 'speciesA_speciesB', 'speciesA_speciesC'), this function will check if gene pairs are classified as transposed (i.e., only one gene is an ancestral locus) in each of the outgroup species, and then calculate the percentage of outgroup species in which each pair is considered 'transposed'. For instance, gene pair 1 is transposed based on 30\ on 100\ based on 0\ Parameter outgroup_coverage lets you choose a minimum percentage cut-off to classify pairs as transposed.
A 3-column data frame with the following variables:
Character, duplicated gene 1.
Character, duplicated gene 2.
Factor of duplication types, with levels "SD" (segmental duplication), "TD" (tandem duplication), "PD" (proximal duplication), "TRD" (transposon-derived duplication), and "DD" (dispersed duplication).
data(diamond_inter) data(diamond_intra) data(yeast_seq) data(yeast_annot) data(fungi_kaks) scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae # Get processed annotation pdata <- syntenet::process_input(yeast_seq, yeast_annot) annotation <- pdata$annotation # Get duplicated pairs pairs <- scerevisiae_kaks[, c("dup1", "dup2", "type")] pairs$dup1 <- paste0("Sce_", pairs$dup1) pairs$dup2 <- paste0("Sce_", pairs$dup2) # Classify pairs trd <- get_transposed(pairs, diamond_inter, annotation) annotation <- c(annotation, list(Cglabrata2 = annotation$Cglabrata)) blast_inter <- c(diamond_inter, list(Scerevisiae_Cglabrata2 = diamond_inter[[1]]))
data(diamond_inter) data(diamond_intra) data(yeast_seq) data(yeast_annot) data(fungi_kaks) scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae # Get processed annotation pdata <- syntenet::process_input(yeast_seq, yeast_annot) annotation <- pdata$annotation # Get duplicated pairs pairs <- scerevisiae_kaks[, c("dup1", "dup2", "type")] pairs$dup1 <- paste0("Sce_", pairs$dup1) pairs$dup2 <- paste0("Sce_", pairs$dup2) # Classify pairs trd <- get_transposed(pairs, diamond_inter, annotation) annotation <- c(annotation, list(Cglabrata2 = annotation$Cglabrata)) blast_inter <- c(diamond_inter, list(Scerevisiae_Cglabrata2 = diamond_inter[[1]]))
Classify TRD genes as derived from either DNA transposons or retrotransposons
get_transposed_classes(pairs, intron_counts)
get_transposed_classes(pairs, intron_counts)
pairs |
A 3-column data frame with columns dup1, dup2,
and type indicating duplicated gene 1, duplicated gene 2, and
the mode of duplication associated with the pair. This data frame
is returned by |
intron_counts |
A 2-column data frame with columns gene
and introns indicating the number of introns for each gene,
as returned by |
A 3-column data frame with the following variables:
Character, duplicated gene 1.
Character, duplicated gene 2.
Factor of duplication types, with levels "SD" (segmental duplication), "TD" (tandem duplication), "PD" (proximal duplication), "dTRD" (DNA transposon-derived duplication), "rTRD" (retrotransposon-derived duplication), and "DD" (dispersed duplication).
data(diamond_inter) data(diamond_intra) data(yeast_seq) data(yeast_annot) data(fungi_kaks) scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae # Get processed annotation pdata <- syntenet::process_input(yeast_seq, yeast_annot) annotation <- pdata$annotation # Get duplicated pairs pairs <- scerevisiae_kaks[, c("dup1", "dup2", "type")] pairs$dup1 <- paste0("Sce_", pairs$dup1) pairs$dup2 <- paste0("Sce_", pairs$dup2) # Classify pairs trd <- get_transposed(pairs, diamond_inter, annotation) # Create TxDb object from GRanges library(txdbmaker) txdb <- txdbmaker::makeTxDbFromGRanges(yeast_annot[[1]]) # Get intron counts intron_counts <- get_intron_counts(txdb) # Get TRD classes trd_classes <- get_transposed_classes(trd, intron_counts)
data(diamond_inter) data(diamond_intra) data(yeast_seq) data(yeast_annot) data(fungi_kaks) scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae # Get processed annotation pdata <- syntenet::process_input(yeast_seq, yeast_annot) annotation <- pdata$annotation # Get duplicated pairs pairs <- scerevisiae_kaks[, c("dup1", "dup2", "type")] pairs$dup1 <- paste0("Sce_", pairs$dup1) pairs$dup2 <- paste0("Sce_", pairs$dup2) # Classify pairs trd <- get_transposed(pairs, diamond_inter, annotation) # Create TxDb object from GRanges library(txdbmaker) txdb <- txdbmaker::makeTxDbFromGRanges(yeast_annot[[1]]) # Get intron counts intron_counts <- get_intron_counts(txdb) # Get TRD classes trd_classes <- get_transposed_classes(trd, intron_counts)
This data set was obtained with classify_gene_pairs()
followed
by pairs2kaks()
.
data(gmax_ks)
data(gmax_ks)
A data frame with the following variables:
Character, duplicated gene 1.
Character, duplicated gene 2.
Numeric, Ks values.
Factor, duplication mode.
data(gmax_ks)
data(gmax_ks)
Calculate Ka, Ks, and Ka/Ks from duplicate gene pairs
pairs2kaks(gene_pairs_list, cds, model = "MYN", threads = 1, verbose = FALSE)
pairs2kaks(gene_pairs_list, cds, model = "MYN", threads = 1, verbose = FALSE)
gene_pairs_list |
List of data frames containing duplicated gene pairs
as returned by |
cds |
List of DNAStringSet objects containing the coding sequences of each gene. |
model |
Character scalar indicating which codon model to use. Possible values are "Li", "NG86", "NG", "LWL", "LPB", "MLWL", "MLPB", "GY", "YN", "MYN", "MS", "MA", "GNG", "GLWL", "GLPB", "GMLWL", "GMLPB", "GYN", and "GMYN". Default: "MYN". |
threads |
Numeric indicating the number of threads to use. Default: 1. |
verbose |
Logical indicating whether progress messages should be printed on screen. Default: FALSE. |
A list of data frames containing gene pairs and their Ka, Ks, and Ka/Ks values.
data(diamond_intra) data(diamond_inter) data(yeast_annot) data(yeast_seq) data(cds_scerevisiae) blast_list <- diamond_intra blast_inter <- diamond_inter pdata <- syntenet::process_input(yeast_seq, yeast_annot) annot <- pdata$annotation["Scerevisiae"] # Binary classification scheme pairs <- classify_gene_pairs(annot, blast_list) td_pairs <- pairs[[1]][pairs[[1]]$type == "TD", ] gene_pairs_list <- list( Scerevisiae = td_pairs[seq(1, 3, by = 1), ] ) cds <- list(Scerevisiae = cds_scerevisiae) kaks <- pairs2kaks(gene_pairs_list, cds)
data(diamond_intra) data(diamond_inter) data(yeast_annot) data(yeast_seq) data(cds_scerevisiae) blast_list <- diamond_intra blast_inter <- diamond_inter pdata <- syntenet::process_input(yeast_seq, yeast_annot) annot <- pdata$annotation["Scerevisiae"] # Binary classification scheme pairs <- classify_gene_pairs(annot, blast_list) td_pairs <- pairs[[1]][pairs[[1]]$type == "TD", ] gene_pairs_list <- list( Scerevisiae = td_pairs[seq(1, 3, by = 1), ] ) cds <- list(Scerevisiae = cds_scerevisiae) kaks <- pairs2kaks(gene_pairs_list, cds)
Plot frequency of duplicates per mode for each species
plot_duplicate_freqs(dup_counts, plot_type = "facet", remove_zero = TRUE)
plot_duplicate_freqs(dup_counts, plot_type = "facet", remove_zero = TRUE)
dup_counts |
A data frame in long format with the number of
duplicates per mode for each species, as returned by
the function |
plot_type |
Character indicating how to plot frequencies. One of 'facet' (facets for each level of the variable type), 'stack' (levels of the variable type as stacked bars), or 'stack_percent' (levels of the variable type as stacked bars, with x-axis representing relative frequencies). Default: 'facet'. |
remove_zero |
Logical indicating whether or not to remove rows with zero values. Default: TRUE. |
A ggplot object.
data(fungi_kaks) # Get unique duplicates duplicate_list <- classify_genes(fungi_kaks) # Get count table dup_counts <- duplicates2counts(duplicate_list) # Plot counts plot_duplicate_freqs(dup_counts, plot_type = "stack_percent")
data(fungi_kaks) # Get unique duplicates duplicate_list <- classify_genes(fungi_kaks) # Get count table dup_counts <- duplicates2counts(duplicate_list) # Plot counts plot_duplicate_freqs(dup_counts, plot_type = "stack_percent")
Plot distribution of synonymous substitution rates (Ks)
plot_ks_distro( ks_df, min_ks = 0.01, max_ks = 2, bytype = FALSE, type_levels = NULL, plot_type = "histogram", binwidth = 0.03 )
plot_ks_distro( ks_df, min_ks = 0.01, max_ks = 2, bytype = FALSE, type_levels = NULL, plot_type = "histogram", binwidth = 0.03 )
ks_df |
A data frame with Ks values for each gene pair
as returned by |
min_ks |
Numeric indicating the minimum Ks value to keep. Default: 0.01. |
max_ks |
Numeric indicating the maximum Ks value to keep. Default: 2. |
bytype |
Logical indicating whether or not to plot the distribution
by type of duplication (requires a column named |
type_levels |
(Only valid if bytype is not NULL) Character indicating which levels of the variable specified in parameter group_by should be kept. By default, all levels are kept. |
plot_type |
Character indicating the type of plot to create. If bytype = TRUE, possible types are "histogram" or "violin". If bytype = FALSE, possible types are "histogram", "density", or "density_histogram". Default: "histogram". |
binwidth |
(Only valid if plot_type = "histogram") Numeric indicating the bin width. Default: 0.03. |
A ggplot object.
data(fungi_kaks) ks_df <- fungi_kaks$saccharomyces_cerevisiae # Plot distro plot_ks_distro(ks_df, bytype = TRUE)
data(fungi_kaks) ks_df <- fungi_kaks$saccharomyces_cerevisiae # Plot distro plot_ks_distro(ks_df, bytype = TRUE)
Plot histogram of Ks distribution with peaks
plot_ks_peaks(peaks = NULL, binwidth = 0.05)
plot_ks_peaks(peaks = NULL, binwidth = 0.05)
peaks |
A list with elements mean, sd,
lambda, and ks, as returned by the
function |
binwidth |
Numeric scalar with binwidth for the histogram. Default: 0.05. |
A ggplot object with a histogram and lines for each Ks peak.
data(fungi_kaks) scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae ks <- scerevisiae_kaks$Ks # Find 2 peaks in Ks distribution peaks <- find_ks_peaks(ks, npeaks = 2) # Plot plot_ks_peaks(peaks, binwidth = 0.05)
data(fungi_kaks) scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae ks <- scerevisiae_kaks$Ks # Find 2 peaks in Ks distribution peaks <- find_ks_peaks(ks, npeaks = 2) # Plot plot_ks_peaks(peaks, binwidth = 0.05)
Plot distributions of substitution rates (Ka, Ks, or Ka/Ks) per species
plot_rates_by_species( kaks_list, rate_column = "Ks", bytype = FALSE, range = c(0, 2), fill = "deepskyblue3", color = "deepskyblue4" )
plot_rates_by_species( kaks_list, rate_column = "Ks", bytype = FALSE, range = c(0, 2), fill = "deepskyblue3", color = "deepskyblue4" )
kaks_list |
A list of data frames with substitution rates per gene
pair in each species as returned by |
rate_column |
Character indicating the name of the column to plot. Default: "Ks". |
bytype |
Logical indicating whether or not to show distributions by type of duplication. Default: FALSE. |
range |
Numeric vector of length 2 indicating the minimum and maximum
values to plot. Default: |
fill |
Character with color to use for the fill aesthetic. Ignored if bytype = TRUE. Default: "deepskyblue3". |
color |
Character with color to use for the color aesthetic. Ignored if bytype = FALSE. Default: "deepskyblue4". |
Data will be plotted using the species order of the list. To change the order of the species to plot, reorder the list elements in kaks_list.
A ggplot object.
data(fungi_kaks) # Plot rates plot_rates_by_species(fungi_kaks, rate_column = "Ka_Ks")
data(fungi_kaks) # Plot rates plot_rates_by_species(fungi_kaks, rate_column = "Ka_Ks")
The purpose of this function is to classify gene pairs by age when there are 2+ Ks peaks. This way, newer gene pairs are found within a certain number of standard deviations from the highest peak, and older genes are found close within smaller peaks.
split_pairs_by_peak(ks_df, peaks, nsd = 2, binwidth = 0.05)
split_pairs_by_peak(ks_df, peaks, nsd = 2, binwidth = 0.05)
ks_df |
A 3-column data frame with gene pairs in columns 1 and 2, and Ks values for the gene pair in column 3. |
peaks |
A list with mean, standard deviation, and amplitude of Ks
peaks as generated by |
nsd |
Numeric with the number of standard deviations to consider for each peak. |
binwidth |
Numeric scalar with binwidth for the histogram. Default: 0.05. |
A list with the following elements:
A 4-column data frame with the variables dup1 (character), dup2 (character), ks (numeric), and peak (numeric), representing duplicate gene pair, Ks values, and peak ID, respectively.
A ggplot object with Ks peaks as returned by
plot_ks_peaks
, but with dashed red lines indicating
boundaries for each peak.
data(fungi_kaks) scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae # Create a data frame of duplicate pairs and Ks values ks_df <- scerevisiae_kaks[, c("dup1", "dup2", "Ks")] # Create list of peaks peaks <- find_ks_peaks(ks_df$Ks, npeaks = 2) # Split pairs spairs <- split_pairs_by_peak(ks_df, peaks)
data(fungi_kaks) scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae # Create a data frame of duplicate pairs and Ks values ks_df <- scerevisiae_kaks[, c("dup1", "dup2", "Ks")] # Create list of peaks peaks <- find_ks_peaks(ks_df$Ks, npeaks = 2) # Split pairs spairs <- split_pairs_by_peak(ks_df, peaks)
Data obtained from Ensembl Fungi. Only annotation data protein-coding genes (with associated mRNA, exons, CDS, etc) are included.
data(yeast_annot)
data(yeast_annot)
A CompressedGRangesList containing the elements Scerevisiae and Cglabrata.
data(yeast_annot)
data(yeast_annot)
Data obtained from Ensembl Fungi. Only translated sequences of primary transcripts were included.
data(yeast_seq)
data(yeast_seq)
A list of AAStringSet objects with the elements Scerevisiae and Cglabrata.
data(yeast_seq)
data(yeast_seq)