Cross Species
Investigation and Analysis
(CoSIA
) is a package that provides researchers with an
alternative methodology for comparing across species and tissues using
normal wild-type RNA-Seq Gene Expression data from Bgee. Using RNA-Seq
Gene Expression data, CoSIA provides multiple visualization tools to
explore the transcriptome diversity and variation across genes, tissues,
and species. CoSIA uses the Coefficient of Variation and Shannon Entropy
and Specificity to calculate transcriptome diversity and variation.
CoSIA also provides additional conversion tools and utilities to provide
a streamlined methodology for cross-species comparison.
CoSIA is split into 3 methods that provide various resources in order for researchers to conduct cross species analysis using gene expression metrics.
Method 1 uses getConversion
to convert inputs
between different gene identifiers in the same species as well as
orthologs in different species. The other modules access tissue- and/or
species-specific gene expression.
Method 2 uses getGEx
to obtain raw read counts that
undergo Variance Stabilizing Transformation via DESeq2 methodology. Gene
expression values are visualized for a single gene across multiple
tissues in single model organism or across multiple species in
a single tissue using the plotting methods,
plotTissueGEx
& plotSpeciesGEx
,
respectively.
Method 3 uses getGExMetrics
to calculate
median-based Coefficient of Variation (variability) and Shannon Entropy
(diversity & specificity). There are two accompanying plotting
methods, plotCVGEx
& plotDSGEx
that are
used to visualize the variation and diversity & specificity (DS) of
gene expression across genes, tissues, and species.
In R:
Slot Name | Possible Value Options | Default |
---|---|---|
gene_set | “character”, c(characters), data.frame$column | N/A |
i_species; o_species; map_species | “h_sapiens”, “m_musculus”, “r_noregicus”, “d_rerio”, “d_melangoaster”, “c_elegans” | N/A |
mapping_tool | “annotationDBI”, “biomaRt” | annotationDBI |
input_id; output_ids | “Ensembl_id”, “Entrez_id”, “Symbol” | N/A |
ortholog_database | “NCBIOrtho”, “HomoloGene” | HomoloGene |
map_tissue | c(“tissue”), “tissue”; see
getTissues |
N/A |
metric_type | “CV_Tissue”, “CV_Species”, “DS_Gene”, “DS_Gene_all”, “DS_Tissue”, “DS_Tissue_all” | N/A |
getTissues
The function getTissues
retrieves tissues available for
a single species:
…or tissues shared across a list of species:
CoSIA::getTissues(c("m_musculus", "r_norvegicus"))
#> # A tibble: 21 × 1
#> Common_Anatomical_Entity_Name
#> <chr>
#> 1 adult mammalian kidney
#> 2 brain
#> 3 cerebellum
#> 4 colon
#> 5 duodenum
#> 6 esophagus
#> 7 frontal cortex
#> 8 heart
#> 9 ileum
#> 10 jejunum
#> # ℹ 11 more rows
NOTE: To compare across all shared tissues for your selected
species, you can assign the getTissues
output to an object
as input for map_tissues
when initializing a CoSIAn
object.
CoSIAn_Obj <- CoSIA::CoSIAn(
gene_set = unique(monogenic_kid_sample$Gene),
i_species = "h_sapiens",
o_species = c(
"h_sapiens",
"r_norvegicus"
),
input_id = "Symbol",
output_ids = "Ensembl_id",
map_species = c(
"h_sapiens",
"r_norvegicus"
),
map_tissues = c(
"adult mammalian kidney",
"heart"
),
mapping_tool = "annotationDBI",
ortholog_database = "HomoloGene",
metric_type = "CV_Species"
)
str(CoSIAn_Obj)
#> Formal class 'CoSIAn' [package "CoSIA"] with 13 slots
#> ..@ gene_set : chr [1:20] "SDCCAG8" "TACO1" "FANCA" "BMP7" ...
#> ..@ i_species : chr "h_sapiens"
#> ..@ input_id : chr "Symbol"
#> ..@ o_species : chr [1:2] "h_sapiens" "r_norvegicus"
#> ..@ output_ids : chr "Ensembl_id"
#> ..@ mapping_tool : chr "annotationDBI"
#> ..@ ortholog_database: chr "HomoloGene"
#> ..@ converted_id :'data.frame': 1 obs. of 1 variable:
#> .. ..$ X0: num 0
#> ..@ map_tissues : chr [1:2] "adult mammalian kidney" "heart"
#> ..@ map_species : chr [1:2] "h_sapiens" "r_norvegicus"
#> ..@ gex :'data.frame': 1 obs. of 1 variable:
#> .. ..$ X0: num 0
#> ..@ metric_type : chr "CV_Species"
#> ..@ metric :'data.frame': 1 obs. of 1 variable:
#> .. ..$ X0: num 0
NOTE: Any species you plan to compare must be specified in
map_species
AND o_species
.
NOTE: The getGEx
function requires that
“Ensembl_id” is included as an output_id
. Here, we convert
gene symbols into Ensembl IDs.
The following use cases provide running examples of CoSIA applications with Natera’s Monogenic Kidney Disease Panel. We will perform id conversion, obtain and visualize gene expression data, and calculate and visualize CV and DS of gene expression across three species (human, mouse, & rat) and two tissues (kidney & heart).
getConversion
)CoSIA can convert input ids to any of the types listed in Table @ref(tab:Table1.).
CoSIAn_Obj_convert <- CoSIA::getConversion(CoSIAn_Obj)
head(CoSIA::viewCoSIAn(CoSIAn_Obj_convert, "converted_id"))
#> [[1]]
#> [1] "SDCCAG8" "SDCCAG8" "TACO1" "FANCA" "BMP7" "AP2S1" "SCARB2"
#> [8] "AQP2" "XDH" "PCBD1" "AMN" "PPP3CA" "LMX1B" "NPHS1"
#> [15] "EBP" "SLX4" "CYP11B2" "TTR" "BICC1" "OFD1" "UCP3"
Now we will use the converted IDs with getGEx
to obtain
heart and kidney gene expression data for human, mouse and rat curated
from Bgee.
CoSIAn_Obj_gex <- CoSIA::getGEx(CoSIAn_Obj_convert)
head(CoSIA::viewCoSIAn(CoSIAn_Obj_gex, "gex"), 5)
These data can be visualized with plotSpeciesGEx
to plot
expression of a single gene in a single tissue across species or
plotTissueGEx
to plot expression of a single gene in a
single species across tissues.
Here we are plotting gene expression for the TACO1 gene in kidney tissue for human, mouse, and rat.
CoSIAn_Obj_gexplot <- CoSIA::plotSpeciesGEx(CoSIAn_Obj_gex, "adult mammalian kidney", "ENSG00000136463")
CoSIAn_Obj_gexplot
CoSIA produces an interactive plot of expression values for TACO1 in kidney for human mouse and rat. Hovering over the plot displays sample-specific VST read count values. For file size compliance, we have included a static plot for this output. NOTE: Expression values are not meant to be compared across species in this plot. The next two use cases demonstrate appropriate methods for comparing gene expression patterns across species.
Calculating and visualizing median-based coefficient of variation
allows for a relative comparison of gene expression variability between
species. In CoSIA, CV is calculated as the standard deviation over the
median using VST read count values. In the section
@ref(initializing-a-cosian-object), we set
metric_type= "CV_Species"
to calculate CV for the monogenic
kidney gene set across human and rat.
CoSIAn_Obj_CV <- CoSIA::getGExMetrics(CoSIAn_Obj_convert)
CoSIAn_Obj_CVplot <- CoSIA::plotCVGEx(CoSIAn_Obj_CV)
CoSIAn_Obj_CVplot
Here we see relatively low variability in gene expression of most
genes in rat and human, with the exception of relatively high levels of
variation in metric_type= "CV_Tissue"
displays the
variability of gene expression for the monogenic gene set across
selected, shared tissues.
Diversity and specificity metrics are also suitable for comparing gene expression patterns across species. In CoSIA, diversity and specificity (based on Shannon’s entropy) are calculated using min-max scaled median VST values generated each gene in a tissue- and species-specific manner. Values are used to calculate diversity and specificity as in Martínez & Reyes-Valdés, 2008.
In the final use case, we are calculating and visualizing diversity
and specificity for kidney and heart tissue across the monogenic kidney
gene set by setting metric_type= "DS_Tissue"
. There are
additional metric_types for the CoSIAn objects, view @ref(tab:Table2.)
This example is not run, but the code and the resulting static output is
included.
CoSIAn_Obj_DS <- CoSIA::CoSIAn(
gene_set = unique(monogenic_kid_sample$Gene),
i_species = "h_sapiens",
o_species = c("h_sapiens", "r_norvegicus"),
input_id = "Symbol",
output_ids = "Ensembl_id",
map_species = c("h_sapiens", "r_norvegicus"),
map_tissues = c("adult mammalian kidney", "heart"),
mapping_tool = "annotationDBI",
ortholog_database = "HomoloGene",
metric_type = "DS_Tissue"
)
CoSIAn_Obj_DS <- CoSIA::getConversion(CoSIAn_Obj_DS)
CoSIAn_Obj_DS <- CoSIA::getGExMetrics(CoSIAn_Obj_DS)
CoSIAn_Obj_DSplot <- CoSIA::plotDSGEx(CoSIAn_Obj_DS)
CoSIAn_Obj_DSplot
There is low specificity in kidney tissue, indicating there are more genes from the set that are expressed in kidney. We see higher specificity in heart tissue, indicating that there are fewer genes from the set that are expressed in heart.
Metric Type | Function |
---|---|
DS_Gene | calculates diversity and specificity across genes in
gene_set in the tissues listed in
map_tissue |
DS_Gene_all | calculates diversity and specificity across genes in
gene_set in the all the available tissues in a species |
DS_Tissue | calculates diversity and specificity across tissues listed in
map_tissue in the genes in gene_set |
DS_Tissue_all | calculates diversity and specificity across tissues listed in
map_tissue in the all the genes in a species |
Session info
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] CoSIA_1.7.0 ExperimentHub_2.15.0 AnnotationHub_3.15.0
#> [4] BiocFileCache_2.15.0 dbplyr_2.5.0 BiocGenerics_0.53.3
#> [7] generics_0.1.3 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.47.0 gtable_0.3.6
#> [3] xfun_0.50 bslib_0.8.0
#> [5] ggplot2_3.5.1 Biobase_2.67.0
#> [7] vctrs_0.6.5 tools_4.4.2
#> [9] stats4_4.4.2 curl_6.1.0
#> [11] tibble_3.2.1 AnnotationDbi_1.69.0
#> [13] RSQLite_2.3.9 blob_1.2.4
#> [15] pkgconfig_2.0.3 homologene_1.4.68.19.3.27
#> [17] S4Vectors_0.45.2 lifecycle_1.0.4
#> [19] GenomeInfoDbData_1.2.13 compiler_4.4.2
#> [21] Biostrings_2.75.3 munsell_0.5.1
#> [23] GenomeInfoDb_1.43.2 htmltools_0.5.8.1
#> [25] sys_3.4.3 buildtools_1.0.0
#> [27] sass_0.4.9 yaml_2.3.10
#> [29] pillar_1.10.1 crayon_1.5.3
#> [31] jquerylib_0.1.4 cachem_1.1.0
#> [33] org.Hs.eg.db_3.20.0 mime_0.12
#> [35] tidyselect_1.2.1 digest_0.6.37
#> [37] dplyr_1.1.4 purrr_1.0.2
#> [39] BiocVersion_3.21.1 maketools_1.3.1
#> [41] fastmap_1.2.0 grid_4.4.2
#> [43] colorspace_2.1-1 cli_3.6.3
#> [45] org.Rn.eg.db_3.20.0 magrittr_2.0.3
#> [47] utf8_1.2.4 withr_3.0.2
#> [49] scales_1.3.0 filelock_1.0.3
#> [51] UCSC.utils_1.3.1 rappdirs_0.3.3
#> [53] bit64_4.6.0-1 rmarkdown_2.29
#> [55] XVector_0.47.2 httr_1.4.7
#> [57] bit_4.5.0.1 png_0.1-8
#> [59] memoise_2.0.1 evaluate_1.0.3
#> [61] knitr_1.49 annotationTools_1.81.0
#> [63] IRanges_2.41.2 rlang_1.1.4
#> [65] glue_1.8.0 DBI_1.2.3
#> [67] BiocManager_1.30.25 jsonlite_1.8.9
#> [69] R6_2.5.1