CoSIA, an R package for Cross Species Investigation and Analysis


Summary

Cross Species Investigation and Analysis (CoSIA) is a package that provides researchers with an alternative methodology for comparing across species and tissues using normal wild-type RNA-Seq Gene Expression data from Bgee. Using RNA-Seq Gene Expression data, CoSIA provides multiple visualization tools to explore the transcriptome diversity and variation across genes, tissues, and species. CoSIA uses the Coefficient of Variation and Shannon Entropy and Specificity to calculate transcriptome diversity and variation. CoSIA also provides additional conversion tools and utilities to provide a streamlined methodology for cross-species comparison.

Figure 1. CoSIA_Workflow

CoSIA is split into 3 methods that provide various resources in order for researchers to conduct cross species analysis using gene expression metrics.

  • Method 1 uses getConversion to convert inputs between different gene identifiers in the same species as well as orthologs in different species. The other modules access tissue- and/or species-specific gene expression.

  • Method 2 uses getGEx to obtain raw read counts that undergo Variance Stabilizing Transformation via DESeq2 methodology. Gene expression values are visualized for a single gene across multiple tissues in single model organism or across multiple species in a single tissue using the plotting methods, plotTissueGEx & plotSpeciesGEx, respectively.

  • Method 3 uses getGExMetrics to calculate median-based Coefficient of Variation (variability) and Shannon Entropy (diversity & specificity). There are two accompanying plotting methods, plotCVGEx & plotDSGEx that are used to visualize the variation and diversity & specificity (DS) of gene expression across genes, tissues, and species.


Installation

In R:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("CoSIA")

Generating a CoSIAn object

Load CoSIA

library(CoSIA)
load("../inst/extdata/proccessed/monogenic_kidney_genes.rda")


# downsampling data for figure
set.seed(42)
monogenic_kid_sample <- dplyr::sample_n(monogenic_kidney_genes, 20)

Arguments and options table

(#tab:Table1.) CoSIA Arguments and Options.
Slot Name Possible Value Options Default
gene_set “character”, c(characters), data.frame$column N/A
i_species; o_species; map_species “h_sapiens”, “m_musculus”, “r_noregicus”, “d_rerio”, “d_melangoaster”, “c_elegans” N/A
mapping_tool “annotationDBI”, “biomaRt” annotationDBI
input_id; output_ids “Ensembl_id”, “Entrez_id”, “Symbol” N/A
ortholog_database “NCBIOrtho”, “HomoloGene” HomoloGene
map_tissue c(“tissue”), “tissue”; see getTissues N/A
metric_type “CV_Tissue”, “CV_Species”, “DS_Gene”, “DS_Gene_all”, “DS_Tissue”, “DS_Tissue_all” N/A

Find possible tissues with getTissues

The function getTissues retrieves tissues available for a single species:

CoSIA::getTissues("d_rerio")

…or tissues shared across a list of species:

CoSIA::getTissues(c("m_musculus", "r_norvegicus"))
#> # A tibble: 21 × 1
#>    Common_Anatomical_Entity_Name
#>    <chr>                        
#>  1 adult mammalian kidney       
#>  2 brain                        
#>  3 cerebellum                   
#>  4 colon                        
#>  5 duodenum                     
#>  6 esophagus                    
#>  7 frontal cortex               
#>  8 heart                        
#>  9 ileum                        
#> 10 jejunum                      
#> # ℹ 11 more rows

NOTE: To compare across all shared tissues for your selected species, you can assign the getTissues output to an object as input for map_tissues when initializing a CoSIAn object.

Initializing a CoSIAn object

CoSIAn_Obj <- CoSIA::CoSIAn(
    gene_set = unique(monogenic_kid_sample$Gene),
    i_species = "h_sapiens",
    o_species = c(
        "h_sapiens",
        "r_norvegicus"
    ),
    input_id = "Symbol",
    output_ids = "Ensembl_id",
    map_species = c(
        "h_sapiens",
        "r_norvegicus"
    ),
    map_tissues = c(
        "adult mammalian kidney",
        "heart"
    ),
    mapping_tool = "annotationDBI",
    ortholog_database = "HomoloGene",
    metric_type = "CV_Species"
)

str(CoSIAn_Obj)
#> Formal class 'CoSIAn' [package "CoSIA"] with 13 slots
#>   ..@ gene_set         : chr [1:20] "SDCCAG8" "TACO1" "FANCA" "BMP7" ...
#>   ..@ i_species        : chr "h_sapiens"
#>   ..@ input_id         : chr "Symbol"
#>   ..@ o_species        : chr [1:2] "h_sapiens" "r_norvegicus"
#>   ..@ output_ids       : chr "Ensembl_id"
#>   ..@ mapping_tool     : chr "annotationDBI"
#>   ..@ ortholog_database: chr "HomoloGene"
#>   ..@ converted_id     :'data.frame':    1 obs. of  1 variable:
#>   .. ..$ X0: num 0
#>   ..@ map_tissues      : chr [1:2] "adult mammalian kidney" "heart"
#>   ..@ map_species      : chr [1:2] "h_sapiens" "r_norvegicus"
#>   ..@ gex              :'data.frame':    1 obs. of  1 variable:
#>   .. ..$ X0: num 0
#>   ..@ metric_type      : chr "CV_Species"
#>   ..@ metric           :'data.frame':    1 obs. of  1 variable:
#>   .. ..$ X0: num 0

NOTE: Any species you plan to compare must be specified in map_species AND o_species.

NOTE: The getGEx function requires that “Ensembl_id” is included as an output_id. Here, we convert gene symbols into Ensembl IDs.

Use Cases with Monogenic Kidney Disease-Associated Genes

The following use cases provide running examples of CoSIA applications with Natera’s Monogenic Kidney Disease Panel. We will perform id conversion, obtain and visualize gene expression data, and calculate and visualize CV and DS of gene expression across three species (human, mouse, & rat) and two tissues (kidney & heart).

Use Case #1: Converting Gene Symbols to Ensembl IDs (getConversion)

CoSIA can convert input ids to any of the types listed in Table @ref(tab:Table1.).

CoSIAn_Obj_convert <- CoSIA::getConversion(CoSIAn_Obj)

head(CoSIA::viewCoSIAn(CoSIAn_Obj_convert, "converted_id"))
#> [[1]]
#>  [1] "SDCCAG8" "SDCCAG8" "TACO1"   "FANCA"   "BMP7"    "AP2S1"   "SCARB2" 
#>  [8] "AQP2"    "XDH"     "PCBD1"   "AMN"     "PPP3CA"  "LMX1B"   "NPHS1"  
#> [15] "EBP"     "SLX4"    "CYP11B2" "TTR"     "BICC1"   "OFD1"    "UCP3"

Use Case #2: Obtaining and visualizing curated non-diseased kidney and heart gene expression data for human, mouse, rat from Bgee

Now we will use the converted IDs with getGEx to obtain heart and kidney gene expression data for human, mouse and rat curated from Bgee.

CoSIAn_Obj_gex <- CoSIA::getGEx(CoSIAn_Obj_convert)

head(CoSIA::viewCoSIAn(CoSIAn_Obj_gex, "gex"), 5)

These data can be visualized with plotSpeciesGEx to plot expression of a single gene in a single tissue across species or plotTissueGEx to plot expression of a single gene in a single species across tissues.

Here we are plotting gene expression for the TACO1 gene in kidney tissue for human, mouse, and rat.

CoSIAn_Obj_gexplot <- CoSIA::plotSpeciesGEx(CoSIAn_Obj_gex, "adult mammalian kidney", "ENSG00000136463")

CoSIAn_Obj_gexplot

Figure 3. Gene Expression of TACO1 in Kidney Across Selected Species

CoSIA produces an interactive plot of expression values for TACO1 in kidney for human mouse and rat. Hovering over the plot displays sample-specific VST read count values. For file size compliance, we have included a static plot for this output. NOTE: Expression values are not meant to be compared across species in this plot. The next two use cases demonstrate appropriate methods for comparing gene expression patterns across species.


Use Case #3: Gene expression variability across species for kidney tissue by calculating and visualizing median-based Coefficient of Variation (CV)

Calculating and visualizing median-based coefficient of variation allows for a relative comparison of gene expression variability between species. In CoSIA, CV is calculated as the standard deviation over the median using VST read count values. In the section @ref(initializing-a-cosian-object), we set metric_type= "CV_Species" to calculate CV for the monogenic kidney gene set across human and rat.


CoSIAn_Obj_CV <- CoSIA::getGExMetrics(CoSIAn_Obj_convert)

CoSIAn_Obj_CVplot <- CoSIA::plotCVGEx(CoSIAn_Obj_CV)

CoSIAn_Obj_CVplot

Figure 4. Gene Expression Variability Across Species in Kidney Tissue

Here we see relatively low variability in gene expression of most genes in rat and human, with the exception of relatively high levels of variation in metric_type= "CV_Tissue" displays the variability of gene expression for the monogenic gene set across selected, shared tissues.


Use Case #4: Gene expression diversity and specificity across tissues and species for monogenic kidney-disease associated genes

Diversity and specificity metrics are also suitable for comparing gene expression patterns across species. In CoSIA, diversity and specificity (based on Shannon’s entropy) are calculated using min-max scaled median VST values generated each gene in a tissue- and species-specific manner. Values are used to calculate diversity and specificity as in Martínez & Reyes-Valdés, 2008.

In the final use case, we are calculating and visualizing diversity and specificity for kidney and heart tissue across the monogenic kidney gene set by setting metric_type= "DS_Tissue". There are additional metric_types for the CoSIAn objects, view @ref(tab:Table2.) This example is not run, but the code and the resulting static output is included.

CoSIAn_Obj_DS <- CoSIA::CoSIAn(
    gene_set = unique(monogenic_kid_sample$Gene),
    i_species = "h_sapiens",
    o_species = c("h_sapiens", "r_norvegicus"),
    input_id = "Symbol",
    output_ids = "Ensembl_id",
    map_species = c("h_sapiens", "r_norvegicus"),
    map_tissues = c("adult mammalian kidney", "heart"),
    mapping_tool = "annotationDBI",
    ortholog_database = "HomoloGene",
    metric_type = "DS_Tissue"
)

CoSIAn_Obj_DS <- CoSIA::getConversion(CoSIAn_Obj_DS)

CoSIAn_Obj_DS <- CoSIA::getGExMetrics(CoSIAn_Obj_DS)

CoSIAn_Obj_DSplot <- CoSIA::plotDSGEx(CoSIAn_Obj_DS)

CoSIAn_Obj_DSplot

Figure 5. Diversity vs Specificity of Genes in Geneset Across Anatomical Entity Names

There is low specificity in kidney tissue, indicating there are more genes from the set that are expressed in kidney. We see higher specificity in heart tissue, indicating that there are fewer genes from the set that are expressed in heart.

(#tab:Table2.) CoSIA Metric Type options.
Metric Type Function
DS_Gene calculates diversity and specificity across genes in gene_set in the tissues listed in map_tissue
DS_Gene_all calculates diversity and specificity across genes in gene_set in the all the available tissues in a species
DS_Tissue calculates diversity and specificity across tissues listed in map_tissue in the genes in gene_set
DS_Tissue_all calculates diversity and specificity across tissues listed in map_tissue in the all the genes in a species

Session info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] CoSIA_1.7.0          ExperimentHub_2.15.0 AnnotationHub_3.15.0
#> [4] BiocFileCache_2.15.0 dbplyr_2.5.0         BiocGenerics_0.53.3 
#> [7] generics_0.1.3       BiocStyle_2.35.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] KEGGREST_1.47.0           gtable_0.3.6             
#>  [3] xfun_0.49                 bslib_0.8.0              
#>  [5] ggplot2_3.5.1             Biobase_2.67.0           
#>  [7] vctrs_0.6.5               tools_4.4.2              
#>  [9] stats4_4.4.2              curl_6.0.1               
#> [11] tibble_3.2.1              AnnotationDbi_1.69.0     
#> [13] RSQLite_2.3.9             blob_1.2.4               
#> [15] pkgconfig_2.0.3           homologene_1.4.68.19.3.27
#> [17] S4Vectors_0.45.2          lifecycle_1.0.4          
#> [19] GenomeInfoDbData_1.2.13   compiler_4.4.2           
#> [21] Biostrings_2.75.3         munsell_0.5.1            
#> [23] GenomeInfoDb_1.43.2       htmltools_0.5.8.1        
#> [25] sys_3.4.3                 buildtools_1.0.0         
#> [27] sass_0.4.9                yaml_2.3.10              
#> [29] pillar_1.10.0             crayon_1.5.3             
#> [31] jquerylib_0.1.4           cachem_1.1.0             
#> [33] org.Hs.eg.db_3.20.0       mime_0.12                
#> [35] tidyselect_1.2.1          digest_0.6.37            
#> [37] dplyr_1.1.4               purrr_1.0.2              
#> [39] BiocVersion_3.21.1        maketools_1.3.1          
#> [41] grid_4.4.2                fastmap_1.2.0            
#> [43] colorspace_2.1-1          cli_3.6.3                
#> [45] org.Rn.eg.db_3.20.0       magrittr_2.0.3           
#> [47] utf8_1.2.4                withr_3.0.2              
#> [49] scales_1.3.0              filelock_1.0.3           
#> [51] UCSC.utils_1.3.0          rappdirs_0.3.3           
#> [53] bit64_4.5.2               rmarkdown_2.29           
#> [55] XVector_0.47.0            httr_1.4.7               
#> [57] bit_4.5.0.1               png_0.1-8                
#> [59] memoise_2.0.1             evaluate_1.0.1           
#> [61] knitr_1.49                annotationTools_1.81.0   
#> [63] IRanges_2.41.2            rlang_1.1.4              
#> [65] glue_1.8.0                DBI_1.2.3                
#> [67] BiocManager_1.30.25       jsonlite_1.8.9           
#> [69] R6_2.5.1                  zlibbioc_1.52.0