The purpose of this package is to compare the gene expression of the genes in two different conditions. The most typical case is when comparing gene expression in patients with a disease with gene expression in study participants without the disease. Hence, we may construct a network containing genes which are relevant for the development of the disease. The input data may come from different measurements of expression such as microarray, proteomics or RNA-seq as long as:
For differential gene-expression involving more than two separate
conditions, consider CoDiNA
(Morselli Gysi 2020) instead.
This package is hosted on Bioconductor. To install it, type:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("csdR")
Then,
should load the package into the current R session. For this vignette, we further load some auxiliary packages and set the random seed
This is a re-implementation and slight modification of the CSD algorithm presented by Voigt et al. 2017(Voigt 2017). In the first phase, the algorithm finds the co-expression between all genes within each condition using the Spearman correlation. For each pair of genes, we apply bootstrapping across the samples and compute the mean Spearman correlation ρ1 and ρ2 for the two conditions and the associated standard errors σ1 and σ2. In the second stage, the values for the two conditions are compared and gives us the following differential co-expression scores:
In this example, we are provided by two expression expression
matrices from thyroid glands, sick_expression
for patients
with thyroid cancer and normal_expression
for healthy
controls. To run the CSD analysis for these two conditions, we simply do
the following:
data("sick_expression")
data("normal_expression")
csd_results <- run_csd(
x_1 = sick_expression, x_2 = normal_expression,
n_it = 10, nThreads = 2L, verbose = FALSE
)
After obtaining these results, we may write them to disk. However, for datasets with thousands of genes, we will get millions upon millions of gene pairs. Writing the results to disk is likely to fill up gigabytes of valuable storage space while the disk IO itself might take a considerable amount of time. Furthermore, we must reduce the information load to create meaningful results, so we better to that while the data is still in memory. We decide to select the 100 edges with highest C, S, and D-score.
pairs_to_pick <- 100L
c_filter <- partial_argsort(csd_results$cVal, pairs_to_pick)
c_frame <- csd_results[c_filter, ]
s_filter <- partial_argsort(csd_results$sVal, pairs_to_pick)
s_frame <- csd_results[s_filter, ]
d_filter <- partial_argsort(csd_results$dVal, pairs_to_pick)
d_frame <- csd_results[d_filter, ]
Why does the csdR
package provide a general
partial_argsort
function which takes in a numeric vector
and spits out the indecies of the largest elements instead of a more
specialized function directly extracting the top results from the
dataframe? The answer is flexibility. Writing an additional line of code
and a dollar sign is not that much work after all and we may want more
flexible approaches such as displaying the union of the C, S- and
D-edges:
csd_filter <- c_filter %>%
union(s_filter) %>%
union(d_filter)
csd_frame <- csd_results[csd_filter, ]
The next logical step is to construct a network and do some analysis.
This is outside the scope of this package, but we will provide some
pointers for completeness. One viable approach is to use the ordinary
write.table
function to write the results of a file and
then use an external tools such as Cytoscape to further make
conclusions. Often, you may want to make an ontology enrichment of the
genes.
The other option is of course to continue using R. Here, we provide an example of combining the C-, S- and D-networks and coloring the edges blue, green and red, respectively depending of where they come from.
c_network <- graph_from_data_frame(c_frame, directed = FALSE)
s_network <- graph_from_data_frame(s_frame, directed = FALSE)
d_network <- graph_from_data_frame(d_frame, directed = FALSE)
E(c_network)$edge_type <- "C"
E(s_network)$edge_type <- "S"
E(d_network)$edge_type <- "D"
combined_network <- igraph::union(c_network, s_network, d_network)
# Auxillary function for combining
# the attributes of the three networks in a proper way
join_attributes <- function(graph, attribute) {
ifelse(
test = is.na(edge_attr(graph, glue("{attribute}_1"))),
yes = ifelse(
test = is.na(edge_attr(graph, glue("{attribute}_2"))),
yes = edge_attr(graph, glue("{attribute}_3")),
no = edge_attr(graph, glue("{attribute}_2"))
),
no = edge_attr(graph, glue("{attribute}_1"))
)
}
E(combined_network)$edge_type <- join_attributes(combined_network, "edge_type")
layout <- layout_nicely(combined_network)
E(combined_network)$color <- recode(E(combined_network)$edge_type,
C = "darkblue", S = "green", D = "darkred"
)
plot(combined_network, layout = layout,
vertex.size = 3, edge.width = 2, vertex.label.cex = 0.001)
As with any bootstrap procedure the number of iterations represented
by the argument n_it
needs to be sufficiently
large in order to get reproducible results. What this means is a
matter of trial and error. In general this means that you should re-run
the computations with a different random seed to see whether the number
of bootstrap iterations are sufficient. Experience has shown that ~ 100
iterations might be sufficient to reproduce almost the same results in
some cases, whereas in other cases, especially when the values are
close, you may choose to run several thousand iterations.
The parallelization of csdR
occurs within each
individual iteration. While parallelizing across the iterations would in
theory yield better CPU utilization, this approach is unfeasible due the
fact that memory consumption use would almost be proportional to the
number of threads. Instead, parallelization is used in computing the
co-expression matrix and updating the mean and variance for each
iteration. As shown in the csdR
paper, parallelism can
provide 2x-3x speedup, but there are diminishing returns in the
performance gained, so running more than 10 threads is most likely a
waste a resources. Limited memory bandwidth is the most likely culprit
behind the failure to scale, so systems with higher memory bandwidth are
more likely to utilize multiple threads better.
For datasets with 20 000 to 30 000 genes, a considerable amount of memory is consumed during the computations. It it therefore not recommended in such cases to run CSD on your laptop or even a workstation, but rather on a compute server with several hundreds GB of RAM.
How many gene pairs to select depends on the specific needs and how big a network you want to handle. A 10 000 edge network may not be easy to visualize, but quantitative network metrics can still be extracted. Also, generating more edges than necessary usually does not make any major harm as superfluous edges can quickly be filter out afterwards. However, if you select fewer edges than you actually need, you have to re-do all calculations to increase the number.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.4 glue_1.8.0 igraph_2.1.2 magrittr_2.0.3
#> [5] csdR_1.13.3 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 WGCNA_1.73 blob_1.2.4
#> [4] Biostrings_2.75.2 fastmap_1.2.0 rpart_4.1.23
#> [7] digest_0.6.37 lifecycle_1.0.4 cluster_2.1.6
#> [10] survival_3.7-0 KEGGREST_1.47.0 RSQLite_2.3.9
#> [13] compiler_4.4.2 rlang_1.1.4 Hmisc_5.2-1
#> [16] sass_0.4.9 tools_4.4.2 utf8_1.2.4
#> [19] yaml_2.3.10 data.table_1.16.4 knitr_1.49
#> [22] htmlwidgets_1.6.4 bit_4.5.0.1 foreign_0.8-87
#> [25] BiocGenerics_0.53.3 sys_3.4.3 nnet_7.3-19
#> [28] dynamicTreeCut_1.63-1 grid_4.4.2 stats4_4.4.2
#> [31] preprocessCore_1.69.0 fansi_1.0.6 colorspace_2.1-1
#> [34] fastcluster_1.2.6 GO.db_3.20.0 ggplot2_3.5.1
#> [37] scales_1.3.0 iterators_1.0.14 cli_3.6.3
#> [40] rmarkdown_2.29 crayon_1.5.3 generics_0.1.3
#> [43] rstudioapi_0.17.1 httr_1.4.7 DBI_1.2.3
#> [46] cachem_1.1.0 stringr_1.5.1 zlibbioc_1.52.0
#> [49] splines_4.4.2 parallel_4.4.2 impute_1.81.0
#> [52] AnnotationDbi_1.69.0 BiocManager_1.30.25 XVector_0.47.0
#> [55] matrixStats_1.4.1 base64enc_0.1-3 vctrs_0.6.5
#> [58] Matrix_1.7-1 jsonlite_1.8.9 IRanges_2.41.2
#> [61] S4Vectors_0.45.2 bit64_4.5.2 htmlTable_2.4.3
#> [64] Formula_1.2-5 maketools_1.3.1 foreach_1.5.2
#> [67] jquerylib_0.1.4 codetools_0.2-20 stringi_1.8.4
#> [70] gtable_0.3.6 GenomeInfoDb_1.43.2 UCSC.utils_1.3.0
#> [73] munsell_0.5.1 tibble_3.2.1 pillar_1.9.0
#> [76] htmltools_0.5.8.1 GenomeInfoDbData_1.2.13 R6_2.5.1
#> [79] doParallel_1.0.17 evaluate_1.0.1 lattice_0.22-6
#> [82] Biobase_2.67.0 backports_1.5.0 png_0.1-8
#> [85] RhpcBLASctl_0.23-42 memoise_2.0.1 bslib_0.8.0
#> [88] Rcpp_1.0.13-1 checkmate_2.3.2 gridExtra_2.3
#> [91] xfun_0.49 buildtools_1.0.0 pkgconfig_2.0.3