Informeasure: a tool to quantify nonlinear dependence between variables in biological regulatory networks from an information theory perspective

Introduction

The information theory R package named Informeasure is to quantify nonlinear dependence between variables in biological regulatory network inferences. This package compiles most of the information measures currently available: mutual information (MI), conditional mutual information (CMI)[1], interaction information (II)[2], partial information decomposition (PID)[3] and part mutual information (PMI)[4], all of which end with .measure() in form. They are MI.measure() for MI, CMI.measure() for CMI, II.measure() for II, PID.measure() for PID and PMI.measure() for PMI. The first estimator is used to infer bivariate networks while the last four are dedicated to analysis of trivariate networks. I here consider estimating information measures from breast cancer expression profile data generated by The Cancer Genome Atlas (TCGA), with applications in various types of transcriptome regulatory network inferences.

Main functions demonstration

Information measure is typically implemented by first discretizing continuous variables into a count table, evaluating probability from the counting, and/or then estimating entropy according to the (joint) probability matrix, finally calculating the information value that is the most representative for the association between variables. Two of the most common discretization methods are adopted in this package. One is a uniform width-based method (default) that divides the continuous data into N count bins with equal width. The other alternative is a uniform frequency-based approach that determines the continuous data into N count bins with equal count number. By default in both methods, the number of bins in these two methods is initialized into a round-off value based on the square root of the data size. In the process of probability estimation, three types of probability estimators referencing to the entropy package[5] that include the empirical estimator (default), the Dirichlet distribution estimator and the shrinkage estimator, while the Dirichlet distribution estimator also includes four different distribution with different prior values. These different probability estimators are showed in detail below:

method = “ML”: empirical estimator, also referred to maximum likelihood estimator,

method = “Jeffreys”: Dirichlet distribution estimator with prior a = 0.5,

method = “Laplace”: Dirichlet distribution estimator with prior a = 1,

method = “SG”: Dirichlet distribution estimator with prior a = 1/length(count table),

method = “minimax”: Dirichlet distribution estimator with prior a = sqrt(sum(count table))/length(count table),

method = “shrink”: shrinkage estimator.

MI.measure(): mutual information

In the case of two variables, the representative method is mutual information, used to measure the mutual dependence between two joint variables. It can be used to identify dependence between proteins in protein-protein interaction network inference. Two types of data formats can be used as input to the algorithm. One is the simple data.frame data type, and the other is the SummarizedExperiment data type.

# data.frame data type 
library(Informeasure)

load(system.file("extdata/tcga.brca.testdata.Rdata", package = "Informeasure"))

mRNAexpression   <- log2(mRNAexpression + 1)

x <- as.numeric(mRNAexpression[which(rownames(mRNAexpression) == "BRCA1"), ])
y <- as.numeric(mRNAexpression[which(rownames(mRNAexpression) == "BARD1"), ])

XY <- discretize2D(x,y)

MI.measure(XY)
##> [1] 0.6459387
# SummarizedExperiment data type 
library(Informeasure)
library(SummarizedExperiment)

load(system.file("extdata/tcga.brca.testdata.Rdata", package = "Informeasure"))

mRNAexpression <- as.matrix(mRNAexpression)
se.mRNAexpression = SummarizedExperiment(assays = list(mRNAexpression = mRNAexpression))

assays(se.mRNAexpression)[["log2"]] <- log2(assays(se.mRNAexpression)[["mRNAexpression"]]+1)

x <- assays(se.mRNAexpression["BRCA1", ])$log2
y <- assays(se.mRNAexpression["BARD1", ])$log2

XY <- discretize2D(x,y)

MI.measure(XY)
##> [1] 0.6459387

CMI.measure(): conditional mutual informaiton

In the three-variable case, the most classic method is conditional mutual information. It is widely used to evaluate the expected mutual information between two random variables conditioned on the third one. Such characteristics of conditional mutual information are fully applicable to the lncRNA-associated ceRNA network inference.

# data.frame data type
library(Informeasure)

load(system.file("extdata/tcga.brca.testdata.Rdata", package = "Informeasure"))

lncRNAexpression <- log2(lncRNAexpression + 1)
miRNAexpression  <- log2(miRNAexpression + 1)
mRNAexpression   <- log2(mRNAexpression + 1)

x <- as.numeric(miRNAexpression[which(rownames(miRNAexpression) == "hsa-miR-26a-5p"), ])
y <- as.numeric(mRNAexpression[which(rownames(mRNAexpression) == "PTEN"), ])
z <- as.numeric(lncRNAexpression[which(rownames(lncRNAexpression) == "PTENP1"), ])

XYZ <- discretize3D(x,y,z)

CMI.measure(XYZ)
##> [1] 0.7697107
# SummarizedExperiment data type
library(Informeasure)
library(SummarizedExperiment)

load(system.file("extdata/tcga.brca.testdata.Rdata", package="Informeasure"))

lncRNAexpression <- as.matrix(lncRNAexpression)
se.lncRNAexpression = SummarizedExperiment(assays = list(lncRNAexpression = lncRNAexpression))

miRNAexpression <- as.matrix(miRNAexpression)
se.miRNAexpression = SummarizedExperiment(assays = list(miRNAexpression = miRNAexpression))

mRNAexpression <- as.matrix(mRNAexpression)
se.mRNAexpression = SummarizedExperiment(assays = list(mRNAexpression = mRNAexpression))

assays(se.lncRNAexpression)[["log2"]] <- log2(assays(se.lncRNAexpression)[["lncRNAexpression"]] + 1)

assays(se.miRNAexpression)[["log2"]] <- log2(assays(se.miRNAexpression)[["miRNAexpression"]] + 1)

assays(se.mRNAexpression)[["log2"]] <- log2(assays(se.mRNAexpression)[["mRNAexpression"]] + 1)


x <- assays(se.miRNAexpression["hsa-miR-26a-5p", ])$log2
y <- assays(se.mRNAexpression["PTEN", ])$log2
z <- assays(se.lncRNAexpression["PTENP1", ])$log2

XYZ <- discretize3D(x,y,z)

CMI.measure(XYZ)
##> [1] 0.7697107

II.measure(): interaction information

Interaction information, also known as co-information, measures the amount information contained in a set of variables beyond any subset of those variables. The number of variables here is limited to three. It can be applied to explore the cooperative or competitive regulation mechanism of two miRNAs on the common target mRNA.

# data.frame data type
library(Informeasure)

load(system.file("extdata/tcga.brca.testdata.Rdata", package = "Informeasure"))

miRNAexpression  <- log2(miRNAexpression + 1)
mRNAexpression   <- log2(mRNAexpression + 1)

x <- as.numeric(miRNAexpression[which(rownames(miRNAexpression) == "hsa-miR-34a-5p"), ])
y <- as.numeric(mRNAexpression[which(rownames(mRNAexpression) == "MYC"), ])
z <- as.numeric(miRNAexpression[which(rownames(miRNAexpression) == "hsa-miR-34b-5p"), ])

XYZ <- discretize3D(x,y,z)

II.measure(XYZ)
##> [1] 0.4676038
# SummarizedExperiment data type
library(Informeasure)
library(SummarizedExperiment)

load(system.file("extdata/tcga.brca.testdata.Rdata", package="Informeasure"))

miRNAexpression <- as.matrix(miRNAexpression)
se.miRNAexpression = SummarizedExperiment(assays = list(miRNAexpression = miRNAexpression))

mRNAexpression <- as.matrix(mRNAexpression)
se.mRNAexpression = SummarizedExperiment(assays = list(mRNAexpression = mRNAexpression))

assays(se.miRNAexpression)[["log2"]] <- log2(assays(se.miRNAexpression)[["miRNAexpression"]] + 1)

assays(se.mRNAexpression)[["log2"]] <- log2(assays(se.mRNAexpression)[["mRNAexpression"]] + 1)

x <- assays(se.miRNAexpression["hsa-miR-34a-5p", ])$log2
y <- assays(se.mRNAexpression["MYC", ])$log2
z <- assays(se.miRNAexpression["hsa-miR-34b-5p", ])$log2

XYZ <- discretize3D(x,y,z)

II.measure(XYZ)
##> [1] 0.4676038

PID.measure(): partial information decomposition

Partial information decomposition decomposes two source information acting on the common target into four information parts: joint information (synergy), unique information from x, unique information from y and shared information (redundancy). It also can be applied to explore the cooperative or competitive regulation mechanism of two miRNAs on the common target mRNA.

# data.frame data type
library(Informeasure)

load(system.file("extdata/tcga.brca.testdata.Rdata", package = "Informeasure"))

miRNAexpression  <- log2(miRNAexpression + 1)
mRNAexpression   <- log2(mRNAexpression + 1)

x <- as.numeric(miRNAexpression[which(rownames(miRNAexpression) == "hsa-miR-34a-5p"), ])
y <- as.numeric(miRNAexpression[which(rownames(miRNAexpression) == "hsa-miR-34b-5p"), ])
z <- as.numeric(mRNAexpression[which(rownames(mRNAexpression) == "MYC"), ])

XYZ <- discretize3D(x,y,z)

PID.measure(XYZ)
##>    Synergy  Unique_X    Unique_Y Redundancy      PID
##> 1 0.670815 0.1854147 0.003058109  0.2032112 1.062499
# SummarizedExperiment data type
library(Informeasure)
library(SummarizedExperiment)

load(system.file("extdata/tcga.brca.testdata.Rdata", package="Informeasure"))

miRNAexpression <- as.matrix(miRNAexpression)
se.miRNAexpression = SummarizedExperiment(assays = list(miRNAexpression = miRNAexpression))

mRNAexpression <- as.matrix(mRNAexpression)
se.mRNAexpression = SummarizedExperiment(assays = list(mRNAexpression = mRNAexpression))

assays(se.miRNAexpression)[["log2"]] <- log2(assays(se.miRNAexpression)[["miRNAexpression"]] + 1)

assays(se.mRNAexpression)[["log2"]] <- log2(assays(se.mRNAexpression)[["mRNAexpression"]] + 1)

x <- assays(se.miRNAexpression["hsa-miR-34a-5p", ])$log2
y <- assays(se.miRNAexpression["hsa-miR-34b-5p", ])$log2
z <- assays(se.mRNAexpression["MYC", ])$log2

XYZ <- discretize3D(x,y,z)

PID.measure(XYZ)
##>    Synergy  Unique_X    Unique_Y Redundancy      PID
##> 1 0.670815 0.1854147 0.003058109  0.2032112 1.062499

PMI.measure(): part mutual information

Part mutual information devotes to measuring the non-linearly direct dependencies between two random variables given a third, especially when any one variable has a potentially strong correlation with the third one. Such characteristics of part mutual information are also fully applicable to the lncRNA-associated ceRNA network inference.

# data.frame data type
library(Informeasure)

load(system.file("extdata/tcga.brca.testdata.Rdata", package = "Informeasure"))

lncRNAexpression <- log2(lncRNAexpression + 1)
miRNAexpression  <- log2(miRNAexpression + 1)
mRNAexpression   <- log2(mRNAexpression + 1)

x <- as.numeric(miRNAexpression[which(rownames(miRNAexpression)   == "hsa-miR-26a-5p"), ])
y <- as.numeric(mRNAexpression[which(rownames(mRNAexpression)     == "PTEN"), ])
z <- as.numeric(lncRNAexpression[which(rownames(lncRNAexpression) == "PTENP1"), ])

XYZ <- discretize3D(x,y,z)

PMI.measure(XYZ)
##> [1] 1.074813
# SummarizedExperiment data type 
library(Informeasure)
library(SummarizedExperiment)

load(system.file("extdata/tcga.brca.testdata.Rdata", package="Informeasure"))

lncRNAexpression <- as.matrix(lncRNAexpression)
se.lncRNAexpression = SummarizedExperiment(assays = list(lncRNAexpression = lncRNAexpression))

miRNAexpression <- as.matrix(miRNAexpression)
se.miRNAexpression = SummarizedExperiment(assays = list(miRNAexpression = miRNAexpression))

mRNAexpression <- as.matrix(mRNAexpression)
se.mRNAexpression = SummarizedExperiment(assays = list(mRNAexpression = mRNAexpression))

assays(se.lncRNAexpression)[["log2"]] <- log2(assays(se.lncRNAexpression)[["lncRNAexpression"]] + 1)

assays(se.miRNAexpression)[["log2"]] <- log2(assays(se.miRNAexpression)[["miRNAexpression"]] + 1)

assays(se.mRNAexpression)[["log2"]] <- log2(assays(se.mRNAexpression)[["mRNAexpression"]] + 1)

x <- assays(se.miRNAexpression["hsa-miR-26a-5p", ])$log2
y <- assays(se.mRNAexpression["PTEN", ])$log2
z <- assays(se.lncRNAexpression["PTENP1", ])$log2

XYZ <- discretize3D(x,y,z)

PMI.measure(XYZ)
##> [1] 1.074813

Conclusions

This package provides implementations of five currently popular information measures. The base installation of this package allows users to approach these information measures to infer bivariate even multivariate biological regulatory networks. But please be noted that the provided package is not only limited to bioinformatics applications. Optionally other research fields can also employ this package to generally evaluate information relations between variables.

Acknowledgement

I would like to thank Ms. Song Jing for her careful proofreading of the manuscript, Mr. Xianghua Wang for his helpful discussions on the PMI algorithm, and Dr. Junpeng Zhang, Mr. Nitesh Turaga and Mr. Martin Morgan for their informative suggestions on writing the R package. I also like to thank my family for their persistent support during my most difficult times in 2020!

References

[1] Wyner A D. A definition of conditional mutual information for arbitrary ensembles[J]. Information & Computation, 1978, 38(1): 51-59.

[2] Mcgill W J. Multivariate information transmission[J]. Psychometrika, 1954, 19(2): 97-116.

[3] Williams P L, Beer R D. Nonnegative Decomposition of Multivariate Information[J]. arXiv: Information Theory, 2010.

[4] Zhao J, Zhou Y, Zhang X, et al. Part mutual information for quantifying direct associations in networks[J]. Proceedings of the National Academy of Sciences of the United States of America, 2016, 113(18): 5130-5135.

[5] Hausser J. and Strimmer K. Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks[J]. The Journal of Machine Learning Research, 2009, 10, 1469-1484.

Session information

sessionInfo()
##> R version 4.4.2 (2024-10-31)
##> Platform: x86_64-pc-linux-gnu
##> Running under: Ubuntu 24.04.1 LTS
##> 
##> Matrix products: default
##> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
##> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
##> 
##> locale:
##>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
##> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
##> 
##> time zone: Etc/UTC
##> tzcode source: system (glibc)
##> 
##> attached base packages:
##> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
##> [8] base     
##> 
##> other attached packages:
##>  [1] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##>  [3] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
##>  [5] IRanges_2.41.1              S4Vectors_0.45.2           
##>  [7] BiocGenerics_0.53.3         generics_0.1.3             
##>  [9] MatrixGenerics_1.19.0       matrixStats_1.4.1          
##> [11] Informeasure_1.17.0         BiocStyle_2.35.0           
##> 
##> loaded via a namespace (and not attached):
##>  [1] Matrix_1.7-1            jsonlite_1.8.9          crayon_1.5.3           
##>  [4] compiler_4.4.2          BiocManager_1.30.25     entropy_1.3.1          
##>  [7] jquerylib_0.1.4         yaml_2.3.10             fastmap_1.2.0          
##> [10] lattice_0.22-6          R6_2.5.1                XVector_0.47.0         
##> [13] S4Arrays_1.7.1          knitr_1.49              DelayedArray_0.33.2    
##> [16] maketools_1.3.1         GenomeInfoDbData_1.2.13 bslib_0.8.0            
##> [19] rlang_1.1.4             cachem_1.1.0            xfun_0.49              
##> [22] sass_0.4.9              sys_3.4.3               SparseArray_1.7.2      
##> [25] cli_3.6.3               zlibbioc_1.52.0         grid_4.4.2             
##> [28] digest_0.6.37           lifecycle_1.0.4         evaluate_1.0.1         
##> [31] buildtools_1.0.0        abind_1.4-8             rmarkdown_2.29         
##> [34] httr_1.4.7              tools_4.4.2             htmltools_0.5.8.1      
##> [37] UCSC.utils_1.3.0