netprioR requires the following data as input
In the following steps we simulate data for a set of N = 1000 genes and a two class prioritisation task (positive vs. negative) and benchmark the performance of our model against the case where we prioritise soleley based on phenotypes.
We simulate the case where we know 100 labels a priori, which corresponds to 10% labelled data. We simulate equal numberos of 50 positives and 50 negatives, setting
Then we simulate the labels, randomly choosing equal numbers of a priori known labels for each class.
class.labels <- simulate_labels(values = c("Positive", "Negative"),
sizes = members_per_class,
nobs = c(nlabel/2, nlabel/2))
The list of simulated labels contains the complete vector of 1000
labels labels.true
and the vector of observed labels
labels.obs
. Unknown labels are set to NA
.
[1] "labels.true" "labels.obs" "labelled" "unlabelled"
Next, we simulate high noise and low noise network data in the form of 1000 x 1000 adjacency matrices. The low noise networks obey the class structure defined above, whereas the high noise networks do not.
networks <- list(LOW_NOISE1 = simulate_network_scalefree(nmemb = members_per_class, pclus = 0.8),
LOW_NOISE2 = simulate_network_scalefree(nmemb = members_per_class, pclus = 0.8),
HIGH_NOISE = simulate_network_random(nmemb = members_per_class, nnei = 1)
)
The networks are sparse binary adjacency matrices, which we can visualise as images. This allows to see the structure within the low noise networks, where we observe 80% of all edges in the 2nd and 4th quadrant, i.e. within each class, as defined above.
We simulate phenotype matching our simulated labels from two normal distributions with a difference in means that reflects our phenotype effect size. We set the effect size to
and simulate the phenotypes
phenotypes <- simulate_phenotype(labels.true = class.labels$labels.true, meandiff = effect_size, sd = 1)
The higher the phenotype effect size is, the easier it is to separate the two classes soleley based on the phenotype. We visualise the phenotypes for the two classes as follows
Based on the simulated data above, we now fit the netprioR model for
gene prioritisation. In this example, we will use hyperparameters
a = b = 0.01
for the Gamma prior of the network weights in
order to yield a sparsifying prior We will fit only one model, setting
nrestarts
to 1, whereas in practise multiple restarts are
used in order to avoid cases where the EM gets stuck in local minima.
The convergence threshold for the relative change in the log likelihood
is set to 1e-6.
np <- netprioR(networks = networks,
phenotypes = phenotypes,
labels = class.labels$labels.obs,
nrestarts = 1,
thresh = 1e-6,
a = 0.1,
b = 0.1,
fit.model = TRUE,
use.cg = FALSE,
verbose = FALSE)
We can investigate the netprioR
object using the
summary()
function.
#Genes: 1000
#Networks: 3
#Phenotypes: 1
#Labels: 100
Classes: Negative Positive
Model:
Likelihood[log]: -2942.597
Fixed effects: 0.1365992
Network weights:
Network Weight
LOW_NOISE1 450.51024
LOW_NOISE2 487.36416
HIGH_NOISE 54.94762
It is also possible to plot the netprioR
object to get
an overview of the model fit.
We can also plot individual plots by setting
which = "weights|ranks|lik"
.
We can see that the relative weight of the low noise networks is much higher than for the high noise networks indicating that, as expected, the low noise networks are more informative for the learning task.
Second, we evaluate the performance of the prioritised list of genes
by comparing the imputed, missing labels Yimp[u]
against
the true underlying labels Y
and computing receiver
operator characteristic (ROC)
roc.np <- ROC(np, true.labels = class.labels$labels.true, plot = TRUE, main = "Prioritisation: netprioR")
In addition, we compute the ROC curve for the case where we prioritise soleley based on the phenotype
unlabelled <- which(is.na(class.labels$labels.obs))
roc.x <- roc(cases = phenotypes[intersect(unlabelled, which(class.labels$labels.true == levels(class.labels$labels.true)[1])),1],
controls = phenotypes[intersect(unlabelled, which(class.labels$labels.true == levels(class.labels$labels.true)[2])),1],
direction = ">")
plot.roc(roc.x, main = "Prioritisation: Phenotype-only", print.auc = TRUE, print.auc.x = 0.2, print.auc.y = 0.1)
Comparing the area under the receiver operator characteristic curve (AUC) values for both cases, we can see that by integrating network data and a priori known labels for TPs and TNs, we gain about 0.15 in AUC.
Here is the output of sessionInfo()
on the system on
which this document was compiled:
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Matrix_1.7-1 pROC_1.18.5 netprioR_1.33.0 ggplot2_3.5.1
[5] pander_0.6.5 dplyr_1.1.4 knitr_1.48 BiocStyle_2.35.0
loaded via a namespace (and not attached):
[1] sparseMVN_0.2.2 sass_0.4.9 utf8_1.2.4
[4] generics_0.1.3 lattice_0.22-6 digest_0.6.37
[7] magrittr_2.0.3 evaluate_1.0.1 grid_4.4.1
[10] iterators_1.0.14 fastmap_1.2.0 plyr_1.8.9
[13] foreach_1.5.2 doParallel_1.0.17 jsonlite_1.8.9
[16] gridExtra_2.3 BiocManager_1.30.25 fansi_1.0.6
[19] scales_1.3.0 codetools_0.2-20 jquerylib_0.1.4
[22] cli_3.6.3 rlang_1.1.4 munsell_0.5.1
[25] withr_3.0.2 cachem_1.1.0 yaml_2.3.10
[28] tools_4.4.1 parallel_4.4.1 colorspace_2.1-1
[31] buildtools_1.0.0 vctrs_0.6.5 R6_2.5.1
[34] lifecycle_1.0.4 pkgconfig_2.0.3 pillar_1.9.0
[37] bslib_0.8.0 gtable_0.3.6 glue_1.8.0
[40] Rcpp_1.0.13 highr_0.11 xfun_0.48
[43] tibble_3.2.1 tidyselect_1.2.1 sys_3.4.3
[46] farver_2.1.2 htmltools_0.5.8.1 labeling_0.4.3
[49] rmarkdown_2.28 maketools_1.3.1 compiler_4.4.1