McGeary, Lin et
al. (2019) used RNA bind-n-seq (RBNS) to empirically determine the
affinities (i.e. dissoiation rates) of selected miRNAs towards random
12-nucleotide sequences (termed 12-mers). As expected, bound sequences
typically exhibited complementarity to the miRNA seed region (positions
2-8 from the miRNA’s 5’ end), but the study also revealed non-canonical
bindings and the importance of flanking di-nucleotides. Based on these
data, the authors developed a model which predicted 12-mer dissociation
rates (KD) based on the miRNA sequence. ScanMiR encodes a compressed
version of these prediction in the form of a KdModel
object.
The 12-mer is defined as the 8 nucleotides opposite the miRNA’s extended seed region plus flanking dinucleotides on either side:
The KdModel
class contains the information concerning
the sequence (12-mer) affinity of a given miRNA, and is meant to
compress and make easily manipulable the dissociation constants (Kd)
predictions from McGeary, Lin et
al. (2019).
We can take a look at the example KdModel
:
## A `KdModel` for hsa-miR-155-5p (Conserved across mammals)
## Sequence: UUAAUGCUAAUCGUGAUAGGGGUU
## Canonical target seed: AGCATTA(A)
In addition to the information necessary to predict the binding
affinity to any given 12-mer sequence, the model contains, minimally,
the name and sequence of the miRNA. Since the KdModel
class
extends the list class, any further information can be stored:
An overview of the binding affinities can be obtained with the following plot:
The plot gives the -log(Kd) values of the top 7-mers (including both canonical and non-canonical sites), with or without the final “A” vis-à-vis the first miRNA nucleotide.
To predict the dissociation constant (and binding type, if any) of a
given 12-mer sequence, you can use the assignKdType
function:
## type log_kd
## 1 non-canonical 0
## type log_kd
## 1 8mer -5129
## 2 non-canonical 0
The log_kd column contains log(Kd) values multiplied by 1000 and stored as an integer (which is more economical when dealing with millions of sites). In the example above, -5129 means -5.129, or a dissociation constant of 0.0059225. The smaller the values, the stronger the relative affinity.
A KdModelList
object is simply a collection of
KdModel
objects. We can build one in the following way:
# we create a copy of the KdModel, and give it a different name:
mod2 <- SampleKdModel
mod2$name <- "dummy-miRNA"
kml <- KdModelList(SampleKdModel, mod2)
kml
## An object of class "KdModelList"
## [[1]]
## A `KdModel` for hsa-miR-155-5p (Conserved across mammals)
## Sequence: UUAAUGCUAAUCGUGAUAGGGGUU
## Canonical target seed: AGCATTA(A)
## [[2]]
## A `KdModel` for dummy-miRNA (Conserved across mammals)
## Sequence: UUAAUGCUAAUCGUGAUAGGGGUU
## Canonical target seed: AGCATTA(A)
## A `KdModelList` object containing binding affinity models from 2 miRNAs.
##
## Low-confidence Poorly conserved
## 0 0
## Conserved across mammals Conserved across vertebrates
## 2 0
Beyond operations typically performed on a list (e.g. subsetting), some specific slots of the respective KdModels can be accessed, for example:
## hsa-miR-155-5p dummy-miRNA
## Conserved across mammals Conserved across mammals
## 4 Levels: Low-confidence Poorly conserved ... Conserved across vertebrates
KdModel
objects are meant to be created from a table
assigning a log_kd values to 12-mer target sequences, as produced by the
CNN from McGeary, Lin et al. (2019). For the purpose of example, we
create such a dummy table:
## X12mer log_kd
## 1 AAAGCAAAAAAA -0.428
## 2 CAAGCACAAACA -0.404
## 3 GAAGCAGAAAGA -0.153
## 4 TAAGCATAAATA -1.375
## 5 ACAGCAACAAAC -0.448
## 6 CCAGCACCAACC -0.274
A KdModel
object can then be created with:
Alternatively, the kd
argument can also be the path to
the output file of the CNN (and if mirseq
and
name
are in the table, they can be omitted).
The scanMiRData
package contains KdModel
collections corresponding to all
human, mouse and rat mirbase miRNAs.
When calling getKdModel
, the dissociation constants are
stored as an lightweight overfitted linear model, with base KDs
coefficients (stored as integers in object$mer8
) for each
1024 partially-matching 8-mers (i.e. at least 4 consecutive matching
nucleotides) to which are added 8-mer-specific coefficients (stored in
object$fl
) that are multiplied with a flanking score
generated by the flanking di-nucleotides. The flanking score is
calculated based on the di-nucleotide effects experimentally measured by
McGeary, Lin et al. (2019). To save space, the actual 8-mer sequences
are not stored but generated when needed in a deterministic fashion. The
8-mers can be obtained, in the right order, with the
getSeed8mers
function.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scanMiR_1.13.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 generics_0.1.3 stringi_1.8.4
## [4] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.1
## [7] grid_4.4.2 fastmap_1.2.0 jsonlite_1.8.9
## [10] GenomeInfoDb_1.43.2 BiocManager_1.30.25 httr_1.4.7
## [13] UCSC.utils_1.3.0 scales_1.3.0 Biostrings_2.75.3
## [16] codetools_0.2-20 jquerylib_0.1.4 cli_3.6.3
## [19] rlang_1.1.4 crayon_1.5.3 XVector_0.47.1
## [22] cowplot_1.1.3 munsell_0.5.1 withr_3.0.2
## [25] cachem_1.1.0 yaml_2.3.10 tools_4.4.2
## [28] parallel_4.4.2 BiocParallel_1.41.0 seqLogo_1.73.0
## [31] colorspace_2.1-1 ggplot2_3.5.1 GenomeInfoDbData_1.2.13
## [34] BiocGenerics_0.53.3 buildtools_1.0.0 vctrs_0.6.5
## [37] R6_2.5.1 stats4_4.4.2 lifecycle_1.0.4
## [40] pwalign_1.3.1 S4Vectors_0.45.2 IRanges_2.41.2
## [43] pkgconfig_2.0.3 bslib_0.8.0 pillar_1.10.0
## [46] gtable_0.3.6 glue_1.8.0 data.table_1.16.4
## [49] xfun_0.49 tibble_3.2.1 GenomicRanges_1.59.1
## [52] sys_3.4.3 knitr_1.49 farver_2.1.2
## [55] htmltools_0.5.8.1 labeling_0.4.3 rmarkdown_2.29
## [58] maketools_1.3.1 compiler_4.4.2