12-mer dissociation rates

McGeary, Lin et al. (2019) used RNA bind-n-seq (RBNS) to empirically determine the affinities (i.e. dissoiation rates) of selected miRNAs towards random 12-nucleotide sequences (termed 12-mers). As expected, bound sequences typically exhibited complementarity to the miRNA seed region (positions 2-8 from the miRNA’s 5’ end), but the study also revealed non-canonical bindings and the importance of flanking di-nucleotides. Based on these data, the authors developed a model which predicted 12-mer dissociation rates (KD) based on the miRNA sequence. ScanMiR encodes a compressed version of these prediction in the form of a KdModel object.

The 12-mer is defined as the 8 nucleotides opposite the miRNA’s extended seed region plus flanking dinucleotides on either side:

KdModels

The KdModel class contains the information concerning the sequence (12-mer) affinity of a given miRNA, and is meant to compress and make easily manipulable the dissociation constants (Kd) predictions from McGeary, Lin et al. (2019).

We can take a look at the example KdModel:

library(scanMiR)
data(SampleKdModel)
SampleKdModel

## A `KdModel` for hsa-miR-155-5p (Conserved across mammals)
##   Sequence: UUAAUGCUAAUCGUGAUAGGGGUU
##   Canonical target seed: AGCATTA(A)

In addition to the information necessary to predict the binding affinity to any given 12-mer sequence, the model contains, minimally, the name and sequence of the miRNA. Since the KdModel class extends the list class, any further information can be stored:

SampleKdModel$myVariable <- "test"

An overview of the binding affinities can be obtained with the following plot:

plotKdModel(SampleKdModel, what="seeds")

The plot gives the -log(Kd) values of the top 7-mers (including both canonical and non-canonical sites), with or without the final “A” vis-à-vis the first miRNA nucleotide.

To predict the dissociation constant (and binding type, if any) of a given 12-mer sequence, you can use the assignKdType function:

assignKdType("ACGTACGTACGT", SampleKdModel)

##            type log_kd
## 1 non-canonical      0

# or using multiple sequences:
assignKdType(c("CTAGCATTAAGT","ACGTACGTACGT"), SampleKdModel)

##            type log_kd
## 1          8mer  -5129
## 2 non-canonical      0

The log_kd column contains log(Kd) values multiplied by 1000 and stored as an integer (which is more economical when dealing with millions of sites). In the example above, -5129 means -5.129, or a dissociation constant of 0.0059225. The smaller the values, the stronger the relative affinity.

KdModelLists

A KdModelList object is simply a collection of KdModel objects. We can build one in the following way:

# we create a copy of the KdModel, and give it a different name:
mod2 <- SampleKdModel
mod2$name <- "dummy-miRNA"
kml <- KdModelList(SampleKdModel, mod2)
kml

## An object of class "KdModelList"
## [[1]]
## A `KdModel` for hsa-miR-155-5p (Conserved across mammals)
##   Sequence: UUAAUGCUAAUCGUGAUAGGGGUU
##   Canonical target seed: AGCATTA(A)
## [[2]]
## A `KdModel` for dummy-miRNA (Conserved across mammals)
##   Sequence: UUAAUGCUAAUCGUGAUAGGGGUU
##   Canonical target seed: AGCATTA(A)

summary(kml)

## A `KdModelList` object containing binding affinity models from 2 miRNAs.
## 
##               Low-confidence             Poorly conserved 
##                            0                            0 
##     Conserved across mammals Conserved across vertebrates 
##                            2                            0

Beyond operations typically performed on a list (e.g. subsetting), some specific slots of the respective KdModels can be accessed, for example:

conservation(kml)

##           hsa-miR-155-5p              dummy-miRNA 
## Conserved across mammals Conserved across mammals 
## 4 Levels: Low-confidence Poorly conserved ... Conserved across vertebrates

Creating a KdModel object

KdModel objects are meant to be created from a table assigning a log_kd values to 12-mer target sequences, as produced by the CNN from McGeary, Lin et al. (2019). For the purpose of example, we create such a dummy table:

kd <- dummyKdData()
head(kd)

##         X12mer log_kd
## 1 AAAGCAAAAAAA -0.428
## 2 CAAGCACAAACA -0.404
## 3 GAAGCAGAAAGA -0.153
## 4 TAAGCATAAATA -1.375
## 5 ACAGCAACAAAC -0.448
## 6 CCAGCACCAACC -0.274

A KdModel object can then be created with:

mod3 <- getKdModel(kd=kd, mirseq="TTAATGCTAATCGTGATAGGGGTT", name = "my-miRNA")

Alternatively, the kd argument can also be the path to the output file of the CNN (and if mirseq and name are in the table, they can be omitted).

Common KdModel collections

The scanMiRData package contains KdModel collections corresponding to all human, mouse and rat mirbase miRNAs.

Under the hood

When calling getKdModel, the dissociation constants are stored as an lightweight overfitted linear model, with base KDs coefficients (stored as integers in object$mer8) for each 1024 partially-matching 8-mers (i.e. at least 4 consecutive matching nucleotides) to which are added 8-mer-specific coefficients (stored in object$fl) that are multiplied with a flanking score generated by the flanking di-nucleotides. The flanking score is calculated based on the di-nucleotide effects experimentally measured by McGeary, Lin et al. (2019). To save space, the actual 8-mer sequences are not stored but generated when needed in a deterministic fashion. The 8-mers can be obtained, in the right order, with the getSeed8mers function.

Session info

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] scanMiR_1.13.0   BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9              generics_0.1.3          stringi_1.8.4          
##  [4] digest_0.6.37           magrittr_2.0.3          evaluate_1.0.3         
##  [7] grid_4.4.2              fastmap_1.2.0           jsonlite_1.9.0         
## [10] GenomeInfoDb_1.43.4     BiocManager_1.30.25     httr_1.4.7             
## [13] UCSC.utils_1.3.1        scales_1.3.0            Biostrings_2.75.4      
## [16] codetools_0.2-20        jquerylib_0.1.4         cli_3.6.4              
## [19] rlang_1.1.5             crayon_1.5.3            XVector_0.47.2         
## [22] cowplot_1.1.3           munsell_0.5.1           withr_3.0.2            
## [25] cachem_1.1.0            yaml_2.3.10             tools_4.4.2            
## [28] parallel_4.4.2          BiocParallel_1.41.2     seqLogo_1.73.0         
## [31] colorspace_2.1-1        ggplot2_3.5.1           GenomeInfoDbData_1.2.13
## [34] BiocGenerics_0.53.6     buildtools_1.0.0        vctrs_0.6.5            
## [37] R6_2.6.1                stats4_4.4.2            lifecycle_1.0.4        
## [40] pwalign_1.3.2           S4Vectors_0.45.4        IRanges_2.41.3         
## [43] pkgconfig_2.0.3         bslib_0.9.0             pillar_1.10.1          
## [46] gtable_0.3.6            glue_1.8.0              data.table_1.17.0      
## [49] xfun_0.51               tibble_3.2.1            GenomicRanges_1.59.1   
## [52] sys_3.4.3               knitr_1.49              farver_2.1.2           
## [55] htmltools_0.5.8.1       labeling_0.4.3          rmarkdown_2.29         
## [58] maketools_1.3.2         compiler_4.4.2