scanMiR
can be used to identify potential binding sites given a set of miRNAs
and a set of transcripts. Furthermore, it determines the type of binding
site and, given a KdModel
object, the predicted affinity of
the site.
The main function used for determining matches of miRNAs in a given
set of sequences is findSeedMatches
. It accepts a set of
DNA sequences either as a character vector or as a DNAStringSet.
The miRNAs can be provided either as a character vector of seeds/miRNA
sequences or as a KdModelList
.
The seed must be given in the form of a (RNA or DNA) sequence of length 7 or 8 (the 8th nucleotide being the final ‘A’ - it is added if only 7 are given). Note that the seed should be given as it would appear in a match in the target sequence (i.e. the reverse complement of how it appears in the miRNA).
library(scanMiR)
# seed sequence of hsa-miR-155-5p
seed <- "AGCAUUAA"
# load a sample transcript
data("SampleTranscript")
# run scan
matches <- findSeedMatches(SampleTranscript, seed, verbose = FALSE)
matches
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | type
## <Rle> <IRanges> <Rle> | <factor>
## [1] seq1 491-498 * | 8mer
## [2] seq1 692-699 * | 7mer-m8
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
By default, a GRanges
object is returned. Apart from the position of the matches, it provides
information on the type of the putative binding site corresponding to
the match. Setting ret = "data.frame"
returns the same
information as a data.frame.
Alternatively, we can provide the full miRNA sequence, which results in additional information on supplementary 3’ pairing in the form of an aggregated score (see Section @ref(sec:3pAlignment) for further details).
# full sequence of the mature miR-155-5p transcript
miRNA <- "UUAAUGCUAAUCGUGAUAGGGGUU"
# run scan
matches <- findSeedMatches(SampleTranscript, miRNA, verbose = FALSE)
matches
## GRanges object with 2 ranges and 3 metadata columns:
## seqnames ranges strand | type p3.score note
## <Rle> <IRanges> <Rle> | <factor> <integer> <Rle>
## [1] seq1 491-498 * | 8mer 12 TDMD?
## [2] seq1 692-699 * | 7mer-m8 0 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We can take a closer look at the alignment of the first match:
##
## miRNA 3'-UUGGGGAUAGUGC-UAA-UCGUAAUU-5'
## |||||||||||| ||||||||
## target 5'-...NGAGUAACGACCCCUAUCACGUCCGCAGCAUUAAAU...-3'
Apart from the direct seed match (right), this representation also reveals the extensive supplementary 3’ pairing (left).
Finally, we can provide the miRNA in the form of a
KdModel
(see the vignette on
KdModels for further information). In this case
findSeedMatches
also returns the predicted affinity value
for each match. The log_kd
column contains log(Kd) values
multiplied by 1000, where Kd is the predicted dissociation constant of
miRNA:mRNA binding for the putative binding site.
# load sample KdModel
data("SampleKdModel")
# run scan
matches <- findSeedMatches(SampleTranscript, SampleKdModel, verbose = FALSE)
matches
## GRanges object with 2 ranges and 4 metadata columns:
## seqnames ranges strand | type log_kd p3.score note
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <Rle>
## [1] seq1 491-498 * | 8mer -4868 12 TDMD?
## [2] seq1 692-699 * | 7mer-m8 -3702 0 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Running findSeedMatches
using a Kdmodel
also returns matches that correspond to non-canonical binding sites.
These are typically of low affinity, but may affect repression if
several of them are found on the same transcript. The scan can be
restricted to canonical sites using the option
onlyCanonical = TRUE
.
KdModel
collections corresponding to all human, mouse
and rat mirbase miRNAs can be obtained through the scanMiRData
package.
For canonical sites, we use the site types described in Grimson et al., Molecular Cell 2007 based on the matching nucleotides form the miRNA seed:
In addition, we include with two additional site types that are classically considered non-canonical, but have much stronger evidence of binding than other non-canonical ones, namely G-bulged sites that have an extra non-matching G in position 5–6 that is bulged out (see Chi, Hannon & Darnell, 2012, and wobble sites have a single G:U replacement (see Becker et al., 2019).
Also note that all binding sites are reported as the alignment of the seed region of the miRNA onto the target mRNA, and as such it is always 8 nucleotide long, even if only 6 or 7nt are actually matching the seed.
If the transcript sequences are provided as a
DNAStringSet
, one may specify the length of the open
reading frame region of the transcripts as a metadata column in order to
distinguish between matches in the ORF and 3’UTR regions.
library(Biostrings)
# generate set of random sequences
seqs <- DNAStringSet(getRandomSeq(length = 1000, n = 10))
# add vector of ORF lengths
mcols(seqs)$ORF.length <- sample(500:800, length(seqs))
# run scan
matches2 <- findSeedMatches(seqs, SampleKdModel, verbose = FALSE)
head(matches2)
## GRanges object with 6 ranges and 5 metadata columns:
## seqnames ranges strand | type log_kd p3.score note ORF
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <Rle> <Rle>
## [1] seq1 921-928 * | 6mer -3195 0 - FALSE
## [2] seq2 4-11 * | 7mer-m8 -3093 0 - TRUE
## [3] seq2 863-870 * | 6mer-m8 -1521 0 - FALSE
## [4] seq3 73-80 * | 6mer -2491 0 - TRUE
## [5] seq4 586-593 * | non-canonical -1574 4 - FALSE
## [6] seq8 865-872 * | 6mer-a1 -1794 0 - FALSE
## -------
## seqinfo: 10 sequences from an unspecified genome; no seqlengths
Upon binding the seed regions, further complementary pairing of the
target to the 3’ region of the miRNA can increase affinity and further
stabilize the binding (Schirle,
Sheu-Gruttadauria and MacRae, 2014). Upon finding a seed match,
scanMiR
performs a local alignment on the upstream region
to identify such complementary 3’ binding. This is internally done by
the get3pAlignment()
function, the arguments to which
(e.g. the maximum size of the gap between the seed binding and the
complementary binding) can be passed via the
findSeedMatches
argument p3.params
. By
default, when running findSeedMatches
a 3’ score is
reported in the matches, which roughly corresponds to the number of
consecutive matching nucleotides (adjusting for small gaps and T/G
substitutions) within the constraints (see ?get3pAlignment
for more detail). More information (such as the size of the miRNA and
target loops between the two complementary regions) can be reported by
setting findSeedMatches(..., p3.extra=TRUE)
. In addition,
the pairing can be visualized with viewTargetAlignment
:
##
## miRNA 3'-UUGGGGAUAGUGC-UAA-UCGUAAUU-5'
## |||||||||||| ||||||||
## target 5'-...NGAGUAACGACCCCUAUCACGUCCGCAGCAUUAAAU...-3'
Some forms of 3’ bindings can however lead to drastic functional
consequences. For example, sufficient final complementary at the 3’ end
of the miRNA can lead to Target-Directed miRNA Degradation (TDMD, Sheu-Gruttadauria,
Pawlica et al., 2019). findSeedMatches
will also flag
such putative sites in the notes
column of the matches.
Finally, while circular RNAs can act as miRNA sponges, some miRNA
bindings can slice their circular structure Hansen et al., 2011
and free their cargo. findSeedMatches
will also flag such
sites in the notes
column.
The shadow
argument can be used to take into account the
observation that sites within the first ~15 nucleotides of the 3’UTR
show poor efficiency (Grimson
et al. 2007). findSeedMatches
will treat matches within
the first shadow
positions of the UTR in the same way as
matches in the ORF region. If no information on ORF lengths is provided,
it will simply ignore the first shadow
positions. The
default setting is shadow = 0
.
The parameter minDist
can be used to specify the minimum
distance between matches of the same miRNA (default 7). If there are
multiple matches within minDist
, only the highest affinity
match will be considered.
With ret = "aggregated"
one obtains a data.frame that
contains the predicted repression for each sequence-seed-pair aggregated
over all matches along with information about the types and number of
matches. Parameters for aggregation can be specified using
agg.params
. For further details, see Section
@ref(sec:aggregating).
scanMiR implements aggregation of miRNA sites based on the biochemical model from McGeary, Lin et al. (2019). This model first predicts the occupancy of AGO-miRNA complexes at each potential binding site as a function of the measured or estimated dissociation constants (Kds). It then assumes an additive effect of the miRNA on the basal decay rate of the transcript that is proportional to this occupancy.
The key parameters of this model are:
a
: the relative concentration of unbound AGO-miRNA
complexesb
: the factor that multiplies with the occupancy and is
added to the basal decay rate (can be interpreted as the additional
repression caused by a single bound AGO)c
: the penalty factor for sites that are found within
the ORF regionMore specifically, the occupancy of a mRNA m by miRNA g, with p matches in the ORF region and q matches in the 3’UTR region, is given by the following equation: $$ \begin{equation} N_{m,g} = \sum_{i=1}^{p}\left(\frac{a_g}{a_g + c_{\text{ORF}} K_{d,i}^{\text{ORF}}}\right) + \sum_{j=1}^{q}\left(\frac{a_g}{a_g + K_{d,j}^{\text{3'UTR}}}\right) \end{equation} $$
The corresponding background occupancy is estimated by substituting
the average affinity of nonspecifically bound sites (i.e. Kd = 1.0): $$
\begin{equation}
N_{m,g,\text{background}} =
\sum_{i=1}^{p}\left(\frac{a_g}{a_g + c_{\text{ORF}}}\right) +
\sum_{j=1}^{q}\left(\frac{a_g}{a_g + 1}\right)
\end{equation}
$$ In addition to this original model, scanMiR
includes a coefficient e
which adjusts the Kd values based
on the supplementary 3’ alignment:
$$ \begin{equation} N_{m,g} = \sum_{i=1}^{p}\left(\frac{a_g}{a_g + e_{i}c_{\text{ORF}} K_{d,i}^{\text{ORF}}}\right) + \sum_{j=1}^{q}\left(\frac{a_g}{a_g + e_{j}K_{d,j}^{\text{3'UTR}}}\right) \end{equation} $$
with ei = exp (p3 ⋅ p3.scorei).
p3
is a global parameter, and p3.scorei
is the 3’ alignment score (roughly corresponding to the number of
matched nucleotides, by default capped to 8 and set to 0 if below 3). Of
note, the default value of p3
is very small, leading to a
very mild effect. The importance of complementary binding seems to
depend on the miRNA, and at the moment there is no easy way to predict
this from the miRNA sequence. Our conservative estimate might not
attribute sufficient importance to this factor for some miRNAs.
The repression is then obtained as the log fold change of the two occupancies: repression = log (1 + bNm, g, background) − log (1 + bNm, g)
Because UTR and ORF lengths have been reported to influence the
efficacy of repression, scanMiR
also includes an additional
modifier to terms handling these effects:
repressionadj = repression ⋅ (1 + f ⋅ UTR.length + h ⋅ ORF.length)
While b
, c
, p3
, f
and h
are considered global parameters (i.e. the same for
different miRNAs and transcripts and also across experimental contexts),
a
is expected to be different for each miRNA in a given
experimental condition. However, as shown by McGeary, Lin et
al. (2019), the model performance is robust to changes in
a
over several orders of magnitude. Aggregation for all
miRNA-transcript pairs for a given data set is therefore usually based
on a single a
value.
Given a GRanges
or data.frame of matches as returned by
findSeedMatches
, aggregation can be performed by the
function aggregateMatches
:
## transcript repression 8mer 7mer 6mer non-canonical ORF.canonical
## 1 seq1 -0.33839851 0 0 1 0 0
## 2 seq10 -0.07139141 0 0 0 1 0
## 3 seq2 -0.13229858 0 0 1 0 1
## 4 seq3 -0.03831961 0 0 0 0 1
## 5 seq4 -0.06970898 0 0 0 1 0
## 6 seq8 -0.08990797 0 0 1 0 0
## ORF.nonCanonical
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
This returns a data.frame with predicted repression values for each
miRNA-transcript pair along with a count table of the different site
types. If matches
does not contain a log_kd
column, only the count table will be returned.
scanMiR uses the following default parameter values for aggregation that have been determined by fitting and validating the model using several experimental data sets:
## a b c p3 coef_utr coef_orf
## 0.007726 0.573500 0.181000 0.051000 -0.171060 -0.215460
## keepSiteInfo
## 1.000000
Where coef_utr
and coef_orf
respectively
correspond to the f
and h
in the above
formula. To disable these features, they can simply be set to zero.
keepSiteInfo
lets you choose whether the site count table
should be returned. The parameters can be passed directly to
aggregateMatches
, or passed to findSeedMatch
when doing aggregation on-the-fly using the agg.params
argument.
To deal with large amounts of sequences and/or seeds, both
findSeedMatches
and aggregateMatches
support
multithreading using the BiocParallel
package. This can be activated by passing
BP = MulticoreParam(ncores)
.
Depending on the system and the size of the scan (i.e. when including
all non-canonical sites), mutlithreading can potentially take a large
amount of memory. To avoid memory issues, the number of seeds processed
simultaneously by findSeedMatches
can be restricted using
the n_seeds
parameter. Alternatively, scan results can be
saved to temporary files using the useTmpFiles
argument
(see ?findSeedMatches
for more detail).
Note that in addition to the multithreading specified in its
arguments, aggregateMatches
uses the data.table
package, which is often set to use multi-threading by default (see
?data.table::setDTthreads
for more information). This can
leave to CPU usage higher than specified through the BP
argument of aggregateMatches
.
Binding sites for all miRNAs on all transcripts, especially when
including non-canonical sites, can easily amount to prohibitive amounts
of memory. The companion scanMiRApp package
includes a class implementing fast indexed access to on-disk
GenomicRanges and data.frames. The package additionally contains wrapper
(e.g. for performing full transcriptome scans) for common species and
for detecting enriched miRNA-target pairs, as well as a shiny interface
to scanMiR
.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] Biostrings_2.75.3 GenomeInfoDb_1.43.2 XVector_0.47.1
## [4] IRanges_2.41.2 S4Vectors_0.45.2 BiocGenerics_0.53.3
## [7] generics_0.1.3 scanMiR_1.13.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 stringi_1.8.4 digest_0.6.37
## [4] magrittr_2.0.3 evaluate_1.0.1 grid_4.4.2
## [7] fastmap_1.2.0 jsonlite_1.8.9 BiocManager_1.30.25
## [10] httr_1.4.7 UCSC.utils_1.3.0 scales_1.3.0
## [13] codetools_0.2-20 jquerylib_0.1.4 cli_3.6.3
## [16] rlang_1.1.4 crayon_1.5.3 cowplot_1.1.3
## [19] munsell_0.5.1 withr_3.0.2 cachem_1.1.0
## [22] yaml_2.3.10 tools_4.4.2 parallel_4.4.2
## [25] BiocParallel_1.41.0 seqLogo_1.73.0 colorspace_2.1-1
## [28] ggplot2_3.5.1 GenomeInfoDbData_1.2.13 buildtools_1.0.0
## [31] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
## [34] pwalign_1.3.1 pkgconfig_2.0.3 bslib_0.8.0
## [37] pillar_1.10.0 gtable_0.3.6 glue_1.8.0
## [40] data.table_1.16.4 xfun_0.49 tibble_3.2.1
## [43] GenomicRanges_1.59.1 sys_3.4.3 knitr_1.49
## [46] farver_2.1.2 htmltools_0.5.8.1 labeling_0.4.3
## [49] rmarkdown_2.29 maketools_1.3.1 compiler_4.4.2