Post-transcriptional modifications can be found abundantly in rRNA and tRNA and can be detected classically via several strategies. However, difficulties arise if the identity and the position of the modified nucleotides is to be determined at the same time. Classically, a primer extension, a form of reverse transcription (RT), would allow certain modifications to be accessed by blocks during the RT changes or changes in the cDNA sequences. Other modification would need to be selectively treated by chemical reactions to influence the outcome of the reverse transcription.
With the increased availability of high throughput sequencing, these classical methods were adapted to high throughput methods allowing more RNA molecules to be accessed at the same time. However, patterns of some modification cannot be detected by accessing small number of parameters. For these cases machine learning models can be trained on data from positions known to be modified in order to detect additional modified positions.
To extend the functionality of the RNAmodR
package and
classical detection strategies used for RiboMethSeq or AlkAnilineSeq
(see RNAmodR.RiboMethSeq
and
RNAmodR.AlkAnilineSeq
packages) towards detection through
machine learning models, RNAmodR.ML
provides classes and an
example workflow.
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'ExperimentHubData'
The ModifierML
class extends the Modifier
class from the RNAmodR
package and adds one slot,
mlModel
, a getter/setter
getMLModel
/setMLModel
, an additional
useMLModel
function to be called from the
aggregate
function.
The slot mlModel
can either be an empty character or
contain a name of a ModifierMLModel
class, which is loaded
upon creation of a ModifierML
object, and serves as a
wrapper around a machine learning model. For different types of machine
learning models ModifierMLModel
derived classes are
available, which currently are:
ModifierMLranger
for models generated with the
ranger
package (Wright and Ziegler 2017)ModifierMLkeras
for models generated with the
keras
package (Allaire and Chollet 2018)To illustrate how to develop a machine learning model with help from
the RNAmodR.ML
package, an example is given below.
Modifier
classAs an example for this vignette, we will try to detect D positions in
AlkAnilineSeq data. First define a specific ModifierML
class loading pileup and coverage data. In this example, the RNA
specific RNAModifierML
class is used.
setClass("ModMLExample",
contains = c("RNAModifierML"),
prototype = list(mod = c("D"),
score = "score",
dataType = c("PileupSequenceData",
"CoverageSequenceData"),
mlModel = character(0)))
# constructor function for ModMLExample
ModMLExample <- function(x, annotation = NA, sequences = NA, seqinfo = NA,
...){
RNAmodR:::Modifier("ModMLExample", x = x, annotation = annotation,
sequences = sequences, seqinfo = seqinfo, ...)
}
setClass("ModSetMLExample",
contains = "ModifierSet",
prototype = list(elementType = "ModMLExample"))
# constructor function for ModSetMLExample
ModSetMLExample <- function(x, annotation = NA, sequences = NA, seqinfo = NA,
...){
RNAmodR:::ModifierSet("ModMLExample", x, annotation = annotation,
sequences = sequences, seqinfo = seqinfo, ...)
}
Since the mlModel
slot contains an empty character, the
creation of the object will not automatically trigger a search for
modifications. However, it will aggregate the data in the format we want
to use. The aggregate_example
function is just an example
and the aggregation of the data is part of the model building. (See
(#Summary))
To gather training data, we just create a ModMLExample
object and let it do its job.
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ...
## Warning in .makeTxDb_normarg_chrominfo(chrominfo): genome version information
## is not available for this TxDb object
## OK
## Loading Pileup data from BAM files ... OK
## Loading Coverage data from BAM files ... OK
## Aggregating data and calculating scores ... Starting to search for 'Dihydrouridine' ...
## Warning: ML model not set. Skipped search for modifications ...
## done.
Afterwards we need to load/create coordinates for positions known to be modified as well as positions known to be unmodified.
data("dmod",package = "RNAmodR.ML")
# we just select the next U position from known positions
nextUPos <- function(gr){
nextU <- lapply(seq.int(1L,2L),
function(i){
subseq <- subseq(sequences(me)[dmod$Parent], start(dmod)+3L)
pos <- start(dmod) + 2L +
vapply(strsplit(as.character(subseq),""),
function(y){which(y == "U")[i]},integer(1))
ans <- dmod
ranges(ans) <- IRanges(start = pos, width = 1L)
ans
})
nextU <- do.call(c,nextU)
nextU$mod <- NULL
unique(nextU)
}
nondmod <- nextUPos(dmod)
nondmod <- nondmod[!(nondmod %in% dmod)]
coord <- unique(c(dmod,nondmod))
coord <- coord[order(as.integer(coord$Parent))]
With these coordinates the aggregated data of the
ModMLExample
can be subset to a training data set using the
function trainingData
.
How a specific model can be trained or what kind of strategies can be
employed to successfully train a model, is out of scope for the
vignette. For this example a random forest is trained using the
functionality provided by the ranger
package.
Now, the trained model can be used to create a new
ModifierMLModel
class and object.
setClass("ModifierMLexample",
contains = c("ModifierMLranger"),
prototype = list(model = model))
ModifierMLexample <- function(...){
new("ModifierMLexample")
}
mlmodel <- ModifierMLexample()
To be able to use the model via the ModifierMLModel
class, we also need to define an accessor to the predictions made by the
model. This function is called useModel
and is already
prefined for the ModifierMLModel
classes mentioned above in
secion Using RNAmodR.ML.
## Method Definition:
##
## function (x, y)
## {
## data <- getAggregateData(y)
## model <- x@model
## if (!is(model, "ranger")) {
## stop("model is not a ranger model")
## }
## unlisted_data <- unlist(data, use.names = FALSE)
## p <- predict(x@model, data.frame(unlisted_data))
## relist(p$predictions, data)
## }
## <bytecode: 0x560b762bcdc0>
## <environment: namespace:RNAmodR.ML>
##
## Signatures:
## x y
## target "ModifierMLranger" "ModifierML"
## defined "ModifierMLranger" "ModifierML"
If the results of these function is not usable for a specific model,
it can redefined for your needs. As defined by RNAmodR.ML
the function returns a NumericList
along the aggregated
data of the ModifierML
object.
The generated ModifierMLexample
wrapper can now be set
for the ModifierML
object using the setMLModel
function. (If a model is saved as part of a package, this step ist not
necessary, because it can be part of the class definition)
In order for the prediction data to be usable, another function has
to be implemented to save the predictions in the aggregated data. The
function is called useMLModel
.
setMethod(f = "useMLModel",
signature = signature(x = "ModMLExample"),
definition =
function(x){
predictions <- useModel(getMLModel(x), x)
data <- getAggregateData(x)
unlisted_data <- unlist(data, use.names = FALSE)
unlisted_data$score <- unlist(predictions)
x@aggregate <- relist(unlisted_data,data)
x
}
)
By re-running the aggregate
function and force an update
of the data, the predictions are made and used to populate the
score
column as outlined above.
During the model building phase some form of a performance
measurement usually is included. In addition to these model specific
measurements, RNAmodR.ML
includes the functionality from
the ROCR
package inherited from the RNAmodR
package. With this the performance of a model can evaluted over the
training set or any coordinates.
ModifierML
classSince we want to use the ModifierML
object to detect
modifications, we also need to define the findMod
function.
Based on the information on the performance, we set a threshold of
0.8
for the prediction score for detecting D modifications.
In the example below this threshold is hardcoded in the
find_mod_example
function, but can also be implemented
using the settings
function.
setMethod(
f = "findMod",
signature = signature(x = "ModMLExample"),
definition =
function(x){
find_mod_example(x, 25L)
}
)
Now we can redfine the ModMLExample
class with the slot
mlModel
already set to the ModifierMLexample
class. The ModMLExample
is now complete.
rm(me)
setClass("ModMLExample",
contains = c("RNAModifierML"),
prototype = list(mod = c("D"),
score = "score",
dataType = c("PileupSequenceData",
"CoverageSequenceData"),
mlModel = "ModifierMLexample"))
me <- ModMLExample(files[[1]], annotation, sequences)
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ...
## Warning in .makeTxDb_normarg_chrominfo(chrominfo): genome version information
## is not available for this TxDb object
## OK
## Loading Pileup data from BAM files ... OK
## Loading Coverage data from BAM files ... OK
## Aggregating data and calculating scores ... Starting to search for 'Dihydrouridine' ... done.
The detected modifications can be access using the
modifications
function.
## GRangesList object of length 37:
## $`1`
## GRanges object with 2 ranges and 5 metadata columns:
## seqnames ranges strand | mod source type
## <Rle> <IRanges> <Rle> | <character> <character> <character>
## [1] Q0020_15S_RRNA 48 + | D RNAmodR.ML RNAMOD
## [2] Q0020_15S_RRNA 323 + | D RNAmodR.ML RNAMOD
## score Parent
## <numeric> <character>
## [1] 0.936333 1
## [2] 0.832700 1
## -------
## seqinfo: 60 sequences from an unspecified genome; no seqlengths
##
## $`3`
## GRanges object with 1 range and 5 metadata columns:
## seqnames ranges strand | mod source type score
## <Rle> <IRanges> <Rle> | <character> <character> <character> <numeric>
## [1] RDN18-1 229 + | D RNAmodR.ML RNAMOD 0.819467
## Parent
## <character>
## [1] 3
## -------
## seqinfo: 60 sequences from an unspecified genome; no seqlengths
##
## $`4`
## GRanges object with 4 ranges and 5 metadata columns:
## seqnames ranges strand | mod source type score
## <Rle> <IRanges> <Rle> | <character> <character> <character> <numeric>
## [1] RDN25-1 2 + | D RNAmodR.ML RNAMOD 0.860233
## [2] RDN25-1 748 + | D RNAmodR.ML RNAMOD 0.818367
## [3] RDN25-1 1720 + | D RNAmodR.ML RNAMOD 0.828700
## [4] RDN25-1 2571 + | D RNAmodR.ML RNAMOD 0.810800
## Parent
## <character>
## [1] 4
## [2] 4
## [3] 4
## [4] 4
## -------
## seqinfo: 60 sequences from an unspecified genome; no seqlengths
##
## ...
## <34 more elements>
Some of the modification found look reasonable. However, some of the positions seem to be noise.
Several options exist to improve the model: Either the threshold
applied to the prediction score can be raised to a higher value, like
0.9
or the model can maybe retrained by including these
position in another training data set. In addition, the training data
might be improved in general by higher sequencing depth.
nonValidMod <- mod[c("1","4")]
nonValidMod[["18"]] <- nonValidMod[["18"]][2]
nonValidMod[["26"]] <- nonValidMod[["26"]][2]
nonValidMod <- unlist(nonValidMod)
nonValidMod <- nonValidMod[,"Parent"]
coord <- unique(c(dmod,nondmod,nonValidMod))
coord <- coord[order(as.integer(coord$Parent))]
As an example, a new model is trained including the wrongly identified positions from the previous model as unmodified positions.
trainingData <- trainingData(me,coord)
trainingData <- unlist(trainingData, use.names = FALSE)
trainingData$labels <- as.integer(trainingData$labels)
model2 <- ranger(labels ~ ., data.frame(trainingData), num.trees = 2000)
setClass("ModifierMLexample2",
contains = c("ModifierMLranger"),
prototype = list(model = model2))
ModifierMLexample2 <- function(...){
new("ModifierMLexample2")
}
mlmodel2 <- ModifierMLexample2()
me2 <- me
setMLModel(me2) <- mlmodel2
me2 <- aggregate(me2, force = TRUE)
After updating the ModifierMLexample
class and
aggregating the data again the performance looks a bit better…
… and leads to a better result for detecting D modifications. Some positions are not detected anymore, which suggest that this model is still not an optimal solution and other factors could and should be improved upon as suggested above.
setMethod(
f = "findMod",
signature = signature(x = "ModMLExample"),
definition =
function(x){
find_mod_example(x, 25L)
}
)
me2 <- modify(me2, force = TRUE)
modifications(me2)
## GRanges object with 46 ranges and 5 metadata columns:
## seqnames ranges strand | mod source type
## <Rle> <IRanges> <Rle> | <character> <character> <character>
## [1] tA-AGC-D 47 + | D RNAmodR.ML RNAMOD
## [2] tA-TGC-A 47 + | D RNAmodR.ML RNAMOD
## [3] tC-GCA-B 19 + | D RNAmodR.ML RNAMOD
## [4] tC-GCA-B 46 + | D RNAmodR.ML RNAMOD
## [5] tE-CTC-D 20 + | D RNAmodR.ML RNAMOD
## ... ... ... ... . ... ... ...
## [42] tV-CAC-D 47 + | D RNAmodR.ML RNAMOD
## [43] tW-CCA-G1 16 + | D RNAmodR.ML RNAMOD
## [44] tW-CCA-G1 46 + | D RNAmodR.ML RNAMOD
## [45] tY-GTA-D 21 + | D RNAmodR.ML RNAMOD
## [46] tY-GTA-D 22 + | D RNAmodR.ML RNAMOD
## score Parent
## <numeric> <character>
## [1] 0.961825 6
## [2] 0.954383 7
## [3] 0.921766 8
## [4] 0.947067 8
## [5] 0.847392 11
## ... ... ...
## [42] 0.929375 57
## [43] 0.806283 59
## [44] 0.826408 59
## [45] 0.906250 60
## [46] 0.802633 60
## -------
## seqinfo: 60 sequences from an unspecified genome; no seqlengths
In addition to training a single model, several models can be trained
and combined to a ModifierSet
.
An overall performance over several models can be analyzed or the individual performance compaired.
plotROC(mse, dmod, score= "score",
plot.args = list(avg = "threshold", spread.estimate = "stderror"))
If several models are trained and each provides useful information,
these can be package into a single ModifierMLModel
class to
combine the output of several models. Some of the functions outlined
above need, e.g. useMLModel
and/or useModel
,
to be modified to provide one or more scores for detecting a
modification.
If the created models can be saved to file, they can also be included in a package. This would allow others to use the models and the models can potentially be improved upon with increasing version numbers.
RNAmodR.ML
provides the interface for machine learning
models to be used with RNAmodR
to detect modified
nucleotides in high throughput sequencing data. For your own work
defining a working model might take some time. We hope that by using
RNAmodR.ML
the steps surrounding this crucial step might
become a bit easier.
However, if some steps or design choices made for
RNAmodR.ML
do not suit your need, let us know.
Contributions are always welcome as well.
We also want to provide some additional hints and suggestions for developing machine learning models.
keras
.cbind
on the data from the SequenceData
objects (cbind
works on SequenceData
objects,
if they are of the same type. Convert each of them to a
CompressedSplitDataFrameList
using
as(x,"CompressedSplitDataFrameList")
).trainingData
, a 2D tensor
is returned (sample, values). This can be converted into a 3D tensor by
providing a flanking value.## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ranger_0.17.0 Rsamtools_2.23.1 RNAmodR.Data_1.20.0
## [4] ExperimentHubData_1.33.0 AnnotationHubData_1.37.0 futile.logger_1.4.3
## [7] ExperimentHub_2.15.0 AnnotationHub_3.15.0 BiocFileCache_2.15.0
## [10] dbplyr_2.5.0 RNAmodR.ML_1.21.0 RNAmodR_1.21.0
## [13] Modstrings_1.23.0 Biostrings_2.75.1 XVector_0.47.0
## [16] rtracklayer_1.67.0 GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [19] IRanges_2.41.1 S4Vectors_0.45.2 BiocGenerics_0.53.3
## [22] generics_0.1.3 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] BiocIO_1.17.1 bitops_1.0-9
## [3] filelock_1.0.3 tibble_3.2.1
## [5] graph_1.85.0 XML_3.99-0.17
## [7] rpart_4.1.23 lifecycle_1.0.4
## [9] httr2_1.0.7 lattice_0.22-6
## [11] ensembldb_2.31.0 OrganismDbi_1.49.0
## [13] backports_1.5.0 magrittr_2.0.3
## [15] Hmisc_5.2-0 sass_0.4.9
## [17] rmarkdown_2.29 jquerylib_0.1.4
## [19] yaml_2.3.10 RUnit_0.4.33
## [21] Gviz_1.51.0 DBI_1.2.3
## [23] buildtools_1.0.0 RColorBrewer_1.1-3
## [25] abind_1.4-8 zlibbioc_1.52.0
## [27] purrr_1.0.2 AnnotationFilter_1.31.0
## [29] biovizBase_1.55.0 RCurl_1.98-1.16
## [31] nnet_7.3-19 VariantAnnotation_1.53.0
## [33] rappdirs_0.3.3 GenomeInfoDbData_1.2.13
## [35] AnnotationForge_1.49.0 maketools_1.3.1
## [37] codetools_0.2-20 DelayedArray_0.33.2
## [39] xml2_1.3.6 tidyselect_1.2.1
## [41] UCSC.utils_1.3.0 matrixStats_1.4.1
## [43] base64enc_0.1-3 GenomicAlignments_1.43.0
## [45] jsonlite_1.8.9 Formula_1.2-5
## [47] tools_4.4.2 progress_1.2.3
## [49] stringdist_0.9.12 Rcpp_1.0.13-1
## [51] glue_1.8.0 gridExtra_2.3
## [53] SparseArray_1.7.2 BiocBaseUtils_1.9.0
## [55] xfun_0.49 MatrixGenerics_1.19.0
## [57] dplyr_1.1.4 withr_3.0.2
## [59] formatR_1.14 BiocManager_1.30.25
## [61] fastmap_1.2.0 latticeExtra_0.6-30
## [63] fansi_1.0.6 caTools_1.18.3
## [65] digest_0.6.37 mime_0.12
## [67] R6_2.5.1 colorspace_2.1-1
## [69] gtools_3.9.5 jpeg_0.1-10
## [71] dichromat_2.0-0.1 biomaRt_2.63.0
## [73] RSQLite_2.3.8 utf8_1.2.4
## [75] data.table_1.16.2 prettyunits_1.2.0
## [77] httr_1.4.7 htmlwidgets_1.6.4
## [79] S4Arrays_1.7.1 pkgconfig_2.0.3
## [81] gtable_0.3.6 blob_1.2.4
## [83] sys_3.4.3 htmltools_0.5.8.1
## [85] RBGL_1.83.0 ProtGenerics_1.39.0
## [87] scales_1.3.0 Biobase_2.67.0
## [89] png_0.1-8 colorRamps_2.3.4
## [91] knitr_1.49 lambda.r_1.2.4
## [93] rstudioapi_0.17.1 reshape2_1.4.4
## [95] rjson_0.2.23 checkmate_2.3.2
## [97] curl_6.0.1 biocViews_1.75.0
## [99] cachem_1.1.0 stringr_1.5.1
## [101] KernSmooth_2.23-24 BiocVersion_3.21.1
## [103] parallel_4.4.2 foreign_0.8-87
## [105] AnnotationDbi_1.69.0 restfulr_0.0.15
## [107] pillar_1.9.0 grid_4.4.2
## [109] vctrs_0.6.5 gplots_3.2.0
## [111] cluster_2.1.6 htmlTable_2.4.3
## [113] evaluate_1.0.1 GenomicFeatures_1.59.1
## [115] cli_3.6.3 compiler_4.4.2
## [117] futile.options_1.0.1 rlang_1.1.4
## [119] crayon_1.5.3 interp_1.1-6
## [121] plyr_1.8.9 stringi_1.8.4
## [123] deldir_2.0-4 BiocParallel_1.41.0
## [125] BiocCheck_1.43.2 txdbmaker_1.3.1
## [127] munsell_0.5.1 lazyeval_0.2.2
## [129] Matrix_1.7-1 BSgenome_1.75.0
## [131] hms_1.1.3 bit64_4.5.2
## [133] ggplot2_3.5.1 KEGGREST_1.47.0
## [135] SummarizedExperiment_1.37.0 ROCR_1.0-11
## [137] memoise_2.0.1 bslib_0.8.0
## [139] bit_4.5.0