Title: | Repetitive Element Methylation Prediction |
---|---|
Description: | Machine learning-based tools to predict DNA methylation of locus-specific repetitive elements (RE) by learning surrounding genetic and epigenetic information. These tools provide genomewide and single-base resolution of DNA methylation prediction on RE that are difficult to measure using array-based or sequencing-based platforms, which enables epigenome-wide association study (EWAS) and differentially methylated region (DMR) analysis on RE. |
Authors: | Yinan Zheng [aut, cre], Lei Liu [aut], Wei Zhang [aut], Warren Kibbe [aut], Lifang Hou [aut, cph] |
Maintainer: | Yinan Zheng <[email protected]> |
License: | GPL-3 |
Version: | 1.31.0 |
Built: | 2024-12-02 06:23:10 UTC |
Source: | https://github.com/bioc/REMP |
Machine learning-based tools to predict DNA methylation of locus-specific repetitive elements (RE) by learning surrounding genetic and epigenetic information. These tools provide genomewide and single-base resolution of DNA methylation prediction on RE that are difficult to measure directly using array-based or sequencing-based platforms, which enables epigenome-wide association study (EWAS) and differentially methylated region (DMR) analysis on RE.
Start out generating required dataset for prediction using initREMP
.
The datasets include RE information, RE-CpG (i.e. CpGs located in RE region) information,
and gene annotation, which are maintained in a REMParcel
object.
It is recommended to save these generated data to the working directory so they can be used in the future.
Clean Illumina methylation dataset using grooMethy
. This function
can help identify and fix abnormal values and automatically impute missing values, which are
essential for downstream prediction.
Run remp
to predict genome-wide locus specific RE methylation.
Use the built-in accessors and utilities in REMProduct
object to get or
refine the prediction results.
Yinan Zheng [email protected], Lei Liu [email protected], Wei Zhang [email protected], Warren Kibbe [email protected], Lifang Hou [email protected]
Maintainer: Yinan Zheng [email protected]
Zheng Y, Joyce BT, Liu L, Zhang Z, Kibbe WA, Zhang W, Hou L. Prediction of genome-wide DNA methylation in repetitive elements. Nucleic Acids Res. 2017;45(15):8697-711. PubMed PMID: 28911103; PMCID: PMC5587781. http://dx.doi.org/10.1093/nar/gkx587.
A GRanges
dataset containing 500 Alu sequences that have CpGs profiled
by both Illumina 450k and EPIC array. The variables are as follows:
Alu.hg19.demo
Alu.hg19.demo
A GRanges
object.
seqnames: chromosome number
ranges: hg19 genomic position
strand: DNA strand
swScore: Smith Waterman (SW) alignment score
repName: Alu name
repClass: Alu class
repFamily: Alu family
Index: internal index (meaningless for external use; not communicable between genome builds)
Alu.hg19.demo
has the same format as the data object
returned by fetchRMSK
.
A GRanges object with 500 ranges and 3 metadata columns.
RepeatMasker database provided by package AnnotationHub
See fetchRMSK
to obtain the complete Alu/L1 dataset.
A GRanges
dataset containing 500 Alu sequences that have CpGs profiled
by both Illumina 450k and EPIC array. The variables are as follows:
Alu.hg38.demo
Alu.hg38.demo
A GRanges
object.
seqnames: chromosome number
ranges: hg38 genomic position
strand: DNA strand
swScore: Smith Waterman (SW) alignment score
repName: Alu name
repClass: Alu class
repFamily: Alu family
Index: internal index (meaningless for external use; not communicable between genome builds)
Alu.hg38.demo
has the same format as the data object
returned by fetchRMSK
.
A GRanges object with 500 ranges and 3 metadata columns.
RepeatMasker database (hg38) is provided by UCSC database downloaded from http://hgdownload.cse.ucsc.edu/goldenpath.
See fetchRMSK
to obtain the complete Alu/L1 dataset.
fetchRefSeqGene
is used to obtain refSeq gene database provided by AnnotationHub (hg19)
or UCSC web database (hg19/hg38).
fetchRefSeqGene( annotation.source = c("AH", "UCSC"), genome = c("hg19", "hg38"), mainOnly = FALSE, verbose = FALSE )
fetchRefSeqGene( annotation.source = c("AH", "UCSC"), genome = c("hg19", "hg38"), mainOnly = FALSE, verbose = FALSE )
annotation.source |
Character parameter. Specify the source of annotation databases, including
the RefSeq Gene annotation database and RepeatMasker annotation database. If |
genome |
Character parameter. Specify the build of human genome. Can be either |
mainOnly |
Logical parameter. See details. |
verbose |
Logical parameter. Should the function be verbose? |
When mainOnly = FALSE
, only the transcript location information will be returned,
Otherwise, a GRangesList
object containing gene regions
information will be added. Gene regions include: 2000 base pair upstream of the transcript
start site ($tss
)), 5'UTR ($fiveUTR
)), coding sequence ($cds
)),
exon ($exon
)), and 3'UTR ($threeUTR
)). The index
column is an internal
index that is used to facilitate data referral, which is meaningless for external use.
A single GRanges
(for main refgene data) object or a list incorporating
both GRanges
object (for main refgene data) and GRangesList
object
(for gene regions data).
if (!exists("refgene.hg19")) refgene.hg19 <- fetchRefSeqGene(annotation.source = "AH", genome = "hg19", verbose = TRUE) refgene.hg19
if (!exists("refgene.hg19")) refgene.hg19 <- fetchRefSeqGene(annotation.source = "AH", genome = "hg19", verbose = TRUE) refgene.hg19
fetchRMSK
is used to obtain specified RE database from RepeatMasker Database
provided by AnnotationHub.
fetchRMSK( REtype = c("Alu", "L1", "ERV"), annotation.source = c("AH", "UCSC"), genome = c("hg19", "hg38"), verbose = FALSE )
fetchRMSK( REtype = c("Alu", "L1", "ERV"), annotation.source = c("AH", "UCSC"), genome = c("hg19", "hg38"), verbose = FALSE )
REtype |
Type of RE. Currently |
annotation.source |
Character parameter. Specify the source of annotation databases, including
the RefSeq Gene annotation database and RepeatMasker annotation database. If |
genome |
Character parameter. Specify the build of human genome. Can be either |
verbose |
Logical parameter. Should the function be verbose? |
A GRanges
object containing RE database. 'repName' column
indicates the RE name; 'swScore' column indicates the SW score; 'Index' is an
internal index for RE to facilitate data referral, which is meaningless for external use.
L1 <- fetchRMSK(REtype = "L1", annotation.source = "AH", genome = "hg19", verbose = TRUE) L1
L1 <- fetchRMSK(REtype = "L1", annotation.source = "AH", genome = "hg19", verbose = TRUE) L1
findRECpG
is used to obtain RE-CpG genomic location data.
findRECpG( RE, REtype = c("Alu", "L1", "ERV"), genome = c("hg19", "hg38"), be = NULL, verbose = FALSE )
findRECpG( RE, REtype = c("Alu", "L1", "ERV"), genome = c("hg19", "hg38"), be = NULL, verbose = FALSE )
RE |
A |
REtype |
Type of RE. Currently |
genome |
Character parameter. Specify the build of human genome. Can be either |
be |
A |
verbose |
logical parameter. Should the function be verbose? |
CpG site is defined as 5'-C-p-G-3'. It is reasonable to assume that the methylation status across all CpG/CpG dyads are concordant. Maintenance methyltransferase exhibits a preference for hemimethylated CpG/CpG dyads (methylated on one strand only). As a result, methyaltion status of CpG sites in both forward and reverse strands are usually consistent. Therefore, to accommodate the cytosine loci in both strands, the returned genomic ranges cover the 'CG' sequence with width of 2. The 'strand' information indicates the strand of the RE. Locating CpG sites in RE sequences can be computation intensive. It is recommanded to get more than one work in the backend for a faster running speed.
A GRanges
object containing identified RE-CpG genomic
location data.
data(Alu.hg19.demo) RE.CpG <- findRECpG(RE = Alu.hg19.demo, REtype = "Alu", genome = "hg19", verbose = TRUE) RE.CpG
data(Alu.hg19.demo) RE.CpG <- findRECpG(RE = Alu.hg19.demo, REtype = "Alu", genome = "hg19", verbose = TRUE) RE.CpG
getBackend
is used to obtain BiocParallel
Back-end to allow
parallel computing.
getBackend(ncore, BPPARAM = NULL, verbose = FALSE)
getBackend(ncore, BPPARAM = NULL, verbose = FALSE)
ncore |
Number of cores used for parallel computing. By default max number
of cores available in the machine will be utilized. If |
BPPARAM |
An optional |
verbose |
Logical parameter. Should the function be verbose? |
A BiocParallel
object that can be used for parallel
computing.
# Non-parallel mode be <- getBackend(ncore = 1, verbose = TRUE) be # parallel mode (2 workers) be <- getBackend(ncore = 2, verbose = TRUE) be
# Non-parallel mode be <- getBackend(ncore = 1, verbose = TRUE) be # parallel mode (2 workers) be <- getBackend(ncore = 2, verbose = TRUE) be
getGM12878
is used to obtain public available methylation profiling
data of HapMap LCL sample GM12878.
getGM12878(arrayType = c("450k", "EPIC"), mapGenome = FALSE)
getGM12878(arrayType = c("450k", "EPIC"), mapGenome = FALSE)
arrayType |
Illumina methylation array type. Currently |
mapGenome |
Logical parameter. If |
Illumina 450k data were sourced and curated from ENCODE
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibMethyl450/wgEncodeHaibMethyl450Gm12878SitesRep1.bed.gz.
Illumina EPIC data were obtained from data package minfiDataEPIC
.
A RatioSet
or GenomicRatioSet
containing
beta value and M value of the methylation data.
## Not run: # Get GM12878 methylation data (450k array) if (!exists("GM12878_450k")) GM12878_450k <- getGM12878("450k") GM12878_450k ## End(Not run) # Get GM12878 methylation data (EPIC array) if (!exists("GM12878_EPIC")) GM12878_EPIC <- getGM12878("EPIC") GM12878_EPIC
## Not run: # Get GM12878 methylation data (450k array) if (!exists("GM12878_450k")) GM12878_450k <- getGM12878("450k") GM12878_450k ## End(Not run) # Get GM12878 methylation data (EPIC array) if (!exists("GM12878_EPIC")) GM12878_EPIC <- getGM12878("EPIC") GM12878_EPIC
GRannot
is used to annotate a GRanges
dataset with gene region
information using refseq gene database
GRannot(object.GR, refgene, symbol = FALSE, verbose = FALSE)
GRannot(object.GR, refgene, symbol = FALSE, verbose = FALSE)
object.GR |
An |
refgene |
A complete refGene annotation database returned by
|
symbol |
Logical parameter. Should the annotation return gene symbol? |
verbose |
Logical parameter. Should the function be verbose? |
The annotated gene region information includes: protein coding gene (InNM),
noncoding RNA gene (InNR), 2000 base pair upstream of the transcript start site (InTSS),
5'UTR (In5UTR), coding sequence (InCDS), exon (InExon), and 3'UTR (In3UTR). The intergenic
and intron regions can then be represented by the combination of these region data.
The number shown in these columns represent the row number or 'index' column in the
main refgene database obtained by fetchRefSeqGene
.
A GRanges
or a GRangesList
object containing refSeq
Gene database.
data(Alu.hg19.demo) if (!exists("refgene.hg19")) refgene.hg19 <- fetchRefSeqGene(annotation.source = "AH", genome = "hg19", verbose = TRUE) Alu.hg19.demo.refGene <- GRannot(Alu.hg19.demo, refgene.hg19, verbose = TRUE) Alu.hg19.demo.refGene
data(Alu.hg19.demo) if (!exists("refgene.hg19")) refgene.hg19 <- fetchRefSeqGene(annotation.source = "AH", genome = "hg19", verbose = TRUE) Alu.hg19.demo.refGene <- GRannot(Alu.hg19.demo, refgene.hg19, verbose = TRUE) Alu.hg19.demo.refGene
grooMethy
is used to automatically detect and fix data issues including zero beta
value, missing value, and infinite value.
grooMethy( methyDat, Seq.GR = NULL, impute = TRUE, imputebyrow = TRUE, mapGenome = FALSE, verbose = FALSE )
grooMethy( methyDat, Seq.GR = NULL, impute = TRUE, imputebyrow = TRUE, mapGenome = FALSE, verbose = FALSE )
methyDat |
A |
Seq.GR |
A |
impute |
If |
imputebyrow |
If |
mapGenome |
Logical parameter. If |
verbose |
Logical parameter. Should the function be verbose? |
For methylation data in beta value, if zero/one value exists, the logit transformation
from beta to M value will produce infinite value. Therefore, zero/one beta value
will be replaced with the smallest non-zero beta/largest non-one beta value found in the dataset.
grooMethy
can also handle missing value (i.e. NA
or NaN
) using KNN imputation (see
impute.knn
). The infinite value will be also treated as missing value for imputation.
If the original dataset is in beta value, grooMethy
will first transform it to M value
before imputation is carried out. If the imputed value is out of the original range (which is possible when
imputebyrow = FALSE
), mean value will be used instead. Warning: imputed
values for multimodal distributed CpGs (across samples) may not be correct. Please check package ENmix
to
identify the CpGs with multimodal distribution. Please note that grooMethy
is
also embedded in remp
so the user can run remp
directly without
explicitly running grooMethy
. For sequencing methylation data, please specify the genomic location of CpGs
in a GenomicRanges
object and specify it in Seq.GR
. For an example of Seq.GR
, Please
run minfi::getLocations(IlluminaHumanMethylation450kanno.ilmn12.hg19)
(the row names of the CpGs in Seq.GR
can be NULL
). The user should make sure the genome build of Seq.GR
match the build specified
in genome
parameter of function initREMP
and remprofile
(default is "hg19"
).
A RatioSet
or GenomicRatioSet
containing beta value and
M value of the methylation data.
# Get GM12878 methylation data (450k array) if (!exists("GM12878_450k")) GM12878_450k <- getGM12878("450k") GM12878_450k <- grooMethy(GM12878_450k, verbose = TRUE) # Also works if data input is a matrix grooMethy(minfi::getBeta(GM12878_450k), verbose = TRUE)
# Get GM12878 methylation data (450k array) if (!exists("GM12878_450k")) GM12878_450k <- getGM12878("450k") GM12878_450k <- grooMethy(GM12878_450k, verbose = TRUE) # Also works if data input is a matrix grooMethy(minfi::getBeta(GM12878_450k), verbose = TRUE)
initREMP
is used to initialize annotation database for RE methylation prediction.
Three RE types in human, Alu element (Alu), LINE-1 (L1), and endogenous retrovirus (ERV) are available.
initREMP( arrayType = c("450k", "EPIC", "Sequencing"), REtype = c("Alu", "L1", "ERV"), annotation.source = c("AH", "UCSC"), genome = c("hg19", "hg38"), RE = NULL, Seq.GR = NULL, ncore = NULL, BPPARAM = NULL, export = FALSE, work.dir = tempdir(), verbose = FALSE )
initREMP( arrayType = c("450k", "EPIC", "Sequencing"), REtype = c("Alu", "L1", "ERV"), annotation.source = c("AH", "UCSC"), genome = c("hg19", "hg38"), RE = NULL, Seq.GR = NULL, ncore = NULL, BPPARAM = NULL, export = FALSE, work.dir = tempdir(), verbose = FALSE )
arrayType |
Illumina methylation array type. Currently |
REtype |
Type of RE. Currently |
annotation.source |
Character parameter. Specify the source of annotation databases, including
the RefSeq Gene annotation database and RepeatMasker annotation database. If |
genome |
Character parameter. Specify the build of human genome. Can be either |
RE |
A |
Seq.GR |
A |
ncore |
Number of cores used for parallel computing. By default max number of cores
available in the machine will be utilized. If |
BPPARAM |
An optional |
export |
Logical. Should the returned |
work.dir |
Path to the directory where the generated data will be saved. Valid when
|
verbose |
Logical parameter. Should the function be verbose? |
Currently, we support two major types of RE in the human genome, Alu and L1. The main purpose of
initREMP
is to generate and annotate CpG/RE data using the refSeq Gene (hg19)
annotation database (provided by AnnotationHub
). These annotation data are crucial to
RE methylation prediction in remp
. Once generated, the data can be reused in the future
(data can be very large). Therefore, we recommend the user to save the output from
initREMP
to the local machine, so that user only need to run this function once
as long as there is no change to the RE database. To minimize the size of the resulting data file, the generated
annotation data are only for REs that contain RE-CpGs with neighboring profiled CpGs. By default, the
neighboring CpGs are confined within 1200 bp flanking window. This window size can be modified using
remp_options
. Note that the refSeq Gene database from UCSC is dynamic (updated periodically)
and reflecting the latest knowledge of gene, whereas the database from AnnotationHub is static and classic.
Using different sources will have a slight impact on the prediction results of RE methylation and gene annotation
of final results. For sequencing methylation data, please specify the genomic location of CpGs
in a GenomicRanges
object and specify it in Seq.GR
. For an example of Seq.GR
, Please
run minfi::getLocations(IlluminaHumanMethylation450kanno.ilmn12.hg19)
(the row names of the CpGs in
Seq.GR
can be NULL
). The user should make sure the genome build of Seq.GR
match the
build specified in genome
parameter (default is "hg19"
).
An REMParcel
object containing data needed for RE methylation prediction.
See remp
for RE methylation prediction.
if (!exists("remparcel")) { data(Alu.hg19.demo) remparcel <- initREMP(arrayType = "450k", REtype = "Alu", annotation.source = "AH", genome = "hg19", RE = Alu.hg19.demo, ncore = 1, verbose = TRUE) }
if (!exists("remparcel")) { data(Alu.hg19.demo) remparcel <- initREMP(arrayType = "450k", REtype = "Alu", annotation.source = "AH", genome = "hg19", RE = Alu.hg19.demo, ncore = 1, verbose = TRUE) }
remp
is used to predict genomewide methylation levels of locus-specific repetitive elements (RE).
Two major RE types in human, Alu element (Alu) and LINE-1 (L1) are available.
remp( methyDat = NULL, REtype = c("Alu", "L1", "ERV"), Seq.GR = NULL, parcel = NULL, work.dir = tempdir(), win = 1000, method = c("rf", "xgbTree", "svmLinear", "svmRadial", "naive"), autoTune = TRUE, param = NULL, seed = NULL, ncore = NULL, BPPARAM = NULL, verbose = FALSE )
remp( methyDat = NULL, REtype = c("Alu", "L1", "ERV"), Seq.GR = NULL, parcel = NULL, work.dir = tempdir(), win = 1000, method = c("rf", "xgbTree", "svmLinear", "svmRadial", "naive"), autoTune = TRUE, param = NULL, seed = NULL, ncore = NULL, BPPARAM = NULL, verbose = FALSE )
methyDat |
A |
REtype |
Type of RE. Currently |
Seq.GR |
A |
parcel |
An |
work.dir |
Path to the directory where the annotation data generated by |
win |
An integer specifying window size to confine the upstream and downstream flanking
region centered on the predicted CpG in RE for prediction. Default = |
method |
Name of model/approach for prediction. Currently |
autoTune |
Logical parameter. If |
param |
A list specifying fixed model tuning parameter(s) (not applicable for Random Forest, see Details).
For Extreme Gradient Boosting, |
seed |
Random seed for Random Forest model for reproducible prediction results.
Default is |
ncore |
Number of cores used for parallel computing. By default, max number of cores available
in the machine will be utilized. If |
BPPARAM |
An optional |
verbose |
Logical parameter. Should the function be verbose? |
Before running remp
, user should make sure the methylation data have gone through
proper quality control, background correction, and normalization procedures. Both beta value
and M value are allowed. Rows represents probes and columns represents samples. For array data,
please make sure to have row names that specify the Illumina probe ID (i.e. cg00000029). For sequencing
data, please provide the genomic location of CpGs in a GRanges
obejct and
specify it using Seq.GR
parameter. win = 1000
is based on previous findings showing that
neighboring CpGs are more likely to be co-modified within 1000 bp. User can specify narrower window size
for slight improvement of prediction accuracy at the cost of less predicted RE. Window size greater than 1000 is not
recommended as the machine learning models would not be able to learn much userful information
for prediction but introduce noise. Random Forest model (method = "rf"
) is recommented
as it offers more accurate prediction and it also enables prediction reliability functionality.
Prediction reliability is estimated by conditional standard deviation using Quantile Regression Forest.
Please note that if parallel computing is allowed, parallel Random Forest
(powered by package ranger
) will be used automatically. The performance of
Random Forest model is often relatively insensitive to the choice of mtry
.
Therefore, auto-tune will be turned off using Random Forest and mtry
will be set to one third
of the total number of predictors. For SVM, if autoTune = TRUE
, preset tuning parameter
search grid can be access and modified using remp_options
.
A REMProduct
object containing predicted RE methylation results.
See initREMP
to prepare necessary annotation database before running remp
.
# Obtain example Illumina example data (450k) if (!exists("GM12878_450k")) GM12878_450k <- getGM12878("450k") # Make sure you have run 'initREMP' first. See ?initREMP. if (!exists("remparcel")) { data(Alu.hg19.demo) remparcel <- initREMP(arrayType = "450k", REtype = "Alu", annotation.source = "AH", genome = "hg19", RE = Alu.hg19.demo, ncore = 1, verbose = TRUE) } # With data template pre-built. See ?rempTemplate. if (!exists("template")) template <- rempTemplate(GM12878_450k, parcel = remparcel, win = 1000, verbose = TRUE) # Run remp with pre-built template: remp.res <- remp(template, ncore = 1) # Or run remp without pre-built template (identical results): ## Not run: remp.res <- remp(GM12878_450k, REtype = "Alu", parcel = remparcel, ncore = 1, verbose = TRUE) ## End(Not run) remp.res details(remp.res) rempB(remp.res) # Methylation data (beta value) # Extract CpG location information. # This accessor is inherit from class 'RangedSummarizedExperiment') rowRanges(remp.res) # RE annotation information rempAnnot(remp.res) # Add gene annotation remp.res <- decodeAnnot(remp.res, type = "symbol") rempAnnot(remp.res) # (Recommended) Trim off less reliable prediction remp.res <- rempTrim(remp.res) # Obtain RE-level methylation (aggregate by mean) remp.res <- rempAggregate(remp.res) rempB(remp.res) # Methylation data (beta value) # Extract RE location information rowRanges(remp.res) # Density plot across predicted RE remplot(remp.res)
# Obtain example Illumina example data (450k) if (!exists("GM12878_450k")) GM12878_450k <- getGM12878("450k") # Make sure you have run 'initREMP' first. See ?initREMP. if (!exists("remparcel")) { data(Alu.hg19.demo) remparcel <- initREMP(arrayType = "450k", REtype = "Alu", annotation.source = "AH", genome = "hg19", RE = Alu.hg19.demo, ncore = 1, verbose = TRUE) } # With data template pre-built. See ?rempTemplate. if (!exists("template")) template <- rempTemplate(GM12878_450k, parcel = remparcel, win = 1000, verbose = TRUE) # Run remp with pre-built template: remp.res <- remp(template, ncore = 1) # Or run remp without pre-built template (identical results): ## Not run: remp.res <- remp(GM12878_450k, REtype = "Alu", parcel = remparcel, ncore = 1, verbose = TRUE) ## End(Not run) remp.res details(remp.res) rempB(remp.res) # Methylation data (beta value) # Extract CpG location information. # This accessor is inherit from class 'RangedSummarizedExperiment') rowRanges(remp.res) # RE annotation information rempAnnot(remp.res) # Add gene annotation remp.res <- decodeAnnot(remp.res, type = "symbol") rempAnnot(remp.res) # (Recommended) Trim off less reliable prediction remp.res <- rempTrim(remp.res) # Obtain RE-level methylation (aggregate by mean) remp.res <- rempAggregate(remp.res) rempB(remp.res) # Methylation data (beta value) # Extract RE location information rowRanges(remp.res) # Density plot across predicted RE remplot(remp.res)
Tools to manage global setting options for REMP package.
remp_options(...) remp_reset()
remp_options(...) remp_reset()
... |
Option names to retrieve option values or |
NULL
The following options are supported
.default.AluFamily.grep
Regular expression for 'grep' to extract Alu family to be included in the prediction.
.default.L1Family.grep
Regular expression for 'grep' to extract L1 family to be included in the prediction.
.default.ERVFamily.grep
Regular expression for 'grep' to extract ERV family to be included in the prediction.
.default.chr
List of human chromosome.
.default.GM12878.450k.URL
URL to download GM12878 450k methylation profiling data.
.default.RMSK.hg19.URL
URL to download RepeatMasker database in hg19 genome.
.default.RMSK.hg38.URL
URL to download RepeatMasker database in hg38 genome.
.default.refGene.hg19.URL
URL to download refSeq gene database in hg19 genome.
.default.refGene.hg38.URL
URL to download refSeq gene database in hg38 genome.
.default.AH.repeatmasker.hg19
AnnotationHub
data ID linked to RepeatMasker
annotation database (Mar 2020, build hg19).
.default.AH.repeatmasker.hg38
AnnotationHub
data ID linked to RepeatMasker
annotation database (Sep 2021, build hg38).
.default.AH.refgene.hg19
AnnotationHub
data ID linked to refSeq gene database
(build hg19)
.default.AH.hg38ToHg19.over.chain
AnnotationHub
hg38 to hg19 liftover chain data ID.
.default.AH.hg19ToHg38.over.chain
AnnotationHub
hg19 to hg38 liftover chain data ID.
.default.TSS.upstream
Define the upstream range of transcription start site region.
.default.TSS.downstream
Define the downstream range of transcription start site region.
.default.max.flankWindow
Define the max size of the flanking window surrounding the predicted RE-CpG.
.default.27k.total.probes
Total number of probes designed in Illumina 27k array.
.default.450k.total.probes
Total number of probes designed in Illumina 450k array.
.default.epic.total.probes
Total number of probes designed in Illumina EPIC array.
.default.450k.annotation
A character string associated with the Illumina 450k array annotation dataset.
.default.epic.annotation
A character string associated with the Illumina EPIC array annotation dataset.
.default.genomicRegionColNames
Define the names of the genomic regions for prediction.
.default.predictors
Define the names of predictors for RE methylation prediction.
.default.svmLinear.tune
Define the default C
(Cost) parameter for Support
Vector Machine (SVM) using linear kernel.
.default.svmRadial.tune
Define the default parameters (C
and sigma
) for SVM
using Radial basis function kernel.
.default.xgbTree.tune
Define the default parameters (nrounds
, eta
, max_depth
,
gamma
, colsample_bytree
, min_child_weight
, and subsample
) for Extreme Gradient
Boosting.
# Display all default settings remp_options() # Display a specified setting remp_options(".default.max.flankWindow") # Change default maximum flanking window size to 2000 remp_options(.default.max.flankWindow = 2000) # Reset all options remp_reset()
# Display all default settings remp_options() # Display a specified setting remp_options(".default.max.flankWindow") # Change default maximum flanking window size to 2000 remp_options(.default.max.flankWindow = 2000) # Reset all options remp_reset()
REMParcel
is a container class to organize required datasets for
RE methylation prediction generated from initREMP
and used in remp
.
REMParcel( REtype = "Unknown", genome = "Unknown", platform = "Unknown", RefGene = GRanges(), RE = GRanges(), RECpG = GRanges(), ILMN = GRanges() ) getParcelInfo(object) getRefGene(object) getRE(object) getRECpG(object) getILMN(object, ...) saveParcel(object, ...) ## S4 method for signature 'REMParcel' saveParcel(object, work.dir = tempdir(), verbose = FALSE, ...) ## S4 method for signature 'REMParcel' getParcelInfo(object) ## S4 method for signature 'REMParcel' getRefGene(object) ## S4 method for signature 'REMParcel' getRE(object) ## S4 method for signature 'REMParcel' getRECpG(object) ## S4 method for signature 'REMParcel' getILMN(object, REonly = FALSE)
REMParcel( REtype = "Unknown", genome = "Unknown", platform = "Unknown", RefGene = GRanges(), RE = GRanges(), RECpG = GRanges(), ILMN = GRanges() ) getParcelInfo(object) getRefGene(object) getRE(object) getRECpG(object) getILMN(object, ...) saveParcel(object, ...) ## S4 method for signature 'REMParcel' saveParcel(object, work.dir = tempdir(), verbose = FALSE, ...) ## S4 method for signature 'REMParcel' getParcelInfo(object) ## S4 method for signature 'REMParcel' getRefGene(object) ## S4 method for signature 'REMParcel' getRE(object) ## S4 method for signature 'REMParcel' getRECpG(object) ## S4 method for signature 'REMParcel' getILMN(object, REonly = FALSE)
REtype |
Type of RE ( |
genome |
Specify the build of human genome. Can be either |
platform |
Illumina methylation profiling platform ( |
RefGene |
refSeq gene annotation data, which can be obtained by |
RE |
Annotated RE genomic range data, which can be obtained by |
RECpG |
Genomic range data of annotated CpG site identified in RE DNA sequence, which can
be obtained by |
ILMN |
Illumina CpG probe genomic range data. |
object |
A |
... |
For |
work.dir |
For |
verbose |
For |
REonly |
For |
An object of class REMParcel
for the constructor.
getParcelInfo(object)
Return data type, RE type, and flanking window size information of the parcel.
getRefGene(object)
Return RefSeq gene annotation data.
getRE(object)
Return RE genomic location data for prediction (annotated by refSeq gene database).
getRECpG(object)
Return RE-CpG genomic location data for prediction.
getILMN(object, REonly = FALSE)
Return Illumina CpG probe genomic
location data for prediction (annotated by refSeq gene database). If
REonly = TRUE
, only probes within RE region are returned.
saveParcel(object, work.dir = tempdir(), verbose = FALSE, ...)
Save the object to local machine.
showClass("REMParcel")
showClass("REMParcel")
Class REMProduct
is to maintain RE methylation prediction results.
REMProduct
inherits Bioconductor's RangedSummarizedExperiment
class.
REMProduct( REtype = "Unknown", genome = "Unknown", platform = "Unknown", win = "Unknown", predictModel = "Unknown", QCModel = "Unknown", rempM = NULL, rempB = NULL, rempQC = NULL, cpgRanges = GRanges(), sampleInfo = DataFrame(), REannotation = GRanges(), RECpG = GRanges(), regionCode = DataFrame(), refGene = GRanges(), varImp = DataFrame(), REStats = DataFrame(), GeneStats = DataFrame(), Seed = NULL ) rempM(object) rempB(object) rempQC(object) rempAnnot(object) rempImp(object) rempStats(object) remplot(object, ...) details(object) decodeAnnot(object, ...) rempTrim(object, ...) rempAggregate(object, ...) rempCombine(object1, object2) ## S4 method for signature 'REMProduct' rempM(object) ## S4 method for signature 'REMProduct' rempB(object) ## S4 method for signature 'REMProduct' rempQC(object) ## S4 method for signature 'REMProduct' rempImp(object) ## S4 method for signature 'REMProduct' rempAnnot(object) ## S4 method for signature 'REMProduct' rempStats(object) ## S4 method for signature 'REMProduct' remplot(object, type = c("individual", "overall"), ...) ## S4 method for signature 'REMProduct' details(object) ## S4 method for signature 'REMProduct' decodeAnnot(object, type = c("symbol", "entrez"), ncore = 1, BPPARAM = NULL) ## S4 method for signature 'REMProduct' rempTrim(object, threshold = 1.7, missingRate = 0.2) ## S4 method for signature 'REMProduct' rempAggregate(object, NCpG = 2, ncore = 1, BPPARAM = NULL) ## S4 method for signature 'REMProduct,REMProduct' rempCombine(object1, object2)
REMProduct( REtype = "Unknown", genome = "Unknown", platform = "Unknown", win = "Unknown", predictModel = "Unknown", QCModel = "Unknown", rempM = NULL, rempB = NULL, rempQC = NULL, cpgRanges = GRanges(), sampleInfo = DataFrame(), REannotation = GRanges(), RECpG = GRanges(), regionCode = DataFrame(), refGene = GRanges(), varImp = DataFrame(), REStats = DataFrame(), GeneStats = DataFrame(), Seed = NULL ) rempM(object) rempB(object) rempQC(object) rempAnnot(object) rempImp(object) rempStats(object) remplot(object, ...) details(object) decodeAnnot(object, ...) rempTrim(object, ...) rempAggregate(object, ...) rempCombine(object1, object2) ## S4 method for signature 'REMProduct' rempM(object) ## S4 method for signature 'REMProduct' rempB(object) ## S4 method for signature 'REMProduct' rempQC(object) ## S4 method for signature 'REMProduct' rempImp(object) ## S4 method for signature 'REMProduct' rempAnnot(object) ## S4 method for signature 'REMProduct' rempStats(object) ## S4 method for signature 'REMProduct' remplot(object, type = c("individual", "overall"), ...) ## S4 method for signature 'REMProduct' details(object) ## S4 method for signature 'REMProduct' decodeAnnot(object, type = c("symbol", "entrez"), ncore = 1, BPPARAM = NULL) ## S4 method for signature 'REMProduct' rempTrim(object, threshold = 1.7, missingRate = 0.2) ## S4 method for signature 'REMProduct' rempAggregate(object, NCpG = 2, ncore = 1, BPPARAM = NULL) ## S4 method for signature 'REMProduct,REMProduct' rempCombine(object1, object2)
REtype |
Type of RE ( |
genome |
Specify the build of human genome. Can be either |
platform |
Illumina methylation profiling platform ( |
win |
Flanking window size of the predicting RE-CpG. |
predictModel |
Name of the model used for prediction. |
QCModel |
Name of the model used for prediction quality evaluation. |
rempM |
Predicted methylation level in M value. |
rempB |
Predicted methylation level in beta value (optional). |
rempQC |
Prediction quality scores, which is available only when Random Forest
model is used in |
cpgRanges |
Genomic ranges of the predicting RE-CpG. |
sampleInfo |
Sample information. |
REannotation |
Annotation data for the predicting RE. |
RECpG |
Annotation data for the RE-CpG profiled by Illumina platform. |
regionCode |
Internal index code defined in |
refGene |
refSeq gene annotation data, which can be obtained by |
varImp |
Importance of the predictors. |
REStats |
RE coverage statistics, which is internally generated in |
GeneStats |
Gene coverage statistics, which is internally generated in |
Seed |
Random seed for Random Forest model for reproducible prediction results. |
object |
A |
... |
For |
object1 |
A |
object2 |
A |
type |
For |
ncore |
For |
BPPARAM |
For |
threshold |
For |
missingRate |
For |
NCpG |
For |
An object of class REMProduct
for the constructor.
rempM(object)
Return M value of the prediction.
rempB(object)
Return beta value of the prediction.
rempQC(object)
Return prediction quality scores.
rempImp(object)
Return relative importance of predictors.
rempStats(object)
Return RE and gene coverage statistics.
rempAnnot(object)
Return annotation data for the predicted RE.
remplot(object, type = c("individual", "overall"), ...)
Make a density plot of predicted methylation
(beta values) in the REMProduct
object. If type = "individual"
, density curves will be
plotted for each of the samples; If type = "overall"
, one density curve of the mean methylation level
across the samples will be plotted. Default type = "individual"
.
details(object)
Display detailed descriptive statistics of the predicion results.
decodeAnnot(object, type = c("symbol", "entrez")), ncore = NULL, BPPARAM = NULL
Decode the
RE annotation data by Gene Symbol (when type = "Symbol"
) or Entrez Gene
(when type = "Entrez"
).Default type = "Symbol"
. Annotation data are provided by
org.Hs.eg.db
.
rempTrim(object, threshold = 1.7, missingRate = 0.2)
Any predicted CpG values with
quality score < threshold (default = 1.7, specified by threshold = 1.7
) will be replaced with NA.
CpGs contain more than missingRate * 100
rate across samples will be discarded. Relavant summary statistics will be re-evaluated.
rempAggregate(object, NCpG = 2, ncore = NULL, BPPARAM = NULL)
Aggregate the predicted RE-CpG
methylation by RE using mean. To ensure the reliability of the aggregation, by default only RE with at
least 2 predicted CpG sites (specified by NCpG = 2
) will be aggregated.
rempCombine(object1, object2)
Combine two REMProduct
objects by column.
showClass("REMProduct")
showClass("REMProduct")
remprofile
is used to extract profiled methylation of CpG sites in RE.
remprofile( methyDat, REtype = c("Alu", "L1", "ERV"), annotation.source = c("AH", "UCSC"), genome = c("hg19", "hg38"), Seq.GR = NULL, RE = NULL, impute = FALSE, imputebyrow = TRUE, verbose = FALSE )
remprofile( methyDat, REtype = c("Alu", "L1", "ERV"), annotation.source = c("AH", "UCSC"), genome = c("hg19", "hg38"), Seq.GR = NULL, RE = NULL, impute = FALSE, imputebyrow = TRUE, verbose = FALSE )
methyDat |
A |
REtype |
Type of RE. Currently |
annotation.source |
Character parameter. Specify the source of annotation databases, including
the RefSeq Gene annotation database and RepeatMasker annotation database. If |
genome |
Character parameter. Specify the build of human genome. Can be either |
Seq.GR |
A |
RE |
A |
impute |
Parameter used by |
imputebyrow |
Parameter used by |
verbose |
Logical parameter. Should the function be verbose? |
A REMProduct
object containing profiled RE methylation results.
data(Alu.hg19.demo) if (!exists("GM12878_450k")) GM12878_450k <- getGM12878("450k") remprofile.res <- remprofile(GM12878_450k, REtype = "Alu", annotation.source = "AH", genome = "hg19", RE = Alu.hg19.demo, verbose = TRUE) details(remprofile.res) rempB(remprofile.res) # Methylation data (beta value) remprofile.res <- rempAggregate(remprofile.res) details(remprofile.res) rempB(remprofile.res) # Methylation data (beta value)
data(Alu.hg19.demo) if (!exists("GM12878_450k")) GM12878_450k <- getGM12878("450k") remprofile.res <- remprofile(GM12878_450k, REtype = "Alu", annotation.source = "AH", genome = "hg19", RE = Alu.hg19.demo, verbose = TRUE) details(remprofile.res) rempB(remprofile.res) # Methylation data (beta value) remprofile.res <- rempAggregate(remprofile.res) details(remprofile.res) rempB(remprofile.res) # Methylation data (beta value)
rempTemplate
is used to build a set of data templates for prediction. The data templates include
RE-CpGs and their methylation data for model training, neighboring CpGs of RE-CpGs and their
methylation data for model prediction, and other necessary information about the prediction.
This function is useful when one needs to experiment different tunning parameters so that these pre-built
data templates can be re-used and substaintially improve efficiency.
rempTemplate( methyDat = NULL, Seq.GR = NULL, parcel = NULL, win = 1000, verbose = FALSE )
rempTemplate( methyDat = NULL, Seq.GR = NULL, parcel = NULL, win = 1000, verbose = FALSE )
methyDat |
A |
Seq.GR |
A |
parcel |
An |
win |
An integer specifying window size to confine the upstream and downstream flanking
region centered on the predicted CpG in RE for prediction. Default = |
verbose |
Logical parameter. Should the function be verbose? |
A template
object containing a GRanges
object of neifhboring CpGs of RE-CpGs
to be predicted ($NBCpG_GR
) and their methylation dataset matrix ($NBCpG_methyDat
);
a GRanges
object of RE-CpGs for model training ($RECpG_GR
) and their methylation
dataset matrix ($RECpG_methyDat
); GRanges
objects of RefSeq Gene database ($refgene
)
and RE ($RE
) that are extracted from the parcel
input; a string of RE type ($REtype
)
and a string of methylation platform ($arrayType
). Note: the subset operator []
is supported.
if (!exists("GM12878_450k")) GM12878_450k <- getGM12878("450k") if (!exists("remparcel")) { data(Alu.hg19.demo) remparcel <- initREMP(arrayType = "450k", REtype = "Alu", annotation.source = "AH", genome = "hg19", RE = Alu.hg19.demo, ncore = 1, verbose = TRUE) } template <- rempTemplate(GM12878_450k, parcel = remparcel, win = 1000, verbose = TRUE) template ## To make a subset template[1]
if (!exists("GM12878_450k")) GM12878_450k <- getGM12878("450k") if (!exists("remparcel")) { data(Alu.hg19.demo) remparcel <- initREMP(arrayType = "450k", REtype = "Alu", annotation.source = "AH", genome = "hg19", RE = Alu.hg19.demo, ncore = 1, verbose = TRUE) } template <- rempTemplate(GM12878_450k, parcel = remparcel, win = 1000, verbose = TRUE) template ## To make a subset template[1]