Title: | Smooth modeling of bisulfite sequencing |
---|---|
Description: | This package aims to analyse count-based methylation data on predefined genomic regions, such as those obtained by targeted sequencing, and thus to identify differentially methylated regions (DMRs) that are associated with phenotypes or traits. The method is built a rich flexible model that allows for the effects, on the methylation levels, of multiple covariates to vary smoothly along genomic regions. At the same time, this method also allows for sequencing errors and can adjust for variability in cell type mixture. |
Authors: | Kaiqiong Zhao [aut], Kathleen Klein [cre], Audrey Lemaçon [ctb, ctr], Simon Laurin-Lemay [ctb, ctr], My Intelligent Machines Inc. [ctr], Celia Greenwood [ths, aut] |
Maintainer: | Kathleen Klein <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.13.0 |
Built: | 2024-10-13 05:24:03 UTC |
Source: | https://github.com/bioc/SOMNiBUS |
This function fits a (dispersion-adjusted) binomial regression model to regional methylation data, and reports the estimated smooth covariate effects and regional p-values for the test of DMRs (differentially methylation regions). Over or under dispersion across loci is accounted for in the model by the combination of a multiplicative dispersion parameter (or scale parameter) and a sample-specific random effect.
This method can deal with outcomes, i.e. the number of
methylated reads in a region, that are contaminated by known
false methylation calling rate (p0
) and false non-methylation
calling rate (1-p1
).
The covariate effects are assumed to smoothly vary across
genomic regions. In order to estimate them, the algorithm first
represents the functional parameters by a linear combination of a set of
restricted cubic splines (with dimension n.k
), and a smoothness
penalization term which depends on the smoothing parameters lambdas
is also added to control smoothness. The estimation is performed by an
iterated EM algorithm. Each M step constitutes an outer Newton's iteration to
estimate smoothing parameters lambdas
and an inner P-IRLS iteration to
estimate spline coefficients alpha
for the covariate effects.
Currently, the computation in the M step depends the implementation of
gam()
in package mgcv
.
binomRegMethModel( data, n.k, p0 = 0.003, p1 = 0.9, Quasi = TRUE, epsilon = 10^(-6), epsilon.lambda = 10^(-3), maxStep = 200, binom.link = "logit", method = "REML", covs = NULL, RanEff = TRUE, reml.scale = FALSE, scale = -2, verbose = TRUE )
binomRegMethModel( data, n.k, p0 = 0.003, p1 = 0.9, Quasi = TRUE, epsilon = 10^(-6), epsilon.lambda = 10^(-3), maxStep = 200, binom.link = "logit", method = "REML", covs = NULL, RanEff = TRUE, reml.scale = FALSE, scale = -2, verbose = TRUE )
data |
a data frame with rows as individual CpGs appearing in all the
samples. The first 4 columns should contain the information of |
n.k |
a vector of basis dimensions for the intercept and
individual covariates. |
p0 |
the probability of observing a methylated read when the underlying
true status is unmethylated. |
p1 |
the probability of observing a methylated read when the underlying
true status is methylated. |
Quasi |
whether a Quasi-likelihood estimation approach will be used; in other words, whether a multiplicative dispersion is added in the model or not. |
epsilon |
numeric; stopping criterion for the closeness of estimates of spline coefficients from two consecutive iterations. |
epsilon.lambda |
numeric; stopping criterion for the closeness of
estimates of smoothing parameter |
maxStep |
the algorithm will step if the iteration steps exceed
|
binom.link |
the link function used in the binomial regression model; the default is the logit link. |
method |
the method used to estimate the smoothing
parameters. The default is the 'REML' method which is generally better than
prediction based criterion |
covs |
a vector of covariate names. The covariates with names in
|
RanEff |
whether sample-level random effects are added or not |
reml.scale |
whether a REML-based scale (dispersion) estimator is used. The default is Fletcher-based estimator. |
scale |
negative values mean scale parameter should be estimated; if a positive value is provided, a fixed scale will be used. |
verbose |
logical indicates if the algorithm should provide progress report information. The default value is TRUE. |
This function return a list
including objects:
est
: estimates of the spline basis coefficients alpha
lambda
: estimates of the smoothing parameters for each
functional parameters
est.pi
: predicted methylation levels for each row in the input
data
ite.points
: estimates of est
, lambda
at each EM
iteration
cov1
: estimated variance-covariance matrix of the basis
coefficients alphas
reg.out
: regional testing output obtained using Fletcher-based
dispersion estimate; an additional 'ID' row would appear if RanEff is TRUE
reg.out.reml.scale
:regional testing output obtained using
REML-based dispersion estimate;
reg.out.gam
:regional testing output obtained using
(Fletcher-based) dispersion estimate from mgcv package;
phi_fletcher
: Fletcher-based estimate of the (multiplicative)
dispersion parameter;
phi_reml
: REML-based estimate of the (multiplicative)
dispersion parameter;
phi_gam
: Estimated dispersion parameter reported by mgcv;
SE.out
: a matrix of the estimated pointwise Standard Errors
(SE); number of rows are the number of unique CpG sites in the input data and
the number of columns equal to the total number of covariates fitted in the
model (the first one is the intercept);
SE.out.REML.scale
: a matrix of the estimated pointwise Standard
Errors (SE); the SE calculated from the REML-based dispersion estimates
uni.pos
: the genomic postions for each row of CpG sites in the
matrix SE.out
;
Beta.out
: a matrix of the estimated covariate effects beta(t),
where t denotes the genomic positions;
ncovs
: number of functional paramters in the model (including
the intercept);
sigma00
: estimated variance for the random effect if RanEff is
TRUE; NA if RanEff is FALSE.
Kaiqiong Zhao
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) out <- binomRegMethModel( data=RAdat.f, n.k=rep(5,3), p0=0.003307034, p1=0.9, epsilon=10^(-6), epsilon.lambda=10^(-3), maxStep=200 )
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) out <- binomRegMethModel( data=RAdat.f, n.k=rep(5,3), p0=0.003307034, p1=0.9, epsilon=10^(-6), epsilon.lambda=10^(-3), maxStep=200 )
This function accepts an output object from function
binomRegMethModel
and print out a plot of the estimated
effect across the region for each test covariate.
binomRegMethModelPlot( BEM.obj, mfrow = NULL, same.range = FALSE, title = "Smooth covariate effects", covs = NULL, save = NULL, verbose = TRUE )
binomRegMethModelPlot( BEM.obj, mfrow = NULL, same.range = FALSE, title = "Smooth covariate effects", covs = NULL, save = NULL, verbose = TRUE )
BEM.obj |
an output object from function |
mfrow |
A vector of the form c(nr, nc). Subsequent figures will be drawn in an nr-by-nc array on the device. |
same.range |
specify whether the plots should be in the same vertical scale |
title |
the text for the title |
covs |
a vector of covariate names. The covariates with
names in |
save |
file name to create on disk. When the value is set to NULL, the plot is not saved. The default value is NULL. |
verbose |
logical indicates if the algorithm should provide progress report information. The default value is TRUE. |
This function prints out a plot of smooth covariate effects and its pointwise confidence intervals
Kaiqiong Zhao, Audrey Lemaçon
#------------------------------------------------------------# data(RAdat) head(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) out <- binomRegMethModel( data=RAdat.f, n.k=rep(5, 3), p0=0.003307034, p1=0.9, epsilon=10^(-6), epsilon.lambda=10^(-3), maxStep=200, Quasi = FALSE, RanEff = FALSE ) binomRegMethModelPlot(out, same.range=FALSE)
#------------------------------------------------------------# data(RAdat) head(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) out <- binomRegMethModel( data=RAdat.f, n.k=rep(5, 3), p0=0.003307034, p1=0.9, epsilon=10^(-6), epsilon.lambda=10^(-3), maxStep=200, Quasi = FALSE, RanEff = FALSE ) binomRegMethModelPlot(out, same.range=FALSE)
This function returns the predicted methylation levels
binomRegMethModelPred( BEM.obj, newdata = NULL, type = "proportion", verbose = TRUE )
binomRegMethModelPred( BEM.obj, newdata = NULL, type = "proportion", verbose = TRUE )
BEM.obj |
an output from the function |
newdata |
the data set whose predictions are calculated; with columns 'Position', covariate names matching the BEM.obj |
type |
return the predicted methylation proportion or the predicted response (in logit or other binom.link scale) |
verbose |
logical indicates if the algorithm should provide progress report information. The default value is TRUE. |
This function returns the predicted methylation levels
Kaiqiong Zhao
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) out <- binomRegMethModel( data=RAdat.f, n.k=rep(5, 3), p0=0.003307034, p1=0.9, epsilon=10^(-6), epsilon.lambda=10^(-3), maxStep=200, Quasi = FALSE, RanEff = FALSE ) binomRegMethModelPred(out)
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) out <- binomRegMethModel( data=RAdat.f, n.k=rep(5, 3), p0=0.003307034, p1=0.9, epsilon=10^(-6), epsilon.lambda=10^(-3), maxStep=200, Quasi = FALSE, RanEff = FALSE ) binomRegMethModelPred(out)
Simulate Bisulfite sequencing data from a Generalized Additive
Model with functional parameters varying with the genomic position. Both the
true methylated counts and observed methylated counts are generated, given
the error/conversion rate parameters p0
and p1
. In addition,
the true methylated counts can be simulated from a binomial or a dispersed
binomial distribution (Beta-binomial distribution).
binomRegMethModelSim( n, posit, theta.0, beta, phi, random.eff = FALSE, mu.e = 0, sigma.ee = 1, p0 = 0.003, p1 = 0.9, X, Z, binom.link = "logit", verbose = TRUE )
binomRegMethModelSim( n, posit, theta.0, beta, phi, random.eff = FALSE, mu.e = 0, sigma.ee = 1, p0 = 0.003, p1 = 0.9, X, Z, binom.link = "logit", verbose = TRUE )
n |
sample size |
posit |
a numeric vector of size |
theta.0 |
numeric vector of size |
beta |
numeric vector of size |
phi |
a vector of length |
random.eff |
indicates whether adding the subject-specific random effect
term |
mu.e |
number, the mean of the random effect. |
sigma.ee |
positive number, variance of the random effect |
p0 |
the probability of observing a methylated read when the underlying
true status is unmethylated. |
p1 |
the probability of observing a methylated read when the underlying
true status is methylated. |
X |
the matrix of the read coverage for each CpG in each sample; a
matrix of n rows and |
Z |
numeric matrix with |
binom.link |
the link function used for simulation |
verbose |
logical indicates if the algorithm should provide progress report information. The default value is TRUE. |
The function returns a list of following objects
S
a numeric matrix of n
rows and p
columns
containing the true methylation counts;
Y
a numeric matrix of n
rows and p
columns
containing the observed methylation counts;
theta
a numeric matrix of n
rows and p
columns
containing the methylation parameter (after the logit transformation);
pi
a numeric matrix of n
rows and p
columns
containing the true methylation proportions used to simulate the data.
Kaiqiong Zhao
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) out <- binomRegMethModel( data=RAdat.f, n.k=rep(5, 3), p0=0, p1=1, epsilon=10^(-6), epsilon.lambda=10^(-3), maxStep=200, RanEff = FALSE ) Z = as.matrix(RAdat.f[match(unique(RAdat.f$ID), RAdat.f$ID), c('T_cell', 'RA')]) set.seed(123) X = matrix(sample(80, nrow(Z)*length(out$uni.pos), replace = TRUE), nrow = nrow(Z), ncol = length(out$uni.pos))+10 simdat = binomRegMethModelSim(n=nrow(Z), posit= out$uni.pos, theta.0=out$Beta.out[,1], beta= out$Beta.out[,-1], random.eff=FALSE, mu.e=0,sigma.ee=1, p0=0.003, p1=0.9,X=X , Z=Z, binom.link='logit', phi = rep(1, length(out$uni.pos)))
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) out <- binomRegMethModel( data=RAdat.f, n.k=rep(5, 3), p0=0, p1=1, epsilon=10^(-6), epsilon.lambda=10^(-3), maxStep=200, RanEff = FALSE ) Z = as.matrix(RAdat.f[match(unique(RAdat.f$ID), RAdat.f$ID), c('T_cell', 'RA')]) set.seed(123) X = matrix(sample(80, nrow(Z)*length(out$uni.pos), replace = TRUE), nrow = nrow(Z), ncol = length(out$uni.pos))+10 simdat = binomRegMethModelSim(n=nrow(Z), posit= out$uni.pos, theta.0=out$Beta.out[,1], beta= out$Beta.out[,-1], random.eff=FALSE, mu.e=0,sigma.ee=1, p0=0.003, p1=0.9,X=X , Z=Z, binom.link='logit', phi = rep(1, length(out$uni.pos)))
This function accepts the data.frame
used as an input
for the function binomRegMethModelPred
with additional
columns containing the predictions generated by the function
binomRegMethModelPred
and columns containing the name of each
experimental group and returns a plot representing the predicted methylation
levels according to each experimental group.
binomRegMethPredPlot( pred, pred.type = "proportion", pred.col = "pred", group.col = NULL, title = "Predicted methylation levels", style = NULL, save = NULL, verbose = TRUE )
binomRegMethPredPlot( pred, pred.type = "proportion", pred.col = "pred", group.col = NULL, title = "Predicted methylation levels", style = NULL, save = NULL, verbose = TRUE )
pred |
|
pred.type |
type of prediction returned by the function
|
pred.col |
character defines the name of the column containing the prediction values. The default value is "pred". |
group.col |
character defines the name of the column containing the experimental groups. If the group.col is set to NULL, the resulting plot will be a simple scatter plot representing all predicted values disregarding any experimental design. The default value is NULL. |
title |
the text for the title |
style |
named list containing the wanted style
(color and line type) for each experimental groups. The first
level list is named according each experimental group and for each
experimental group there is a list containing the
The function accepts color name and its hexadecimal code.
The default value is NULL meaning that the colors will be chosen randomly and
the line style will be set to |
save |
file name to create on disk. When the value is set to NULL, the plot is not saved. The default value is NULL. |
verbose |
logical indicates if the algorithm should provide progress report information. The default value is TRUE. |
This function prints out a plot of the predicted methylation levels according to preset experimental groups.
Audrey Lemaçon
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) BEM.obj <- binomRegMethModel( data=RAdat.f, n.k=rep(5, 3), p0=0.003307034, p1=0.9, epsilon=10^(-6), epsilon.lambda=10^(-3), maxStep=200, Quasi = FALSE, RanEff = FALSE, verbose = FALSE ) pos <- BEM.obj$uni.pos newdata <- expand.grid(pos, c(0, 1), c(0, 1)) colnames(newdata) <- c("Position", "T_cell", "RA") my.pred <- binomRegMethModelPred(BEM.obj, newdata, type = "link.scale", verbose = FALSE) newdata$group <- "" newdata[(newdata$RA == 0 & newdata$T_cell == 0),]$group <- "CTRL MONO" newdata[(newdata$RA == 0 & newdata$T_cell == 1),]$group <- "CTRL TCELL" newdata[(newdata$RA == 1 & newdata$T_cell == 0),]$group <- "RA MONO" newdata[(newdata$RA == 1 & newdata$T_cell == 1),]$group <- "RA TCELL" pred <- cbind(newdata, Pred = my.pred) style <- list("CTRL MONO" = list(color = "blue", type = "dashed"), "CTRL TCELL" = list(color = "green", type = "dashed"), "RA MONO" = list(color = "blue", type = "solid"), "RA TCELL" = list(color = "green", type = "solid")) g <- binomRegMethPredPlot(pred, pred.col = "Pred", group.col = "group", style = style, save = NULL, verbose = FALSE)
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) BEM.obj <- binomRegMethModel( data=RAdat.f, n.k=rep(5, 3), p0=0.003307034, p1=0.9, epsilon=10^(-6), epsilon.lambda=10^(-3), maxStep=200, Quasi = FALSE, RanEff = FALSE, verbose = FALSE ) pos <- BEM.obj$uni.pos newdata <- expand.grid(pos, c(0, 1), c(0, 1)) colnames(newdata) <- c("Position", "T_cell", "RA") my.pred <- binomRegMethModelPred(BEM.obj, newdata, type = "link.scale", verbose = FALSE) newdata$group <- "" newdata[(newdata$RA == 0 & newdata$T_cell == 0),]$group <- "CTRL MONO" newdata[(newdata$RA == 0 & newdata$T_cell == 1),]$group <- "CTRL TCELL" newdata[(newdata$RA == 1 & newdata$T_cell == 0),]$group <- "RA MONO" newdata[(newdata$RA == 1 & newdata$T_cell == 1),]$group <- "RA TCELL" pred <- cbind(newdata, Pred = my.pred) style <- list("CTRL MONO" = list(color = "blue", type = "dashed"), "CTRL TCELL" = list(color = "green", type = "dashed"), "RA MONO" = list(color = "blue", type = "solid"), "RA TCELL" = list(color = "green", type = "solid")) g <- binomRegMethPredPlot(pred, pred.col = "Pred", group.col = "group", style = style, save = NULL, verbose = FALSE)
This function reads and converts Bismark's
'genome wide cytosine report' and 'coverage' into a
list
of data.frame
s (one per chromosome) to a format compatible
with SOMNiBUS' main functions runSOMNiBUS
and
binomRegMethModel
.
formatFromBismark(..., verbose = TRUE)
formatFromBismark(..., verbose = TRUE)
... |
parameters from |
verbose |
logical indicates the level of information provided by the algorithm during the process. The default value is TRUE. |
This function returns a list
of data.frame
s (one per
chromosome). Each data.frame
contains rows as individual CpGs
appearing in all the samples. The first 4 columns contain the information of
Meth_Counts
(methylated counts), Total_Counts
(read depths),
Position
(Genomic position for the CpG site) and ID
(sample ID).
The additional information (such as disease status, sex, age) extracted from
the BSseq object are listed in column 5 and onwards and will be considered
as covariate information by SOMNiBUS algorithms.
Audrey Lemaçon
read.bismark for parsing output from the Bismark alignment suite.
Other Parsing functions:
formatFromBSmooth()
,
formatFromBSseq()
infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz", package = "bsseq") dat <- formatFromBismark(infile, verbose = FALSE)
infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz", package = "bsseq") dat <- formatFromBismark(infile, verbose = FALSE)
This function reads and converts BSmooth's output into a
list
of data.frame
s (one per chromosome) to a format compatible
with SOMNiBUS' main functions runSOMNiBUS
and
binomRegMethModel
.
formatFromBSmooth(..., verbose = TRUE)
formatFromBSmooth(..., verbose = TRUE)
... |
parameters from |
verbose |
logical indicates the level of information provided by the algorithm during the process. The default value is TRUE. |
This function returns a list
of data.frame
s (one per
chromosome). Each data.frame
contains rows as individual CpGs
appearing in all the samples. The first 4 columns contain the information of
Meth_Counts
(methylated counts), Total_Counts
(read depths),
Position
(Genomic position for the CpG site) and ID
(sample ID).
The additional information (such as disease status, sex, age) extracted from
the BSseq object are listed in column 5 and onwards and will be considered
as covariate information by SOMNiBUS algorithms.
Audrey Lemaçon
read.bismark for parsing output from the Bismark alignment suite.
Other Parsing functions:
formatFromBSseq()
,
formatFromBismark()
indir <- system.file("extdata/ev_bt2_tab", package = "SOMNiBUS") dat <- formatFromBSmooth(indir, verbose = FALSE)
indir <- system.file("extdata/ev_bt2_tab", package = "SOMNiBUS") dat <- formatFromBSmooth(indir, verbose = FALSE)
This function reads and converts a BSseq object into a
list
of data.frame
s (one per chromosome) to a format compatible
with SOMNiBUS' main functions runSOMNiBUS
and
binomRegMethModel
.
formatFromBSseq(bsseq_dat, verbose = TRUE)
formatFromBSseq(bsseq_dat, verbose = TRUE)
bsseq_dat |
an object of class BSseq. |
verbose |
logical indicates the level of information provided by the algorithm during the process. The default value is TRUE. |
This function returns a list
of data.frame
s (one per
chromosome). Each data.frame
contains rows as individual CpGs
appearing in all the samples. The first 4 columns contain the information of
Meth_Counts
(methylated counts), Total_Counts
(read depths),
Position
(Genomic position for the CpG site) and ID
(sample ID).
The additional information (such as disease status, sex, age) extracted from
the BSseq object are listed in column 5 and onwards and will be considered
as covariate information by SOMNiBUS algorithms.
Audrey Lemaçon
BSseq for the BSseq class.
Other Parsing functions:
formatFromBSmooth()
,
formatFromBismark()
M <- matrix(1:9, 3,3) colnames(M) <- c("A1", "A2", "A3") BStest <- bsseq::BSseq(pos = 1:3, chr = c("chr1", "chr2", "chr1"), M = M, Cov = M + 2) dat <- formatFromBSseq(BStest, verbose = FALSE)
M <- matrix(1:9, 3,3) colnames(M) <- c("A1", "A2", "A3") BStest <- bsseq::BSseq(pos = 1:3, chr = c("chr1", "chr2", "chr1"), M = M, Cov = M + 2) dat <- formatFromBSseq(BStest, verbose = FALSE)
A dataset containing methylation levels on one targeted region on chromosome 4 near gene BANK1 from cases with rheumatoid arthritis (RA) and controls.
RAdat
RAdat
A data frame of 5289 rows and 6 columns. Each row represents a CpG site for a sample. Columns include in order:
Number of methylated reads
Total number of reads; read-depth
Genomic position (in bp) for the CpG site
indicates which sample the CpG site belongs to
whether a sample is from T cell or monocyte
whether a sample is an RA patient or control
This example data include methylation levels of cell type separated blood samples of 22 rheumatoid arthritis (RA) patients and 21 healthy individuals. In the data set, 123 CpG sites are measured and there are 25 samples from circulating T cells and 18 samples from monocytes.
Dr. Marie Hudson (McGill University)
This example data include methylation levels on a region with 208 CpGs for 116 blood samples.
RAdat2
RAdat2
A data frame of 6064 rows and 13 columns. Each row represents a CpG site for a sample. Columns include in order:
Number of methylated reads
Total number of reads; read-depth
Genomic position (in bp) for the CpG site
indicates which sample the CpG site belongs to
binary indicator for a biomarker anti-citrullinated protein antibody
Age
2-female; 1-male
1-current or ex-smoker; 0-non-smoker
1-Smoking info is NA; 0-Smoking info is available
PC1 for the cell type proportions
PC2 for the cell type proportions
PC3 for the cell type proportions
PC4 for the cell type proportions
simulation is based a real data set provided by PI Dr. Sasha Bernatsky (McGill University)
This function splits the methylation data into regions (according to different approaches) and, for each region, fits a (dispersion-adjusted) binomial regression model to regional methylation data, and reports the estimated smooth covariate effects and regional p-values for the test of DMRs (differentially methylation regions). Over or under dispersion across loci is accounted for in the model by the combination of a multiplicative dispersion parameter (or scale parameter) and a sample-specific random effect.
This method can deal with outcomes, i.e. the number of
methylated reads in a region, that are contaminated by known
false methylation calling rate (p0
) and false non-methylation
calling rate (1-p1
).
The covariate effects are assumed to smoothly vary across
genomic regions. In order to estimate them, the algorithm first
represents the functional parameters by a linear combination
of a set of restricted cubic splines (with dimension
n.k
), and a smoothness penalization term
which depends on the smoothing parameters lambdas
is also
added to control smoothness. The estimation is performed by an iterated
EM algorithm. Each M step constitutes an outer Newton's iteration to estimate
smoothing parameters lambdas
and an inner P-IRLS iteration to estimate
spline coefficients alpha
for the covariate effects.
Currently, the computation in the M step depends the implementation of
gam()
in package mgcv
.
runSOMNiBUS( dat, split = list(approach = "region"), min.cpgs = 50, max.cpgs = 2000, n.k, p0 = 0.003, p1 = 0.9, Quasi = TRUE, epsilon = 10^(-6), epsilon.lambda = 10^(-3), maxStep = 200, binom.link = "logit", method = "REML", covs = NULL, RanEff = TRUE, reml.scale = FALSE, scale = -2, verbose = TRUE )
runSOMNiBUS( dat, split = list(approach = "region"), min.cpgs = 50, max.cpgs = 2000, n.k, p0 = 0.003, p1 = 0.9, Quasi = TRUE, epsilon = 10^(-6), epsilon.lambda = 10^(-3), maxStep = 200, binom.link = "logit", method = "REML", covs = NULL, RanEff = TRUE, reml.scale = FALSE, scale = -2, verbose = TRUE )
dat |
a data frame with rows as individual CpGs appearing
in all the samples. The first 4 columns should contain the information of
|
split |
this
This list should also contain additional parameters specific to each partitioning approach (see the documentation of each approach for details). |
min.cpgs |
positive integer defining the minimum number of CpGs within a region for the algorithm to perform optimally. The default value is 50. |
max.cpgs |
positive integer defining the maximum number of CpGs within a region for the algorithm to perform optimally. The default value is 2000. |
n.k |
a vector of basis dimensions for the intercept and
individual covariates. |
p0 |
the probability of observing a methylated read when the underlying
true status is unmethylated. |
p1 |
the probability of observing a methylated read when the underlying
true status is methylated. |
Quasi |
whether a Quasi-likelihood estimation approach will be used; in other words, whether a multiplicative dispersion is added in the model or not. |
epsilon |
numeric; stopping criterion for the closeness of estimates of spline coefficients from two consecutive iterations. |
epsilon.lambda |
numeric; stopping criterion for the closeness of
estimates of smoothing parameter |
maxStep |
the algorithm will step if the iteration steps exceed
|
binom.link |
the link function used in the binomial regression model; the default is the logit link. |
method |
the method used to estimate the smoothing
parameters. The default is the 'REML' method which is generally better than
prediction based criterion |
covs |
a vector of covariate names. The covariates with names in
|
RanEff |
whether sample-level random effects are added or not |
reml.scale |
whether a REML-based scale (dispersion) estimator is used. The default is Fletcher-based estimator. |
scale |
negative values mean scale parameter should be estimated; if a positive value is provided, a fixed scale will be used. |
verbose |
logical indicates if the algorithm should provide progress report information. The default value is TRUE. |
This function returns a list
of models (one by independent
region) including objects:
est
: estimates of the spline basis coefficients alpha
lambda
: estimates of the smoothing parameters for each
functional parameters
est.pi
: predicted methylation levels for each row in the input
data
ite.points
: estimates of est
, lambda
at each EM
iteration
cov1
: estimated variance-covariance matrix of the basis
coefficients alphas
reg.out
: regional testing output obtained using Fletcher-based
dispersion estimate; an additional 'ID' row would appear if RanEff is TRUE
reg.out.reml.scale
: regional testing output obtained using
REML-based dispersion estimate;
reg.out.gam
: regional testing output obtained using
(Fletcher-based) dispersion estimate from mgcv package;
phi_fletcher
: Fletcher-based estimate of the (multiplicative)
dispersion parameter;
phi_reml
: REML-based estimate of the (multiplicative)
dispersion parameter;
phi_gam
: Estimated dispersion parameter reported by mgcv;
SE.out
: a matrix of the estimated pointwise Standard Errors
(SE); number of rows are the number of unique CpG sites in the input data and
the number of columns equal to the total number of covariates fitted in the
model (the first one is the intercept);
SE.out.REML.scale
: a matrix of the estimated pointwise Standard
Errors (SE); the SE calculated from the REML-based dispersion estimates
uni.pos
: the genomic postions for each row of CpG sites in the
matrix SE.out
;
Beta.out
: a matrix of the estimated covariate effects beta(t),
where t denotes the genomic positions;
ncovs
: number of functional paramters in the model (including
the intercept);
sigma00
: estimated variance for the random effect if RanEff is
TRUE; NA if RanEff is FALSE.
Audrey Lemaçon
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) outs <- runSOMNiBUS( dat=RAdat.f, split = list(approach = "region", gap = 1e6), min.cpgs = 5, n.k = rep(5,3), p0 = 0.003, p1 = 0.9 )
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) outs <- runSOMNiBUS( dat=RAdat.f, split = list(approach = "region", gap = 1e6), min.cpgs = 5, n.k = rep(5,3), p0 = 0.003, p1 = 0.9 )
This function splits the methylation data into regions based on the genomic annotation provided under the form of a 1-based BED file
splitDataByBed( dat, chr, bed, gap = -1, min.cpgs = 50, max.cpgs = 2000, verbose = TRUE )
splitDataByBed( dat, chr, bed, gap = -1, min.cpgs = 50, max.cpgs = 2000, verbose = TRUE )
dat |
a data frame with rows as individual CpGs appearing
in all the samples. The first 4 columns should contain the information of
|
chr |
character vector containing the chromosome information. Its length
should be equal to the number of rows in |
bed |
character, path to the 1-based BED file containing the annotations |
gap |
integer defining the maximum gap that is allowed between
two regions to be considered as overlapping.
According to the |
min.cpgs |
positive integer defining the minimum number of CpGs within a region for the algorithm to perform optimally. The default value is 50. |
max.cpgs |
positive integer defining the maximum number of CpGs within a region for the algorithm to perform optimally. The default value is 2000. |
verbose |
logical indicates if the algorithm should provide progress report information. The default value is TRUE. |
A named list
of data.frame
containing the data of each
independent region.
Audrey Lemaçon
This function splits the methylation data into regions
based on the chromatin states predicted by ChromHMM software
(Ernst and Kellis (2012)).
The annotations come from the Bioconductor package annnotatr
.
Chromatin states determined by chromHMM are
available in hg19 for nine cell lines (Gm12878, H1hesc, Hepg2, Hmec, Hsmm,
Huvec, K562, Nhek, and Nhlf).
splitDataByChromatin( dat, chr, cell.line, states, gap = -1, min.cpgs = 50, max.cpgs = 2000, verbose = TRUE )
splitDataByChromatin( dat, chr, cell.line, states, gap = -1, min.cpgs = 50, max.cpgs = 2000, verbose = TRUE )
dat |
a data frame with rows as individual CpGs appearing
in all the samples. The first 4 columns should contain the information of
|
chr |
character vector containing the chromosome information. Its length
should be equal to the number of rows in |
cell.line |
character defining the cell line of interest. Nine cell lines are available:
|
states |
character vector defining the chromatin states of interest among the following available options:
|
gap |
this integer defines the maximum gap that is allowed between
two regions to be considered as overlapping.
According to the |
min.cpgs |
positive integer defining the minimum number of CpGs within a region for the algorithm to perform optimally. The default value is 50. |
max.cpgs |
positive integer defining the maximum number of CpGs within a region for the algorithm to perform optimally. The default value is 2000. |
verbose |
logical indicates if the algorithm should provide progress report information. The default value is TRUE. |
A list
of data.frame
containing the data of each
independent region.
Audrey Lemaçon
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) results <- splitDataByChromatin(dat = RAdat.f, cell.line = "huvec", chr = rep(x = "chr4", times = nrow(RAdat.f)), states = "Insulator", verbose = FALSE)
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) results <- splitDataByChromatin(dat = RAdat.f, cell.line = "huvec", chr = rep(x = "chr4", times = nrow(RAdat.f)), states = "Insulator", verbose = FALSE)
This function splits the methylation data into regions based on the density of CpGs.
splitDataByDensity( dat, window.size = 100, by = 1, min.density = 5, gap = 10, min.cpgs = 50, max.cpgs = 2000, verbose = TRUE )
splitDataByDensity( dat, window.size = 100, by = 1, min.density = 5, gap = 10, min.cpgs = 50, max.cpgs = 2000, verbose = TRUE )
dat |
a data frame with rows as individual CpGs appearing
in all the samples. The first 4 columns should contain the information of
|
window.size |
this positive integer defines the size of the
sliding window in bp. Decimal values will be rounded to the nearest integer.
The value should be greater than 10. The default value is |
by |
positive integer defines by how many base pairs the
window moves at each increment. Decimal values will be rounded to the
nearest integer. The default value is |
min.density |
positive integer defines the minimum density
threshold for each window. Decimal values will be rounded to the
nearest integer. The default value is |
gap |
positive integer defining the gap width
beyond which we consider that two regions are independent.
Decimal values will be rounded to the nearest integer.
The default value is |
min.cpgs |
positive integer defining the minimum number of CpGs within a region for the algorithm to perform optimally. The default value is 50. |
max.cpgs |
positive integer defining the maximum number of CpGs within a region for the algorithm to perform optimally. The default value is 2000. |
verbose |
logical indicates if the algorithm should provide progress report information. The default value is TRUE. |
A named list
of data.frame
containing the data of each
independent region.
Audrey Lemaçon
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) results <- splitDataByDensity(dat = RAdat.f, window.size = 100, by = 1, min.density = 5, gap = 10, min.cpgs = 50, verbose = FALSE)
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) results <- splitDataByDensity(dat = RAdat.f, window.size = 100, by = 1, min.density = 5, gap = 10, min.cpgs = 50, verbose = FALSE)
This function splits the methylation data into regions
based on the genes. The annotations are coming from the Bioconductor
package annnotatr
.
splitDataByGene( dat, chr, organism = "human", build = "hg38", types = "promoter", gap = -1, min.cpgs = 50, max.cpgs = 2000, verbose = TRUE )
splitDataByGene( dat, chr, organism = "human", build = "hg38", types = "promoter", gap = -1, min.cpgs = 50, max.cpgs = 2000, verbose = TRUE )
dat |
a data frame with rows as individual CpGs appearing
in all the samples. The first 4 columns should contain the information of
|
chr |
character vector containing the chromosome information. Its length
should be equal to the number of rows in |
organism |
character defining the organism of interest
Only Homo sapiens ( |
build |
character defining the version of the genome build on which the
methylation data have been mapped. By default, the build is set to
|
types |
character vector defining the type of genic annotations to use among the following options:
|
gap |
this integer defines the maximum gap allowed between two regions
to be considered as overlapping.
According to the |
min.cpgs |
positive integer defining the minimum number of CpGs within a region for the algorithm to perform optimally. The default value is 50. |
max.cpgs |
positive integer defining the maximum number of CpGs within a region for the algorithm to perform optimally. The default value is 2000. |
verbose |
logical indicates if the algorithm should provide progress report information. The default value is TRUE. |
A named list
of data.frame
containing the data of each
independent region.
Audrey Lemaçon
#------------------------------------------------------------# data(RAdat) # Add a column containing the chromosome information RAdat$Chr <- "chr4" RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) results <- splitDataByGene(dat = RAdat.f, chr = rep(x = "chr1", times = nrow(RAdat.f)), verbose = FALSE)
#------------------------------------------------------------# data(RAdat) # Add a column containing the chromosome information RAdat$Chr <- "chr4" RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) results <- splitDataByGene(dat = RAdat.f, chr = rep(x = "chr1", times = nrow(RAdat.f)), verbose = FALSE)
This function splits the methylation data into regions based on the genomic annotations provided under the form of a GenomicRanges object.
splitDataByGRanges( dat, chr, annots, gap = -1, min.cpgs = 50, max.cpgs = 2000, verbose = TRUE )
splitDataByGRanges( dat, chr, annots, gap = -1, min.cpgs = 50, max.cpgs = 2000, verbose = TRUE )
dat |
a data frame with rows as individual CpGs appearing
in all the samples. The first 4 columns should contain the information of
|
chr |
character vector containing the chromosome information. Its length
should be equal to the number of rows in |
annots |
GenomicRanges object containing the annotations |
gap |
integer defining the maximum gap that is allowed between
two regions to be considered as overlapping.
According to the |
min.cpgs |
positive integer defining the minimum number of CpGs within a region for the algorithm to perform optimally. The default value is 50. |
max.cpgs |
positive integer defining the maximum number of CpGs within a region for the algorithm to perform optimally. The default value is 2000. |
verbose |
logical indicates if the algorithm should provide progress report information. The default value is TRUE. |
A named list
of data.frame
containing the data of each
independent region.
Audrey Lemaçon
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) annot <- GenomicRanges::GRanges(seqnames = "chr1", IRanges::IRanges( start = c(102711720,102711844,102712006,102712503,102712702), end = c(102711757,102711909,102712195,102712637,102712712) )) results <- splitDataByGRanges(dat = RAdat.f, chr = rep(x = "chr1", times = nrow(RAdat.f)), annots = annot, gap = -1, min.cpgs = 5)
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) annot <- GenomicRanges::GRanges(seqnames = "chr1", IRanges::IRanges( start = c(102711720,102711844,102712006,102712503,102712702), end = c(102711757,102711909,102712195,102712637,102712712) )) results <- splitDataByGRanges(dat = RAdat.f, chr = rep(x = "chr1", times = nrow(RAdat.f)), annots = annot, gap = -1, min.cpgs = 5)
This function splits the methylation data into regions based on the spacing of CpGs.
splitDataByRegion( dat, gap = 1e+06, min.cpgs = 50, max.cpgs = 2000, verbose = TRUE )
splitDataByRegion( dat, gap = 1e+06, min.cpgs = 50, max.cpgs = 2000, verbose = TRUE )
dat |
a data frame with rows as individual CpGs appearing in all the
samples. The first 4 columns should contain the information of |
gap |
positive integer defining the gap width
beyond which we consider that two regions are independent.
Odd and decimal values will be rounded to the next even numbers
(e.g. 8.2 and 8.7 become gaps of 8 and 10 respectively).
The default value is |
min.cpgs |
positive integer defining the minimum number of CpGs within a region for the algorithm to perform optimally. The default value is 50. |
max.cpgs |
positive integer defining the maximum number of CpGs within a region for the algorithm to perform optimally. The default value is 2000. |
verbose |
logical indicates if the algorithm should provide progress report information. The default value is TRUE. |
A named list
of data.frame
containing the data of each
independent region.
Audrey Lemaçon
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) results <- splitDataByRegion( dat=RAdat.f, gap = 1e6, min.cpgs = 5, verbose = FALSE)
#------------------------------------------------------------# data(RAdat) RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ]) results <- splitDataByRegion( dat=RAdat.f, gap = 1e6, min.cpgs = 5, verbose = FALSE)