The purpose of this vignette is to show users how to retrofit their
methylSig
< 0.99.0 code to work with the refactor in
version 0.99.0 and later.
In versions < 0.99.0 of methylSig
, the
methylSigReadData()
function read Bismark coverage files,
Bismark genome-wide CpG reports, or MethylDackel bedGraphs.
Additionally, users could destrand the data, filter by coverage, and
filter SNPs.
In versions >= 0.99.0 of methylSig
, the user should
read data with bsseq::read.bismark()
and then apply
functions that were once bundled within
methylSigReadData()
.
files = c(
system.file('extdata', 'bis_cov1.cov', package='methylSig'),
system.file('extdata', 'bis_cov2.cov', package='methylSig')
)
bsseq_stranded = bsseq::read.bismark(
files = files,
colData = data.frame(row.names = c('test1','test2')),
rmZeroCov = FALSE,
strandCollapse = FALSE
)
After reading data, filter by coverage. Note, we are changing our dataset to something we can use with the downstream functions.
# Load data for use in the rest of the vignette
data(BS.cancer.ex, package = 'bsseqData')
bs = BS.cancer.ex[1:10000]
bs = filter_loci_by_coverage(bs, min_count = 5, max_count = 500)
#> Warning in value[[3L]](cond): expected no additional right arguments for
#> 'DelayedNaryIsoOp', falling back to an unknown matrix
#> Warning in value[[3L]](cond): expected no additional right arguments for
#> 'DelayedNaryIsoOp', falling back to an unknown matrix
If the locations of C-to-T and G-to-A SNPs are known, or some other set of location should be removed:
In versions < 0.99.0 of methylSig
, the
methylSigTile()
function combined aggregating CpG data over
pre-defined tiles and genomic windows.
In versions >= 0.99.0 of methylSig
, tiling is
separated into two functions, tile_by_regions()
and
tile_by_windows()
. Users should chooose one or the
other.
windowed_bs = tile_by_windows(bs = bs, win_size = 10000)
#> Warning in value[[3L]](cond): expected no additional right arguments for
#> 'DelayedNaryIsoOp', falling back to an unknown matrix
#> Warning in value[[3L]](cond): expected no additional right arguments for
#> 'DelayedNaryIsoOp', falling back to an unknown matrix
# Collapsed promoters on chr21 and chr22
data(promoters_gr, package = 'methylSig')
promoters_bs = tile_by_regions(bs = bs, gr = promoters_gr)
#> Warning in value[[3L]](cond): expected no additional right arguments for
#> 'DelayedNaryIsoOp', falling back to an unknown matrix
#> Warning in value[[3L]](cond): expected no additional right arguments for
#> 'DelayedNaryIsoOp', falling back to an unknown matrix
In versions < 0.99.0 of methylSig
, the
methylSigCalc
function had a min.per.group
parameter to determine how many samples per group had to have coverage
in order to be tested.
In versions >= 0.99.0 of methylSig
, the
min.per.group
functionality is performed by a separate
function filter_loci_by_group_coverage()
. Also note the
change in form to define dispersion calculations, and the use of local
information.
# Look a the phenotype data for bs
bsseq::pData(bs)
#> DataFrame with 6 rows and 2 columns
#> Type Pair
#> <character> <character>
#> C1 cancer pair1
#> C2 cancer pair2
#> C3 cancer pair3
#> N1 normal pair1
#> N2 normal pair2
#> N3 normal pair3
# Require at least two samples from cancer and two samples from normal
bs = filter_loci_by_group_coverage(
bs = bs,
group_column = 'Type',
c('cancer' = 2, 'normal' = 2))
After removing loci with insufficient information, we can now use the
diff_methylsig()
test.
In versions < 0.99.0 of methylSig
, the
methylSigDSS
function also had a min.per.group
parameter to determine how many samples per group had to have coverage.
Users also couldn’t specify which methylation groups to recover. The
form of design
, formula
, and
contrast
, remain the same in versions >= 0.99.0.
In versions >= 0.99.0 of methylSig
, the single
methylSigDSS()
function is replaced by a fit function
diff_dss_fit()
and a test functiotn
diff_dss_test()
. As with diff_methylsig()
,
users should ensure enough samples have sufficient coverage with the
filter_loci_by_group_coverage()
function. The
design
and formula
are unchanged in their
forms.
If a continuous covariate is to be tested,
filter_loci_by_group_coverage()
should be skipped, as there
are no groups. In prior versions of methylSigDSS()
, this
was not possible, and the group constraints were incorrectly applied
prior to testing on a continuous covariate.
# IF NOT DONE PREVIOUSLY
# Require at least two samples from cancer and two samples from normal
bs = filter_loci_by_group_coverage(
bs = bs,
group_column = 'Type',
c('cancer' = 2, 'normal' = 2))
# Test the simplest model with an intercept and Type
diff_fit_simple = diff_dss_fit(
bs = bs,
design = bsseq::pData(bs),
formula = as.formula('~ Type'))
#> Fitting DML model for CpG site:
The contrast
parameter is also changed in its form. Note
the, additional parameters to specify how to recover group methylation.
methylation_group_column
and
methylation_groups
should be specified for group versus
group comparisons. For continuous covariates,
methylation_group_column
is sufficient, and the samples
will be grouped into top/bottom 25 percentile based on the continuous
covariate column name given in
methylation_group_column
.
# Test the simplest model for cancer vs normal
# Note, 2 rows corresponds to 2 columns in diff_fit_simple$X
simple_contrast = matrix(c(0,1), ncol = 1)
diff_simple_gr = diff_dss_test(
bs = bs,
diff_fit = diff_fit_simple,
contrast = simple_contrast,
methylation_group_column = 'Type',
methylation_groups = c('case' = 'cancer', 'control' = 'normal'))
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] methylSig_1.19.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.37.0 rjson_0.2.23
#> [3] xfun_0.50 bslib_0.8.0
#> [5] rhdf5_2.51.2 Biobase_2.67.0
#> [7] lattice_0.22-6 bitops_1.0-9
#> [9] rhdf5filters_1.19.0 tools_4.4.2
#> [11] generics_0.1.3 curl_6.2.0
#> [13] stats4_4.4.2 parallel_4.4.2
#> [15] R.oo_1.27.0 Matrix_1.7-2
#> [17] BSgenome_1.75.1 data.table_1.16.4
#> [19] S4Vectors_0.45.2 sparseMatrixStats_1.19.0
#> [21] lifecycle_1.0.4 GenomeInfoDbData_1.2.13
#> [23] compiler_4.4.2 Rsamtools_2.23.1
#> [25] Biostrings_2.75.3 statmod_1.5.0
#> [27] munsell_0.5.1 codetools_0.2-20
#> [29] permute_0.9-7 GenomeInfoDb_1.43.4
#> [31] htmltools_0.5.8.1 sys_3.4.3
#> [33] buildtools_1.0.0 sass_0.4.9
#> [35] RCurl_1.98-1.16 yaml_2.3.10
#> [37] crayon_1.5.3 jquerylib_0.1.4
#> [39] R.utils_2.12.3 BiocParallel_1.41.0
#> [41] DelayedArray_0.33.4 cachem_1.1.0
#> [43] limma_3.63.3 abind_1.4-8
#> [45] gtools_3.9.5 locfit_1.5-9.10
#> [47] digest_0.6.37 restfulr_0.0.15
#> [49] maketools_1.3.1 splines_4.4.2
#> [51] fastmap_1.2.0 grid_4.4.2
#> [53] colorspace_2.1-1 cli_3.6.3
#> [55] SparseArray_1.7.4 DSS_2.55.0
#> [57] S4Arrays_1.7.1 XML_3.99-0.18
#> [59] h5mread_0.99.4 DelayedMatrixStats_1.29.1
#> [61] UCSC.utils_1.3.1 scales_1.3.0
#> [63] rmarkdown_2.29 XVector_0.47.2
#> [65] httr_1.4.7 matrixStats_1.5.0
#> [67] R.methodsS3_1.8.2 beachmat_2.23.6
#> [69] HDF5Array_1.35.8 evaluate_1.0.3
#> [71] knitr_1.49 BiocIO_1.17.1
#> [73] GenomicRanges_1.59.1 IRanges_2.41.2
#> [75] rtracklayer_1.67.0 rlang_1.1.5
#> [77] Rcpp_1.0.14 glue_1.8.0
#> [79] BiocManager_1.30.25 BiocGenerics_0.53.5
#> [81] bsseq_1.43.1 jsonlite_1.8.9
#> [83] R6_2.5.1 Rhdf5lib_1.29.0
#> [85] GenomicAlignments_1.43.0 MatrixGenerics_1.19.1