magpie Package User’s Guide

logo

Introduction

Background

MeRIP-Seq (methylated RNA immunoprecipitation followed by sequencing) is a method used to study the epigenetic regulation of gene expression by identifying methylated RNA molecules in a sample. It involves immunoprecipitation of methylated RNA using an antibody specific for methylated nucleotides, followed by high-throughput sequencing of the immunoprecipitated RNA. The resulting data provide a snapshot of the methylation status of RNA molecules, allowing researchers to investigate the role of methylation in the regulation of gene expression and other biological processes. After being collected from two or more biological conditions, MeRIP-Seq data is typically analyzed using computational tools to identify differentially methylated regions (DMRs). DMR detection can uncover the functional significance of m6A methylation and identify potential therapeutic targets for disease such as cancer.

To establish the statistical rigor of MeRIP-Seq experiments, it is important to carefully consider sample size during the study design process in order to ensure that the experiment is adequately powered to detect differentially methylated RNA (DMR) regions. However, there is no such tool available, so we developed the R package magpie, a simulation-based tool for performing power calculations on MeRIP-Seq data. magpie has two main functions:

  • Power calculation: Given a MeRIP-Seq dataset, various experimental scenarios, and a selected test method, magpie can calculate the statistical power of the experiment and output the results.

  • Results preservation and visualization: magpie can save the results of the power calculation as an Excel file, and it can also produce basic line plots that allow the user to visualize the results.

Installation

From GitHub:

install.packages("devtools") # if you have not installed "devtools" package
library(devtools)
install_github("https://github.com/dxd429/magpie",
    build_vignettes = TRUE
)

From Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("magpie")

To view the package vignette in HTML format, run the following lines in R:

library(magpie)
vignette("magpie")

Quickstart

To get started with magpie, users are expected to provide MeRIP-Seq data for parameter estimation. This includes paired input and IP .BAM files for each experimental condition, along with the matching annotation file in the *.sqlite format. Since .BAM files are generally large, it is also encouraged to use data of only one or a few chromosomes.

magpie offers the function quickPower() for users who need quick power calculation results without their own simulations. This function takes few seconds to run and uses pre-calculated results from three publicly available MeRIP-seq datasets. The details of these datasets can be found in @ref(quickPower)). To use this function and save and plot the results:

library(magpie)
power.test <- quickPower(dataset = "GSE46705") # Options are 'GSE46705', 'GSE55575', and 'GSE94613'.

Save results into .xlsx files:

### write out .xlsx
writeToxlsx(power.test, file = "test_TRESS.xlsx")

### write out stratified results
writeToxlsx_strata(power.test, file = "test_strata_TRESS.xlsx")

Produce basic line plots for visulization:

### plot FDR under sequencing depth 1x
plotRes(power.test, depth_factor = 1, value_option = "FDR")

### plot all in a panel under sequencing depth 1x
plotAll(power.test, depth_factor = 1)

### plot a FDR strata result
plotStrata(power.test, value_option = "FDR")

### plot all strata results in a panel
plotAll_Strata(power.test)

Input data

In order to best mimic the characteristics of real data, magpie expects users to provide real MeRIP-Seq datasets, if not directly extracting our pre-calculated results.

magpie requires paired input and IP .BAM files for each replicate of both conditions. Example file names are as follows: “Ctrl1.input.bam” & Ctrl1.ip.bam”, “Ctrl2.input.bam” & Ctrl2.ip.bam”, “Case1.input.bam” & Case1.ip.bam”, “Case2.input.bam” & Case2.ip.bam”, for a 2 controls vs 2 cases DMR calling study. Note that names of provided .BAM files should be ordered carefully so that samples from different conditions will not be mistreated.

For illustration purpose, we include a sample dataset (GSE46705) from a study investigating how METTL3-METTL14 complex mediates mammalian nuclear RNA N6-adenosine methylation in our experimental data package magpieData on GitHub, which can be installed with:

install.packages("devtools") # if you have not installed "devtools" package
library(devtools)
install_github("https://github.com/dxd429/magpieData",
    build_vignettes = TRUE
)

The data package contains four .BAM files of two wild type (WT) replicates, four .BAM files of two knockdown of complex METTL3 replicates only on chromosome 15, and one genome annotation file.

In terms of the annotation file, it should match the reference genome version when obtaining the .BAM files, and be saved in the format of *.sqlite, which can be created with R function makeTxDbFromUCSC() from Bioconductor package GenomicFeatures:

## Use "makeTxDbFromUCSC" function to create an annotation file of hg18
library(GenomicFeatures)
hg18 <- makeTxDbFromUCSC(genome = "hg18", tablename = "knownGene")
saveDb(hg18, file = "hg18.sqlite")

Power calculation

magpie offers two functions for power evaluation: powerEval() and quickPower(). If users prefer to quickly examine the power evaluation results using the built-in datasets, they can use the quickPower() function. Otherwise, users can use the powerEval() function to perform power evaluation on their own data and customize the simulation settings. The output of these two functions is a list of power measurements under various experimental settings, such as FDR, Precision, and Statistical power.

Power evaluation with powerEval()

With .BAM files and an annotation file, powerEval() enables users to specify number of simulations (nsim), sample sizes (N.reps), sequencing depths (depth_factor), FDR thresholds (thres), and testing methods (Test_method).

Users should always indicate the percentage of the whole genome covered by their dataset, whether it’s complete or partial (bam_factor). Here, bam_factor allows for simulating whole-genome data using a smaller subset of it, under the assumption that DMR signals are relatively evenly distributed across chromosomes.

We demonstrate the usage of powerEval() with the example dataset from R package magpieData:

library(magpieData)
library(magpie)
### Get the example data
BAM_path <- getBAMpath()
### Call powerEval()
power.test <- powerEval(
    Input.file = c("Ctrl1.chr15.input.bam", "Ctrl2.chr15.input.bam", "Case1.chr15.input.bam", "Case2.chr15.input.bam"),
    IP.file = c("Ctrl1.chr15.ip.bam", "Ctrl2.chr15.ip.bam", "Case1.chr15.ip.bam", "Case2.chr15.ip.bam"),
    BamDir = BAM_path,
    annoDir = paste0(BAM_path, "/hg18_chr15.sqlite"),
    variable = rep(c("Ctrl", "Trt"), each = 2),
    bam_factor = 0.03,
    nsim = 10,
    N.reps = c(2, 3, 5, 7),
    depth_factor = c(1, 2),
    thres = c(0.01, 0.05, 0.1),
    Test_method = "TRESS" ## TRESS or exomePeak2
)

To use your own data, replace Input.file and IP.file with the names of your .BAM files, and set BamDir and annoDir to the file paths for your data. Make sure that the order of the variable corresponds to the order of your data files.

Power evaluation with quickPower()

Another power calculation function is quickPower(). Unlike powerEval() which often takes a while to run, quickPower() produces results in seconds, by directly extracting results from three pre-evaluated datasets on GEO:

  • GSE46705: Human HeLa cell line: Two replicates of wild type (WT) and two replicates of knockdown (KD) of complex METTL3.

  • GSE55575: Mouse embryonic fibroblasts: Two replicates of wild type (WT) and four replicates of knockdown (KD) of WTAP.

  • GSE94613: Human leukemia cell line: Four replicates of wild type (WT) and eight replicates of knockdown (KD) of complex METTL3.

Here, we use quickPower() to get power evaluation results of GSE46705, tested by TRESS:

library(magpie)
power.test <- quickPower(dataset = "GSE46705")

Results preservation and visualization

Magpie computes power evaluation metrics for each experimental scenario defined in the function arguments, the results it generates include:

  • FDR: The ratio of number of false positives to the number of positive discoveries.

  • FDC: The ratio of number of false positives to the number of true positives.

  • Power: Statistical power.

  • Precision: The ratio of number of true positives to the number of positive discoveries.

Once users have obtained a list of power measurements using either powerEval() or quickPower(), they can use functions in magpie to save the results as a formatted .xlsx file and generate basic line plots.

Save results to .xlsx files

magpie provides two functions: writeToxlsx() and writeToxlsx_strata(), for saving results to a formatted .xlsx file. writeToxlsx() allows you to save power measurements for different sample sizes, FDR thresholds, and sequencing depths, while writeToxlsx_strata() writes out the results stratified by input expression level and for different sample sizes. Note that we only evaluate power under the original sequencing depth (depth_factor = 1) and FDR threshold of 0.05 (thres = 1).

Here we save the output from quickPower() to corresponding .xlsx files:

### write out .xlsx
writeToxlsx(power.test, file = "test_TRESS.xlsx")

### write out stratified results
writeToxlsx_strata(power.test, file = "test_strata_TRESS.xlsx")

The generated .xlsx files are formatted as follows:

FDR N.rep 0.01 0.05 0.1
2 0.36 0.48 0.57
3 0.14 0.27 0.38
5 0.06 0.13 0.21
7 0.04 0.11 0.17
FDR N.rep (0, 27.68] (27.68, 54.3] (54.3, 92.64] (92.64, Inf]
2 0.41 0.46 0.47 0.58
3 0.23 0.30 0.27 0.28
5 0.15 0.16 0.11 0.12
7 0.12 0.14 0.08 0.11

Generate figures

As mentioned before, magpie also provides four plotting functions: plotRes(), plotAll(), plotStrata(), and plotAll_Strata(), for figure generating. In general, plotRes() and plotStrata() produce individual plots, while plotAll() and `plotAll_Strata() produce four plots in a 2 x 2 panel.

Again, we demonstrate these four functions with the previous output from `quickPower()``:

### plot FDR under sequencing depth 1x
plotRes(power.test, depth_factor = 1, value_option = "FDR")

### plot all in a panel under sequencing depth 1x
plotAll(power.test, depth_factor = 1)

### plot a FDR strata result
plotStrata(power.test, value_option = "FDR")

### plot all strata results in a panel
plotAll_Strata(power.test)

Session info

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] magpie_1.7.0     BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##   [1] viridisLite_0.4.2           blob_1.2.4                 
##   [3] Biostrings_2.75.3           bitops_1.0-9               
##   [5] fastmap_1.2.0               RCurl_1.98-1.16            
##   [7] GenomicAlignments_1.43.0    XML_3.99-0.17              
##   [9] digest_0.6.37               mime_0.12                  
##  [11] lifecycle_1.0.4             KEGGREST_1.47.0            
##  [13] RSQLite_2.3.9               magrittr_2.0.3             
##  [15] compiler_4.4.2              rlang_1.1.4                
##  [17] sass_0.4.9                  tools_4.4.2                
##  [19] yaml_2.3.10                 rtracklayer_1.67.0         
##  [21] knitr_1.49                  S4Arrays_1.7.1             
##  [23] bit_4.5.0.1                 curl_6.0.1                 
##  [25] DelayedArray_0.33.3         xml2_1.3.6                 
##  [27] plyr_1.8.9                  RColorBrewer_1.1-3         
##  [29] abind_1.4-8                 BiocParallel_1.41.0        
##  [31] BiocGenerics_0.53.3         sys_3.4.3                  
##  [33] aod_1.3.3                   grid_4.4.2                 
##  [35] stats4_4.4.2                TRESS_1.13.0               
##  [37] colorspace_2.1-1            ggplot2_3.5.1              
##  [39] scales_1.3.0                SummarizedExperiment_1.37.0
##  [41] cli_3.6.3                   rmarkdown_2.29             
##  [43] crayon_1.5.3                generics_0.1.3             
##  [45] rstudioapi_0.17.1           reshape2_1.4.4             
##  [47] httr_1.4.7                  rjson_0.2.23               
##  [49] DBI_1.2.3                   cachem_1.1.0               
##  [51] stringr_1.5.1               zlibbioc_1.52.0            
##  [53] parallel_4.4.2              AnnotationDbi_1.69.0       
##  [55] BiocManager_1.30.25         XVector_0.47.0             
##  [57] restfulr_0.0.15             matrixStats_1.4.1          
##  [59] vctrs_0.6.5                 Matrix_1.7-1               
##  [61] jsonlite_1.8.9              IRanges_2.41.2             
##  [63] S4Vectors_0.45.2            bit64_4.5.2                
##  [65] systemfonts_1.1.0           GenomicFeatures_1.59.1     
##  [67] maketools_1.3.1             locfit_1.5-9.10            
##  [69] jquerylib_0.1.4             glue_1.8.0                 
##  [71] codetools_0.2-20            stringi_1.8.4              
##  [73] gtable_0.3.6                GenomeInfoDb_1.43.2        
##  [75] GenomicRanges_1.59.1        BiocIO_1.17.1              
##  [77] UCSC.utils_1.3.0            munsell_0.5.1              
##  [79] tibble_3.2.1                pillar_1.10.0              
##  [81] htmltools_0.5.8.1           GenomeInfoDbData_1.2.13    
##  [83] R6_2.5.1                    kableExtra_1.4.0           
##  [85] evaluate_1.0.1              lattice_0.22-6             
##  [87] Biobase_2.67.0              png_0.1-8                  
##  [89] Rsamtools_2.23.1            openxlsx_4.2.7.1           
##  [91] memoise_2.0.1               bslib_0.8.0                
##  [93] zip_2.3.1                   Rcpp_1.0.13-1              
##  [95] svglite_2.1.3               SparseArray_1.7.2          
##  [97] DESeq2_1.47.1               xfun_0.49                  
##  [99] MatrixGenerics_1.19.0       buildtools_1.0.0           
## [101] pkgconfig_2.0.3