Borealis outlier methylation detection

Introduction

Borealis is a package for the detection of outlier methylation at single CpG-site resolution, where a cohort of 3 to 100+ samples can be processed and each sample is analyzed versus the rest of the cohort to identify outlier hypermethylated or hypomethylated CpG sites. This form of one vs many analysis differs from traditional case vs control group analyses and has been successful in domains such as rare genetic disease driver identification. Furthermore, the ability of Borealis to identify single CpG-site differences offers a higher resolution view of methylation, since increasing numbers of studies demonstrate single-site methylation aberrations in disease.

This vignette provides an introduction to some basic and advanced operations with Borealis using a region of chromosome 14 in a cohort of 20 individuals being investigated for causes of rare genetic disease. In real-world use, Borealis expects outputs from the aligner Bismark which is available on Github.

After completing this vignette users should be able to conduct in-depth methylation analysis and discover biological signals in their data. After learning basic functionality, creating summary metrics and looking for outlier samples, users will optionally learn how to annotate CpGs with epigenetic feature context using annotatr. Finally users will learn how to summarize CpG data across epigenetic features.

Citation

If you use borealis in your research, please cite Oliver et al. (2022).

Installation

The release version of borealis is available via Bioconductor and can be installed as follows:

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("borealis")

The development version of borealis can be obtained via the Github repository.

It is easiest to install development versions with the CRANpkg(“devtools”) package as follows:

devtools::install_github('GarrettJenkinson/borealis')

Changelogs for development releases will be detailed on GitHub releases.

Running Borealis

Now let’s load the test data included with Borealis. This represents a specific region from chromosome 14 (hg19) for 20 individuals with undiagnosed rare disease.

The entire Borealis pipeline can be run with the single runBorealis command.

library("borealis")

# Set data locations
outdir <- tempdir()
extdata <- system.file("extdata","bismark", package="borealis")

# Run borealis
results <- runBorealis(extdata,nThreads=2, chrs="chr14", suffix=".gz",
                        outprefix=file.path(outdir,"vignette_borealis_"), 
                        modelOutPrefix=file.path(outdir,"vignette_CpG_model"))

Now let’s quickly make sure we generated one output per sample:

# Read in the name of all files for analysis
borealis_in <- dir(outdir,pattern="*DMLs.tsv")
length(borealis_in)
## [1] 20

Looks good? Let’s continue!

Basic post-processing and analysis of Borealis results

Read in entire cohort’s results

Now you should have successfully loaded the provided methylation data, run Borealis and created a list of its output files.

Let’s now have a look at the data and generate some summary metrics.

First we’ll read in the data for each of the 20 patients and create GRanges for each:

# Read in list of Borealis output files and create a dataframe for each
for (file in borealis_in) {
    name <- sub("_DMLs.tsv", "", file)
    assign(name,GenomicRanges::makeGRangesFromDataFrame(
                read.table(file.path(outdir,file), header=TRUE,
                stringsAsFactors=FALSE), start.field="pos", end.field="pos.1", 
                seqnames.field="chr", keep.extra.columns=TRUE))
}

# Create a list of the newly created dataframes
list_object_names <- ls(pattern="borealis_patient")
listGRanges <- lapply(list_object_names, get)

Let’s check we have 20 new GRange objects and confirm the naming convention:

length(list_object_names)
## [1] 20
list_object_names[1]
## [1] "vignette_borealis_patient_70_chr14"

Looks good - now let’s have a look at one row to get familiar with the fields of data:

listGRanges[[1]][1,]
## GRanges object with 1 range and 7 metadata columns:
##       seqnames    ranges strand |         x         n        mu       theta
##          <Rle> <IRanges>  <Rle> | <integer> <integer> <numeric>   <numeric>
##   [1]    chr14  24793719      * |        12        12  0.578665 2.49863e-16
##             pVal    isHypo   effSize
##        <numeric> <logical> <numeric>
##   [1] 0.00281942     FALSE  0.421335
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

To explain the data you see in brief:

  • x = number of methylated CpGs at a given position
  • n = the toal number of reads at that position
  • mu = the mean fraction of reads methylated at this position across all samples
  • theta = the disperson of the methylation values at this position
  • pVal= probability of deviation from normal methylation at this position
  • isHypo = if the site is hypomethylated at this position
    • Yes = the site is defined as hypomethylated
    • No = the site is defined as hypermethylated
    • NA = the change is not significant enough to call hyper/hypo
  • effSize = the effect size i.e. how large a deviation from normal methylation

Okay - now let’s start summarizing some of this information across samples.

Generating summary metrics across all samples

First we will add two new columns to each GRange. One column will be the sample ID for each patient, to enable us to distinguish between results later. The second will contain p-values adjusted for multiple comparisons at each CpG site per-patient.

We decouple the generation of adjusted p-values from the main package since users may wish to use uncorrected p-values or generate adjusted p-values in a specific fashion depending on their analysis.

# Add sample ID and a corrected p-value to each and output as new files (.padj)
for (i in seq_along(listGRanges)) {
    sample=sub("_chr.*", "", list_object_names[i])
    listGRanges[[i]]$sampleID <- sample
    listGRanges[[i]]$pAdj <- p.adjust( listGRanges[[i]]$pVal, method="BH")
}

Let’s have a look at the first entry to see those newly added columns:

listGRanges[[1]][1,]
## GRanges object with 1 range and 9 metadata columns:
##       seqnames    ranges strand |         x         n        mu       theta
##          <Rle> <IRanges>  <Rle> | <integer> <integer> <numeric>   <numeric>
##   [1]    chr14  24793719      * |        12        12  0.578665 2.49863e-16
##             pVal    isHypo   effSize               sampleID      pAdj
##        <numeric> <logical> <numeric>            <character> <numeric>
##   [1] 0.00281942     FALSE  0.421335 vignette_borealis_pa..  0.552606
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Looking good. We now have adjusted p-values and sample identifiers.

Now we will create a single dataframe containing data for all samples and generate our summary metrics.

# Create a single GRanges obect with data for all samples and a dataframe for summaries
combined_files <- Reduce(c,listGRanges)
combined_files_df<-data.frame(combined_files)

How many CpG sites worth of data do we have across all samples combined?

# How many rows of data in combined GRange object?
length(combined_files)
## [1] 3709

Lets create a table to summarize useful metrics about the CpG sites

# Create table of unique positions and mu/theta values
mu_theta <- unique(subset(combined_files_df, 
                    select=-c(x,n,pVal, isHypo, pAdj, effSize, sampleID)))

How many unique samples and unique CpG sites are we analyzing data from?

#Number of unique samples
length(unique(combined_files_df$sampleID))
## [1] 20
#Number of unique CpG sites
nrow(unique(mu_theta))
## [1] 241

Distribution of the mean methylation values and variability per CpG Site

#generate summaries for mu and theta
summary(mu_theta$mu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.028   0.068   0.122   0.332   0.780   0.921
summary(mu_theta$theta)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0144  0.0243  0.1417

Now let’s look at the distribitions of other important values.

# Create table of unique positions and depth/p-val/padj for each position in 
# each case
depth_pvals_eff <- unique(combined_files_df)

Summary of read depth distributions

#Summarize read depths
summary(depth_pvals_eff$n)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    15.0    22.0    24.6    33.0    71.0

Summary of uncorrected p-values

#Summarize pvals
summary(depth_pvals_eff$pVal)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.001   0.366   0.608   0.602   0.874   1.000

Summary of corrected p-values

#Summarize corrected pvals
summary(depth_pvals_eff$pAdj)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.019   1.000   1.000   0.975   1.000   1.000

Summary of methylation fraction across sites

#Summarize fraction of methylation
summary(depth_pvals_eff$x/depth_pvals_eff$n)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.043   0.264   0.560   1.000

Summary of effect sizes

#Summarize effect size
summary(depth_pvals_eff$effSize)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.483  -0.069  -0.040  -0.017   0.021   0.595

Outliers and most significant CpG site

Now lets see if any samples are extreme outliers in terms of number of CpG sites called significant (p <= 0.05):

# Detection of outliers based on number of significant sites
# Count significant CpG sites per patient
signif_only <- subset(combined_files_df, pVal <= 0.05)
signif_counts <- dplyr::count(as.data.frame(signif_only),sampleID)

# Calculate the percentiles of the number of significant sites
sig_quantiles <- quantile(signif_counts$n,
                            probs=c(0.025, 0.05, 0.95, 0.975, 0.999))

# Check if nay patients are above the 99.9th percentile
subset(signif_counts, n >= sig_quantiles["99.9%"])
##                       sampleID  n
## 4 vignette_borealis_patient_72 51

We can see that one patient appears to be an extreme outlier.

Let’s also have a look at which site is most significant.

# What is the most significant CpG site between all samples?
subset(combined_files_df, pVal == min(combined_files$pVal))
##     seqnames    start      end width strand  x  n    mu  theta     pVal isHypo
## 365    chr14 24780569 24780569     1      * 17 28 0.117 0.0567 0.000592  FALSE
##     effSize                     sampleID  pAdj
## 365    0.49 vignette_borealis_patient_72 0.019

Depending on your use-case, this might be enough information for you and you could stop here, or you could extract the top 100 outlier sites.

Another useful step is contextualizing CpG sites on the basis of which epigenetic features they reside in. We provide an example of how to do so in the following section. We will do this for the single sample containing the most significant CpG site for the sake of brevity in the vignette.

Annotating outputs with epigenetic features

In this section we will add further biological context to our results by annotating epigenetically relevant features. We will use the annotatr package but you can use your favorite tool instead. The release version of annotatr is available via Bioconductor and can be installed as follows:

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("annotatr")

Now let’s load the package and read in the data we created in the previous steps for the patient with the most significant CpG site.

We will annotate with annotations from annotatr.

Let’s will start by defining the annotations we want to utilize.

#Assign approproate genome version based on alignments
genome.version <- "hg19"

# Select annnotation classes we want from annotatr (can be user-customized)
myAnnots <- c('_genes_promoters', '_genes_5UTRs', '_genes_exons',
    '_genes_3UTRs','_cpg_islands')

Now let’s annotate for our patient of interest:

#Read in patient 72 Grange data for annotation
dmrs.gr<-subset(combined_files, 
    sampleID == "vignette_borealis_patient_72")


# Annotate using annotatr
myAnnots <- paste(genome.version,myAnnots,sep="")
annots.all.gr <- annotatr::build_annotations(genome = genome.version,
    annotations = myAnnots)
allAnnot <- annotatr::annotate_regions(regions=dmrs.gr,
    annotations=annots.all.gr, ignore.strand=TRUE, minoverlap=0)

Now let’s look at that most significant site again - with annotations.

# Extract the annotated site with the smallest p-value
subset(allAnnot, pVal == min(allAnnot$pVal))$annot
## GRanges object with 9 ranges and 5 metadata columns:
##       seqnames            ranges strand |             id       tx_id
##          <Rle>         <IRanges>  <Rle> |    <character> <character>
##   [1]    chr14 24779705-24780704      + | promoter:51107  uc001wos.3
##   [2]    chr14 24779855-24780576      - |     5UTR:72873  uc001woo.3
##   [3]    chr14 24779855-24780576      - |     5UTR:72876  uc001wop.3
##   [4]    chr14 24779861-24781259      + |    exon:468887  uc001wor.3
##   [5]    chr14 24779805-24781259      + |    exon:468889  uc010alo.3
##   [6]    chr14 24779871-24780947      + |    exon:468890  uc021rrp.1
##   [7]    chr14 24779855-24780576      - |    exon:481685  uc001woo.3
##   [8]    chr14 24779855-24780576      - |    exon:481692  uc001wop.3
##   [9]    chr14 24779875-24780932      * |   island:17101        <NA>
##           gene_id      symbol                 type
##       <character> <character>          <character>
##   [1]        1241       LTB4R hg19_genes_promoters
##   [2]       27141       CIDEB     hg19_genes_5UTRs
##   [3]       27141       CIDEB     hg19_genes_5UTRs
##   [4]       56413      LTB4R2     hg19_genes_exons
##   [5]       56413      LTB4R2     hg19_genes_exons
##   [6]        <NA>        <NA>     hg19_genes_exons
##   [7]       27141       CIDEB     hg19_genes_exons
##   [8]       27141       CIDEB     hg19_genes_exons
##   [9]        <NA>        <NA>     hg19_cpg_islands
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

This CpG site overlaps multiple genes and features in Patient 72 but perhaps most interesting is the LTB4R promoter, since promoters are so strongly linked to control of gene expression.

Let’s use a handy Borealis plotting function to investigate this site further.

# Use Borealis plotting function to investigate this site further
plotCpGsite("chr14:24780288", sampleOfInterest="patient_72",
            modelFile=file.path(outdir,"vignette_CpG_model_chr14.csv"),
            methCountFile=file.path(outdir,
                                "vignette_CpG_model_rawMethCount_chr14.tsv"),
            totalCountFile=file.path(outdir,
                                "vignette_CpG_model_rawTotalCount_chr14.tsv"))
## $`chr14:24780288`
plotCpGsite function demo

plotCpGsite function demo

The graph above shows us the expected methylation profile under our model, and how far our site deviates from the expected methylation profile. As you can see this site stands out. It is hypermethylated.

Now what if we wanted to know if this site is surrounded by other hypermethylated CpG sites? We could do this using the data we have already generated, but an advanced approach that will be useful for further analysis is to summarize all our data across epigenetic features.

That’s what we will do next.

Summarizing single-site data across epigenetic features

First lets set a p-value we will use to determine significance in our feature-summarized data. In our summarization we will make use of corrected p-values and ignore uncorrected p-values. Your analysis can optionally use an alternative approach.

padjThresh <- 0.05

Okay - now let’s do the summarization. You can use this code as-is or easily customize it to create summary metrics specific to your application.

# Calculate how may CpGs per annotatr feature and store in dataframe
allAnnot <- as.data.frame(allAnnot)
featureids <- allAnnot$annot.id
featurecnts <- as.data.frame(table(featureids))
colnames(featurecnts) <- c("annot.id", "NoSites")

Since the inputs are individual CpGs we are just creating a count of how many individual sites exist in every unique feature. Let’s look:

head(featurecnts)
##     annot.id NoSites
## 1 3UTR:40890       4
## 2 3UTR:40894       2
## 3 5UTR:70956       7
## 4 5UTR:70963       7
## 5 5UTR:72870      10
## 6 5UTR:72873      36

Okay, now we want to figure out how many of the CpGs in each feature have a significant corrected p-value, meaning they appear abnormal. We will subset the sample data on the basic of the p-value and create counts,

# Calculate how many sites per feature pass p-value threshold
# Add data to new summary dataframe
signifonly <- subset(allAnnot, pAdj<=padjThresh)
signifonly <- signifonly$annot.id
signifonlycnt <- as.data.frame(table(signifonly))
colnames(signifonlycnt) <- c("annot.id", "signifCount")
featurecnts <- merge(featurecnts, signifonlycnt, by.x="annot.id", 
                        by.y="annot.id", all.x=TRUE)

Now for convenience let’s figure out what fraction of sites in each feature are significant.

# What fraction of sites per feature pass p-value threshold?
featurecnts$fractionSignif <- featurecnts$signifCount/featurecnts$NoSites

And finally lets merge the new and orignal data for output:

# Let's combine the data for final output
locations <- subset(allAnnot, select=c("annot.id", "annot.seqnames",
                                        "annot.start", "annot.end"))
featurecnts <- merge(unique(locations), featurecnts, by="annot.id")
genemap <- unique(cbind(allAnnot$annot.symbol, allAnnot$annot.id, 
                        allAnnot$annot.tx_id,allAnnot$annot.width, 
                        allAnnot$sampleID))
colnames(genemap) <- c("annot.symbol", "annot.id", "annot.tx_id", "annot.width",
                        "SampleID")
    
summarized <- merge(featurecnts, genemap, by="annot.id")
summarized$signifCount[is.na(summarized$signifCount)] <- 0
summarized$fractionSignif[is.na(summarized$fractionSignif)] <- 0

Now let’s have a look at that promoter region!

# Select the LTB4R promoter region
subset(summarized, select=c(annot.symbol, NoSites, signifCount, fractionSignif),
        (annot.symbol=="LTB4R" & grepl("promoter", annot.id)))
##    annot.symbol NoSites signifCount fractionSignif
## 85        LTB4R      45          37          0.822

You can see that the outlier site from earlier isn’t alone - multiple other sites are affected in the same region. In fact, in that LTB4R promoter 78.7% of the CpG sites we have data for are called significant by Borealis! That’s 37 CpG sites and all are called hypermethylated. This could be a very relevant finding. Maybe you can now used the code provided to look at the data in different ways. What other genes overlap this region? What features are affected for each?

Thank you for trying Borealis and for working through this vignette. We hope it was helpful. This is as far as we will go but hopefully you have all you need to expand the analysis and apply it to your own data!

Good luck!

Session info

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.20.0                    
##  [2] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [3] GenomicFeatures_1.57.1                 
##  [4] AnnotationDbi_1.69.0                   
##  [5] GenomicRanges_1.57.2                   
##  [6] GenomeInfoDb_1.41.2                    
##  [7] IRanges_2.39.2                         
##  [8] S4Vectors_0.43.2                       
##  [9] borealis_1.11.0                        
## [10] Biobase_2.65.1                         
## [11] BiocGenerics_0.51.3                    
## [12] BiocStyle_2.33.1                       
## 
## loaded via a namespace (and not attached):
##   [1] sys_3.4.3                   jsonlite_1.8.9             
##   [3] magrittr_2.0.3              farver_2.1.2               
##   [5] rmarkdown_2.28              BiocIO_1.15.2              
##   [7] zlibbioc_1.51.2             vctrs_0.6.5                
##   [9] memoise_2.0.1               Rsamtools_2.21.2           
##  [11] DelayedMatrixStats_1.27.3   RCurl_1.98-1.16            
##  [13] htmltools_0.5.8.1           S4Arrays_1.5.11            
##  [15] AnnotationHub_3.15.0        curl_5.2.3                 
##  [17] Rhdf5lib_1.27.0             SparseArray_1.5.45         
##  [19] rhdf5_2.49.0                sass_0.4.9                 
##  [21] bslib_0.8.0                 bsseq_1.41.1               
##  [23] plyr_1.8.9                  cachem_1.1.0               
##  [25] buildtools_1.0.0            GenomicAlignments_1.41.0   
##  [27] mime_0.12                   lifecycle_1.0.4            
##  [29] iterators_1.0.14            pkgconfig_2.0.3            
##  [31] Matrix_1.7-1                R6_2.5.1                   
##  [33] fastmap_1.2.0               GenomeInfoDbData_1.2.13    
##  [35] MatrixGenerics_1.17.1       digest_0.6.37              
##  [37] colorspace_2.1-1            regioneR_1.37.0            
##  [39] RSQLite_2.3.7               labeling_0.4.3             
##  [41] filelock_1.0.3              fansi_1.0.6                
##  [43] httr_1.4.7                  abind_1.4-8                
##  [45] compiler_4.4.1              withr_3.0.2                
##  [47] bit64_4.5.2                 doParallel_1.0.17          
##  [49] BiocParallel_1.39.0         DBI_1.2.3                  
##  [51] highr_0.11                  HDF5Array_1.33.8           
##  [53] R.utils_2.12.3              MASS_7.3-61                
##  [55] rappdirs_0.3.3              DelayedArray_0.31.14       
##  [57] rjson_0.2.23                gtools_3.9.5               
##  [59] permute_0.9-7               tools_4.4.1                
##  [61] DSS_2.53.0                  R.oo_1.26.0                
##  [63] glue_1.8.0                  restfulr_0.0.15            
##  [65] nlme_3.1-166                rhdf5filters_1.17.0        
##  [67] grid_4.4.1                  reshape2_1.4.4             
##  [69] generics_0.1.3              snow_0.4-4                 
##  [71] gtable_0.3.6                BSgenome_1.73.1            
##  [73] tzdb_0.4.0                  R.methodsS3_1.8.2          
##  [75] data.table_1.16.2           hms_1.1.3                  
##  [77] utf8_1.2.4                  XVector_0.45.0             
##  [79] stringr_1.5.1               BiocVersion_3.20.0         
##  [81] foreach_1.5.2               pillar_1.9.0               
##  [83] limma_3.61.12               splines_4.4.1              
##  [85] dplyr_1.1.4                 BiocFileCache_2.13.2       
##  [87] lattice_0.22-6              survival_3.7-0             
##  [89] rtracklayer_1.65.0          bit_4.5.0                  
##  [91] gamlss.data_6.0-6           tidyselect_1.2.1           
##  [93] locfit_1.5-9.10             maketools_1.3.1            
##  [95] Biostrings_2.73.2           knitr_1.48                 
##  [97] SummarizedExperiment_1.35.5 xfun_0.48                  
##  [99] statmod_1.5.0               annotatr_1.33.0            
## [101] matrixStats_1.4.1           stringi_1.8.4              
## [103] UCSC.utils_1.1.0            yaml_2.3.10                
## [105] evaluate_1.0.1              codetools_0.2-20           
## [107] tibble_3.2.1                BiocManager_1.30.25        
## [109] cli_3.6.3                   munsell_0.5.1              
## [111] jquerylib_0.1.4             Rcpp_1.0.13                
## [113] dbplyr_2.5.0                png_0.1-8                  
## [115] XML_3.99-0.17               parallel_4.4.1             
## [117] ggplot2_3.5.1               readr_2.1.5                
## [119] blob_1.2.4                  sparseMatrixStats_1.17.2   
## [121] bitops_1.0-9                gamlss.dist_6.1-1          
## [123] scales_1.3.0                gamlss_5.4-22              
## [125] purrr_1.0.2                 crayon_1.5.3               
## [127] rlang_1.1.4                 cowplot_1.1.3              
## [129] KEGGREST_1.45.1
Oliver, Gavin R., Garrett Jenkinson, Rory J. Olson, Laura E. Shultz-Rogers, and Eric W. Klee. 2022. “Detection of Outlier Methylation from Bisulfite Sequencing Data with Novel Bioconductor Package BOREALIS.” bioRxiv. https://doi.org/10.1101/2022.05.19.492700.