Borealis is a package for the detection of outlier methylation at single CpG-site resolution, where a cohort of 3 to 100+ samples can be processed and each sample is analyzed versus the rest of the cohort to identify outlier hypermethylated or hypomethylated CpG sites. This form of one vs many analysis differs from traditional case vs control group analyses and has been successful in domains such as rare genetic disease driver identification. Furthermore, the ability of Borealis to identify single CpG-site differences offers a higher resolution view of methylation, since increasing numbers of studies demonstrate single-site methylation aberrations in disease.
This vignette provides an introduction to some basic and advanced operations with Borealis using a region of chromosome 14 in a cohort of 20 individuals being investigated for causes of rare genetic disease. In real-world use, Borealis expects outputs from the aligner Bismark which is available on Github.
After completing this vignette users should be able to conduct in-depth methylation analysis and discover biological signals in their data. After learning basic functionality, creating summary metrics and looking for outlier samples, users will optionally learn how to annotate CpGs with epigenetic feature context using annotatr. Finally users will learn how to summarize CpG data across epigenetic features.
If you use borealis in your research, please cite Oliver et al. (2022).
The release version of borealis is available via Bioconductor and can be installed as follows:
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("borealis")
The development version of borealis can be obtained via the Github repository.
It is easiest to install development versions with the CRANpkg(“devtools”) package as follows:
Changelogs for development releases will be detailed on GitHub releases.
Now let’s load the test data included with Borealis. This represents a specific region from chromosome 14 (hg19) for 20 individuals with undiagnosed rare disease.
The entire Borealis pipeline can be run with the single
runBorealis
command.
library("borealis")
# Set data locations
outdir <- tempdir()
extdata <- system.file("extdata","bismark", package="borealis")
# Run borealis
results <- runBorealis(extdata,nThreads=2, chrs="chr14", suffix=".gz",
outprefix=file.path(outdir,"vignette_borealis_"),
modelOutPrefix=file.path(outdir,"vignette_CpG_model"))
Now let’s quickly make sure we generated one output per sample:
# Read in the name of all files for analysis
borealis_in <- dir(outdir,pattern="*DMLs.tsv")
length(borealis_in)
## [1] 20
Looks good? Let’s continue!
Now you should have successfully loaded the provided methylation data, run Borealis and created a list of its output files.
Let’s now have a look at the data and generate some summary metrics.
First we’ll read in the data for each of the 20 patients and create GRanges for each:
# Read in list of Borealis output files and create a dataframe for each
for (file in borealis_in) {
name <- sub("_DMLs.tsv", "", file)
assign(name,GenomicRanges::makeGRangesFromDataFrame(
read.table(file.path(outdir,file), header=TRUE,
stringsAsFactors=FALSE), start.field="pos", end.field="pos.1",
seqnames.field="chr", keep.extra.columns=TRUE))
}
# Create a list of the newly created dataframes
list_object_names <- ls(pattern="borealis_patient")
listGRanges <- lapply(list_object_names, get)
Let’s check we have 20 new GRange objects and confirm the naming convention:
## [1] 20
## [1] "vignette_borealis_patient_70_chr14"
Looks good - now let’s have a look at one row to get familiar with the fields of data:
## GRanges object with 1 range and 7 metadata columns:
## seqnames ranges strand | x n mu theta
## <Rle> <IRanges> <Rle> | <integer> <integer> <numeric> <numeric>
## [1] chr14 24793719 * | 12 12 0.578665 2.49863e-16
## pVal isHypo effSize
## <numeric> <logical> <numeric>
## [1] 0.00281942 FALSE 0.421335
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
To explain the data you see in brief:
Okay - now let’s start summarizing some of this information across samples.
First we will add two new columns to each GRange. One column will be the sample ID for each patient, to enable us to distinguish between results later. The second will contain p-values adjusted for multiple comparisons at each CpG site per-patient.
We decouple the generation of adjusted p-values from the main package since users may wish to use uncorrected p-values or generate adjusted p-values in a specific fashion depending on their analysis.
# Add sample ID and a corrected p-value to each and output as new files (.padj)
for (i in seq_along(listGRanges)) {
sample=sub("_chr.*", "", list_object_names[i])
listGRanges[[i]]$sampleID <- sample
listGRanges[[i]]$pAdj <- p.adjust( listGRanges[[i]]$pVal, method="BH")
}
Let’s have a look at the first entry to see those newly added columns:
## GRanges object with 1 range and 9 metadata columns:
## seqnames ranges strand | x n mu theta
## <Rle> <IRanges> <Rle> | <integer> <integer> <numeric> <numeric>
## [1] chr14 24793719 * | 12 12 0.578665 2.49863e-16
## pVal isHypo effSize sampleID pAdj
## <numeric> <logical> <numeric> <character> <numeric>
## [1] 0.00281942 FALSE 0.421335 vignette_borealis_pa.. 0.552606
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Looking good. We now have adjusted p-values and sample identifiers.
Now we will create a single dataframe containing data for all samples and generate our summary metrics.
# Create a single GRanges obect with data for all samples and a dataframe for summaries
combined_files <- Reduce(c,listGRanges)
combined_files_df<-data.frame(combined_files)
## [1] 3709
Lets create a table to summarize useful metrics about the CpG sites
## [1] 20
## [1] 241
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.028 0.068 0.122 0.332 0.780 0.921
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0144 0.0243 0.1417
Now let’s look at the distribitions of other important values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.0 15.0 22.0 24.6 33.0 71.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001 0.366 0.608 0.602 0.874 1.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.019 1.000 1.000 0.975 1.000 1.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.043 0.264 0.560 1.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.483 -0.069 -0.040 -0.017 0.021 0.595
Now lets see if any samples are extreme outliers in terms of number of CpG sites called significant (p <= 0.05):
# Detection of outliers based on number of significant sites
# Count significant CpG sites per patient
signif_only <- subset(combined_files_df, pVal <= 0.05)
signif_counts <- dplyr::count(as.data.frame(signif_only),sampleID)
# Calculate the percentiles of the number of significant sites
sig_quantiles <- quantile(signif_counts$n,
probs=c(0.025, 0.05, 0.95, 0.975, 0.999))
# Check if nay patients are above the 99.9th percentile
subset(signif_counts, n >= sig_quantiles["99.9%"])
## sampleID n
## 4 vignette_borealis_patient_72 51
We can see that one patient appears to be an extreme outlier.
Let’s also have a look at which site is most significant.
# What is the most significant CpG site between all samples?
subset(combined_files_df, pVal == min(combined_files$pVal))
## seqnames start end width strand x n mu theta pVal isHypo
## 365 chr14 24780569 24780569 1 * 17 28 0.117 0.0567 0.000592 FALSE
## effSize sampleID pAdj
## 365 0.49 vignette_borealis_patient_72 0.019
Depending on your use-case, this might be enough information for you and you could stop here, or you could extract the top 100 outlier sites.
Another useful step is contextualizing CpG sites on the basis of which epigenetic features they reside in. We provide an example of how to do so in the following section. We will do this for the single sample containing the most significant CpG site for the sake of brevity in the vignette.
In this section we will add further biological context to our results by annotating epigenetically relevant features. We will use the annotatr package but you can use your favorite tool instead. The release version of annotatr is available via Bioconductor and can be installed as follows:
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("annotatr")
Now let’s load the package and read in the data we created in the previous steps for the patient with the most significant CpG site.
We will annotate with annotations from annotatr.
Let’s will start by defining the annotations we want to utilize.
#Assign approproate genome version based on alignments
genome.version <- "hg19"
# Select annnotation classes we want from annotatr (can be user-customized)
myAnnots <- c('_genes_promoters', '_genes_5UTRs', '_genes_exons',
'_genes_3UTRs','_cpg_islands')
Now let’s annotate for our patient of interest:
#Read in patient 72 Grange data for annotation
dmrs.gr<-subset(combined_files,
sampleID == "vignette_borealis_patient_72")
# Annotate using annotatr
myAnnots <- paste(genome.version,myAnnots,sep="")
annots.all.gr <- annotatr::build_annotations(genome = genome.version,
annotations = myAnnots)
allAnnot <- annotatr::annotate_regions(regions=dmrs.gr,
annotations=annots.all.gr, ignore.strand=TRUE, minoverlap=0)
Now let’s look at that most significant site again - with annotations.
# Extract the annotated site with the smallest p-value
subset(allAnnot, pVal == min(allAnnot$pVal))$annot
## GRanges object with 9 ranges and 5 metadata columns:
## seqnames ranges strand | id tx_id
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr14 24779705-24780704 + | promoter:51107 uc001wos.3
## [2] chr14 24779855-24780576 - | 5UTR:72873 uc001woo.3
## [3] chr14 24779855-24780576 - | 5UTR:72876 uc001wop.3
## [4] chr14 24779861-24781259 + | exon:468887 uc001wor.3
## [5] chr14 24779805-24781259 + | exon:468889 uc010alo.3
## [6] chr14 24779871-24780947 + | exon:468890 uc021rrp.1
## [7] chr14 24779855-24780576 - | exon:481685 uc001woo.3
## [8] chr14 24779855-24780576 - | exon:481692 uc001wop.3
## [9] chr14 24779875-24780932 * | island:17101 <NA>
## gene_id symbol type
## <character> <character> <character>
## [1] 1241 LTB4R hg19_genes_promoters
## [2] 27141 CIDEB hg19_genes_5UTRs
## [3] 27141 CIDEB hg19_genes_5UTRs
## [4] 56413 LTB4R2 hg19_genes_exons
## [5] 56413 LTB4R2 hg19_genes_exons
## [6] <NA> <NA> hg19_genes_exons
## [7] 27141 CIDEB hg19_genes_exons
## [8] 27141 CIDEB hg19_genes_exons
## [9] <NA> <NA> hg19_cpg_islands
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
This CpG site overlaps multiple genes and features in Patient 72 but perhaps most interesting is the LTB4R promoter, since promoters are so strongly linked to control of gene expression.
Let’s use a handy Borealis plotting function to investigate this site further.
# Use Borealis plotting function to investigate this site further
plotCpGsite("chr14:24780288", sampleOfInterest="patient_72",
modelFile=file.path(outdir,"vignette_CpG_model_chr14.csv"),
methCountFile=file.path(outdir,
"vignette_CpG_model_rawMethCount_chr14.tsv"),
totalCountFile=file.path(outdir,
"vignette_CpG_model_rawTotalCount_chr14.tsv"))
## $`chr14:24780288`
The graph above shows us the expected methylation profile under our model, and how far our site deviates from the expected methylation profile. As you can see this site stands out. It is hypermethylated.
Now what if we wanted to know if this site is surrounded by other hypermethylated CpG sites? We could do this using the data we have already generated, but an advanced approach that will be useful for further analysis is to summarize all our data across epigenetic features.
That’s what we will do next.
First lets set a p-value we will use to determine significance in our feature-summarized data. In our summarization we will make use of corrected p-values and ignore uncorrected p-values. Your analysis can optionally use an alternative approach.
Okay - now let’s do the summarization. You can use this code as-is or easily customize it to create summary metrics specific to your application.
# Calculate how may CpGs per annotatr feature and store in dataframe
allAnnot <- as.data.frame(allAnnot)
featureids <- allAnnot$annot.id
featurecnts <- as.data.frame(table(featureids))
colnames(featurecnts) <- c("annot.id", "NoSites")
Since the inputs are individual CpGs we are just creating a count of how many individual sites exist in every unique feature. Let’s look:
## annot.id NoSites
## 1 3UTR:40890 4
## 2 3UTR:40894 2
## 3 5UTR:70956 7
## 4 5UTR:70963 7
## 5 5UTR:72870 10
## 6 5UTR:72873 36
Okay, now we want to figure out how many of the CpGs in each feature have a significant corrected p-value, meaning they appear abnormal. We will subset the sample data on the basic of the p-value and create counts,
# Calculate how many sites per feature pass p-value threshold
# Add data to new summary dataframe
signifonly <- subset(allAnnot, pAdj<=padjThresh)
signifonly <- signifonly$annot.id
signifonlycnt <- as.data.frame(table(signifonly))
colnames(signifonlycnt) <- c("annot.id", "signifCount")
featurecnts <- merge(featurecnts, signifonlycnt, by.x="annot.id",
by.y="annot.id", all.x=TRUE)
Now for convenience let’s figure out what fraction of sites in each feature are significant.
# What fraction of sites per feature pass p-value threshold?
featurecnts$fractionSignif <- featurecnts$signifCount/featurecnts$NoSites
And finally lets merge the new and orignal data for output:
# Let's combine the data for final output
locations <- subset(allAnnot, select=c("annot.id", "annot.seqnames",
"annot.start", "annot.end"))
featurecnts <- merge(unique(locations), featurecnts, by="annot.id")
genemap <- unique(cbind(allAnnot$annot.symbol, allAnnot$annot.id,
allAnnot$annot.tx_id,allAnnot$annot.width,
allAnnot$sampleID))
colnames(genemap) <- c("annot.symbol", "annot.id", "annot.tx_id", "annot.width",
"SampleID")
summarized <- merge(featurecnts, genemap, by="annot.id")
summarized$signifCount[is.na(summarized$signifCount)] <- 0
summarized$fractionSignif[is.na(summarized$fractionSignif)] <- 0
Now let’s have a look at that promoter region!
# Select the LTB4R promoter region
subset(summarized, select=c(annot.symbol, NoSites, signifCount, fractionSignif),
(annot.symbol=="LTB4R" & grepl("promoter", annot.id)))
## annot.symbol NoSites signifCount fractionSignif
## 85 LTB4R 45 37 0.822
You can see that the outlier site from earlier isn’t alone - multiple other sites are affected in the same region. In fact, in that LTB4R promoter 78.7% of the CpG sites we have data for are called significant by Borealis! That’s 37 CpG sites and all are called hypermethylated. This could be a very relevant finding. Maybe you can now used the code provided to look at the data in different ways. What other genes overlap this region? What features are affected for each?
Thank you for trying Borealis and for working through this vignette. We hope it was helpful. This is as far as we will go but hopefully you have all you need to expand the analysis and apply it to your own data!
Good luck!
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] org.Hs.eg.db_3.20.0
## [2] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [3] GenomicFeatures_1.59.1
## [4] AnnotationDbi_1.69.0
## [5] GenomicRanges_1.59.1
## [6] GenomeInfoDb_1.43.2
## [7] IRanges_2.41.1
## [8] S4Vectors_0.45.2
## [9] borealis_1.11.0
## [10] Biobase_2.67.0
## [11] BiocGenerics_0.53.3
## [12] generics_0.1.3
## [13] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sys_3.4.3 jsonlite_1.8.9
## [3] magrittr_2.0.3 farver_2.1.2
## [5] rmarkdown_2.29 BiocIO_1.17.1
## [7] zlibbioc_1.52.0 vctrs_0.6.5
## [9] memoise_2.0.1 Rsamtools_2.23.1
## [11] DelayedMatrixStats_1.29.0 RCurl_1.98-1.16
## [13] htmltools_0.5.8.1 S4Arrays_1.7.1
## [15] AnnotationHub_3.15.0 curl_6.0.1
## [17] Rhdf5lib_1.29.0 SparseArray_1.7.2
## [19] rhdf5_2.51.0 sass_0.4.9
## [21] bslib_0.8.0 bsseq_1.43.0
## [23] plyr_1.8.9 cachem_1.1.0
## [25] buildtools_1.0.0 GenomicAlignments_1.43.0
## [27] mime_0.12 lifecycle_1.0.4
## [29] iterators_1.0.14 pkgconfig_2.0.3
## [31] Matrix_1.7-1 R6_2.5.1
## [33] fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [35] MatrixGenerics_1.19.0 digest_0.6.37
## [37] colorspace_2.1-1 regioneR_1.39.0
## [39] RSQLite_2.3.8 labeling_0.4.3
## [41] filelock_1.0.3 fansi_1.0.6
## [43] httr_1.4.7 abind_1.4-8
## [45] compiler_4.4.2 withr_3.0.2
## [47] bit64_4.5.2 doParallel_1.0.17
## [49] BiocParallel_1.41.0 DBI_1.2.3
## [51] HDF5Array_1.35.1 R.utils_2.12.3
## [53] MASS_7.3-61 rappdirs_0.3.3
## [55] DelayedArray_0.33.2 rjson_0.2.23
## [57] gtools_3.9.5 permute_0.9-7
## [59] tools_4.4.2 DSS_2.55.0
## [61] R.oo_1.27.0 glue_1.8.0
## [63] restfulr_0.0.15 nlme_3.1-166
## [65] rhdf5filters_1.19.0 grid_4.4.2
## [67] reshape2_1.4.4 snow_0.4-4
## [69] gtable_0.3.6 BSgenome_1.75.0
## [71] tzdb_0.4.0 R.methodsS3_1.8.2
## [73] data.table_1.16.2 hms_1.1.3
## [75] utf8_1.2.4 XVector_0.47.0
## [77] BiocVersion_3.21.1 foreach_1.5.2
## [79] pillar_1.9.0 stringr_1.5.1
## [81] limma_3.63.2 splines_4.4.2
## [83] dplyr_1.1.4 BiocFileCache_2.15.0
## [85] lattice_0.22-6 survival_3.7-0
## [87] rtracklayer_1.67.0 bit_4.5.0
## [89] gamlss.data_6.0-6 tidyselect_1.2.1
## [91] locfit_1.5-9.10 maketools_1.3.1
## [93] Biostrings_2.75.1 knitr_1.49
## [95] SummarizedExperiment_1.37.0 xfun_0.49
## [97] statmod_1.5.0 annotatr_1.33.0
## [99] matrixStats_1.4.1 stringi_1.8.4
## [101] UCSC.utils_1.3.0 yaml_2.3.10
## [103] evaluate_1.0.1 codetools_0.2-20
## [105] tibble_3.2.1 BiocManager_1.30.25
## [107] cli_3.6.3 munsell_0.5.1
## [109] jquerylib_0.1.4 Rcpp_1.0.13-1
## [111] dbplyr_2.5.0 png_0.1-8
## [113] XML_3.99-0.17 parallel_4.4.2
## [115] ggplot2_3.5.1 readr_2.1.5
## [117] blob_1.2.4 sparseMatrixStats_1.19.0
## [119] bitops_1.0-9 gamlss.dist_6.1-1
## [121] scales_1.3.0 gamlss_5.4-22
## [123] purrr_1.0.2 crayon_1.5.3
## [125] rlang_1.1.4 cowplot_1.1.3
## [127] KEGGREST_1.47.0