preciseTAD
is an R package designed to transform domain
boundary calling into a supervised machine learning framework.
preciseTAD
offers full functionality in building the
predictor-response data and selecting the best model for precise
prediction of boundary regions. This functionality can be broken into 2
primary usages:
The training/testing data used for modeling is represented as a matrix with rows being genomic regions, columns being genomic annotations, and cells containing measures of association between them. Users have the option to concatenate genomic regions from multiple chromosomes.
To create the row-wise dimension of the data, preciseTAD
uses shifted binning, a strategy for making the dimensions of
the data matrix used for modeling by segmenting the linear genome into
nonoverlapping regions. This step is transparent for the user. To create
shifted bins, chromosome-specific bins start at half of the resolution
r, and continue in congruent intervals of length r
until the end of the chromosome (mod r + r/2), using
hg19 genomic coordinates. The shifted genomic bins, are then defined as
boundary regions (Y = 1) if they contain a called boundary, and
non-boundary regions (Y = 0) otherwise, thus establishing the
binary response vector (Y) used for classification.
Intuitively, shifted bins are centered on borders between the original
bins, thus capturing potential boundaries. We found the shifted binning
strategy improves model performance.
The column-wise dimension is formed by genomic annotations, such as
transcription factor binding sites (TFBSs), histone modification marks,
chromatin states. The (log2)
distances, which enumerate the genomic distance from the center of
each genomic bin to the center of the nearest ChIP-seq peak region of
interest, form the feature space. Other feature type options include
binary overlaps - an indicator for whether a ChIP-seq region
overlaps with genomic bin, count overlaps - the number of
overlaps in each bin, and percent overlaps - the percentage of
overlap between any bin and the total width of all ChIP-seq regions
overlapping it. The customized training/testing data formation offered
by preciseTAD
allows users to implement any binary
classification machine learning algorithm.
preciseTAD
implements a random forest (RF) model,
allowing tuning hyperparameters and applying feature reduction. The
primary inputs are the training and testing data, a list of
hyperparameter values, the number of cross-validation folds to use (if a
grid of hyperparameter values is provided), and the metric used for
optimization. The output includes the model object (necessary for
downstream prediction of boundaries), the variable importance values,
and a list of performance metrics when validating the model on the
testing data (see Table 1). This model is then used to predict
base-level precise boundary locations.
To predict the base-level location of domain boundaries, the
preciseTAD
model is applied for each base annotated with
the aforementioned genomic annotations. The base-level probabilities are
clustered using density-based clustering and scalable data partitioning
techniques, to narrow boundary regions and points. First, the
probability vector, pni,
is extracted (ni representing
the length of chromosome i).
Next, DBSCAN (Density-based Spatial Clustering of Applications
with Noise) (Ester et al. 1996; Hahsler,
Piekenbrock, and Doran 2019) is applied to the matrix of pairwise
genomic distances between bases with pni ≥ t,
where t is a threshold
determined by the user. The resulting clusters of highly predictive
bases identified by DBSCAN are termed preciseTAD boundary
regions (PTBR). To precisely identify a single base among each
PTBR, preciseTAD implements partitioning around medoids (PAM)
within each cluster. The corresponding cluster medoid is defined as a
preciseTAD boundary point (PTBP), making it the most
representative and biologically meaningful base within each clustered
PTBR. The output includes a list of genomic coordinates of PTBPs and
PTBSs.
preciseTAD
allows us to use the pre-trained model to
predict the precise location of domain boundaries in cell types without
Hi-C data but with genome annotation data. Specifically, only
cell-type-specific ChIP-seq data (BED format) for CTCF, RAD21, SMC3, and
ZNF143 transcription factor binding sites is required. We provide models
pre-trained using GM12878 and K562 genome annotation data, and
Arrowhead- and Peakachu-predicted boundaries for chromosome-specific
prediction of precise domain boundaries in other cell types.
# if (!requireNamespace("BiocManager", quietly=TRUE))
# install.packages("BiocManager")
BiocManager::install("preciseTAD")
# For the latest version install from GitHub
# BiocManager::install("dozmorovlab/preciseTAD")
## Warning: multiple methods tables found for 'setequal'
## Warning: replacing previous import 'BiocGenerics::setequal' by 'S4Vectors::setequal' when loading 'GenomicRanges'
## Warning: replacing previous import 'BiocGenerics::setequal' by 'S4Vectors::setequal' when loading 'IRanges'
## Warning: replacing previous import 'BiocGenerics::setequal' by 'S4Vectors::setequal' when loading 'GenomeInfoDb'
## Warning: replacing previous import 'BiocGenerics::setequal' by 'S4Vectors::setequal' when loading 'XVector'
## Warning: replacing previous import 'BiocGenerics::setequal' by 'S4Vectors::setequal' when loading 'AnnotationDbi'
## Warning: replacing previous import 'BiocGenerics::setequal' by 'S4Vectors::setequal' when loading 'Biostrings'
## Warning: multiple methods tables found for 'setequal'
## Warning: replacing previous import 'BiocGenerics::setequal' by 'S4Vectors::setequal' when loading 'GenomicFeatures'
## Warning: replacing previous import 'BiocGenerics::setequal' by 'S4Vectors::setequal' when loading 'rtracklayer'
## Warning: replacing previous import 'BiocGenerics::setequal' by 'S4Vectors::setequal' when loading 'GenomicAlignments'
## Warning: replacing previous import 'BiocGenerics::setequal' by 'S4Vectors::setequal' when loading 'SummarizedExperiment'
## Warning: replacing previous import 'BiocGenerics::setequal' by 'S4Vectors::setequal' when loading 'S4Arrays'
## Warning: replacing previous import 'BiocGenerics::setequal' by 'S4Vectors::setequal' when loading 'DelayedArray'
## Warning: replacing previous import 'BiocGenerics::setequal' by 'S4Vectors::setequal' when loading 'SparseArray'
## Warning: replacing previous import 'S4Arrays::read_block' by 'DelayedArray::read_block' when loading 'SummarizedExperiment'
preciseTAD
requires users to supply genomic coordinates
of the “ground-truth” domain boundaries to establish the response vector
(Y). As an example, we consider TAD boundaries derived
from the popular Arrowhead TAD caller, a part of the Juicebox
suite of tools developed by Aiden Lab (Durand et
al. 2016). To get boundaries, Arrowhead was applied on
the autosomal chromosomes using 5 kb GM12878 Hi-C data (Rao et al. 2014) (GSE63525).
An example of the command line script used to called TADs with
Arrowhead is provided below, as well as the first three columns
of the resulting TAD coordinates. A more detailed tutorial for
implementing Arrowhead can be found here.
arrowhead -c 1 \ #chromosome to call TADs on
-r 5000 \ #HiC data resolution
~/GSE63525_GM12878_insitu_primary.hic \ #location of the .HIC file
~/arrowhead_output #location to store the output
## V1 V2 V3
## 1 1 49375000 50805000
## 2 1 16830000 17230000
## 3 1 163355000 164860000
## 4 1 231935000 233400000
## 5 1 149035000 149430000
## 6 1 3995000 5505000
Users will then need to transform this TAD coordinate data into a
GRanges object of unique boundary coordinates using the
preciseTAD::extractBoundaries()
function. Note that the
input for this function is a BED-like data frame where the second and
third columns are genomic coordinates of domain anchor’s midpoints. Here
we only extract boundaries for CHR1 and CHR22
(CHR = c("CHR1", "CHR22")
parameter). We specify
preprocess=FALSE
because we are only interested in all
boundaries and not filtering TADs by length. Lastly, we specify
resolution=5000
to match the resolution used by
Arrowhead (although this argument is ignored given that
preprocess=FALSE
). As shown below, there were a total of
1901 unique TAD boundaries reported by Arrowhead for
chromosomes 1 and 22.
bounds <- extractBoundaries(domains.mat = arrowhead_gm12878_5kb, filter = FALSE, CHR = c("CHR1", "CHR22"), resolution = 5000)
# View unique boundaries
bounds
## GRanges object with 1901 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## 787 chr1 815000-815001 *
## 9196 chr1 890000-890001 *
## 37 chr1 915000-915001 *
## 8923 chr1 1005000-1005001 *
## 275 chr1 1015000-1015001 *
## ... ... ... ...
## 8379 chr22 50815000-50815001 *
## 16788 chr22 50935000-50935001 *
## 8382 chr22 50945000-50945001 *
## 16791 chr22 51060000-51060001 *
## 16670 chr22 51235000-51235001 *
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
To identify genomic features best predictive of TAD boundaries,
preciseTAD
requires functional genomic annotation data.
They are used to establish the feature space (X = {X1, X2, ⋯, Xp}).
Cell type-specific genomic annotation data can be downloaded from the ENCODE
data portal as BED files. Once you have downloaded your preferred list
of functional genomic annotations, store them in a specific file
location (i.e., “./path/to/BEDfiles”). These files can then be converted
into a GRangesList object and used to create the feature space using the
preciseTAD::bedToGRangesList()
function. The
signal
argument refers to the column in the BED files
containing peak signal strength values and is used to assign metadata to
the corresponding GRanges (only necessary for downstream plotting). We
have already provided a GRangesList object with 26 transcription factor
binding sites (TFBS) specific to the GM12878 cell type. Once you load it
in, you can see the list of transcription factors using the following
commands.
# path <- "./path/to/BEDfiles"
# tfbsList <- bedToGRangesList(filepath=path, bedList=NULL, bedNames=NULL, pattern = "*.bed", signal=4)
data("tfbsList")
names(tfbsList)
## [1] "Gm12878-Atf3-Haib" "Gm12878-Cfos-Sydh" "Gm12878-Cmyc-Uta" "Gm12878-Ctcf-Broad" "Gm12878-Egr1-Haib" "Gm12878-Ets1-Haib" "Gm12878-Gabp-Haib" "Gm12878-Jund-Sydh" "Gm12878-Max-Sydh" "Gm12878-Mazab85725-Sydh" "Gm12878-Mef2a-Haib" "Gm12878-Mxi1-Sydh" "Gm12878-P300-Sydh" "Gm12878-Pol2-Haib" "Gm12878-Pu1-Haib"
## [16] "Gm12878-Rad21-Haib" "Gm12878-Rfx5-Sydh" "Gm12878-Sin3a-Sydh" "Gm12878-Six5-Haib" "Gm12878-Smc3-Sydh" "Gm12878-Sp1-Haib" "Gm12878-Srf-Haib" "Gm12878-Taf1-Haib" "Gm12878-Tr4-Sydh" "Gm12878-Yy1-Sydh" "Gm12878-Znf143-Sydh"
## GRangesList object of length 26:
## $`Gm12878-Atf3-Haib`
## GRanges object with 1677 ranges and 1 metadata column:
## seqnames ranges strand | coverage
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr1 1167429-1167599 * | 28.5019
## [2] chr1 1510215-1510385 * | 64.7440
## [3] chr1 1655730-1655900 * | 25.1686
## [4] chr1 1709843-1710013 * | 15.5525
## [5] chr1 2343938-2344108 * | 19.6961
## ... ... ... ... . ...
## [1673] chrX 118699292-118699462 * | 17.6970
## [1674] chrX 119947928-119948098 * | 13.8884
## [1675] chrX 128481457-128481627 * | 26.1075
## [1676] chrX 132399481-132399651 * | 20.7070
## [1677] chrX 136520393-136520563 * | 24.4197
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
##
## ...
## <25 more elements>
Using the “ground-truth” boundaries and the following TFBS, we can
build the data matrix that will be used for predictive modeling. The
preciseTAD::createTADdata()
function can create the
training and testing data. Here, we specify to train on chromosome 1 and
test on chromosome 22. Additionally, we specify
resolution = 5000
to construct 5kb shifted genomic bins (to
match the Hi-C data resolution), featureType = "distance"
for a log2(distance + 1)-type
feature space, and resampling = "rus"
to apply random
under-sampling (RUS) on the training data to balance classes of boundary
vs. non-boundary regions. We also specify a seed to ensure
reproducibility when performing the resampling. The result is a list
containing two data frames: (1) the resampled (if specified) training
data, and (2) the testing data.
set.seed(123)
tadData <- createTADdata(bounds.GR = bounds,
resolution = 5000,
genomicElements.GR = tfbsList,
featureType = "distance",
resampling = "rus",
trainCHR = "CHR1",
predictCHR = "CHR22")
# View subset of training data
tadData[[1]][1:5,1:4]
## y Gm12878-Atf3-Haib Gm12878-Cfos-Sydh Gm12878-Cmyc-Uta
## 5799 No 18.73205 17.73674 14.60901
## 32930 No 21.23327 15.32696 15.03635
## 5767 Yes 18.07127 17.99360 13.22415
## 8504 Yes 19.24515 19.26026 18.31016
## 39486 No 21.23396 17.28550 19.09578
##
## No Yes
## 1595 1595
## y Gm12878-Atf3-Haib Gm12878-Cfos-Sydh Gm12878-Cmyc-Uta
## 1 No 24.08978 24.1100 24.12327
## 2 No 24.08937 24.1096 24.12287
## 3 No 24.08897 24.1092 24.12248
## 4 No 24.08857 24.1088 24.12208
## 5 No 24.08816 24.1084 24.12169
We can now implement our machine learning algorithm of choice to
predict boundary regions. Here, we opt for the random forest algorithm
for binary classification. preciseTAD
offers functionality
for performing recursive feature elimination (RFE) as a form of feature
reduction through the use of the preciseTAD::TADrfe()
function, which is a wrapper for the rfe
function in the
caret
package (Kuhn 2012).
preciseTAD::TADrfe()
implements a random forest model on
the best subset of features from 2 to the maximum number of features in
the data by powers of 2, using 5-fold cross-validation. We specify
accuracy as the performance metric. An example is shown below.
set.seed(123)
rfe_res <- TADrfe(trainData = tadData[[1]],
tuneParams = list(ntree = 500, nodesize = 1),
cvFolds = 5,
cvMetric = "Accuracy",
verbose = TRUE)
## Variables MCC ROC Sens Spec Pos Pred Value Neg Pred Value Accuracy Kappa MCCSD ROCSD SensSD SpecSD Pos Pred ValueSD Neg Pred ValueSD AccuracySD KappaSD
## 1 2 0.3491082 0.7420849 0.6526646 0.6952978 0.6822854 0.6667258 0.6739812 0.3479624 0.02627631 0.010098192 0.009508308 0.02993689 0.019873426 0.008273672 0.013160545 0.02632109
## 2 4 0.4613042 0.8026680 0.6984326 0.7611285 0.7455349 0.7163087 0.7297806 0.4595611 0.01904174 0.009974699 0.015099178 0.02275698 0.015584018 0.007949199 0.008769029 0.01753806
## 3 8 0.5076806 0.8229440 0.7122257 0.7931034 0.7749194 0.7340188 0.7526646 0.5053292 0.01922275 0.010486133 0.022648767 0.01063061 0.008681254 0.014664805 0.010711191 0.02142238
## 4 16 0.5150537 0.8263765 0.7197492 0.7943574 0.7779659 0.7394767 0.7570533 0.5141066 0.01881406 0.014157078 0.022886160 0.01696849 0.012896280 0.014149142 0.010802546 0.02160509
## 5 26 0.5307963 0.8294258 0.7210031 0.8075235 0.7894162 0.7439457 0.7642633 0.5285266 0.02605025 0.015234024 0.036287890 0.01480341 0.009062094 0.022029556 0.013957766 0.02791553
## Overall var Variables Resample
## 1 26.76923 Gm12878-Smc3-Sydh 26 Fold1
## 2 23.70939 Gm12878-Rad21-Haib 26 Fold1
## 3 20.08605 Gm12878-Ctcf-Broad 26 Fold1
## 4 17.65042 Gm12878-Znf143-Sydh 26 Fold1
## 5 14.17874 Gm12878-Pol2-Haib 26 Fold1
## 6 13.25481 Gm12878-Gabp-Haib 26 Fold1
Recursive feature elimination results indicate that model accuracy begins to stabilize when only considering the top four transcription factors for building random forest predictive models (Figure 1A). After aggregating the variable importance values of the top four TFBS across each cross-fold, we see that the top four most important TFBS in each of the five folds are the SMC3, RAD21, CTCF, and ZNF143 (Figure 1B). These are known components of the loop-extrusion model that has been proposed as a mechanism for the 3D architecture of the human genome (Sanborn et al. 2015; Fudenberg et al. 2016; Hansen et al. 2018).
Now that we have suitably reduced our feature space, we can implement
a random forest algorithm built simply on the TFBS mentioned above
(SMC3, RAD21, CTCF, and ZNF143). We can take advantage of the
preciseTAD::TADrandomForest()
function, which is a wrapper
of the randomForest
package (Breiman
2001; Liaw, Wiener, et al. 2002). We specify the training and
testing data, the hyperparameter values, the number of cross-validation
folds, the performance metric to consider (here, accuracy), the seed to
initialize for reproducibility, an indicator for retaining the model
object, an indicator for retaining variable importances, the variable
importance measure to consider (here, mean decrease in accuracy (MDA)),
and an indicator for retaining model performances based on the test
data. The function returns a list containing: 1) a trained model from
caret with model information (tadModel[[1]]
), 2) a
data.frame
of variable importance for each feature included
in the model (tadModel[[2]]
), and 3) a
data.frame
of various model performance metrics.
# Restrict the data matrix to include only SMC3, RAD21, CTCF, and ZNF143
tfbsList_filt <- tfbsList[names(tfbsList) %in% c("Gm12878-Ctcf-Broad",
"Gm12878-Rad21-Haib",
"Gm12878-Smc3-Sydh",
"Gm12878-Znf143-Sydh")]
set.seed(123)
tadData <- createTADdata(bounds.GR = bounds,
resolution = 5000,
genomicElements.GR = tfbsList_filt,
featureType = "distance",
resampling = "rus",
trainCHR = "CHR1",
predictCHR = "CHR22")
# Run RF
set.seed(123)
tadModel <- TADrandomForest(trainData = tadData[[1]],
testData = tadData[[2]],
tuneParams = list(mtry = 2,
ntree = 500,
nodesize = 1),
cvFolds = 3,
cvMetric = "Accuracy",
verbose = FALSE,
model = TRUE,
importances = TRUE,
impMeasure = "MDA",
performances = TRUE)
# View model performances
performances <- tadModel[[3]]
performances$Performance <- round(performances$Performance, digits = 2)
rownames(performances) <- performances$Metric
kable(t(performances), caption = "List of model performances when validating an RF built on CHR1 on CHR22 test data.")
TN | FN | FP | TP | Total | Sensitivity | Specificity | Kappa | Accuracy | BalancedAccuracy | Precision | FPR | FNR | NPV | MCC | F1 | AUC | Youden | AUPRC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metric | TN | FN | FP | TP | Total | Sensitivity | Specificity | Kappa | Accuracy | BalancedAccuracy | Precision | FPR | FNR | NPV | MCC | F1 | AUC | Youden | AUPRC |
Performance | 6400.00 | 67.00 | 2954.00 | 239.00 | 9660.00 | 0.78 | 0.68 | 0.08 | 0.69 | 0.73 | 0.07 | 0.32 | 0.01 | 0.99 | 0.17 | 0.14 | 0.80 | 0.47 | 0.09 |
As you may know, there exist other machine learning binary
classifiers that can be used in this setting. For example, suppose we
opt to implement a support vector machine (SVM). This is easy enough to
accomplish given that preciseTAD::createTADdata()
conveniently sets up the training and testing data sets. We use the
e1071
package to run the SVM with a radial kernel, cost =
1, and gamma = 0.5 using the example command below. We see that the SVM
model’s accuracy is only 0.67, whereas the accuracy from our random
forest was 0.69 (Table 1).
svmModel <- svm(y ~ ., data = tadData[[1]], kernel = "radial", cost = 1, gamma = 0.5)
svmPreds <- predict(svmModel, tadData[[2]][, -1], positive = "Yes")
# View confusion matrix
table(svmPreds, tadData[[2]][, 1])
##
## svmPreds No Yes
## No 6278 49
## Yes 3076 257
Recall that our model classifies boundary regions,
in that each prediction refers to a genomic bin of width 5000 bases. To
predict boundary coordinates at the base resolution more precisely, we
can leverage our model through the use of the
preciseTAD::preciseTAD()
function. Conceptually, instead of
genomic bins, we annotate each base with the selected genomic
annotations and featureType. We then apply our model on this annotation
matrix to predict the probability of each base is a boundary given the
associated genomic annotations. To minimize computational costs, the
base-level prediction should be performed in selected regions.
Suppose we want to precisely predict the domain boundary coordinates
for the 2Mb section of CHR22:25,500,000-27,500,000. To do so, we specify
chromCoords = list(25500000, 27500000)
. Additionally, we
set a range of probability threshold values including 0.975, 0.99, and
1.0 used for constructing PTBRs. For DBSCAN, we assign a range of ϵ-neighborhood values including
5000, 10000, 15000, 20000, and 30000 and maintain 3 as the
MinPts value. We specify verbose=TRUE
so that the
function will print out some minor results for each combination of (t, ϵ) as well as the
overall best combination
# Run preciseTAD
set.seed(123)
pt <- preciseTAD(genomicElements.GR = tfbsList_filt,
featureType = "distance",
CHR = "CHR22",
chromCoords = list(17000000, 18000000),
tadModel = tadModel[[1]],
threshold = 1,
verbose = TRUE,
parallel = NULL,
DBSCAN_params = list(eps = c(30000), # c(10000, 30000, 50000)
MinPts = c(100)), # c(3, 10, 100)
slope = 5000,
genome = "hg19",
BaseProbs = TRUE)
## [1] "Establishing bp resolution test data using a distance type feature space"
## [1] "Establishing probability vector"
## [1] "Determining optimal combination of epsilon neighborhood (eps) and minimum number of points (MinPts)"
## [1] "Initializing DBSCAN for MinPts=100 and eps=30000"
## [1] " preciseTAD identified 2 PTBRs"
## [1] " Establishing PTBPs"
## [1] " Cluster 1 out of 2"
## [1] " Cluster 2 out of 2"
## [1] "Optimal combination of MinPts and eps is = (100, 30000)"
## [1] "preciseTAD identified a total of 1678 base pairs whose predictive probability was equal to or exceeded a threshold of 1"
## [1] "Initializing DBSCAN for MinPts = 100 and eps = 30000"
## [1] "preciseTAD identified 2 PTBRs"
## [1] "Establishing PTBPs"
## [1] "Cluster 1 out of 2"
## [1] "Cluster 2 out of 2"
# What is getting returned? PTBRs - precide TAD boundary regions, and PTBPs - boundary points
names(pt)
## [1] "preciseTADparams" "PTBR" "PTBP" "Summaries"
When specifying verbose=TRUE
the function provides brief
output for all t × ϵ
combinations including how many PTBRs/PTBPs result from each
combination. NOTE: it is advised to provide verbose=FALSE
when implementing preciseTAD using multiple t and ϵ on the entire chromosome.
Additionally, we see from the verbose statements that the optimal
combination of (t, ϵ) = (1, 30000).
The full output of the preciseTAD::preciseTAD()
function
is a list with 4 elements: 1) data frame with average (and standard
deviation) normalized enrichment (NE) values for each combination of t
and eps (only if multiple values are provided for at least paramenter;
all subsequent summaries are applied to optimal combination of (t,
eps)), 2) the genomic coordinates spanning each preciseTAD predicted
region (PTBR), 3) the genomic coordinates of preciseTAD predicted
boundaries points (PTBP). 4) a named list including summary statistics
of the following: PTBRWidth - PTBR width, PTBRCoverage - the ratio of
base level coordinates with probabilities that exceed the threshold to
PTBRWidth, DistanceBetweenPTBR - the genomic distance between the end of
the previous PTBR and the start of the subsequent PTBR, NumSubRegions -
the number of elements in each PTBR cluster, SubRegionWidth - the
genomic coordinates spanning the subregion associated with each PTBR,
DistBetweenSubRegions - the genomic distance between the end of the
previous PTBR-specific region and the start of the subsequent
PTBR-specific region, NormilizedEnrichment - the normalized enrichment
of the genomic annotations used in the model around flanked PTBPs, and
BaseProbs - a numeric vector of probabilities for each corresponding
base coordinate. Normalized enrichment is calculated as the total number
of peak regions that overlap with flanked predicted boundaries divided
by the number of predicted boundaries. A schematic of how each of the
summaries are calculated is presented in Figure 2.
The vector of base-specific probabilities of being a boundary
(BaseProbs
) may be large, depending on the
chromCoords
settings. In our example, it has 2 million
entries. Be careful when setting BaseProbs = TRUE
to avoid
memory issues. To see how the probabilities for the first five hundred
bases look like, we do:
We can see that the probability of being a boundary seems to peak around 200 and 400 bases. It is well below 1, but may be worth further investigation.
Juicebox is an interface provided by Aiden Lab that allows for
superimposing boundary coordinates onto Hi-C contact maps. To visualize
domains flanked by the predicted boundaries, you must first select a
Hi-C map. As an example, you can import the contact matrix for GM12878
derived by Rao et al. 2014 by choosing
File -> Open... -> Rao and Huntley et al. -> GM12878 -> in situ Mbol -> primary
.
To format preciseTAD
results to use in Juicebox, users can
take advantage of preciseTAD::juicer_func()
, which is a
function available in the preciseTAD
R package that
transforms a GRanges object into a data frame as shown below.
You will then need to save the PTBPs as a BED file using
write.table
as shown below. Once saved, import them into
Juicebox using
Show Annotation Panel -> 2D Annotations -> Add Local -> pt_juice.bed
.
preciseTAD
allows us to use the pre-trained model to
predict the precise location of domain boundaries in cell types without
Hi-C data but with genome annotation data. Specifically, only the cell
type-specific ChIP-seq data (BED format) for CTCF, RAD21, SMC3, and
ZNF143 transcription factor binding sites is required. These data need
to be assembled in a GRanges object using the aforementioned
bedToGRangesList()
function (i.e., creating another
cell-type-specific tfbsList_filt
object).
The pre-trained model tadModel
can be obtained by using
bedToGRangesList()
, extractBoundaries()
,
createTADdata()
, and TADrandomForest()
functions, as described. Alternatively, chromosome-specific pre-trained
models can be downloaded from the OneDrive
storage.
We recommend models pre-trained using GM12878 genome annotation data and
Peakachu peaks. Note that each model is annotated with “CHRi_holdout”,
representing the chromosome held out of training, and thus to be used to
predict boundaries. It should be noted that the pretrained
preciseTAD models were trained on cancer cell lines (GM12878
and K562). Although we expect that the fundamental rules of domain
boundary formation (CTCF etc. binding) will be preserved in cancer
cells, use caution when applying the pretrained models to predict
boindaries in normal cell lines.
As an example, we consider the Hela-S3 cell line. We will download and use the CHR22_GM12878_5kb_Peakachu model to predict boundaries for chr22 in the Hela-S3 cell line.
# Read in RF model built on chr1-chr21
tadModel <- readRDS("~/Downloads/CHR22_GM12878_5kb_Peakachu.rds")
The Hela-S3-specific BED-formatted TFBS data can be downloaded at ENCODE. The selected TFBS data can be downloaded from the supplementary repository. They can be converted into a GRangesList.
# Establishing Hela-specific genomic annotations
# Reading in CTCF, RAD21, SMC3, and ZNF143 from repository and storing as list
ctcf <- as.data.frame(read.delim("http://raw.githubusercontent.com/stilianoudakis/preciseTAD_supplementary/master/data/bed/hela/ctcf.bed", sep = "\t", header = FALSE))
rad21 <- as.data.frame(read.delim("http://raw.githubusercontent.com/stilianoudakis/preciseTAD_supplementary/master/data/bed/hela/rad21.bed", sep = "\t", header = FALSE))
smc3 <- as.data.frame(read.delim("http://raw.githubusercontent.com/stilianoudakis/preciseTAD_supplementary/master/data/bed/hela/smc3.bed", sep = "\t", header = FALSE))
znf143 <- as.data.frame(read.delim("http://raw.githubusercontent.com/stilianoudakis/preciseTAD_supplementary/master/data/bed/hela/znf143.bed", sep = "\t", header = FALSE))
helaList <- list(ctcf, rad21, smc3, znf143)
hela.GR <- bedToGRangesList(bedList=helaList, bedNames=c("Ctcf", "Rad21", "Smc3", "Znf143"), signal=5)
Now we are ready to predict boundaries on chr22 for Hela-S3.
# Run preciseTAD
set.seed(123)
pt <- preciseTAD(genomicElements.GR = hela.GR,
featureType = "distance",
CHR = "CHR22",
chromCoords = list(25500000, 27500000),
tadModel = tadModel[[1]],
threshold = 1.0,
verbose = FALSE,
parallel = NULL,
DBSCAN_params = list(30000, 100),
slope = 5000,
BaseProbs = FALSE,
savetobed = FALSE)