Most cellular processes are regulated by RNA-binding proteins (RBPs). Knowledge on their exact positioning can be obtained from individual-nucleotide resolution UV crosslinking and immunoprecipitation (iCLIP) experiments. In a recent publication we described a complete analysis workflow to detect RBP binding sites from iCLIP data. The workflow covers all essential steps from quality control of sequencing reads, different peak calling options, to the downstream analysis and definition of binding sites. The pre-processing and peak calling steps rely on publicly available software, whereas the definition of the final binding sites follows a custom procedure implemented in BindingSiteFinder. This vignette explains how equally sized binding sites can be defined from a genome-wide iCLIP coverage.
The workflow described herein is based on our recently published complete iCLIP analysis pipeline (Busch et al. 2020). Thus, we expect the user to have preprocessed their iCLIP sequencing reads up to the point of the peak calling step. In brief, this includes basic processing of the sequencing reads, such as quality filtering, barcode handling, mapping and the generation of a single nucleotide crosslink file for all replicates under consideration. As we describe in our manuscript, replicate .bam files may or may not be merged prior to peak calling, for which we suggest PureCLIP (Krakau, Richard, and Marsico 2017). For simplicity, we address only the case in which peak calling was performed on the merge of all replicates.
The BindingSiteFinder
package is available at https://bioconductor.org and can be
installed via BiocManager::install
:
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("BindingSiteFinder")
Note: If you use BindingSiteFinder in published research, please cite:
Busch, A., Brüggemann, M., Ebersberger, S., & Zarnack, K. (2020) iCLIP data analysis: A complete pipeline from sequencing reads to RBP binding sites. Methods, 178, 49-62. https://doi.org/10.1016/j.ymeth.2019.11.008
This part is for the impatient user that wants to start as fast as possible. For that reason this section contains only the most basic steps and function calls to get to the final binding site set. For more details on each step see the Standard workflow section.
It all starts with your input iCLIP data, which has to be in form of
a BSFDataSet
. This data-structure combines crosslink sites
(the output from PureCLIP) with the individual crosslinks per
replicate (iCLIP coverage). Further instructions on the data-set
constructions are given in the chapter Construction
of the BindingSiteFinder dataset.
files <- system.file("extdata", package="BindingSiteFinder")
load(list.files(files, pattern = ".rda$", full.names = TRUE))
bds
## Dataset: Test set
## Object of class BSFDataSet
## #N Ranges: 19,749
## Width ranges: 1
## #N Samples: 4
## #N Conditions: 1
To intersect binding sites with genomic features, such as genes and
transcript regions the respective regions have to be presented as
GenomicRanges
objects. Further instructions on how to
import annotations from various sources are given in chapter Construct the
annotation objects.
## GRanges object with 1387 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000008735.14 chr22 50600793-50613981 + | ENSG00000008735.14
## ENSG00000015475.19 chr22 17734138-17774770 - | ENSG00000015475.19
## ENSG00000025708.14 chr22 50525752-50530032 - | ENSG00000025708.14
## ENSG00000025770.19 chr22 50508224-50524780 + | ENSG00000025770.19
## ENSG00000040608.14 chr22 20241415-20283246 - | ENSG00000040608.14
## ... ... ... ... . ...
## ENSG00000288106.1 chr22 38886837-38903794 + | ENSG00000288106.1
## ENSG00000288153.1 chr22 30664961-30674134 + | ENSG00000288153.1
## ENSG00000288262.1 chr22 11474744-11479643 + | ENSG00000288262.1
## ENSG00000288540.1 chr22 48320412-48333199 - | ENSG00000288540.1
## ENSG00000288683.1 chr22 18077989-18131720 + | ENSG00000288683.1
## gene_type gene_name
## <character> <character>
## ENSG00000008735.14 protein_coding MAPK8IP2
## ENSG00000015475.19 protein_coding BID
## ENSG00000025708.14 protein_coding TYMP
## ENSG00000025770.19 protein_coding NCAPH2
## ENSG00000040608.14 protein_coding RTN4R
## ... ... ...
## ENSG00000288106.1 lncRNA AL022318.5
## ENSG00000288153.1 unprocessed_pseudogene AC003072.2
## ENSG00000288262.1 unprocessed_pseudogene CT867976.1
## ENSG00000288540.1 lncRNA AL008720.2
## ENSG00000288683.1 protein_coding AC016027.6
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
## GRangesList object of length 4:
## $CDS
## GRanges object with 17035 ranges and 4 metadata columns:
## seqnames ranges strand | cds_id exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <integer> <character>
## chr22 15528192-15529139 + | 1 <NA> <NA>
## chr22 15690078-15690314 + | 2 <NA> <NA>
## chr22 15690078-15690709 + | 3 <NA> <NA>
## chr22 15690246-15690709 + | 4 <NA> <NA>
## chr22 15690426-15690709 + | 5 <NA> <NA>
## . ... ... ... . ... ... ...
## chr22 50782188-50782294 - | 17031 <NA> <NA>
## chr22 50782188-50782294 - | 17032 <NA> <NA>
## chr22 50782188-50782294 - | 17033 <NA> <NA>
## chr22 50782188-50782294 - | 17034 <NA> <NA>
## chr22 50782234-50782294 - | 17035 <NA> <NA>
## exon_rank
## <integer>
## <NA>
## <NA>
## <NA>
## <NA>
## <NA>
## . ...
## <NA>
## <NA>
## <NA>
## <NA>
## <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## ...
## <3 more elements>
The fastest way to accurately define binding site with very little
manual intervention is running the wrapper function
BSFind()
. It will perform all necessary computations and
will store all results and diagnostics internally. Various plotting
functions can be used to make all sorts of diagnostics. Export functions
can be used to extract resulting binding sites for further processing.
Each step is covered in more detail in the Standard workflow section.
To get a fast glance at the data-set with minimal user intervention
that is also visually appealing, the quickFigure()
function
exists.
# not run
quickFigure(bdsQuick, what = "main", save.filename = "./my_fig_main.pdf", save.device = "pdf")
# not run
quickFigure(bdsQuick, what = "supp", save.filename = "./my_fig_supp.pdf", save.device = "pdf")
We showcase the standard workflow on the example of the RNA-binding protein U2AF2. The data-set consists of four replicates. Sequencing reads from all replicates were pre-processed up to the point of .bam files. These were then merged and subjected to peak calling with PureCLIP. Resulting peaks as well as the processed coverage files (.bigwigs) are what is needed as input for the following sections.
To build the BSFDataSet
one needs two types of
information. First, the result of the peak calling step (crosslink
sites) and second the iCLIP coverage in the form of .bigwig files for
each replicate. Here we first import the crosslink sites to be
represented as a GenomicRanges
object. We also expect that
all ranges are a single nucleotide wide1.
# load crosslink sites
csFile <- system.file("extdata", "PureCLIP_crosslink_sites_examples.bed",
package="BindingSiteFinder")
cs = rtracklayer::import(con = csFile, format = "BED", extraCols=c("additionalScores" = "character"))
cs$additionalScores = NULL
cs
## GRanges object with 1000 ranges and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 629906 + | 3 27.30000
## [2] chr1 629911 + | 3 1.72212
## [3] chr1 629912 + | 3 15.68900
## [4] chr1 629913 + | 3 13.82090
## [5] chr1 629915 + | 3 6.36911
## ... ... ... ... . ... ...
## [996] chr1 2396116 + | 3 7.74693
## [997] chr1 2396990 + | 3 3.92442
## [998] chr1 2396991 + | 3 14.52750
## [999] chr1 2396992 + | 3 6.59018
## [1000] chr1 2397001 + | 3 1.27159
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The .bigwig files with the coverage information are located in a
single folder. The path is given in the files
variable. We
then use the path to list all .bigwig files from each strand
separately.
# Load clip signal files and define meta data object
files <- system.file("extdata", package="BindingSiteFinder")
clipFilesP <- list.files(files, pattern = "plus.bw$", full.names = TRUE)
clipFilesM <- list.files(files, pattern = "minus.bw$", full.names = TRUE)
The strand specific vectors with the path
information
that point to each replicate .bigwig file are combined with additional
meta information into a meta table data.frame
. This
data.frame
has to have at minimum three columns, which must
be named condition
, clPlus
and
clMinus
. The clPlus
and clMinus
columns point towards the strand-specific coverage for each replicate.
This information will be imported as Rle objects
upon
object initialization.
# make the meta table
meta = data.frame(
id = c(1:4),
condition = factor(rep("WT", 4)),
clPlus = clipFilesP, clMinus = clipFilesM)
meta
## id condition
## 1 1 WT
## 2 2 WT
## 3 3 WT
## 4 4 WT
## clPlus
## 1 /tmp/Rtmp6Am8Jd/Rinst24e5562aa4bd/BindingSiteFinder/extdata/rep1_clip_plus.bw
## 2 /tmp/Rtmp6Am8Jd/Rinst24e5562aa4bd/BindingSiteFinder/extdata/rep2_clip_plus.bw
## 3 /tmp/Rtmp6Am8Jd/Rinst24e5562aa4bd/BindingSiteFinder/extdata/rep3_clip_plus.bw
## 4 /tmp/Rtmp6Am8Jd/Rinst24e5562aa4bd/BindingSiteFinder/extdata/rep4_clip_plus.bw
## clMinus
## 1 /tmp/Rtmp6Am8Jd/Rinst24e5562aa4bd/BindingSiteFinder/extdata/rep1_clip_minus.bw
## 2 /tmp/Rtmp6Am8Jd/Rinst24e5562aa4bd/BindingSiteFinder/extdata/rep2_clip_minus.bw
## 3 /tmp/Rtmp6Am8Jd/Rinst24e5562aa4bd/BindingSiteFinder/extdata/rep3_clip_minus.bw
## 4 /tmp/Rtmp6Am8Jd/Rinst24e5562aa4bd/BindingSiteFinder/extdata/rep4_clip_minus.bw
With the crosslink sites and the meta table ready we are good to
construct the BSFDataSet
object using the
BSFDataSetFromBigWig()
function. We can get information of
the constructed object using it’s show
method, which show
the number of ranges and crosslinks present in the object2.
## Dataset: Test set
## Object of class BSFDataSet
## #N Ranges: 19,749
## Width ranges: 1
## #N Samples: 4
## #N Conditions: 1
To take advantage of the full functionality of the package one needs to provide gene annotation information. These will be used to map computed binding sites directly on target genes and and transcript regions of protein-coding genes. Here, we use GENCODE hg38 gene annotations.
To present gene-level information we import the annotation as a .gff3
file into a GenomicRanges
object. To combine the ranges
with the available meta information for each gene (such as a
gene_id
, gene_name
or gene_type
),
we have to read our annotation twice. First, we import the ranges into a
TxDB
database using functionalists from GenomicFeatures.
This database can be queried using the genes()
function to
retrieve the positions of all genes from our resource as
GRanges
objects3. Second we use rtracklayer
to
import all ranges with the corresponding meta information. We then match
both imports to create a single GRanges
object which
contains all genes from our annotation with relevant meta information
added4.
library(txdbmaker)
# Make annotation database from gff3 file
annoFile = "./gencode_v37_annotation.gff3"
annoDb = txdbmaker::makeTxDbFromGFF(file = annoFile, format = "gff3")
annoInfo = rtracklayer::import(annoFile, format = "gff3")
# Get genes as GRanges
gns = genes(annoDb)
idx = match(gns$gene_id, annoInfo$gene_id)
meta = cbind(elementMetadata(gns),
elementMetadata(annoInfo)[idx,])
meta = meta[,!duplicated(colnames(meta))]
elementMetadata(gns) = meta
# Load GRanges with genes
geneFile <- list.files(files, pattern = "gns.rds$", full.names = TRUE)
load(geneFile)
gns
## GRanges object with 1387 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000008735.14 chr22 50600793-50613981 + | ENSG00000008735.14
## ENSG00000015475.19 chr22 17734138-17774770 - | ENSG00000015475.19
## ENSG00000025708.14 chr22 50525752-50530032 - | ENSG00000025708.14
## ENSG00000025770.19 chr22 50508224-50524780 + | ENSG00000025770.19
## ENSG00000040608.14 chr22 20241415-20283246 - | ENSG00000040608.14
## ... ... ... ... . ...
## ENSG00000288106.1 chr22 38886837-38903794 + | ENSG00000288106.1
## ENSG00000288153.1 chr22 30664961-30674134 + | ENSG00000288153.1
## ENSG00000288262.1 chr22 11474744-11479643 + | ENSG00000288262.1
## ENSG00000288540.1 chr22 48320412-48333199 - | ENSG00000288540.1
## ENSG00000288683.1 chr22 18077989-18131720 + | ENSG00000288683.1
## gene_type gene_name
## <character> <character>
## ENSG00000008735.14 protein_coding MAPK8IP2
## ENSG00000015475.19 protein_coding BID
## ENSG00000025708.14 protein_coding TYMP
## ENSG00000025770.19 protein_coding NCAPH2
## ENSG00000040608.14 protein_coding RTN4R
## ... ... ...
## ENSG00000288106.1 lncRNA AL022318.5
## ENSG00000288153.1 unprocessed_pseudogene AC003072.2
## ENSG00000288262.1 unprocessed_pseudogene CT867976.1
## ENSG00000288540.1 lncRNA AL008720.2
## ENSG00000288683.1 protein_coding AC016027.6
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
To represent all possible transcript regions, we re-use the
previously created TxDB
annotation database. The GenomicFeatures
package provides dedicated information to retrieve specific sets of
ranges from such a database object, like CDS, UTRs,
etc. Here we store these ranges into a GRangestList
and
name the elements accordingly5.
# Count the overlaps of each binding site for each region of the transcript.
cdseq = cds(annoDb)
intrns = unlist(intronsByTranscript(annoDb))
utrs3 = unlist(threeUTRsByTranscript(annoDb))
utrs5 = unlist(fiveUTRsByTranscript(annoDb))
regions = GRangesList(CDS = cdseq, Intron = intrns, UTR3 = utrs3, UTR5 = utrs5)
Now that we have prepared our input data, we are good to use the main
workhorse, the BSFind()
function. This function is
essentially a high-level wrapper around all core utility functions and
provides an easy entry point into the analysis. At it’s core each
analysis step consists of a utility function that computes the
respective result. Each utility function further comes along with a set
of diagnostic/ result plotting functions. These can be used to assess
the quality of every computational step. We recommend to once run the
entire analysis with BSFind()
and then follow up on each
step in detail using the plot functionalities6.
bdsOut = BSFind(object = bds, anno.genes = gns, anno.transcriptRegionList = regions,
est.subsetChromosome = "chr22", veryQuiet = TRUE, est.maxBsWidth = 29)
It is good practice to once visualize the entire binding site
definition process after BSFind()
was executed. This helps
to keep track what steps were performed and how the flow into each
other7.
As a default all of the shown steps are executed sequentially in the given order. To change that order, each utility function can be called individually to set up a customized workflow (for details see Construct your own pipeline).
The first and main step of any binding site analysis is the actual binding site definition step. In the following we showcase how you can apply the main workflow functions to accurately define binding sites and judge the results.
An optional step prior to the actual merging of crosslink sites into binding sites is pre-filtering. Depending on the experiment type or sequencing depth, it might be useful to retain only the most informative crosslink sites. In the case of PureCLIP, the positions called as significantly enriched (hereafter called PureCLIP sites) are associated with a binding affinity strength score (hereafter called PureCLIP score) which can be used as a metric for the pre-filtering step. As default PureCLIP sites with the lowest 1% score will be removed.
In case the option bsSize
is not set, the optimal
binding site width is estimated from the data directly (using the
estimateBsWidth
function). Since the final binding site
width depends on the quality of the input crosslink sites,
bsSize
is estimated together with the gene-wise filter
level (cutoff.geneWiseFilter
). That is filtering by the top
X% crosslink sites per gene. Filtering on gene level has been shown to
be really effective in diluting the most informative signal from the
data8.
Signal-to-flank score values are estimated on all binding
sites on a given chromosome (est.bsResolution
) for varying
levels of accuracy. For est.geneResolution
options “coarse”
results in 20% steps, “medium” in 10%, “fine” in 5% and “finest” in 1%
steps. For est.bsResolution
option “coarse” and “medium”
approximate binding sites by extending the center of the each merged
crosslink region. In “coarse” the center is found by the maximum
PureCLIP score, in “medium” the center is found by the position with the
most stacked crosslinks. Option “fine” employs a full binding site
definition cycle. Binding sites are tested from the minimum size, which
is 3, up to maxBsWidth
.
After iterating over all sets given the above resolutions a mean
score per binding site width is calculated and the binding site width
that yielded the highest score is selected as optimal. The
est.minimumStepGain
option allows control over the minimum
additional gain in the score that a tested width has to have to be
selected as the best option (the default is 2% increase). This
essentially lets binding site sizes to be computed more conservatively,
opting for smaller binding sites rather than larger ones. This is
particularly useful when the estimated mean score flattens out at a
certain point showing only a very small increase in the score per
binding site width increase. Samples with broad binding RBPs or samples
with poor coverage tend to show this behavior.
In case you already know your binding site width and/or gene-wise
filtering threshold you can invoke BSFind()
with these
parameters, to avoid the execution of the utility function
estimateBsWidth()
. For details see section If
you already know your binding site size.
In case you want to manually investigate which binding site size to use see the section How to manually test different binding site sizes.
The gene-wise filter cutoff estimated in the previous step
(cutoff.geneWiseFilter
) is applied to keep only the top
percent of crosslink sites, ranked by their PureCLIP score. In
case the value for cutoff.geneWiseFilter
was given directly
without estimation, that value is used directly.
Such a filter essentially helps to improve the binding site quality by using only the defined subset per gene of all PureCLIP sites for the binding site definition. This is because the strength of a binding site is tied to the expression level of the hosting gene. Thus a gene-wise percentage filter will increase the quality on all hosting genes in the same order.
Since a single PureCLIP site might overlap with multiple different genes given in the annotation, a diagnostic function helps to visualize the magnitude of that problem.
To handle such overlapping cases the overlaps
parameter
can be used with options "keepSingle"
,
"removeAll"
and "keepAll"
. We recommend to
stick with the default (overlaps="keepSingle"
), which
reduces the number of crosslink sites on overlapping loci to a single
instance. This will conserve the number of crosslink sites. The
additional options overlaps="removeAll"
and
overlaps="keepAll"
will either increase or decrease the
number of crosslink sites in the set and should be handled with
care.
In this step we turn the single-nucleotide wide PureCLIP sites into
binding sites of a desired size (bsSize
). This most
essential parameter can either be estimated automatically, or be set
directly. For the inexperienced user we recommend to use the estimated
size in almost all cases. An exception would be very sparse or
incomplete data, which should result in warning messages during the
estimation step. If you do want to use a manual size, because you have
eg. prior-knowledge of your RBP please have a look at the If
you already know your binding site size section first.
Binding site merging happens by concatenation of neighboring crosslink sites into larger regions. These regions are then split up again using the crosslink coverage to guide the position of the final binding sites. This process happens iteratively, where each merge and resize round resolves more regions unitl all regions are converted into binding sites. The following diagnostic plots the number of regions that are still left to be resolved after each round and compares it to the size of these regions9.
A variety of different filtering options for placing a final binding
site inside a merged region exist. Most importantly
merge.minClSites
gives the minimum number of initial
crosslink sites that have to overlap a final binding site. How this and
all other options affect the final binding site set size is displayed by
the following diagnostic.
Since the initially used crosslink sites are computed from the merged signal of all replicates, binding sites resulting from the previous merge might not be reproducible among all replicates. For that reason, we specifically ask in this section which of the computed binding sites are reproduced by the individual replicates.
Since replicates might differ in library size, a replicate-specific threshold is computed base on the binding site support distribution. That is how many crosslinks fall into the computed binding site for each replicate. In other words, we compute the number of crosslinks per binding site and replicate. This results in the following histogram.
When calling BSFind()
we did not specify any further
cutoff for this particular step, so the 5% quantile
(cutoff
) with a lower boundary of 1 crosslink
(minCrosslinks
) is used as default. Here for example a
binding site would be called reproducible by replicate No.3 if at least
4 crosslinks from that replicate fall into the range of the binding
site[^8].
[^8] Note: These settings can be adapted to for example produce high and low confidence binding site sets.
After computing the replicate-specific thresholds as indicated above, one must decide on the number of replicates that must meet the thresholds for a binding site to be seen as reproducible.
A binding site that meets the threshold for all replicates in the set
would be highly reproducible, whereas a binding site that meets the
threshold for a single replicate only would be not reproducible at all.
To find a balance in this question we suggest to stay with the default
of N-1 for the nReps
parameter. However if your replicates
show a very high variability among each other changing this parameter
could be beneficial to increase the specificity of the reproducible set.
The following diagnostic assists in that decision, by visualizing the
degree of which reproducible sets overlap among each other.
A common question for reproducibility is the visualization of the between replicate pairwise reproducibility. The following scatter plot does exactly that. Depending on when the correlations are calculated, the binding site set is either already corrected for reproducibility or not. To compare pairwise correlations before and after the application of the reproducibility filtering see Reproducibility scatter before and after correction.
A major question the follow from binding site definition is the assessment of the genomic targets that the RBP under consideration binds to. In the following section we showcase how target genes and transcript regions can be identified and what one has to consider doing so. In case you don’t have a gene annotation at hand but still want to compute binding sites, have a look at the section Work without a gene annotation.
Here we essentially overlap the annotation given in
anno.genes
with the reproducible set of binding site
computed above. Depending on the source and organism, gene annotations
overlap each other to some degree. We implemented several strategies to
resolve these overlaps with overlaps="frequency"
being the
recommended default. Here the most frequently present gene_type
is used to resolve overlaps. This however requires that the annotated
genes (given in anno.genes
) carry gene type information as
a meta column. The exact column name can be specified with
match.geneType
. The same requirement holds true for the
option overlaps="hierarchy"
, where a manual hierarchical
rule on the gene type is used (overlaps.rule
is required)10. The
following plot visualizes among which gene types the most overlaps
exist11.
The final target gene spectrum can now be visualized with the following plot. Depending on which of the overlap strategies are being used, numbers in this plot will change slightly.
The assigned gene with it’s ID, name and gene type (if given) is also
attached to the binding site meta data, which can be retrieved as
GRanges
object using the getRanges()
function.
Similar to the target gene, each binding site can be assigned to a
hosting transcript region. This is done using the
anno.transcriptRegionList
resource that we defined earlier.
Essentially each binding site is overlapped with each of the ranges
given in the region list. As one would expect, the problem of
overlapping annotations is increased manifold on the level of transcript
regions compared to entire genes.
Options to resolve these are similar to those on gene level. Option
overlaps="frequency"
will resolve based on the most
frequently observed transcript region, overlaps="hierarchy"
uses a manual hierarchic assignment based on
overlaps.rule
12. Additionally, option
overlaps="flag"
will just flag all overlaps with the tag
ambiguous and option overlaps="remove"
will remove
all those cases. The following plot illustrates the degree of overlap
for our example data-set.
We observe that most overlaps are generated by binding sites that are
either within an intron or a 3’UTR. As indicated by the plot we use the
default strategy overlaps="frequency"
is used to resolve
these cases. Since introns were most frequently observed in the data,
the transcript region intron was selected in these cases.
The final binding spectrum on transcript region level can be visualized with the following plot. Depending on which of the overlap strategies are being used, numbers in this plot will change slightly.
The assigned transcript region is also attached to the binding site
meta data, which can be retrieved as GRanges
object using
the getRanges()
function.
If you want to learn about how to normalize these profiles by the length of each region see section Normalize transcript regions.
The described matching of binding sites and transcript regions is
entirely based on overlaps with the regions given in
anno.transcriptRegionList
. This means one can customize
these regions by using different sets of ranges to extend and modify
this list. For further details on how to do this see section Use
a different set of transcript regions.
Another common task is to assess the strength and shape of a binding site. We provide two additional metrics as part of our core workflow that allow to assess exactly that.
The affinity of a binding site to the RNA is already given to some
degree by PureCLIP. The PureCLIP model computes a
strength score that captures affinity for each crosslink site. Here we
re-map these scores from the initial crosslink sites to the final
binding sites. This is done through the match.score
argument that point to the meta column name with the score from the
initial crosslink sites13. As for the default we assign the highest
score from all overlapping crosslink site to the binding site
(match.option="max"
). The following plot shows the
distribution of scores as density for all binding sites after
re-assignment.
How well the defined binding sites fit to the observed crosslink coverage can be assess through binding site definedness. This also gives first insights into the overall binding site shape and the binding mode of the RBP. Here we assess this property through a percent bound score, that divides the number of crosslinks inside the binding sites by the total number of crosslinks in a region around the binding site. As default, this background region is two times the size of the binding site. In other words, we compare the signal within a binding site to a same sized window directly flanking the binding site on both sides. A splicing factor that recognizes very clear motifs will for example form very sharp peaks, which will lead to a high percent bound degree. A broad binding RBP on the other hand, that scans more along the RNA, will lead to a lower definedness score.
Depending on which steps were executed in the Genomic target
identification section, the plot can be grouped by the assignment
(by="all"
, by="transcript_region"
or
by="gene_type"
).
For proper integration with any other downstream analysis and application we provide a set of export functionalities that allow to quickly and easily retrieve any type of result.
If one stays inside the R/ Bioconductor environment one can simply
extract the final binding sites as GenomicRanges
, which
enables full functionality that comes with this powerful data class. As
indicated by the print all information from each processing step is kept
in the meta data of the granges object. This makes it easy to combine
your analysis results with orthogonal data and projects.
## GRanges object with 2017 ranges and 9 metadata columns:
## seqnames ranges strand | bsID bsSize currIdx
## <Rle> <IRanges> <Rle> | <character> <numeric> <integer>
## 1 chr22 11630138-11630142 + | BS1 5 1
## 2 chr22 11630413-11630417 + | BS2 5 2
## 3 chr22 15785546-15785550 + | BS3 5 3
## 4 chr22 15786268-15786272 + | BS4 5 4
## 5 chr22 15786287-15786291 + | BS5 5 5
## ... ... ... ... . ... ... ...
## 2188 chr22 50626769-50626773 - | BS2368 5 2013
## 2190 chr22 50769684-50769688 - | BS2370 5 2014
## 2191 chr22 50776992-50776996 - | BS2371 5 2015
## 2192 chr22 50777869-50777873 - | BS2372 5 2016
## 2193 chr22 50783298-50783302 - | BS2373 5 2017
## geneID geneName geneType transcriptRegion score
## <character> <character> <character> <character> <numeric>
## 1 <NA> <NA> Intergenic OTHER 334.1870
## 2 <NA> <NA> Intergenic OTHER 77.5833
## 3 ENSG00000206195.11 DUXAP8 lncRNA INTRON 15.2087
## 4 ENSG00000206195.11 DUXAP8 lncRNA INTRON 16.5390
## 5 ENSG00000206195.11 DUXAP8 lncRNA INTRON 13.8455
## ... ... ... ... ... ...
## 2188 ENSG00000100299.18 ARSA protein_coding INTRON 37.1189
## 2190 ENSG00000079974.18 RABL2B protein_coding INTRON 20.3594
## 2191 ENSG00000079974.18 RABL2B protein_coding INTRON 37.5507
## 2192 ENSG00000079974.18 RABL2B protein_coding INTRON 22.7803
## 2193 ENSG00000079974.18 RABL2B protein_coding INTRON 16.2296
## signalToFlankRatio
## <numeric>
## 1 0.891358
## 2 0.826923
## 3 0.612903
## 4 0.781250
## 5 0.372881
## ... ...
## 2188 0.885714
## 2190 0.892857
## 2191 0.636364
## 2192 0.810811
## 2193 0.814815
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
For visual inspection and integration with further data-sets results can be easily exported in the widely used .bed format. This offers a quick way for manual inspection of the resulting binding sites in tools like IGV or UCSC genome browser.
Gene targets from the target gene assignment can be exported to
either csv
or xslx
grouped by
geneType
or transcriptRegion
. Genes can
additionally be sorted by the number of overlapping binding sites or by
binding site strength score.
To export iCLIP coverage directly we do not provide a dedicated
function, but rather we use the export
function from the rtracklayer
package. For details see the section Exporting iCLIP
signal.
The analysis workflow described in the Standard workflow section contains all steps for a comprehensive transcriptome-wide binding site analysis. However in some instances it might be useful to drill these analysis down to specific binding sites. In the following chapter we provide some useful ideas and functions when this could be useful and what type of questions could be addressed.
To visualize the iCLIP signal (aka coverage) on a specific binding
site, we provide the bindingSiteCoveragePlot()
function. It
takes a reference index for the binding site and then plots the coverage
as bars in a defined range around a selected binding site. The function
is based on the Gviz
package. The plotIdx
indicates which range should be used
as center, the flankPos
parameter allows to zoom in and
out. In addition to the selected range, also all other ranges which fall
into the selected window will be shown.
To specifically assess for example the reproducibility on a particular binding site, one could plot the coverage from each replicate on that site. The plot below shows the coverage from each replicate on an exemplaric binding site.
Sometimes it can also be useful to merge all replicates into a single
combined coverage. This can be done by setting
mergeReplicates=TRUE
.
A common case that makes use of this function is when one wants to see why a particular binding site is lost after a certain filtering step. Here, we look at a binding site that was filtered out from the final object by the reproducibility filter. Doing so, we can visually confirm that binding sites 5 and 6 were correctly removed by the reproducibility filter function.
rangesBeforeRepFilter = getRanges(bds)
rangesAfterRepFilter = getRanges(bdsOut)
idx = which(!match(rangesBeforeRepFilter, rangesAfterRepFilter, nomatch = 0) > 0)
rangesRemovedByFilter = rangesBeforeRepFilter[idx]
bdsRemovedRanges = setRanges(bds, rangesRemovedByFilter)
bindingSiteCoveragePlot(bdsRemovedRanges, plotIdx = 2, flankPos = 50)
We could also look at the final binding site definition and see how
they were derived from the initial PureCLIP sites. To achieve
this, we make use of the customRange
slot to add these
sites at the bottom of the coverage plot.
Another use case would be to check the coverage on some example genes, or as we do here on the binding site with the highest number of crosslinks:
bindingSiteCoverage = coverageOverRanges(bdsOut, returnOptions = "merge_positions_keep_replicates")
idxMaxCountsBs = which.max(rowSums(as.data.frame(mcols(bindingSiteCoverage))))
bindingSiteCoveragePlot(bdsOut, plotIdx = idxMaxCountsBs, flankPos = 100, mergeReplicates = FALSE, shiftPos = 50)
It is also possible to anchor the plot on any other
GenomicRange
. Here, we take annotated CDS regions and ask
for the one with the most overlapping binding sites. We then use this
ranges as center for the plot and further zoom in to a particular range.
We then make use of the customRange
slot to re-include the
binding site ranges as additional annotation shown underneath the
signal. Additionally, one could also add a custom annotation track in
the form of a GenomicRanges
or TxDB
object.
bdsCDS = setRanges(bdsOut, regions$CDS)
cdsWithMostBs = which.max(countOverlaps(regions$CDS, getRanges(bdsOut)))
bindingSiteCoveragePlot(bdsCDS, plotIdx = cdsWithMostBs, showCentralRange = FALSE,
flankPos = 250, shiftPos = 50, mergeReplicates = TRUE,
highlight = FALSE, customRange = getRanges(bdsOut),
customAnnotation = regions$CDS)
When assessing the number of binding sites per transcript region, it is common sense that more binding sites will map to intron as appose to for example 5’UTRs due to the size difference of these regions. Here we describe how one can be used to normalize for this effect.
The normalization scheme we provide is based on the hosting region
length of each binding site. That means the number of binding sites per
region can be divided by either the sum, mean or median of the hosting
region length from all binding sites (see normalize.factor
when using transcriptRegionSpectrumPlot()
). To avoid biases
by very short or long transcript regions an upper and lower boundary can
be set (see normalize.exclude.upper
and
normalize.exclude.lower
when using
assignToTranscriptRegions()
).
Here we show the transcript region spectrum on a percentage scale once normalized and once un-normalized.
In some instances the estimateBsWidth()
function used in
the BSFind()
wrapper might not lead to optimal results.
This can be the case if the starting data is of below average quality
with very few crosslink sites to begin with. Here manual binding site
profiling can be used to achieve the best possible result given the
input data.
To guide the choice of a good width, we recommend to explicitly
compute binding sites for a variety of width. Here we selected 3, 7, 19
and 29nt and manually execute makeBindingSites()
14.
# compute binding sites
bds1 <- makeBindingSites(object = bds, bsSize = 3)
bds2 <- makeBindingSites(object = bds, bsSize = 9)
bds3 <- makeBindingSites(object = bds, bsSize = 19)
bds4 <- makeBindingSites(object = bds, bsSize = 29)
# summarize in list
l = list(`1. bsSize = 3` = bds1, `2. bsSize = 9` = bds2,
`3. bsSize = 19` = bds3, `4. bsSize = 29` = bds4)
The combined coverage over these different binding sites can help to
quickly judge if the selected size does fit the data appropriately
(rangeCoveragePlot()
). Here, each plot is centered around
the binding site’s midpoint and the computed width is indicated by the
gray frame.
In our example, size = 3 appears too small, since not all of the relevant peak signal seems to be captured. On the contrary, size = 29 appears extremely large. Here, we decided for size = 9 because it seems to capture the central coverage peak best.
In case you already know what binding site size you want to use
BSFind()
can be called directly with that input size. This
is usually the case if one has prior knowledge from orthogonal
experiments or if the present data-set should be fix to the same size as
another data-set for comparison. With a fixed binding site at hand we
can call BSFind()
with that explicit size to avoid
estimation and use our manual input instead15.
bdsManual = BSFind(object = bds, anno.genes = gns, anno.transcriptRegionList = regions, bsSize = 9, cutoff.geneWiseFilter = 0)
If you are not sure how to choose an appropriate binding site size
for your data-set, see Standard
workflow which makes use of the estimateBsWidth()
function, or see the chapter How
to manually test different binding site sizes, to manually select a
size value.
In the case on does not want to work with BSFind()
as
main wrapper function, all core utility function can be called
separately. In our example from above where we manually decided on a
binding site width of 9nt we could also call
makeBindingSites()
directly. This avoids turning off
certain default options and will result in the most basic result
possible, a simple set of merged binding site.
If one is using more than one utility function with some pre-defined input options it is recommended to construct a small custom pipeline. This greatly increase readability and reproducibility of the code. Here we extend our binding site merging to 9nt from above by a pre-filter and add another layer of post-processing to it.
customBSFind <- function(object) {
this.obj = pureClipGlobalFilter(object)
this.obj = makeBindingSites(this.obj, bsSize = 9)
this.obj = calculateSignalToFlankScore(this.obj)
return(this.obj)
}
bdsSimple = customBSFind(bds)
Processing steps in custom pipelines can also be visualized with the flow chart function.
Setting up a custom pipeline can also be needed when one does not have any supporting gene annotation at hand. In those cases a simplified workflow pipeline can be defined that still allows binding site definition. To achieve this we simply combine all processing functions that do not require an explicit annotation into a new reduced wrapper16.
noAnnotationBSFind <- function(object) {
this.obj = pureClipGlobalFilter(object)
this.obj = makeBindingSites(this.obj, bsSize = 9)
this.obj = reproducibilityFilter(this.obj)
this.obj = annotateWithScore(this.obj, getRanges(object))
this.obj = calculateSignalToFlankScore(this.obj)
return(this.obj)
}
bdsSimple = noAnnotationBSFind(bds)
Since no gene or transcript level assignment was performed
(assignToGenes()
and
assignToTranscriptRegions()
was not run), the related
diagnostic plots are not available. Yet the diagnostics for the
functions we have called in noAnnotationBSFind()
can be
used to assess the performance of our analysis.
In order to visualize how the reproducibility filtering function
improves pairwise replicate reproducibility, the
reproducibilityScatterPlot()
function can be called before
and after reproducibility filtering.
When assigning binding sites to transcript regions, the
assignToTranscriptRegions()
utility function takes the
annotation ranges as a GRangesList
input. Then binding
sites are simply assigned to each list entry using the respective list
name. This fact can be exploited to essentially use any kind of range
set as input to assign binding site to any custom set of genomic ranges,
while handling overlaps as it is described in
assignToTranscriptRegions()
.
Besides the standard workflow and main processing functions a variety of further useful functions that come in handy when working with iCLIP data are provided to the user. In the following sections we demonstrate some these functionalities.
Sub-setting a BSFinderData
object can be useful in a
variety of cases, e.g. for reducing the object size for faster parameter
testing, limiting the analysis to some candidate genes etc. Here, we
subset the object by a random set of 100 binding sites and plot their
count distribution.
set.seed(1234)
bdsSub = bds[sample(seq_along(getRanges(bds)), 100, replace = FALSE)]
cov = coverageOverRanges(bdsSub, returnOptions = "merge_positions_keep_replicates")
df = mcols(cov) %>%
as.data.frame() %>%
pivot_longer(everything())
ggplot(df, aes(x = name, y = log2(value+1), fill = name)) +
geom_violin() +
geom_boxplot(width = 0.1, fill = "white") +
scale_fill_brewer(palette = "Greys") +
theme_bw() +
theme(legend.position = "none") +
labs(x = "Samples", y = "#Crosslinks (log2)")
Depending on the task at hand, one either wants to keep the iCLIP
signal separated by replicates or merge the signal over the replicates
(e.g. of the same condition). Merging signal can be done using the
collapseReplicates()
function. Doing so allows for example
to identify the proportion of crosslink events that each sample
contributes to the total of a binding site. Here, we did this for the
first 100 binding sites. We sort all binding sites by their fraction and
color the plot based on the replicate.
bdsMerge = collapseReplicates(bds)[1:100]
covTotal = coverageOverRanges(bdsMerge, returnOptions = "merge_positions_keep_replicates")
covRep = coverageOverRanges(bds[1:100], returnOptions = "merge_positions_keep_replicates")
df = cbind.data.frame(mcols(covTotal), mcols(covRep)) %>%
mutate(rep1 = round(`1_WT`/ WT, digits = 2) * 100,
rep2 = round(`2_WT`/ WT, digits = 2) * 100,
rep3 = round(`3_WT`/ WT, digits = 2) * 100,
rep4 = round(`4_WT`/ WT, digits = 2) * 100) %>%
tibble::rowid_to_column("BsID") %>%
dplyr::select(BsID, rep1, rep2, rep3, rep4) %>%
pivot_longer(-BsID) %>%
group_by(BsID) %>%
arrange(desc(value), .by_group = TRUE) %>%
mutate(name = factor(name, levels = name)) %>%
group_by(name) %>%
arrange(desc(value), .by_group = TRUE) %>%
mutate(BsID = factor(BsID, levels = BsID))
ggplot(df, aes(x = BsID, y = value, fill = name)) +
geom_col(position = "fill", width = 1) +
theme_bw() +
scale_fill_brewer(palette = "Set3") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 7)) +
labs(x = "Binding site ID",
y = "Percentage",
fill = "Replicate"
) +
scale_y_continuous(labels = scales::percent)
Similar to exporting the binding sites as BED file for
external visualization, one can do the same with the iCLIP signal
itself. Here we use the export
function from the rtracklayer
package. We could now first merge the signal from all replicates into a
single combined signal. This signal can then be exported to
.bigwig format for easy visualization in IGV or th UCSC genome
browser.
Each BSFDataSet
contains the binding sites as ranges as
well as the individual crosslinks per sample. With the
clipCoverage()
function one can easily compute all sorts of
coverage over these ranges.
In the purest form this would result in a matrix with the number of
columns being equal to the number of nucleotides in the binding sites
(bsSize
) and the number of rows being equal to the number
of binding sites. If this matrix is computed for all samples the result
is a list of matrices with the described dimensions.
To simplify this output three parameter options exist to merge either
of the three dimensions (samples.merge
,
positions.merge
, ranges.merge
). Merging
samples will combine the coverage signal from individual samples17.
Merging positions combine all single nucleotide positions of the binding
sites into a single value. Merging ranges will combine all ranges. For
each merging operation one can choose between using the sum
or mean
. Depending on the shape of the result it can either
be attached to the ranges present in the BSFDataSet
which
results in a granges
type of output, or it can be returned
as a data.frame
.
## GRanges object with 2017 ranges and 13 metadata columns:
## seqnames ranges strand | bsID bsSize currIdx
## <Rle> <IRanges> <Rle> | <character> <numeric> <integer>
## BS1 chr22 11630138-11630142 + | BS1 5 1
## BS2 chr22 11630413-11630417 + | BS2 5 2
## BS3 chr22 15785546-15785550 + | BS3 5 3
## BS4 chr22 15786268-15786272 + | BS4 5 4
## BS5 chr22 15786287-15786291 + | BS5 5 5
## ... ... ... ... . ... ... ...
## BS2368 chr22 50626769-50626773 - | BS2368 5 2013
## BS2370 chr22 50769684-50769688 - | BS2370 5 2014
## BS2371 chr22 50776992-50776996 - | BS2371 5 2015
## BS2372 chr22 50777869-50777873 - | BS2372 5 2016
## BS2373 chr22 50783298-50783302 - | BS2373 5 2017
## geneID geneName geneType transcriptRegion
## <character> <character> <character> <character>
## BS1 <NA> <NA> Intergenic OTHER
## BS2 <NA> <NA> Intergenic OTHER
## BS3 ENSG00000206195.11 DUXAP8 lncRNA INTRON
## BS4 ENSG00000206195.11 DUXAP8 lncRNA INTRON
## BS5 ENSG00000206195.11 DUXAP8 lncRNA INTRON
## ... ... ... ... ...
## BS2368 ENSG00000100299.18 ARSA protein_coding INTRON
## BS2370 ENSG00000079974.18 RABL2B protein_coding INTRON
## BS2371 ENSG00000079974.18 RABL2B protein_coding INTRON
## BS2372 ENSG00000079974.18 RABL2B protein_coding INTRON
## BS2373 ENSG00000079974.18 RABL2B protein_coding INTRON
## score signalToFlankRatio 1_WT 2_WT 3_WT 4_WT
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## BS1 334.1870 0.891358 55 98 132 76
## BS2 77.5833 0.826923 11 21 39 15
## BS3 15.2087 0.612903 0 6 8 5
## BS4 16.5390 0.781250 3 9 12 1
## BS5 13.8455 0.372881 1 5 14 2
## ... ... ... ... ... ... ...
## BS2368 37.1189 0.885714 3 7 17 4
## BS2370 20.3594 0.892857 2 9 12 2
## BS2371 37.5507 0.636364 2 9 9 8
## BS2372 22.7803 0.810811 4 8 13 5
## BS2373 16.2296 0.814815 4 5 8 5
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Here we added the number of crosslinks per binding site and replicate
to the granges
object of the binding sites.
In this section we go beyond basic binding site definition and demonstrate how binding sites of a given RBP can be compared between conditions. For example one might be interested in how the binding pattern on of an RBP changes upon a specific mutation or condition.
Comparing two sets of binding sites from different biological conditions to each other is not straight forward, since a change observed between any two binding sites can also be caused by expression level changes of the hosting transcript. For that reason the following computational approach will correct for this effect resulting in a log2 fold-change for each binding site that is corrected for a potential transcript expression level change.
The main idea of the approach is to divide given iCLIP crosslinks into two bins for each gene. The first bin holds all crosslinks that are overlapping defined binding sites and thus can be attributed to the binding affinity of the RBP. We call this bin the binding-bin. The second bin holds all other crosslinks, which can be attributed to the expression level of the hosting RNA. We call this bin the background-bin.
We now make use of the condition specific changes in the
background-bin to correct all fold-changes for each binding
site in the binding-bin. This is done using the
DESeq2
likelihood ratio test, by comparing a “full” model
to a reduced “null” model.
In the most basic approach one wants to compute fold-changes for each binding site for the same RBP found in of two conditions. For the following example we name our conditions WT and KO.
At first binding sites have to be defined for each condition separately. This ensures that regardless of sequencing depth or expression levels not binding site can be missed or masked by either condition 18.
Here we load two pre-computed data sets, each having two replicates for WT and two for the KO 19. To learn how these sets can be generated from two individual data set, see the Starting from scratch section.
# make artifical data sets
myDataSets = makeExampleDataSets()
bds.wt = myDataSets$wt
bds.ko = myDataSets$ko
bds.wt
## Dataset: Example.WT
## Object of class BSFDataSet
## #N Ranges: 2,522
## Width ranges: 5
## #N Samples: 2
## #N Conditions: 2
## Dataset: Example.KO
## Object of class BSFDataSet
## #N Ranges: 2,554
## Width ranges: 5
## #N Samples: 2
## #N Conditions: 2
We follow up on the separate binding site definition by combining
both binding site sets into a single set on which we then perform the
differential binding testing. Merging two BSFDataSets
can
be done using the +
operator as shown below or by using the
combineBSF()
function.
## GRanges object with 5076 ranges and 6 metadata columns:
## seqnames ranges strand | bsID bsSize
## <Rle> <IRanges> <Rle> | <character> <integer>
## 1 chr22 11630138-11630142 + | BS1 5
## 2 chr22 11630413-11630417 + | BS2 5
## 3 chr22 15785523-15785527 + | BS3 5
## 4 chr22 15785540-15785544 + | BS4 5
## 5 chr22 15785546-15785550 + | BS5 5
## ... ... ... ... . ... ...
## 5072 chr22 50776987-50776991 - | BS5072 5
## 5073 chr22 50776992-50776996 - | BS5073 5
## 5074 chr22 50777821-50777825 - | BS5074 5
## 5075 chr22 50777869-50777873 - | BS5075 5
## 5076 chr22 50783298-50783302 - | BS5076 5
## geneID geneName geneType dataset
## <character> <character> <character> <character>
## 1 <NA> <NA> Intergenic Example.KO
## 2 <NA> <NA> Intergenic Example.KO
## 3 ENSG00000206195.11 DUXAP8 lncRNA Example.KO
## 4 ENSG00000206195.11 DUXAP8 lncRNA Example.KO
## 5 ENSG00000206195.11 DUXAP8 lncRNA Example.WT
## ... ... ... ... ...
## 5072 ENSG00000079974.18 RABL2B protein_coding Example.WT
## 5073 ENSG00000079974.18 RABL2B protein_coding Example.WT
## 5074 ENSG00000079974.18 RABL2B protein_coding Example.KO
## 5075 ENSG00000079974.18 RABL2B protein_coding Example.WT
## 5076 ENSG00000079974.18 RABL2B protein_coding Example.KO
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
To perform the differential testing we first need to extract crosslink counts for each binding site and background-bin per gene. It is all crosslinks that are attributed for the background-bin that are used for the expression change correction. This makes the differentiation of the iCLIP signal in binding-bin and background-bin one of the must critical steps.
To fine-tune this step a variety of options exist. First each binding site can be extend by an offset region. All crosslinks overlapping the offset will be removed. This means they are not counted towards the background-bin or the binding-bin. The same holds true for the blacklist-region which is an additional container which can be used to remove counts from certain ranges from the background-bin20.
Calculating the background counts requires a gene annotation
information to be present in the form of a GRanges
object.
For more details on how to generate this object from a .gtf/.gff3 file
please see the section Create annotation for
genes. The final counts are added to the meta columns of the binding
site GRanges
objects 21.
## GRanges object with 4961 ranges and 14 metadata columns:
## seqnames ranges strand | bsID bsSize
## <Rle> <IRanges> <Rle> | <character> <integer>
## BS3 chr22 15785523-15785527 + | BS3 5
## BS4 chr22 15785540-15785544 + | BS4 5
## BS5 chr22 15785546-15785550 + | BS5 5
## BS6 chr22 15785946-15785950 + | BS6 5
## BS7 chr22 15785962-15785966 + | BS7 5
## ... ... ... ... . ... ...
## BS5072 chr22 50776987-50776991 - | BS5072 5
## BS5073 chr22 50776992-50776996 - | BS5073 5
## BS5074 chr22 50777821-50777825 - | BS5074 5
## BS5075 chr22 50777869-50777873 - | BS5075 5
## BS5076 chr22 50783298-50783302 - | BS5076 5
## geneID geneName geneType dataset
## <character> <character> <character> <character>
## BS3 ENSG00000206195.11 DUXAP8 lncRNA Example.KO
## BS4 ENSG00000206195.11 DUXAP8 lncRNA Example.KO
## BS5 ENSG00000206195.11 DUXAP8 lncRNA Example.WT
## BS6 ENSG00000206195.11 DUXAP8 lncRNA Example.KO
## BS7 ENSG00000206195.11 DUXAP8 lncRNA Example.WT
## ... ... ... ... ...
## BS5072 ENSG00000079974.18 RABL2B protein_coding Example.WT
## BS5073 ENSG00000079974.18 RABL2B protein_coding Example.WT
## BS5074 ENSG00000079974.18 RABL2B protein_coding Example.KO
## BS5075 ENSG00000079974.18 RABL2B protein_coding Example.WT
## BS5076 ENSG00000079974.18 RABL2B protein_coding Example.KO
## counts.bs.1_WT counts.bs.2_WT counts.bs.3_KO counts.bs.4_KO
## <numeric> <numeric> <numeric> <numeric>
## BS3 2 5 6 2
## BS4 3 1 6 2
## BS5 0 6 8 5
## BS6 1 1 5 0
## BS7 0 2 8 1
## ... ... ... ... ...
## BS5072 0 1 7 1
## BS5073 2 9 9 8
## BS5074 2 2 5 3
## BS5075 4 8 13 5
## BS5076 4 5 8 5
## counts.bg.1_WT counts.bg.2_WT counts.bg.3_KO counts.bg.4_KO
## <numeric> <numeric> <numeric> <numeric>
## BS3 11 71 74 22
## BS4 11 71 74 22
## BS5 11 71 74 22
## BS6 11 71 74 22
## BS7 11 71 74 22
## ... ... ... ... ...
## BS5072 11 46 71 27
## BS5073 11 46 71 27
## BS5074 11 46 71 27
## BS5075 11 46 71 27
## BS5076 11 46 71 27
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Depending on the crosslink signal at hand some genes might not be suitable for differential testing. This is for example when there is overall not enough signal per gene, if the distribution between conditions is heavily tailored towards one side or if there is a strong imbalance between the background-bin and binding-bin per gene. The following function can be used to remove such cases to create a more reliable test set22.
## Warning in filterBsBackground(bds.diff): Found 34 genes with less total counts than 100.
## Removing a total of 34 genes and 39 binding sites.
## Warning in filterBsBackground(bds.diff): Found 12 genes with bs to bg ratio not meeting thresholds.
## Removing a total of 12 genes and 30 binding sites.
## Warning in filterBsBackground(bds.diff): Found 0 genes with condition balance ratio not meeting thresholds.
## Removing a total of 0 genes and 0 binding sites.
Finally differential testing can be carried out using the
calculateBsFoldChange()
function. This function computes
fold-changes for each binding site using the background-bin for
correction and the likelihood ratio test for significance testing (for
details on these test see the DESeq2
user manual).
Additionally, gene-level fold-changes are computed for each hosting gene
using the standard Wald test (again see the DESeq2
user
manual for further details)23.
## GRanges object with 4892 ranges and 26 metadata columns:
## seqnames ranges strand | bsID bsSize
## <Rle> <IRanges> <Rle> | <character> <integer>
## BS3 chr22 15785523-15785527 + | BS3 5
## BS4 chr22 15785540-15785544 + | BS4 5
## BS5 chr22 15785546-15785550 + | BS5 5
## BS6 chr22 15785946-15785950 + | BS6 5
## BS7 chr22 15785962-15785966 + | BS7 5
## ... ... ... ... . ... ...
## BS5072 chr22 50776987-50776991 - | BS5072 5
## BS5073 chr22 50776992-50776996 - | BS5073 5
## BS5074 chr22 50777821-50777825 - | BS5074 5
## BS5075 chr22 50777869-50777873 - | BS5075 5
## BS5076 chr22 50783298-50783302 - | BS5076 5
## geneID geneName geneType dataset
## <character> <character> <character> <character>
## BS3 ENSG00000206195.11 DUXAP8 lncRNA Example.KO
## BS4 ENSG00000206195.11 DUXAP8 lncRNA Example.KO
## BS5 ENSG00000206195.11 DUXAP8 lncRNA Example.WT
## BS6 ENSG00000206195.11 DUXAP8 lncRNA Example.KO
## BS7 ENSG00000206195.11 DUXAP8 lncRNA Example.WT
## ... ... ... ... ...
## BS5072 ENSG00000079974.18 RABL2B protein_coding Example.WT
## BS5073 ENSG00000079974.18 RABL2B protein_coding Example.WT
## BS5074 ENSG00000079974.18 RABL2B protein_coding Example.KO
## BS5075 ENSG00000079974.18 RABL2B protein_coding Example.WT
## BS5076 ENSG00000079974.18 RABL2B protein_coding Example.KO
## counts.bs.1_WT counts.bs.2_WT counts.bs.3_KO counts.bs.4_KO
## <numeric> <numeric> <numeric> <numeric>
## BS3 2 5 6 2
## BS4 3 1 6 2
## BS5 0 6 8 5
## BS6 1 1 5 0
## BS7 0 2 8 1
## ... ... ... ... ...
## BS5072 0 1 7 1
## BS5073 2 9 9 8
## BS5074 2 2 5 3
## BS5075 4 8 13 5
## BS5076 4 5 8 5
## counts.bg.1_WT counts.bg.2_WT counts.bg.3_KO counts.bg.4_KO
## <numeric> <numeric> <numeric> <numeric>
## BS3 11 71 74 22
## BS4 11 71 74 22
## BS5 11 71 74 22
## BS6 11 71 74 22
## BS7 11 71 74 22
## ... ... ... ... ...
## BS5072 11 46 71 27
## BS5073 11 46 71 27
## BS5074 11 46 71 27
## BS5075 11 46 71 27
## BS5076 11 46 71 27
## bs.baseMean bs.log2FoldChange bs.lfcSE bs.stat bs.pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## BS3 18.06576 0.00201385 0.814051 6.13075e-06 0.998024
## BS4 17.21228 0.49510922 1.263454 1.48423e-01 0.700047
## BS5 19.86013 1.00907610 1.040451 9.79359e-01 0.322357
## BS6 8.30315 0.92938363 1.450254 4.27508e-01 0.513214
## BS7 10.09457 1.87552092 1.359956 2.17815e+00 0.139982
## ... ... ... ... ... ...
## BS5072 8.45359 2.1633690 1.395628 2.9873966 0.083915
## BS5073 32.75253 -0.0813876 0.622872 0.0170241 0.896190
## BS5074 16.31571 0.2781398 0.929807 0.0907884 0.763177
## BS5075 34.75249 -0.1357579 0.602860 0.0505005 0.822195
## BS5076 29.04008 -0.1901605 0.684449 0.0765903 0.781972
## bs.padj bg.baseMean bg.log2FoldChange bg.lfcSE bg.stat bg.pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## BS3 0.999250 34.6044 -0.256956 0.464566 -0.55311 0.580188
## BS4 0.936368 34.6044 -0.256956 0.464566 -0.55311 0.580188
## BS5 0.811975 34.6044 -0.256956 0.464566 -0.55311 0.580188
## BS6 0.893694 34.6044 -0.256956 0.464566 -0.55311 0.580188
## BS7 0.621117 34.6044 -0.256956 0.464566 -0.55311 0.580188
## ... ... ... ... ... ... ...
## BS5072 0.511223 31.4992 0.251674 0.466657 0.539312 0.589672
## BS5073 0.983324 31.4992 0.251674 0.466657 0.539312 0.589672
## BS5074 0.950474 31.4992 0.251674 0.466657 0.539312 0.589672
## BS5075 0.966006 31.4992 0.251674 0.466657 0.539312 0.589672
## BS5076 0.956831 31.4992 0.251674 0.466657 0.539312 0.589672
## bg.padj
## <numeric>
## BS3 0.995469
## BS4 0.995469
## BS5 0.995469
## BS6 0.995469
## BS7 0.995469
## ... ...
## BS5072 0.995469
## BS5073 0.995469
## BS5074 0.995469
## BS5075 0.995469
## BS5076 0.995469
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Resulting fold-changes and P values are again added to the meta
columns of the binding sites (GRanges
format). As a quick
visualization of the results the functions plotBsMA()
and
plotBsVolcano()
can be used, each of which can produce a
plot either for the binding sites (what = "bs"
) or for the
background (what = "bg"
).
The following code chunks show how one can start fully from scratch with the binding site definition for the purpose of a downstream differential binding analysis. In the following example we will demonstrate the binding site definition for the conditions WT and KO.
At first one has to construct a BSFDataSet
for each
condition with the crosslinks form the peak-calling step (PureCLIP
output). This serves as input for the binding sites definition with
BSFind()
(for further details see the Performing the main
analysis section). As a result one gets a set of binding sites. This
has to be done for each condition (in our example these are WT
and KO)24.
############################
# Construct WT binding sites
############################
# Locate peak calling result
peaks.wt = "./myFolder/wt/pureclip.bed"
peaks.wt = import(con = peaks.wt, format = "BED", extraCols=c("additionalScores" = "character"))
peaks.wt = keepStandardChromosomes(peaks.wt, pruning.mode = "coarse")
# Locate bigwig files
clipFiles = "./myFolder/wt/bigwig/"
clipFiles = list.files(clipFiles, pattern = ".bw$", full.names = TRUE)
clipFiles = clipFiles[grepl("WT", clipFiles)]
clipFilesP = clipFiles[grep(clipFiles, pattern = "plus")]
clipFilesM = clipFiles[grep(clipFiles, pattern = "minus")]
# Organize clip data in dataframe
colData = data.frame(
name = "Example.WT", id = c(1:2),
condition = factor(c("WT", "WT"), levels = c("WT")),
clPlus = clipFilesP, clMinus = clipFilesM)
# Construct BSFDataSet object
bdsWT.initial = BSFDataSetFromBigWig(ranges = peaksInitial, meta = colData)
# Define binding sites
bdsWT = BSFind(bdsWT.initial, anno.genes = gns, anno.transcriptRegionList = trl)
############################
# Construct KO binding sites
############################
# Locate peak calling result
peaks.ko = "./myFolder/ko/pureclip.bed"
peaks.ko = import(con = peaks.ko, format = "BED", extraCols=c("additionalScores" = "character"))
peaks.ko = keepStandardChromosomes(peaks.ko, pruning.mode = "coarse")
# Locate bigwig files
clipFiles = "./myFolder/ko/bigwig/"
clipFiles = list.files(clipFiles, pattern = ".bw$", full.names = TRUE)
clipFiles = clipFiles[grepl("KO", clipFiles)]
clipFilesP = clipFiles[grep(clipFiles, pattern = "plus")]
clipFilesM = clipFiles[grep(clipFiles, pattern = "minus")]
# Organize clip data in dataframe
colData = data.frame(
name = "Example.KO", id = c(1:2),
condition = factor(c("KO", "KO"), levels = c("KO")),
clPlus = clipFilesP, clMinus = clipFilesM)
# Construct BSFDataSet object
bdsKO.initial = BSFDataSetFromBigWig(ranges = peaksInitial, meta = colData)
# Define binding sites
bdsKO = BSFind(bdsKO.initial, anno.genes = gns, anno.transcriptRegionList = trl)
Sometimes one already has a set of binding sites computed and just
wants to perform the differential binding analysis part for the specific
set at hand. In this case the pre-computed binding sites can be imported
alongside the bigwig files from both conditions into a single
BSFDataSet
. This set can then directly be used for
background generation (see section Count matrix
generation).
# Locate bigwig files - KO
clipFilesKO = "./myFolder/ko/bigwig/"
clipFilesKO = list.files(clipFilesKO, pattern = ".bw$", full.names = TRUE)
clipFilesKO = clipFilesKO[grepl("KO", clipFilesKO)]
# Locate bigwig files - WT
clipFilesWT = "./myFolder/wt/bigwig/"
clipFilesWT = list.files(clipFilesWT, pattern = ".bw$", full.names = TRUE)
clipFilesWT = clipFilesWT[grepl("WT", clipFilesWT)]
# Merge both conditions
clipFiles = c(clipFilesWT, clipFilesKO)
# Split by strand
clipFilesP = clipFiles[grep(clipFiles, pattern = "plus")]
clipFilesM = clipFiles[grep(clipFiles, pattern = "minus")]
# Organize clip data in dataframe
colData = data.frame(
name = "Example.KO", id = c(1:4),
condition = factor(c("WT", "WT", "KO", "KO"), levels = c("WT", "KO")),
clPlus = clipFilesP, clMinus = clipFilesM)
# Construct BSFDataSet object from pre-defined binding sites
# -> myBindingSites must be a GRanges object
bds.comb = BSFDataSetFromBigWig(ranges = myBindingSites, meta = colData)
In the case of a more complex experimental design that requires the comparison of more than two conditions, we recommend to compute as many pair-wise comparison as necessary. If one for example wants to incorporate an additional control condition (named CTRL) to the above setup, we would split the comparisons into three pair-wise designs (CTRL-WT, CTRL-KO and WT-KO). To ensure that all sets contain all binding sites found in either of the three conditions a merge over all three conditions has to be done prior to the testing.
Assuming that binding sites were defined for each condition, one would merge all binding sites doing the following:25.
Next all binding site ranges from this set can be extracted and used
to create a new BSFDataSet
object for each comparison that
should be computed (here we compare WT-vs-KO). This is very similar to
starting a differential binding analysis from an existing set of binding
sites (see chapter Starting
from existing binding sites).
# extract binding site ranges from complex merge
bindingSites.complex = getRanges(bds.complex)
# Locate bigwig files - KO
clipFilesKO = "./myFolder/ko/bigwig/"
clipFilesKO = list.files(clipFilesKO, pattern = ".bw$", full.names = TRUE)
clipFilesKO = clipFilesKO[grepl("KO", clipFilesKO)]
# Locate bigwig files - WT
clipFilesWT = "./myFolder/wt/bigwig/"
clipFilesWT = list.files(clipFilesWT, pattern = ".bw$", full.names = TRUE)
clipFilesWT = clipFilesWT[grepl("WT", clipFilesWT)]
# Merge both conditions
clipFiles = c(clipFilesWT, clipFilesKO)
# Split by strand
clipFilesP = clipFiles[grep(clipFiles, pattern = "plus")]
clipFilesM = clipFiles[grep(clipFiles, pattern = "minus")]
# Organize clip data in dataframe
colData = data.frame(
name = "Example.KO", id = c(1:2),
condition = factor(c("KO", "KO"), levels = c("KO")),
clPlus = clipFilesP, clMinus = clipFilesM)
# Construct BSFDataSet object from pre-defined binding sites
bds.comb = BSFDataSetFromBigWig(ranges = bindingSites.complex, meta = colData)
In the case that the binding site definition was performed using a
particularly high gene-wise level filter or if differential testing
should be performed only on a subset of all defined binding sites, we
recommend to use this information for the background generation by
making use of the blacklist
argument. This ensures that
counts from all ranges that were at some point in the analysis
identified as either a peak or a binding site will be removed from the
background-bin.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] dplyr_1.1.4 forcats_1.0.0
## [3] BindingSiteFinder_2.5.2 ComplexHeatmap_2.23.0
## [5] tidyr_1.3.1 ggplot2_3.5.1
## [7] rtracklayer_1.67.0 GenomicAlignments_1.43.0
## [9] Rsamtools_2.23.1 Biostrings_2.75.3
## [11] XVector_0.47.1 SummarizedExperiment_1.37.0
## [13] Biobase_2.67.0 MatrixGenerics_1.19.0
## [15] matrixStats_1.4.1 GenomicRanges_1.59.1
## [17] GenomeInfoDb_1.43.2 IRanges_2.41.2
## [19] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [21] generics_0.1.3 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] BiocIO_1.17.1 bitops_1.0-9 filelock_1.0.3
## [4] tibble_3.2.1 polyclip_1.10-7 XML_3.99-0.17
## [7] rpart_4.1.23 lifecycle_1.0.4 httr2_1.0.7
## [10] doParallel_1.0.17 lattice_0.22-6 ensembldb_2.31.0
## [13] MASS_7.3-61 ggdist_3.3.2 backports_1.5.0
## [16] magrittr_2.0.3 Hmisc_5.2-1 sass_0.4.9
## [19] rmarkdown_2.29 jquerylib_0.1.4 yaml_2.3.10
## [22] Gviz_1.51.0 DBI_1.2.3 buildtools_1.0.0
## [25] RColorBrewer_1.1-3 abind_1.4-8 zlibbioc_1.52.0
## [28] purrr_1.0.2 AnnotationFilter_1.31.0 biovizBase_1.55.0
## [31] RCurl_1.98-1.16 nnet_7.3-19 VariantAnnotation_1.53.0
## [34] tweenr_2.0.3 rappdirs_0.3.3 circlize_0.4.16
## [37] GenomeInfoDbData_1.2.13 maketools_1.3.1 svglite_2.1.3
## [40] codetools_0.2-20 DelayedArray_0.33.3 xml2_1.3.6
## [43] ggforce_0.4.2 tidyselect_1.2.1 shape_1.4.6.1
## [46] UCSC.utils_1.3.0 farver_2.1.2 viridis_0.6.5
## [49] BiocFileCache_2.15.0 base64enc_0.1-3 jsonlite_1.8.9
## [52] GetoptLong_1.0.5 Formula_1.2-5 iterators_1.0.14
## [55] systemfonts_1.1.0 foreach_1.5.2 tools_4.4.2
## [58] progress_1.2.3 Rcpp_1.0.13-1 glue_1.8.0
## [61] gridExtra_2.3 SparseArray_1.7.2 xfun_0.49
## [64] DESeq2_1.47.1 distributional_0.5.0 withr_3.0.2
## [67] BiocManager_1.30.25 fastmap_1.2.0 GGally_2.2.1
## [70] latticeExtra_0.6-30 digest_0.6.37 R6_2.5.1
## [73] colorspace_2.1-1 Cairo_1.6-2 jpeg_0.1-10
## [76] dichromat_2.0-0.1 biomaRt_2.63.0 RSQLite_2.3.9
## [79] data.table_1.16.4 prettyunits_1.2.0 httr_1.4.7
## [82] htmlwidgets_1.6.4 S4Arrays_1.7.1 ggstats_0.7.0
## [85] pkgconfig_2.0.3 gtable_0.3.6 blob_1.2.4
## [88] sys_3.4.3 htmltools_0.5.8.1 ProtGenerics_1.39.1
## [91] clue_0.3-66 scales_1.3.0 kableExtra_1.4.0
## [94] png_0.1-8 knitr_1.49 rstudioapi_0.17.1
## [97] rjson_0.2.23 checkmate_2.3.2 curl_6.0.1
## [100] cachem_1.1.0 GlobalOptions_0.1.2 stringr_1.5.1
## [103] parallel_4.4.2 vipor_0.4.7 foreign_0.8-87
## [106] AnnotationDbi_1.69.0 ggrastr_1.0.2 restfulr_0.0.15
## [109] pillar_1.10.0 vctrs_0.6.5 dbplyr_2.5.0
## [112] cluster_2.1.8 beeswarm_0.4.0 htmlTable_2.4.3
## [115] evaluate_1.0.1 GenomicFeatures_1.59.1 cli_3.6.3
## [118] locfit_1.5-9.10 compiler_4.4.2 rlang_1.1.4
## [121] crayon_1.5.3 labeling_0.4.3 interp_1.1-6
## [124] plyr_1.8.9 ggbeeswarm_0.7.2 stringi_1.8.4
## [127] viridisLite_0.4.2 deldir_2.0-4 BiocParallel_1.41.0
## [130] munsell_0.5.1 lazyeval_0.2.2 Matrix_1.7-1
## [133] BSgenome_1.75.0 hms_1.1.3 bit64_4.5.2
## [136] KEGGREST_1.47.0 memoise_2.0.1 bslib_0.8.0
## [139] bit_4.5.0.1
Note: We remove the column additionalScore
after the initial load, to clean our input↩︎
Note: We load a previously compiled
BSFDataSet
to save disc space.↩︎
Note: It is of particular importance that one uses a valid format for the geneID (ENSEMBL, UCSC, …) at this point.↩︎
Note: We again only load a previously compiled object at this point to save disc space.↩︎
Note: Just like for the genes, we load a pre-compiled list to save disc space.↩︎
Note: Here we set
est.subsetChromosom='chr22'
because we have only signal on
chromosome 22 to save disc space↩︎
Note: Not all binding sites can be assigned to genes or transcript regions, in these cases NA is introduced, so no binding site is lost.↩︎
Note: Since we did not specify bsSize
in
BSFind()
the binding site size is estimated
automatically.↩︎
Note: Since we use a very small data-set for demonstration purpose, this plot might look different with a real-sized data-set.↩︎
Note: If an overlaps exists, but gene types are identical options “frequency” and “hierarchy” will cause the gene that was seen first to be selected as representative. ↩︎
Note: Strategies that work without gene type
information are overlaps="remove"
and
overlaps="keep"
that respectively remove or keep all
overlapping cases.↩︎
Note: This time the transcript type is pulled from the
names of the GRangesList
defined in
anno.transcriptRegionList
. ↩︎
Note: Essentially any numerical score can be transferred from the initial crosslinks to the final binding site using this option.↩︎
Note: Notice that we do not subset on a specific chromosome, but rather include all ranges to avoid any local pitfalls. ↩︎
Note: We also set cutoff.geneWiseFilter=0
to turn off gene-wise filtering.↩︎
Note: We highly recommend using an adequate gene annotation since it improves the quality of the analysis.↩︎
Notes: Samples can also be combined by condition if multiple conditions are present.↩︎
Note: This requires that both data sets need to be handled strict separately, from pre-processing to peak-calling and binding site definition.↩︎
Note: Both data-sets were artificially created from the same condition for demonstration purposes.↩︎
Note: The background counts per replicate is identical for binding sites on the same hosting gene.↩︎
Note: Binding sites that do not overlap a distinct gene may end with NA values as background counts and can be removed in the next step.↩︎
Note: The plotBsBackgroundFilter()
function can be used to visualize each filtering option.↩︎
Note: Depending on the data one might have to use the
removeNA
argument to remove edge cases where no
background-bin
counts could be computed. It is worth to
manually look at such cases.↩︎
Note: It is desired that the same binding site size is
used for both conditions, which can require setting the
bsSize
argument explicitly when calling
BSFind()
.↩︎
Note: This also merges the meta data which we have to revert for the specific comparison that we want to compute.↩︎