R version: R version 4.4.2 (2024-10-31)
Bioconductor version: 3.20
Package
version: 1.21.0
Biological information encoded in the DNA sequence is often organized into independent modules or clusters. For instance, the eukaryotic system of expression relies on combinations of homotypic or heterotypic transcription factors (TFs) which play an important role in activating and repressing target genes. Identifying clusters of genomic features is a frequent task and includes application such as genomic regions enriched for the presence of a combination of transcription factor binding sites or enriched for the presence of mutations often referred as regions with increase mutational hotspots.
fcScan is designed to detect clusters of genomic features, based on user defined search criteria. Such criteria include:
fcScan is designed to handle large number of genomic features (i.e data generated by High Throughput Sequencing).
fcScan depends on the following packages:
Currently, fcScan has one main function, the getCluster
function. Additional functionality will be added for future releases
including cross-species identification of orthologous clusters.
The input for getCluster
is given through the parameter
x
. This function accepts input in data frame, GRanges as
well as vector of files in BED and VCF (compressed or not) formats. BED
and VCF files are loaded by packages rtracklayer
and
VariantAnnotation
respectively. There is no limit to the
number of files the user can define.
When input is data frame or GRanges objects are given, they should contain the following “named” 5 columns:
seqnames | start | end | strand | site |
---|---|---|---|---|
chr1 | 10 | 15 | + | a |
chr1 | 17 | 20 | + | b |
chr1 | 25 | 30 | + | b |
chr1 | 27 | 35 | + | a |
chr1 | 32 | 40 | + | c |
chr1 | 41 | 48 | + | b |
The seqnames, start and end columns define the name of the chromosome, the starting and ending position of the feature in the chromosome, respectively. The strand column defines the strand. The site column contains the ID of the site and will be used for clustering. The start and end columns are numeric while the remaining columns are characters.
Note: when input is data frame, the data is considered
zero-based
as in BED format.
Window size is set using w
. It defines the
maximum
size of the clusters.
The clustering condition c
defines the number and name
of genomic features to search for based on features names defined in the
site
column c = c("a" = 1, "b" = 2)
This searches for clusters with 3 sites, One a site and two
b sites.
Another way of writing the condition, only if input is a vector to
file paths, is the following
x = ("a.bed", "b.bed"), c = c(1,2)
Given 2 files, a.bed and b.bed, this condition states
that the user is looking for clusters made from 1 “a” site and 2 “b”
sites. In this case, the order of sites defined in c
is
relative to the order of files.
When input is a data frame or GRanges object (instead of files), the
user needs to explicitly define the site names along
with the desired number relative to each site. For instance, giving the
condition as c = c(1,2)
for a data frame or GRanges is not
allowed.
x = dataframe, c = c("a" = 1, "b" = 2)
where
a
and b
are valid site names in
site
column in the dataframe/GRanges
Users can exclude clusters containing a specific site(s). This is
done by specifying zero 0
in the condition as
c = c("a" = 1, "b" = 2, "c" = 0)
. In this case, any
cluster(s) containing c
site will be excluded even if it
has 1 a
and 2 b
sites.
By default, clustering will be performed on both strands and on all
seqnames unless specified by the user using the s
and
seqnames
arguments to limit the search on a specific strand
and/or seqname.
Users can choose to cluster on one specific seqname
(seqnames = "chr1")
, or on several seqnames
(seqnames = c("chr1","chr3","chr4"))
(Default for seqnames
is NULL) meaning
that clustering on all seqnames will be performed.
For s
, the values allowed are:
The gap/overlap between adjacent clusters, and not sites, can be
controled using the overlap
option. When
overlap
is a positive integer, adjacent clusters will be
separated by a minimum of the given value. When overlap
is
negative, adjacent clusters will be allowed to overlap by a maximum of
the given value. (Default is set to 0)
greedy
allows the user to control the number of genomic
features found in clusters.
When greedy = FALSE
, getCluster
will build
clusters having the required window size and will label TRUE
the ones that contain the exact number of sites
provided in the condition argument. Clusters having the user defined
window size but not satisfying the condition will be labelled as
FALSE.
When greedy = TRUE
, additional sites defined in
condition will be added to the cluster as long as the cluster size is
below the defined window size. (Default is set to
FALSE)
The order
option defines the desired order of the sites
within identified clusters. For instance,
order = c("a","b","b")
will search for clusters containing
1 a
and 2 b
sites and checks if they are in
the specified order. Clusters with 1 a
and 2 b
sites that do not contain the specified order will be rejected. When
greedy is set to TRUE
, order can be satisfied if a
subcluster contains the desired order. For example if a cluster has
a, a, b, b, b
sites, it satisfies the required order (a, b,
b) and therefore will be considered as a correct cluster . (Default is
set to NULL)
The site_orientation
option defines the orientation or
strandness of sites in the found clusters. This option cannot be used if
order
is NULL
. site_orientation
should be specified for each site in order
. For instance,
if order = c("a","b","b")
, we can define
site_orientation
for each site respectively as follow:
site_orientation = c("+","-","-")
. The cluster will be
correct if it satisfies the required order and sites orientation.
(Default is set to NULL)
The site_overlap
option defines the maximum or minimum
distance allowed between sites in the cluster. When
site_overlap
is a positive integer, sites within a cluster
should have minimum distance and above
, then the cluster is
considered TRUE
cluster. When site_overlap
is
a negative integer, sites within a cluster should have
max distance and below
, then the cluster is considered
TRUE
cluster. When site_overlap
is zero,
distance between sites is not taken into consideration. (Default is set
to 0)
The cluster_by
option allows the usage of different ways
in scanning for sites. Using this option, scanning does not ‘jump’
clusters found. cluster_by
can be startsEnds
,
endsStarts
, starts
, ends
, or
middles
. If we choose startsEnds
, the
clustering begins at the start of the first site and it ends at the end
of the last site within the window size. If we choose
endsStarts
, scanning begins at the end of the first site
and it ends at the start of the last site within the window size. if we
choose starts
, the clustering begins at the start of the
first site and it ends at the start of the last site within the window
size. if we choose ends
, the clustering begins at the end
of the first site and it ends at the end of the last site within the
window size. Also, if we choose middles
, the clustering
begins at the middle of the first site and it ends at the middle of the
last site within the window size. (Default is set to
startsEnds)
The allow_clusters_overlap
allows the user whether to
include overlapping clusters or not. (Default is set to FALSE)
This option can be only used in case where the “cluster_by” choosen
is “startsEnds” or “ends”, otherwise it is ignored. If it is set to
TRUE
, partially overlapping sites within window size (at
the end of the cluster) are allowed. However, if it is set to
FALSE
, partially overlapping sites are not included in the
cluster found. (Default is set to FALSE)
This option allows the user to choose which partially overlapping
site are included. The amount of overlap between the last site found and
the window size controls the addition or removal of those sites. For
example, if partial_overlap_percentage
is set to 0.6, this
means that the window size should overlap this site by 60% of its size.
However, if it fails, this site won’t be included in the cluster found.
This option can be only used in case where the “cluster_by” chosen is
“startsEnds” or “ends”, otherwise it is ignored. (Default is set to
NULL)
This option allows to distribute the run over multiple threads. It is
recommended to use multiple threads when the input dataset is large.
When threads
is set to 0, getCluster
reserves
the adequate number of threads and performs the run. This will
dramatically increase the speed of getCluster
. The user can
choose the number of threads between 1 and the maximum number of
available threads. (Default is set to 1)
The verbose
option allows the printing of additional
messages and the list of clusters that failed for lack of correct
combination of sites. This option is used mainly for debugging purposes.
(Default is set to FALSE)
The output of getCluster
is a GRanges object with
fields:
The algorithm returns all clusters containing the correct count of
sites/features, unless verbose is set to TRUE
. If the
combination, overlap and order options are satisfied, the cluster is
considered a TRUE
cluster.
The status of a cluster can be either PASS, excludedSites,
orderFail, siteOrientation or siteOverlap PASS
is a cluster
that satisfied the desired combination, overlap, order and sites
orientation. orderFail
is a cluster that had the required
combination but did not satisfy the required order of sites.
excludedSites
is a cluster that had the required
combination and order but it has one or more sites to exclude.
siteOrientation
is a cluster that had the required
combination and order but it has one or more sites with different
orientation than requested. siteOverlap
is a cluster that
meets all the requirements but the distance between its sites is not
respected.
NOTE: If the user is using greedy = FALSE
and
order
contains values more than in the condition parameter
(c
), an error will be raised. However, if
greedy = TRUE
, then using order
with more
values than the condition parameter is allowed since the cluster may
contain more sites than the required c
condition as long as
the window size is satisfied.
Example using getCluster
:
getCluster
looks for desired genomic regions of interest
like transcription factor binding sites, within a window size and
specific condition.
This function accepts a data frame and GRanges object. getCluster
also accepts BED or VCF (or mix of both) files as input. The output of
getCluster
is a GRanges object that contains the genomic
coordinates (seqnames, ranges and strand) and three metadata
columns:
sites: contains clusters of sites that conforms with the
condition c
specified in getCluster
.
isCluster: TRUE
if the cluster found conform with
the condition c
and the order
(if indicated in
condition) and FALSE
if the cluster fails to conform with
the condition or order.
status: PASS
if isCluster
equals
TRUE
. However, if isCluster
is
FALSE
, status shows why the found cluster is not a
TRUE
cluster. If the order of sites is not respected in the
found cluster, status would return OrderFail
. in Addition,
if the cluster found contains non desired sites, it returns
excludedSites
. Moreover, if the sites orientation is not
respected in found cluster, status would return
siteOrientation
. Finally , if the distance between sites in
the cluster found does not conform to the condition specified by the
user, status would return siteOverlap
.
In this example, we ask getCluster
to look for clusters
that contains one site “s1”, one site “s2” and zero “s3” sites. In
addition, we requested clusters to have sites in the order s1,s2 and
having orientation “+”,“+” respectively with minimum distance between
sites of 2bp.
x1 = data.frame(seqnames = rep("chr1", times = 17),
start = c(1,10,17,25,27,32,41,47,60,70,87,94,99,107,113,121,132),
end = c(8,15,20,30,35,40,48,55,68,75,93,100,105,113,120,130,135),
strand = c("+","+","+","+","+","+","+","+","+",
"+","+","+","+","+","+","+","-"),
site = c("s3","s1","s2","s2","s1","s2","s1","s1","s2","s1","s2",
"s2","s1","s2","s1","s1","s2"))
clusters = getCluster(x1, w = 20, c = c("s1" = 1, "s2" = 1, "s3" = 0),
greedy = TRUE, order = c("s1","s2"), site_orientation=c("+","+"),
site_overlap = 2, verbose = TRUE)
#> 17 entries loaded
#> Running getCluster using 1 threads
#> Time difference of 0.2140579 secs
clusters
#> GRanges object with 9 ranges and 3 metadata columns:
#> seqnames ranges strand | sites isCluster status
#> <Rle> <IRanges> <Rle> | <character> <logical> <character>
#> [1] chr1 2-20 * | s3,s1,s2 FALSE excludedSites
#> [2] chr1 11-30 * | s1,s2,s2 TRUE PASS
#> [3] chr1 33-48 * | s2,s1 FALSE orderFail
#> [4] chr1 48-68 * | s1,s2 TRUE PASS
#> [5] chr1 88-105 * | s2,s2,s1 FALSE orderFail
#> [6] chr1 95-113 * | s2,s1,s2 FALSE siteOverlap
#> [7] chr1 100-120 * | s1,s2,s1 FALSE siteOverlap
#> [8] chr1 108-120 * | s2,s1 FALSE orderFail
#> [9] chr1 122-135 * | s1,s2 FALSE siteOrientation
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
Another example but using GRanges as input: in this example, we ask
getCluster
to look for clusters that contains one site
s1
and two sites s2
within a window size of 25
bp. Also, we requested clusters to be searched as +
strand.
suppressMessages(library(GenomicRanges))
x = GRanges(
seqnames = Rle("chr1", 16),
ranges = IRanges(start = c(10L,17L,25L,27L,32L,41L,47L,
60L,70L,87L,94L,99L,107L,113L,121L,132L),
end = c(15L,20L,30L,35L,40L,48L,55L,68L,75L,93L,100L,105L,
113L,120L,130L,135L)),
strand = Rle("+",16),
site = c("s1","s2","s2","s1","s2","s1","s1","s2",
"s1","s2","s2","s1","s2","s1","s1","s2"))
clusters = getCluster(x, w = 25, c = c("s1"=1,"s2"=2), s = "+")
#> 16 entries loaded
#> Running getCluster using 1 threads
#> Time difference of 0.0783329 secs
clusters
#> GRanges object with 2 ranges and 3 metadata columns:
#> seqnames ranges strand | sites isCluster status
#> <Rle> <IRanges> <Rle> | <character> <logical> <character>
#> [1] chr1 10-30 + | s1,s2,s2 TRUE PASS
#> [2] chr1 87-105 + | s2,s2,s1 TRUE PASS
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2 IRanges_2.41.2
#> [4] S4Vectors_0.45.2 BiocGenerics_0.53.3 generics_0.1.3
#> [7] fcScan_1.21.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.47.0 SummarizedExperiment_1.37.0
#> [3] rjson_0.2.23 xfun_0.49
#> [5] bslib_0.8.0 Biobase_2.67.0
#> [7] lattice_0.22-6 vctrs_0.6.5
#> [9] tools_4.4.2 bitops_1.0-9
#> [11] curl_6.0.1 parallel_4.4.2
#> [13] AnnotationDbi_1.69.0 RSQLite_2.3.9
#> [15] blob_1.2.4 Matrix_1.7-1
#> [17] BSgenome_1.75.0 lifecycle_1.0.4
#> [19] GenomeInfoDbData_1.2.13 compiler_4.4.2
#> [21] Rsamtools_2.23.1 Biostrings_2.75.3
#> [23] codetools_0.2-20 htmltools_0.5.8.1
#> [25] sys_3.4.3 buildtools_1.0.0
#> [27] sass_0.4.9 RCurl_1.98-1.16
#> [29] yaml_2.3.10 crayon_1.5.3
#> [31] jquerylib_0.1.4 BiocParallel_1.41.0
#> [33] cachem_1.1.0 DelayedArray_0.33.3
#> [35] iterators_1.0.14 foreach_1.5.2
#> [37] abind_1.4-8 digest_0.6.37
#> [39] restfulr_0.0.15 VariantAnnotation_1.53.0
#> [41] maketools_1.3.1 fastmap_1.2.0
#> [43] grid_4.4.2 cli_3.6.3
#> [45] SparseArray_1.7.2 S4Arrays_1.7.1
#> [47] GenomicFeatures_1.59.1 XML_3.99-0.17
#> [49] UCSC.utils_1.3.0 bit64_4.5.2
#> [51] rmarkdown_2.29 XVector_0.47.1
#> [53] httr_1.4.7 matrixStats_1.4.1
#> [55] bit_4.5.0.1 png_0.1-8
#> [57] memoise_2.0.1 evaluate_1.0.1
#> [59] knitr_1.49 BiocIO_1.17.1
#> [61] doParallel_1.0.17 rtracklayer_1.67.0
#> [63] rlang_1.1.4 Rcpp_1.0.13-1
#> [65] DBI_1.2.3 BiocManager_1.30.25
#> [67] jsonlite_1.8.9 R6_2.5.1
#> [69] plyr_1.8.9 MatrixGenerics_1.19.0
#> [71] GenomicAlignments_1.43.0 zlibbioc_1.52.0