fcScan: features cluster Scan

Version Info

R version: R version 4.4.2 (2024-10-31)
Bioconductor version: 3.20
Package version: 1.21.0

Introduction

Biological information encoded in the DNA sequence is often organized into independent modules or clusters. For instance, the eukaryotic system of expression relies on combinations of homotypic or heterotypic transcription factors (TFs) which play an important role in activating and repressing target genes. Identifying clusters of genomic features is a frequent task and includes application such as genomic regions enriched for the presence of a combination of transcription factor binding sites or enriched for the presence of mutations often referred as regions with increase mutational hotspots.

fcScan is designed to detect clusters of genomic features, based on user defined search criteria. Such criteria include:

  • A Path to BED/VCF files, a dataframe or a GRanges object as input (required)
  • A window size specifying the maximum cluster size allowed (required)
  • The combination of features required and/or to exclude (required)
  • The order of features required within identified clusters (optional)
  • The allowed distance or overlap between identified clusters (optional)
  • The strand(s) to build the clusters on (optional)
  • The seqname(s) to build the clusters on (optional)
  • The sites orientation to build the clusters on (optional)
  • The distance between site in the cluster (optional)

fcScan is designed to handle large number of genomic features (i.e data generated by High Throughput Sequencing).

Dependencies

fcScan depends on the following packages:

  • stats
  • plyr
  • utils
  • rtracklayer
  • SummarizedExperiment
  • IRanges
  • VariantAnnotation
  • GenomicRanges

Overview

Currently, fcScan has one main function, the getCluster function. Additional functionality will be added for future releases including cross-species identification of orthologous clusters.

Input arguments for getCluster

Input data (x)

The input for getCluster is given through the parameter x. This function accepts input in data frame, GRanges as well as vector of files in BED and VCF (compressed or not) formats. BED and VCF files are loaded by packages rtracklayer and VariantAnnotation respectively. There is no limit to the number of files the user can define.

When input is data frame or GRanges objects are given, they should contain the following “named” 5 columns:

  • seqnames : Contains the seqname name
  • start: Contains the start coordinates
  • end: Contains the end coordinates
  • strand: Contains the strand relative to each site
  • site : Contains the category name relative to genomic site
seqnames start end strand site
chr1 10 15 + a
chr1 17 20 + b
chr1 25 30 + b
chr1 27 35 + a
chr1 32 40 + c
chr1 41 48 + b

The seqnames, start and end columns define the name of the chromosome, the starting and ending position of the feature in the chromosome, respectively. The strand column defines the strand. The site column contains the ID of the site and will be used for clustering. The start and end columns are numeric while the remaining columns are characters.

Note: when input is data frame, the data is considered zero-based as in BED format.

Window Size (w)

Window size is set using w. It defines the maximum size of the clusters.

Condition (c)

The clustering condition c defines the number and name of genomic features to search for based on features names defined in the site column c = c("a" = 1, "b" = 2)
This searches for clusters with 3 sites, One a site and two b sites.

Another way of writing the condition, only if input is a vector to file paths, is the following x = ("a.bed", "b.bed"), c = c(1,2)
Given 2 files, a.bed and b.bed, this condition states that the user is looking for clusters made from 1 “a” site and 2 “b” sites. In this case, the order of sites defined in c is relative to the order of files.

When input is a data frame or GRanges object (instead of files), the user needs to explicitly define the site names along with the desired number relative to each site. For instance, giving the condition as c = c(1,2) for a data frame or GRanges is not allowed.

x = dataframe, c = c("a" = 1, "b" = 2) where a and b are valid site names in site column in the dataframe/GRanges

Users can exclude clusters containing a specific site(s). This is done by specifying zero 0 in the condition as c = c("a" = 1, "b" = 2, "c" = 0). In this case, any cluster(s) containing c site will be excluded even if it has 1 a and 2 b sites.

Seqnames (seqnames) and Strand (s).

By default, clustering will be performed on both strands and on all seqnames unless specified by the user using the s and seqnames arguments to limit the search on a specific strand and/or seqname.

Users can choose to cluster on one specific seqname (seqnames = "chr1"), or on several seqnames (seqnames = c("chr1","chr3","chr4"))
(Default for seqnames is NULL) meaning that clustering on all seqnames will be performed.

For s, the values allowed are:

  • + : Build clusters on positive strand
  • - : Build clusters on negative strand
  • * : Clusters are not strand specific
    (Default is set to *)

Overlap (overlap)

The gap/overlap between adjacent clusters, and not sites, can be controled using the overlap option. When overlap is a positive integer, adjacent clusters will be separated by a minimum of the given value. When overlap is negative, adjacent clusters will be allowed to overlap by a maximum of the given value. (Default is set to 0)

Greedy vs Non-Greedy (greedy)

greedy allows the user to control the number of genomic features found in clusters.

When greedy = FALSE, getCluster will build clusters having the required window size and will label TRUE the ones that contain the exact number of sites provided in the condition argument. Clusters having the user defined window size but not satisfying the condition will be labelled as FALSE.

When greedy = TRUE, additional sites defined in condition will be added to the cluster as long as the cluster size is below the defined window size. (Default is set to FALSE)

Order (order)

The order option defines the desired order of the sites within identified clusters. For instance, order = c("a","b","b") will search for clusters containing 1 a and 2 b sites and checks if they are in the specified order. Clusters with 1 a and 2 b sites that do not contain the specified order will be rejected. When greedy is set to TRUE, order can be satisfied if a subcluster contains the desired order. For example if a cluster has a, a, b, b, b sites, it satisfies the required order (a, b, b) and therefore will be considered as a correct cluster . (Default is set to NULL)

Sites Orientation (site_orientation)

The site_orientation option defines the orientation or strandness of sites in the found clusters. This option cannot be used if order is NULL. site_orientation should be specified for each site in order. For instance, if order = c("a","b","b"), we can define site_orientation for each site respectively as follow: site_orientation = c("+","-","-"). The cluster will be correct if it satisfies the required order and sites orientation. (Default is set to NULL)

Distance between sites in clusters (site_overlap)

The site_overlap option defines the maximum or minimum distance allowed between sites in the cluster. When site_overlap is a positive integer, sites within a cluster should have minimum distance and above, then the cluster is considered TRUE cluster. When site_overlap is a negative integer, sites within a cluster should have max distance and below, then the cluster is considered TRUE cluster. When site_overlap is zero, distance between sites is not taken into consideration. (Default is set to 0)

Clustering option (cluster_by)

The cluster_by option allows the usage of different ways in scanning for sites. Using this option, scanning does not ‘jump’ clusters found. cluster_by can be startsEnds, endsStarts, starts, ends, or middles. If we choose startsEnds, the clustering begins at the start of the first site and it ends at the end of the last site within the window size. If we choose endsStarts, scanning begins at the end of the first site and it ends at the start of the last site within the window size. if we choose starts, the clustering begins at the start of the first site and it ends at the start of the last site within the window size. if we choose ends, the clustering begins at the end of the first site and it ends at the end of the last site within the window size. Also, if we choose middles, the clustering begins at the middle of the first site and it ends at the middle of the last site within the window size. (Default is set to startsEnds)

Overlapping Clusters option (allow_clusters_overlap)

The allow_clusters_overlap allows the user whether to include overlapping clusters or not. (Default is set to FALSE)

Include partially overlapping sites (include_partial_sites)

This option can be only used in case where the “cluster_by” choosen is “startsEnds” or “ends”, otherwise it is ignored. If it is set to TRUE, partially overlapping sites within window size (at the end of the cluster) are allowed. However, if it is set to FALSE, partially overlapping sites are not included in the cluster found. (Default is set to FALSE)

Control addition of partially overlapping sites (partial_overlap_percentage)

This option allows the user to choose which partially overlapping site are included. The amount of overlap between the last site found and the window size controls the addition or removal of those sites. For example, if partial_overlap_percentage is set to 0.6, this means that the window size should overlap this site by 60% of its size. However, if it fails, this site won’t be included in the cluster found. This option can be only used in case where the “cluster_by” chosen is “startsEnds” or “ends”, otherwise it is ignored. (Default is set to NULL)

Multi-threading (threads)

This option allows to distribute the run over multiple threads. It is recommended to use multiple threads when the input dataset is large. When threads is set to 0, getCluster reserves the adequate number of threads and performs the run. This will dramatically increase the speed of getCluster. The user can choose the number of threads between 1 and the maximum number of available threads. (Default is set to 1)

Verbose (verbose)

The verbose option allows the printing of additional messages and the list of clusters that failed for lack of correct combination of sites. This option is used mainly for debugging purposes. (Default is set to FALSE)

Output of getCluster

The output of getCluster is a GRanges object with fields:

  • seqnames: The seqname on which a cluster is found
  • ranges: The ranges of the cluster
  • strand: The strand of the cluster, if any
  • sites: The combination of sites that define the cluster
  • isCluster: A logical indicating if the cluster is TRUE or FALSE
  • status: Describes the reason behind the rejection of a cluster

The algorithm returns all clusters containing the correct count of sites/features, unless verbose is set to TRUE. If the combination, overlap and order options are satisfied, the cluster is considered a TRUE cluster.

The status of a cluster can be either PASS, excludedSites, orderFail, siteOrientation or siteOverlap PASS is a cluster that satisfied the desired combination, overlap, order and sites orientation. orderFail is a cluster that had the required combination but did not satisfy the required order of sites. excludedSites is a cluster that had the required combination and order but it has one or more sites to exclude. siteOrientation is a cluster that had the required combination and order but it has one or more sites with different orientation than requested. siteOverlap is a cluster that meets all the requirements but the distance between its sites is not respected.

NOTE: If the user is using greedy = FALSE and order contains values more than in the condition parameter (c), an error will be raised. However, if greedy = TRUE, then using order with more values than the condition parameter is allowed since the cluster may contain more sites than the required c condition as long as the window size is satisfied.

Example on Clustering

Example using getCluster:

getCluster looks for desired genomic regions of interest like transcription factor binding sites, within a window size and specific condition.

This function accepts a data frame and GRanges object. getCluster also accepts BED or VCF (or mix of both) files as input. The output of getCluster is a GRanges object that contains the genomic coordinates (seqnames, ranges and strand) and three metadata columns:

  1. sites: contains clusters of sites that conforms with the condition c specified in getCluster.

  2. isCluster: TRUE if the cluster found conform with the condition c and the order (if indicated in condition) and FALSE if the cluster fails to conform with the condition or order.

  3. status: PASS if isCluster equals TRUE. However, if isCluster is FALSE, status shows why the found cluster is not a TRUE cluster. If the order of sites is not respected in the found cluster, status would return OrderFail. in Addition, if the cluster found contains non desired sites, it returns excludedSites. Moreover, if the sites orientation is not respected in found cluster, status would return siteOrientation. Finally , if the distance between sites in the cluster found does not conform to the condition specified by the user, status would return siteOverlap.

In this example, we ask getCluster to look for clusters that contains one site “s1”, one site “s2” and zero “s3” sites. In addition, we requested clusters to have sites in the order s1,s2 and having orientation “+”,“+” respectively with minimum distance between sites of 2bp.

x1 = data.frame(seqnames = rep("chr1", times = 17),
    start = c(1,10,17,25,27,32,41,47,60,70,87,94,99,107,113,121,132),
    end = c(8,15,20,30,35,40,48,55,68,75,93,100,105,113,120,130,135),
    strand = c("+","+","+","+","+","+","+","+","+",
        "+","+","+","+","+","+","+","-"),
    site = c("s3","s1","s2","s2","s1","s2","s1","s1","s2","s1","s2",
        "s2","s1","s2","s1","s1","s2"))

clusters = getCluster(x1, w = 20, c = c("s1" = 1, "s2" = 1, "s3" = 0), 
    greedy = TRUE, order = c("s1","s2"), site_orientation=c("+","+"), 
    site_overlap = 2, verbose = TRUE)
#> 17  entries loaded 
#> Running getCluster using 1 threads
#> Time difference of 0.2140579 secs

clusters
#> GRanges object with 9 ranges and 3 metadata columns:
#>       seqnames    ranges strand |       sites isCluster          status
#>          <Rle> <IRanges>  <Rle> | <character> <logical>     <character>
#>   [1]     chr1      2-20      * |    s3,s1,s2     FALSE   excludedSites
#>   [2]     chr1     11-30      * |    s1,s2,s2      TRUE            PASS
#>   [3]     chr1     33-48      * |       s2,s1     FALSE       orderFail
#>   [4]     chr1     48-68      * |       s1,s2      TRUE            PASS
#>   [5]     chr1    88-105      * |    s2,s2,s1     FALSE       orderFail
#>   [6]     chr1    95-113      * |    s2,s1,s2     FALSE     siteOverlap
#>   [7]     chr1   100-120      * |    s1,s2,s1     FALSE     siteOverlap
#>   [8]     chr1   108-120      * |       s2,s1     FALSE       orderFail
#>   [9]     chr1   122-135      * |       s1,s2     FALSE siteOrientation
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Another example but using GRanges as input: in this example, we ask getCluster to look for clusters that contains one site s1 and two sites s2 within a window size of 25 bp. Also, we requested clusters to be searched as + strand.

suppressMessages(library(GenomicRanges))

x = GRanges(
    seqnames = Rle("chr1", 16),
    ranges = IRanges(start = c(10L,17L,25L,27L,32L,41L,47L,
        60L,70L,87L,94L,99L,107L,113L,121L,132L),
    end = c(15L,20L,30L,35L,40L,48L,55L,68L,75L,93L,100L,105L,
        113L,120L,130L,135L)),
    strand = Rle("+",16),
    site = c("s1","s2","s2","s1","s2","s1","s1","s2",
        "s1","s2","s2","s1","s2","s1","s1","s2"))
    
clusters = getCluster(x, w = 25, c = c("s1"=1,"s2"=2), s = "+")
#> 16  entries loaded 
#> Running getCluster using 1 threads
#> Time difference of 0.0783329 secs

clusters
#> GRanges object with 2 ranges and 3 metadata columns:
#>       seqnames    ranges strand |       sites isCluster      status
#>          <Rle> <IRanges>  <Rle> | <character> <logical> <character>
#>   [1]     chr1     10-30      + |    s1,s2,s2      TRUE        PASS
#>   [2]     chr1    87-105      + |    s2,s2,s1      TRUE        PASS
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Session Info

sessionInfo() 
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2  IRanges_2.41.2      
#> [4] S4Vectors_0.45.2     BiocGenerics_0.53.3  generics_0.1.3      
#> [7] fcScan_1.21.0        BiocStyle_2.35.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] KEGGREST_1.47.0             SummarizedExperiment_1.37.0
#>  [3] rjson_0.2.23                xfun_0.49                  
#>  [5] bslib_0.8.0                 Biobase_2.67.0             
#>  [7] lattice_0.22-6              vctrs_0.6.5                
#>  [9] tools_4.4.2                 bitops_1.0-9               
#> [11] curl_6.0.1                  parallel_4.4.2             
#> [13] AnnotationDbi_1.69.0        RSQLite_2.3.9              
#> [15] blob_1.2.4                  Matrix_1.7-1               
#> [17] BSgenome_1.75.0             lifecycle_1.0.4            
#> [19] GenomeInfoDbData_1.2.13     compiler_4.4.2             
#> [21] Rsamtools_2.23.1            Biostrings_2.75.3          
#> [23] codetools_0.2-20            htmltools_0.5.8.1          
#> [25] sys_3.4.3                   buildtools_1.0.0           
#> [27] sass_0.4.9                  RCurl_1.98-1.16            
#> [29] yaml_2.3.10                 crayon_1.5.3               
#> [31] jquerylib_0.1.4             BiocParallel_1.41.0        
#> [33] cachem_1.1.0                DelayedArray_0.33.3        
#> [35] iterators_1.0.14            foreach_1.5.2              
#> [37] abind_1.4-8                 digest_0.6.37              
#> [39] restfulr_0.0.15             VariantAnnotation_1.53.0   
#> [41] maketools_1.3.1             fastmap_1.2.0              
#> [43] grid_4.4.2                  cli_3.6.3                  
#> [45] SparseArray_1.7.2           S4Arrays_1.7.1             
#> [47] GenomicFeatures_1.59.1      XML_3.99-0.17              
#> [49] UCSC.utils_1.3.0            bit64_4.5.2                
#> [51] rmarkdown_2.29              XVector_0.47.1             
#> [53] httr_1.4.7                  matrixStats_1.4.1          
#> [55] bit_4.5.0.1                 png_0.1-8                  
#> [57] memoise_2.0.1               evaluate_1.0.1             
#> [59] knitr_1.49                  BiocIO_1.17.1              
#> [61] doParallel_1.0.17           rtracklayer_1.67.0         
#> [63] rlang_1.1.4                 Rcpp_1.0.13-1              
#> [65] DBI_1.2.3                   BiocManager_1.30.25        
#> [67] jsonlite_1.8.9              R6_2.5.1                   
#> [69] plyr_1.8.9                  MatrixGenerics_1.19.0      
#> [71] GenomicAlignments_1.43.0    zlibbioc_1.52.0