nearBynding Vignette

Introduction

nearBynding is a package designed to discern annotated RNA structures proximal to protein binding. nearBynding allows users to annotate RNA structure contexts via CapR or input their own annotations in BED/bedGraph format. It accomodates protein binding information from CLIP-seq experiments as either aligned CLIP-seq reads or peak-called intervals. This vignette will walk you through:

* The external software necessary to support this pipeline
* Creating a concatenated transcriptome
* Extracting and folding RNA from the transcriptome via CapR
* Mapping protein-binding and RNA structure information onto a transcriptome
* Running StereoGene to identify RNA structure proximal to protein binding
* Visualizing binding results
* Determining the distance between binding contexts

Before running any of these examples, it is highly recommended that the user establishes a new empty directory and uses setwd() to make certain that all outputs are deposited there. Some of the functions below create multiple output files.

Installation

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("nearBynding")

External Software Dependencies

Add all dependency directories to your PATH after installation.

bedtools

bedtools is available for installation here.

Installation instructions will vary by operating system.

CapR

Download the zip file from the github repository, unzip the file, and move it to a directory where you want to permanently store the function.

In the command line, access the folder where the unzipped file is stored.

cd CapR-master
make
./CapR

If installation is successful, the final line will output

Error: The number of argument is invalid.

StereoGene

Download the zip file from the github repository, unzip the file, and move it to a directory where you want to permanently store the function.

In the command line, access the folder where the unzipped file is stored.

cd stereogene-master
cd src
make
./stereogene -h

If installation is successful, the final line will output a menu of argument options.

1. Concatenate the Transcriptome

Although nearBynding is designed to support whole-genome analyses, we will exclusively be evaluating protein-coding genes of chromosomes 4 and 5 through this vignette.

First, a list of transcripts must be identified for analysis. A recommended criterium for selection is that the transcripts be expressed in the cell type used for CLIP-seq experiments. For this vignette, 50 random transcripts have been selected, and the 3’UTR structure of each transcript will be used for analysis, though any region of a transcript such as 5’UTR or CDS could be assessed instead.

This step creates a chain file that will be used to map the selected regions of transcripts end-to-end, excluding the intergenic regions and undesired transcripts that comprise the majority of the genome.

# load transcript list
load(system.file("extdata/transcript_list.Rda", package="nearBynding"))
# get GTF file
gtf<-system.file("extdata/Homo_sapiens.GRCh38.chr4&5.gtf", 
                    package="nearBynding")

GenomeMappingToChainFile(genome_gtf = gtf,
                        out_chain_name = "test.chain",
                        RNA_fragment = "three_prime_utr",
                        transcript_list = transcript_list,
                        alignment = "hg38")

A file containing the sizes of each concatenated chromosome in the chain file will be required for downstream analysis.

getChainChrSize(chain = "test.chain", 
                out_chr = "chr4and5_3UTR.size")

2. Fold RNA via CapR

In order to fold the 3’UTRs, the sequences must first be extracted. This is achieved with the following code:

ExtractTranscriptomeSequence(transcript_list = transcript_list,
                    ref_genome = "Homo_sapiens.GRCh38.dna.primary_assembly.fa",
                    genome_gtf = gtf,
                    RNA_fragment = "three_prime_utr",
                    exome_prefix = "chr4and5_3UTR")

The reference genome can be found through Ensembl, but for users who do not want to download that 3.2GB file for the sake of this vignette, the FASTA output of the above code is available via:

chr4and5_3UTR.fa <- system.file("extdata/chr4and5_3UTR.fa", 
                                package="nearBynding")

These sequences can then be submitted to CapR for folding.

runCapR(in_file = chr4and5_3UTR.fa)

Warning: This step can take hours or even days depending on how many transcripts are submitted, how long the RNA fragments are, and the maximum distance between base-paired nucleotides submitted to the CapR algorithm.

3. Map to Transcriptome

The nearBynding pipeline can accomodate either a BAM file of aligned CLIP-seq reads or a BED file of peak intervals. BAM files must be sorted and converted to a bedGraph file first, whereas BED files can be read-in directly.

BAM file input

bam <- system.file("extdata/chr4and5.bam", package="nearBynding")
sorted_bam<-sortBam(bam, "chr4and5_sorted")

CleanBAMtoBG(in_bam = sorted_bam)

Map Protein Binding and RNA Structure to the Transcriptome

BED or bedGraph files can then be mapped onto the concatenated transcriptome using the chain file created by GenomeMappingToChainFile(). This way, only the protein binding from transcriptomic regions of interest will be considered in the final protein binding analysis.

liftOverToExomicBG(input = "chr4and5_sorted.bedGraph",
                    chain = "test.chain",
                    chrom_size = "chr4and5_3UTR.size",
                    output_bg = "chr4and5_liftOver.bedGraph")

For BED file inputs, use the additional argument format = "BED".

The RNA structure information from the CapR output also needs to mapped onto the concatenated transcriptome. There are six different binding contexts calculated by CapR – stem, hairpin, multibranch, exterior, internal, and bulge. processCapRout() parses the CapR output, converts it into six separate bedGraph files, and then performs liftOverToExomic() for all the files.

For this sake of this vignette, the CapR outfile is available:

processCapRout(CapR_outfile = system.file("extdata/chr4and5_3UTR.out", 
                                            package="nearBynding"),
                chain = "test.chain",
                output_prefix = "chr4and5_3UTR",
                chrom_size = "chr4and5_3UTR.size",
                genome_gtf = gtf,
                RNA_fragment = "three_prime_utr")

It is possible for users to input their own RNA structure information rather than using CapR. The information should be in BED file format and can be input into liftOverToExomicBG() to map the RNA structure data to the same transcriptome as the protein binding data.

4. Calculate Cross-correlation via StereoGene

This is the process that directly answers the question, “What does RNA structure look like around where the protein is binding?” StereoGene is used to calculate the cross-correlation between RNA structure and protein binding in order to visualize the RNA structure landscape surrounding protein binding.

If CapR is used to determine RNA structure, runStereogeneOnCapR() initiates StereoGene for a given protein against all CapR-generated RNA structure contexts.

For the sake of this vignette, use outfiles() to pull the StereoGene output files to your local directory if you do not want to run StereoGene.

runStereogeneOnCapR(protein_file = "chr4and5_liftOver.bedGraph",
                    chrom_size = "chr4and5_3UTR.size",
                    name_config = "chr4and5_3UTR.cfg",
                    input_prefix = "chr4and5_3UTR")

If external RNA structure data is being tested, runStereogene() initiates analysis for a given protein and a single RNA structure context.

Note: The input track file order matters! The correct order is:

1) RNA structure
2) protein binding

Otherwise, data visualization will be inverted and all downstream analysis will be backwards.

runStereogene(track_files = c("chr4and5_3UTR_stem_liftOver.bedGraph", 
                                "chr4and5_liftOver.bedGraph"),
                name_config = "chr4and5_3UTR.cfg")

5. Visualize Results

The cross-correlation output of StereoGene can be visualized as either a heatmap or a line plot. Examples of both are below.

For CapR-derived RNA structure, all six contexts can be viewed simultaneously.

visualizeCapRStereogene(CapR_prefix = "chr4and5_3UTR",
                        protein_file = "chr4and5_liftOver",
                        heatmap = TRUE,
                        out_file = "all_contexts_heatmap",
                        x_lim = c(-500, 500))
visualizeCapRStereogene(CapR_prefix = "chr4and5_3UTR",
                        protein_file = "chr4and5_liftOver",
                        x_lim = c(-500, 500),
                        out_file = "all_contexts_line",
                        y_lim = c(-18, 22))

Warning: This step may take up to an hour for a full transcriptome.

Alternatively, a single context can be viewed at a time.

visualizeStereogene(context_file = "chr4and5_3UTR_stem_liftOver",
                    protein_file = "chr4and5_liftOver",
                    out_file = "stem_line",
                    x_lim = c(-500, 500))
visualizeStereogene(context_file = "chr4and5_3UTR_stem_liftOver",
                    protein_file = "chr4and5_liftOver",
                    heatmap = TRUE,
                    out_file = "stem_heatmap",
                    x_lim = c(-500, 500))

Although this specific, limited example does not provide a particularly visually stimulating image, larger data sets (which provide many more StereoGene windows) result in narrower peaks and less noise.

6. Calculate Distance

In order to determine the similarity of two binding contexts, we can calculate the Wasserstein distance between curves. A small value suggests two binding contexts are very similar, whereas larger values suggest substantial differences.

For example, calculate the distance between the stem and hairpin contexts visualized above.

bindingContextDistance(RNA_context = "chr4and5_3UTR_stem_liftOver",
                        protein_file = "chr4and5_liftOver",
                        RNA_context_2 = "chr4and5_3UTR_hairpin_liftOver")
#> [1] 11.75237

Now compare it to the distance between internal and hairpin contexts.

bindingContextDistance(RNA_context = "chr4and5_3UTR_internal_liftOver",
                        protein_file = "chr4and5_liftOver",
                        RNA_context_2 = "chr4and5_3UTR_hairpin_liftOver")
#> [1] 1.788653

We can see that the stem context is less similar to the hairpin context than the internal context, and this is reflected in the calculated distances.

Questions? Comments? Please email Veronica Busa at

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] Rsamtools_2.21.2     Biostrings_2.75.0    XVector_0.45.0      
#>  [4] GenomicRanges_1.57.2 GenomeInfoDb_1.41.2  IRanges_2.39.2      
#>  [7] S4Vectors_0.43.2     BiocGenerics_0.53.0  nearBynding_1.17.0  
#> [10] rmarkdown_2.28      
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1                        
#>  [2] dplyr_1.1.4                             
#>  [3] blob_1.2.4                              
#>  [4] R.utils_2.12.3                          
#>  [5] bitops_1.0-9                            
#>  [6] fastmap_1.2.0                           
#>  [7] RCurl_1.98-1.16                         
#>  [8] GenomicAlignments_1.41.0                
#>  [9] transport_0.15-4                        
#> [10] XML_3.99-0.17                           
#> [11] digest_0.6.37                           
#> [12] lifecycle_1.0.4                         
#> [13] plyranges_1.25.0                        
#> [14] KEGGREST_1.45.1                         
#> [15] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0
#> [16] RSQLite_2.3.7                           
#> [17] magrittr_2.0.3                          
#> [18] compiler_4.4.1                          
#> [19] rlang_1.1.4                             
#> [20] sass_0.4.9                              
#> [21] tools_4.4.1                             
#> [22] utf8_1.2.4                              
#> [23] yaml_2.3.10                             
#> [24] data.table_1.16.2                       
#> [25] rtracklayer_1.65.0                      
#> [26] knitr_1.48                              
#> [27] S4Arrays_1.5.11                         
#> [28] bit_4.5.0                               
#> [29] curl_5.2.3                              
#> [30] DelayedArray_0.33.1                     
#> [31] KernSmooth_2.23-24                      
#> [32] abind_1.4-8                             
#> [33] BiocParallel_1.41.0                     
#> [34] sys_3.4.3                               
#> [35] R.oo_1.26.0                             
#> [36] grid_4.4.1                              
#> [37] fansi_1.0.6                             
#> [38] caTools_1.18.3                          
#> [39] colorspace_2.1-1                        
#> [40] ggplot2_3.5.1                           
#> [41] gtools_3.9.5                            
#> [42] scales_1.3.0                            
#> [43] SummarizedExperiment_1.35.5             
#> [44] cli_3.6.3                               
#> [45] crayon_1.5.3                            
#> [46] generics_0.1.3                          
#> [47] httr_1.4.7                              
#> [48] rjson_0.2.23                            
#> [49] DBI_1.2.3                               
#> [50] cachem_1.1.0                            
#> [51] zlibbioc_1.51.2                         
#> [52] parallel_4.4.1                          
#> [53] AnnotationDbi_1.69.0                    
#> [54] restfulr_0.0.15                         
#> [55] matrixStats_1.4.1                       
#> [56] vctrs_0.6.5                             
#> [57] Matrix_1.7-1                            
#> [58] jsonlite_1.8.9                          
#> [59] bit64_4.5.2                             
#> [60] GenomicFeatures_1.57.1                  
#> [61] maketools_1.3.1                         
#> [62] jquerylib_0.1.4                         
#> [63] glue_1.8.0                              
#> [64] codetools_0.2-20                        
#> [65] gtable_0.3.6                            
#> [66] BiocIO_1.17.0                           
#> [67] UCSC.utils_1.1.0                        
#> [68] munsell_0.5.1                           
#> [69] tibble_3.2.1                            
#> [70] pillar_1.9.0                            
#> [71] gplots_3.2.0                            
#> [72] htmltools_0.5.8.1                       
#> [73] GenomeInfoDbData_1.2.13                 
#> [74] R6_2.5.1                                
#> [75] evaluate_1.0.1                          
#> [76] Biobase_2.67.0                          
#> [77] lattice_0.22-6                          
#> [78] highr_0.11                              
#> [79] png_0.1-8                               
#> [80] R.methodsS3_1.8.2                       
#> [81] memoise_2.0.1                           
#> [82] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 
#> [83] bslib_0.8.0                             
#> [84] Rcpp_1.0.13                             
#> [85] SparseArray_1.5.45                      
#> [86] xfun_0.48                               
#> [87] MatrixGenerics_1.17.1                   
#> [88] buildtools_1.0.0                        
#> [89] pkgconfig_2.0.3