nearBynding is a package designed to discern annotated RNA structures proximal to protein binding. nearBynding allows users to annotate RNA structure contexts via CapR or input their own annotations in BED/bedGraph format. It accomodates protein binding information from CLIP-seq experiments as either aligned CLIP-seq reads or peak-called intervals. This vignette will walk you through:
* The external software necessary to support this pipeline
* Creating a concatenated transcriptome
* Extracting and folding RNA from the transcriptome via CapR
* Mapping protein-binding and RNA structure information onto a transcriptome
* Running StereoGene to identify RNA structure proximal to protein binding
* Visualizing binding results
* Determining the distance between binding contexts
Before running any of these examples, it is highly recommended that
the user establishes a new empty directory and uses setwd()
to make certain that all outputs are deposited there. Some of the
functions below create multiple output files.
Add all dependency directories to your PATH after installation.
bedtools is available for installation here.
Installation instructions will vary by operating system.
Download the zip file from the github repository, unzip the file, and move it to a directory where you want to permanently store the function.
In the command line, access the folder where the unzipped file is stored.
If installation is successful, the final line will output
Error: The number of argument is invalid.
Download the zip file from the github repository, unzip the file, and move it to a directory where you want to permanently store the function.
In the command line, access the folder where the unzipped file is stored.
If installation is successful, the final line will output a menu of argument options.
Although nearBynding is designed to support whole-genome analyses, we will exclusively be evaluating protein-coding genes of chromosomes 4 and 5 through this vignette.
First, a list of transcripts must be identified for analysis. A recommended criterium for selection is that the transcripts be expressed in the cell type used for CLIP-seq experiments. For this vignette, 50 random transcripts have been selected, and the 3’UTR structure of each transcript will be used for analysis, though any region of a transcript such as 5’UTR or CDS could be assessed instead.
This step creates a chain file that will be used to map the selected regions of transcripts end-to-end, excluding the intergenic regions and undesired transcripts that comprise the majority of the genome.
# load transcript list
load(system.file("extdata/transcript_list.Rda", package="nearBynding"))
# get GTF file
gtf<-system.file("extdata/Homo_sapiens.GRCh38.chr4&5.gtf",
package="nearBynding")
GenomeMappingToChainFile(genome_gtf = gtf,
out_chain_name = "test.chain",
RNA_fragment = "three_prime_utr",
transcript_list = transcript_list,
alignment = "hg38")
A file containing the sizes of each concatenated chromosome in the chain file will be required for downstream analysis.
In order to fold the 3’UTRs, the sequences must first be extracted. This is achieved with the following code:
ExtractTranscriptomeSequence(transcript_list = transcript_list,
ref_genome = "Homo_sapiens.GRCh38.dna.primary_assembly.fa",
genome_gtf = gtf,
RNA_fragment = "three_prime_utr",
exome_prefix = "chr4and5_3UTR")
The reference genome can be found through Ensembl, but for users who do not want to download that 3.2GB file for the sake of this vignette, the FASTA output of the above code is available via:
These sequences can then be submitted to CapR for folding.
Warning: This step can take hours or even days depending on how many transcripts are submitted, how long the RNA fragments are, and the maximum distance between base-paired nucleotides submitted to the CapR algorithm.
The nearBynding pipeline can accomodate either a BAM file of aligned CLIP-seq reads or a BED file of peak intervals. BAM files must be sorted and converted to a bedGraph file first, whereas BED files can be read-in directly.
BED or bedGraph files can then be mapped onto the concatenated
transcriptome using the chain file created by
GenomeMappingToChainFile()
. This way, only the protein
binding from transcriptomic regions of interest will be considered in
the final protein binding analysis.
liftOverToExomicBG(input = "chr4and5_sorted.bedGraph",
chain = "test.chain",
chrom_size = "chr4and5_3UTR.size",
output_bg = "chr4and5_liftOver.bedGraph")
For BED file inputs, use the additional argument
format = "BED"
.
The RNA structure information from the CapR output also needs to
mapped onto the concatenated transcriptome. There are six different
binding contexts calculated by CapR – stem, hairpin,
multibranch, exterior, internal, and
bulge. processCapRout()
parses the CapR output,
converts it into six separate bedGraph files, and then performs
liftOverToExomic()
for all the files.
For this sake of this vignette, the CapR outfile is available:
processCapRout(CapR_outfile = system.file("extdata/chr4and5_3UTR.out",
package="nearBynding"),
chain = "test.chain",
output_prefix = "chr4and5_3UTR",
chrom_size = "chr4and5_3UTR.size",
genome_gtf = gtf,
RNA_fragment = "three_prime_utr")
It is possible for users to input their own RNA structure information
rather than using CapR. The information should be in BED file format and
can be input into liftOverToExomicBG()
to map the RNA
structure data to the same transcriptome as the protein binding
data.
This is the process that directly answers the question, “What does RNA structure look like around where the protein is binding?” StereoGene is used to calculate the cross-correlation between RNA structure and protein binding in order to visualize the RNA structure landscape surrounding protein binding.
If CapR is used to determine RNA structure,
runStereogeneOnCapR()
initiates StereoGene for a given
protein against all CapR-generated RNA structure contexts.
For the sake of this vignette, use outfiles()
to pull
the StereoGene output files to your local directory if you do not want
to run StereoGene.
runStereogeneOnCapR(protein_file = "chr4and5_liftOver.bedGraph",
chrom_size = "chr4and5_3UTR.size",
name_config = "chr4and5_3UTR.cfg",
input_prefix = "chr4and5_3UTR")
If external RNA structure data is being tested,
runStereogene()
initiates analysis for a given protein and
a single RNA structure context.
Note: The input track file order matters! The correct order is:
1) RNA structure
2) protein binding
Otherwise, data visualization will be inverted and all downstream analysis will be backwards.
The cross-correlation output of StereoGene can be visualized as either a heatmap or a line plot. Examples of both are below.
For CapR-derived RNA structure, all six contexts can be viewed simultaneously.
visualizeCapRStereogene(CapR_prefix = "chr4and5_3UTR",
protein_file = "chr4and5_liftOver",
heatmap = TRUE,
out_file = "all_contexts_heatmap",
x_lim = c(-500, 500))
visualizeCapRStereogene(CapR_prefix = "chr4and5_3UTR",
protein_file = "chr4and5_liftOver",
x_lim = c(-500, 500),
out_file = "all_contexts_line",
y_lim = c(-18, 22))
Warning: This step may take up to an hour for a full transcriptome.
Alternatively, a single context can be viewed at a time.
visualizeStereogene(context_file = "chr4and5_3UTR_stem_liftOver",
protein_file = "chr4and5_liftOver",
out_file = "stem_line",
x_lim = c(-500, 500))
visualizeStereogene(context_file = "chr4and5_3UTR_stem_liftOver",
protein_file = "chr4and5_liftOver",
heatmap = TRUE,
out_file = "stem_heatmap",
x_lim = c(-500, 500))
Although this specific, limited example does not provide a particularly visually stimulating image, larger data sets (which provide many more StereoGene windows) result in narrower peaks and less noise.
In order to determine the similarity of two binding contexts, we can calculate the Wasserstein distance between curves. A small value suggests two binding contexts are very similar, whereas larger values suggest substantial differences.
For example, calculate the distance between the stem and hairpin contexts visualized above.
bindingContextDistance(RNA_context = "chr4and5_3UTR_stem_liftOver",
protein_file = "chr4and5_liftOver",
RNA_context_2 = "chr4and5_3UTR_hairpin_liftOver")
#> [1] 11.75237
Now compare it to the distance between internal and hairpin contexts.
bindingContextDistance(RNA_context = "chr4and5_3UTR_internal_liftOver",
protein_file = "chr4and5_liftOver",
RNA_context_2 = "chr4and5_3UTR_hairpin_liftOver")
#> [1] 1.788653
We can see that the stem context is less similar to the hairpin context than the internal context, and this is reflected in the calculated distances.
Questions? Comments? Please email Veronica Busa at [email protected]
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] Rsamtools_2.21.2 Biostrings_2.75.0 XVector_0.45.0
#> [4] GenomicRanges_1.57.2 GenomeInfoDb_1.41.2 IRanges_2.39.2
#> [7] S4Vectors_0.43.2 BiocGenerics_0.53.0 nearBynding_1.17.0
#> [10] rmarkdown_2.28
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1
#> [2] dplyr_1.1.4
#> [3] blob_1.2.4
#> [4] R.utils_2.12.3
#> [5] bitops_1.0-9
#> [6] fastmap_1.2.0
#> [7] RCurl_1.98-1.16
#> [8] GenomicAlignments_1.41.0
#> [9] transport_0.15-4
#> [10] XML_3.99-0.17
#> [11] digest_0.6.37
#> [12] lifecycle_1.0.4
#> [13] plyranges_1.25.0
#> [14] KEGGREST_1.45.1
#> [15] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0
#> [16] RSQLite_2.3.7
#> [17] magrittr_2.0.3
#> [18] compiler_4.4.1
#> [19] rlang_1.1.4
#> [20] sass_0.4.9
#> [21] tools_4.4.1
#> [22] utf8_1.2.4
#> [23] yaml_2.3.10
#> [24] data.table_1.16.2
#> [25] rtracklayer_1.65.0
#> [26] knitr_1.48
#> [27] S4Arrays_1.5.11
#> [28] bit_4.5.0
#> [29] curl_5.2.3
#> [30] DelayedArray_0.33.1
#> [31] KernSmooth_2.23-24
#> [32] abind_1.4-8
#> [33] BiocParallel_1.41.0
#> [34] sys_3.4.3
#> [35] R.oo_1.26.0
#> [36] grid_4.4.1
#> [37] fansi_1.0.6
#> [38] caTools_1.18.3
#> [39] colorspace_2.1-1
#> [40] ggplot2_3.5.1
#> [41] gtools_3.9.5
#> [42] scales_1.3.0
#> [43] SummarizedExperiment_1.35.5
#> [44] cli_3.6.3
#> [45] crayon_1.5.3
#> [46] generics_0.1.3
#> [47] httr_1.4.7
#> [48] rjson_0.2.23
#> [49] DBI_1.2.3
#> [50] cachem_1.1.0
#> [51] zlibbioc_1.51.2
#> [52] parallel_4.4.1
#> [53] AnnotationDbi_1.69.0
#> [54] restfulr_0.0.15
#> [55] matrixStats_1.4.1
#> [56] vctrs_0.6.5
#> [57] Matrix_1.7-1
#> [58] jsonlite_1.8.9
#> [59] bit64_4.5.2
#> [60] GenomicFeatures_1.57.1
#> [61] maketools_1.3.1
#> [62] jquerylib_0.1.4
#> [63] glue_1.8.0
#> [64] codetools_0.2-20
#> [65] gtable_0.3.6
#> [66] BiocIO_1.17.0
#> [67] UCSC.utils_1.1.0
#> [68] munsell_0.5.1
#> [69] tibble_3.2.1
#> [70] pillar_1.9.0
#> [71] gplots_3.2.0
#> [72] htmltools_0.5.8.1
#> [73] GenomeInfoDbData_1.2.13
#> [74] R6_2.5.1
#> [75] evaluate_1.0.1
#> [76] Biobase_2.67.0
#> [77] lattice_0.22-6
#> [78] highr_0.11
#> [79] png_0.1-8
#> [80] R.methodsS3_1.8.2
#> [81] memoise_2.0.1
#> [82] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
#> [83] bslib_0.8.0
#> [84] Rcpp_1.0.13
#> [85] SparseArray_1.5.45
#> [86] xfun_0.48
#> [87] MatrixGenerics_1.17.1
#> [88] buildtools_1.0.0
#> [89] pkgconfig_2.0.3